Overview Literature

The Data Science Interview Book

Learning Python, R, SQL & Data Science

101 machine learning algorithms for data science with cheat sheets

Reddit Data Science Resources

AWESOME DATA SCIENCE Data Science Interviews Data Science Cheatsheet 2.0 data-science-ipython-notebooks Awesome Machine Learning

Bean Machine

Minimum Viable Study Plan for Machine Learning Interviews https://github.com/khangich/machine-learning-interview

Causal Inference: The Mixtape https://mixtape.scunning.com/index.html

Bayesian Workflow Andrew Gelman, Aki Vehtari, Daniel Simpson, Charles C. Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, Martin Modrák https://arxiv.org/abs/2011.01808

How to avoid machine learning pitfalls: a guide for academic researchers Michael A. Lones https://arxiv.org/abs/2108.02497

Information geometry and divergences https://franknielsen.github.io/IG/#bookIG

Statistical Rethinking: A Bayesian Course with Examples in R and Stan (& PyMC3 & brms) https://xcelab.net/rm/statistical-rethinking/ https://www.youtube.com/playlist?list=PLDcUM9US4XdMROZ57-OIRtIK0aOynbgZN

ML Frameworks Interoperability Cheat Sheet http://bl.ocks.org/miguelusque/raw/f44a8e729896a96d0a3e4b07b5176af4/

Regression and Other Stories, Andrew Gelman, Jennifer Hill, Aki Vehtari copy of the book https://users.aalto.fi/~ave/ROS.pdf

tidybayes: Bayesian analysis + tidy data + geoms

Graphical Data Analysis with R Antony Unwin

Data Visualization A practical introduction, Kieran Healy

Bayes Rules! An Introduction to Applied Bayesian Modeling, Alicia A. Johnson, Miles Q. Ott, Mine Dogucu, 2021-12-01

Bayesian Statistics Independent readings course on Bayesian statistics with R and Stan, Andrew Heiss and Meng Ye, Fall 2022 https://bayesf22-notebook.classes.andrewheiss.com/rethinking/ https://bayesf22-notebook.classes.andrewheiss.com/bayes-rules/

Prior Setting in Practice: Strategies and Rationales Used in Choosing Prior Distributions for Bayesian Analysis

An Introduction to Proximal Causal Learning

A Selective Review of Negative Control Methods in Epidemiology

Backpropagation is not just the chain rule%2C%20to%20predict%20y.)

Why High-Order Polynomials Should Not Be Used in Regression Discontinuity Designs Andrew Gelman &Guido Imbens

R Markdown Cookbook Yihui Xie, Christophe Dervieux, Emily Riederer 2022-11-07 https://bookdown.org/yihui/rmarkdown-cookbook/

Understanding Machine Learning: From Theory to Algorithms https://www.cs.huji.ac.il/w~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf

https://simplystatistics.org/

Estimation Prediction, Estimation, and Attribution

The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant

A Parsimonious Tour of Bayesian Model Uncertainty

Causal Inference for the Brave and True

https://bayesiancomputationbook.com/welcome.html

Measurement error and the replication crisis The assumption that measurement error always reduces effect sizes is false https://www.science.org/doi/10.1126/science.aal3618

https://journals.sagepub.com/doi/abs/10.1177/00031224211004187#:~:text=The%20estimand%20is%20the%20target,purpose%20of%20the%20statistical%20analysis.&text=By%20grounding%20all%20three%20steps,connects%20statistical%20evidence%20to%20theory

Exploring the Dynamics of Latent Variable Models https://www.cambridge.org/core/journals/political-analysis/article/abs/exploring-the-dynamics-of-latent-variable-models/CBE116F37900DAE957B2D7EB53DB0907#.X7h7GMnwHwM.twitter

https://github.com/HenrikBengtsson/matrixStats

Let’s Git started

https://github.com/facebookresearch/StarSpace

https://dennybritz.com/posts/wildml/understanding-convolutional-neural-networks-for-nlp/

What’s Wrong With My Time Series Blog post by Alex Smolyanskaya ALEX SMOLYANSKAYA February 28, 2017 - San Francisco, CA Tweet this post! Post on LinkedIn What’s wrong with my time series? Model validation without a hold-out set https://multithreaded.stitchfix.com/blog/2017/02/28/whats-wrong-with-my-time-series/

ggRandomForests: Exploring Random Forest Survival https://arxiv.org/pdf/1612.08974.pdf

https://districtdatalabs.silvrback.com/time-maps-visualizing-discrete-events-across-many-timescales

Explained Visually https://setosa.io/ev/

https://github.com/google/BIG-bench/blob/main/docs/paper/BIG-bench.pdf

Two Experiments in Peer Review: Posting Preprints and Citation Bias

Random Walk: A Modern Introduction Gregory F. Lawler and Vlada Limic

Can Transformers be Strong Treatment Effect Estimators? https://arxiv.org/pdf/2202.01336v1.pdf

Statistical rethinking with brms, ggplot2, and the tidyverse: Second edition https://bookdown.org/content/4857/

Patches Are All You Need? https://openreview.net/forum?id=TVHS5Y4dNvM

The validate R-package makes it super-easy to check whether data lives up to expectations you have based on domain knowledge. It works by allowing https://github.com/data-cleaning/validate

Let’s Put Garbage-Can Regressions and Garbage-Can Probits Where They Belong https://journals.sagepub.com/doi/10.1080/07388940500339167

autoxgboost https://github.com/ja-thomas/autoxgboost

1,500 scientists lift the lid on reproducibility https://www.nature.com/articles/533452a

Methodology over metrics: current scientific standards are a disservice to patients and society https://www.jclinepi.com/article/S0895-4356(21)00170-0/fulltext

bper: Bayesian Prediction for Ethnicity and Race https://github.com/bwilden/bper

Automatic Differentiation Variational Inference https://www.jmlr.org/papers/volume18/16-107/16-107.pdf

What are the most important statistical ideas of the past 50 years? Andrew Gelman, Aki Vehtari https://arxiv.org/pdf/2012.00174.pdf

Why Propensity Scores Should Not Be Used for Matching https://gking.harvard.edu/publications/why-propensity-scores-should-not-be-used-formatching

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4349800/

PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R https://cran.r-project.org/web/packages/PRROC/vignettes/PRROC.pdf

On Multi-Cause Causal Inference with Unobserved Confounding: Counterexamples, Impossibility, and Alternatives https://arxiv.org/abs/1902.10286

[‘Trust Us’: Open Data and Preregistration in Political Science and International Relations] https://osf.io/preprints/metaarxiv/8h2bp/

pals https://cran.r-project.org/web/packages/pals/vignettes/pals_examples.html

Greedy Function Approximation: A Gradient Boosting Machine https://jerryfriedman.su.domains/ftp/trebst.pdf

Natural Scales in Geographical Patterns https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5379183/

https://daattali.com/shiny/timevis-demo/

https://www.extremetech.com/computing/151980-inside-ibms-67-billion-sage-the-largest-computer-ever-built

Faux peer-reviewed journals: a threat to research integrity http://deevybee.blogspot.com/2020/12/?m=1

https://github.com/mmxgn/spacy-clausie

http://deevybee.blogspot.com/2020/12/?m=1

http://www.deeplearningbook.org

Statistical Nonsignificance in Empirical Economics https://www.aeaweb.org/articles?id=10.1257/aeri.20190252&from=f

Acquiescence Bias Inflates Estimates of Conspiratorial Beliefs and Political Misperceptions∗ Seth J. Hill† Margaret E. Roberts‡ October 25, 2021 http://www.margaretroberts.net/wp-content/uploads/2021/10/hillroberts_acqbiaspoliticalbeliefs.pdf

The lesson of ivermectin: meta-analyses based on summary data alone are inherently unreliable https://www.nature.com/articles/s41591-021-01535-y

https://www.math.uzh.ch/pages/varrank/index.html

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing https://arxiv.org/pdf/2107.13586.pdf

How should variable selection be performed with multiply imputed data? https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.3177

Feature Interactions in XGBoost https://arxiv.org/abs/2007.05758

Landscape of R packages for eXplainable Artificial Intelligence by Szymon Maksymiuk, Alicja Gosiewska, Przemysław Biecek https://arxiv.org/pdf/2009.13248.pdf

Feature Engineering and Selection: A Practical Approach for Predictive Models https://bookdown.org/max/FES/

Parachute use to prevent death and major trauma related to gravitational challenge: systematic review of randomised controlled trials https://www.ncbi.nlm.nih.gov/pmc/articles/PMC300808/

xgboost.surv https://github.com/bcjaeger/xgboost.surv

DoubleML The Python and R package DoubleML provide an implementation of the double / debiased machine learning framework of Chernozhukov et al. (2018). The Python package is built on top of scikit-learn (Pedregosa et al., 2011) and the R package on top of mlr3 and the mlr3 ecosystem (Lang et al., 2019). https://docs.doubleml.org/stable/index.html

Preplication, Replication: A Proposal to Efficiently Upgrade Journal Replication Standards Get access Arrow Michael Colaresi https://academic.oup.com/isp/article-abstract/17/4/367/2528282?redirectedFrom=fulltext

https://deepmind.com/blog/article/using-jax-to-accelerate-our-research

https://github.com/tidyverts/fable

The Effect: An Introduction to Research Design and Causality https://theeffectbook.net/

https://github.com/dedupeio/dedupe

https://arxiv.org/abs/2205.07407 What GPT Knows About Who is Who Xiaohan Yang, Eduardo Peynetti, Vasco Meerman, Chris Tanner

An Introduction to Ontology Engineering https://people.cs.uct.ac.za/~mkeet/files/OEbook.pdf

R Packages for Item Response Theory Analysis: Descriptions and Features https://www.tandfonline.com/doi/full/10.1080/15366367.2019.1586404

Accuracy vs Explainability of Machine Learning Models [NIPS workshop poster review] https://www.inference.vc/accuracy-vs-explainability-in-machine-learning-models-nips-workshop-poster-review/

https://arxiv-sanity-lite.com/

Attitudes toward amalgamating evidence in statistics∗ Andrew Gelman† Keith O’Rourke‡ http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

An overview of gradient descent optimization algorithms https://ruder.io/optimizing-gradient-descent/

https://codeocean.com/

ClustGeo: an R package for hierarchical clustering with spatial constraints https://arxiv.org/pdf/1707.03897.pdf

An Algorithmic Framework for Bias Bounties Ira Globus-Harris, Michael Kearns, Aaron Roth https://arxiv.org/abs/2201.10408

On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis https://arxiv.org/pdf/1707.01780.pdf

Fast TreeSHAP: Accelerating SHAP Value Computation for Trees Jilei Yang https://arxiv.org/abs/2109.09847

Comparing interpretability and explainability for feature selection Jack Dunn, Luca Mingardi, Ying Daisy Zhuo https://arxiv.org/abs/2105.05328

Training Deep Nets with Sublinear Memory Cost Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin https://arxiv.org/abs/1604.06174

ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R https://arxiv.org/pdf/1508.04409.pdf

A Survey of Recent Abstract Summarization Techniques Diyah Puspitaningrum https://arxiv.org/abs/2105.00824

U N D E R S TA N D I N G R A N D O M F O R E S T S from theory to practice https://arxiv.org/pdf/1407.7502.pdf

Performance Metrics (Error Measures) in Machine Learning Regression, Forecasting and Prognostics: Properties and Typology https://arxiv.org/pdf/1809.03006.pdf

Spike-and-Slab Meets LASSO: A Review of the Spike-and-Slab LASSO Ray Bai, Veronika Rockova, Edward I. George https://arxiv.org/abs/2010.06451

Representation Tradeoffs for Hyperbolic Embeddings Christopher De Sa‡ Albert Gu† Christopher Re´ † Frederic Sala† https://arxiv.org/pdf/1804.03329.pdf

Ratios: A short guide to confidence limits and proper use V.H. Franz∗ October, 2007 https://arxiv.org/pdf/0710.2024.pdf

The Endogeneity of Historical Data Posted on August 28, 2020 by Adam Slez https://broadstreet.blog/2020/08/28/the-endogeneity-of-historical-data/

A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0251194

Post model-fitting exploration via a ‘‘Next-Door’’ analysis Leying GUAN1* and Robert TIBSHIRANI2 https://tibshirani.su.domains/ftp/nextDoor.pdf

Understanding BERT Transformer: Attention isn’t all you need A parsing/composition framework for understanding Transformers https://medium.com/synapse-dev/understanding-bert-transformer-attention-isnt-all-you-need-5839ebd396db

Einstein VI: General and Integrated Stein Variational Inference in NumPyro Ahmad Salim Al-Sibahi, Ola Rønning, Christophe Ley, Thomas Wim Hamelryck https://openreview.net/forum?id=nXSDybDWV3

Dream Investigation Results Official Report by the Minecraft Speedrunning Team https://mcspeedrun.com/dream.pdf

Improving Parameter Estimation of Epidemic Models: Likelihood Functions and Kalman Filtering 39 Pages Posted: 8 Aug 2022 https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4165188

Do Name-Based Treatments Violate Information Equivalence? Evidence from a Correspondence Audit Experiment Published online by Cambridge University Press: 09 March 2021 https://www.cambridge.org/core/journals/political-analysis/article/abs/do-namebased-treatments-violate-information-equivalence-evidence-from-a-correspondence-audit-experiment/56C6846518DDADE6EAF92DAE11552BDF

How Much Should We Trust Staggered Difference-In-Differences Estimates? European Corporate Governance Institute – Finance Working Paper No. 736/2021 Rock Center for Corporate Governance at Stanford University Working Paper No. 246 Journal of Financial Economics (JFE), Forthcoming https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3794018

Building useful models for industry—some tips Jim Savage January 2017 https://khakieconomics.github.io/2017/01/01/Building-useful-models-for-industry.html

An Introduction to Proximal Causal Learning https://arxiv.org/pdf/2009.10982.pdf

First Things First: Assessing Data Quality before Model Quality Anita Gohdes and Megan Price meganp@benetech.orgView all authors and affiliations https://journals.sagepub.com/doi/full/10.1177/0022002712459708?casa_token=xXfXTvx0AcwAAAAA%3AxwRiF0ljSt387F0k14y0NEe7BMdzhMpF08oKFzv8Sgyo6MfAL3wDT-kmn9p94f4BFh60b0eH_PE

Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans https://www.nature.com/articles/s42256-021-00307-0

Why and How We Should Join the Shift From Significance Testing to Estimation https://www.preprints.org/manuscript/202112.0235/v1

How to make replication the norm https://www.nature.com/articles/d41586-018-02108-9

Applied Bayesian Statistics Using Stan and R https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/applied-bayesian-statistics/

https://seeing-theory.brown.edu/index.html

https://www.brodrigues.co/

FINDING ECONOMIC ARTICLES WITH DATA AND SPECIFIC EMPIRICAL METHODS http://skranz.github.io//r/2021/01/05/FindingEconomicArticles4.html

Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do about It https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2849145

Machine vision on historical maps https://weinman.cs.grinnell.edu/research/maps.shtml

Enhancing Validity in Observational Settings When Replication Is Not Possible https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2543525

1.1 Billion Taxi Rides with SQLite, Parquet & HDFS https://tech.marksblogg.com/billion-nyc-taxi-rides-sqlite-parquet-hdfs.html

Understanding the Bias-Variance Tradeoff http://scott.fortmann-roe.com/docs/BiasVariance.html

Is the LKJ(1) prior uniform? “Yes” http://srmart.in/is-the-lkj1-prior-uniform-yes/

Informative priors for correlation matrices: An easy approach http://srmart.in/informative-priors-for-correlation-matrices-an-easy-approach/

A Tutorial on Spectral Clustering https://arxiv.org/pdf/0711.0189v1.pdf

Automated Geocoding of Textual Documents: A Survey of Current Approaches https://onlinelibrary.wiley.com/doi/full/10.1111/tgis.12212

Sparklyr https://spark.rstudio.com/

The AAA Tranche of Subprime Science Andrew Gelman and Eric Loken http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics10.pdf

Never trust rownames of a dataframe June 16th, 2015 by Ankur Gupta | https://www.perfectlyrandom.org/2015/06/16/never-trust-the-row-names-of-a-dataframe-in-R/

GRAPH ALGORITHMS http://www.martinbroadhurst.com/tag/igraph

Groundhog: Addressing The Threat That R Poses To Reproducible Research http://datacolada.org/95

CS231n Convolutional Neural Networks for Visual Recognition https://cs231n.github.io/neural-networks-3/

Implementing Variational Autoencoders in Keras: Beyond the Quickstart Tutorial http://louistiao.me/posts/implementing-variational-autoencoders-in-keras-beyond-the-quickstart-tutorial/

Hypothesis Testing in Econometrics http://home.uchicago.edu/amshaikh/webfiles/testingreview.pdf

“Why Should I Trust You?” Explaining the Predictions of Any Classifier https://arxiv.org/pdf/1602.04938v3.pdf

Yes, but Did It Work?: Evaluating Variational Inference http://www.stat.columbia.edu/~gelman/research/published/Evaluating_Variational_Inference.pdf https://statmodeling.stat.columbia.edu/2018/06/27/yes-work-evaluating-variational-inference/

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets https://arxiv.org/abs/2103.12028

One Instrument to Rule Them All: The Bias and Coverage of Just-ID IV Joshua Angrist, Michal Kolesár https://arxiv.org/abs/2110.10556

Underspecification Presents Challenges for Credibility in Modern Machine Learning https://arxiv.org/abs/2011.03395

A Survey of Predictive Modelling under Imbalanced Distributions https://arxiv.org/pdf/1505.01658.pdf

Varying Slopes Models and the CholeskyLKJ distribution in TensorFlow Probability https://adamhaber.github.io/post/varying-slopes/

Shapley Decomposition of R-Squared in Machine Learning Models https://arxiv.org/pdf/1908.09718.pdf

Understanding Global Feature Contributions With Additive Importance Measures Ian Covert, Scott Lundberg, Su-In Lee https://arxiv.org/abs/2004.00668

True to the Model or True to the Data? https://arxiv.org/abs/2006.16234

When to Impute? Imputation before and during cross-validation Byron C. Jaeger*1 | Nicholas J. Tierney2 | Noah R. Simon3 https://arxiv.org/pdf/2010.00718.pdf

A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications Hongyun Cai, Vincent W. Zheng, Kevin Chen-Chuan Chang https://arxiv.org/abs/1709.07604

Comparing methods addressing multi-collinearity when developing prediction models https://arxiv.org/abs/2101.01603

Nonparametric causal effects based on incremental propensity score interventions https://arxiv.org/abs/1704.00211

Deep learning generalizes because the parameter-function map is biased towards simple functions Guillermo Valle-Pérez, Chico Q. Camargo, Ard A. Louis https://arxiv.org/abs/1805.08522

Bayesian Item Response Modeling in R with brms and Stan https://arxiv.org/pdf/1905.09501.pdf

Bayesian Inference for a Covariance Matrix https://arxiv.org/pdf/1408.4050.pdf

Cross-validation Confidence Intervals for Test Error Pierre Bayle, Alexandre Bayle, Lucas Janson, Lester Mackey https://arxiv.org/abs/2007.12671

Comparing Published Scientific Journal Articles to Their Pre-print Versions https://arxiv.org/pdf/1604.05363.pdf

End-to-End Weak Supervision Salva Rühling Cachay, Benedikt Boecking, Artur Dubrawski https://arxiv.org/abs/2107.02233

Estimation and Inference of Heterogeneous Treatment Effects using Random Forests∗ https://arxiv.org/pdf/1510.04342.pdf

Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift https://arxiv.org/pdf/1801.05134.pdf

A review on outlier/anomaly detection in time series data https://arxiv.org/abs/2002.04236

Entropic Out-of-Distribution Detection: Seamless Detection of Unknown Examples David Macêdo, Tsang Ing Ren, Cleber Zanchettin, Adriano L. I. Oliveira, Teresa Ludermir https://arxiv.org/abs/2006.04005

An Exploratory Characterization of Bugs in COVID-19 Software Projects Akond Rahman, Effat Farhana https://arxiv.org/abs/2006.00586

Be Careful What You Backpropagate: A Case For Linear Output Activations & Gradient Boosting Anders Oland, Aayush Bansal, Roger B. Dannenberg, Bhiksha Raj https://arxiv.org/abs/1707.04199

Introducing Stan2tfp - a lightweight interface for the Stan-to-TensorFlow Probability compiler May 21, 2020 4 min read https://adamhaber.github.io/post/stan2tfp-post1/

L2 Regularization versus Batch and Weight Normalization Twan van Laarhoven https://arxiv.org/abs/1706.05350

Unsupervised Discovery of Temporal Structure in Noisy Data with Dynamical Components Analysis David G. Clark, Jesse A. Livezey, Kristofer E. Bouchard https://arxiv.org/abs/1905.09944

Monte Carlo Gradient Estimation in Machine Learning Shakir Mohamed, Mihaela Rosca, Michael Figurnov, Andriy Mnih https://arxiv.org/abs/1906.10652

Large-scale linear regression: Development of high-performance routines Alvaro Frank, Diego Fabregat-Traver, Paolo Bientinesi https://arxiv.org/abs/1504.07890

The Kernel Interaction Trick: Fast Bayesian Discovery of Pairwise Interactions in High Dimensions Raj Agrawal, Jonathan H. Huggins, Brian Trippe, Tamara Broderick https://arxiv.org/abs/1905.06501

TensorFlow Distributions Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, Rif A. Saurous https://arxiv.org/abs/1711.10604

Asymptotically Exact, Embarrassingly Parallel MCMC Willie Neiswanger, Chong Wang, Eric Xing https://arxiv.org/abs/1311.4780

Python for Data Science https://aeturrell.github.io/python4DS/welcome.html

Using the flextable R package https://ardata-fr.github.io/flextable-book/

Coding for Economists https://aeturrell.github.io/coding-for-economists/intro.html

When Should You Adjust Standard Errors for Clustering? Get access Arrow Alberto Abadie, Susan Athey, Guido W Imbens, Jeffrey M Wooldridge https://academic.oup.com/qje/advance-article-abstract/doi/10.1093/qje/qjac038/6750017

Awesome Deep Learning for Natural Language Processing (NLP) https://github.com/brianspiering/awesome-dl4nlp

R for applied epidemiology and public health https://epirhandbook.com/en/index.html

COVID 19: Reduced forms have gone viral, but what do they tell us?* https://drive.google.com/file/d/1ERjcGXD2jvfDFXdI0_NtF4X95UeQ5f4W/view

Reproducibility in Cancer Biology: Challenges for assessing replicability in preclinical cancer biology https://elifesciences.org/articles/67995

Taking Uncertainty Seriously: Bayesian Marginal Structural Models for Causal Inference in Political Science https://github.com/ajnafa/Latent-Bayesian-MSM

Generalized Linear Models https://data.princeton.edu/wws509/notes/c7s4

genieclust: Fast and Robust Hierarchical Clustering with Noise Point Detection https://genieclust.gagolewski.com/

Awesome Graph Classification https://github.com/benedekrozemberczki/awesome-graph-classification

parallelDist https://github.com/alexeckert/parallelDist

Interpretable Machine Learning A Guide for Making Black Box Models Explainable Christoph Molnar https://christophm.github.io/interpretable-ml-book/

The Inverse CDF Method https://dk81.github.io/dkmathstats_site/prob-inverse-cdf.html

HamiltonianMC https://chi-feng.github.io/mcmc-demo/app.html#HamiltonianMC,banana

End-to-End Balancing for Causal Continuous Treatment-Effect Estimation https://assets.amazon.science/5b/71/fa078e6f4f97a76a2a622c767dd5/end-to-end-balancing-for-causal-continuous-treatment-effect-estimation.pdf

A tour of probabilistic programming language APIs What does it look like to do MCMC in different frameworks? https://colcarroll.github.io/ppl-api/

Probabilistic Programming & Bayesian Methods for Hackers https://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/

Beyond Multiple Linear Regression Applied Generalized Linear Models and Multilevel Models in R https://bookdown.org/roback/bookdown-bysh/

Maybe a section on hyperparameters?

Does batch size matter? https://blog.janestreet.com/does-batch-size-matter/

The Much Quieter Revolution of Synthetic Control: Episode I https://causalinf.substack.com/p/the-much-quieter-revolution-of-synthetic?utm_campaign=post&utm_medium=web&utm_source=

User-friendly introduction to PAC-Bayes bounds https://arxiv.org/pdf/2110.11216.pdf

Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results https://journals.sagepub.com/doi/full/10.1177/2515245917747646

The RecordLinkage Package: Detecting Errors in Data https://journal.r-project.org/archive/2010-2/RJournal_2010-2_Sariyar+Borg.pdf

https://grow.google/certificates/interview-warmup/

The inverse-transform method for generating random variables in R https://heds.nz/posts/inverse-transform/

The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology https://journals.sagepub.com/doi/10.1177/1948550616673876

Evolution of Reporting P Values in the Biomedical Literature, 1990-2015 https://jamanetwork.com/journals/jama/fullarticle/2503172

SHAP (SHapley Additive exPlanations) https://github.com/slundberg/shap

The h-index is no longer an effective correlate of scientific reputation https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0253397

Prior Choice Recommendations Andrew Gelman edited this page on Apr 17, 2020 · 51 revisions https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations#prior-for-a-covariance-matrix

Institute for Replication (I4R) https://i4replication.org/index.html

How Life Sciences Actually Work: Findings of a Year-Long Investigation https://guzey.com/how-life-sciences-actually-work/

Efficient Neural Causal Discovery without Acyclicity Constraints https://github.com/phlippe/ENCO

awesome-text-summarization https://github.com/mathsyouth/awesome-text-summarization

(Ir)Reproducible Machine Learning: A Case Study https://reproducible.cs.princeton.edu/irreproducibility-paper.pdf

THE MYTH OF THE EXPERT REVIEWER https://parameterfree.com/2021/07/06/the-myth-of-the-expert-reviewer/

Understanding and Choosing the Right Probability Distributions https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781119197096.app03

Spatial Interdependence and Instrumental Variable Models https://osf.io/preprints/socarxiv/pgrcu/

The case for formal methodology in scientific reform https://royalsocietypublishing.org/doi/10.1098/rsos.200805

Using Difference-in-Differences to Identify Causal Effects of COVID-19 Policies https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3603970

Pandas Comparison with R / R libraries https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html

Non-Standard Errors https://orbilu.uni.lu/bitstream/10993/48686/1/SSRN-id3961574.pdf

Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors https://pubmed.ncbi.nlm.nih.gov/26186114/

The (lack of) impact of retraction on citation networks Charisse R Madlock-Brown 1, David Eichmann https://pubmed.ncbi.nlm.nih.gov/24668038/

The puzzling relationship between multi-lab replications and meta-analyses of the rest of the literature https://psyarxiv.com/pbrdk/

Bayesian Estimation of Correlation Matrices of Longitudinal Data Riddhi Pratim Ghosh, Bani Mallick, Mohsen Pourahmadi https://projecteuclid.org/journals/bayesian-analysis/volume-16/issue-3/Bayesian-Estimation-of-Correlation-Matrices-of-Longitudinal-Data/10.1214/20-BA1237.full

Operationalizing the Replication Standard: A Case Study of the Data Curation and Verification Workflow for Scholarly Journals https://osf.io/preprints/socarxiv/cfdba/

How Using Machine Learning Classification as a Variable in Regression Leads to Attenuation Bias and What to Do About It https://osf.io/preprints/socarxiv/453jk/

Lost in Aggregation: Improving Event Analysis with Report-Level Data Scott J. Cook,Nils B. Weidmann https://onlinelibrary.wiley.com/doi/full/10.1111/ajps.12398

Frequentist versus Bayesian approaches to multiple testing Arvid Sjölander & Stijn Vansteelandt https://link.springer.com/article/10.1007/s10654-019-00517-2

Research note: Examining potential bias in large-scale censored data https://misinforeview.hks.harvard.edu/article/research-note-examining-potential-bias-in-large-scale-censored-data/

When Should We Use Unit Fixed Effects Regression Models for Causal Inference with Longitudinal Data? Kosuke Imai,In Song Kim https://onlinelibrary.wiley.com/doi/abs/10.1111/ajps.12417

Runtime warnings and convergence problems Stan Development Team https://mc-stan.org/misc/warnings.html

Dirichlet Process Gaussian mixture model via the stick-breaking construction in various PPLs This page was last updated on 29 Mar, 2021. https://luiarthur.github.io/TuringBnpBenchmarks/dpsbgmm

xgboost: “Hi I’m Gamma. What can I do for you?” — and the tuning of regularization https://medium.com/data-design/xgboost-hi-im-gamma-what-can-i-do-for-you-and-the-tuning-of-regularization-a42ea17e6ab6

PostGIS In Action https://livebook.manning.com/book/postgis-in-action-second-edition/about-this-book/

Stan User’s Guide https://mc-stan.org/docs/stan-users-guide/index.html

Smoothing Terms in GAM Models https://maths-people.anu.edu.au/~johnm/r-book/xtras/autosmooth.pdf

Designing a Deep Learning Project https://medium.com/(erogol/designing-a-deep-learning-project-9b3698aef127?)

PyTorch With Baby Steps: From y=x To Training A Convnet https://lelon.io/blog/pytorch-baby-steps

Bayesian inference with Stan: A tutorial on adding custom distributions Jeffrey Annis, Brent J. Miller & Thomas J. Palmeri https://link.springer.com/article/10.3758/s13428-016-0746-9

Bayes Rules! An Introduction to Applied Bayesian Modeling https://www.bayesrulesbook.com/

Graduate Qualitative Methods Training in Political Science: A Disciplinary Crisis Published online by Cambridge University Press: 21 November 2019 https://www.cambridge.org/core/journals/ps-political-science-and-politics/article/graduate-qualitative-methods-training-in-political-science-a-disciplinary-crisis/7B0EEB76E1CC234AFED7EED8DA71BA35

Time Series Analysis by State Space Methods (Oxford Statistical Science Series) https://www.amazon.com/dp/0198523548/ref=cm_sw_r_tw_apa_fabc_0MWV12PSS3K9NW3RF9ZY

Hyperparameters and tuning strategies for random forest https://wires.onlinelibrary.wiley.com/doi/full/10.1002/widm.1301?casa_token=_zNb_GkfYAUAAAAA%3AszhLWhEqZzM5C74ByxjTmQX9uUzIgzvLGXEyJHk5BubnNpTqOtqruOwi8ACcoHxUrV3Ypl4uOpsu

Your Cross Validation Error Confidence Intervals are Wrong — here’s how to Fix Them https://towardsdatascience.com/your-cross-validation-error-confidence-intervals-are-wrong-heres-how-to-fix-them-abbfe28d390

Probabilistic Programming with Variational Inference: Under the Hood https://willcrichton.net/notes/probabilistic-programming-under-the-hood/

How to Measure Statistical Causality: A Transfer Entropy Approach with Financial Applications https://towardsdatascience.com/causality-931372313a1c

Kullback-Leibler Divergence Explained https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained

Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance https://www.cs.purdue.edu/homes/lintan/publications/variance-ase20.pdf

How (Not) to Reproduce: Practical Considerations to Improve Research Transparency in Political Science https://www.cambridge.org/core/journals/ps-political-science-and-politics/article/abs/how-not-to-reproduce-practical-considerations-to-improve-research-transparency-in-political-science/32E7CF5D975C081BA666D3BD475D7913

Quantifying Bias from Measurable and Unmeasurable Confounders Across Three Domains of Individual Determinants of Political Preferences Published online by Cambridge University Press: 22 February 2022 https://www.cambridge.org/core/journals/political-analysis/article/quantifying-bias-from-measurable-and-unmeasurable-confounders-across-three-domains-of-individual-determinants-of-political-preferences/D1D2DEE9E7180BDCFC592885BE66E9AF

5 Levels of Difficulty — Bayesian Gaussian Random Walk with PyMC3 and Theano https://towardsdatascience.com/5-levels-of-difficulty-bayesian-gaussian-random-walk-with-pymc3-and-theano-34343911c7d2

Single-Parameter Models | Pyro vs. STAN https://towardsdatascience.com/single-parameter-models-pyro-vs-stan-e7e69b45d95c

Partial Identification in Econometrics Elie Tamer https://scholar.harvard.edu/files/tamer/files/pie.pdf

LightGBM for Quantile Regression Understand Quantile Regression https://towardsdatascience.com/lightgbm-for-quantile-regression-4288d0bb23fd

Assessing the Impact of Non-Random Measurement Error on Inference: A Sensitivity Analysis Approach https://strathprints.strath.ac.uk/59463/1/Gallop_Weschle_PSRM_2016_Assessing_the_impact_of_non_random_measurement_error_on_inference.pdf

yardstick is a package to estimate how well models are working using tidy data principles. See the package webpage for more information. https://yardstick.tidymodels.org/index.html

The Three Faces of Bayes https://slackprop.wordpress.com/2016/08/28/the-three-faces-of-bayes/

Evaluating Random Forests for Survival Analysis using Prediction Error Curves https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4194196/

The role of metadata in reproducible computational research https://www.sciencedirect.com/science/article/pii/S2666389921001707

Ecological Inference in the Social Sciences https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2885825/

Two Wrongs Make a Right: Addressing Underreporting in Binary Data from Multiple Sources https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5667662/

On the low reproducibility of cancer studies https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6599599/

Quarto with Python https://www.meyerperin.com/using-quarto/

Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015 https://www.nature.com/articles/s41562-018-0399-z

Bayesian analysis of tests with unknown specificity and sensitivity∗ Andrew Gelman† and Bob Carpenter‡ https://www.medrxiv.org/content/10.1101/2020.05.22.20108944v3.full.pdf

Notes on the Negative Binomial Distribution https://www.johndcook.com/negative_binomial.pdf

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Automatic Differentiation Variational Inference https://www.jmlr.org/papers/volume18/16-107/16-107.pdf

IZA DP No. 13233: The Influence of Hidden Researcher Decisions in Applied Microeconomics https://www.iza.org/publications/dp/13233/the-influence-of-hidden-researcher-decisions-in-applied-microeconomics

cdfquantreg: An R Package for CDF-Quantile Regression https://www.jstatsoft.org/article/view/v088i01

https://techdevguide.withgoogle.com/

What the F-measure doesn’t measure: Features, Flaws, Fallacies and Fixes David M. W. Powers https://arxiv.org/abs/1503.06410

When LOO and other cross-validation approaches are valid https://statmodeling.stat.columbia.edu/2018/08/03/loo-cross-validation-approaches-valid/

Hamiltonian Monte Carlo explained http://arogozhnikov.github.io/2016/12/19/markov_chain_monte_carlo.html

Controlling for Unobserved Confounds in Classification Using Correlational Constraints Virgile Landeiro, Aron Culotta https://arxiv.org/abs/1703.01671

The Persistence of Underpowered Studies in Psychological Research: Causes, Consequences, and Remedies Scott E. Maxwell https://statmodeling.stat.columbia.edu/wp-content/uploads/2017/07/maxwell2004.pdf

You need 16 times the sample size to estimate an interaction than to estimate a main effect https://statmodeling.stat.columbia.edu/2018/03/15/need-16-times-sample-size-estimate-interaction-estimate-main-effect/

Machine Learning of Sets http://akosiorek.github.io/ml/2020/08/12/machine_learning_of_sets.html

Weak Supervision: A New Programming Paradigm for Machine Learning http://ai.stanford.edu/blog/weak-supervision/

The earth is flat (p>0.05): Significance thresholds and the crisis of unreplicable research https://peerj.com/preprints/2921/

Advanced Natural Language Processing with TensorFlow 2: Build effective real-world NLP applications using NER, RNNs, seq2seq models, Transformers, and more https://www.amazon.com/Advanced-Natural-Language-Processing-TensorFlow/dp/1800200935?encoding=UTF8&qid=&sr=&linkCode=sl1&tag=kirkdborne-20&linkId=4448e1a0cd126f52a2aba844c4bdb78e&language=en_US&ref=as_li_ss_tl

3 reasons why you can’t always use predictive performance to choose among models https://statmodeling.stat.columbia.edu/2015/10/23/26857/

Using Heteroscedasticity to Identify and Estimate Mismeasured and Endogenous Regressor Models Arthur Lewbel https://www.tandfonline.com/doi/full/10.1080/07350015.2012.643126

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe, Christian Szegedy https://arxiv.org/abs/1502.03167

Gradient Boosting explained [demonstration] http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html

Clustered standard errors vs. multilevel modeling https://statmodeling.stat.columbia.edu/2007/11/28/clustered_stand/

Advanced R https://adv-r.hadley.nz/index.html

Regression to the mean continues to confuse people and lead to errors in published research https://statmodeling.stat.columbia.edu/2018/06/24/regression-mean-continues-confuse-people-lead-errors-published-research/

The statistical significance filter leads to overoptimistic expectations of replicability https://statmodeling.stat.columbia.edu/2018/05/22/statistical-significance-filter-leads-overoptimistic-expectations-replicability/

How to cross-validate PCA, clustering, and matrix decomposition models http://alexhwilliams.info/itsneuronalblog/2018/02/26/crossval/?mlreview

Inference in Experiments Conditional on Observed Imbalances in Covariates Per JohanssonORCID Icon &Mattias Nordin https://www.tandfonline.com/doi/full/10.1080/00031305.2022.2054859

Scientific progress despite irreproducibility: A seeming paradox Richard M. Shiffrin, Katy Borner, Stephen M. Stigler https://arxiv.org/abs/1710.01946

On Statistical Non-Significance Alberto Abadie https://arxiv.org/abs/1803.00609

On the number of signals in multivariate time series Markus Matilainen, Klaus Nordhausen, Joni Virta https://arxiv.org/abs/1801.04925

Data Science vs. Statistics: Two Cultures? Iain Carmichael, J.S. Marron https://arxiv.org/abs/1801.00371

The exploding gradient problem demystified - definition, prevalence, impact, origin, tradeoffs, and solutions George Philipp, Dawn Song, Jaime G. Carbonell https://arxiv.org/abs/1712.05577

Theory of Deep Learning III: explaining the non-overfitting puzzle Tomaso Poggio, Kenji Kawaguchi, Qianli Liao, Brando Miranda, Lorenzo Rosasco, Xavier Boix, Jack Hidary, Hrushikesh Mhaskar https://arxiv.org/abs/1801.00173

On overfitting and post-selection uncertainty assessments Liang Hong, Todd A. Kuffner, Ryan Martin https://arxiv.org/abs/1712.02379

A Theory of Statistical Inference for Ensuring the Robustness of Scientific Results Beau Coker, Cynthia Rudin, Gary King https://arxiv.org/abs/1804.08646

Labelling as an unsupervised learning problem Terry Lyons, Imanol Perez Arribas https://arxiv.org/abs/1805.03911

Structural Breaks in Time Series Alessandro Casini, Pierre Perron https://arxiv.org/abs/1805.03807

On consistency and inconsistency of nonparametric tests Mikhail Ermakov https://arxiv.org/abs/1807.09076

A New Angle on L2 Regularization Thomas Tanay, Lewis D Griffin https://arxiv.org/abs/1806.11186

On the Robustness of Interpretability Methods David Alvarez-Melis, Tommi S. Jaakkola https://arxiv.org/abs/1806.08049

Identifying Causal Effects with the R Package causaleffect Santtu Tikka, Juha Karvanen https://arxiv.org/abs/1806.07161

How Does Batch Normalization Help Optimization? Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Madry https://arxiv.org/abs/1805.11604

The effect of the choice of neural network depth and breadth on the size of its hypothesis space Lech Szymanski, Brendan McCane, Michael Albert https://arxiv.org/abs/1806.02460

Is preprocessing of text really worth your time for online comment classification? Fahim Mohammad https://arxiv.org/abs/1806.02908

Geometric Understanding of Deep Learning Na Lei, Zhongxuan Luo, Shing-Tung Yau, David Xianfeng Gu https://arxiv.org/abs/1805.10451

Cross validation residuals for generalised least squares and other correlated data models Ingrid Annette Baade https://arxiv.org/abs/1809.01319

Out-of-Distribution Detection Using an Ensemble of Self Supervised Leave-out Classifiers Apoorv Vyas, Nataraj Jammalamadaka, Xia Zhu, Dipankar Das, Bharat Kaul, Theodore L. Willke https://arxiv.org/abs/1809.03576

Handling Imbalanced Dataset in Multi-label Text Categorization using Bagging and Adaptive Boosting Genta Indra Winata, Masayu Leylia Khodra https://arxiv.org/abs/1810.11612

On the Art and Science of Machine Learning Explanations Patrick Hall https://arxiv.org/abs/1810.02909

Causal inference under over-simplified longitudinal causal models Lola Etievant, Vivian Viallon https://arxiv.org/abs/1810.01294

Revisiting the Gelman-Rubin Diagnostic Dootika Vats, Christina Knudson https://arxiv.org/abs/1812.09384

A Survey on Data Collection for Machine Learning: a Big Data – AI Integration Perspective Yuji Roh, Geon Heo, Steven Euijong Whang

A Fundamental Measure of Treatment Effect Heterogeneity Jonathan Levy, Mark van der Laan, Alan Hubbard, Romain Pirracchio https://arxiv.org/abs/1811.03745

Causal Discovery Toolbox: Uncover causal relationships in Python Diviyan Kalainathan, Olivier Goudet https://arxiv.org/abs/1903.02278

Dying ReLU and Initialization: Theory and Numerical Examples Lu Lu, Yeonjong Shin, Yanhui Su, George Em Karniadakis https://arxiv.org/abs/1903.06733

ROC and AUC with a Binary Predictor: a Potentially Misleading Metric John Muschelli https://arxiv.org/abs/1903.04881

Gamification in Science: A Study of Requirements in the Context of Reproducible Research Sebastian S. Feger, Sünje Dallmeier-Tiessen, Paweł W. Woźniak, Albrecht Schmidt https://arxiv.org/abs/1903.02446

Matrix factorization for multivariate time series analysis Pierre Alquier, Nicolas Marie https://arxiv.org/abs/1903.05589

On the complexity of logistic regression models Nicola Bulso, Matteo Marsili, Yasser Roudi https://arxiv.org/abs/1903.00386

On Heavy-user Bias in A/B Testing Yu Wang, Somit Gupta, Jiannan Lu, Ali Mahmoudzadeh, Sophia Liu https://arxiv.org/abs/1902.02021

DeepMoD: Deep learning for Model Discovery in noisy data Gert-Jan Both, Subham Choudhury, Pierre Sens, Remy Kusters

Learning Causality: Synthesis of Large-Scale Causal Networks from High-Dimensional Time Series Data Mark-Oliver Stehr, Peter Avar, Andrew R. Korte, Lida Parvin, Ziad J. Sahab, Deborah I. Bunin, Merrill Knapp, Denise Nishita, Andrew Poggio, Carolyn L. Talcott, Brian M. Davis, Christine A. Morton, Christopher J. Sevinsky, Maria I. Zavodszky, Akos Vertes https://arxiv.org/abs/1905.02291

Text Classification Algorithms: A Survey Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura E. Barnes, Donald E. Brown https://arxiv.org/abs/1904.08067

The Information Complexity of Learning Tasks, their Structure and their Distance Alessandro Achille, Giovanni Paolini, Glen Mbeng, Stefano Soatto https://arxiv.org/abs/1904.03292

Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches Shane Storks, Qiaozi Gao, Joyce Y. Chai https://arxiv.org/abs/1904.01172

Evaluating A Key Instrumental Variable Assumption Using Randomization Tests Zach Branson, Luke Keele https://arxiv.org/abs/1907.01943

Model selection for high-dimensional linear regression with dependent observations Ching-Kang Ing https://arxiv.org/abs/1906.07395

Doubts on the efficacy of outliers correction methods Marjorie Fonnesu, Nicola Kuczewski

The Design of Global Correlation Quantifiers and Continuous Notions of Statistical Sufficiency Nicholas Carrara, Kevin Vanslette

An Econometric Perspective on Algorithmic Subsampling Sokbae Lee, Serena Ng https://arxiv.org/abs/1907.01954

Factor Analysis for High-Dimensional Time Series with Change Point Xialu Liu, Ting Zhang https://arxiv.org/abs/1907.09522

Causal Regularization Dominik Janzing https://arxiv.org/abs/1906.12179

The exact form of the ‘Ockham factor’ in model selection Jonathan Rougier, Carey Priebe https://arxiv.org/abs/1906.11592

Measuring Average Treatment Effect from Heavy-tailed Data Jason (Xiao)Wang, Pauline Burke https://arxiv.org/abs/1905.09252

The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial Benyamin Ghojogh, Mark Crowley https://arxiv.org/abs/1905.12787

Statistical methods research done as science rather than mathematics James S. Hodges https://arxiv.org/abs/1905.08381

Regression Analysis of Unmeasured Confounding Brian Knaeble, Braxton Osting, Mark Abramson

Dyadic Regression Bryan S. Graham https://arxiv.org/abs/1908.09029

Illusion of Causality in Visualized Data Cindy Xiong, Joel Shapiro, Jessica Hullman, Steven Franconeri https://arxiv.org/abs/1908.00215

“multiColl”: An R package to detect multicollinearity Román Salmerón, Catalina García, José García https://arxiv.org/abs/1910.14590

All of Linear Regression Arun K. Kuchibhotla, Lawrence D. Brown, Andreas Buja, Junhui Cai https://arxiv.org/abs/1910.06386

What is the Value of Data? On Mathematical Methods for Data Quality Estimation Netanel Raviv, Siddharth Jain, Jehoshua Bruck https://arxiv.org/abs/2001.03464

Imputation for High-Dimensional Linear Regression Kabir Aladin Chandrasekher, Ahmed El Alaoui, Andrea Montanari https://arxiv.org/abs/2001.09180

On Model Evaluation under Non-constant Class Imbalance Jan Brabec, Tomáš Komárek, Vojtěch Franc, Lukáš Machlica https://arxiv.org/abs/2001.05571

Identifying Mislabeled Data using the Area Under the Margin Ranking Geoff Pleiss, Tianyi Zhang, Ethan R. Elenberg, Kilian Q. Weinberger https://arxiv.org/abs/2001.10528

Expanding the scope of statistical computing: Training statisticians to be software engineers Alex Reinhart, Christopher R. Genovese https://arxiv.org/abs/1912.13076

Learning under Model Misspecification: Applications to Variational and Ensemble methods Andres R. Masegosa https://arxiv.org/abs/1912.08335

Explaining the Explainer: A First Theoretical Analysis of LIME Damien Garreau, Ulrike von Luxburg https://arxiv.org/abs/2001.03447

Algorithms for Heavy-Tailed Statistics: Regression, Covariance Estimation, and Beyond Yeshwanth Cherapanamjeri, Samuel B. Hopkins, Tarun Kathuria, Prasad Raghavendra, Nilesh Tripuraneni https://arxiv.org/abs/1912.11071

Markov Chain Monte Carlo Methods, a survey with some frequent misunderstandings Christian P. Robert (U Paris Dauphine and U Warwick), Wu Changye (U Paris Dauphine) https://arxiv.org/abs/2001.06249

Valid p-Values and Expectations of p-Values Revisited Albert Vexler https://arxiv.org/abs/2001.05126

Counterexamples to “The Blessings of Multiple Causes” by Wang and Blei Elizabeth L. Ogburn, Ilya Shpitser, Eric J. Tchetgen Tchetgen https://arxiv.org/abs/2001.06555

Identifying Mislabeled Instances in Classification Datasets Nicolas Michael Müller, Karla Markert https://arxiv.org/abs/1912.05283

Randomized p-values for multiple testing and their application in replicability analysis Anh-Tuan Hoang, Thorsten Dickhaus https://arxiv.org/abs/1912.06982

Over-parametrized deep neural networks do not generalize well Michael Kohler, Adam Krzyzak https://arxiv.org/abs/1912.03925

Re-Evaluating Strengthened-IV Designs: Asymptotic Efficiency, Bias Formula, and the Validity and Power of Sensitivity Analyses Siyu Heng, Bo Zhang, Xu Han, Scott A. Lorch, Dylan S. Small https://arxiv.org/abs/1911.09171

Unbiased variable importance for random forests Markus Loecher https://arxiv.org/abs/2003.02106

Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited Wesley J. Maddox, Gregory Benton, Andrew Gordon Wilson https://arxiv.org/abs/2003.02139

A Multi-Way Correlation Coefficient Benjamin M. Taylor https://arxiv.org/abs/2003.02561

Sense and Sensitivity Analysis: Simple Post-Hoc Analysis of Bias Due to Unobserved Confounding Victor Veitch, Anisha Zaveri https://arxiv.org/abs/2003.01747

The Implicit and Explicit Regularization Effects of Dropout Colin Wei, Sham Kakade, Tengyu Ma https://arxiv.org/abs/2002.12915

Natural Language Processing Advancements By Deep Learning: A Survey Amirsina Torfi, Rouzbeh A. Shirvani, Yaser Keneshloo, Nader Tavaf, Edward A. Fox https://arxiv.org/abs/2003.01200

An Evaluation of Change Point Detection Algorithms Gerrit J.J. van den Burg, Christopher K.I. Williams https://arxiv.org/abs/2003.06222

Complexity Measures and Features for Times Series classification Francisco J. Baldán, José M. Benítez https://arxiv.org/abs/2002.12036

Computing Shapley Effects for Sensitivity Analysis Elmar Plischke, Giovanni Rabitti, Emanuele Borgonovo https://arxiv.org/abs/2002.12024

Bayesian Posterior Interval Calibration to Improve the Interpretability of Observational Studies Jami J. Mulgrave, David Madigan, George Hripcsak https://arxiv.org/abs/2003.06002

Demystify Lindley’s Paradox by Interpreting P-value as Posterior Probability Guosheng Yin, Haolun Shi https://arxiv.org/abs/2002.10883

Estimation of causal effects with small data in the presence of trapdoor variables Jouni Helske, Santtu Tikka, Juha Karvanen https://arxiv.org/abs/2003.03187

Dimensional Analysis in Statistical Modelling Tae Yoon Lee, James V. Zidek, Nancy Heckman https://arxiv.org/abs/2002.11259

Causal bounds for outcome-dependent sampling in observational studies Erin E. Gabriel, Michael C. Sachs, Arvid Sjölander https://arxiv.org/abs/2002.10519

cutpointr: Improved Estimation and Validation of Optimal Cutpoints in R https://arxiv.org/abs/2002.09209

A New Framework for Online Testing of Heterogeneous Treatment Effect Miao Yu, Wenbin Lu, Rui Song https://arxiv.org/abs/2002.03277

Combining Observational and Experimental Datasets Using Shrinkage Estimators Evan Rosenman, Guillaume Basse, Art Owen, Michael Baiocchi https://arxiv.org/abs/2002.06708

A confidence interval robust to publication bias for random-effects meta-analysis of few studies M. Henmi, S. Hattori, T. Friede https://arxiv.org/abs/2002.07598

Boosting Simple Learners Noga Alon, Alon Gonen, Elad Hazan, Shay Moran https://arxiv.org/abs/2001.11704

Analytic Study of Double Descent in Binary Classification: The Impact of Loss Ganesh Kini, Christos Thrampoulidis https://arxiv.org/abs/2001.11572

Fast Bayesian Estimation of Spatial Count Data Models Prateek Bansal, Rico Krueger, Daniel J. Graham https://arxiv.org/abs/2007.03681

High-recall causal discovery for autocorrelated time series with latent confounders Andreas Gerhardus, Jakob Runge https://arxiv.org/abs/2007.01884

Estimating the Prediction Performance of Spatial Models via Spatial k-Fold Cross Validation Jonne Pohjankukka, Tapio Pahikkala, Paavo Nevalainen, Jukka Heikkonen https://arxiv.org/abs/2005.14263

Validating Label Consistency in NER Data Annotation Qingkai Zeng, Mengxia Yu, Wenhao Yu, Tianwen Jiang, Meng Jiang https://arxiv.org/abs/2101.08698

Learning Prediction Intervals for Model Performance Benjamin Elder, Matthew Arnold, Anupama Murthi, Jiri Navratil https://arxiv.org/abs/2012.08625

Dive into Decision Trees and Forests: A Theoretical Demonstration Jinxiong Zhang https://arxiv.org/abs/2101.08656

Self-semi-supervised Learning to Learn from NoisyLabeled Data Jiacheng Wang, Yue Ma, Shuang Gao https://arxiv.org/abs/2011.01429

Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, Yu Su https://arxiv.org/abs/2011.07743

Uncertainty as a Form of Transparency: Measuring, Communicating, and Using Uncertainty Umang Bhatt, Javier Antorán, Yunfeng Zhang, Q. Vera Liao, Prasanna Sattigeri, Riccardo Fogliato, Gabrielle Gauthier Melançon, Ranganath Krishnan, Jason Stanley, Omesh Tickoo, Lama Nachman, Rumi Chunara, Madhulika Srikumar, Adrian Weller, Alice Xiang https://arxiv.org/abs/2011.07586

A Survey on Data Augmentation for Text Classification Markus Bayer, Marc-André Kaufhold, Christian Reuter https://arxiv.org/abs/2107.03158

A Survey on Automated Fact-Checking Zhijiang Guo, Michael Schlichtkrull, Andreas Vlachos https://arxiv.org/abs/2108.11896

The Benchmark Lottery Mostafa Dehghani, Yi Tay, Alexey A. Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, Oriol Vinyals https://arxiv.org/abs/2107.07002

Is Automated Topic Model Evaluation Broken?: The Incoherence of Coherence Alexander Hoyle, Pranav Goel, Denis Peskov, Andrew Hian-Cheong, Jordan Boyd-Graber, Philip Resnik https://arxiv.org/abs/2107.02173

The Modern Mathematics of Deep Learning Julius Berner, Philipp Grohs, Gitta Kutyniok, Philipp Petersen https://arxiv.org/abs/2105.04026

Biases in human mobility data impact epidemic modeling Frank Schlosser, Vedran Sekara, Dirk Brockmann, Manuel Garcia-Herranz https://arxiv.org/abs/2112.12521

Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation Shaoshi Sun, Zhenyuan Zhang, BoCheng Huang, Pengbin Lei, Jianlin Su, Shengfeng Pan, Jiarun Cao https://arxiv.org/abs/2112.12433

Clean or Annotate: How to Spend a Limited Data Collection Budget Derek Chen, Zhou Yu, Samuel R. Bowman https://arxiv.org/abs/2110.08355

How many labelers do you have? A closer look at gold-standard labels Chen Cheng, Hilal Asi, John Duchi https://arxiv.org/abs/2206.12041

Eliciting and Learning with Soft Labels from Every Annotator https://arxiv.org/abs/2207.00810

Quantified Reproducibility Assessment of NLP Results Anya Belz, Maja Popović, Simon Mille https://arxiv.org/abs/2204.05961

SHAP and LIME Python Libraries: Part 2 - Using SHAP and LIME https://www.dominodatalab.com/blog/shap-lime-python-libraries-part-2-using-shap-lime

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations Sander Greenland,corresponding author Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4877414/

The Mythos of Model Interpretability https://arxiv.org/pdf/1606.03490v1.pdf

Lessons Learned Reproducing a Deep Reinforcement Learning Paper Apr 6, 2018 http://amid.fish/reproducing-deep-rl

Spatial autocorrelation: bane or bonus? View ORCID ProfileMatt. D. M. Pawley, Brian H. McArdle doi: https://doi.org/10.1101/385526 https://www.biorxiv.org/content/10.1101/385526v1

On Reality and the Limits of Language Data Nigel H. Collier, Fangyu Liu, Ehsan Shareghi https://arxiv.org/abs/2208.11981

Open Information Extraction from 2007 to 2022 – A Survey Pai Liu, Wenyang Gao, Wenjie Dong, Songfang Huang, Yue Zhang https://arxiv.org/abs/2208.08690

Colah’s blog http://colah.github.io/

Causal Reasoning: Fundamentals and Machine Learning Applications http://causalinference.gitlab.io/book/

http://courses.d2l.ai/berkeley-stat-157/units/index.html#

A Compendium of Clean Graphs in R http://shinyapps.org/apps/RGraphCompendium/index.php?utm_content=bufferd23cb&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

Bayesian Inference an interactive visualization https://rpsychologist.com/d3/bayes/?utm_content=buffera5352&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

Fitting distributions with R https://www.magesblog.com/post/2011-12-01-fitting-distributions-with-r/

Concerns About Bots on Mechanical Turk: Problems and Solutions https://www.cloudresearch.com/resources/blog/concerns-about-bots-on-mechanical-turk-problems-and-solutions/

Reproducible Research with R & RStudio 2nd Edition Christopher Gandrud http://christophergandrud.github.io/RepResR-RStudio/

Regression and Causality https://arxiv.org/pdf/2006.11754.pdf

Introduction to Causal Inference Fall 2020 https://www.bradyneal.com/causal-inference-course

Tensorflow 2.0 Pitfalls A list of commonly seen issues along with solutions. http://blog.ai.ovgu.de/posts/jens/2019/001_tf20_pitfalls/index.html

Cold Case: The Lost MNIST Digits Chhavi Yadav, Léon Bottou https://arxiv.org/abs/1905.10498

Automated Text Classification of News Articles: A Practical Guide Published online by Cambridge University Press: 09 June 2020 https://www.cambridge.org/core/journals/political-analysis/article/abs/automated-text-classification-of-news-articles-a-practical-guide/10462DB284B1CD80C0FAE796AD786BC6

How to Use t-SNE Effectively https://distill.pub/2016/misread-tsne/

Locality Sensitive Hashing in R http://dsnotes.com/post/locality-sensitive-hashing-in-r-part-1/

Identification of and Correction for Publication Bias Isaiah Andrews https://www.aeaweb.org/articles?id=10.1257/aer.20180310

Mediation Analysis is Counterintuitively Invalid http://datacolada.org/103

Dive into Deep Learning http://d2l.ai/

CS224d: Deep Learning for Natural Language Processing http://cs224d.stanford.edu/syllabus.html

Regression Models for Count Data: beyond the Poisson model http://cursos.leg.ufpr.br/rmcd/

p-hacking fast and slow: Evaluating a forthcoming AER paper deeming some econ literatures less trustworthy http://datacolada.org/91

Attention Is All You Need Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin https://arxiv.org/abs/1706.03762

When Should We Use Unit Fixed Effects Regression Models for Causal Inference with Longitudinal Data? Kosuke Imai Harvard University In Song Kim https://imai.fas.harvard.edu/research/files/FEmatch.pdf

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? https://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf

P values in display items are ubiquitous and almost invariably significant: A survey of top science journals https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0197440

R Workflow for Reproducible Data Analysis and Reporting http://hbiostat.org/rflow/

The reusable holdout: Preserving validity in adaptive data analysis https://ai.googleblog.com/2015/08/the-reusable-holdout-preserving.html

The science that’s never been cited Nature investigates how many papers really end up without a single citation. https://www.nature.com/articles/d41586-017-08404-0?WT.mc_id=TWT_NA_1711_FHNEWSFNEVERCITED_PORTFOLIO

didimputation The goal of didimputation is to estimate TWFE models without running into the problem of staggered treatment adoption. https://github.com/kylebutts/didimputation

Methods Matter: P-Hacking and Causal Inference in Economics https://docs.iza.org/dp11796.pdf

CloudForest https://github.com/ryanbressler/CloudForest

The idea for Artificial Contrasts is based on: Eugene Tuvand and Kari Torkkola’s “Feature Filtering with Ensembles Using Artificial Contrasts” http://enpub.fulton.asu.edu/workshop/FSDM05-Proceedings.pdf#page=74 and Eugene Tuv, Alexander Borisov, George Runger and Kari Torkkola’s “Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination” http://www.researchgate.net/publication/220320233_Feature_Selection_with_Ensembles_Artificial_Variables_and_Redundancy_Elimination/file/d912f5058a153a8b35.pdf

The idea for growing trees to minimize categorical entropy comes from Ross Quinlan’s ID3: http://en.wikipedia.org/wiki/ID3_algorithm

“The Elements of Statistical Learning” 2nd edition by Trevor Hastie, Robert Tibshirani and Jerome Friedman was also consulted during development.

Methods for classification from unbalanced data are covered in several papers: http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163175/ http://www.biomedcentral.com/1471-2105/11/523 http://bib.oxfordjournals.org/content/early/2012/03/08/bib.bbs006 http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0067863

Denisty Estimating Trees/Forests are Discussed: http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p627.pdf http://research.microsoft.com/pubs/158806/CriminisiForests_FoundTrends_2011.pdf The later also introduces the idea of manifold forests which can be learned using down stream analysis of the outputs of leafcount to find the Fiedler vectors of the graph laplacian.

An introduction to Git and how to use it with RStudio http://r-bio.github.io/intro-git-rstudio/

Probability and Statistics Cookbook https://pages.cs.wisc.edu/~tdw/files/cookbook-en.pdf

The Plain Person’s Guide to Plain Text Social Science https://plain-text.co/

Causal Graphical Views of Fixed Effects and Random Effects Models https://psyarxiv.com/cxd2n/

Beware Default Random Forest Importances https://explained.ai/rf-importance/index.html

Tools and guides to put R models into production https://putrinprod.com/

’Metrics Monday: You Can’t Compare OLS with 2SLS PUBLISHED NOVEMBER 20, 2017 http://marcfbellemare.com/wordpress/12723

Causal Inference Animated Plots https://nickchk.com/causalgraphs.html#iv

Scaling Data from Multiple Sources https://www.cambridge.org/core/journals/political-analysis/article/abs/scaling-data-from-multiple-sources/1F9D30D8DDCE44379E8B962C29DADBAB?utm_source=hootsuite&utm_medium=twitter&utm_campaign=PAN_Nov20

GAM: The Predictive Modeling Silver Bullet https://multithreaded.stitchfix.com/blog/2015/07/30/gam/

Generalized Full Matching Published online by Cambridge University Press: 23 November 2020 https://www.cambridge.org/core/journals/political-analysis/article/abs/generalized-full-matching/3DA71D8BEDA6F02B5D36457E114C79B6?utm_source=hootsuite&utm_medium=twitter&utm_campaign=PAN_Nov20

A Deep Dive Into How R Fits a Linear Model http://madrury.github.io/jekyll/update/statistics/2016/07/20/lm-in-R.html

A ModernDive into R and the Tidyverse https://moderndive.com/

INSTRUMENTAL VARIABLES REGRESSIONS INVOLVING SEASONAL DATA David E.A. GILES http://web.uvic.ca/~dgiles/blog/Giles_FWL.pdf

The Book of Statistical Proofs https://statproofbook.github.io/

An econometric method for estimating population parameters from non‐random samples: An application to clinical case finding http://www-personal.umich.edu/~zmclaren/mclaren_tbprevalence.pdf

Parallelizing neural networks on one GPU with JAX http://willwhitney.com/parallel-training-jax.html

https://wrdrd.github.io/docs/

Learning interactions via hierarchical group-lasso regularization Michael Lim∗ and Trevor Hastie∗ June 21, 2014 https://hastie.su.domains/Papers/glinternet_jcgs.pdf

On the Use of Two-Way Fixed Effects Regression Models for Causal Inference with Panel Data http://web.mit.edu/insong/www/pdf/FEmatch-twoway.pdf

Backprop is not just the chain rule AUG 18, 2017 http://timvieira.github.io/blog/post/2017/08/18/backprop-is-not-just-the-chain-rule/

HOW TO PLOT XGBOOST TREES IN R http://theautomatic.net/2021/04/28/how-to-plot-xgboost-trees-in-r/

Rectangling https://tidyr.tidyverse.org/articles/rectangle.html

R Packages (2e) https://r-pkgs.org/

Prior distributions for variance parameters in hierarchical models http://www.stat.columbia.edu/~gelman/research/published/taumain.pdf

A visual introduction to machine learning http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

Prior distribution Andrew Gelman Volume 3, pp 1634–1637 http://www.stat.columbia.edu/~gelman/research/published/p039-_o.pdf

P-curve.com http://www.p-curve.com/

Model Tuning and the Bias-Variance Tradeoff http://www.r2d3.us/visual-intro-to-machine-learning-part-2/

How (and why) to create a good validation set https://www.fast.ai/posts/2017-11-13-validation-sets.html

The Promise and Pitfalls of Differences-in-Differences: Reflections on ‘16 and Pregnant’ and Other Applications https://www.nber.org/papers/w24857

Applied Bayesian Modeling http://www.leg.ufpr.br/lib/exe/fetch.php/wiki:internas:biblioteca:cogdon.pdf

Gaussian Distributions are Soap Bubbles https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/

Interpreting Instrumented Difference-in-Differences http://www.mit.edu/~liebers/DDIV.pdf

Probability, log-odds, and odds https://www.montana.edu/rotella/documents/502/Prob_odds_log-odds.pdf

TRANSFORMERS FROM SCRATCH https://peterbloem.nl/blog/transformers

How to Examine External Validity Within an Experiment https://www.nber.org/papers/w24834

Program Evaluation https://www.lecy.info/program-evaluation/

Facing Imbalanced Data Recommendations for the Use of Performance Metrics https://sites.pitt.edu/~jeffcohn/biblio/Jeni_Metrics.pdf

The Art and Practice of Economics Research: Lessons from Leading Minds https://static1.squarespace.com/static/56ec62678a65e20b89da5f33/t/6164758b00bbcb015c12dd53/1633973644033/Card.pdf

Statistics: P values are just the tip of the iceberg https://www.nature.com/articles/520612a

Random Forests, Decision Trees, and Categorical Predictors: The “Absent Levels” Problem https://www.jmlr.org/papers/volume19/16-474/16-474.pdf

Can transparency undermine peer review? A simulation model of scientist behavior under open peer review Federico Bianchi, Flaminio Squazzoni https://academic.oup.com/spp/article/49/5/791/6602348?login=false

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation https://aclanthology.org/2021.emnlp-main.97/

Channeling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results Get access Arrow Alwyn Young https://academic.oup.com/qje/article-abstract/134/2/557/5195544?redirectedFrom=fulltext

Advanced R https://adv-r.hadley.nz/index.html

Causal Machine Learning: A Survey and Open Problems https://ai.papers.bar/paper/460ac86ef8e611ecb9b9d35608ee6155

On the Meaning of Within-Factor Correlated Measurement Errors https://academic.oup.com/jcr/article-abstract/11/1/572/1822756

Trustworthy Machine Learning http://www.trustworthymachinelearning.com/

Statistical Modeling: The Two Cultures Author(s): Leo Breiman http://www2.math.uu.se/~thulin/mm/breiman.pdf

Estimating misclassification error with small samples via bootstrap cross-validation https://academic.oup.com/bioinformatics/article/21/9/1979/409121?login=true

Critical appraisal of artificial intelligence-based prediction models for cardiovascular disease https://academic.oup.com/eurheartj/article/43/31/2921/6593474?login=false

The Only Probability Cheatsheet You’ll Ever Need http://www.wzchen.com/probability-cheatsheet/

Come back and scrape these http://www.wzchen.com/data-science-books

Download the Datasaurus: Never trust summary statistics alone; always visualize your data http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html

Prior Setting in Practice: Strategies and Rationales Used in Choosing Prior Distributions for Bayesian Analysis https://abhsarma.github.io/pubs/Prior_Setting_CHI2020.pdf

50 Years of Data Science David Donoho https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734

The Assessment of Intrinsic Credibility and a New Argument for p<0.005 Leonhard Held https://arxiv.org/abs/1803.10052

Arbitrariness of peer review: A Bayesian analysis of the NIPS experiment Olivier Francois https://arxiv.org/abs/1507.06411

Diaries of Social Data Research https://anchor.fm/diaries-soc-data-research/episodes/The-Evolution-of-Computational-Social-Science-from-a-Sociology-Perspective-with-Chris-Bail-e17vikf

pipecleaner https://alistaire47.github.io/pipecleaner/

Cross-Validation for Correlated Data https://amstat.tandfonline.com/doi/abs/10.1080/01621459.2020.1801451

The causal hype ratchet https://statmodeling.stat.columbia.edu/2018/12/21/causal-hype-ratchet/

A Permutation Test for the Regression Kink Design Peter Ganong &Simon Jäger https://amstat.tandfonline.com/doi/abs/10.1080/01621459.2017.1328356#.XEx7z89KjXF

Approximate Residual Balancing: De-Biased Inference of Average Treatment Effects in High Dimensions https://arxiv.org/abs/1604.07125

High-Dimensional Convex Geometry https://amitrajaraman.github.io/notes/convex-geometry/Convex_Geometry.pdf

Data Science Interviews During the 2020 Pandemic https://alexgude.com/blog/interviewing-for-data-science-positions-in-2020/

Tech Interviews: Respect Everyone’s Time https://alexgude.com/blog/interviews-respect-time/

Distribution-Free Prediction Intervals with Conformal Inference using R https://arelbundock.com/posts/conformal/

Robustness checks https://statmodeling.stat.columbia.edu/2018/11/14/robustness-checks-joke/ https://statmodeling.stat.columbia.edu/2017/11/29/whats-point-robustness-check/

Synthetically generated text for supervised text analysis Andrew Halterman https://andrewhalterman.com/files/Halterman_synthetic_text.pdf

EPP: interpretable score of model predictive power Alicja Gosiewska, Mateusz Bakala, Katarzyna Woznica, Maciej Zwolinski, Przemyslaw Biecek https://arxiv.org/abs/1908.09213

Measuring Calibration in Deep Learning Jeremy Nixon, Mike Dusenberry, Ghassen Jerfel, Timothy Nguyen, Jeremiah Liu, Linchuan Zhang, Dustin Tran https://arxiv.org/abs/1904.01685

What can be estimated? Identifiability, estimability, causal inference and ill-posed inverse problems Oliver J. Maclaren, Ruanui Nicholson https://arxiv.org/abs/1904.02826

Comparing Spike and Slab Priors for Bayesian Variable Selection Gertraud Malsiner-Walli, Helga Wagner https://arxiv.org/abs/1812.07259

Time-uniform, nonparametric, nonasymptotic confidence sequences Steven R. Howard, Aaditya Ramdas, Jon McAuliffe, Jasjeet Sekhon https://arxiv.org/abs/1810.08240

Open Science in Software Engineering Daniel Méndez Fernández, Daniel Graziotin, Stefan Wagner, Heidi Seibold

Safe Testing We develop the theory of hypothesis testing based on the E-value, a notion of evidence that, unlike the p-v https://arxiv.org/abs/1906.07801

A Mini-Introduction To Information Theory Edward Witten https://arxiv.org/abs/1805.11965

An Introduction to Deep Reinforcement Learning Vincent Francois-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, Joelle Pineau https://arxiv.org/abs/1811.12560

On the cross-validation bias due to unsupervised pre-processing Amit Moscovich, Saharon Rosset https://arxiv.org/abs/1901.08974

Troubling Trends in Machine Learning Scholarship Zachary C. Lipton, Jacob Steinhardt https://arxiv.org/abs/1807.03341

The Role of the Propensity Score in Fixed Effect Models Dmitry Arkhangelsky, Guido Imbens https://arxiv.org/abs/1807.02099

Proxy Controls and Panel Data Ben Deaner https://arxiv.org/abs/1810.00283

Structural Breaks in Time Series Alessandro Casini, Pierre Perron https://arxiv.org/abs/1805.03807

Comparing interpretability and explainability for feature selection Jack Dunn, Luca Mingardi, Ying Daisy Zhuo https://arxiv.org/abs/2105.05328

Cross-validation: what does it estimate and how well does it do it? Stephen Bates, Trevor Hastie, Robert Tibshirani https://arxiv.org/abs/2104.00673

On the implied weights of linear regression for causal inference Ambarish Chattopadhyay, Jose R. Zubizarreta https://arxiv.org/abs/2104.06581

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification Anastasios N. Angelopoulos, Stephen Bates https://arxiv.org/abs/2107.07511

Out-of-distribution Generalization in the Presence of Nuisance-Induced Spurious Correlations Aahlad Puli, Lily H. Zhang, Eric K. Oermann, Rajesh Ranganath https://arxiv.org/abs/2107.00520

A Tutorial on VAEs: From Bayes’ Rule to Lossless Compression Ronald Yu https://arxiv.org/abs/2006.10273

Common Limitations of Image Processing Metrics: A Picture Story https://arxiv.org/abs/2104.05642

On the Inductive Bias of Masked Language Modeling: From Statistical to Syntactic Dependencies Tianyi Zhang, Tatsunori Hashimoto https://arxiv.org/abs/2104.05694

A large-scale study on research code quality and execution Ana Trisovic, Matthew K. Lau, Thomas Pasquier, Mercè Crosas https://arxiv.org/abs/2103.12793

Explaining by Removing: A Unified Framework for Model Explanation Ian Covert, Scott Lundberg, Su-In Lee https://arxiv.org/abs/2011.14878

What is Entropy? A new perspective from games of chance Sarah Brandsen, Isabelle Jianing Geng, Gilad Gour https://arxiv.org/abs/2103.08681

Instrumental variables, spatial confounding and interference Andrew Giffin, Brian J. Reich, Shu Yang, Ana G. Rappold https://arxiv.org/abs/2103.00304

On Linear Identifiability of Learned Representations Geoffrey Roeder, Luke Metz, Diederik P. Kingma https://arxiv.org/abs/2007.00810

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, Yejin Choi https://arxiv.org/abs/2009.10795

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks Curtis G. Northcutt, Anish Athalye, Jonas Mueller https://arxiv.org/abs/2103.14749

Contamination Bias in Linear Regressions Paul Goldsmith-Pinkham, Peter Hull, Michal Kolesár

Towards optimal doubly robust estimation of heterogeneous causal effects Edward H. Kennedy https://arxiv.org/abs/2004.14497

When are Non-Parametric Methods Robust? Robi Bhattacharjee, Kamalika Chaudhuri https://arxiv.org/abs/2003.06121

When Is Parallel Trends Sensitive to Functional Form? Jonathan Roth, Pedro H. C. Sant’Anna https://arxiv.org/abs/2010.04814

Optimal Regularization Can Mitigate Double Descent Preetum Nakkiran, Prayaag Venkat, Sham Kakade, Tengyu Ma

Valid Causal Inference with (Some) Invalid Instruments Jason Hartford, Victor Veitch, Dhanya Sridhar, Kevin Leyton-Brown https://arxiv.org/abs/2006.11386

A Survey on Knowledge Graphs: Representation, Acquisition and Applications Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, Philip S. Yu https://arxiv.org/abs/2002.00388

The MCC-F1 curve: a performance evaluation technique for binary classification Chang Cao, Davide Chicco, Michael M. Hoffman https://arxiv.org/abs/2006.11278

Causal Inference and Data Fusion in Econometrics Paul Hünermund (Copenhagen Business School), Elias Bareinboim (Columbia University) https://arxiv.org/abs/1912.09104

Learning to Induce Causal Structure Nan Rosemary Ke, Silvia Chiappa, Jane Wang, Anirudh Goyal, Jorg Bornschein, Melanie Rey, Theophane Weber, Matthew Botvinic, Michael Mozer, Danilo Jimenez Rezende https://arxiv.org/abs/2204.04875

Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data Spencer Frei, Niladri S. Chatterji, Peter L. Bartlett https://arxiv.org/abs/2202.05928

Causal Inference Through the Structural Causal Marginal Problem Luigi Gresele, Julius von Kügelgen, Jonas M. Kübler, Elke Kirschbaum, Bernhard Schölkopf, Dominik Janzing https://arxiv.org/abs/2202.01300

Benefits and costs of matching prior to a Difference in Differences analysis when parallel trends does not hold Dae Woong Ham, Luke Miratrix https://arxiv.org/abs/2205.08644

Causal influence, causal effects, and path analysis in the presence of intermediate confounding Iván Díaz https://arxiv.org/abs/2205.08000

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant Misra https://arxiv.org/abs/2201.02177

Deep Learning Interviews: Hundreds of fully solved job interview questions from a wide range of key topics in AI Shlomo Kashani, Amir Ivry https://arxiv.org/abs/2201.00650

Better Uncertainty Calibration via Proper Scores for Classification and Beyond Sebastian Gruber, Florian Buettner https://arxiv.org/abs/2203.07835

Sensitivity Analysis of Individual Treatment Effects: A Robust Conformal Inference Approach Ying Jin, Zhimei Ren, Emmanuel J. Candès https://arxiv.org/abs/2111.12161

Towards a Unified Information-Theoretic Framework for Generalization Mahdi Haghifam, Gintare Karolina Dziugaite, Shay Moran, Daniel M. Roy https://arxiv.org/abs/2111.05275

Learning in High Dimension Always Amounts to Extrapolation Randall Balestriero, Jerome Pesenti, Yann LeCun https://arxiv.org/abs/2110.09485

Understanding Dataset Difficulty with V-Usable Information Kawin Ethayarajh, Yejin Choi, Swabha Swayamdipta https://arxiv.org/abs/2110.08420

Batch Normalization Explained Randall Balestriero, Richard G. Baraniuk https://arxiv.org/abs/2209.14778

Bayesian Online Changepoint Detection https://arxiv.org/pdf/0710.3742.pdf

Impact of subsampling and pruning on random forests. https://arxiv.org/pdf/1603.04261.pdf

Selection Collider Bias in Large Language Models Emily McMilin https://arxiv.org/abs/2208.10063

On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models Rohan Anil, Sandra Gadanho, Da Huang, Nijith Jacob, Zhuoshu Li, Dong Lin, Todd Phillips, Cristina Pop, Kevin Regan, Gil I. Shamir, Rakesh Shivanna, Qiqi Yan https://arxiv.org/abs/2209.05310

Selective review of offline change point detection methods https://arxiv.org/pdf/1801.00718.pdf

How Much More Data Do I Need? Estimating Requirements for Downstream Tasks Rafid Mahmood, James Lucas, David Acuna, Daiqing Li, Jonah Philion, Jose M. Alvarez, Zhiding Yu, Sanja Fidler, Marc T. Law https://arxiv.org/abs/2207.01725

Snorkel: Rapid Training Data Creation with Weak Supervision https://arxiv.org/pdf/1711.10160.pdf

Defining and Characterizing Reward Hacking Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, David Krueger https://arxiv.org/abs/2209.13085

On Leave-One-Out Conditional Mutual Information For Generalization Mohamad Rida Rammal, Alessandro Achille, Aditya Golatkar, Suhas Diggavi, Stefano Soatto https://arxiv.org/abs/2207.00581

Formal Algorithms for Transformers Mary Phuong, Marcus Hutter https://arxiv.org/abs/2207.09238

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models Kushal Tirumala, Aram H. Markosyan, Luke Zettlemoyer, Armen Aghajanyan https://arxiv.org/abs/2205.10770

Why do tree-based models still outperform deep learning on tabular data? Léo Grinsztajn (SODA), Edouard Oyallon (ISIR, CNRS), Gaël Varoquaux (SODA) https://arxiv.org/abs/2207.08815

Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting Neil Mallinar, James B. Simon, Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran https://arxiv.org/abs/2207.06569

Towards Understanding Grokking: An Effective Theory of Representation Learning Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, Mike Williams https://arxiv.org/abs/2205.10343

Pen and Paper Exercises in Machine Learning Michael U. Gutmann https://arxiv.org/abs/2206.13446

Introduction to DiD with Multiple Time Periods Brantly Callaway and Pedro H.C. Sant’Anna 2022-07-19 https://bcallaway11.github.io/did/articles/multi-period-did.html

Applications of Deep Neural Networks with Keras Jeff Heaton Fall 2022.0 https://arxiv.org/pdf/2009.05673.pdf

Joint Distributions for TensorFlow Probability DAN PIPONI† , DAVE MOORE† & JOSHUA V. DILLON, Google Research https://arxiv.org/pdf/2001.11819.pdf

Descending through a Crowded Valley — Benchmarking Deep Learning Optimizers https://arxiv.org/pdf/2007.01547.pdf

Geographic Difference-in-Discontinuities Kyle Butts https://arxiv.org/pdf/2109.07406.pdf

Pre-trained Models for Natural Language Processing: A Survey Xipeng Qiu* , Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai & Xuanjing Huang https://arxiv.org/pdf/2003.08271.pdf

Knowledge Graphs on the Web – an Overview https://arxiv.org/pdf/2003.00719.pdf

Knowledge Graphs https://arxiv.org/pdf/2003.02320.pdf

TOPOLOGY OF DEEP NEURAL NETWORKS GREGORY NAITZAT, ANDREY ZHITNIKOV, AND LEK-HENG LIM https://arxiv.org/pdf/2004.06093.pdf

Noise-Induced Randomization in Regression Discontinuity Designs https://arxiv.org/pdf/2004.09458.pdf

Markov Chain Monte Carlo Methods, a survey with some frequent misunderstandings https://arxiv.org/pdf/2001.06249.pdf

Learning Dependency Structures for Weak Supervision Models https://arxiv.org/pdf/1903.05844.pdf

Approximate leave-future-out cross-validation for Bayesian time series models https://arxiv.org/pdf/1902.06281.pdf

Relational Representation Learning for Dynamic (Knowledge) Graphs: A Survey https://arxiv.org/pdf/1905.11485v1.pdf

Statistical methods research done as science rather than mathematics James S. Hodges https://arxiv.org/pdf/1905.08381.pdf

R Tip: use isTRUE() https://win-vector.com/2018/06/11/r-tip-use-istrue/

The tidymodels Package https://www.tidyverse.org/blog/2018/08/tidymodels-0-0-1/

Regular expressions are tricky. RegExplain makes it easier to see what you’re doing. https://www.garrickadenbuie.com/project/regexplain/

The ability of different peer review procedures to flag problematic publications https://link.springer.com/article/10.1007/s11192-018-2969-2

gglabeller https://github.com/AliciaSchep/gglabeller?utm_content=buffer552f9&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

The {targets} R package user manual https://books.ropensci.org/targets/

How Regularization Works https://e2eml.school/regularization.html

Don’t be tricked by the Hashing Trick https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087

How to Use Catboost with Tidymodels https://blog.rmhogervorst.nl/blog/2020/08/28/how-to-use-catboost-with-tidymodels/

R Markdown: The Definitive Guide https://bookdown.org/yihui/rmarkdown/

37 Reasons why your Neural Network is not working https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607

Time to assume that health research is fraudulent until proven otherwise? https://blogs.bmj.com/bmj/2021/07/05/time-to-assume-that-health-research-is-fraudulent-until-proved-otherwise/

Quality of reporting of randomised controlled trials of artificial intelligence in healthcare: a systematic review Rida Shahzad1, Bushra Ayub2, http://orcid.org/0000-0001-5100-3189M A Rehman Siddiqui3 https://bmjopen.bmj.com/content/12/9/e061519.abstract

Running R Scripts on a Schedule with GitHub Actions By Simon P. Couch DECEMBER 27, 2020 https://www.simonpcouch.com/blog/r-github-actions-commit/

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction Shangzhi Hong & Henry S. Lynn https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-01080-1

spatialRF: Easy Spatial Regression with Random Forest https://blasbenito.github.io/spatialRF/

Supervised Clustering: How to Use SHAP Values for Better Cluster Analysis https://www.aidancooper.co.uk/supervised-clustering-shap-values/

Exploring Neural Networks Visually in the Browser https://cprimozic.net/blog/neural-network-experiments-and-visualizations/

How Much Should We Trust Instrumental Variable Estimates in Political Science? Practical Advice Based on Over 60 Replicated Studies∗ https://yiqingxu.org/papers/english/2021_iv/LLXZ.pdf

Feathr: LinkedIn’s feature store is now available on Azure Posted on April 12, 2022 Xiaoyong Zhu https://azure.microsoft.com/en-us/blog/feathr-linkedin-s-feature-store-is-now-available-on-azure/

A Survey of Learning on Small Data Xiaofeng Cao, Weixin Bu, Shengjun Huang, Yingpeng Tang, Yaming Guo, Yi Chang, Ivor W. Tsang https://arxiv.org/abs/2207.14443

Ontology-based industrial data management platform Sergey Gorshkov, Alexander Grebeshkov, Roman Shebalov https://arxiv.org/abs/2103.05538

How to Speed Up XGBoost Model Training https://www.anyscale.com/blog/how-to-speed-up-xgboost-model-training

Markov Chain Monte Carlo Methods for Bayesian Data Analysis in Astronomy https://www.datasciencecentral.com/markov-chain-monte-carlo-methods-for-bayesian-data-analysis-in/#w6JI5

How to Easily Draw Neural Network Architecture Diagrams https://towardsdatascience.com/how-to-easily-draw-neural-network-architecture-diagrams-a6b6138ed875

L2 Regularization and Batch Norm https://blog.janestreet.com/l2-regularization-and-batch-norm/

Trust in LIME: Yes, No, Maybe So? https://www.dominodatalab.com/blog/trust-in-lime-local-interpretable-model-agnostic-explanations

Inside Manifold: Uber’s Stack for Debugging Machine Learning Models https://towardsai.net/p/l/inside-manifold-ubers-stack-for-debugging-machine-learning-models?utm_source=twitter&utm_medium=social&utm_campaign=rop-content-recycle

Data validation for machine learning JUNE 5, 2019 ~ ADRIAN COLYER https://blog.acolyer.org/2019/06/05/data-validation-for-machine-learning/

Multiprocessing vs. Threading in Python: What Every Data Scientist Needs to Know https://blog.floydhub.com/multiprocessing-vs-threading-in-python-what-every-data-scientist-needs-to-know/

A Comprehensive Guide to Machine Learning https://www.eecs189.org/static/resources/comprehensive-guide.pdf

A Concrete Introduction to Probability (using Python) https://github.com/norvig/pytudes/blob/main/ipynb/Probability.ipynb

An Interactive Guide To The Fourier Transform https://betterexplained.com/articles/an-interactive-guide-to-the-fourier-transform/

Identity Crisis https://betanalpha.github.io/assets/case_studies/identifiability.html

https://betanalpha.github.io/writing/

Bayes Sparse Regression Michael Betancourt March 2018 https://betanalpha.github.io/assets/case_studies/bayes_sparse_regression.html#1_fading_into_irrelevance

An Introduction to Stan Michael Betancourt March 2020 https://betanalpha.github.io/assets/case_studies/stan_intro.html

https://developers.google.com/machine-learning

Prior Modeling Michael Betancourt September 2021 https://betanalpha.github.io/assets/case_studies/prior_modeling.html

Colorized Math Equations https://betterexplained.com/articles/colorized-math-equations/

Towards A Principled Bayesian Workflow Michael Betancourt April 2020 https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html

Ordinal Regression Michael Betancourt May 2019 https://betanalpha.github.io/assets/case_studies/ordinal_regression.html

Analysing continuous proportions in ecology and evolution: A practical introduction to beta and Dirichlet regression Jacob C. Douma,James T. Weedon https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13234

slider https://davisvaughan.github.io/slider/index.html

torch.manual seed(3407) is all you need: On the influence of random seeds in deep learning architecture for computer vision https://davidpicard.github.io/pdf/lucky_seed.pdf

A Day in the Life of a Silicon Valley Data Engineer https://towardsdatascience.com/a-day-in-the-life-of-a-google-data-engineer-722f1b2206cc

ICML 2018 Notes https://david-abel.github.io/blog/posts/misc/icml_2018.pdf

ICML 2019 Notes https://david-abel.github.io/notes/icml_2019.pdf

Keep using plate notation https://davidrushingdewhurst.com/blog/2020-07-28keep-using-plate-notation.html

Data Visualization https://datavizs21.classes.andrewheiss.com/content/

DO YOU KNOW THE 4 TYPES OF ADDITIVE VARIABLE IMPORTANCES? https://datajms.com/post/variable_importance_feature_attribution/

geostan: Bayesian spatial analysis https://connordonegan.github.io/geostan/

Using Observational Study Data as an External Control Group for a Clinical Trial: an Empirical Comparison of Methods to Account for Longitudinal Missing Data https://www.researchgate.net/publication/357609855_Using_Observational_Study_Data_as_an_External_Control_Group_for_a_Clinical_Trial_an_Empirical_Comparison_of_Methods_to_Account_for_Longitudinal_Missing_Data

Selective Ignorability Assumptions in Causal Inference Marshall M. Joffe , Wei Peter Yang and Harold I. Feldman https://www.degruyter.com/document/doi/10.2202/1557-4679.1199/html

Expressing Regret: A Unified View of Credible Intervals Kenneth RiceORCID Icon &Lingbo Ye https://www.tandfonline.com/doi/abs/10.1080/00031305.2022.2039764

Efficient Identification in Linear Structural Causal Models with Instrumental Cutsets https://causalai.net/r49.pdf

Polymatching algorithm in observational studies with multiple treatment groups https://www.sciencedirect.com/science/article/abs/pii/S0167947321001985

An Introduction to Statistical Learning https://hastie.su.domains/ISLR2/ISLRv2_website.pdf

Core concepts in pharmacoepidemiology: Confounding by indication and the role of active comparators https://onlinelibrary.wiley.com/doi/10.1002/pds.5407

Aim for Clinical Utility, Not Just Predictive Accuracy Sachs, Michael C.a; Sjölander, Arvidb; Gabriel, Erin E.b https://journals.lww.com/epidem/Fulltext/2020/05000/Aim_for_Clinical_Utility,_Not_Just_Predictive.8.aspx#ej-article-sam-container

Monitoring Machine Learning Models in Production A Comprehensive Guide https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/

*args and **kwargs in Python https://towardsdatascience.com/args-kwargs-python-d9c71b220970

Waiting for Event Studies: A Play in Three Acts Sun and Abraham (2020) Explainer https://causalinf.substack.com/p/waiting-for-event-studies-a-play

Computational Socioeconomics https://arxiv.org/abs/1905.06166

Is Peer Review a Good Idea? Remco Heesen and Liam Kofi Bright https://www.journals.uchicago.edu/doi/10.1093/bjps/axz029

Opening the Black Box: a motivation for the assessment of mediation Danella M Hafeman 1, Sharon Schwartz https://pubmed.ncbi.nlm.nih.gov/19261660/

Invited Commentary: Propensity Scores Marshall M. Joffe, Paul R. Rosenbaum https://academic.oup.com/aje/article/150/4/327/98791

Advances in propensity score analysis Peter C Austin peter.austin@ices.on.caView all authors and affiliations https://journals.sagepub.com/doi/full/10.1177/0962280219899248

Analysis in an imperfect world Michael Wallace First published: 29 January 2020 https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1740-9713.2020.01353.x

Central Limit Theorem http://mfviz.com/central-limit/?utm_content=buffere918f&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

Adjusting for Covariates in Randomized Clinical Trials for Drugs and Biological Products Draft Guidance for Industry https://www.fda.gov/regulatory-information/search-fda-guidance-documents/adjusting-covariates-randomized-clinical-trials-drugs-and-biological-products

Machine learning for improving high-dimensional proxy confounder adjustment in healthcare database studies: An overview of the current literature https://onlinelibrary.wiley.com/doi/10.1002/pds.5500

A Gentle Introduction to tidymodels https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/

Simultaneous Variable and Covariance Selection with the Multivariate Spike-and-Slab LASSO https://arxiv.org/pdf/1708.08911.pdf?utm_content=bufferb1cd5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

Transformers Explained Visually — Not Just How, but Why They Work So Well https://towardsdatascience.com/transformers-explained-visually-not-just-how-but-why-they-work-so-well-d840bd61a9d3

Easy Bayesian Bootstrap in R https://www.sumsar.net/blog/2015/07/easy-bayesian-bootstrap-in-r/?utm_content=buffer53c16&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

Leave-future-out cross-validation for time-series models https://discourse.mc-stan.org/t/leave-future-out-cross-validation-for-time-series-models/12954/2

PCA in a tidy(verse) framework https://tbradley1013.github.io/2018/02/01/pca-in-a-tidy-verse-framework/?utm_content=bufferfaf31&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

Visualising Residuals https://drsimonj.svbtle.com/visualising-residuals?utm_content=bufferdb80e&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

Covariate adjustment for randomized controlled trials revisited Jixian Wang https://onlinelibrary.wiley.com/doi/full/10.1002/pst.1988?campaign=wolearlyview

STAT 545 Data wrangling, exploration, and analysis with R https://stat545.com/index.html

Topics in Econometrics: Advances in Causality and Foundations of Machine Learning https://maxkasy.github.io/home/TopicsInEconometrics2019/

A nontechnical explanation of the counterfactual definition of confounding Martijn J.L. Bours https://www.jclinepi.com/article/S0895-4356(19)30173-8/pdf

Discovering Reliable Correlations in Categorical Data https://deepai.org/publication/discovering-reliable-correlations-in-categorical-data

Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations https://deepai.org/publication/dealing-with-disagreements-looking-beyond-the-majority-vote-in-subjective-annotations

https://papers.labml.ai/papers/weekly

Generalized Principal Component Analysis https://deepai.org/publication/generalized-principal-component-analysis

Deeptime: a Python library for machine learning dynamical models from time series data https://deepai.org/publication/deeptime-a-python-library-for-machine-learning-dynamical-models-from-time-series-data

Delving into Deep Imbalanced Regression https://deepai.org/publication/delving-into-deep-imbalanced-regression

Causality-based Feature Selection: Methods and Evaluations https://deepai.org/publication/causality-based-feature-selection-methods-and-evaluations

Causal Inference Through the Structural Causal Marginal Problem 02/02/2022 ∙ by
Luigi Gresele, et al. https://deepai.org/publication/causal-inference-through-the-structural-causal-marginal-problem

Causal Discovery from Incomplete Data: A Deep Learning Approach https://deepai.org/publication/causal-discovery-from-incomplete-data-a-deep-learning-approach

AutoML: A Survey of the State-of-the-Art https://deepai.org/publication/automl-a-survey-of-the-state-of-the-art

CatBoostLSS – An extension of CatBoost to probabilistic forecasting https://deepai.org/publication/catboostlss-an-extension-of-catboost-to-probabilistic-forecasting

Breiman’s “Two Cultures” Revisited and Reconciled 05/27/2020 Subhadeep, et al. https://deepai.org/publication/breiman-s-two-cultures-revisited-and-reconciled

Graphical Representation of Missing Data Problems Felix Thoemmes1 and Karthika Mohan2 https://ftp.cs.ucla.edu/pub/stat_ser/r448-reprint.pdf

Judea Pearl* Does Obesity Shorten Life? Or is it the Soda? On Non-manipulable Causes https://ftp.cs.ucla.edu/pub/stat_ser/r483-reprint.pdf

Gene name errors are widespread in the scientific literature Mark Ziemann, Yotam Eren & Assam El-Osta https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7

A hypothesis is a liability Itai Yanai & Martin Lercher https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02133-w

gganimate extends the grammar of graphics as implemented by ggplot2 to include the description of animation. https://gganimate.com/

Geocomputation with R https://geocompr.robinlovelace.net/index.html

ftfy: fixes text for you https://ftfy.readthedocs.io/en/latest/

ggforce https://ggforce.data-imaginist.com/index.html

Geographically based Economic data (G-Econ) https://gecon.yale.edu/

The packages dtw for R and dtw-python for Python provide the most complete, freely-available (GPL) implementation of Dynamic Time Warping-type (DTW) algorithms up to date. https://dynamictimewarping.github.io/

Preprints: An underutilized mechanism to accelerate outbreak science Michael A. Johansson ,Nicholas G. Reich,Lauren Ancel Meyers,Marc Lipsitch Published: April 3, 2018 https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002549

Randomization does not imply unconfoundedness https://drive.google.com/file/d/1nV8QMLxwXi-iWSqiwRN4KnMSWfoWJned/view

Bayesian Gaussian Graphical Models https://donaldrwilliams.github.io/BGGM/index.html

DoWhy: Addressing Challenges in Expressing and Validating Causal Assumptions https://drive.google.com/file/d/1i81CnMd683A788RYtEb8KSowhhPJn3Z6/view

Plotting background data for groups with ggplot2 https://drsimonj.svbtle.com/plotting-background-data-for-groups-with-ggplot2

Benign Overfitting in Linear Regression 06/26/2019 https://deepai.org/publication/benign-overfitting-in-linear-regression

Amortized Causal Discovery: Learning to Infer Causal Graphs from Time-Series Data https://deepai.org/publication/amortized-causal-discovery-learning-to-infer-causal-graphs-from-time-series-data

Accelerating Deep Learning by Focusing on the Biggest Losers 10/02/2019 https://deepai.org/publication/accelerating-deep-learning-by-focusing-on-the-biggest-losers

A Class of Algorithms for General Instrumental Variable Models 06/11/2020 https://deepai.org/publication/a-class-of-algorithms-for-general-instrumental-variable-models

A Survey of Parameters Associated with the Quality of Benchmarks in NLP 10/14/2022 https://deepai.org/publication/a-survey-of-parameters-associated-with-the-quality-of-benchmarks-in-nlp

A study of uncertainty quantification in overparametrized high-dimensional models 10/23/2022 https://deepai.org/publication/a-study-of-uncertainty-quantification-in-overparametrized-high-dimensional-models

DeclareDesign Blog The trouble with ‘controlling for blocks’ https://declaredesign.org/blog/biased-fixed-effects.html

A Critical Review on the Use (and Misuse) of Differential Privacy in Machine Learning 06/09/2022 https://deepai.org/publication/a-critical-review-on-the-use-and-misuse-of-differential-privacy-in-machine-learning

A Comprehensive Survey of Image Augmentation Techniques for Deep Learning 05/03/2022 https://deepai.org/publication/a-comprehensive-survey-of-image-augmentation-techniques-for-deep-learning

Stance Detection: A Survey ACM Computing SurveysVolume 53Issue 1January 2021 https://dl.acm.org/doi/abs/10.1145/3369026

PyTorch: An Imperative Style, High-Performance Deep Learning Library 12/03/2019 https://deepai.org/publication/pytorch-an-imperative-style-high-performance-deep-learning-library

On Causally Disentangled Representations 12/10/2021 https://deepai.org/publication/on-causally-disentangled-representations

A Visual Exploration of Gaussian Processes https://distill.pub/2019/visual-exploration-gaussian-processes/

Principled Machine Learning: Practices and Tools for Efficient Collaboration https://dev.to/robogeek/principled-machine-learning-4eho

Rules of Machine Learning:

bookmark_border Best Practices for ML Engineering Martin Zinkevich https://developers.google.com/machine-learning/guides/rules-of-ml

The Variability of Model Specification 10/06/2021 https://deepai.org/publication/the-variability-of-model-specification

Taxonomy of Benchmarks in Graph Representation Learning 06/15/2022 https://deepai.org/publication/taxonomy-of-benchmarks-in-graph-representation-learning

Recognizing Variables from their Data via Deep Embeddings of Distributions 09/11/2019 https://deepai.org/publication/recognizing-variables-from-their-data-via-deep-embeddings-of-distributions

Relaxed Softmax for learning from Positive and Unlabeled data 09/17/2019 https://deepai.org/publication/relaxed-softmax-for-learning-from-positive-and-unlabeled-data

On Quantitative Evaluations of Counterfactuals 10/30/2021 https://deepai.org/publication/on-quantitative-evaluations-of-counterfactuals

Learning Neural Causal Models from Unknown Interventions 10/02/2019 https://deepai.org/publication/learning-neural-causal-models-from-unknown-interventions

Memorizing without overfitting: Bias, variance, and interpolation in over-parameterized models 10/26/2020 https://deepai.org/publication/memorizing-without-overfitting-bias-variance-and-interpolation-in-over-parameterized-models

InceptionTime: Finding AlexNet for Time Series Classification 09/11/2019 https://deepai.org/publication/inceptiontime-finding-alexnet-for-time-series-classification

Learning from Positive and Unlabeled Data by Identifying the Annotation Process 03/02/2020 https://deepai.org/publication/learning-from-positive-and-unlabeled-data-by-identifying-the-annotation-process

Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning 12/22/2019 https://deepai.org/publication/lessons-from-archives-strategies-for-collecting-sociocultural-data-in-machine-learning

Identification In Missing Data Models Represented By Directed Acyclic Graphs 06/29/2019 https://deepai.org/publication/identification-in-missing-data-models-represented-by-directed-acyclic-graphs

rrtools: Tools for Writing Reproducible Research in R https://github.com/benmarwick/rrtools

fastshap The goal of fastshap is to provide an efficient and speedy (relative to other implementations) approach to computing approximate Shapley values which help explain the predictions from machine learning models.

Monitoring Data Quality at Scale with Statistical Modeling May 7, 2020 https://www.uber.com/blog/monitoring-data-quality-at-scale/

This standard operating procedure (SOP) document describes the default practices of the experimental research group led by Donald P. Green at Columbia University. https://github.com/acoppock/Green-Lab-SOP

ggplot2 extensions https://exts.ggplot2.tidyverse.org/gallery/

RemixAutoML website and reference manual https://github.com/AdrianAntico/RemixAutoML

Inference in Linear Regression Models with Many Covariates and Heteroscedasticity Matias D. Cattaneoa, Michael Janssonb,c , and Whitney K. Neweyd https://eml.berkeley.edu/~mjansson/Publications/Cattaneo-Jansson-Newey_2018_JASA.pdf

Derivation of front door adjustment without intervention on the mediator https://figshare.com/articles/journal_contribution/Derivation_of_front_door_adjustment_without_intervention_on_the_mediator/20278347/1

Satellite image datasets https://eod-grss-ieee.com/dataset-search

Point of View: How should novelty be valued in science? Barak A Cohen https://elifesciences.org/articles/28699

The Softmax function and its derivative https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/

Why software projects take longer than you think: a statistical model 2019-04-15 https://erikbern.com/2019/04/15/why-software-projects-take-longer-than-you-think-a-statistical-model.html

MOVE IT OR LOSE IT: INTRODUCING PSEUDO-EARTH MOVER DIVERGENCE AS A CONTEXT-SENSITIVE METRIC FOR EVALUATING AND IMPROVING FORECASTING AND PREDICTION SYSTEMS https://events.barcelonagse.eu/live/files/2912-pemdivbarcelonapdf

The spatial allocation of population: a review of large-scale gridded population data products and their fitness for use https://eprints.soton.ac.uk/434156/1/The_spatial_allocation_of_population.pdf

https://www.youtube.com/playlist?list=PL8PYTP1V4I8D0UkqW2fEhgLrnlDW9QK7z

Robust misinterpretation of confidence intervals Rink Hoekstra & Richard D. Morey & Jeffrey N. Rouder & Eric-Jan Wagenmakers https://ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf

Transformers from Scratch Brandon Rohrer https://e2eml.school/transformers.html

Evidence on Research Transparency in Economics Edward Miguel https://www.aeaweb.org/articles?id=10.1257/jep.35.3.193

The Dance of the Mechanisms: How Observed Information Influences the Validity of Missingness Assumptions https://journals.sagepub.com/doi/10.1177/0049124118799376

The Generalizability of Survey Experiments* Published online by Cambridge University Press: 12 January 2016 https://www.cambridge.org/core/journals/journal-of-experimental-political-science/article/abs/generalizability-of-survey-experiments/72D4E3DB90569AD7F2D469E9DF3A94CB

Preregistering qualitative research Tamarinde L. HavenORCID Icon &Dr. Leonie Van Grootel https://www.tandfonline.com/doi/full/10.1080/08989621.2019.1580147

Categorical Perception of p-Values V. N. Vimal Rao,Jeffrey K. Bye,Sashank Varma https://onlinelibrary.wiley.com/doi/10.1111/tops.12589

On the Practice of Lagging Variables to Avoid Simultaneity† William Robert Reed https://onlinelibrary.wiley.com/doi/10.1111/obes.12088

Phantom Counterfactuals Tara Slough https://onlinelibrary.wiley.com/doi/10.1111/ajps.12715

“Don’t Know” Responses, Personality, and the Measurement of Political Knowledge* Published online by Cambridge University Press: 19 June 2015 https://www.cambridge.org/core/journals/political-science-research-and-methods/article/abs/dont-know-responses-personality-and-the-measurement-of-political-knowledge/C28B2FF6AD8181F9F60651C0933E5620

The influence of hidden researcher decisions in applied microeconomics https://onlinelibrary.wiley.com/doi/ftr/10.1111/ecin.12992

Inference in Experiments Conditional on Observed Imbalances in Covariates Per JohanssonORCID Icon &Mattias Nordin https://www.tandfonline.com/doi/full/10.1080/00031305.2022.2054859

Research Replication: Practical Considerations Published online by Cambridge University Press: 04 April 2018 https://www.cambridge.org/core/journals/ps-political-science-and-politics/article/research-replication-practical-considerations/B744967268CDAA3F44103AA5C8539EA2

The self-fulfilling prophecy of post-hoc power calculations Christos Christogiannis Stavros Nikolakopoulos Nikolaos Pandis Dimitris Mavridis https://www.ajodo.org/article/S0889-5406(21)00697-1/fulltext

Equinox is a JAX library based around a simple idea: represent parameterised functions (such as neural networks) as PyTrees. https://docs.kidger.site/equinox/

P-Hacking, Data Type and Data-Sharing Policy https://docs.iza.org/dp15586.pdf

UpSetR generates static UpSet plots. https://github.com/hms-dbmi/UpSetR

The purpose of the future package is to provide a very simple and uniform way of evaluating R expressions asynchronously using various resources available to the user. https://github.com/HenrikBengtsson/future

miceRanger: Fast Imputation with Random Forests https://github.com/farrellday/miceRanger

scoringRules An R package to compute scoring rules for fixed (parametric) and simulated forecast distributions. https://github.com/FK83/scoringRules

bayesdfa implements Bayesian Dynamic Factor Analysis (DFA) with Stan. https://github.com/fate-ewi/bayesdfa

dtplyr provides a data.table backend for dplyr. https://github.com/tidyverse/dtplyr

performance https://github.com/easystats/performance

BorutaShap is a wrapper feature selection method which combines both the Boruta feature selection algorithm with shapley values. https://github.com/Ekeany/Boruta-Shap

D-Lab’s Introduction to Machine Learning with tidymodels https://github.com/dlab-berkeley/R-Machine-Learning

ggVennDiagram https://github.com/gaospecial/ggVennDiagram

The {InteractionPoweR} package conducts power analyses for regression models in cross-sectional data sets where the term of interest is an interaction between two variables, also known as ‘moderation’ analyses. https://github.com/dbaranger/InteractionPoweR

varimpact uses causal inference statistics to generate variable importance estimates for a given dataset and outcome. https://github.com/ck37/varImpact

tidystringdist https://github.com/ColinFay/tidystringdist

https://github.com/CenterForPeaceAndSecurityStudies/IntroductiontoMachineLearning

Probabilistic Programming and Bayesian Methods for Hackers Chapter 1 https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter1_Introduction/Ch1_Introduction_TFP.ipynb

Papers about Causal Inference and Language https://github.com/causaltext/causal-text-papers

diffobj - Diffs for R Objects https://github.com/brodieG/diffobj

CatBoost https://github.com/catboost/catboost

Awesome Public Datasets https://github.com/awesomedata/awesome-public-datasets/blob/master/README.rst

BlackJAX is a library of samplers for JAX that works on CPU as well as GPU. https://github.com/blackjax-devs/blackjax

sensemakr: Sensitivity Analysis Tools for OLS https://github.com/carloscinelli/sensemakr

TorchArrow: a data processing library for PyTorch https://github.com/pytorch/torcharrow

causaleffect is a Python library for computing conditional and non-conditional causal effects. https://github.com/pedemonte96/causaleffect

Dynamic State Space Models in JAX. https://github.com/probml/dynamax

numpy-hilbert-curve https://github.com/PrincetonLIPS/numpy-hilbert-curve

Bayesian optimization in JAX https://github.com/PredictiveIntelligenceLab/JAX-BO

splines_in_stan.pdf https://github.com/milkha/Splines_in_Stan/blob/master/splines_in_stan.pdf

This is a repository that makes an attempt to empirically take stock of the most important concepts necessary to understand cutting-edge research in neural network models for NLP. https://github.com/neulab/nn4nlp-concepts

The EloML package provides Elo rating system for machine learning models. Elo Predictive Power (EPP) score helps to assess model performance based Elo ranking system. https://github.com/ModelOriented/EloML

SPTAG (Space Partition Tree And Graph) is a library for large scale vector approximate nearest neighbor search https://github.com/microsoft/SPTAG

fixest: Fast and user-friendly fixed-effects estimation https://github.com/lrberge/fixest/

tidybayes: Bayesian analysis + tidy data + geoms https://github.com/mjskay/tidybayes

Milvus is an open-source vector database built to power embedding similarity search and AI applications. https://github.com/milvus-io/milvus

bpCausal: Bayesian Causal Inference with Time-Series Cross-Sectional Data R package for A Bayesain Alternative to the Synthetic Control Method. https://github.com/liulch/bpCausal

priors.pdf https://github.com/lsbastos/Delay/blob/master/Code/priors.pdf

cheat_sheet-slabinterval.pdf https://github.com/mjskay/ggdist/blob/master/figures-source/cheat_sheet-slabinterval.pdf

tidypolars is a data frame library built on top of the blazingly fast polars library that gives access to methods and functions familiar to R tidyverse users. https://github.com/markfairbanks/tidypolars

ftfy: fixes text for you https://github.com/rspeer/python-ftfy

ggannotate https://github.com/MattCowgill/ggannotate

Conducting and Visualizing Specification Curve Analyses The goal of specr is to facilitate specification curve analyses (Simonsohn, Simmons & Nelson, 2019; also known as multiverse analyses, see Steegen, Tuerlinckx, Gelman & Vanpaemel, 2016). https://github.com/masurp/specr

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. https://github.com/lmcinnes/umap

survminer: Survival Analysis and Visualization https://github.com/kassambara/survminer

A curated list of resources dedicated to Natural Language Processing https://github.com/keon/awesome-nlp

vip: Variable Importance Plots https://github.com/koalaverse/vip/

How To Make Your Data Analysis Notebooks More Reproducible https://github.com/karthik/rstudio2019

Awesome Self-Supervised Learning https://github.com/jason718/awesome-self-supervised-learning

dagitty Graphical Analysis of Structural Causal Models https://github.com/jtextor/dagitty

Awesome Machine Learning https://github.com/josephmisiti/awesome-machine-learning#computer-vision-5

Replication, Replication https://gking.harvard.edu/files/abs/replication-abs.shtml

Ecological Regression with Partial Identification https://gking.harvard.edu/publications/ecological-regression-partial-identification

biglasso: Extend Lasso Model Fitting to Big Data in R https://github.com/YaohuiZeng/biglasso

marginaleffects package for R https://github.com/vincentarelbundock/marginaleffects

Feature Engineering and Selection by Max Kuhn and Kjell Johnson (2019). https://github.com/topepo/FES

tidyposterior https://github.com/tidymodels/tidyposterior

R Data Science Tutorials https://github.com/ujjwalkarn/DataScienceR

janitor https://github.com/sfirke/janitor

Conformal Inference R Project Maintained by Ryan Tibshirani https://github.com/ryantibs/conformal

semTools: Useful tools for structural equation modeling https://github.com/simsem/semTools/wiki

collapse is a C/C++ based package for data transformation and statistical computing in R. Its aims are: https://github.com/SebKrantz/collapse

latex2exp https://github.com/stefano-meschiari/latex2exp

Tracking Progress in Natural Language Processing https://github.com/sebastianruder/NLP-progress

Introduction to R Package Idealstan Robert Kubinec December 27, 2021 https://github.com/saudiwin/idealstan

Google’s Compact Language Detector 3 is a neural network model for language identification and the successor of CLD2 (available from) CRAN. T https://github.com/ropensci/cld3

terra is an R package for spatial data analysis. https://github.com/rspatial/terra

rmcelreath stat_rethinking_2022 https://github.com/rmcelreath/stat_rethinking_2022#calendar–topical-outline

charlatan makes fake data, inspired from and borrowing some code from Python’s faker (https://github.com/joke2k/faker) https://github.com/ropensci/charlatan

skimr provides a frictionless approach to summary statistics which conforms to the principle of least surprise, displaying summary statistics the user can skim quickly to understand their data. It handles different data types and returns a https://github.com/ropensci/skimr

assertr https://github.com/ropensci/assertr

Explaining Models by Propagating Shapley Values of Local Components Hugh Chen, Scott Lundberg, Su-In Lee https://arxiv.org/abs/1911.11888

Explaining Models by Propagating Shapley Values of Local Components Hugh Chen, Scott Lundberg, Su-In Lee

Visualizing a Million Time Series with the Density Line Chart https://idl.cs.washington.edu/files/2018-DenseLines-arXiv.pdf

What is GANs? GANs(Generative Adversarial Networks) are the models that used in unsupervised machine learning https://hollobit.github.io/All-About-the-GAN/

Explaining machine learning models with SHAP and SAGE https://iancovert.com/blog/understanding-shap-sage/

The Dozen Things Experimental Economists Should Do (More of) https://ideas.repec.org/p/feb/artefa/00648.html

Synthetic Control Using Lasso (scul) https://hollina.github.io/scul/

Everything is fucked: The syllabus https://thehardestscience.com/2016/08/11/everything-is-fucked-the-syllabus/

Regression Modeling With Proportion Data (Part 1) Predicting Attendance in the German Handball-Bundesliga https://hansjoerg.me/2019/05/10/regression-modeling-with-proportion-data-part-1/

Conditional independences and causal relations implied by sets of equations Tineke Blom, Mirthe M. van Diepen, Joris M. Mooij; 2 https://jmlr.org/papers/v22/20-863.html

Researcher Degrees of Freedom Analysis https://joachim-gassen.github.io/rdfanalysis/

Evidence-Based Medicine—An Oral History Richard Smith, MBChB, CBE, FMedSci, FRCPE, FRCGP1; Drummond Rennie, MD, FRCP2 https://jamanetwork.com/journals/jama/article-abstract/1817042

Geocomputation with R’s guide to reproducible spatial data analysis https://jakubnowosad.com/ogh2022/#/title-slide

Tutorial: JAX 101 https://jax.readthedocs.io/en/latest/jax-101/index.html

Autodidax: JAX core from scratch https://jax.readthedocs.io/en/latest/autodidax.html

Conda: Myths and Misconceptions https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/

A Statistical Method for Empirical Testing of Competing Theories Kosuke Imai Princeton University Dustin Tingley https://imai.fas.harvard.edu/research/files/mixture.pdf

The Influence of Data Pre-processing and Post-processing on Long Document Summarization Xinwei Du, Kailun Dong, Yuchen Zhang, Yongsheng Li, Ruei-Yu Tsay https://arxiv.org/abs/2112.01660

COLLIDER: A Robust Training Framework for Backdoor Data Hadi M. Dolatabadi, Sarah Erfani, Christopher Leckie https://arxiv.org/abs/2210.06704

Time Series Data Augmentation for Deep Learning: A Survey Qingsong Wen, Liang Sun, Fan Yang, Xiaomin Song, Jingkun Gao, Xue Wang, Huan Xu https://arxiv.org/abs/2002.12478

Bayesian Changepoint Detection in (Num)Pyro Posted on Tue 08 June 2021 in probabilistic programming, changepoint detection, Bayesian https://irustandi.github.io/bayesian-changepoint-detection-in-numpyro.html

Modeling Regime Shifts in Multiple Time Series Etienne Gael Tajeuna, Mohamed Bouguessa, Shengrui Wang https://arxiv.org/abs/2109.09692

Why negative results? Publication of negative results is difficult in most fields, but in NLP the problem is exacerbated by the near-universal focus on improvements in benchmarks. https://insights-workshop.github.io/

Small Data, Big Decisions: Model Selection in the Small-Data Regime Jorg Bornschein, Francesco Visin, Simon Osindero https://arxiv.org/abs/2009.12583

Quantifying With Only Positive Training Data Denis dos Reis, Marcílio de Souto, Elaine de Sousa, Gustavo Batista https://arxiv.org/abs/2004.10356

Superbloom: Bloom filter meets Transformer John Anderson, Qingqing Huang, Walid Krichene, Steffen Rendle, Li Zhang https://arxiv.org/abs/2002.04723

Selection via Proxy: Efficient Data Selection for Deep Learning Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, Matei Zaharia https://arxiv.org/abs/1906.11829

Many Proxy Controls Ben Deaner https://arxiv.org/abs/2110.03973

https://datatalks.club/slack.html

The Reparameterization “Trick” As Simple as Possible in TensorFlow https://medium.com/(llionj/the-reparameterization-trick-4ff30fe92954?)

Mixed Models for Big Data GAM MIXED MODELS BIG DATA BAYESIAN Explorations of a fast penalized regression approach with mgcv https://m-clark.github.io/posts/2019-10-20-big-mixed-models/

I saw your RCT and I have some worries! FAQs Macartan Humphreys 6 September 2021 https://macartan.github.io/i/notes/rct_faqs.html

Avoiding technical debt in social science research https://medium.com/pew-research-center-decoded/avoiding-technical-debt-in-social-science-research-54618194790a

Confounder Selection: Objectives and Approaches F. Richard Guo, Anton Rask Lundborg, Qingyuan Zhao https://math.papers.bar/paper/cc98597a2c2e11edaa66a71c10a887e7

Diagnosing Biased Inference with Divergences Michael Betancourt January 2017 https://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html

Regression and Causality Michael Schomaker https://math.papers.bar/paper/7e46323aaf3d11eb9864394904658322

Mathematical Proof Between Generations https://math.papers.bar/paper/347f685c018d11edb9b9d35608ee6155

Document Deduplication with Locality Sensitive Hashing May 23, 2017 https://mattilyra.github.io/2017/05/23/document-deduplication-with-lsh.html

Mastering Shiny https://mastering-shiny.org/

How to be a modern scientist https://leanpub.com/modernscientist

Blind 75 LeetCode Questions https://leetcode.com/discuss/general-discussion/460599/blind-75-leetcode-questions

How Exactly UMAP Works And why exactly it is better than tSNE https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668

MATH 342 (Time Series), https://lbelzile.github.io/timeseRies/

Generative vs. Discriminative; Bayesian vs. Frequentist https://lingpipe-blog.com/2013/04/12/generative-vs-discriminative-bayesian-vs-frequentist/

All Bayesian Models are Generative (in Theory) https://lingpipe-blog.com/2013/05/23/all-bayesian-models-are-generative-in-theory/

Failing Grade: 89% of Introduction-to-Psychology Textbooks That Define or Explain Statistical Significance Do So Incorrectly https://journals.sagepub.com/doi/full/10.1177/2515245919858072

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant https://journals.sagepub.com/doi/full/10.1177/0956797611417632

Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone Uri SimonsohnView all authors and affiliations https://journals.sagepub.com/doi/pdf/10.1177/0956797613480366?casa_token=r3DLe47WVEcAAAAA:Ct3qoVeZQvowii2Xk4wu5TRzV0GTAzNGUeH8qPHJhb2jR9p0GEkScL1-p8JHlSlDfwfyYHGtnWyqSw

The national accounting paradox: how statistical norms corrode international economic data Daniel Mügge https://orcid.org/0000-0001-9408-7597 d.k.muegge@uva.nl and Lukas LinsiView all authors and affiliations https://journals.sagepub.com/doi/full/10.1177/1354066120936339

Intellectual contributions meriting authorship: Survey results from the top cited authors across all science categories Gregory S. Patience ,Federico Galli,Paul A. Patience,Daria C. Boffito https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0198117

The goal of gluedown is to ease the transition from R’s powerful vectors to formatted markdown text. https://kiernann.com/gluedown/

The Nine Circles of Scientific Hell NeuroskepticView all authors and affiliations https://journals.sagepub.com/doi/10.1177/1745691612459519

The Temporal Structure of Scientific Consensus Formation Uri Shwed shwed@bgu.ac.il and Peter S. BearmanView all authors and affiliations https://journals.sagepub.com/doi/10.1177/0003122410388488

Efficient estimation of generalized linear latent variable models https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0216129#:~:text=Generalized%20linear%20latent%20variable%20models%20(GLLVM)%20are%20popular%20tools%20for,from%20a%20set%20of%20sites

The Phantom Menace: Omitted Variable Bias in Econometric Research Kevin A. ClarkeView all authors and affiliations https://journals.sagepub.com/doi/10.1080/07388940500339183

Predicting replicability—Analysis of survey and prediction market data from large-scale forecasting projects Michael Gordon ,Domenico Viganola ,Anna Dreber,Magnus Johannesson,Thomas Pfeiffer Published: April 14, 2021 https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0248780

Why we publish where we do: Faculty publishing values and their relationship to review, promotion and tenure expectations Meredith T. Niles ,Lesley A. Schimanski,Erin C. McKiernan,Juan Pablo Alperin Published: March 11, 2020 https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0228914

Reappraising the utility of Google Flu Trends Sasikiran Kandula ,Jeffrey Shaman https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007258

The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets Takaya Saito ,Marc Rehmsmeier Published: March 4, 2015 https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432

Statistically Controlling for Confounding Constructs Is Harder than You Think Jacob Westfall ,Tal Yarkoni Published: March 31, 2016 https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152719

Ten simple rules for collaboratively writing a multi-authored paper Marieke A. Frassl ,David P. Hamilton,Blaize A. Denfeld,Elvira de Eyto,Stephanie E. Hampton,Philipp S. Keller,Sapna Sharma,Abigail S. L. Lewis,Gesa A. Weyhenmeyer,Catherine M. O’Reilly,Mary E. Lofton,Núria Catalán Published: November 15, 2018 https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006508

Analyzing Selection Bias for Credible Causal Inference When in Doubt, DAG It Out https://journals.lww.com/epidem/fulltext/2019/07000/analyzing_selection_bias_for_credible_causal.8.aspx

Selective publication of antidepressant trials and its influence on apparent efficacy: Updated comparisons and meta-analyses of newer versus older trials Erick H. Turner ,Andrea Cipriani,Toshi A. Furukawa,Georgia Salanti,Ymkje Anna de Vries Published: January 19, 2022 https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1003886#sec018

Likelihood of Null Effects of Large NHLBI Clinical Trials Has Increased over Time Robert M. Kaplan ,Veronica L. Irvin Published: August 5, 2015 https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0132382

Test-Negative Designs Differences and Commonalities with Other Case–Control Studies with “Other Patient” Controls https://journals.lww.com/epidem/Abstract/2019/11000/Test_Negative_Designs__Differences_and.10.aspx

Examining linguistic shifts between preprints and publications David N. Nicholson,Vincent Rubinetti,Dongbo Hu,Marvin Thielk,Lawrence E. Hunter,Casey S. Greene Published: February 1, 2022 https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001470

Break Down: Model Agnostic Explainers for Individual Predictions https://pbiecek.github.io/breakDown/

‘Trust Us’: Open Data and Preregistration in Political Science and International Relations https://osf.io/preprints/metaarxiv/8h2bp/

The Methodological Divide of Sociology - Evidence From Two Decades of Journal Publications https://osf.io/preprints/socarxiv/s59bp/

Shapley Residuals: Quantifying the limits of the Shapley value for explanations. https://par.nsf.gov/biblio/10187138-shapley-residuals-quantifying-limits-shapley-value-explanations

Activation Functions https://paperswithcode.com/methods/category/activation-functions

Software citation principles https://peerj.com/articles/cs-86/

Causality Redux: The Evolution of Empirical Methods in Accounting Research and the Growth of Quasi-Experiments https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3935088

Large-Scale Study of Curiosity-Driven Learning https://pathak22.github.io/large-scale-curiosity/

Bayes and big data: the consensus Monte Carlo algorithm https://orsociety.tandfonline.com/doi/full/10.1080/17509653.2016.1142191?casa_token=AaNmx-7IVb4AAAAA%3Af_Zh3iwRXbyNvI4Tz5Erf0UrxkvftTGLN2EXwtvBu5Je0ejMp3fOYbYpUT9R6vBlgbwU2hoid24#.X-9wZtaIZHA

Finance is Not Excused: Why Finance Should Not Flout Basic Principles of Statistics Forthcoming, Significance (Royal Statistical Society), 2021 https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3895330

Bayesian Time Series Forecasting with Change Point and Anomaly Detection https://openreview.net/forum?id=rJLTTe-0W

How to translate a verbal theory into a formal model https://osf.io/preprints/metaarxiv/n7qsh/

Does Regression Produce Representative Estimates of Causal Effects? https://onlinelibrary.wiley.com/doi/abs/10.1111/ajps.12185

Mixed Hamiltonian Monte Carlo for Mixed Discrete and Continuous Variables https://papers.nips.cc/paper/2020/file/c6a01432c8138d46ba39957a8250e027-Paper.pdf

Specification Curve: Descriptive and Inferential Statistics on All Reasonable Specifications https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2694998

The Standard Errors of Persistence Morgan Kelly https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3398303

https://papers.labml.ai/lists

The International Political Economy Data Resource https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2534067

LightGBM: A Highly Efficient Gradient Boosting Decision Tree https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

Notes on Changing from Rmarkdown/Bookdown to Quarto https://www.njtierney.com/post/2022/04/11/rmd-to-qmd/

Estimating the prevalence of transparency and reproducibility-related research practices in psychology (2014-2017) https://osf.io/preprints/metaarxiv/9sz2y/

Teaching Safe-Stats, Not Statistical Abstinence https://nhorton.people.amherst.edu/mererenovation/17_Wickham.PDF

Quantile Regression With LightGBM https://notebook.community/ethen8181/machine-learning/ab_tests/quantile_regression/quantile_regression

Forecasting: Principles and Practice https://otexts.com/fpp3/

Comparison of Preregistration Platforms https://osf.io/preprints/metaarxiv/zry2u

https://nips.cc/Conferences

https://opensyllabus.org/

Deep Learning Yoshua Bengio

An Introduction to Statistical Learning Gareth James, Daniela Witten, Trevor Hastie

Critical Questions for Big Data Danah Boyd, Kate Crawford

The Elements of Statistical Learning Trevor Hastie

Mostly Harmless Econometrics Joshua D. Angrist

Counterfactuals and Causal Inference Stephen L. Morgan

Machine Learning: A Probabilistic Perspective Kevin P. Murphy

Causality: Models, Reasoning, and Inference Judea Pearl

Methods to Estimate Causal Effects - An Overview on IV, DiD and RDD and a Guide on How to Apply them in Practice https://osf.io/preprints/socarxiv/usvta/

WINNER’S CURSE? ON PACE, PROGRESS, AND EMPIRICAL RIGOR https://openreview.net/pdf?id=rJWF0Fywf

NLP Highlights Podcast https://open.spotify.com/show/4tGHzmicSHIVU3ksf5iYv8

A method to streamline p-hacking https://open.lnu.se/index.php/metapsychology/article/view/2529

Machine Learning Theory - Part 3: Regularization and the Bias-variance Trade-off https://mostafa-samir.github.io/ml-theory-pt3/

NeetCode https://neetcode.io/

Detecting p-Hacking https://onlinelibrary.wiley.com/doi/abs/10.3982/ECTA18583

Is Temperature Exogenous? The Impact of Civil Conflict on the Instrumental Climate Record in Sub-Saharan Africa Kenneth A. Schultz,Justin S. Mankin First published: 28 March 2019 https://onlinelibrary.wiley.com/doi/abs/10.1111/ajps.12425

Making apples from oranges: Comparing noncollapsible effect estimators and their standard errors after adjustment for different covariate sets Rhian Daniel,Jingjing Zhang,Daniel Farewell First published: 14 December 2020 https://onlinelibrary.wiley.com/doi/full/10.1002/bimj.201900297

vtree a flexible R package for displaying nested subsets of a data frame https://nbarrowman.github.io/vtree.html

Election polling errors across time and space Will Jennings & Christopher Wlezien https://www.nature.com/articles/s41562-018-0315-6

Bayesian Estimation of Signal Detection Models https://mvuorre.github.io/posts/2017-10-09-bayesian-estimation-of-signal-detection-theory-models/

Add to feature engineering The xspliner package is a collection of tools for training interpretable surrogate ML models. https://modeloriented.github.io/xspliner/index.html

Observational Studies https://muse.jhu.edu/issue/48885

Bringing more causality to analytics https://motifanalytics.medium.com/bringing-more-causality-to-analytics-d378108bb15

Nice stats note https://moultano.wordpress.com/2013/08/09/logs-tails-long-tails/

Figuring out why my object detection model is underperforming with FiftyOne, a great tool you probably haven’t heard of https://mlops.systems/redactionmodel/computervision/tools/debugging/jupyter/2022/03/12/fiftyone-computervision.html

A ModernDive into R and the Tidyverse https://moderndive.com/index.html

Hopfield Networks is All You Need https://ml-jku.github.io/hopfield-layers/

Rediscovering Bayesian Structural Time Series June 7, 2020 https://minimizeregret.com/post/2020/06/07/rediscovering-bayesian-structural-time-series/

Prophet https://www.youtube.com/watch?v=pOYAXv15r3A&feature=emb_logo

A paper is the tip of an iceberg https://minhlab.wordpress.com/2017/03/18/a-paper-is-the-tip-of-the-iceberg/

Geometric Intuition for Training Neural Networks https://sea-adl.org/2019/11/25/geometric-intuition-for-training-neural-networks/

Latent Variable Modelling in brms January 20, 2020 https://scottclaessens.github.io/blog/2020/brmsLV/

Robust Empirical Bayes Confidence Intervals https://scholar.princeton.edu/mikkelpm/ebci

Bias of OLS Estimat Bias of OLS Estimators due t ors due to Exclusion of Rele clusion of Relevant Variables ariables and Inclusion of Irrelevant Variables Deepankar Basu https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1257&context=econ_workingpaper

scikit-survival https://scikit-survival.readthedocs.io/en/latest/index.html

Scikit-learn’s Defaults are Wrong https://ryxcommar.com/2019/08/30/scikit-learns-defaults-are-wrong/

Selecting on the DV Design, Inference, and the Strategic Logic of Suicide Terrorism: A Rejoinder https://scholar.princeton.edu/sites/default/files/rejoinder3.pdf

Sampling from weird probability distributions Alan R. Pearse 6 July 2019 https://rpubs.com/a_pear_9/weird_distributions

Outliers: Love’em or leave’em João Neto April 2020 https://rpubs.com/jpn3to/outliers

Synthetic controls with staggered adoption Eli Ben-Michael,Avi Feller,Jesse Rothstein https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12448

Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends https://pubs.aeaweb.org/doi/pdfplus/10.1257/aeri.20210236?cookieSet=1

(PROTOTYPE) INTRODUCTION TO NAMED TENSORS IN PYTORCH Author: Richard Zou https://pytorch.org/tutorials/intermediate/named_tensor_tutorial.html

Comparing meta-analyses and preregistered multiple-laboratory replication projects https://pubmed.ncbi.nlm.nih.gov/31873200/

Does retraction after misconduct have an impact on citations? A pre-post study https://pubmed.ncbi.nlm.nih.gov/33187964/

Sparsity information and regularization in the horseshoe and other shrinkage priors Juho Piironen, Aki Vehtari https://projecteuclid.org/journals/electronic-journal-of-statistics/volume-11/issue-2/Sparsity-information-and-regularization-in-the-horseshoe-and-other-shrinkage/10.1214/17-EJS1337SI.full

A Word of Caution about Many Labs 4: If You Fail to Follow Your Preregistered Plan, You May Fail to Find a Real Effect https://psyarxiv.com/ejubn

Discrepancies between meta-analyses and subsequent large randomized, controlled trials https://pubmed.ncbi.nlm.nih.gov/9262498/

Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective https://research.facebook.com/publications/applied-machine-learning-at-facebook-a-datacenter-infrastructure-perspective/

An overview of systematic reviews found suboptimal reporting and methodological limitations of mediation studies investigating causal mechanisms https://pubmed.ncbi.nlm.nih.gov/30904567/

That’s a lot to Process! Pitfalls of Popular Path Models Julia M. RohrerPaul HünermundRuben C. ArslanMalte Elson https://psyarxiv.com/paeb7/

The Matrix-F Prior for Estimating and Testing Covariance Matrices Joris Mulder, Luis Raúl Pericchi https://projecteuclid.org/journals/bayesian-analysis/volume-13/issue-4/The-Matrix-F-Prior-for-Estimating-and-Testing-Covariance-Matrices/10.1214/17-BA1092.full

Sample Size Justification Daniel Lakens https://psyarxiv.com/9d3yf

A unified view on Bayesian varying coefficient models Maria Franco-Villoria, Massimo Ventrucci, Håvard Rue https://projecteuclid.org/journals/electronic-journal-of-statistics/volume-13/issue-2/A-unified-view-on-Bayesian-varying-coefficient-models/10.1214/19-EJS1653.full

Introduction to the concept of likelihood and its applications Alexander Etz https://psyarxiv.com/85ywt

Tapped Out or Barely Tapped? Recommendations for How to Harness the Vast and Largely Unused Potential of the Mechanical Turk Participant Pool Jonathan RobinsonCheskie RosenzweigAaron J MossLeib LItman https://psyarxiv.com/jq589

Share the code, not just the data: A case study of the reproducibility of articles published in the Journal of Memory and Language under the open data policy Anna LaurinavichyuteHimanshu YadavShravan Vasishth https://psyarxiv.com/hf297/

Play with Generative Adversarial Networks (GANs) in your browser! https://poloclub.github.io/ganlab/

Probabilistic Machine Learning: An Introduction https://probml.github.io/pml-book/book1.html

How Good are FiveThirtyEight Forecasts https://projects.fivethirtyeight.com/checking-our-work/

plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2 https://plotnine.readthedocs.io/en/stable/

Doubly Robust Difference-in-Differences https://psantanna.com/DRDID/

Locally Adaptive Smoothing with Markov Random Fields and Shrinkage Priors James R. Faulkner, Vladimir N. Minin https://projecteuclid.org/journals/bayesian-analysis/volume-13/issue-1/Locally-Adaptive-Smoothing-with-Markov-Random-Fields-and-Shrinkage-Priors/10.1214/17-BA1050.full

Introduction to Probability for Data Science https://probability4datascience.com/

The Design Space of Computational Notebooks: An Analysis of 60 Systems in Academia and Industry https://pg.ucsd.edu/publications/computational-notebooks-design-space_VLHCC-2020.pdf

Analyzing Minard’s Visualization Of Napoleon’s 1812 March https://thoughtbot.com/blog/analyzing-minards-visualization-of-napoleons-1812-march

A course in Time Series Analysis https://web.stat.tamu.edu/~suhasini/teaching673/time_series.pdf

When and How Should One Use Deep Learning for Causal Effect Inference https://technionmail-my.sharepoint.com/personal/urishalit_technion_ac_il/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Furishalit%5Ftechnion%5Fac%5Fil%2FDocuments%2FPresentations%2FIAS2018%2FIAS2018%5F2%5Ffor%5Fpublic%2Epdf&parent=%2Fpersonal%2Furishalit%5Ftechnion%5Fac%5Fil%2FDocuments%2FPresentations%2FIAS2018&ga=1

A Primer on Pólya-gamma Random Variables - Part II: Bayesian Logistic Regression https://tiao.io/post/polya-gamma-bayesian-logistic-regression/

treeheatr https://trang1618.github.io/treeheatr/

DiD Reading Group https://taylorjwright.github.io/did-reading-group/

Why is it that natural log changes are percentage changes? What is about logs that makes this so? https://stats.stackexchange.com/questions/244199/why-is-it-that-natural-log-changes-are-percentage-changes-what-is-about-logs-th

STATISTICAL PARADISES AND PARADOXES IN BIG DATA (I): LAW OF LARGE POPULATIONS, BIG DATA PARADOX, AND THE 2016 US PRESIDENTIAL ELECTION https://statistics.fas.harvard.edu/files/statistics-2/files/statistical_paradises_and_paradoxes.pdf

Prior predictive checks for Bayesian regression. https://engineeringdecisionanalysis.shinyapps.io/Priors/?_ga=2.230023708.1800474280.1612547156-513452691.1612547156

No, you can’t explain what a p-value is with one sentence (Parts I, II) https://statsepi.substack.com/p/no-you-cant-explain-what-a-p-value

Does it make sense to log-transform the dependent when using Gradient Boosted Trees? https://stats.stackexchange.com/questions/262114/does-it-make-sense-to-log-transform-the-dependent-when-using-gradient-boosted-tr/263753#263753

Why is Euclidean distance not a good metric in high dimensions? https://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-good-metric-in-high-dimensions

Plotting partial pooling in mixed-effects models https://www.tjmahr.com/plotting-partial-pooling-in-mixed-effects-models/

A Monte Carlo study on methods for handling class imbalance https://static1.squarespace.com/static/58a7d1e52994ca398697a621/t/5a2833cec83025cca6b99ff8/1512584144990/manuscript.pdf

Billion-scale semantic similarity search with FAISS+SBERT https://towardsdatascience.com/billion-scale-semantic-similarity-search-with-faiss-sbert-c845614962e2

The caret Package Max Kuhn 2019-03-27 https://topepo.github.io/caret/index.html

Custom Loss Functions for Gradient Boosting Optimize what matters https://towardsdatascience.com/custom-loss-functions-for-gradient-boosting-f79c1b40466d

What is the trade-off between batch size and number of iterations to train a neural network? https://stats.stackexchange.com/questions/164876/what-is-the-trade-off-between-batch-size-and-number-of-iterations-to-train-a-neu/236393#236393

Generalized Instrumental Variables https://arxiv.org/pdf/1301.0560.pdf

experimenter demand effects (EDEs)—bias that occurs when participants infer the purpose of an experiment and respond so as to help confirm a researcher’s hypothesis https://www.cambridge.org/core/journals/american-political-science-review/article/abs/demand-effects-in-survey-experiments-an-empirical-assessment/043386DC63A69098E859414EF9932EBC

An Overview of Google’s Work on AutoML and Future Directions Jun 14, 2019 https://slideslive.com/38917526/an-overview-of-googles-work-on-automl-and-future-directions?locale=en

The other kind of machine learning regression — unmeasured method performance https://stuart-reynolds.medium.com/the-other-kind-of-machine-learning-regression-unmeasured-method-performance-81b7eb00efda

The tidyverse style guide https://style.tidyverse.org/index.html

STOP CONFOUNDING YOURSELF! STOP CONFOUNDING YOURSELF! https://slatestarcodex.com/2014/04/26/stop-confounding-yourself-stop-confounding-yourself/

Causal model and theory Suparna Chaudhry and Andrew Heiss 2021-05-26 https://stats.andrewheiss.com/donors-crackdowns-aid/00_causal-model-theory.html

Stanza – A Python NLP Package for Many Human Languages https://stanfordnlp.github.io/stanza/

Inference for deterministic simulation models: The Bayesian melding approach https://sites.stat.washington.edu/raftery/Research/PDF/poole2000.pdf

Unofficial guidance on various topics by Social Science Data Editors https://social-science-data-editors.github.io/guidance/

Welcome to The Advanced Matrix Factorization Jungle https://sites.google.com/site/igorcarron2/matrixfactorizations

Exploring Enterprise Databases with R: A Tidyverse Approach https://smithjd.github.io/sql-pet/

Even with randomization, mediation analysis can still be confounded https://www.r-bloggers.com/2019/04/even-with-randomization-mediation-analysis-can-still-be-confounded/

Inference and Prediction Diverge in Biomedicine https://www.cell.com/patterns/fulltext/S2666-3899(20)30160-4

Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning https://www.cis.upenn.edu/~jean/math-deep.pdf

The conditional nature of publication bias: a meta-regression analysis Published online by Cambridge University Press: 11 May 2020 https://www.cambridge.org/core/journals/political-science-research-and-methods/article/abs/conditional-nature-of-publication-bias-a-metaregression-analysis/40C0A166F3ED1516A051C5ED270D1650

A Practical Guide to Weak Instruments Michael Keane† and Timothy Neal† https://uc822f03065d525621a0034d9737.dl.dropboxusercontent.com/cd/0/inline2/Bwkck59j4aGcZXzlbeX5dTlWFJ75Q4t_Aw8oEzfBlrtKDJ0UQT2snWJVT2up4b1-hAQXypGx3CI1GMrv7IzuLn3qml-_qg3e7n9WFySMEteOh8YarvEvY0co5iYeI7ah1ppzWGgI3CLOk-5aStOsOeAY9TIcEABvPqXkGLXgZd6eXOFKRBv-OlDPL3mcixiiBC2OoLXuBymI3IyQIzTE2BPwCLdFAMijckE-VG6tTEnG7yAEiJwOGXnwoFk6gB7td51Loi_1f26t_3zcBSHgpjBD_yVRhmb_R_Nt0kxdh3Nvhm0rueJcbzf1-gkqXGgZApK5Rc3JdAi7woThAAkD1hGko0HSYQT7SIdRBZZ28FpMer2sZVBkFXpY_9o-nefwJiFcbyIaiuqQVvQckMw6QWx_L4nJRL1Btd7ztnss1dJ_YA/file

Why You Should (or Shouldn’t) be Using Google’s JAX in 2022 https://www.assemblyai.com/blog/why-you-should-or-shouldnt-be-using-jax-in-2022/

A guide to working with country-year panel data and Bayesian multilevel models https://www.andrewheiss.com/blog/2021/12/01/multilevel-models-panel-data-guide/

Statistical Significance Annual Review of Statistics and Its Application https://www.annualreviews.org/doi/pdf/10.1146/annurev-statistics-031219-041051

Bayesian Additive Regression Trees: A Review and Look Forward Annual Review of Statistics and Its Application https://www.annualreviews.org/doi/abs/10.1146/annurev-statistics-031219-041110

Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work https://www.amazon.ca/Bad-Data-Handbook-Cleaning-Back/dp/1449321887

Data Replication with Code Ocean – A How-To Guide for PA Authors Simon Heuberger October 2, 2019 https://www.american.edu/spa/data-science/upload/authors_how_to.pdf

Statistical Significance, p-Values, and the Reporting of Uncertainty Guido W. Imbens https://www.aeaweb.org/articles?id=10.1257/jep.35.3.157

Methods Matter: P-Hacking and Publication Bias in Causal Analysis in Economics By Abel Brodeur, Nikolai Cook, and Anthony Heyes https://www.aeaweb.org/content/file?id=12747

Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects Alberto Abadie https://www.aeaweb.org/articles?id=10.1257/jel.20191450&from=f

Star Wars: The Empirics Strike Back Abel Brodeur Mathias Lé Marc Sangnier Yanos Zylberberg AMERICAN ECONOMIC JOURNAL: APPLIED ECONOMICS VOL. 8, NO. 1, JANUARY 2016 https://www.aeaweb.org/articles?id=10.1257/app.20150044

Beware performative reproducibility Well-meant changes to improve science could become empty gestures unless underlying values change. https://www.nature.com/articles/d41586-021-01824-z

Plausibly Exogenous Timothy G. Conley, Christian B. Hansen, Peter E. Rossi https://direct.mit.edu/rest/article-abstract/94/1/260/57981/Plausibly-Exogenous

Deep Learning on Electronic Medical Records is doomed to fail Originally posted 2022-03-22 https://www.moderndescartes.com/essays/deep_learning_emr/

Collider bias undermines our understanding of COVID-19 disease risk and severity https://www.nature.com/articles/s41467-020-19478-2

Bayesian analysis of tests with unknown specificity and sensitivity Andrew Gelman, Bob Carpenter https://www.medrxiv.org/content/10.1101/2020.05.22.20108944v3

The One Standard Error Rule for Model Selection: Does It Work? by Yuchen Chen 1 andYuhong Yang 2,* https://www.mdpi.com/2571-905X/4/4/51

MODELING COVARIANCE MATRICES IN TERMS OF STANDARD DEVIATIONS AND CORRELATIONS, WITH APPLICATION TO SHRINKAGE John Barnard, Robert McCulloch and Xiao-Li Meng https://www.jstor.org/stable/24306780#metadata_info_tab_contents

Regression and Other Stories, with Andrew Gelman, Jennifer Hill & Aki Vehtari podcast https://learnbayesstats.com/episode/20-regression-and-other-stories-with-andrew-gelman-jennifer-hill-aki-vehtari/

Causal Inference: What If (the book) https://cdn1.sph.harvard.edu/wp-content/uploads/sites/1268/2022/10/hernanrobins_WhatIf_15sep22.pdf

Statistical Comparisons of Classifiers over Multiple Data Sets https://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf

CRITICAL VALUES ROBUST TO P-HACKING https://www.pascalmichaillat.org/12.html

P-value, compatibility, and S-value Author links open overlay panelMohammad AliMansourniaaMaryamNazemipouraMahyarEtminan https://www.sciencedirect.com/science/article/pii/S2590113322000153?via%3Dihub

The Dunning-Kruger effect is (mostly) a statistical artefact: Valid approaches to testing the hypothesis with individual differences data Author links open overlay panelGilles E.GignacaMarcinZajenkowskib https://www.sciencedirect.com/science/article/abs/pii/S0160289620300271

Natural Scales in Geographical Patterns Telmo Menezesa,1,* and Camille Roth1,2,3, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5379183/

Making Sense of Sensitivity: Extending Omitted Variable Bias January 2018 https://www.researchgate.net/publication/322509816_Making_Sense_of_Sensitivity_Extending_Omitted_Variable_Bias

A Survey of Methods for Time Series Change Point Detection Samaneh Aminikhanghahi and Diane J. Cook https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5464762/

Negative Controls: A Tool for Detecting Confounding and Bias in Observational Studies https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3053408/

Introduction to Facebook AI Similarity Search (Faiss) https://www.pinecone.io/learn/faiss-tutorial/

Is probabilistic bias analysis approximately Bayesian? https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3257063/

Everything you always wanted to know about evaluating prediction models (but were too afraid to ask) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2997853/

Should We Trust Clustered Standard Errors? A Comparison with Randomization-Based Methods Lourenço S. Paz & James E. West https://www.nber.org/papers/w25926

The Value of Statistical Life: A Meta-analysis of Meta-analyses H. Spencer Banzhaf https://www.nber.org/papers/w29185

A large-scale study on research code quality and execution Ana Trisovic, Matthew K. Lau, Thomas Pasquier & Mercè Crosas https://www.nature.com/articles/s41597-022-01143-6

SELECTION INTO IDENTIFICATION IN FIXED EFFECTS MODELS, WITH APPLICATION TO HEAD START https://www.nber.org/system/files/working_papers/w26174/w26174.pdf

Fast and effective pseudo transfer entropy for bivariate data-driven causal inference Riccardo Silini & Cristina Masoller Scientific Reports volume 11, Article number: 8423 (2021) https://www.nature.com/articles/s41598-021-87818-3

Variable selection in the presence of missing data: resampling and imputation Qi Long* https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5156376/

Consensus features nested cross-validation Saeid Parvandeh,1,2 Hung-Wen Yeh,3 Martin P Paulus,4 and Brett A McKinney1,5 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7776094/

Bayesian Approaches for Missing Not at Random Outcome Data: The Role of Identifying Restrictions Antonio R. Linero* and Michael J. Daniels† https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6936760/

Do Pre-Registration and Pre-analysis Plans Reduce p-Hacking and Publication Bias? Abel BrodeurNikolai CookJonathan HartleyAnthony Heyes https://osf.io/preprints/metaarxiv/uxf39/

Elements of Information Theory, 2nd Edition https://www.wiley.com/en-us/Elements+of+Information+Theory%2C+2nd+Edition-p-9780471241959

Why Data Is Never Raw https://www.thenewatlantis.com/publications/why-data-is-never-raw

When U.S. air force discovered the flaw of averages https://www.thestar.com/news/insight/2016/01/16/when-us-air-force-discovered-the-flaw-of-averages.html

A History of Scientific Journals Publishing at the Royal Society, 1665-2015 Aileen Fyfe, Noah Moxham, Julie McDougall-Waters, and Camilla Mørk Røstvik https://www.uclpress.co.uk/products/187262

Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban Ronald D. Fricker Jr.,Katherine Burke,Xiaoyan Han &William H. Woodall https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1537892

Oh No! I Got the Wrong Sign! What Should I Do? Peter E. Kennedy https://www.tandfonline.com/doi/abs/10.3200/JECE.36.1.77-92

Prediction, Estimation, and Attribution Bradley Efron https://www.tandfonline.com/doi/full/10.1080/01621459.2020.1762613

The Gaussian Graphical Model in Cross-Sectional and Time-Series Data Sacha Epskamp,Lourens J. Waldorp,René Mõttus &Denny Borsboom Pages 453-480 | Published online: 16 Apr 2018 https://www.tandfonline.com/doi/full/10.1080/00273171.2018.1454823

Bruce E. Hansen Jackknife Standard Errors for Clustered Regression August 2022 https://www.ssc.wisc.edu/~bhansen/papers/tcauchy.pdf

A Five-Star Guide for Achieving Replicability and Reproducibility When Working with GIS Software and Algorithms https://www.tandfonline.com/doi/abs/10.1080/24694452.2020.1806026?journalCode=raag21

Some useful equations for nonlinear regression in R Andrea Onofri 2019-01-08 https://www.statforbiology.com/nonlinearregression/usefulequations

Random Forests for Spatially Dependent Data Arkajyoti Saha,Sumanta Basu &Abhirup Datta Received 02 Dec 2020 https://www.tandfonline.com/doi/abs/10.1080/01621459.2021.1950003

Jackknife Standard Errors for Clustered Regression Bruce E. Hansen* University of Wisconsin† August, 2022 https://www.ssc.wisc.edu/~bhansen/papers/tcauchy.pdf

Social Science Reproduction Platform (SSRP) is an openly licensed platform that facilitates the sourcing, cataloging, and review of attempts to verify and improve the computational reproducibility of social science research. https://www.socialsciencereproduction.org/about

Non-Standard Errors https://orbilu.uni.lu/bitstream/10993/48686/1/SSRN-id3961574.pdf

Do growth mindset interventions impact students’ academic achievement? A systematic review and meta-analysis with recommendations for best practices. https://psycnet.apa.org/record/2023-14088-001

Bayesian inference with INLA https://becarioprecario.bitbucket.io/inla-gitbook/index.html

Measurement Models http://cfariss.com/documents/FarissKenwickReuning2019_MesurmentModels.pdf

Distinguishing cause from effect using observational data: methods and benchmarks https://arxiv.org/abs/1412.3773v3

The Effect: An Introduction to Research Design and Causality https://theeffectbook.net/

Feature Engineering and Selection: A Practical Approach for Predictive Models https://bookdown.org/max/FES/

Feature Interactions in XGBoost https://arxiv.org/abs/2007.05758

How should variable selection be performed with multiply imputed data? https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.3177

Statistical Nonsignificance in Empirical Economics https://www.aeaweb.org/articles?id=10.1257/aeri.20190252&from=f

Deep Learning https://www.deeplearningbook.org/

On Multi-Cause Causal Inference with Unobserved Confounding: Counterexamples, Impossibility, and Alternatives https://arxiv.org/abs/1902.10286

Why Propensity Scores Should Not Be Used for Matching https://gking.harvard.edu/publications/why-propensity-scores-should-not-be-used-formatching

What are the most important statistical ideas of the past 50 years?∗ https://arxiv.org/pdf/2012.00174.pdf

Automatic Differentiation Variational Inference https://www.jmlr.org/papers/volume18/16-107/16-107.pdf

Methodology over metrics: current scientific standards are a disservice to patients and society https://www.jclinepi.com/article/S0895-4356(21)00170-0/fulltext

Let’s Put Garbage-Can Regressions and Garbage-Can Probits Where They Belong https://journals.sagepub.com/doi/10.1080/07388940500339167

Statistical rethinking with brms, ggplot2, and the tidyverse: Second edition https://bookdown.org/content/4857/

Random Walk: A Modern Introduction https://math.uchicago.edu/~lawler/srwbook.pdf

On the reliability of published findings using the regression discontinuity design in political science https://arxiv.org/abs/2109.14526

Exploring the Dynamics of Latent Variable Models https://www.cambridge.org/core/journals/political-analysis/article/abs/exploring-the-dynamics-of-latent-variable-models/CBE116F37900DAE957B2D7EB53DB0907#.X7h7GMnwHwM.twitter

Cross-validation: what does it estimate and how well does it do it? https://arxiv.org/abs/2104.00673

What Is Your Estimand? Defining the Target Quantity Connects Statistical Evidence to Theory https://journals.sagepub.com/doi/abs/10.1177/00031224211004187#:~:text=The%20estimand%20is%20the%20target,purpose%20of%20the%20statistical%20analysis.&text=By%20grounding%20all%20three%20steps,connects%20statistical%20evidence%20to%20theory

Measurement error and the replication crisis https://www.science.org/doi/10.1126/science.aal3618

Bayesian Modeling and Computation in Python https://bayesiancomputationbook.com/welcome.html

The Separation Plot: A New Visual Method for Evaluating the Fit of Binary Models https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1540-5907.2011.00525.x

Causal Inference for The Brave and True https://matheusfacure.github.io/python-causality-handbook/landing-page.html

A Parsimonious Tour of Bayesian Model Uncertainty https://arxiv.org/abs/1902.05539

The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant http://www.stat.columbia.edu/~gelman/research/published/signif4.pdf

Prediction, Estimation, and Attribution https://efron.ckirby.su.domains//papers/2019PredictEstimatAttribut.pdf

Difference-in-Differences with a Continuous Treatment https://psantanna.com/files/Callaway_Goodman-Bacon_SantAnna_2021.pdf

A Practical Introduction to Regression Discontinuity Designs: Foundations https://arxiv.org/pdf/1911.09511.pdf

The influence of decision-making in tree ring-based climate reconstructions https://www.nature.com/articles/s41467-021-23627-6

The Influence of Hidden Researcher Decisions in Applied Microeconomics https://docs.iza.org/dp13233.pdf

Introducing geofacet https://ryanhafen.com/blog/geofacet/

The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432

Cross-validation FAQ Aki Vehtari First version 2020-03-11. Last modified 2022-07-30. https://avehtari.github.io/modelselection/CV-FAQ.html

Shapley values for feature selection: The good, the bad, and the axioms Daniel Fryer, Inga Strümke, Hien Nguyen https://arxiv.org/abs/2102.10936

A Crash Course in Good and Bad Controls Carlos Cinelli∗ Andrew Forney† Judea Pearl https://ftp.cs.ucla.edu/pub/stat_ser/r493.pdf

Reinforcement Learning in R Nicolas Pröllochs, Stefan Feuerriegel https://arxiv.org/abs/1810.00240

Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure https://onlinelibrary.wiley.com/doi/pdf/10.1111/ecog.02881

NumPyro https://github.com/pyro-ppl/numpyro

Replacing the do-calculus with Bayes rule https://arxiv.org/pdf/1906.07125.pdf

Channeling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results https://academic.oup.com/qje/article-abstract/134/2/557/5195544?redirectedFrom=fulltext&login=false

A Survey on Societal Event Forecasting with Deep Learning https://arxiv.org/pdf/2112.06345.pdf

Has the Credibility of the Social Sciences Been Credibly Destroyed? Reanalyzing the “Many Analysts, One Data Set” Project https://journals.sagepub.com/doi/full/10.1177/23780231211024421

Political Event Coding as Text to Text Sequence Generation https://yaoyaodai.github.io/files/CASE_2022.pdf

Bayesian Thinking for Toddlers https://psyarxiv.com/w5vbp/

The Dunning-Kruger Effect is Autocorrelation https://economicsfromthetopdown.com/2022/04/08/the-dunning-kruger-effect-is-autocorrelation/

Causal Inference and Its Applications in Online Industry https://alexdeng.github.io/causal/

Bayesian Workflow https://arxiv.org/abs/2011.01808

Papers about Causal Inference and Language https://github.com/causaltext/causal-text-papers

Achieving Statistical Significance with Control Variables and Without Transparency https://www.cambridge.org/core/journals/political-analysis/article/abs/achieving-statistical-significance-with-control-variables-and-without-transparency/1E867C357835019E0C9322B918414045

Questionable research practices among researchers in the most research-productive management programs https://onlinelibrary.wiley.com/doi/10.1002/job.2623

The problem of the missing dead Sophia Dawkins https://orcid.org/0000-0002-2609-0820 sophia.dawkins@yale.eduView all authors and affiliations

https://declaredesign.org/

Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty https://www.pnas.org/doi/full/10.1073/pnas.2203150119

How to avoid machine learning pitfalls: a guide for academic researchers https://arxiv.org/pdf/2108.02497.pdf

Multiple Imputation Through XGBoost https://arxiv.org/abs/2106.01574

TACRED is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used in the yearly TAC Knowledge Base Population (TAC KBP) challenges. https://nlp.stanford.edu/projects/tacred/

Understanding lime https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html

Race to the bottom: Spatial aggregation and event data https://www.tandfonline.com/doi/abs/10.1080/03050629.2022.2025365?casa_token=wrWE–FIltIAAAAA%3AU5Dsm6FMC_1wN1GKsdbEyveqc7XKFEe2beBsBxSVjVopzFgrJdYgfQ9gvW0nL17UUSyAIR5_8Kg&journalCode=gini20

Inference and Prediction Diverge in Biomedicine https://www.cell.com/patterns/fulltext/S2666-3899(20)30160-4

I saw your RCT and I have some worries! FAQs Macartan Humphreys https://macartan.github.io/i/notes/rct_faqs.html

Measuring the landscape of civil war: Evaluating geographic coding decisions with historic data from the Mau Mau rebellion https://journals.sagepub.com/eprint/dRCkdD4ZWSp99x8cinAV/full

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale https://arxiv.org/abs/2208.07339

Understanding Machine Learning: From Theory to Algorithms https://www.cs.huji.ac.il/w~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf

Known broken ones that didn’t get sorted

Confounder Selection: Objectives and Approaches F. Richard Guo, Anton Rask Lundborg, Qingyuan Zhao

[2011.01808](https:/arxiv.org/abs/2011.01808] Article identifier not recognized [2108.02497](https:/arxiv.org/abs/2108.02497] Article identifier not recognized