Overview Literature
The Data Science Interview Book
Learning Python, R, SQL & Data Science
101 machine learning algorithms for data science with cheat sheets
AWESOME DATA SCIENCE Data Science Interviews Data Science Cheatsheet 2.0 data-science-ipython-notebooks Awesome Machine Learning
Minimum Viable Study Plan for Machine Learning Interviews https://github.com/khangich/machine-learning-interview
Causal Inference: The Mixtape https://mixtape.scunning.com/index.html
Bayesian Workflow Andrew Gelman, Aki Vehtari, Daniel Simpson, Charles C. Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, Martin Modrák https://arxiv.org/abs/2011.01808
How to avoid machine learning pitfalls: a guide for academic researchers Michael A. Lones https://arxiv.org/abs/2108.02497
Information geometry and divergences https://franknielsen.github.io/IG/#bookIG
Statistical Rethinking: A Bayesian Course with Examples in R and Stan (& PyMC3 & brms) https://xcelab.net/rm/statistical-rethinking/ https://www.youtube.com/playlist?list=PLDcUM9US4XdMROZ57-OIRtIK0aOynbgZN
ML Frameworks Interoperability Cheat Sheet http://bl.ocks.org/miguelusque/raw/f44a8e729896a96d0a3e4b07b5176af4/
Regression and Other Stories, Andrew Gelman, Jennifer Hill, Aki Vehtari copy of the book https://users.aalto.fi/~ave/ROS.pdf
tidybayes: Bayesian analysis + tidy data + geoms
Graphical Data Analysis with R Antony Unwin
Data Visualization A practical introduction, Kieran Healy
Bayesian Statistics Independent readings course on Bayesian statistics with R and Stan, Andrew Heiss and Meng Ye, Fall 2022 https://bayesf22-notebook.classes.andrewheiss.com/rethinking/ https://bayesf22-notebook.classes.andrewheiss.com/bayes-rules/
An Introduction to Proximal Causal Learning
A Selective Review of Negative Control Methods in Epidemiology
Backpropagation is not just the chain rule%2C%20to%20predict%20y.)
R Markdown Cookbook Yihui Xie, Christophe Dervieux, Emily Riederer 2022-11-07 https://bookdown.org/yihui/rmarkdown-cookbook/
Understanding Machine Learning: From Theory to Algorithms https://www.cs.huji.ac.il/w~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf
Estimation Prediction, Estimation, and Attribution
The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant
A Parsimonious Tour of Bayesian Model Uncertainty
Causal Inference for the Brave and True
https://bayesiancomputationbook.com/welcome.html
Measurement error and the replication crisis The assumption that measurement error always reduces effect sizes is false https://www.science.org/doi/10.1126/science.aal3618
https://journals.sagepub.com/doi/abs/10.1177/00031224211004187#:~:text=The%20estimand%20is%20the%20target,purpose%20of%20the%20statistical%20analysis.&text=By%20grounding%20all%20three%20steps,connects%20statistical%20evidence%20to%20theory
Exploring the Dynamics of Latent Variable Models https://www.cambridge.org/core/journals/political-analysis/article/abs/exploring-the-dynamics-of-latent-variable-models/CBE116F37900DAE957B2D7EB53DB0907#.X7h7GMnwHwM.twitter
https://github.com/HenrikBengtsson/matrixStats
https://github.com/facebookresearch/StarSpace
https://dennybritz.com/posts/wildml/understanding-convolutional-neural-networks-for-nlp/
What’s Wrong With My Time Series Blog post by Alex Smolyanskaya ALEX SMOLYANSKAYA February 28, 2017 - San Francisco, CA Tweet this post! Post on LinkedIn What’s wrong with my time series? Model validation without a hold-out set https://multithreaded.stitchfix.com/blog/2017/02/28/whats-wrong-with-my-time-series/
ggRandomForests: Exploring Random Forest Survival https://arxiv.org/pdf/1612.08974.pdf
https://districtdatalabs.silvrback.com/time-maps-visualizing-discrete-events-across-many-timescales
Explained Visually https://setosa.io/ev/
https://github.com/google/BIG-bench/blob/main/docs/paper/BIG-bench.pdf
Two Experiments in Peer Review: Posting Preprints and Citation Bias
Random Walk: A Modern Introduction Gregory F. Lawler and Vlada Limic
Can Transformers be Strong Treatment Effect Estimators? https://arxiv.org/pdf/2202.01336v1.pdf
Statistical rethinking with brms, ggplot2, and the tidyverse: Second edition https://bookdown.org/content/4857/
Patches Are All You Need? https://openreview.net/forum?id=TVHS5Y4dNvM
The validate R-package makes it super-easy to check whether data lives up to expectations you have based on domain knowledge. It works by allowing https://github.com/data-cleaning/validate
Let’s Put Garbage-Can Regressions and Garbage-Can Probits Where They Belong https://journals.sagepub.com/doi/10.1080/07388940500339167
autoxgboost https://github.com/ja-thomas/autoxgboost
1,500 scientists lift the lid on reproducibility https://www.nature.com/articles/533452a
Methodology over metrics: current scientific standards are a disservice to patients and society https://www.jclinepi.com/article/S0895-4356(21)00170-0/fulltext
bper: Bayesian Prediction for Ethnicity and Race https://github.com/bwilden/bper
Automatic Differentiation Variational Inference https://www.jmlr.org/papers/volume18/16-107/16-107.pdf
What are the most important statistical ideas of the past 50 years? Andrew Gelman, Aki Vehtari https://arxiv.org/pdf/2012.00174.pdf
Why Propensity Scores Should Not Be Used for Matching https://gking.harvard.edu/publications/why-propensity-scores-should-not-be-used-formatching
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4349800/
PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R https://cran.r-project.org/web/packages/PRROC/vignettes/PRROC.pdf
On Multi-Cause Causal Inference with Unobserved Confounding: Counterexamples, Impossibility, and Alternatives https://arxiv.org/abs/1902.10286
[‘Trust Us’: Open Data and Preregistration in Political Science and International Relations] https://osf.io/preprints/metaarxiv/8h2bp/
pals https://cran.r-project.org/web/packages/pals/vignettes/pals_examples.html
Greedy Function Approximation: A Gradient Boosting Machine https://jerryfriedman.su.domains/ftp/trebst.pdf
Natural Scales in Geographical Patterns https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5379183/
https://daattali.com/shiny/timevis-demo/
https://www.extremetech.com/computing/151980-inside-ibms-67-billion-sage-the-largest-computer-ever-built
Faux peer-reviewed journals: a threat to research integrity http://deevybee.blogspot.com/2020/12/?m=1
https://github.com/mmxgn/spacy-clausie
http://deevybee.blogspot.com/2020/12/?m=1
http://www.deeplearningbook.org
Statistical Nonsignificance in Empirical Economics https://www.aeaweb.org/articles?id=10.1257/aeri.20190252&from=f
Acquiescence Bias Inflates Estimates of Conspiratorial Beliefs and Political Misperceptions∗ Seth J. Hill† Margaret E. Roberts‡ October 25, 2021 http://www.margaretroberts.net/wp-content/uploads/2021/10/hillroberts_acqbiaspoliticalbeliefs.pdf
The lesson of ivermectin: meta-analyses based on summary data alone are inherently unreliable https://www.nature.com/articles/s41591-021-01535-y
https://www.math.uzh.ch/pages/varrank/index.html
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing https://arxiv.org/pdf/2107.13586.pdf
How should variable selection be performed with multiply imputed data? https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.3177
Feature Interactions in XGBoost https://arxiv.org/abs/2007.05758
Landscape of R packages for eXplainable Artificial Intelligence by Szymon Maksymiuk, Alicja Gosiewska, Przemysław Biecek https://arxiv.org/pdf/2009.13248.pdf
Feature Engineering and Selection: A Practical Approach for Predictive Models https://bookdown.org/max/FES/
Parachute use to prevent death and major trauma related to gravitational challenge: systematic review of randomised controlled trials https://www.ncbi.nlm.nih.gov/pmc/articles/PMC300808/
xgboost.surv https://github.com/bcjaeger/xgboost.surv
DoubleML The Python and R package DoubleML provide an implementation of the double / debiased machine learning framework of Chernozhukov et al. (2018). The Python package is built on top of scikit-learn (Pedregosa et al., 2011) and the R package on top of mlr3 and the mlr3 ecosystem (Lang et al., 2019). https://docs.doubleml.org/stable/index.html
Preplication, Replication: A Proposal to Efficiently Upgrade Journal Replication Standards Get access Arrow Michael Colaresi https://academic.oup.com/isp/article-abstract/17/4/367/2528282?redirectedFrom=fulltext
https://deepmind.com/blog/article/using-jax-to-accelerate-our-research
https://github.com/tidyverts/fable
The Effect: An Introduction to Research Design and Causality https://theeffectbook.net/
https://github.com/dedupeio/dedupe
https://arxiv.org/abs/2205.07407 What GPT Knows About Who is Who Xiaohan Yang, Eduardo Peynetti, Vasco Meerman, Chris Tanner
An Introduction to Ontology Engineering https://people.cs.uct.ac.za/~mkeet/files/OEbook.pdf
R Packages for Item Response Theory Analysis: Descriptions and Features https://www.tandfonline.com/doi/full/10.1080/15366367.2019.1586404
Accuracy vs Explainability of Machine Learning Models [NIPS workshop poster review] https://www.inference.vc/accuracy-vs-explainability-in-machine-learning-models-nips-workshop-poster-review/
https://arxiv-sanity-lite.com/
Attitudes toward amalgamating evidence in statistics∗ Andrew Gelman† Keith O’Rourke‡ http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf
An overview of gradient descent optimization algorithms https://ruder.io/optimizing-gradient-descent/
https://codeocean.com/
ClustGeo: an R package for hierarchical clustering with spatial constraints https://arxiv.org/pdf/1707.03897.pdf
An Algorithmic Framework for Bias Bounties Ira Globus-Harris, Michael Kearns, Aaron Roth https://arxiv.org/abs/2201.10408
On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis https://arxiv.org/pdf/1707.01780.pdf
Fast TreeSHAP: Accelerating SHAP Value Computation for Trees Jilei Yang https://arxiv.org/abs/2109.09847
Comparing interpretability and explainability for feature selection Jack Dunn, Luca Mingardi, Ying Daisy Zhuo https://arxiv.org/abs/2105.05328
Training Deep Nets with Sublinear Memory Cost Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin https://arxiv.org/abs/1604.06174
ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R https://arxiv.org/pdf/1508.04409.pdf
A Survey of Recent Abstract Summarization Techniques Diyah Puspitaningrum https://arxiv.org/abs/2105.00824
U N D E R S TA N D I N G R A N D O M F O R E S T S from theory to practice https://arxiv.org/pdf/1407.7502.pdf
Performance Metrics (Error Measures) in Machine Learning Regression, Forecasting and Prognostics: Properties and Typology https://arxiv.org/pdf/1809.03006.pdf
Spike-and-Slab Meets LASSO: A Review of the Spike-and-Slab LASSO Ray Bai, Veronika Rockova, Edward I. George https://arxiv.org/abs/2010.06451
Representation Tradeoffs for Hyperbolic Embeddings Christopher De Sa‡ Albert Gu† Christopher Re´ † Frederic Sala† https://arxiv.org/pdf/1804.03329.pdf
Ratios: A short guide to confidence limits and proper use V.H. Franz∗ October, 2007 https://arxiv.org/pdf/0710.2024.pdf
The Endogeneity of Historical Data Posted on August 28, 2020 by Adam Slez https://broadstreet.blog/2020/08/28/the-endogeneity-of-historical-data/
A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0251194
Post model-fitting exploration via a ‘‘Next-Door’’ analysis Leying GUAN1* and Robert TIBSHIRANI2 https://tibshirani.su.domains/ftp/nextDoor.pdf
Understanding BERT Transformer: Attention isn’t all you need A parsing/composition framework for understanding Transformers https://medium.com/synapse-dev/understanding-bert-transformer-attention-isnt-all-you-need-5839ebd396db
Einstein VI: General and Integrated Stein Variational Inference in NumPyro Ahmad Salim Al-Sibahi, Ola Rønning, Christophe Ley, Thomas Wim Hamelryck https://openreview.net/forum?id=nXSDybDWV3
Dream Investigation Results Official Report by the Minecraft Speedrunning Team https://mcspeedrun.com/dream.pdf
Improving Parameter Estimation of Epidemic Models: Likelihood Functions and Kalman Filtering 39 Pages Posted: 8 Aug 2022 https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4165188
Do Name-Based Treatments Violate Information Equivalence? Evidence from a Correspondence Audit Experiment Published online by Cambridge University Press: 09 March 2021 https://www.cambridge.org/core/journals/political-analysis/article/abs/do-namebased-treatments-violate-information-equivalence-evidence-from-a-correspondence-audit-experiment/56C6846518DDADE6EAF92DAE11552BDF
How Much Should We Trust Staggered Difference-In-Differences Estimates? European Corporate Governance Institute – Finance Working Paper No. 736/2021 Rock Center for Corporate Governance at Stanford University Working Paper No. 246 Journal of Financial Economics (JFE), Forthcoming https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3794018
Building useful models for industry—some tips Jim Savage January 2017 https://khakieconomics.github.io/2017/01/01/Building-useful-models-for-industry.html
An Introduction to Proximal Causal Learning https://arxiv.org/pdf/2009.10982.pdf
First Things First: Assessing Data Quality before Model Quality Anita Gohdes and Megan Price meganp@benetech.orgView all authors and affiliations https://journals.sagepub.com/doi/full/10.1177/0022002712459708?casa_token=xXfXTvx0AcwAAAAA%3AxwRiF0ljSt387F0k14y0NEe7BMdzhMpF08oKFzv8Sgyo6MfAL3wDT-kmn9p94f4BFh60b0eH_PE
Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans https://www.nature.com/articles/s42256-021-00307-0
Why and How We Should Join the Shift From Significance Testing to Estimation https://www.preprints.org/manuscript/202112.0235/v1
How to make replication the norm https://www.nature.com/articles/d41586-018-02108-9
Applied Bayesian Statistics Using Stan and R https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/applied-bayesian-statistics/
https://seeing-theory.brown.edu/index.html
https://www.brodrigues.co/
FINDING ECONOMIC ARTICLES WITH DATA AND SPECIFIC EMPIRICAL METHODS http://skranz.github.io//r/2021/01/05/FindingEconomicArticles4.html
Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do about It https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2849145
Machine vision on historical maps https://weinman.cs.grinnell.edu/research/maps.shtml
Enhancing Validity in Observational Settings When Replication Is Not Possible https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2543525
1.1 Billion Taxi Rides with SQLite, Parquet & HDFS https://tech.marksblogg.com/billion-nyc-taxi-rides-sqlite-parquet-hdfs.html
Understanding the Bias-Variance Tradeoff http://scott.fortmann-roe.com/docs/BiasVariance.html
Is the LKJ(1) prior uniform? “Yes” http://srmart.in/is-the-lkj1-prior-uniform-yes/
Informative priors for correlation matrices: An easy approach http://srmart.in/informative-priors-for-correlation-matrices-an-easy-approach/
A Tutorial on Spectral Clustering https://arxiv.org/pdf/0711.0189v1.pdf
Automated Geocoding of Textual Documents: A Survey of Current Approaches https://onlinelibrary.wiley.com/doi/full/10.1111/tgis.12212
Sparklyr https://spark.rstudio.com/
The AAA Tranche of Subprime Science Andrew Gelman and Eric Loken http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics10.pdf
Never trust rownames of a dataframe June 16th, 2015 by Ankur Gupta | https://www.perfectlyrandom.org/2015/06/16/never-trust-the-row-names-of-a-dataframe-in-R/
GRAPH ALGORITHMS http://www.martinbroadhurst.com/tag/igraph
Groundhog: Addressing The Threat That R Poses To Reproducible Research http://datacolada.org/95
CS231n Convolutional Neural Networks for Visual Recognition https://cs231n.github.io/neural-networks-3/
Implementing Variational Autoencoders in Keras: Beyond the Quickstart Tutorial http://louistiao.me/posts/implementing-variational-autoencoders-in-keras-beyond-the-quickstart-tutorial/
Hypothesis Testing in Econometrics http://home.uchicago.edu/amshaikh/webfiles/testingreview.pdf
“Why Should I Trust You?” Explaining the Predictions of Any Classifier https://arxiv.org/pdf/1602.04938v3.pdf
Yes, but Did It Work?: Evaluating Variational Inference http://www.stat.columbia.edu/~gelman/research/published/Evaluating_Variational_Inference.pdf https://statmodeling.stat.columbia.edu/2018/06/27/yes-work-evaluating-variational-inference/
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets https://arxiv.org/abs/2103.12028
One Instrument to Rule Them All: The Bias and Coverage of Just-ID IV Joshua Angrist, Michal Kolesár https://arxiv.org/abs/2110.10556
Underspecification Presents Challenges for Credibility in Modern Machine Learning https://arxiv.org/abs/2011.03395
A Survey of Predictive Modelling under Imbalanced Distributions https://arxiv.org/pdf/1505.01658.pdf
Varying Slopes Models and the CholeskyLKJ distribution in TensorFlow Probability https://adamhaber.github.io/post/varying-slopes/
Shapley Decomposition of R-Squared in Machine Learning Models https://arxiv.org/pdf/1908.09718.pdf
Understanding Global Feature Contributions With Additive Importance Measures Ian Covert, Scott Lundberg, Su-In Lee https://arxiv.org/abs/2004.00668
True to the Model or True to the Data? https://arxiv.org/abs/2006.16234
When to Impute? Imputation before and during cross-validation Byron C. Jaeger*1 | Nicholas J. Tierney2 | Noah R. Simon3 https://arxiv.org/pdf/2010.00718.pdf
A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications Hongyun Cai, Vincent W. Zheng, Kevin Chen-Chuan Chang https://arxiv.org/abs/1709.07604
Comparing methods addressing multi-collinearity when developing prediction models https://arxiv.org/abs/2101.01603
Nonparametric causal effects based on incremental propensity score interventions https://arxiv.org/abs/1704.00211
Deep learning generalizes because the parameter-function map is biased towards simple functions Guillermo Valle-Pérez, Chico Q. Camargo, Ard A. Louis https://arxiv.org/abs/1805.08522
Bayesian Item Response Modeling in R with brms and Stan https://arxiv.org/pdf/1905.09501.pdf
Bayesian Inference for a Covariance Matrix https://arxiv.org/pdf/1408.4050.pdf
Cross-validation Confidence Intervals for Test Error Pierre Bayle, Alexandre Bayle, Lucas Janson, Lester Mackey https://arxiv.org/abs/2007.12671
Comparing Published Scientific Journal Articles to Their Pre-print Versions https://arxiv.org/pdf/1604.05363.pdf
End-to-End Weak Supervision Salva Rühling Cachay, Benedikt Boecking, Artur Dubrawski https://arxiv.org/abs/2107.02233
Estimation and Inference of Heterogeneous Treatment Effects using Random Forests∗ https://arxiv.org/pdf/1510.04342.pdf
Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift https://arxiv.org/pdf/1801.05134.pdf
A review on outlier/anomaly detection in time series data https://arxiv.org/abs/2002.04236
Entropic Out-of-Distribution Detection: Seamless Detection of Unknown Examples David Macêdo, Tsang Ing Ren, Cleber Zanchettin, Adriano L. I. Oliveira, Teresa Ludermir https://arxiv.org/abs/2006.04005
An Exploratory Characterization of Bugs in COVID-19 Software Projects Akond Rahman, Effat Farhana https://arxiv.org/abs/2006.00586
Be Careful What You Backpropagate: A Case For Linear Output Activations & Gradient Boosting Anders Oland, Aayush Bansal, Roger B. Dannenberg, Bhiksha Raj https://arxiv.org/abs/1707.04199
Introducing Stan2tfp - a lightweight interface for the Stan-to-TensorFlow Probability compiler May 21, 2020 4 min read https://adamhaber.github.io/post/stan2tfp-post1/
L2 Regularization versus Batch and Weight Normalization Twan van Laarhoven https://arxiv.org/abs/1706.05350
Unsupervised Discovery of Temporal Structure in Noisy Data with Dynamical Components Analysis David G. Clark, Jesse A. Livezey, Kristofer E. Bouchard https://arxiv.org/abs/1905.09944
Monte Carlo Gradient Estimation in Machine Learning Shakir Mohamed, Mihaela Rosca, Michael Figurnov, Andriy Mnih https://arxiv.org/abs/1906.10652
Large-scale linear regression: Development of high-performance routines Alvaro Frank, Diego Fabregat-Traver, Paolo Bientinesi https://arxiv.org/abs/1504.07890
The Kernel Interaction Trick: Fast Bayesian Discovery of Pairwise Interactions in High Dimensions Raj Agrawal, Jonathan H. Huggins, Brian Trippe, Tamara Broderick https://arxiv.org/abs/1905.06501
TensorFlow Distributions Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, Rif A. Saurous https://arxiv.org/abs/1711.10604
Asymptotically Exact, Embarrassingly Parallel MCMC Willie Neiswanger, Chong Wang, Eric Xing https://arxiv.org/abs/1311.4780
Python for Data Science https://aeturrell.github.io/python4DS/welcome.html
Using the flextable R package https://ardata-fr.github.io/flextable-book/
Coding for Economists https://aeturrell.github.io/coding-for-economists/intro.html
When Should You Adjust Standard Errors for Clustering? Get access Arrow Alberto Abadie, Susan Athey, Guido W Imbens, Jeffrey M Wooldridge https://academic.oup.com/qje/advance-article-abstract/doi/10.1093/qje/qjac038/6750017
Awesome Deep Learning for Natural Language Processing (NLP) https://github.com/brianspiering/awesome-dl4nlp
R for applied epidemiology and public health https://epirhandbook.com/en/index.html
COVID 19: Reduced forms have gone viral, but what do they tell us?* https://drive.google.com/file/d/1ERjcGXD2jvfDFXdI0_NtF4X95UeQ5f4W/view
Reproducibility in Cancer Biology: Challenges for assessing replicability in preclinical cancer biology https://elifesciences.org/articles/67995
Taking Uncertainty Seriously: Bayesian Marginal Structural Models for Causal Inference in Political Science https://github.com/ajnafa/Latent-Bayesian-MSM
Generalized Linear Models https://data.princeton.edu/wws509/notes/c7s4
genieclust: Fast and Robust Hierarchical Clustering with Noise Point Detection https://genieclust.gagolewski.com/
Awesome Graph Classification https://github.com/benedekrozemberczki/awesome-graph-classification
parallelDist https://github.com/alexeckert/parallelDist
Interpretable Machine Learning A Guide for Making Black Box Models Explainable Christoph Molnar https://christophm.github.io/interpretable-ml-book/
The Inverse CDF Method https://dk81.github.io/dkmathstats_site/prob-inverse-cdf.html
HamiltonianMC https://chi-feng.github.io/mcmc-demo/app.html#HamiltonianMC,banana
End-to-End Balancing for Causal Continuous Treatment-Effect Estimation https://assets.amazon.science/5b/71/fa078e6f4f97a76a2a622c767dd5/end-to-end-balancing-for-causal-continuous-treatment-effect-estimation.pdf
A tour of probabilistic programming language APIs What does it look like to do MCMC in different frameworks? https://colcarroll.github.io/ppl-api/
Probabilistic Programming & Bayesian Methods for Hackers https://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/
Beyond Multiple Linear Regression Applied Generalized Linear Models and Multilevel Models in R https://bookdown.org/roback/bookdown-bysh/
Maybe a section on hyperparameters?
Does batch size matter? https://blog.janestreet.com/does-batch-size-matter/
The Much Quieter Revolution of Synthetic Control: Episode I https://causalinf.substack.com/p/the-much-quieter-revolution-of-synthetic?utm_campaign=post&utm_medium=web&utm_source=
User-friendly introduction to PAC-Bayes bounds https://arxiv.org/pdf/2110.11216.pdf
Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results https://journals.sagepub.com/doi/full/10.1177/2515245917747646
The RecordLinkage Package: Detecting Errors in Data https://journal.r-project.org/archive/2010-2/RJournal_2010-2_Sariyar+Borg.pdf
https://grow.google/certificates/interview-warmup/
The inverse-transform method for generating random variables in R https://heds.nz/posts/inverse-transform/
The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology https://journals.sagepub.com/doi/10.1177/1948550616673876
Evolution of Reporting P Values in the Biomedical Literature, 1990-2015 https://jamanetwork.com/journals/jama/fullarticle/2503172
SHAP (SHapley Additive exPlanations) https://github.com/slundberg/shap
The h-index is no longer an effective correlate of scientific reputation https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0253397
Prior Choice Recommendations Andrew Gelman edited this page on Apr 17, 2020 · 51 revisions https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations#prior-for-a-covariance-matrix
Institute for Replication (I4R) https://i4replication.org/index.html
How Life Sciences Actually Work: Findings of a Year-Long Investigation https://guzey.com/how-life-sciences-actually-work/
Efficient Neural Causal Discovery without Acyclicity Constraints https://github.com/phlippe/ENCO
awesome-text-summarization https://github.com/mathsyouth/awesome-text-summarization
(Ir)Reproducible Machine Learning: A Case Study https://reproducible.cs.princeton.edu/irreproducibility-paper.pdf
THE MYTH OF THE EXPERT REVIEWER https://parameterfree.com/2021/07/06/the-myth-of-the-expert-reviewer/
Understanding and Choosing the Right Probability Distributions https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781119197096.app03
Spatial Interdependence and Instrumental Variable Models https://osf.io/preprints/socarxiv/pgrcu/
The case for formal methodology in scientific reform https://royalsocietypublishing.org/doi/10.1098/rsos.200805
Using Difference-in-Differences to Identify Causal Effects of COVID-19 Policies https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3603970
Pandas Comparison with R / R libraries https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html
Non-Standard Errors https://orbilu.uni.lu/bitstream/10993/48686/1/SSRN-id3961574.pdf
Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors https://pubmed.ncbi.nlm.nih.gov/26186114/
The (lack of) impact of retraction on citation networks Charisse R Madlock-Brown 1, David Eichmann https://pubmed.ncbi.nlm.nih.gov/24668038/
The puzzling relationship between multi-lab replications and meta-analyses of the rest of the literature https://psyarxiv.com/pbrdk/
Bayesian Estimation of Correlation Matrices of Longitudinal Data Riddhi Pratim Ghosh, Bani Mallick, Mohsen Pourahmadi https://projecteuclid.org/journals/bayesian-analysis/volume-16/issue-3/Bayesian-Estimation-of-Correlation-Matrices-of-Longitudinal-Data/10.1214/20-BA1237.full
Operationalizing the Replication Standard: A Case Study of the Data Curation and Verification Workflow for Scholarly Journals https://osf.io/preprints/socarxiv/cfdba/
How Using Machine Learning Classification as a Variable in Regression Leads to Attenuation Bias and What to Do About It https://osf.io/preprints/socarxiv/453jk/
Lost in Aggregation: Improving Event Analysis with Report-Level Data Scott J. Cook,Nils B. Weidmann https://onlinelibrary.wiley.com/doi/full/10.1111/ajps.12398
Frequentist versus Bayesian approaches to multiple testing Arvid Sjölander & Stijn Vansteelandt https://link.springer.com/article/10.1007/s10654-019-00517-2
Research note: Examining potential bias in large-scale censored data https://misinforeview.hks.harvard.edu/article/research-note-examining-potential-bias-in-large-scale-censored-data/
When Should We Use Unit Fixed Effects Regression Models for Causal Inference with Longitudinal Data? Kosuke Imai,In Song Kim https://onlinelibrary.wiley.com/doi/abs/10.1111/ajps.12417
Runtime warnings and convergence problems Stan Development Team https://mc-stan.org/misc/warnings.html
Dirichlet Process Gaussian mixture model via the stick-breaking construction in various PPLs This page was last updated on 29 Mar, 2021. https://luiarthur.github.io/TuringBnpBenchmarks/dpsbgmm
xgboost: “Hi I’m Gamma. What can I do for you?” — and the tuning of regularization https://medium.com/data-design/xgboost-hi-im-gamma-what-can-i-do-for-you-and-the-tuning-of-regularization-a42ea17e6ab6
PostGIS In Action https://livebook.manning.com/book/postgis-in-action-second-edition/about-this-book/
Stan User’s Guide https://mc-stan.org/docs/stan-users-guide/index.html
Smoothing Terms in GAM Models https://maths-people.anu.edu.au/~johnm/r-book/xtras/autosmooth.pdf
Designing a Deep Learning Project https://medium.com/(erogol/designing-a-deep-learning-project-9b3698aef127?)
PyTorch With Baby Steps: From y=x To Training A Convnet https://lelon.io/blog/pytorch-baby-steps
Bayesian inference with Stan: A tutorial on adding custom distributions Jeffrey Annis, Brent J. Miller & Thomas J. Palmeri https://link.springer.com/article/10.3758/s13428-016-0746-9
Bayes Rules! An Introduction to Applied Bayesian Modeling https://www.bayesrulesbook.com/
Graduate Qualitative Methods Training in Political Science: A Disciplinary Crisis Published online by Cambridge University Press: 21 November 2019 https://www.cambridge.org/core/journals/ps-political-science-and-politics/article/graduate-qualitative-methods-training-in-political-science-a-disciplinary-crisis/7B0EEB76E1CC234AFED7EED8DA71BA35
Time Series Analysis by State Space Methods (Oxford Statistical Science Series) https://www.amazon.com/dp/0198523548/ref=cm_sw_r_tw_apa_fabc_0MWV12PSS3K9NW3RF9ZY
Hyperparameters and tuning strategies for random forest https://wires.onlinelibrary.wiley.com/doi/full/10.1002/widm.1301?casa_token=_zNb_GkfYAUAAAAA%3AszhLWhEqZzM5C74ByxjTmQX9uUzIgzvLGXEyJHk5BubnNpTqOtqruOwi8ACcoHxUrV3Ypl4uOpsu
Your Cross Validation Error Confidence Intervals are Wrong — here’s how to Fix Them https://towardsdatascience.com/your-cross-validation-error-confidence-intervals-are-wrong-heres-how-to-fix-them-abbfe28d390
Probabilistic Programming with Variational Inference: Under the Hood https://willcrichton.net/notes/probabilistic-programming-under-the-hood/
How to Measure Statistical Causality: A Transfer Entropy Approach with Financial Applications https://towardsdatascience.com/causality-931372313a1c
Kullback-Leibler Divergence Explained https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained
Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance https://www.cs.purdue.edu/homes/lintan/publications/variance-ase20.pdf
How (Not) to Reproduce: Practical Considerations to Improve Research Transparency in Political Science https://www.cambridge.org/core/journals/ps-political-science-and-politics/article/abs/how-not-to-reproduce-practical-considerations-to-improve-research-transparency-in-political-science/32E7CF5D975C081BA666D3BD475D7913
Quantifying Bias from Measurable and Unmeasurable Confounders Across Three Domains of Individual Determinants of Political Preferences Published online by Cambridge University Press: 22 February 2022 https://www.cambridge.org/core/journals/political-analysis/article/quantifying-bias-from-measurable-and-unmeasurable-confounders-across-three-domains-of-individual-determinants-of-political-preferences/D1D2DEE9E7180BDCFC592885BE66E9AF
5 Levels of Difficulty — Bayesian Gaussian Random Walk with PyMC3 and Theano https://towardsdatascience.com/5-levels-of-difficulty-bayesian-gaussian-random-walk-with-pymc3-and-theano-34343911c7d2
Single-Parameter Models | Pyro vs. STAN https://towardsdatascience.com/single-parameter-models-pyro-vs-stan-e7e69b45d95c
Partial Identification in Econometrics Elie Tamer https://scholar.harvard.edu/files/tamer/files/pie.pdf
LightGBM for Quantile Regression Understand Quantile Regression https://towardsdatascience.com/lightgbm-for-quantile-regression-4288d0bb23fd
Assessing the Impact of Non-Random Measurement Error on Inference: A Sensitivity Analysis Approach https://strathprints.strath.ac.uk/59463/1/Gallop_Weschle_PSRM_2016_Assessing_the_impact_of_non_random_measurement_error_on_inference.pdf
yardstick is a package to estimate how well models are working using tidy data principles. See the package webpage for more information. https://yardstick.tidymodels.org/index.html
The Three Faces of Bayes https://slackprop.wordpress.com/2016/08/28/the-three-faces-of-bayes/
Evaluating Random Forests for Survival Analysis using Prediction Error Curves https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4194196/
The role of metadata in reproducible computational research https://www.sciencedirect.com/science/article/pii/S2666389921001707
Ecological Inference in the Social Sciences https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2885825/
Two Wrongs Make a Right: Addressing Underreporting in Binary Data from Multiple Sources https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5667662/
On the low reproducibility of cancer studies https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6599599/
Quarto with Python https://www.meyerperin.com/using-quarto/
Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015 https://www.nature.com/articles/s41562-018-0399-z
Bayesian analysis of tests with unknown specificity and sensitivity∗ Andrew Gelman† and Bob Carpenter‡ https://www.medrxiv.org/content/10.1101/2020.05.22.20108944v3.full.pdf
Notes on the Negative Binomial Distribution https://www.johndcook.com/negative_binomial.pdf
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Automatic Differentiation Variational Inference https://www.jmlr.org/papers/volume18/16-107/16-107.pdf
IZA DP No. 13233: The Influence of Hidden Researcher Decisions in Applied Microeconomics https://www.iza.org/publications/dp/13233/the-influence-of-hidden-researcher-decisions-in-applied-microeconomics
cdfquantreg: An R Package for CDF-Quantile Regression https://www.jstatsoft.org/article/view/v088i01
https://techdevguide.withgoogle.com/
What the F-measure doesn’t measure: Features, Flaws, Fallacies and Fixes David M. W. Powers https://arxiv.org/abs/1503.06410
When LOO and other cross-validation approaches are valid https://statmodeling.stat.columbia.edu/2018/08/03/loo-cross-validation-approaches-valid/
Hamiltonian Monte Carlo explained http://arogozhnikov.github.io/2016/12/19/markov_chain_monte_carlo.html
Controlling for Unobserved Confounds in Classification Using Correlational Constraints Virgile Landeiro, Aron Culotta https://arxiv.org/abs/1703.01671
The Persistence of Underpowered Studies in Psychological Research: Causes, Consequences, and Remedies Scott E. Maxwell https://statmodeling.stat.columbia.edu/wp-content/uploads/2017/07/maxwell2004.pdf
You need 16 times the sample size to estimate an interaction than to estimate a main effect https://statmodeling.stat.columbia.edu/2018/03/15/need-16-times-sample-size-estimate-interaction-estimate-main-effect/
Machine Learning of Sets http://akosiorek.github.io/ml/2020/08/12/machine_learning_of_sets.html
Weak Supervision: A New Programming Paradigm for Machine Learning http://ai.stanford.edu/blog/weak-supervision/
The earth is flat (p>0.05): Significance thresholds and the crisis of unreplicable research https://peerj.com/preprints/2921/
Advanced Natural Language Processing with TensorFlow 2: Build effective real-world NLP applications using NER, RNNs, seq2seq models, Transformers, and more https://www.amazon.com/Advanced-Natural-Language-Processing-TensorFlow/dp/1800200935?encoding=UTF8&qid=&sr=&linkCode=sl1&tag=kirkdborne-20&linkId=4448e1a0cd126f52a2aba844c4bdb78e&language=en_US&ref=as_li_ss_tl
3 reasons why you can’t always use predictive performance to choose among models https://statmodeling.stat.columbia.edu/2015/10/23/26857/
Using Heteroscedasticity to Identify and Estimate Mismeasured and Endogenous Regressor Models Arthur Lewbel https://www.tandfonline.com/doi/full/10.1080/07350015.2012.643126
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe, Christian Szegedy https://arxiv.org/abs/1502.03167
Gradient Boosting explained [demonstration] http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html
Clustered standard errors vs. multilevel modeling https://statmodeling.stat.columbia.edu/2007/11/28/clustered_stand/
Advanced R https://adv-r.hadley.nz/index.html
Regression to the mean continues to confuse people and lead to errors in published research https://statmodeling.stat.columbia.edu/2018/06/24/regression-mean-continues-confuse-people-lead-errors-published-research/
The statistical significance filter leads to overoptimistic expectations of replicability https://statmodeling.stat.columbia.edu/2018/05/22/statistical-significance-filter-leads-overoptimistic-expectations-replicability/
How to cross-validate PCA, clustering, and matrix decomposition models http://alexhwilliams.info/itsneuronalblog/2018/02/26/crossval/?mlreview
Inference in Experiments Conditional on Observed Imbalances in Covariates Per JohanssonORCID Icon &Mattias Nordin https://www.tandfonline.com/doi/full/10.1080/00031305.2022.2054859
Scientific progress despite irreproducibility: A seeming paradox Richard M. Shiffrin, Katy Borner, Stephen M. Stigler https://arxiv.org/abs/1710.01946
On Statistical Non-Significance Alberto Abadie https://arxiv.org/abs/1803.00609
On the number of signals in multivariate time series Markus Matilainen, Klaus Nordhausen, Joni Virta https://arxiv.org/abs/1801.04925
Data Science vs. Statistics: Two Cultures? Iain Carmichael, J.S. Marron https://arxiv.org/abs/1801.00371
The exploding gradient problem demystified - definition, prevalence, impact, origin, tradeoffs, and solutions George Philipp, Dawn Song, Jaime G. Carbonell https://arxiv.org/abs/1712.05577
Theory of Deep Learning III: explaining the non-overfitting puzzle Tomaso Poggio, Kenji Kawaguchi, Qianli Liao, Brando Miranda, Lorenzo Rosasco, Xavier Boix, Jack Hidary, Hrushikesh Mhaskar https://arxiv.org/abs/1801.00173
On overfitting and post-selection uncertainty assessments Liang Hong, Todd A. Kuffner, Ryan Martin https://arxiv.org/abs/1712.02379
A Theory of Statistical Inference for Ensuring the Robustness of Scientific Results Beau Coker, Cynthia Rudin, Gary King https://arxiv.org/abs/1804.08646
Labelling as an unsupervised learning problem Terry Lyons, Imanol Perez Arribas https://arxiv.org/abs/1805.03911
Structural Breaks in Time Series Alessandro Casini, Pierre Perron https://arxiv.org/abs/1805.03807
On consistency and inconsistency of nonparametric tests Mikhail Ermakov https://arxiv.org/abs/1807.09076
A New Angle on L2 Regularization Thomas Tanay, Lewis D Griffin https://arxiv.org/abs/1806.11186
On the Robustness of Interpretability Methods David Alvarez-Melis, Tommi S. Jaakkola https://arxiv.org/abs/1806.08049
Identifying Causal Effects with the R Package causaleffect Santtu Tikka, Juha Karvanen https://arxiv.org/abs/1806.07161
How Does Batch Normalization Help Optimization? Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Madry https://arxiv.org/abs/1805.11604
The effect of the choice of neural network depth and breadth on the size of its hypothesis space Lech Szymanski, Brendan McCane, Michael Albert https://arxiv.org/abs/1806.02460
Is preprocessing of text really worth your time for online comment classification? Fahim Mohammad https://arxiv.org/abs/1806.02908
Geometric Understanding of Deep Learning Na Lei, Zhongxuan Luo, Shing-Tung Yau, David Xianfeng Gu https://arxiv.org/abs/1805.10451
Cross validation residuals for generalised least squares and other correlated data models Ingrid Annette Baade https://arxiv.org/abs/1809.01319
Out-of-Distribution Detection Using an Ensemble of Self Supervised Leave-out Classifiers Apoorv Vyas, Nataraj Jammalamadaka, Xia Zhu, Dipankar Das, Bharat Kaul, Theodore L. Willke https://arxiv.org/abs/1809.03576
Handling Imbalanced Dataset in Multi-label Text Categorization using Bagging and Adaptive Boosting Genta Indra Winata, Masayu Leylia Khodra https://arxiv.org/abs/1810.11612
On the Art and Science of Machine Learning Explanations Patrick Hall https://arxiv.org/abs/1810.02909
Causal inference under over-simplified longitudinal causal models Lola Etievant, Vivian Viallon https://arxiv.org/abs/1810.01294
Revisiting the Gelman-Rubin Diagnostic Dootika Vats, Christina Knudson https://arxiv.org/abs/1812.09384
A Survey on Data Collection for Machine Learning: a Big Data – AI Integration Perspective Yuji Roh, Geon Heo, Steven Euijong Whang
A Fundamental Measure of Treatment Effect Heterogeneity Jonathan Levy, Mark van der Laan, Alan Hubbard, Romain Pirracchio https://arxiv.org/abs/1811.03745
Causal Discovery Toolbox: Uncover causal relationships in Python Diviyan Kalainathan, Olivier Goudet https://arxiv.org/abs/1903.02278
Dying ReLU and Initialization: Theory and Numerical Examples Lu Lu, Yeonjong Shin, Yanhui Su, George Em Karniadakis https://arxiv.org/abs/1903.06733
ROC and AUC with a Binary Predictor: a Potentially Misleading Metric John Muschelli https://arxiv.org/abs/1903.04881
Gamification in Science: A Study of Requirements in the Context of Reproducible Research Sebastian S. Feger, Sünje Dallmeier-Tiessen, Paweł W. Woźniak, Albrecht Schmidt https://arxiv.org/abs/1903.02446
Matrix factorization for multivariate time series analysis Pierre Alquier, Nicolas Marie https://arxiv.org/abs/1903.05589
On the complexity of logistic regression models Nicola Bulso, Matteo Marsili, Yasser Roudi https://arxiv.org/abs/1903.00386
On Heavy-user Bias in A/B Testing Yu Wang, Somit Gupta, Jiannan Lu, Ali Mahmoudzadeh, Sophia Liu https://arxiv.org/abs/1902.02021
DeepMoD: Deep learning for Model Discovery in noisy data Gert-Jan Both, Subham Choudhury, Pierre Sens, Remy Kusters
Learning Causality: Synthesis of Large-Scale Causal Networks from High-Dimensional Time Series Data Mark-Oliver Stehr, Peter Avar, Andrew R. Korte, Lida Parvin, Ziad J. Sahab, Deborah I. Bunin, Merrill Knapp, Denise Nishita, Andrew Poggio, Carolyn L. Talcott, Brian M. Davis, Christine A. Morton, Christopher J. Sevinsky, Maria I. Zavodszky, Akos Vertes https://arxiv.org/abs/1905.02291
Text Classification Algorithms: A Survey Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura E. Barnes, Donald E. Brown https://arxiv.org/abs/1904.08067
The Information Complexity of Learning Tasks, their Structure and their Distance Alessandro Achille, Giovanni Paolini, Glen Mbeng, Stefano Soatto https://arxiv.org/abs/1904.03292
Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches Shane Storks, Qiaozi Gao, Joyce Y. Chai https://arxiv.org/abs/1904.01172
Evaluating A Key Instrumental Variable Assumption Using Randomization Tests Zach Branson, Luke Keele https://arxiv.org/abs/1907.01943
Model selection for high-dimensional linear regression with dependent observations Ching-Kang Ing https://arxiv.org/abs/1906.07395
Doubts on the efficacy of outliers correction methods Marjorie Fonnesu, Nicola Kuczewski
The Design of Global Correlation Quantifiers and Continuous Notions of Statistical Sufficiency Nicholas Carrara, Kevin Vanslette
An Econometric Perspective on Algorithmic Subsampling Sokbae Lee, Serena Ng https://arxiv.org/abs/1907.01954
Factor Analysis for High-Dimensional Time Series with Change Point Xialu Liu, Ting Zhang https://arxiv.org/abs/1907.09522
Causal Regularization Dominik Janzing https://arxiv.org/abs/1906.12179
The exact form of the ‘Ockham factor’ in model selection Jonathan Rougier, Carey Priebe https://arxiv.org/abs/1906.11592
Measuring Average Treatment Effect from Heavy-tailed Data Jason (Xiao)Wang, Pauline Burke https://arxiv.org/abs/1905.09252
The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial Benyamin Ghojogh, Mark Crowley https://arxiv.org/abs/1905.12787
Statistical methods research done as science rather than mathematics James S. Hodges https://arxiv.org/abs/1905.08381
Regression Analysis of Unmeasured Confounding Brian Knaeble, Braxton Osting, Mark Abramson
Dyadic Regression Bryan S. Graham https://arxiv.org/abs/1908.09029
Illusion of Causality in Visualized Data Cindy Xiong, Joel Shapiro, Jessica Hullman, Steven Franconeri https://arxiv.org/abs/1908.00215
“multiColl”: An R package to detect multicollinearity Román Salmerón, Catalina García, José García https://arxiv.org/abs/1910.14590
All of Linear Regression Arun K. Kuchibhotla, Lawrence D. Brown, Andreas Buja, Junhui Cai https://arxiv.org/abs/1910.06386
What is the Value of Data? On Mathematical Methods for Data Quality Estimation Netanel Raviv, Siddharth Jain, Jehoshua Bruck https://arxiv.org/abs/2001.03464
Imputation for High-Dimensional Linear Regression Kabir Aladin Chandrasekher, Ahmed El Alaoui, Andrea Montanari https://arxiv.org/abs/2001.09180
On Model Evaluation under Non-constant Class Imbalance Jan Brabec, Tomáš Komárek, Vojtěch Franc, Lukáš Machlica https://arxiv.org/abs/2001.05571
Identifying Mislabeled Data using the Area Under the Margin Ranking Geoff Pleiss, Tianyi Zhang, Ethan R. Elenberg, Kilian Q. Weinberger https://arxiv.org/abs/2001.10528
Expanding the scope of statistical computing: Training statisticians to be software engineers Alex Reinhart, Christopher R. Genovese https://arxiv.org/abs/1912.13076
Learning under Model Misspecification: Applications to Variational and Ensemble methods Andres R. Masegosa https://arxiv.org/abs/1912.08335
Explaining the Explainer: A First Theoretical Analysis of LIME Damien Garreau, Ulrike von Luxburg https://arxiv.org/abs/2001.03447
Algorithms for Heavy-Tailed Statistics: Regression, Covariance Estimation, and Beyond Yeshwanth Cherapanamjeri, Samuel B. Hopkins, Tarun Kathuria, Prasad Raghavendra, Nilesh Tripuraneni https://arxiv.org/abs/1912.11071
Markov Chain Monte Carlo Methods, a survey with some frequent misunderstandings Christian P. Robert (U Paris Dauphine and U Warwick), Wu Changye (U Paris Dauphine) https://arxiv.org/abs/2001.06249
Valid p-Values and Expectations of p-Values Revisited Albert Vexler https://arxiv.org/abs/2001.05126
Counterexamples to “The Blessings of Multiple Causes” by Wang and Blei Elizabeth L. Ogburn, Ilya Shpitser, Eric J. Tchetgen Tchetgen https://arxiv.org/abs/2001.06555
Identifying Mislabeled Instances in Classification Datasets Nicolas Michael Müller, Karla Markert https://arxiv.org/abs/1912.05283
Randomized p-values for multiple testing and their application in replicability analysis Anh-Tuan Hoang, Thorsten Dickhaus https://arxiv.org/abs/1912.06982
Over-parametrized deep neural networks do not generalize well Michael Kohler, Adam Krzyzak https://arxiv.org/abs/1912.03925
Re-Evaluating Strengthened-IV Designs: Asymptotic Efficiency, Bias Formula, and the Validity and Power of Sensitivity Analyses Siyu Heng, Bo Zhang, Xu Han, Scott A. Lorch, Dylan S. Small https://arxiv.org/abs/1911.09171
Unbiased variable importance for random forests Markus Loecher https://arxiv.org/abs/2003.02106
Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited Wesley J. Maddox, Gregory Benton, Andrew Gordon Wilson https://arxiv.org/abs/2003.02139
A Multi-Way Correlation Coefficient Benjamin M. Taylor https://arxiv.org/abs/2003.02561
Sense and Sensitivity Analysis: Simple Post-Hoc Analysis of Bias Due to Unobserved Confounding Victor Veitch, Anisha Zaveri https://arxiv.org/abs/2003.01747
The Implicit and Explicit Regularization Effects of Dropout Colin Wei, Sham Kakade, Tengyu Ma https://arxiv.org/abs/2002.12915
Natural Language Processing Advancements By Deep Learning: A Survey Amirsina Torfi, Rouzbeh A. Shirvani, Yaser Keneshloo, Nader Tavaf, Edward A. Fox https://arxiv.org/abs/2003.01200
An Evaluation of Change Point Detection Algorithms Gerrit J.J. van den Burg, Christopher K.I. Williams https://arxiv.org/abs/2003.06222
Complexity Measures and Features for Times Series classification Francisco J. Baldán, José M. Benítez https://arxiv.org/abs/2002.12036
Computing Shapley Effects for Sensitivity Analysis Elmar Plischke, Giovanni Rabitti, Emanuele Borgonovo https://arxiv.org/abs/2002.12024
Bayesian Posterior Interval Calibration to Improve the Interpretability of Observational Studies Jami J. Mulgrave, David Madigan, George Hripcsak https://arxiv.org/abs/2003.06002
Demystify Lindley’s Paradox by Interpreting P-value as Posterior Probability Guosheng Yin, Haolun Shi https://arxiv.org/abs/2002.10883
Estimation of causal effects with small data in the presence of trapdoor variables Jouni Helske, Santtu Tikka, Juha Karvanen https://arxiv.org/abs/2003.03187
Dimensional Analysis in Statistical Modelling Tae Yoon Lee, James V. Zidek, Nancy Heckman https://arxiv.org/abs/2002.11259
Causal bounds for outcome-dependent sampling in observational studies Erin E. Gabriel, Michael C. Sachs, Arvid Sjölander https://arxiv.org/abs/2002.10519
cutpointr: Improved Estimation and Validation of Optimal Cutpoints in R https://arxiv.org/abs/2002.09209
A New Framework for Online Testing of Heterogeneous Treatment Effect Miao Yu, Wenbin Lu, Rui Song https://arxiv.org/abs/2002.03277
Combining Observational and Experimental Datasets Using Shrinkage Estimators Evan Rosenman, Guillaume Basse, Art Owen, Michael Baiocchi https://arxiv.org/abs/2002.06708
A confidence interval robust to publication bias for random-effects meta-analysis of few studies M. Henmi, S. Hattori, T. Friede https://arxiv.org/abs/2002.07598
Boosting Simple Learners Noga Alon, Alon Gonen, Elad Hazan, Shay Moran https://arxiv.org/abs/2001.11704
Analytic Study of Double Descent in Binary Classification: The Impact of Loss Ganesh Kini, Christos Thrampoulidis https://arxiv.org/abs/2001.11572
Fast Bayesian Estimation of Spatial Count Data Models Prateek Bansal, Rico Krueger, Daniel J. Graham https://arxiv.org/abs/2007.03681
High-recall causal discovery for autocorrelated time series with latent confounders Andreas Gerhardus, Jakob Runge https://arxiv.org/abs/2007.01884
Estimating the Prediction Performance of Spatial Models via Spatial k-Fold Cross Validation Jonne Pohjankukka, Tapio Pahikkala, Paavo Nevalainen, Jukka Heikkonen https://arxiv.org/abs/2005.14263
Validating Label Consistency in NER Data Annotation Qingkai Zeng, Mengxia Yu, Wenhao Yu, Tianwen Jiang, Meng Jiang https://arxiv.org/abs/2101.08698
Learning Prediction Intervals for Model Performance Benjamin Elder, Matthew Arnold, Anupama Murthi, Jiri Navratil https://arxiv.org/abs/2012.08625
Dive into Decision Trees and Forests: A Theoretical Demonstration Jinxiong Zhang https://arxiv.org/abs/2101.08656
Self-semi-supervised Learning to Learn from NoisyLabeled Data Jiacheng Wang, Yue Ma, Shuang Gao https://arxiv.org/abs/2011.01429
Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, Yu Su https://arxiv.org/abs/2011.07743
Uncertainty as a Form of Transparency: Measuring, Communicating, and Using Uncertainty Umang Bhatt, Javier Antorán, Yunfeng Zhang, Q. Vera Liao, Prasanna Sattigeri, Riccardo Fogliato, Gabrielle Gauthier Melançon, Ranganath Krishnan, Jason Stanley, Omesh Tickoo, Lama Nachman, Rumi Chunara, Madhulika Srikumar, Adrian Weller, Alice Xiang https://arxiv.org/abs/2011.07586
A Survey on Data Augmentation for Text Classification Markus Bayer, Marc-André Kaufhold, Christian Reuter https://arxiv.org/abs/2107.03158
A Survey on Automated Fact-Checking Zhijiang Guo, Michael Schlichtkrull, Andreas Vlachos https://arxiv.org/abs/2108.11896
The Benchmark Lottery Mostafa Dehghani, Yi Tay, Alexey A. Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, Oriol Vinyals https://arxiv.org/abs/2107.07002
Is Automated Topic Model Evaluation Broken?: The Incoherence of Coherence Alexander Hoyle, Pranav Goel, Denis Peskov, Andrew Hian-Cheong, Jordan Boyd-Graber, Philip Resnik https://arxiv.org/abs/2107.02173
The Modern Mathematics of Deep Learning Julius Berner, Philipp Grohs, Gitta Kutyniok, Philipp Petersen https://arxiv.org/abs/2105.04026
Biases in human mobility data impact epidemic modeling Frank Schlosser, Vedran Sekara, Dirk Brockmann, Manuel Garcia-Herranz https://arxiv.org/abs/2112.12521
Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation Shaoshi Sun, Zhenyuan Zhang, BoCheng Huang, Pengbin Lei, Jianlin Su, Shengfeng Pan, Jiarun Cao https://arxiv.org/abs/2112.12433
Clean or Annotate: How to Spend a Limited Data Collection Budget Derek Chen, Zhou Yu, Samuel R. Bowman https://arxiv.org/abs/2110.08355
How many labelers do you have? A closer look at gold-standard labels Chen Cheng, Hilal Asi, John Duchi https://arxiv.org/abs/2206.12041
Eliciting and Learning with Soft Labels from Every Annotator https://arxiv.org/abs/2207.00810
Quantified Reproducibility Assessment of NLP Results Anya Belz, Maja Popović, Simon Mille https://arxiv.org/abs/2204.05961
SHAP and LIME Python Libraries: Part 2 - Using SHAP and LIME https://www.dominodatalab.com/blog/shap-lime-python-libraries-part-2-using-shap-lime
Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations Sander Greenland,corresponding author Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4877414/
The Mythos of Model Interpretability https://arxiv.org/pdf/1606.03490v1.pdf
Lessons Learned Reproducing a Deep Reinforcement Learning Paper Apr 6, 2018 http://amid.fish/reproducing-deep-rl
Spatial autocorrelation: bane or bonus? View ORCID ProfileMatt. D. M. Pawley, Brian H. McArdle doi: https://doi.org/10.1101/385526 https://www.biorxiv.org/content/10.1101/385526v1
On Reality and the Limits of Language Data Nigel H. Collier, Fangyu Liu, Ehsan Shareghi https://arxiv.org/abs/2208.11981
Open Information Extraction from 2007 to 2022 – A Survey Pai Liu, Wenyang Gao, Wenjie Dong, Songfang Huang, Yue Zhang https://arxiv.org/abs/2208.08690
Colah’s blog http://colah.github.io/
Causal Reasoning: Fundamentals and Machine Learning Applications http://causalinference.gitlab.io/book/
http://courses.d2l.ai/berkeley-stat-157/units/index.html#
A Compendium of Clean Graphs in R http://shinyapps.org/apps/RGraphCompendium/index.php?utm_content=bufferd23cb&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Bayesian Inference an interactive visualization https://rpsychologist.com/d3/bayes/?utm_content=buffera5352&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Fitting distributions with R https://www.magesblog.com/post/2011-12-01-fitting-distributions-with-r/
Concerns About Bots on Mechanical Turk: Problems and Solutions https://www.cloudresearch.com/resources/blog/concerns-about-bots-on-mechanical-turk-problems-and-solutions/
Reproducible Research with R & RStudio 2nd Edition Christopher Gandrud http://christophergandrud.github.io/RepResR-RStudio/
Regression and Causality https://arxiv.org/pdf/2006.11754.pdf
Introduction to Causal Inference Fall 2020 https://www.bradyneal.com/causal-inference-course
Tensorflow 2.0 Pitfalls A list of commonly seen issues along with solutions. http://blog.ai.ovgu.de/posts/jens/2019/001_tf20_pitfalls/index.html
Cold Case: The Lost MNIST Digits Chhavi Yadav, Léon Bottou https://arxiv.org/abs/1905.10498
Automated Text Classification of News Articles: A Practical Guide Published online by Cambridge University Press: 09 June 2020 https://www.cambridge.org/core/journals/political-analysis/article/abs/automated-text-classification-of-news-articles-a-practical-guide/10462DB284B1CD80C0FAE796AD786BC6
How to Use t-SNE Effectively https://distill.pub/2016/misread-tsne/
Locality Sensitive Hashing in R http://dsnotes.com/post/locality-sensitive-hashing-in-r-part-1/
Identification of and Correction for Publication Bias Isaiah Andrews https://www.aeaweb.org/articles?id=10.1257/aer.20180310
Mediation Analysis is Counterintuitively Invalid http://datacolada.org/103
Dive into Deep Learning http://d2l.ai/
CS224d: Deep Learning for Natural Language Processing http://cs224d.stanford.edu/syllabus.html
Regression Models for Count Data: beyond the Poisson model http://cursos.leg.ufpr.br/rmcd/
p-hacking fast and slow: Evaluating a forthcoming AER paper deeming some econ literatures less trustworthy http://datacolada.org/91
Attention Is All You Need Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin https://arxiv.org/abs/1706.03762
When Should We Use Unit Fixed Effects Regression Models for Causal Inference with Longitudinal Data? Kosuke Imai Harvard University In Song Kim https://imai.fas.harvard.edu/research/files/FEmatch.pdf
Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? https://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf
P values in display items are ubiquitous and almost invariably significant: A survey of top science journals https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0197440
R Workflow for Reproducible Data Analysis and Reporting http://hbiostat.org/rflow/
The reusable holdout: Preserving validity in adaptive data analysis https://ai.googleblog.com/2015/08/the-reusable-holdout-preserving.html
The science that’s never been cited Nature investigates how many papers really end up without a single citation. https://www.nature.com/articles/d41586-017-08404-0?WT.mc_id=TWT_NA_1711_FHNEWSFNEVERCITED_PORTFOLIO
didimputation The goal of didimputation is to estimate TWFE models without running into the problem of staggered treatment adoption. https://github.com/kylebutts/didimputation
Methods Matter: P-Hacking and Causal Inference in Economics https://docs.iza.org/dp11796.pdf
CloudForest https://github.com/ryanbressler/CloudForest
The idea for Artificial Contrasts is based on: Eugene Tuvand and Kari Torkkola’s “Feature Filtering with Ensembles Using Artificial Contrasts” http://enpub.fulton.asu.edu/workshop/FSDM05-Proceedings.pdf#page=74 and Eugene Tuv, Alexander Borisov, George Runger and Kari Torkkola’s “Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination” http://www.researchgate.net/publication/220320233_Feature_Selection_with_Ensembles_Artificial_Variables_and_Redundancy_Elimination/file/d912f5058a153a8b35.pdf
The idea for growing trees to minimize categorical entropy comes from Ross Quinlan’s ID3: http://en.wikipedia.org/wiki/ID3_algorithm
“The Elements of Statistical Learning” 2nd edition by Trevor Hastie, Robert Tibshirani and Jerome Friedman was also consulted during development.
Methods for classification from unbalanced data are covered in several papers: http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163175/ http://www.biomedcentral.com/1471-2105/11/523 http://bib.oxfordjournals.org/content/early/2012/03/08/bib.bbs006 http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0067863
Denisty Estimating Trees/Forests are Discussed: http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p627.pdf http://research.microsoft.com/pubs/158806/CriminisiForests_FoundTrends_2011.pdf The later also introduces the idea of manifold forests which can be learned using down stream analysis of the outputs of leafcount to find the Fiedler vectors of the graph laplacian.
An introduction to Git and how to use it with RStudio http://r-bio.github.io/intro-git-rstudio/
Probability and Statistics Cookbook https://pages.cs.wisc.edu/~tdw/files/cookbook-en.pdf
The Plain Person’s Guide to Plain Text Social Science https://plain-text.co/
Causal Graphical Views of Fixed Effects and Random Effects Models https://psyarxiv.com/cxd2n/
Beware Default Random Forest Importances https://explained.ai/rf-importance/index.html
Tools and guides to put R models into production https://putrinprod.com/
’Metrics Monday: You Can’t Compare OLS with 2SLS PUBLISHED NOVEMBER 20, 2017 http://marcfbellemare.com/wordpress/12723
Causal Inference Animated Plots https://nickchk.com/causalgraphs.html#iv
Scaling Data from Multiple Sources https://www.cambridge.org/core/journals/political-analysis/article/abs/scaling-data-from-multiple-sources/1F9D30D8DDCE44379E8B962C29DADBAB?utm_source=hootsuite&utm_medium=twitter&utm_campaign=PAN_Nov20
GAM: The Predictive Modeling Silver Bullet https://multithreaded.stitchfix.com/blog/2015/07/30/gam/
Generalized Full Matching Published online by Cambridge University Press: 23 November 2020 https://www.cambridge.org/core/journals/political-analysis/article/abs/generalized-full-matching/3DA71D8BEDA6F02B5D36457E114C79B6?utm_source=hootsuite&utm_medium=twitter&utm_campaign=PAN_Nov20
A Deep Dive Into How R Fits a Linear Model http://madrury.github.io/jekyll/update/statistics/2016/07/20/lm-in-R.html
A ModernDive into R and the Tidyverse https://moderndive.com/
INSTRUMENTAL VARIABLES REGRESSIONS INVOLVING SEASONAL DATA David E.A. GILES http://web.uvic.ca/~dgiles/blog/Giles_FWL.pdf
The Book of Statistical Proofs https://statproofbook.github.io/
An econometric method for estimating population parameters from non‐random samples: An application to clinical case finding http://www-personal.umich.edu/~zmclaren/mclaren_tbprevalence.pdf
Parallelizing neural networks on one GPU with JAX http://willwhitney.com/parallel-training-jax.html
https://wrdrd.github.io/docs/
Learning interactions via hierarchical group-lasso regularization Michael Lim∗ and Trevor Hastie∗ June 21, 2014 https://hastie.su.domains/Papers/glinternet_jcgs.pdf
On the Use of Two-Way Fixed Effects Regression Models for Causal Inference with Panel Data http://web.mit.edu/insong/www/pdf/FEmatch-twoway.pdf
Backprop is not just the chain rule AUG 18, 2017 http://timvieira.github.io/blog/post/2017/08/18/backprop-is-not-just-the-chain-rule/
HOW TO PLOT XGBOOST TREES IN R http://theautomatic.net/2021/04/28/how-to-plot-xgboost-trees-in-r/
Rectangling https://tidyr.tidyverse.org/articles/rectangle.html
R Packages (2e) https://r-pkgs.org/
Prior distributions for variance parameters in hierarchical models http://www.stat.columbia.edu/~gelman/research/published/taumain.pdf
A visual introduction to machine learning http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
Prior distribution Andrew Gelman Volume 3, pp 1634–1637 http://www.stat.columbia.edu/~gelman/research/published/p039-_o.pdf
P-curve.com http://www.p-curve.com/
Model Tuning and the Bias-Variance Tradeoff http://www.r2d3.us/visual-intro-to-machine-learning-part-2/
How (and why) to create a good validation set https://www.fast.ai/posts/2017-11-13-validation-sets.html
The Promise and Pitfalls of Differences-in-Differences: Reflections on ‘16 and Pregnant’ and Other Applications https://www.nber.org/papers/w24857
Applied Bayesian Modeling http://www.leg.ufpr.br/lib/exe/fetch.php/wiki:internas:biblioteca:cogdon.pdf
Gaussian Distributions are Soap Bubbles https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/
Interpreting Instrumented Difference-in-Differences http://www.mit.edu/~liebers/DDIV.pdf
Probability, log-odds, and odds https://www.montana.edu/rotella/documents/502/Prob_odds_log-odds.pdf
TRANSFORMERS FROM SCRATCH https://peterbloem.nl/blog/transformers
How to Examine External Validity Within an Experiment https://www.nber.org/papers/w24834
Program Evaluation https://www.lecy.info/program-evaluation/
Facing Imbalanced Data Recommendations for the Use of Performance Metrics https://sites.pitt.edu/~jeffcohn/biblio/Jeni_Metrics.pdf
The Art and Practice of Economics Research: Lessons from Leading Minds https://static1.squarespace.com/static/56ec62678a65e20b89da5f33/t/6164758b00bbcb015c12dd53/1633973644033/Card.pdf
Statistics: P values are just the tip of the iceberg https://www.nature.com/articles/520612a
Random Forests, Decision Trees, and Categorical Predictors: The “Absent Levels” Problem https://www.jmlr.org/papers/volume19/16-474/16-474.pdf
Can transparency undermine peer review? A simulation model of scientist behavior under open peer review Federico Bianchi, Flaminio Squazzoni https://academic.oup.com/spp/article/49/5/791/6602348?login=false
The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation https://aclanthology.org/2021.emnlp-main.97/
Channeling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results Get access Arrow Alwyn Young https://academic.oup.com/qje/article-abstract/134/2/557/5195544?redirectedFrom=fulltext
Advanced R https://adv-r.hadley.nz/index.html
Causal Machine Learning: A Survey and Open Problems https://ai.papers.bar/paper/460ac86ef8e611ecb9b9d35608ee6155
On the Meaning of Within-Factor Correlated Measurement Errors https://academic.oup.com/jcr/article-abstract/11/1/572/1822756
Trustworthy Machine Learning http://www.trustworthymachinelearning.com/
Statistical Modeling: The Two Cultures Author(s): Leo Breiman http://www2.math.uu.se/~thulin/mm/breiman.pdf
Estimating misclassification error with small samples via bootstrap cross-validation https://academic.oup.com/bioinformatics/article/21/9/1979/409121?login=true
Critical appraisal of artificial intelligence-based prediction models for cardiovascular disease https://academic.oup.com/eurheartj/article/43/31/2921/6593474?login=false
The Only Probability Cheatsheet You’ll Ever Need http://www.wzchen.com/probability-cheatsheet/
Come back and scrape these http://www.wzchen.com/data-science-books
Download the Datasaurus: Never trust summary statistics alone; always visualize your data http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html
Prior Setting in Practice: Strategies and Rationales Used in Choosing Prior Distributions for Bayesian Analysis https://abhsarma.github.io/pubs/Prior_Setting_CHI2020.pdf
50 Years of Data Science David Donoho https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734
The Assessment of Intrinsic Credibility and a New Argument for p<0.005 Leonhard Held https://arxiv.org/abs/1803.10052
Arbitrariness of peer review: A Bayesian analysis of the NIPS experiment Olivier Francois https://arxiv.org/abs/1507.06411
Diaries of Social Data Research https://anchor.fm/diaries-soc-data-research/episodes/The-Evolution-of-Computational-Social-Science-from-a-Sociology-Perspective-with-Chris-Bail-e17vikf
pipecleaner https://alistaire47.github.io/pipecleaner/
Cross-Validation for Correlated Data https://amstat.tandfonline.com/doi/abs/10.1080/01621459.2020.1801451
The causal hype ratchet https://statmodeling.stat.columbia.edu/2018/12/21/causal-hype-ratchet/
A Permutation Test for the Regression Kink Design Peter Ganong &Simon Jäger https://amstat.tandfonline.com/doi/abs/10.1080/01621459.2017.1328356#.XEx7z89KjXF
Approximate Residual Balancing: De-Biased Inference of Average Treatment Effects in High Dimensions https://arxiv.org/abs/1604.07125
High-Dimensional Convex Geometry https://amitrajaraman.github.io/notes/convex-geometry/Convex_Geometry.pdf
Data Science Interviews During the 2020 Pandemic https://alexgude.com/blog/interviewing-for-data-science-positions-in-2020/
Tech Interviews: Respect Everyone’s Time https://alexgude.com/blog/interviews-respect-time/
Distribution-Free Prediction Intervals with Conformal Inference using R https://arelbundock.com/posts/conformal/
Robustness checks https://statmodeling.stat.columbia.edu/2018/11/14/robustness-checks-joke/ https://statmodeling.stat.columbia.edu/2017/11/29/whats-point-robustness-check/
Synthetically generated text for supervised text analysis Andrew Halterman https://andrewhalterman.com/files/Halterman_synthetic_text.pdf
EPP: interpretable score of model predictive power Alicja Gosiewska, Mateusz Bakala, Katarzyna Woznica, Maciej Zwolinski, Przemyslaw Biecek https://arxiv.org/abs/1908.09213
Measuring Calibration in Deep Learning Jeremy Nixon, Mike Dusenberry, Ghassen Jerfel, Timothy Nguyen, Jeremiah Liu, Linchuan Zhang, Dustin Tran https://arxiv.org/abs/1904.01685
What can be estimated? Identifiability, estimability, causal inference and ill-posed inverse problems Oliver J. Maclaren, Ruanui Nicholson https://arxiv.org/abs/1904.02826
Comparing Spike and Slab Priors for Bayesian Variable Selection Gertraud Malsiner-Walli, Helga Wagner https://arxiv.org/abs/1812.07259
Time-uniform, nonparametric, nonasymptotic confidence sequences Steven R. Howard, Aaditya Ramdas, Jon McAuliffe, Jasjeet Sekhon https://arxiv.org/abs/1810.08240
Open Science in Software Engineering Daniel Méndez Fernández, Daniel Graziotin, Stefan Wagner, Heidi Seibold
Safe Testing We develop the theory of hypothesis testing based on the E-value, a notion of evidence that, unlike the p-v https://arxiv.org/abs/1906.07801
A Mini-Introduction To Information Theory Edward Witten https://arxiv.org/abs/1805.11965
An Introduction to Deep Reinforcement Learning Vincent Francois-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, Joelle Pineau https://arxiv.org/abs/1811.12560
On the cross-validation bias due to unsupervised pre-processing Amit Moscovich, Saharon Rosset https://arxiv.org/abs/1901.08974
Troubling Trends in Machine Learning Scholarship Zachary C. Lipton, Jacob Steinhardt https://arxiv.org/abs/1807.03341
The Role of the Propensity Score in Fixed Effect Models Dmitry Arkhangelsky, Guido Imbens https://arxiv.org/abs/1807.02099
Proxy Controls and Panel Data Ben Deaner https://arxiv.org/abs/1810.00283
Structural Breaks in Time Series Alessandro Casini, Pierre Perron https://arxiv.org/abs/1805.03807
Comparing interpretability and explainability for feature selection Jack Dunn, Luca Mingardi, Ying Daisy Zhuo https://arxiv.org/abs/2105.05328
Cross-validation: what does it estimate and how well does it do it? Stephen Bates, Trevor Hastie, Robert Tibshirani https://arxiv.org/abs/2104.00673
On the implied weights of linear regression for causal inference Ambarish Chattopadhyay, Jose R. Zubizarreta https://arxiv.org/abs/2104.06581
A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification Anastasios N. Angelopoulos, Stephen Bates https://arxiv.org/abs/2107.07511
Out-of-distribution Generalization in the Presence of Nuisance-Induced Spurious Correlations Aahlad Puli, Lily H. Zhang, Eric K. Oermann, Rajesh Ranganath https://arxiv.org/abs/2107.00520
A Tutorial on VAEs: From Bayes’ Rule to Lossless Compression Ronald Yu https://arxiv.org/abs/2006.10273
Common Limitations of Image Processing Metrics: A Picture Story https://arxiv.org/abs/2104.05642
On the Inductive Bias of Masked Language Modeling: From Statistical to Syntactic Dependencies Tianyi Zhang, Tatsunori Hashimoto https://arxiv.org/abs/2104.05694
A large-scale study on research code quality and execution Ana Trisovic, Matthew K. Lau, Thomas Pasquier, Mercè Crosas https://arxiv.org/abs/2103.12793
Explaining by Removing: A Unified Framework for Model Explanation Ian Covert, Scott Lundberg, Su-In Lee https://arxiv.org/abs/2011.14878
What is Entropy? A new perspective from games of chance Sarah Brandsen, Isabelle Jianing Geng, Gilad Gour https://arxiv.org/abs/2103.08681
Instrumental variables, spatial confounding and interference Andrew Giffin, Brian J. Reich, Shu Yang, Ana G. Rappold https://arxiv.org/abs/2103.00304
On Linear Identifiability of Learned Representations Geoffrey Roeder, Luke Metz, Diederik P. Kingma https://arxiv.org/abs/2007.00810
Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, Yejin Choi https://arxiv.org/abs/2009.10795
Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks Curtis G. Northcutt, Anish Athalye, Jonas Mueller https://arxiv.org/abs/2103.14749
Contamination Bias in Linear Regressions Paul Goldsmith-Pinkham, Peter Hull, Michal Kolesár
Towards optimal doubly robust estimation of heterogeneous causal effects Edward H. Kennedy https://arxiv.org/abs/2004.14497
When are Non-Parametric Methods Robust? Robi Bhattacharjee, Kamalika Chaudhuri https://arxiv.org/abs/2003.06121
When Is Parallel Trends Sensitive to Functional Form? Jonathan Roth, Pedro H. C. Sant’Anna https://arxiv.org/abs/2010.04814
Optimal Regularization Can Mitigate Double Descent Preetum Nakkiran, Prayaag Venkat, Sham Kakade, Tengyu Ma
Valid Causal Inference with (Some) Invalid Instruments Jason Hartford, Victor Veitch, Dhanya Sridhar, Kevin Leyton-Brown https://arxiv.org/abs/2006.11386
A Survey on Knowledge Graphs: Representation, Acquisition and Applications Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, Philip S. Yu https://arxiv.org/abs/2002.00388
The MCC-F1 curve: a performance evaluation technique for binary classification Chang Cao, Davide Chicco, Michael M. Hoffman https://arxiv.org/abs/2006.11278
Causal Inference and Data Fusion in Econometrics Paul Hünermund (Copenhagen Business School), Elias Bareinboim (Columbia University) https://arxiv.org/abs/1912.09104
Learning to Induce Causal Structure Nan Rosemary Ke, Silvia Chiappa, Jane Wang, Anirudh Goyal, Jorg Bornschein, Melanie Rey, Theophane Weber, Matthew Botvinic, Michael Mozer, Danilo Jimenez Rezende https://arxiv.org/abs/2204.04875
Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data Spencer Frei, Niladri S. Chatterji, Peter L. Bartlett https://arxiv.org/abs/2202.05928
Causal Inference Through the Structural Causal Marginal Problem Luigi Gresele, Julius von Kügelgen, Jonas M. Kübler, Elke Kirschbaum, Bernhard Schölkopf, Dominik Janzing https://arxiv.org/abs/2202.01300
Benefits and costs of matching prior to a Difference in Differences analysis when parallel trends does not hold Dae Woong Ham, Luke Miratrix https://arxiv.org/abs/2205.08644
Causal influence, causal effects, and path analysis in the presence of intermediate confounding Iván Díaz https://arxiv.org/abs/2205.08000
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant Misra https://arxiv.org/abs/2201.02177
Deep Learning Interviews: Hundreds of fully solved job interview questions from a wide range of key topics in AI Shlomo Kashani, Amir Ivry https://arxiv.org/abs/2201.00650
Better Uncertainty Calibration via Proper Scores for Classification and Beyond Sebastian Gruber, Florian Buettner https://arxiv.org/abs/2203.07835
Sensitivity Analysis of Individual Treatment Effects: A Robust Conformal Inference Approach Ying Jin, Zhimei Ren, Emmanuel J. Candès https://arxiv.org/abs/2111.12161
Towards a Unified Information-Theoretic Framework for Generalization Mahdi Haghifam, Gintare Karolina Dziugaite, Shay Moran, Daniel M. Roy https://arxiv.org/abs/2111.05275
Learning in High Dimension Always Amounts to Extrapolation Randall Balestriero, Jerome Pesenti, Yann LeCun https://arxiv.org/abs/2110.09485
Understanding Dataset Difficulty with V-Usable Information Kawin Ethayarajh, Yejin Choi, Swabha Swayamdipta https://arxiv.org/abs/2110.08420
Batch Normalization Explained Randall Balestriero, Richard G. Baraniuk https://arxiv.org/abs/2209.14778
Bayesian Online Changepoint Detection https://arxiv.org/pdf/0710.3742.pdf
Impact of subsampling and pruning on random forests. https://arxiv.org/pdf/1603.04261.pdf
Selection Collider Bias in Large Language Models Emily McMilin https://arxiv.org/abs/2208.10063
On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models Rohan Anil, Sandra Gadanho, Da Huang, Nijith Jacob, Zhuoshu Li, Dong Lin, Todd Phillips, Cristina Pop, Kevin Regan, Gil I. Shamir, Rakesh Shivanna, Qiqi Yan https://arxiv.org/abs/2209.05310
Selective review of offline change point detection methods https://arxiv.org/pdf/1801.00718.pdf
How Much More Data Do I Need? Estimating Requirements for Downstream Tasks Rafid Mahmood, James Lucas, David Acuna, Daiqing Li, Jonah Philion, Jose M. Alvarez, Zhiding Yu, Sanja Fidler, Marc T. Law https://arxiv.org/abs/2207.01725
Snorkel: Rapid Training Data Creation with Weak Supervision https://arxiv.org/pdf/1711.10160.pdf
Defining and Characterizing Reward Hacking Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, David Krueger https://arxiv.org/abs/2209.13085
On Leave-One-Out Conditional Mutual Information For Generalization Mohamad Rida Rammal, Alessandro Achille, Aditya Golatkar, Suhas Diggavi, Stefano Soatto https://arxiv.org/abs/2207.00581
Formal Algorithms for Transformers Mary Phuong, Marcus Hutter https://arxiv.org/abs/2207.09238
Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models Kushal Tirumala, Aram H. Markosyan, Luke Zettlemoyer, Armen Aghajanyan https://arxiv.org/abs/2205.10770
Why do tree-based models still outperform deep learning on tabular data? Léo Grinsztajn (SODA), Edouard Oyallon (ISIR, CNRS), Gaël Varoquaux (SODA) https://arxiv.org/abs/2207.08815
Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting Neil Mallinar, James B. Simon, Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran https://arxiv.org/abs/2207.06569
Towards Understanding Grokking: An Effective Theory of Representation Learning Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, Mike Williams https://arxiv.org/abs/2205.10343
Pen and Paper Exercises in Machine Learning Michael U. Gutmann https://arxiv.org/abs/2206.13446
Introduction to DiD with Multiple Time Periods Brantly Callaway and Pedro H.C. Sant’Anna 2022-07-19 https://bcallaway11.github.io/did/articles/multi-period-did.html
Applications of Deep Neural Networks with Keras Jeff Heaton Fall 2022.0 https://arxiv.org/pdf/2009.05673.pdf
Joint Distributions for TensorFlow Probability DAN PIPONI† , DAVE MOORE† & JOSHUA V. DILLON, Google Research https://arxiv.org/pdf/2001.11819.pdf
Descending through a Crowded Valley — Benchmarking Deep Learning Optimizers https://arxiv.org/pdf/2007.01547.pdf
Geographic Difference-in-Discontinuities Kyle Butts https://arxiv.org/pdf/2109.07406.pdf
Pre-trained Models for Natural Language Processing: A Survey Xipeng Qiu* , Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai & Xuanjing Huang https://arxiv.org/pdf/2003.08271.pdf
Knowledge Graphs on the Web – an Overview https://arxiv.org/pdf/2003.00719.pdf
Knowledge Graphs https://arxiv.org/pdf/2003.02320.pdf
TOPOLOGY OF DEEP NEURAL NETWORKS GREGORY NAITZAT, ANDREY ZHITNIKOV, AND LEK-HENG LIM https://arxiv.org/pdf/2004.06093.pdf
Noise-Induced Randomization in Regression Discontinuity Designs https://arxiv.org/pdf/2004.09458.pdf
Markov Chain Monte Carlo Methods, a survey with some frequent misunderstandings https://arxiv.org/pdf/2001.06249.pdf
Learning Dependency Structures for Weak Supervision Models https://arxiv.org/pdf/1903.05844.pdf
Approximate leave-future-out cross-validation for Bayesian time series models https://arxiv.org/pdf/1902.06281.pdf
Relational Representation Learning for Dynamic (Knowledge) Graphs: A Survey https://arxiv.org/pdf/1905.11485v1.pdf
Statistical methods research done as science rather than mathematics James S. Hodges https://arxiv.org/pdf/1905.08381.pdf
R Tip: use isTRUE() https://win-vector.com/2018/06/11/r-tip-use-istrue/
The tidymodels Package https://www.tidyverse.org/blog/2018/08/tidymodels-0-0-1/
Regular expressions are tricky. RegExplain makes it easier to see what you’re doing. https://www.garrickadenbuie.com/project/regexplain/
The ability of different peer review procedures to flag problematic publications https://link.springer.com/article/10.1007/s11192-018-2969-2
gglabeller https://github.com/AliciaSchep/gglabeller?utm_content=buffer552f9&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
The {targets} R package user manual https://books.ropensci.org/targets/
How Regularization Works https://e2eml.school/regularization.html
Don’t be tricked by the Hashing Trick https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087
How to Use Catboost with Tidymodels https://blog.rmhogervorst.nl/blog/2020/08/28/how-to-use-catboost-with-tidymodels/
R Markdown: The Definitive Guide https://bookdown.org/yihui/rmarkdown/
37 Reasons why your Neural Network is not working https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607
Time to assume that health research is fraudulent until proven otherwise? https://blogs.bmj.com/bmj/2021/07/05/time-to-assume-that-health-research-is-fraudulent-until-proved-otherwise/
Quality of reporting of randomised controlled trials of artificial intelligence in healthcare: a systematic review Rida Shahzad1, Bushra Ayub2, http://orcid.org/0000-0001-5100-3189M A Rehman Siddiqui3 https://bmjopen.bmj.com/content/12/9/e061519.abstract
Running R Scripts on a Schedule with GitHub Actions By Simon P. Couch DECEMBER 27, 2020 https://www.simonpcouch.com/blog/r-github-actions-commit/
Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction Shangzhi Hong & Henry S. Lynn https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-01080-1
spatialRF: Easy Spatial Regression with Random Forest https://blasbenito.github.io/spatialRF/
Supervised Clustering: How to Use SHAP Values for Better Cluster Analysis https://www.aidancooper.co.uk/supervised-clustering-shap-values/
Exploring Neural Networks Visually in the Browser https://cprimozic.net/blog/neural-network-experiments-and-visualizations/
How Much Should We Trust Instrumental Variable Estimates in Political Science? Practical Advice Based on Over 60 Replicated Studies∗ https://yiqingxu.org/papers/english/2021_iv/LLXZ.pdf
Feathr: LinkedIn’s feature store is now available on Azure Posted on April 12, 2022 Xiaoyong Zhu https://azure.microsoft.com/en-us/blog/feathr-linkedin-s-feature-store-is-now-available-on-azure/
A Survey of Learning on Small Data Xiaofeng Cao, Weixin Bu, Shengjun Huang, Yingpeng Tang, Yaming Guo, Yi Chang, Ivor W. Tsang https://arxiv.org/abs/2207.14443
Ontology-based industrial data management platform Sergey Gorshkov, Alexander Grebeshkov, Roman Shebalov https://arxiv.org/abs/2103.05538
How to Speed Up XGBoost Model Training https://www.anyscale.com/blog/how-to-speed-up-xgboost-model-training
Markov Chain Monte Carlo Methods for Bayesian Data Analysis in Astronomy https://www.datasciencecentral.com/markov-chain-monte-carlo-methods-for-bayesian-data-analysis-in/#w6JI5
How to Easily Draw Neural Network Architecture Diagrams https://towardsdatascience.com/how-to-easily-draw-neural-network-architecture-diagrams-a6b6138ed875
L2 Regularization and Batch Norm https://blog.janestreet.com/l2-regularization-and-batch-norm/
Trust in LIME: Yes, No, Maybe So? https://www.dominodatalab.com/blog/trust-in-lime-local-interpretable-model-agnostic-explanations
Inside Manifold: Uber’s Stack for Debugging Machine Learning Models https://towardsai.net/p/l/inside-manifold-ubers-stack-for-debugging-machine-learning-models?utm_source=twitter&utm_medium=social&utm_campaign=rop-content-recycle
Data validation for machine learning JUNE 5, 2019 ~ ADRIAN COLYER https://blog.acolyer.org/2019/06/05/data-validation-for-machine-learning/
Multiprocessing vs. Threading in Python: What Every Data Scientist Needs to Know https://blog.floydhub.com/multiprocessing-vs-threading-in-python-what-every-data-scientist-needs-to-know/
A Comprehensive Guide to Machine Learning https://www.eecs189.org/static/resources/comprehensive-guide.pdf
A Concrete Introduction to Probability (using Python) https://github.com/norvig/pytudes/blob/main/ipynb/Probability.ipynb
An Interactive Guide To The Fourier Transform https://betterexplained.com/articles/an-interactive-guide-to-the-fourier-transform/
Identity Crisis https://betanalpha.github.io/assets/case_studies/identifiability.html
https://betanalpha.github.io/writing/
Bayes Sparse Regression Michael Betancourt March 2018 https://betanalpha.github.io/assets/case_studies/bayes_sparse_regression.html#1_fading_into_irrelevance
An Introduction to Stan Michael Betancourt March 2020 https://betanalpha.github.io/assets/case_studies/stan_intro.html
https://developers.google.com/machine-learning
Prior Modeling Michael Betancourt September 2021 https://betanalpha.github.io/assets/case_studies/prior_modeling.html
Colorized Math Equations https://betterexplained.com/articles/colorized-math-equations/
Towards A Principled Bayesian Workflow Michael Betancourt April 2020 https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html
Ordinal Regression Michael Betancourt May 2019 https://betanalpha.github.io/assets/case_studies/ordinal_regression.html
Analysing continuous proportions in ecology and evolution: A practical introduction to beta and Dirichlet regression Jacob C. Douma,James T. Weedon https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13234
slider https://davisvaughan.github.io/slider/index.html
torch.manual seed(3407) is all you need: On the influence of random seeds in deep learning architecture for computer vision https://davidpicard.github.io/pdf/lucky_seed.pdf
A Day in the Life of a Silicon Valley Data Engineer https://towardsdatascience.com/a-day-in-the-life-of-a-google-data-engineer-722f1b2206cc
ICML 2018 Notes https://david-abel.github.io/blog/posts/misc/icml_2018.pdf
ICML 2019 Notes https://david-abel.github.io/notes/icml_2019.pdf
Keep using plate notation https://davidrushingdewhurst.com/blog/2020-07-28keep-using-plate-notation.html
Data Visualization https://datavizs21.classes.andrewheiss.com/content/
DO YOU KNOW THE 4 TYPES OF ADDITIVE VARIABLE IMPORTANCES? https://datajms.com/post/variable_importance_feature_attribution/
geostan: Bayesian spatial analysis https://connordonegan.github.io/geostan/
Using Observational Study Data as an External Control Group for a Clinical Trial: an Empirical Comparison of Methods to Account for Longitudinal Missing Data https://www.researchgate.net/publication/357609855_Using_Observational_Study_Data_as_an_External_Control_Group_for_a_Clinical_Trial_an_Empirical_Comparison_of_Methods_to_Account_for_Longitudinal_Missing_Data
Selective Ignorability Assumptions in Causal Inference Marshall M. Joffe , Wei Peter Yang and Harold I. Feldman https://www.degruyter.com/document/doi/10.2202/1557-4679.1199/html
Expressing Regret: A Unified View of Credible Intervals Kenneth RiceORCID Icon &Lingbo Ye https://www.tandfonline.com/doi/abs/10.1080/00031305.2022.2039764
Efficient Identification in Linear Structural Causal Models with Instrumental Cutsets https://causalai.net/r49.pdf
Polymatching algorithm in observational studies with multiple treatment groups https://www.sciencedirect.com/science/article/abs/pii/S0167947321001985
An Introduction to Statistical Learning https://hastie.su.domains/ISLR2/ISLRv2_website.pdf
Core concepts in pharmacoepidemiology: Confounding by indication and the role of active comparators https://onlinelibrary.wiley.com/doi/10.1002/pds.5407
Aim for Clinical Utility, Not Just Predictive Accuracy Sachs, Michael C.a; Sjölander, Arvidb; Gabriel, Erin E.b https://journals.lww.com/epidem/Fulltext/2020/05000/Aim_for_Clinical_Utility,_Not_Just_Predictive.8.aspx#ej-article-sam-container
Monitoring Machine Learning Models in Production A Comprehensive Guide https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/
*args and **kwargs in Python https://towardsdatascience.com/args-kwargs-python-d9c71b220970
Waiting for Event Studies: A Play in Three Acts Sun and Abraham (2020) Explainer https://causalinf.substack.com/p/waiting-for-event-studies-a-play
Computational Socioeconomics https://arxiv.org/abs/1905.06166
Is Peer Review a Good Idea? Remco Heesen and Liam Kofi Bright https://www.journals.uchicago.edu/doi/10.1093/bjps/axz029
Opening the Black Box: a motivation for the assessment of mediation Danella M Hafeman 1, Sharon Schwartz https://pubmed.ncbi.nlm.nih.gov/19261660/
Invited Commentary: Propensity Scores Marshall M. Joffe, Paul R. Rosenbaum https://academic.oup.com/aje/article/150/4/327/98791
Advances in propensity score analysis Peter C Austin peter.austin@ices.on.caView all authors and affiliations https://journals.sagepub.com/doi/full/10.1177/0962280219899248
Analysis in an imperfect world Michael Wallace First published: 29 January 2020 https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1740-9713.2020.01353.x
Central Limit Theorem http://mfviz.com/central-limit/?utm_content=buffere918f&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Adjusting for Covariates in Randomized Clinical Trials for Drugs and Biological Products Draft Guidance for Industry https://www.fda.gov/regulatory-information/search-fda-guidance-documents/adjusting-covariates-randomized-clinical-trials-drugs-and-biological-products
Machine learning for improving high-dimensional proxy confounder adjustment in healthcare database studies: An overview of the current literature https://onlinelibrary.wiley.com/doi/10.1002/pds.5500
A Gentle Introduction to tidymodels https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/
Simultaneous Variable and Covariance Selection with the Multivariate Spike-and-Slab LASSO https://arxiv.org/pdf/1708.08911.pdf?utm_content=bufferb1cd5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Transformers Explained Visually — Not Just How, but Why They Work So Well https://towardsdatascience.com/transformers-explained-visually-not-just-how-but-why-they-work-so-well-d840bd61a9d3
Easy Bayesian Bootstrap in R https://www.sumsar.net/blog/2015/07/easy-bayesian-bootstrap-in-r/?utm_content=buffer53c16&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Leave-future-out cross-validation for time-series models https://discourse.mc-stan.org/t/leave-future-out-cross-validation-for-time-series-models/12954/2
PCA in a tidy(verse) framework https://tbradley1013.github.io/2018/02/01/pca-in-a-tidy-verse-framework/?utm_content=bufferfaf31&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Visualising Residuals https://drsimonj.svbtle.com/visualising-residuals?utm_content=bufferdb80e&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Covariate adjustment for randomized controlled trials revisited Jixian Wang https://onlinelibrary.wiley.com/doi/full/10.1002/pst.1988?campaign=wolearlyview
STAT 545 Data wrangling, exploration, and analysis with R https://stat545.com/index.html
Topics in Econometrics: Advances in Causality and Foundations of Machine Learning https://maxkasy.github.io/home/TopicsInEconometrics2019/
A nontechnical explanation of the counterfactual definition of confounding Martijn J.L. Bours https://www.jclinepi.com/article/S0895-4356(19)30173-8/pdf
Discovering Reliable Correlations in Categorical Data https://deepai.org/publication/discovering-reliable-correlations-in-categorical-data
Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations https://deepai.org/publication/dealing-with-disagreements-looking-beyond-the-majority-vote-in-subjective-annotations
https://papers.labml.ai/papers/weekly
Generalized Principal Component Analysis https://deepai.org/publication/generalized-principal-component-analysis
Deeptime: a Python library for machine learning dynamical models from time series data https://deepai.org/publication/deeptime-a-python-library-for-machine-learning-dynamical-models-from-time-series-data
Delving into Deep Imbalanced Regression https://deepai.org/publication/delving-into-deep-imbalanced-regression
Causality-based Feature Selection: Methods and Evaluations https://deepai.org/publication/causality-based-feature-selection-methods-and-evaluations
Causal Inference Through the Structural Causal Marginal Problem 02/02/2022 ∙ by
Luigi Gresele, et al. https://deepai.org/publication/causal-inference-through-the-structural-causal-marginal-problem
Causal Discovery from Incomplete Data: A Deep Learning Approach https://deepai.org/publication/causal-discovery-from-incomplete-data-a-deep-learning-approach
AutoML: A Survey of the State-of-the-Art https://deepai.org/publication/automl-a-survey-of-the-state-of-the-art
CatBoostLSS – An extension of CatBoost to probabilistic forecasting https://deepai.org/publication/catboostlss-an-extension-of-catboost-to-probabilistic-forecasting
Breiman’s “Two Cultures” Revisited and Reconciled 05/27/2020 Subhadeep, et al. https://deepai.org/publication/breiman-s-two-cultures-revisited-and-reconciled
Graphical Representation of Missing Data Problems Felix Thoemmes1 and Karthika Mohan2 https://ftp.cs.ucla.edu/pub/stat_ser/r448-reprint.pdf
Judea Pearl* Does Obesity Shorten Life? Or is it the Soda? On Non-manipulable Causes https://ftp.cs.ucla.edu/pub/stat_ser/r483-reprint.pdf
Gene name errors are widespread in the scientific literature Mark Ziemann, Yotam Eren & Assam El-Osta https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7
A hypothesis is a liability Itai Yanai & Martin Lercher https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02133-w
gganimate extends the grammar of graphics as implemented by ggplot2 to include the description of animation. https://gganimate.com/
Geocomputation with R https://geocompr.robinlovelace.net/index.html
ftfy: fixes text for you https://ftfy.readthedocs.io/en/latest/
ggforce https://ggforce.data-imaginist.com/index.html
Geographically based Economic data (G-Econ) https://gecon.yale.edu/
The packages dtw for R and dtw-python for Python provide the most complete, freely-available (GPL) implementation of Dynamic Time Warping-type (DTW) algorithms up to date. https://dynamictimewarping.github.io/
Preprints: An underutilized mechanism to accelerate outbreak science Michael A. Johansson ,Nicholas G. Reich,Lauren Ancel Meyers,Marc Lipsitch Published: April 3, 2018 https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002549
Randomization does not imply unconfoundedness https://drive.google.com/file/d/1nV8QMLxwXi-iWSqiwRN4KnMSWfoWJned/view
Bayesian Gaussian Graphical Models https://donaldrwilliams.github.io/BGGM/index.html
DoWhy: Addressing Challenges in Expressing and Validating Causal Assumptions https://drive.google.com/file/d/1i81CnMd683A788RYtEb8KSowhhPJn3Z6/view
Plotting background data for groups with ggplot2 https://drsimonj.svbtle.com/plotting-background-data-for-groups-with-ggplot2
Benign Overfitting in Linear Regression 06/26/2019 https://deepai.org/publication/benign-overfitting-in-linear-regression
Amortized Causal Discovery: Learning to Infer Causal Graphs from Time-Series Data https://deepai.org/publication/amortized-causal-discovery-learning-to-infer-causal-graphs-from-time-series-data
Accelerating Deep Learning by Focusing on the Biggest Losers 10/02/2019 https://deepai.org/publication/accelerating-deep-learning-by-focusing-on-the-biggest-losers
A Class of Algorithms for General Instrumental Variable Models 06/11/2020 https://deepai.org/publication/a-class-of-algorithms-for-general-instrumental-variable-models
A Survey of Parameters Associated with the Quality of Benchmarks in NLP 10/14/2022 https://deepai.org/publication/a-survey-of-parameters-associated-with-the-quality-of-benchmarks-in-nlp
A study of uncertainty quantification in overparametrized high-dimensional models 10/23/2022 https://deepai.org/publication/a-study-of-uncertainty-quantification-in-overparametrized-high-dimensional-models
DeclareDesign Blog The trouble with ‘controlling for blocks’ https://declaredesign.org/blog/biased-fixed-effects.html
A Critical Review on the Use (and Misuse) of Differential Privacy in Machine Learning 06/09/2022 https://deepai.org/publication/a-critical-review-on-the-use-and-misuse-of-differential-privacy-in-machine-learning
A Comprehensive Survey of Image Augmentation Techniques for Deep Learning 05/03/2022 https://deepai.org/publication/a-comprehensive-survey-of-image-augmentation-techniques-for-deep-learning
Stance Detection: A Survey ACM Computing SurveysVolume 53Issue 1January 2021 https://dl.acm.org/doi/abs/10.1145/3369026
PyTorch: An Imperative Style, High-Performance Deep Learning Library 12/03/2019 https://deepai.org/publication/pytorch-an-imperative-style-high-performance-deep-learning-library
On Causally Disentangled Representations 12/10/2021 https://deepai.org/publication/on-causally-disentangled-representations
A Visual Exploration of Gaussian Processes https://distill.pub/2019/visual-exploration-gaussian-processes/
Principled Machine Learning: Practices and Tools for Efficient Collaboration https://dev.to/robogeek/principled-machine-learning-4eho
Rules of Machine Learning:
bookmark_border Best Practices for ML Engineering Martin Zinkevich https://developers.google.com/machine-learning/guides/rules-of-ml
The Variability of Model Specification 10/06/2021 https://deepai.org/publication/the-variability-of-model-specification
Taxonomy of Benchmarks in Graph Representation Learning 06/15/2022 https://deepai.org/publication/taxonomy-of-benchmarks-in-graph-representation-learning
Recognizing Variables from their Data via Deep Embeddings of Distributions 09/11/2019 https://deepai.org/publication/recognizing-variables-from-their-data-via-deep-embeddings-of-distributions
Relaxed Softmax for learning from Positive and Unlabeled data 09/17/2019 https://deepai.org/publication/relaxed-softmax-for-learning-from-positive-and-unlabeled-data
On Quantitative Evaluations of Counterfactuals 10/30/2021 https://deepai.org/publication/on-quantitative-evaluations-of-counterfactuals
Learning Neural Causal Models from Unknown Interventions 10/02/2019 https://deepai.org/publication/learning-neural-causal-models-from-unknown-interventions
Memorizing without overfitting: Bias, variance, and interpolation in over-parameterized models 10/26/2020 https://deepai.org/publication/memorizing-without-overfitting-bias-variance-and-interpolation-in-over-parameterized-models
InceptionTime: Finding AlexNet for Time Series Classification 09/11/2019 https://deepai.org/publication/inceptiontime-finding-alexnet-for-time-series-classification
Learning from Positive and Unlabeled Data by Identifying the Annotation Process 03/02/2020 https://deepai.org/publication/learning-from-positive-and-unlabeled-data-by-identifying-the-annotation-process
Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning 12/22/2019 https://deepai.org/publication/lessons-from-archives-strategies-for-collecting-sociocultural-data-in-machine-learning
Identification In Missing Data Models Represented By Directed Acyclic Graphs 06/29/2019 https://deepai.org/publication/identification-in-missing-data-models-represented-by-directed-acyclic-graphs
rrtools: Tools for Writing Reproducible Research in R https://github.com/benmarwick/rrtools
fastshap The goal of fastshap is to provide an efficient and speedy (relative to other implementations) approach to computing approximate Shapley values which help explain the predictions from machine learning models.
Monitoring Data Quality at Scale with Statistical Modeling May 7, 2020 https://www.uber.com/blog/monitoring-data-quality-at-scale/
This standard operating procedure (SOP) document describes the default practices of the experimental research group led by Donald P. Green at Columbia University. https://github.com/acoppock/Green-Lab-SOP
ggplot2 extensions https://exts.ggplot2.tidyverse.org/gallery/
RemixAutoML website and reference manual https://github.com/AdrianAntico/RemixAutoML
Inference in Linear Regression Models with Many Covariates and Heteroscedasticity Matias D. Cattaneoa, Michael Janssonb,c , and Whitney K. Neweyd https://eml.berkeley.edu/~mjansson/Publications/Cattaneo-Jansson-Newey_2018_JASA.pdf
Derivation of front door adjustment without intervention on the mediator https://figshare.com/articles/journal_contribution/Derivation_of_front_door_adjustment_without_intervention_on_the_mediator/20278347/1
Satellite image datasets https://eod-grss-ieee.com/dataset-search
Point of View: How should novelty be valued in science? Barak A Cohen https://elifesciences.org/articles/28699
The Softmax function and its derivative https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/
Why software projects take longer than you think: a statistical model 2019-04-15 https://erikbern.com/2019/04/15/why-software-projects-take-longer-than-you-think-a-statistical-model.html
MOVE IT OR LOSE IT: INTRODUCING PSEUDO-EARTH MOVER DIVERGENCE AS A CONTEXT-SENSITIVE METRIC FOR EVALUATING AND IMPROVING FORECASTING AND PREDICTION SYSTEMS https://events.barcelonagse.eu/live/files/2912-pemdivbarcelonapdf
The spatial allocation of population: a review of large-scale gridded population data products and their fitness for use https://eprints.soton.ac.uk/434156/1/The_spatial_allocation_of_population.pdf
https://www.youtube.com/playlist?list=PL8PYTP1V4I8D0UkqW2fEhgLrnlDW9QK7z
Robust misinterpretation of confidence intervals Rink Hoekstra & Richard D. Morey & Jeffrey N. Rouder & Eric-Jan Wagenmakers https://ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf
Transformers from Scratch Brandon Rohrer https://e2eml.school/transformers.html
Evidence on Research Transparency in Economics Edward Miguel https://www.aeaweb.org/articles?id=10.1257/jep.35.3.193
The Dance of the Mechanisms: How Observed Information Influences the Validity of Missingness Assumptions https://journals.sagepub.com/doi/10.1177/0049124118799376
The Generalizability of Survey Experiments* Published online by Cambridge University Press: 12 January 2016 https://www.cambridge.org/core/journals/journal-of-experimental-political-science/article/abs/generalizability-of-survey-experiments/72D4E3DB90569AD7F2D469E9DF3A94CB
Preregistering qualitative research Tamarinde L. HavenORCID Icon &Dr. Leonie Van Grootel https://www.tandfonline.com/doi/full/10.1080/08989621.2019.1580147
Categorical Perception of p-Values V. N. Vimal Rao,Jeffrey K. Bye,Sashank Varma https://onlinelibrary.wiley.com/doi/10.1111/tops.12589
On the Practice of Lagging Variables to Avoid Simultaneity† William Robert Reed https://onlinelibrary.wiley.com/doi/10.1111/obes.12088
Phantom Counterfactuals Tara Slough https://onlinelibrary.wiley.com/doi/10.1111/ajps.12715
“Don’t Know” Responses, Personality, and the Measurement of Political Knowledge* Published online by Cambridge University Press: 19 June 2015 https://www.cambridge.org/core/journals/political-science-research-and-methods/article/abs/dont-know-responses-personality-and-the-measurement-of-political-knowledge/C28B2FF6AD8181F9F60651C0933E5620
The influence of hidden researcher decisions in applied microeconomics https://onlinelibrary.wiley.com/doi/ftr/10.1111/ecin.12992
Inference in Experiments Conditional on Observed Imbalances in Covariates Per JohanssonORCID Icon &Mattias Nordin https://www.tandfonline.com/doi/full/10.1080/00031305.2022.2054859
Research Replication: Practical Considerations Published online by Cambridge University Press: 04 April 2018 https://www.cambridge.org/core/journals/ps-political-science-and-politics/article/research-replication-practical-considerations/B744967268CDAA3F44103AA5C8539EA2
The self-fulfilling prophecy of post-hoc power calculations Christos Christogiannis Stavros Nikolakopoulos Nikolaos Pandis Dimitris Mavridis https://www.ajodo.org/article/S0889-5406(21)00697-1/fulltext
Equinox is a JAX library based around a simple idea: represent parameterised functions (such as neural networks) as PyTrees. https://docs.kidger.site/equinox/
P-Hacking, Data Type and Data-Sharing Policy https://docs.iza.org/dp15586.pdf
UpSetR generates static UpSet plots. https://github.com/hms-dbmi/UpSetR
The purpose of the future package is to provide a very simple and uniform way of evaluating R expressions asynchronously using various resources available to the user. https://github.com/HenrikBengtsson/future
miceRanger: Fast Imputation with Random Forests https://github.com/farrellday/miceRanger
scoringRules An R package to compute scoring rules for fixed (parametric) and simulated forecast distributions. https://github.com/FK83/scoringRules
bayesdfa implements Bayesian Dynamic Factor Analysis (DFA) with Stan. https://github.com/fate-ewi/bayesdfa
dtplyr provides a data.table backend for dplyr. https://github.com/tidyverse/dtplyr
performance https://github.com/easystats/performance
BorutaShap is a wrapper feature selection method which combines both the Boruta feature selection algorithm with shapley values. https://github.com/Ekeany/Boruta-Shap
D-Lab’s Introduction to Machine Learning with tidymodels https://github.com/dlab-berkeley/R-Machine-Learning
ggVennDiagram https://github.com/gaospecial/ggVennDiagram
The {InteractionPoweR} package conducts power analyses for regression models in cross-sectional data sets where the term of interest is an interaction between two variables, also known as ‘moderation’ analyses. https://github.com/dbaranger/InteractionPoweR
varimpact uses causal inference statistics to generate variable importance estimates for a given dataset and outcome. https://github.com/ck37/varImpact
tidystringdist https://github.com/ColinFay/tidystringdist
https://github.com/CenterForPeaceAndSecurityStudies/IntroductiontoMachineLearning
Probabilistic Programming and Bayesian Methods for Hackers Chapter 1 https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter1_Introduction/Ch1_Introduction_TFP.ipynb
Papers about Causal Inference and Language https://github.com/causaltext/causal-text-papers
diffobj - Diffs for R Objects https://github.com/brodieG/diffobj
CatBoost https://github.com/catboost/catboost
Awesome Public Datasets https://github.com/awesomedata/awesome-public-datasets/blob/master/README.rst
BlackJAX is a library of samplers for JAX that works on CPU as well as GPU. https://github.com/blackjax-devs/blackjax
sensemakr: Sensitivity Analysis Tools for OLS https://github.com/carloscinelli/sensemakr
TorchArrow: a data processing library for PyTorch https://github.com/pytorch/torcharrow
causaleffect is a Python library for computing conditional and non-conditional causal effects. https://github.com/pedemonte96/causaleffect
Dynamic State Space Models in JAX. https://github.com/probml/dynamax
numpy-hilbert-curve https://github.com/PrincetonLIPS/numpy-hilbert-curve
Bayesian optimization in JAX https://github.com/PredictiveIntelligenceLab/JAX-BO
splines_in_stan.pdf https://github.com/milkha/Splines_in_Stan/blob/master/splines_in_stan.pdf
This is a repository that makes an attempt to empirically take stock of the most important concepts necessary to understand cutting-edge research in neural network models for NLP. https://github.com/neulab/nn4nlp-concepts
The EloML package provides Elo rating system for machine learning models. Elo Predictive Power (EPP) score helps to assess model performance based Elo ranking system. https://github.com/ModelOriented/EloML
SPTAG (Space Partition Tree And Graph) is a library for large scale vector approximate nearest neighbor search https://github.com/microsoft/SPTAG
fixest: Fast and user-friendly fixed-effects estimation https://github.com/lrberge/fixest/
tidybayes: Bayesian analysis + tidy data + geoms https://github.com/mjskay/tidybayes
Milvus is an open-source vector database built to power embedding similarity search and AI applications. https://github.com/milvus-io/milvus
bpCausal: Bayesian Causal Inference with Time-Series Cross-Sectional Data R package for A Bayesain Alternative to the Synthetic Control Method. https://github.com/liulch/bpCausal
priors.pdf https://github.com/lsbastos/Delay/blob/master/Code/priors.pdf
cheat_sheet-slabinterval.pdf https://github.com/mjskay/ggdist/blob/master/figures-source/cheat_sheet-slabinterval.pdf
tidypolars is a data frame library built on top of the blazingly fast polars library that gives access to methods and functions familiar to R tidyverse users. https://github.com/markfairbanks/tidypolars
ftfy: fixes text for you https://github.com/rspeer/python-ftfy
ggannotate https://github.com/MattCowgill/ggannotate
Conducting and Visualizing Specification Curve Analyses The goal of specr is to facilitate specification curve analyses (Simonsohn, Simmons & Nelson, 2019; also known as multiverse analyses, see Steegen, Tuerlinckx, Gelman & Vanpaemel, 2016). https://github.com/masurp/specr
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. https://github.com/lmcinnes/umap
survminer: Survival Analysis and Visualization https://github.com/kassambara/survminer
A curated list of resources dedicated to Natural Language Processing https://github.com/keon/awesome-nlp
vip: Variable Importance Plots https://github.com/koalaverse/vip/
How To Make Your Data Analysis Notebooks More Reproducible https://github.com/karthik/rstudio2019
Awesome Self-Supervised Learning https://github.com/jason718/awesome-self-supervised-learning
dagitty Graphical Analysis of Structural Causal Models https://github.com/jtextor/dagitty
Awesome Machine Learning https://github.com/josephmisiti/awesome-machine-learning#computer-vision-5
Replication, Replication https://gking.harvard.edu/files/abs/replication-abs.shtml
Ecological Regression with Partial Identification https://gking.harvard.edu/publications/ecological-regression-partial-identification
biglasso: Extend Lasso Model Fitting to Big Data in R https://github.com/YaohuiZeng/biglasso
marginaleffects package for R https://github.com/vincentarelbundock/marginaleffects
Feature Engineering and Selection by Max Kuhn and Kjell Johnson (2019). https://github.com/topepo/FES
tidyposterior https://github.com/tidymodels/tidyposterior
R Data Science Tutorials https://github.com/ujjwalkarn/DataScienceR
janitor https://github.com/sfirke/janitor
Conformal Inference R Project Maintained by Ryan Tibshirani https://github.com/ryantibs/conformal
semTools: Useful tools for structural equation modeling https://github.com/simsem/semTools/wiki
collapse is a C/C++ based package for data transformation and statistical computing in R. Its aims are: https://github.com/SebKrantz/collapse
latex2exp https://github.com/stefano-meschiari/latex2exp
Tracking Progress in Natural Language Processing https://github.com/sebastianruder/NLP-progress
Introduction to R Package Idealstan Robert Kubinec December 27, 2021 https://github.com/saudiwin/idealstan
Google’s Compact Language Detector 3 is a neural network model for language identification and the successor of CLD2 (available from) CRAN. T https://github.com/ropensci/cld3
terra is an R package for spatial data analysis. https://github.com/rspatial/terra
rmcelreath stat_rethinking_2022 https://github.com/rmcelreath/stat_rethinking_2022#calendar–topical-outline
charlatan makes fake data, inspired from and borrowing some code from Python’s faker (https://github.com/joke2k/faker) https://github.com/ropensci/charlatan
skimr provides a frictionless approach to summary statistics which conforms to the principle of least surprise, displaying summary statistics the user can skim quickly to understand their data. It handles different data types and returns a https://github.com/ropensci/skimr
assertr https://github.com/ropensci/assertr
Explaining Models by Propagating Shapley Values of Local Components Hugh Chen, Scott Lundberg, Su-In Lee https://arxiv.org/abs/1911.11888
Explaining Models by Propagating Shapley Values of Local Components Hugh Chen, Scott Lundberg, Su-In Lee
Visualizing a Million Time Series with the Density Line Chart https://idl.cs.washington.edu/files/2018-DenseLines-arXiv.pdf
What is GANs? GANs(Generative Adversarial Networks) are the models that used in unsupervised machine learning https://hollobit.github.io/All-About-the-GAN/
Explaining machine learning models with SHAP and SAGE https://iancovert.com/blog/understanding-shap-sage/
The Dozen Things Experimental Economists Should Do (More of) https://ideas.repec.org/p/feb/artefa/00648.html
Synthetic Control Using Lasso (scul) https://hollina.github.io/scul/
Everything is fucked: The syllabus https://thehardestscience.com/2016/08/11/everything-is-fucked-the-syllabus/
Regression Modeling With Proportion Data (Part 1) Predicting Attendance in the German Handball-Bundesliga https://hansjoerg.me/2019/05/10/regression-modeling-with-proportion-data-part-1/
Conditional independences and causal relations implied by sets of equations Tineke Blom, Mirthe M. van Diepen, Joris M. Mooij; 2 https://jmlr.org/papers/v22/20-863.html
Researcher Degrees of Freedom Analysis https://joachim-gassen.github.io/rdfanalysis/
Evidence-Based Medicine—An Oral History Richard Smith, MBChB, CBE, FMedSci, FRCPE, FRCGP1; Drummond Rennie, MD, FRCP2 https://jamanetwork.com/journals/jama/article-abstract/1817042
Geocomputation with R’s guide to reproducible spatial data analysis https://jakubnowosad.com/ogh2022/#/title-slide
Tutorial: JAX 101 https://jax.readthedocs.io/en/latest/jax-101/index.html
Autodidax: JAX core from scratch https://jax.readthedocs.io/en/latest/autodidax.html
Conda: Myths and Misconceptions https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/
A Statistical Method for Empirical Testing of Competing Theories Kosuke Imai Princeton University Dustin Tingley https://imai.fas.harvard.edu/research/files/mixture.pdf
The Influence of Data Pre-processing and Post-processing on Long Document Summarization Xinwei Du, Kailun Dong, Yuchen Zhang, Yongsheng Li, Ruei-Yu Tsay https://arxiv.org/abs/2112.01660
COLLIDER: A Robust Training Framework for Backdoor Data Hadi M. Dolatabadi, Sarah Erfani, Christopher Leckie https://arxiv.org/abs/2210.06704
Time Series Data Augmentation for Deep Learning: A Survey Qingsong Wen, Liang Sun, Fan Yang, Xiaomin Song, Jingkun Gao, Xue Wang, Huan Xu https://arxiv.org/abs/2002.12478
Bayesian Changepoint Detection in (Num)Pyro Posted on Tue 08 June 2021 in probabilistic programming, changepoint detection, Bayesian https://irustandi.github.io/bayesian-changepoint-detection-in-numpyro.html
Modeling Regime Shifts in Multiple Time Series Etienne Gael Tajeuna, Mohamed Bouguessa, Shengrui Wang https://arxiv.org/abs/2109.09692
Why negative results? Publication of negative results is difficult in most fields, but in NLP the problem is exacerbated by the near-universal focus on improvements in benchmarks. https://insights-workshop.github.io/
Small Data, Big Decisions: Model Selection in the Small-Data Regime Jorg Bornschein, Francesco Visin, Simon Osindero https://arxiv.org/abs/2009.12583
Quantifying With Only Positive Training Data Denis dos Reis, Marcílio de Souto, Elaine de Sousa, Gustavo Batista https://arxiv.org/abs/2004.10356
Superbloom: Bloom filter meets Transformer John Anderson, Qingqing Huang, Walid Krichene, Steffen Rendle, Li Zhang https://arxiv.org/abs/2002.04723
Selection via Proxy: Efficient Data Selection for Deep Learning Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, Matei Zaharia https://arxiv.org/abs/1906.11829
Many Proxy Controls Ben Deaner https://arxiv.org/abs/2110.03973
https://datatalks.club/slack.html
The Reparameterization “Trick” As Simple as Possible in TensorFlow https://medium.com/(llionj/the-reparameterization-trick-4ff30fe92954?)
Mixed Models for Big Data GAM MIXED MODELS BIG DATA BAYESIAN Explorations of a fast penalized regression approach with mgcv https://m-clark.github.io/posts/2019-10-20-big-mixed-models/
I saw your RCT and I have some worries! FAQs Macartan Humphreys 6 September 2021 https://macartan.github.io/i/notes/rct_faqs.html
Avoiding technical debt in social science research https://medium.com/pew-research-center-decoded/avoiding-technical-debt-in-social-science-research-54618194790a
Confounder Selection: Objectives and Approaches F. Richard Guo, Anton Rask Lundborg, Qingyuan Zhao https://math.papers.bar/paper/cc98597a2c2e11edaa66a71c10a887e7
Diagnosing Biased Inference with Divergences Michael Betancourt January 2017 https://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html
Regression and Causality Michael Schomaker https://math.papers.bar/paper/7e46323aaf3d11eb9864394904658322
Mathematical Proof Between Generations https://math.papers.bar/paper/347f685c018d11edb9b9d35608ee6155
Document Deduplication with Locality Sensitive Hashing May 23, 2017 https://mattilyra.github.io/2017/05/23/document-deduplication-with-lsh.html
Mastering Shiny https://mastering-shiny.org/
How to be a modern scientist https://leanpub.com/modernscientist
Blind 75 LeetCode Questions https://leetcode.com/discuss/general-discussion/460599/blind-75-leetcode-questions
How Exactly UMAP Works And why exactly it is better than tSNE https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668
MATH 342 (Time Series), https://lbelzile.github.io/timeseRies/
Generative vs. Discriminative; Bayesian vs. Frequentist https://lingpipe-blog.com/2013/04/12/generative-vs-discriminative-bayesian-vs-frequentist/
All Bayesian Models are Generative (in Theory) https://lingpipe-blog.com/2013/05/23/all-bayesian-models-are-generative-in-theory/
Failing Grade: 89% of Introduction-to-Psychology Textbooks That Define or Explain Statistical Significance Do So Incorrectly https://journals.sagepub.com/doi/full/10.1177/2515245919858072
False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant https://journals.sagepub.com/doi/full/10.1177/0956797611417632
Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone Uri SimonsohnView all authors and affiliations https://journals.sagepub.com/doi/pdf/10.1177/0956797613480366?casa_token=r3DLe47WVEcAAAAA:Ct3qoVeZQvowii2Xk4wu5TRzV0GTAzNGUeH8qPHJhb2jR9p0GEkScL1-p8JHlSlDfwfyYHGtnWyqSw
The national accounting paradox: how statistical norms corrode international economic data Daniel Mügge https://orcid.org/0000-0001-9408-7597 d.k.muegge@uva.nl and Lukas LinsiView all authors and affiliations https://journals.sagepub.com/doi/full/10.1177/1354066120936339
Intellectual contributions meriting authorship: Survey results from the top cited authors across all science categories Gregory S. Patience ,Federico Galli,Paul A. Patience,Daria C. Boffito https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0198117
The goal of gluedown is to ease the transition from R’s powerful vectors to formatted markdown text. https://kiernann.com/gluedown/
The Nine Circles of Scientific Hell NeuroskepticView all authors and affiliations https://journals.sagepub.com/doi/10.1177/1745691612459519
The Temporal Structure of Scientific Consensus Formation Uri Shwed shwed@bgu.ac.il and Peter S. BearmanView all authors and affiliations https://journals.sagepub.com/doi/10.1177/0003122410388488
Efficient estimation of generalized linear latent variable models https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0216129#:~:text=Generalized%20linear%20latent%20variable%20models%20(GLLVM)%20are%20popular%20tools%20for,from%20a%20set%20of%20sites
The Phantom Menace: Omitted Variable Bias in Econometric Research Kevin A. ClarkeView all authors and affiliations https://journals.sagepub.com/doi/10.1080/07388940500339183
Predicting replicability—Analysis of survey and prediction market data from large-scale forecasting projects Michael Gordon ,Domenico Viganola ,Anna Dreber,Magnus Johannesson,Thomas Pfeiffer Published: April 14, 2021 https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0248780
Why we publish where we do: Faculty publishing values and their relationship to review, promotion and tenure expectations Meredith T. Niles ,Lesley A. Schimanski,Erin C. McKiernan,Juan Pablo Alperin Published: March 11, 2020 https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0228914
Reappraising the utility of Google Flu Trends Sasikiran Kandula ,Jeffrey Shaman https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007258
The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets Takaya Saito ,Marc Rehmsmeier Published: March 4, 2015 https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432
Statistically Controlling for Confounding Constructs Is Harder than You Think Jacob Westfall ,Tal Yarkoni Published: March 31, 2016 https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152719
Ten simple rules for collaboratively writing a multi-authored paper Marieke A. Frassl ,David P. Hamilton,Blaize A. Denfeld,Elvira de Eyto,Stephanie E. Hampton,Philipp S. Keller,Sapna Sharma,Abigail S. L. Lewis,Gesa A. Weyhenmeyer,Catherine M. O’Reilly,Mary E. Lofton,Núria Catalán Published: November 15, 2018 https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006508
Analyzing Selection Bias for Credible Causal Inference When in Doubt, DAG It Out https://journals.lww.com/epidem/fulltext/2019/07000/analyzing_selection_bias_for_credible_causal.8.aspx
Selective publication of antidepressant trials and its influence on apparent efficacy: Updated comparisons and meta-analyses of newer versus older trials Erick H. Turner ,Andrea Cipriani,Toshi A. Furukawa,Georgia Salanti,Ymkje Anna de Vries Published: January 19, 2022 https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1003886#sec018
Likelihood of Null Effects of Large NHLBI Clinical Trials Has Increased over Time Robert M. Kaplan ,Veronica L. Irvin Published: August 5, 2015 https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0132382
Test-Negative Designs Differences and Commonalities with Other Case–Control Studies with “Other Patient” Controls https://journals.lww.com/epidem/Abstract/2019/11000/Test_Negative_Designs__Differences_and.10.aspx
Examining linguistic shifts between preprints and publications David N. Nicholson,Vincent Rubinetti,Dongbo Hu,Marvin Thielk,Lawrence E. Hunter,Casey S. Greene Published: February 1, 2022 https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001470
Break Down: Model Agnostic Explainers for Individual Predictions https://pbiecek.github.io/breakDown/
‘Trust Us’: Open Data and Preregistration in Political Science and International Relations https://osf.io/preprints/metaarxiv/8h2bp/
The Methodological Divide of Sociology - Evidence From Two Decades of Journal Publications https://osf.io/preprints/socarxiv/s59bp/
Shapley Residuals: Quantifying the limits of the Shapley value for explanations. https://par.nsf.gov/biblio/10187138-shapley-residuals-quantifying-limits-shapley-value-explanations
Activation Functions https://paperswithcode.com/methods/category/activation-functions
Software citation principles https://peerj.com/articles/cs-86/
Causality Redux: The Evolution of Empirical Methods in Accounting Research and the Growth of Quasi-Experiments https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3935088
Large-Scale Study of Curiosity-Driven Learning https://pathak22.github.io/large-scale-curiosity/
Bayes and big data: the consensus Monte Carlo algorithm https://orsociety.tandfonline.com/doi/full/10.1080/17509653.2016.1142191?casa_token=AaNmx-7IVb4AAAAA%3Af_Zh3iwRXbyNvI4Tz5Erf0UrxkvftTGLN2EXwtvBu5Je0ejMp3fOYbYpUT9R6vBlgbwU2hoid24#.X-9wZtaIZHA
Finance is Not Excused: Why Finance Should Not Flout Basic Principles of Statistics Forthcoming, Significance (Royal Statistical Society), 2021 https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3895330
Bayesian Time Series Forecasting with Change Point and Anomaly Detection https://openreview.net/forum?id=rJLTTe-0W
How to translate a verbal theory into a formal model https://osf.io/preprints/metaarxiv/n7qsh/
Does Regression Produce Representative Estimates of Causal Effects? https://onlinelibrary.wiley.com/doi/abs/10.1111/ajps.12185
Mixed Hamiltonian Monte Carlo for Mixed Discrete and Continuous Variables https://papers.nips.cc/paper/2020/file/c6a01432c8138d46ba39957a8250e027-Paper.pdf
Specification Curve: Descriptive and Inferential Statistics on All Reasonable Specifications https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2694998
The Standard Errors of Persistence Morgan Kelly https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3398303
https://papers.labml.ai/lists
The International Political Economy Data Resource https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2534067
LightGBM: A Highly Efficient Gradient Boosting Decision Tree https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
Notes on Changing from Rmarkdown/Bookdown to Quarto https://www.njtierney.com/post/2022/04/11/rmd-to-qmd/
Estimating the prevalence of transparency and reproducibility-related research practices in psychology (2014-2017) https://osf.io/preprints/metaarxiv/9sz2y/
Teaching Safe-Stats, Not Statistical Abstinence https://nhorton.people.amherst.edu/mererenovation/17_Wickham.PDF
Quantile Regression With LightGBM https://notebook.community/ethen8181/machine-learning/ab_tests/quantile_regression/quantile_regression
Forecasting: Principles and Practice https://otexts.com/fpp3/
Comparison of Preregistration Platforms https://osf.io/preprints/metaarxiv/zry2u
https://nips.cc/Conferences
https://opensyllabus.org/
Deep Learning Yoshua Bengio
An Introduction to Statistical Learning Gareth James, Daniela Witten, Trevor Hastie
Critical Questions for Big Data Danah Boyd, Kate Crawford
The Elements of Statistical Learning Trevor Hastie
Mostly Harmless Econometrics Joshua D. Angrist
Counterfactuals and Causal Inference Stephen L. Morgan
Machine Learning: A Probabilistic Perspective Kevin P. Murphy
Causality: Models, Reasoning, and Inference Judea Pearl
Methods to Estimate Causal Effects - An Overview on IV, DiD and RDD and a Guide on How to Apply them in Practice https://osf.io/preprints/socarxiv/usvta/
WINNER’S CURSE? ON PACE, PROGRESS, AND EMPIRICAL RIGOR https://openreview.net/pdf?id=rJWF0Fywf
NLP Highlights Podcast https://open.spotify.com/show/4tGHzmicSHIVU3ksf5iYv8
A method to streamline p-hacking https://open.lnu.se/index.php/metapsychology/article/view/2529
Machine Learning Theory - Part 3: Regularization and the Bias-variance Trade-off https://mostafa-samir.github.io/ml-theory-pt3/
NeetCode https://neetcode.io/
Detecting p-Hacking https://onlinelibrary.wiley.com/doi/abs/10.3982/ECTA18583
Is Temperature Exogenous? The Impact of Civil Conflict on the Instrumental Climate Record in Sub-Saharan Africa Kenneth A. Schultz,Justin S. Mankin First published: 28 March 2019 https://onlinelibrary.wiley.com/doi/abs/10.1111/ajps.12425
Making apples from oranges: Comparing noncollapsible effect estimators and their standard errors after adjustment for different covariate sets Rhian Daniel,Jingjing Zhang,Daniel Farewell First published: 14 December 2020 https://onlinelibrary.wiley.com/doi/full/10.1002/bimj.201900297
vtree a flexible R package for displaying nested subsets of a data frame https://nbarrowman.github.io/vtree.html
Election polling errors across time and space Will Jennings & Christopher Wlezien https://www.nature.com/articles/s41562-018-0315-6
Bayesian Estimation of Signal Detection Models https://mvuorre.github.io/posts/2017-10-09-bayesian-estimation-of-signal-detection-theory-models/
Add to feature engineering The xspliner package is a collection of tools for training interpretable surrogate ML models. https://modeloriented.github.io/xspliner/index.html
Observational Studies https://muse.jhu.edu/issue/48885
Bringing more causality to analytics https://motifanalytics.medium.com/bringing-more-causality-to-analytics-d378108bb15
Nice stats note https://moultano.wordpress.com/2013/08/09/logs-tails-long-tails/
Figuring out why my object detection model is underperforming with FiftyOne, a great tool you probably haven’t heard of https://mlops.systems/redactionmodel/computervision/tools/debugging/jupyter/2022/03/12/fiftyone-computervision.html
A ModernDive into R and the Tidyverse https://moderndive.com/index.html
Hopfield Networks is All You Need https://ml-jku.github.io/hopfield-layers/
Rediscovering Bayesian Structural Time Series June 7, 2020 https://minimizeregret.com/post/2020/06/07/rediscovering-bayesian-structural-time-series/
Prophet https://www.youtube.com/watch?v=pOYAXv15r3A&feature=emb_logo
A paper is the tip of an iceberg https://minhlab.wordpress.com/2017/03/18/a-paper-is-the-tip-of-the-iceberg/
Geometric Intuition for Training Neural Networks https://sea-adl.org/2019/11/25/geometric-intuition-for-training-neural-networks/
Latent Variable Modelling in brms January 20, 2020 https://scottclaessens.github.io/blog/2020/brmsLV/
Robust Empirical Bayes Confidence Intervals https://scholar.princeton.edu/mikkelpm/ebci
Bias of OLS Estimat Bias of OLS Estimators due t ors due to Exclusion of Rele clusion of Relevant Variables ariables and Inclusion of Irrelevant Variables Deepankar Basu https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1257&context=econ_workingpaper
scikit-survival https://scikit-survival.readthedocs.io/en/latest/index.html
Scikit-learn’s Defaults are Wrong https://ryxcommar.com/2019/08/30/scikit-learns-defaults-are-wrong/
Selecting on the DV Design, Inference, and the Strategic Logic of Suicide Terrorism: A Rejoinder https://scholar.princeton.edu/sites/default/files/rejoinder3.pdf
Sampling from weird probability distributions Alan R. Pearse 6 July 2019 https://rpubs.com/a_pear_9/weird_distributions
Outliers: Love’em or leave’em João Neto April 2020 https://rpubs.com/jpn3to/outliers
Synthetic controls with staggered adoption Eli Ben-Michael,Avi Feller,Jesse Rothstein https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12448
Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends https://pubs.aeaweb.org/doi/pdfplus/10.1257/aeri.20210236?cookieSet=1
(PROTOTYPE) INTRODUCTION TO NAMED TENSORS IN PYTORCH Author: Richard Zou https://pytorch.org/tutorials/intermediate/named_tensor_tutorial.html
Comparing meta-analyses and preregistered multiple-laboratory replication projects https://pubmed.ncbi.nlm.nih.gov/31873200/
Does retraction after misconduct have an impact on citations? A pre-post study https://pubmed.ncbi.nlm.nih.gov/33187964/
Sparsity information and regularization in the horseshoe and other shrinkage priors Juho Piironen, Aki Vehtari https://projecteuclid.org/journals/electronic-journal-of-statistics/volume-11/issue-2/Sparsity-information-and-regularization-in-the-horseshoe-and-other-shrinkage/10.1214/17-EJS1337SI.full
A Word of Caution about Many Labs 4: If You Fail to Follow Your Preregistered Plan, You May Fail to Find a Real Effect https://psyarxiv.com/ejubn
Discrepancies between meta-analyses and subsequent large randomized, controlled trials https://pubmed.ncbi.nlm.nih.gov/9262498/
Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective https://research.facebook.com/publications/applied-machine-learning-at-facebook-a-datacenter-infrastructure-perspective/
An overview of systematic reviews found suboptimal reporting and methodological limitations of mediation studies investigating causal mechanisms https://pubmed.ncbi.nlm.nih.gov/30904567/
That’s a lot to Process! Pitfalls of Popular Path Models Julia M. RohrerPaul HünermundRuben C. ArslanMalte Elson https://psyarxiv.com/paeb7/
The Matrix-F Prior for Estimating and Testing Covariance Matrices Joris Mulder, Luis Raúl Pericchi https://projecteuclid.org/journals/bayesian-analysis/volume-13/issue-4/The-Matrix-F-Prior-for-Estimating-and-Testing-Covariance-Matrices/10.1214/17-BA1092.full
Sample Size Justification Daniel Lakens https://psyarxiv.com/9d3yf
A unified view on Bayesian varying coefficient models Maria Franco-Villoria, Massimo Ventrucci, Håvard Rue https://projecteuclid.org/journals/electronic-journal-of-statistics/volume-13/issue-2/A-unified-view-on-Bayesian-varying-coefficient-models/10.1214/19-EJS1653.full
Introduction to the concept of likelihood and its applications Alexander Etz https://psyarxiv.com/85ywt
Tapped Out or Barely Tapped? Recommendations for How to Harness the Vast and Largely Unused Potential of the Mechanical Turk Participant Pool Jonathan RobinsonCheskie RosenzweigAaron J MossLeib LItman https://psyarxiv.com/jq589
Share the code, not just the data: A case study of the reproducibility of articles published in the Journal of Memory and Language under the open data policy Anna LaurinavichyuteHimanshu YadavShravan Vasishth https://psyarxiv.com/hf297/
Play with Generative Adversarial Networks (GANs) in your browser! https://poloclub.github.io/ganlab/
Probabilistic Machine Learning: An Introduction https://probml.github.io/pml-book/book1.html
How Good are FiveThirtyEight Forecasts https://projects.fivethirtyeight.com/checking-our-work/
plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2 https://plotnine.readthedocs.io/en/stable/
Doubly Robust Difference-in-Differences https://psantanna.com/DRDID/
Locally Adaptive Smoothing with Markov Random Fields and Shrinkage Priors James R. Faulkner, Vladimir N. Minin https://projecteuclid.org/journals/bayesian-analysis/volume-13/issue-1/Locally-Adaptive-Smoothing-with-Markov-Random-Fields-and-Shrinkage-Priors/10.1214/17-BA1050.full
Introduction to Probability for Data Science https://probability4datascience.com/
The Design Space of Computational Notebooks: An Analysis of 60 Systems in Academia and Industry https://pg.ucsd.edu/publications/computational-notebooks-design-space_VLHCC-2020.pdf
Analyzing Minard’s Visualization Of Napoleon’s 1812 March https://thoughtbot.com/blog/analyzing-minards-visualization-of-napoleons-1812-march
A course in Time Series Analysis https://web.stat.tamu.edu/~suhasini/teaching673/time_series.pdf
When and How Should One Use Deep Learning for Causal Effect Inference https://technionmail-my.sharepoint.com/personal/urishalit_technion_ac_il/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Furishalit%5Ftechnion%5Fac%5Fil%2FDocuments%2FPresentations%2FIAS2018%2FIAS2018%5F2%5Ffor%5Fpublic%2Epdf&parent=%2Fpersonal%2Furishalit%5Ftechnion%5Fac%5Fil%2FDocuments%2FPresentations%2FIAS2018&ga=1
A Primer on Pólya-gamma Random Variables - Part II: Bayesian Logistic Regression https://tiao.io/post/polya-gamma-bayesian-logistic-regression/
treeheatr https://trang1618.github.io/treeheatr/
DiD Reading Group https://taylorjwright.github.io/did-reading-group/
Why is it that natural log changes are percentage changes? What is about logs that makes this so? https://stats.stackexchange.com/questions/244199/why-is-it-that-natural-log-changes-are-percentage-changes-what-is-about-logs-th
STATISTICAL PARADISES AND PARADOXES IN BIG DATA (I): LAW OF LARGE POPULATIONS, BIG DATA PARADOX, AND THE 2016 US PRESIDENTIAL ELECTION https://statistics.fas.harvard.edu/files/statistics-2/files/statistical_paradises_and_paradoxes.pdf
Prior predictive checks for Bayesian regression. https://engineeringdecisionanalysis.shinyapps.io/Priors/?_ga=2.230023708.1800474280.1612547156-513452691.1612547156
No, you can’t explain what a p-value is with one sentence (Parts I, II) https://statsepi.substack.com/p/no-you-cant-explain-what-a-p-value
Does it make sense to log-transform the dependent when using Gradient Boosted Trees? https://stats.stackexchange.com/questions/262114/does-it-make-sense-to-log-transform-the-dependent-when-using-gradient-boosted-tr/263753#263753
Why is Euclidean distance not a good metric in high dimensions? https://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-good-metric-in-high-dimensions
Plotting partial pooling in mixed-effects models https://www.tjmahr.com/plotting-partial-pooling-in-mixed-effects-models/
A Monte Carlo study on methods for handling class imbalance https://static1.squarespace.com/static/58a7d1e52994ca398697a621/t/5a2833cec83025cca6b99ff8/1512584144990/manuscript.pdf
Billion-scale semantic similarity search with FAISS+SBERT https://towardsdatascience.com/billion-scale-semantic-similarity-search-with-faiss-sbert-c845614962e2
The caret Package Max Kuhn 2019-03-27 https://topepo.github.io/caret/index.html
Custom Loss Functions for Gradient Boosting Optimize what matters https://towardsdatascience.com/custom-loss-functions-for-gradient-boosting-f79c1b40466d
What is the trade-off between batch size and number of iterations to train a neural network? https://stats.stackexchange.com/questions/164876/what-is-the-trade-off-between-batch-size-and-number-of-iterations-to-train-a-neu/236393#236393
Generalized Instrumental Variables https://arxiv.org/pdf/1301.0560.pdf
experimenter demand effects (EDEs)—bias that occurs when participants infer the purpose of an experiment and respond so as to help confirm a researcher’s hypothesis https://www.cambridge.org/core/journals/american-political-science-review/article/abs/demand-effects-in-survey-experiments-an-empirical-assessment/043386DC63A69098E859414EF9932EBC
An Overview of Google’s Work on AutoML and Future Directions Jun 14, 2019 https://slideslive.com/38917526/an-overview-of-googles-work-on-automl-and-future-directions?locale=en
The other kind of machine learning regression — unmeasured method performance https://stuart-reynolds.medium.com/the-other-kind-of-machine-learning-regression-unmeasured-method-performance-81b7eb00efda
The tidyverse style guide https://style.tidyverse.org/index.html
STOP CONFOUNDING YOURSELF! STOP CONFOUNDING YOURSELF! https://slatestarcodex.com/2014/04/26/stop-confounding-yourself-stop-confounding-yourself/
Causal model and theory Suparna Chaudhry and Andrew Heiss 2021-05-26 https://stats.andrewheiss.com/donors-crackdowns-aid/00_causal-model-theory.html
Stanza – A Python NLP Package for Many Human Languages https://stanfordnlp.github.io/stanza/
Inference for deterministic simulation models: The Bayesian melding approach https://sites.stat.washington.edu/raftery/Research/PDF/poole2000.pdf
Unofficial guidance on various topics by Social Science Data Editors https://social-science-data-editors.github.io/guidance/
Welcome to The Advanced Matrix Factorization Jungle https://sites.google.com/site/igorcarron2/matrixfactorizations
Exploring Enterprise Databases with R: A Tidyverse Approach https://smithjd.github.io/sql-pet/
Even with randomization, mediation analysis can still be confounded https://www.r-bloggers.com/2019/04/even-with-randomization-mediation-analysis-can-still-be-confounded/
Inference and Prediction Diverge in Biomedicine https://www.cell.com/patterns/fulltext/S2666-3899(20)30160-4
Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning https://www.cis.upenn.edu/~jean/math-deep.pdf
The conditional nature of publication bias: a meta-regression analysis Published online by Cambridge University Press: 11 May 2020 https://www.cambridge.org/core/journals/political-science-research-and-methods/article/abs/conditional-nature-of-publication-bias-a-metaregression-analysis/40C0A166F3ED1516A051C5ED270D1650
A Practical Guide to Weak Instruments Michael Keane† and Timothy Neal† https://uc822f03065d525621a0034d9737.dl.dropboxusercontent.com/cd/0/inline2/Bwkck59j4aGcZXzlbeX5dTlWFJ75Q4t_Aw8oEzfBlrtKDJ0UQT2snWJVT2up4b1-hAQXypGx3CI1GMrv7IzuLn3qml-_qg3e7n9WFySMEteOh8YarvEvY0co5iYeI7ah1ppzWGgI3CLOk-5aStOsOeAY9TIcEABvPqXkGLXgZd6eXOFKRBv-OlDPL3mcixiiBC2OoLXuBymI3IyQIzTE2BPwCLdFAMijckE-VG6tTEnG7yAEiJwOGXnwoFk6gB7td51Loi_1f26t_3zcBSHgpjBD_yVRhmb_R_Nt0kxdh3Nvhm0rueJcbzf1-gkqXGgZApK5Rc3JdAi7woThAAkD1hGko0HSYQT7SIdRBZZ28FpMer2sZVBkFXpY_9o-nefwJiFcbyIaiuqQVvQckMw6QWx_L4nJRL1Btd7ztnss1dJ_YA/file
Why You Should (or Shouldn’t) be Using Google’s JAX in 2022 https://www.assemblyai.com/blog/why-you-should-or-shouldnt-be-using-jax-in-2022/
A guide to working with country-year panel data and Bayesian multilevel models https://www.andrewheiss.com/blog/2021/12/01/multilevel-models-panel-data-guide/
Statistical Significance Annual Review of Statistics and Its Application https://www.annualreviews.org/doi/pdf/10.1146/annurev-statistics-031219-041051
Bayesian Additive Regression Trees: A Review and Look Forward Annual Review of Statistics and Its Application https://www.annualreviews.org/doi/abs/10.1146/annurev-statistics-031219-041110
Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work https://www.amazon.ca/Bad-Data-Handbook-Cleaning-Back/dp/1449321887
Data Replication with Code Ocean – A How-To Guide for PA Authors Simon Heuberger October 2, 2019 https://www.american.edu/spa/data-science/upload/authors_how_to.pdf
Statistical Significance, p-Values, and the Reporting of Uncertainty Guido W. Imbens https://www.aeaweb.org/articles?id=10.1257/jep.35.3.157
Methods Matter: P-Hacking and Publication Bias in Causal Analysis in Economics By Abel Brodeur, Nikolai Cook, and Anthony Heyes https://www.aeaweb.org/content/file?id=12747
Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects Alberto Abadie https://www.aeaweb.org/articles?id=10.1257/jel.20191450&from=f
Star Wars: The Empirics Strike Back Abel Brodeur Mathias Lé Marc Sangnier Yanos Zylberberg AMERICAN ECONOMIC JOURNAL: APPLIED ECONOMICS VOL. 8, NO. 1, JANUARY 2016 https://www.aeaweb.org/articles?id=10.1257/app.20150044
Beware performative reproducibility Well-meant changes to improve science could become empty gestures unless underlying values change. https://www.nature.com/articles/d41586-021-01824-z
Plausibly Exogenous Timothy G. Conley, Christian B. Hansen, Peter E. Rossi https://direct.mit.edu/rest/article-abstract/94/1/260/57981/Plausibly-Exogenous
Deep Learning on Electronic Medical Records is doomed to fail Originally posted 2022-03-22 https://www.moderndescartes.com/essays/deep_learning_emr/
Collider bias undermines our understanding of COVID-19 disease risk and severity https://www.nature.com/articles/s41467-020-19478-2
Bayesian analysis of tests with unknown specificity and sensitivity Andrew Gelman, Bob Carpenter https://www.medrxiv.org/content/10.1101/2020.05.22.20108944v3
The One Standard Error Rule for Model Selection: Does It Work? by Yuchen Chen 1 andYuhong Yang 2,* https://www.mdpi.com/2571-905X/4/4/51
MODELING COVARIANCE MATRICES IN TERMS OF STANDARD DEVIATIONS AND CORRELATIONS, WITH APPLICATION TO SHRINKAGE John Barnard, Robert McCulloch and Xiao-Li Meng https://www.jstor.org/stable/24306780#metadata_info_tab_contents
Regression and Other Stories, with Andrew Gelman, Jennifer Hill & Aki Vehtari podcast https://learnbayesstats.com/episode/20-regression-and-other-stories-with-andrew-gelman-jennifer-hill-aki-vehtari/
Causal Inference: What If (the book) https://cdn1.sph.harvard.edu/wp-content/uploads/sites/1268/2022/10/hernanrobins_WhatIf_15sep22.pdf
Statistical Comparisons of Classifiers over Multiple Data Sets https://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf
CRITICAL VALUES ROBUST TO P-HACKING https://www.pascalmichaillat.org/12.html
P-value, compatibility, and S-value Author links open overlay panelMohammad AliMansourniaaMaryamNazemipouraMahyarEtminan https://www.sciencedirect.com/science/article/pii/S2590113322000153?via%3Dihub
The Dunning-Kruger effect is (mostly) a statistical artefact: Valid approaches to testing the hypothesis with individual differences data Author links open overlay panelGilles E.GignacaMarcinZajenkowskib https://www.sciencedirect.com/science/article/abs/pii/S0160289620300271
Natural Scales in Geographical Patterns Telmo Menezesa,1,* and Camille Roth1,2,3, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5379183/
Making Sense of Sensitivity: Extending Omitted Variable Bias January 2018 https://www.researchgate.net/publication/322509816_Making_Sense_of_Sensitivity_Extending_Omitted_Variable_Bias
A Survey of Methods for Time Series Change Point Detection Samaneh Aminikhanghahi and Diane J. Cook https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5464762/
Negative Controls: A Tool for Detecting Confounding and Bias in Observational Studies https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3053408/
Introduction to Facebook AI Similarity Search (Faiss) https://www.pinecone.io/learn/faiss-tutorial/
Is probabilistic bias analysis approximately Bayesian? https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3257063/
Everything you always wanted to know about evaluating prediction models (but were too afraid to ask) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2997853/
Should We Trust Clustered Standard Errors? A Comparison with Randomization-Based Methods Lourenço S. Paz & James E. West https://www.nber.org/papers/w25926
The Value of Statistical Life: A Meta-analysis of Meta-analyses H. Spencer Banzhaf https://www.nber.org/papers/w29185
A large-scale study on research code quality and execution Ana Trisovic, Matthew K. Lau, Thomas Pasquier & Mercè Crosas https://www.nature.com/articles/s41597-022-01143-6
SELECTION INTO IDENTIFICATION IN FIXED EFFECTS MODELS, WITH APPLICATION TO HEAD START https://www.nber.org/system/files/working_papers/w26174/w26174.pdf
Fast and effective pseudo transfer entropy for bivariate data-driven causal inference Riccardo Silini & Cristina Masoller Scientific Reports volume 11, Article number: 8423 (2021) https://www.nature.com/articles/s41598-021-87818-3
Variable selection in the presence of missing data: resampling and imputation Qi Long* https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5156376/
Consensus features nested cross-validation Saeid Parvandeh,1,2 Hung-Wen Yeh,3 Martin P Paulus,4 and Brett A McKinney1,5 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7776094/
Bayesian Approaches for Missing Not at Random Outcome Data: The Role of Identifying Restrictions Antonio R. Linero* and Michael J. Daniels† https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6936760/
Do Pre-Registration and Pre-analysis Plans Reduce p-Hacking and Publication Bias? Abel BrodeurNikolai CookJonathan HartleyAnthony Heyes https://osf.io/preprints/metaarxiv/uxf39/
Elements of Information Theory, 2nd Edition https://www.wiley.com/en-us/Elements+of+Information+Theory%2C+2nd+Edition-p-9780471241959
Why Data Is Never Raw https://www.thenewatlantis.com/publications/why-data-is-never-raw
When U.S. air force discovered the flaw of averages https://www.thestar.com/news/insight/2016/01/16/when-us-air-force-discovered-the-flaw-of-averages.html
A History of Scientific Journals Publishing at the Royal Society, 1665-2015 Aileen Fyfe, Noah Moxham, Julie McDougall-Waters, and Camilla Mørk Røstvik https://www.uclpress.co.uk/products/187262
Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban Ronald D. Fricker Jr.,Katherine Burke,Xiaoyan Han &William H. Woodall https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1537892
Oh No! I Got the Wrong Sign! What Should I Do? Peter E. Kennedy https://www.tandfonline.com/doi/abs/10.3200/JECE.36.1.77-92
Prediction, Estimation, and Attribution Bradley Efron https://www.tandfonline.com/doi/full/10.1080/01621459.2020.1762613
The Gaussian Graphical Model in Cross-Sectional and Time-Series Data Sacha Epskamp,Lourens J. Waldorp,René Mõttus &Denny Borsboom Pages 453-480 | Published online: 16 Apr 2018 https://www.tandfonline.com/doi/full/10.1080/00273171.2018.1454823
Bruce E. Hansen Jackknife Standard Errors for Clustered Regression August 2022 https://www.ssc.wisc.edu/~bhansen/papers/tcauchy.pdf
A Five-Star Guide for Achieving Replicability and Reproducibility When Working with GIS Software and Algorithms https://www.tandfonline.com/doi/abs/10.1080/24694452.2020.1806026?journalCode=raag21
Some useful equations for nonlinear regression in R Andrea Onofri 2019-01-08 https://www.statforbiology.com/nonlinearregression/usefulequations
Random Forests for Spatially Dependent Data Arkajyoti Saha,Sumanta Basu &Abhirup Datta Received 02 Dec 2020 https://www.tandfonline.com/doi/abs/10.1080/01621459.2021.1950003
Jackknife Standard Errors for Clustered Regression Bruce E. Hansen* University of Wisconsin† August, 2022 https://www.ssc.wisc.edu/~bhansen/papers/tcauchy.pdf
Social Science Reproduction Platform (SSRP) is an openly licensed platform that facilitates the sourcing, cataloging, and review of attempts to verify and improve the computational reproducibility of social science research. https://www.socialsciencereproduction.org/about
Non-Standard Errors https://orbilu.uni.lu/bitstream/10993/48686/1/SSRN-id3961574.pdf
Do growth mindset interventions impact students’ academic achievement? A systematic review and meta-analysis with recommendations for best practices. https://psycnet.apa.org/record/2023-14088-001
Bayesian inference with INLA https://becarioprecario.bitbucket.io/inla-gitbook/index.html
Measurement Models http://cfariss.com/documents/FarissKenwickReuning2019_MesurmentModels.pdf
Distinguishing cause from effect using observational data: methods and benchmarks https://arxiv.org/abs/1412.3773v3
The Effect: An Introduction to Research Design and Causality https://theeffectbook.net/
Feature Engineering and Selection: A Practical Approach for Predictive Models https://bookdown.org/max/FES/
Feature Interactions in XGBoost https://arxiv.org/abs/2007.05758
How should variable selection be performed with multiply imputed data? https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.3177
Statistical Nonsignificance in Empirical Economics https://www.aeaweb.org/articles?id=10.1257/aeri.20190252&from=f
Deep Learning https://www.deeplearningbook.org/
On Multi-Cause Causal Inference with Unobserved Confounding: Counterexamples, Impossibility, and Alternatives https://arxiv.org/abs/1902.10286
Why Propensity Scores Should Not Be Used for Matching https://gking.harvard.edu/publications/why-propensity-scores-should-not-be-used-formatching
What are the most important statistical ideas of the past 50 years?∗ https://arxiv.org/pdf/2012.00174.pdf
Automatic Differentiation Variational Inference https://www.jmlr.org/papers/volume18/16-107/16-107.pdf
Methodology over metrics: current scientific standards are a disservice to patients and society https://www.jclinepi.com/article/S0895-4356(21)00170-0/fulltext
Let’s Put Garbage-Can Regressions and Garbage-Can Probits Where They Belong https://journals.sagepub.com/doi/10.1080/07388940500339167
Statistical rethinking with brms, ggplot2, and the tidyverse: Second edition https://bookdown.org/content/4857/
Random Walk: A Modern Introduction https://math.uchicago.edu/~lawler/srwbook.pdf
On the reliability of published findings using the regression discontinuity design in political science https://arxiv.org/abs/2109.14526
Exploring the Dynamics of Latent Variable Models https://www.cambridge.org/core/journals/political-analysis/article/abs/exploring-the-dynamics-of-latent-variable-models/CBE116F37900DAE957B2D7EB53DB0907#.X7h7GMnwHwM.twitter
Cross-validation: what does it estimate and how well does it do it? https://arxiv.org/abs/2104.00673
What Is Your Estimand? Defining the Target Quantity Connects Statistical Evidence to Theory https://journals.sagepub.com/doi/abs/10.1177/00031224211004187#:~:text=The%20estimand%20is%20the%20target,purpose%20of%20the%20statistical%20analysis.&text=By%20grounding%20all%20three%20steps,connects%20statistical%20evidence%20to%20theory
Measurement error and the replication crisis https://www.science.org/doi/10.1126/science.aal3618
Bayesian Modeling and Computation in Python https://bayesiancomputationbook.com/welcome.html
The Separation Plot: A New Visual Method for Evaluating the Fit of Binary Models https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1540-5907.2011.00525.x
Causal Inference for The Brave and True https://matheusfacure.github.io/python-causality-handbook/landing-page.html
A Parsimonious Tour of Bayesian Model Uncertainty https://arxiv.org/abs/1902.05539
The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant http://www.stat.columbia.edu/~gelman/research/published/signif4.pdf
Prediction, Estimation, and Attribution https://efron.ckirby.su.domains//papers/2019PredictEstimatAttribut.pdf
Difference-in-Differences with a Continuous Treatment https://psantanna.com/files/Callaway_Goodman-Bacon_SantAnna_2021.pdf
A Practical Introduction to Regression Discontinuity Designs: Foundations https://arxiv.org/pdf/1911.09511.pdf
The influence of decision-making in tree ring-based climate reconstructions https://www.nature.com/articles/s41467-021-23627-6
The Influence of Hidden Researcher Decisions in Applied Microeconomics https://docs.iza.org/dp13233.pdf
Introducing geofacet https://ryanhafen.com/blog/geofacet/
The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432
Cross-validation FAQ Aki Vehtari First version 2020-03-11. Last modified 2022-07-30. https://avehtari.github.io/modelselection/CV-FAQ.html
Shapley values for feature selection: The good, the bad, and the axioms Daniel Fryer, Inga Strümke, Hien Nguyen https://arxiv.org/abs/2102.10936
A Crash Course in Good and Bad Controls Carlos Cinelli∗ Andrew Forney† Judea Pearl https://ftp.cs.ucla.edu/pub/stat_ser/r493.pdf
Reinforcement Learning in R Nicolas Pröllochs, Stefan Feuerriegel https://arxiv.org/abs/1810.00240
Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure https://onlinelibrary.wiley.com/doi/pdf/10.1111/ecog.02881
NumPyro https://github.com/pyro-ppl/numpyro
Replacing the do-calculus with Bayes rule https://arxiv.org/pdf/1906.07125.pdf
Channeling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results https://academic.oup.com/qje/article-abstract/134/2/557/5195544?redirectedFrom=fulltext&login=false
A Survey on Societal Event Forecasting with Deep Learning https://arxiv.org/pdf/2112.06345.pdf
Has the Credibility of the Social Sciences Been Credibly Destroyed? Reanalyzing the “Many Analysts, One Data Set” Project https://journals.sagepub.com/doi/full/10.1177/23780231211024421
Political Event Coding as Text to Text Sequence Generation https://yaoyaodai.github.io/files/CASE_2022.pdf
Bayesian Thinking for Toddlers https://psyarxiv.com/w5vbp/
The Dunning-Kruger Effect is Autocorrelation https://economicsfromthetopdown.com/2022/04/08/the-dunning-kruger-effect-is-autocorrelation/
Causal Inference and Its Applications in Online Industry https://alexdeng.github.io/causal/
Bayesian Workflow https://arxiv.org/abs/2011.01808
Papers about Causal Inference and Language https://github.com/causaltext/causal-text-papers
Achieving Statistical Significance with Control Variables and Without Transparency https://www.cambridge.org/core/journals/political-analysis/article/abs/achieving-statistical-significance-with-control-variables-and-without-transparency/1E867C357835019E0C9322B918414045
Questionable research practices among researchers in the most research-productive management programs https://onlinelibrary.wiley.com/doi/10.1002/job.2623
The problem of the missing dead Sophia Dawkins https://orcid.org/0000-0002-2609-0820 sophia.dawkins@yale.eduView all authors and affiliations
https://declaredesign.org/
Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty https://www.pnas.org/doi/full/10.1073/pnas.2203150119
How to avoid machine learning pitfalls: a guide for academic researchers https://arxiv.org/pdf/2108.02497.pdf
Multiple Imputation Through XGBoost https://arxiv.org/abs/2106.01574
TACRED is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used in the yearly TAC Knowledge Base Population (TAC KBP) challenges. https://nlp.stanford.edu/projects/tacred/
Understanding lime https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html
Race to the bottom: Spatial aggregation and event data https://www.tandfonline.com/doi/abs/10.1080/03050629.2022.2025365?casa_token=wrWE–FIltIAAAAA%3AU5Dsm6FMC_1wN1GKsdbEyveqc7XKFEe2beBsBxSVjVopzFgrJdYgfQ9gvW0nL17UUSyAIR5_8Kg&journalCode=gini20
Inference and Prediction Diverge in Biomedicine https://www.cell.com/patterns/fulltext/S2666-3899(20)30160-4
I saw your RCT and I have some worries! FAQs Macartan Humphreys https://macartan.github.io/i/notes/rct_faqs.html
Measuring the landscape of civil war: Evaluating geographic coding decisions with historic data from the Mau Mau rebellion https://journals.sagepub.com/eprint/dRCkdD4ZWSp99x8cinAV/full
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale https://arxiv.org/abs/2208.07339
Understanding Machine Learning: From Theory to Algorithms https://www.cs.huji.ac.il/w~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf
Known broken ones that didn’t get sorted
Confounder Selection: Objectives and Approaches F. Richard Guo, Anton Rask Lundborg, Qingyuan Zhao
[2011.01808](https:/arxiv.org/abs/2011.01808] Article identifier not recognized [2108.02497](https:/arxiv.org/abs/2108.02497] Article identifier not recognized