ML and Stats Notes


This is an attempt to organize the over 10,000 links to resources I’ve tweeted over the last decade and several thousand zotero entries into coherent quick to find categories. For some of the topics, I am working toward providing code examples across multiple languages. Consider everything here a rough draft. Some of the topics will be empty placeholders.

―The Sociology and Economics of Data―

1 Literature

Broad Literature/Textbooks Unsorted
Statistics Information Theory

2 Data and Benchmarks

Imagery Text

3 Industry

Jobs Salaries
Coding Practice Interview Practice

4 Meta-Science

Open Science Null Results
Sociology of Inference Meta-Reviews

5 Communicating Results


6 Production

Monitering Experiments Monitering Experiments

―Managing Data―

7 Computer Languages

Bash Git SQL
R Python
Pytorch Tensorflow
Jax Numpyro

8 Databases

duckdb Postgres arrow

9 Mathematical and Data Structures

null NaN
Floating Point Numbers (floats)
set list dictionary
trees tensor
table stacks_queues
linked_lists hash_table
graphs function

10 Data Transformations

filter joins
fuzzyrecordmatching regex

11 Data Loading

dataloaders Google Sheets

―Problems of Population―

12 Statistics

Statistic Population
Domain Variation
Consistency Efficiency

13 Random Variables

Random Variable
indepenent and identically distributed (iid) exchangeable

14 Probability

Probability Case Studies probability_distribution
probability density mass function entropy
maximum entropy probability distribution
Normal Distribution Poisson Distribution

―Problems of Distribution―

15 Hypothesis Testing

Hypothesis Testing Statistical Power
Critiques of Hypothesis Testing

―Problems of Specification―

16 Causal Identification Strategies

Random Control Trial Regression Discontinuity
Difference in Differences Double ML
Fixed Effects Granger
Instrumental Variables Placebo Tests
Synthetic Controls

17 Regression and Causality (the 12 Assumptions)

1 No Unmeasured Confounders 2 Correct Model Specification
3 No conditioning on a collider 4 No conditioning on a mediator
5 Positivity 6 Consistency
7 No interference 8 No relevant effect modification
9 Collapsibility 10 Compliance
11 The Missing Data Mechanism 12 No relevant measurement error

18 Multilevel Modeling

(Mixed Effect / Hierarchical / Random / Variance Component / Nested)

Fully Pooled Random Effects
1-Way Fixed Effects
2-way Fixed Effects

19 Calibration


20 Images

Annotating Images
Yolo Remote Sensing

21 Text

Large Language Models (LLMs)

22 Time Series

Kalman Filter