ML and Stats Notes
[FOREVER UNDER CONSTRUCTION]
This is an attempt to organize the over 10,000 links to resources I’ve tweeted over the last decade and several thousand zotero entries into coherent quick to find categories. For some of the topics, I am working toward providing code examples across multiple languages. Consider everything here a rough draft. Some of the topics will be empty placeholders.
―The Sociology and Economics of Data―
1 Literature
• Broad Literature/Textbooks | • Unsorted |
• Statistics | • Information Theory |
• | • |
2 Data and Benchmarks
• Imagery | • Text |
3 Industry
• Jobs | • Salaries |
• Coding Practice | • Interview Practice |
4 Meta-Science
• Open Science | • Null Results |
• Sociology of Inference | • Meta-Reviews |
5 Communicating Results
• Markdown |
6 Production
• Monitering Experiments | • Monitering Experiments |
―Managing Data―
7 Computer Languages
• Bash | • Git | • SQL |
• R | • Python | |
• Stan | ||
• Pytorch | • Tensorflow | |
• Jax | • Numpyro |
8 Databases
• duckdb | • Postgres | • arrow |
9 Mathematical and Data Structures
• null | • NaN | |
• boolean | • | |
• Floating Point Numbers (floats) | ||
• set | • list | • dictionary |
• trees | • tensor | |
• table | • stacks_queues | |
• linked_lists | • hash_table | |
• graphs | • function |
10 Data Transformations
• filter | • joins |
• fuzzyrecordmatching | • regex |
11 Data Loading
• dataloaders | • Google Sheets |
―Problems of Population―
12 Statistics
• Statistic | • Population |
• Domain | • Variation |
• Consistency | • Efficiency |
• Sufficiency | • |
13 Random Variables
• Random Variable | • |
• indepenent and identically distributed (iid) | • exchangeable |
14 Probability
• Probability Case Studies | • probability_distribution |
• probability density mass function | • entropy |
• | • maximum entropy probability distribution |
• Normal Distribution | • Poisson Distribution |
―Problems of Distribution―
15 Hypothesis Testing
• Hypothesis Testing | • Statistical Power |
• Critiques of Hypothesis Testing | • |
• | • |
• | • |
• | • |
• | • |
―Problems of Specification―
16 Causal Identification Strategies
• Random Control Trial | • Regression Discontinuity |
• Difference in Differences | • Double ML |
• Fixed Effects | • Granger |
• Instrumental Variables | • Placebo Tests |
• Synthetic Controls | • |
17 Regression and Causality (the 12 Assumptions)
18 Multilevel Modeling
(Mixed Effect / Hierarchical / Random / Variance Component / Nested)
• Fully Pooled | • Random Effects |
• 1-Way Fixed Effects | |
• 2-way Fixed Effects |
19 Calibration
• Calibration |
20 Images
• Annotating Images | • |
• Yolo | • Remote Sensing |
• | • |
21 Text
• Large Language Models (LLMs) |
22 Time Series
• Kalman Filter | • |