1 Automated Syllabus of Machine Learning Papers
Built by Rex W. Douglass @RexDouglass ; Github ; LinkedIn
Papers curated by hand, summaries and taxonomy written by LLMs.
2 Machine learning
2.1 Hyperbolic Embedding
Consider using hyperbolic embeddings combined with Hearst patterns to infer concept hierarchies from large text corpora because this approach allows for improved taxonomic consistency, increased efficiency in handling large datasets, and greater interpretability of results. (Le et al. 2019)
Consider using hyperbolic spaces for learning hierarchical embeddings of directed acyclic graphs, as they offer superior representational capacity compared to Euclidean spaces and can better preserve the underlying network properties. (Ganea, Bécigneul, and Hofmann 2018)
Consider combining both Euclidean and hyperbolic embeddings for improved representational power in node classification and link prediction tasks, especially when dealing with complex graphs that exhibit both hierarchical and non-hierarchical structures. (Kipf and Welling 2016)
Consider using hyperbolic spaces, specifically the hyperbolic plane, as a target space for embedding trees due to its ability to preserve the topological and geometric properties of the tree, enable guaranteed greedy routing, and achieve low distortion when embedding weighted trees. (Chepoi et al. 2010)
2.2 Transformer
Leverage the power of transformer architectures, specifically through the use of separate embeddings for covariates and treatments followed by cross-modal attention, to improve treatment effect estimation while reducing parameter inefficiency and increasing robustness to changes in treatment or dosage. (Zhang et al. 2022)
Prioritize using pre-trained language representations over explicit linguistic features when conducting relation extraction studies, as they offer superior performance, require less data annotation, and reduce the risk of error accumulation. (Alt, Hübner, and Hennig 2019)
Consider the potential impact of non-identifiability in self-attention mechanisms, as the presence of a non-trivial null space in the attention matrix implies that there are multiple sets of attention weights that produce the same output, making interpretation of attention weights difficult and potentially misleading. (Brunner et al. 2019)
Consider leveraging the power of the open-source library “Transformers,” which offers carefully engineered state-of-the-art Transformer architectures under a unified API, backed by a curated collection of pretrained models, and designed to be extensible, simple, and fast for both research and industrial deployments. (Wolf et al. 2019)
2.3 Deep Learning
Leverage heterogeneous data sources and study the dependencies between societal events to build interpretable deep learning models for accurate event prediction. (Deng and Ning 2021)
Consider employing a multi-task learning approach for analyzing complex scenarios involving crowds, as it can lead to improved performance in individual tasks, as evidenced by the authors study showing a 9% improvement in ROC curve AUC for violent behavior detection. (Marsden et al. 2017)
2.4 Imbalanced Dataset
Consider using Monte Carlo simulation methods to systematically vary specific aspects of your data while controlling others, allowing them to draw conclusions about the impact of those variations on the performance of different algorithms for handling class imbalance in machine learning tasks. (Abdar et al. 2019)
Employ specialized evaluation metrics and modify learning algorithms to prioritize rare and important cases when working with imbalanced datasets, where standard evaluation criteria and learning algorithms may perform poorly due to non-uniform user preference biases. (Branco, Torgo, and Ribeiro 2015)
Consider the potential impact of data skewness on performance metrics, particularly for imbalanced datasets, and report skew-normalized scores alongside raw scores to ensure accurate interpretation of results. (Jeni, Cohn, and Torre 2013)
2.5 Attention Mechanism
- Utilize a multiscale visualization tool to better understand complex attention mechanisms in transformer models, allowing for the identification of biases, location of relevant attention heads, and linking of neurons to model behavior. (Vig 2019)
2.6 Autoencoder
Consider using deterministic autoencoders with explicit regularization schemes for the decoder as a simpler and potentially more effective alternative to variational autoencoders for generative modeling tasks. (Ghosh et al. 2019)
Consider using a hierarchical Vector Quantized Variational AutoEncoder (VQ-VAE) model for large scale image generation, as it enables the generation of high-coherence and high-fidelity synthetic samples, while being scalable and efficient due to its use of simple feed-forward encoder and decoder networks and fast sampling in the compressed latent space. (Razavi, Oord, and Vinyals 2019)
2.7 Batch Normalization
Carefully consider the choice of batch in BatchNorm, including its size, data source, and algorithm for computing statistics, as different choices can lead to inconsistencies and affect the generalization of models. (Wu and Johnson 2021)
Carefully consider the potential for variance shift when combining batch normalization (BN) and dropout techniques in neural networks, as the differing ways these methods handle variance can lead to numerical instability and decreased performance. (Li et al. 2018)
2.8 Reproducibility In Machine Learning
Vary the random seed in your deep learning experiments and analyze the distribution of scores to assess the potential impact of randomness on your results. (Caron et al. 2021)
Consider both algorithmic and implementation-level sources of non-determinism when evaluating deep learning models, as these factors can significantly impact model performance and lead to inconsistent results across identical training runs. (Pham et al. 2020)
2.9 Transfer Learning
Utilize a combination of pre-training and supervised fine-tuning when developing language models, allowing for effective transfer learning and high performance on a wide range of tasks. (Al-Rfou et al. 2018)
Consider leveraging pre-trained deep CNN models on large and diverse datasets for unsupervised classification tasks, as they can outperform more sophisticated and specifically tailored image-set clustering methods. (Guérin et al. 2017)
2.10 Automatic Differentiation
- Utilize automatic differentiation variational inference (ADVI) to enable rapid iteration and exploration of complex probabilistic models, allowing for efficient and accurate estimation of model parameters without requiring manual derivation of algorithms. (Abadi et al. 2016)
2.11 Boosting
- Carefully consider the trade-offs between computational resources and model quality when selecting a gradient boosting decision tree (GBDT) algorithm and its hyperparameters, especially when working with large-scale datasets and limited time or hardware constraints. (Anghel et al. 2018)
2.12 Clustering
- Consider creating new clustering methods by systematically combining and modifying components of existing methods, guided by a comprehensive taxonomy of clustering methods that utilize deep neural networks. (Aljalbout et al. 2018)
2.13 Comparison Of Classifiers
- Aim to compare a large number of classifiers from various families across numerous datasets to increase the likelihood of identifying the best performing model for a particular dataset, rather than relying solely on familiar or commonly used classifiers. (Aha, Kibler, and Albert 1991)
2.14 Computer Vision In Politics
- Consider using automated coding techniques, specifically machine learning algorithms, to efficiently and accurately code large volumes of video data, achieving comparable levels of accuracy to traditional human coding methods. (Tarr, Hwang, and Imai 2022)
2.15 Deep Reinforcement Learning
- Carefully select and preprocess your training data to improve the signal-to-noise ratio, specifically by focusing on periods of high price activity, which can lead to better performance of reinforcement learning algorithms in high frequency trading applications. (Briola et al. 2021)
2.16 Embeddings
- Consider using the epsilon-four-points condition (?-4PC) as a measure of proximity to tree metrics, as it is a scalable and easily verifiable condition that accurately reflects the hierarchical nature of complex networks like the Internet. (Abraham et al. 2007)
2.17 Expectation Maximization Algorithm
- Consider using the Orthogonalizing Expectation Maximization (OEM) algorithm for penalized regression analysis in situations involving “tall” data sets, where the number of observations far exceeds the number of variables, as it offers significant computational advantages compared to other methods. (Huling and Qian 2018)
2.18 Gaussian Mixture Model
- Carefully consider the assumptions of independence and homogeneity in event count models, as violations of these assumptions can lead to inefficient estimates and biased standard errors. (King 1989)
2.19 Generative Models
- Utilize a deep generative model-based framework like Credence to validate causal inference methods, as it generates synthetic data anchored at the empirical distribution for the observed sample, enabling users to specify ground truth for causal effects and confounding bias, and evaluate the potential performance of various causal estimation methods on data similar to the observed sample. (Cui and Tchetgen 2019)
2.20 Interaction Model
- Carefully evaluate the appropriateness of the Linear Interaction Effect (LIE) assumption in multiplicative interaction models, as it often fails in empirical settings, leading to potentially biased and inconsistent estimates. Additionally, researchers should ensure adequate common support in the data to avoid making inferences based solely on interpolation or extrapolation. (Hainmueller, Mummolo, and Xu 2018)
2.21 Interactive Machine Learning
- Consider incorporating dynamic memory of user feedback into your models to enable continual system improvement without the need for model retraining. (Mishra, Tafjord, and Clark 2022)
2.22 Isaac Gym
- Consider implementing an end-to-end GPU-accelerated training pipeline for robotics simulations to achieve significant speed-ups in training complex environments, as demonstrated by the development of Isaac Gym. (Makoviychuk et al. 2021)
2.23 Leave Future Out Cross Validation
- Use leave-future-out cross-validation (LFO-CV) instead of leave-one-out cross-validation (LOO-CV) for time series analysis to avoid overly optimistic estimates caused by the availability of future information during the prediction of past observations. (Bürkner, Gabry, and Vehtari 2020)
2.24 Machine Learning Pitfalls
- Meticulously manage your data, including setting aside independent test sets early on, avoiding data leakage, and ensuring that feature selection and dimensionality reduction are treated as part of the model training process. (Lones 2021)
2.25 Manifold Learning
- Focus on generating samples from the target distribution on the manifold rather than the input space, especially when dealing with high dimensional data and complex models, as this leads to more accurate and representative estimates of the underlying population. (Oh et al. 2013)
2.26 Merlion Machine Learning Library
- Consider using Merlion, an open-source machine learning library specifically designed for time series analysis, which offers a unified interface for various models and datasets, standard pre/post-processing layers, visualization tools, anomaly score calibration, AutoML for hyperparameter tuning and model selection, and model ensembling, allowing for rapid development and benchmarking of models for specific time series needs. (Bhatnagar et al. 2021)
2.27 Meta-learning
- Consider meta-learning unsupervised update rules for unsupervised representation learning, specifically targeting semi-supervised classification performance, and constraining the update rule to be a biologically-motivated, neuron-local function to improve generalizability across different neural network architectures, datasets, and data modalities. (Metz et al. 2018)
2.28 Multi-label Classification
- Consider using the mldr package in R for working with multilabel datasets, which provides functions for loading, analyzing, and manipulating such datasets, as well as applying binary relevance (BR) and label powerset (LP) transformations to enable the use of traditional binary and multiclass classification models. (Charte and Charte 2015)
2.29 Multi-modal Learning
- Consider leveraging the binding property of images to enable emergent cross-modal interactions among multiple modalities, even when explicit paired data between all modalities is absent. (Alayrac et al. 2022)
2.30 Nonparametric Autoencoder
- Consider combining Bayesian nonparametric methods with variational autoencoders to enable greater modeling flexibility and structured interpretability in unsupervised representation learning tasks. (Bowman et al. 2015)
2.31 One-class Classification
- Consider extending traditional Random Forests (RFs) to one-class classification problems by developing a natural methodology to adapt standard splitting criteria to the one-class setting, allowing for structural generalizations of RFs to one-class classification. (Goix et al. 2016)
2.32 Optimizers
- Evaluate multiple optimizers with default hyperparameters, as this approach performs approximately as well as tuning the hyperparameters for a fixed optimizer, and can save valuable computational resources. (Schmidt, Schneider, and Hennig 2020)
2.33 Overfitting
- Carefully consider the potential impact of adaptive overfitting when using holdout data for model evaluation, especially in situations where the test set is reused frequently or the sample size is small. (Chen and Guestrin 2016)
2.34 PAC-Bayes
- Consider using PAC-Bayes bounds, which are a type of tool in statistical learning theory, to analyze the generalization ability of aggregated and randomized predictors, as these bounds do not rely on minimization problems and can handle complex models such as neural networks. (Alquier 2021)
2.35 Predictive Maintenance Framework
- Explicitly define the predictimand, or the specific question about treatment effects that your clinical prediction model aims to answer, as this choice determines the appropriate statistical approach and ensures accurate interpretation of results. (Geloven et al. 2020)
2.36 Pretrained Model
- Consider utilizing pre-trained models (PTMs) for natural language processing (NLP) tasks, as they offer significant benefits such as learning universal language representations, providing better model initializations, acting as a regularization technique to prevent overfitting, and reducing the reliance on labeled data through leveraging large-scale unlabeled corpora. (Qiu et al. 2020)
2.37 Proximal Causal Learning
- Attempt to identify and categorize proxy variables into three buckets - those that are common causes of treatment and outcome, treatment-inducing confounding proxies, and outcome-inducing confounding proxies - in order to enable proximal causal learning and improve causal inferences in situations where traditional exchangeability assumptions fail. (Tchetgen et al. 2020)
2.38 Pure Prediction Algorithms
- Carefully consider whether your primary objective is accurate prediction or understanding the underlying scientific truth when choosing between traditional regression methods and newer pure prediction algorithms. (Efron 2020)
2.39 Quantized LLMs
- Consider using the QLoRA approach for efficient fine-tuning of large language models, which combines 4-bit NormalFloat quantization, Double Quantization, and Paged Optimizers to reduce memory usage while maintaining full 16-bit finetuning task performance. (Dettmers et al. 2023)
2.40 Random Forest
- Carefully consider the impact of subsampling rate and tree depth when using Breimans random forests, as properly tuning these parameters can significantly improve the accuracy of the model. (Duroux and Scornet 2016)
2.41 Reinforcement Learning
- Carefully consider the limitations of Markov reward functions in expressing complex tasks, as there are certain tasks that cannot be accurately captured by these functions. Therefore, researchers should explore alternative formulations of the problem when encountering such tasks. (Abel et al. 2021)
2.42 Representation Learning
- Consider incorporating both word-level and entity-level features when developing models for text classification tasks, as demonstrated by the improved performance of the TextEnt-full model over the TextEnt-word and TextEnt-entity models in the entity typing task. (Yamada, Shindo, and Takefuji 2018)
2.43 SHAP
- Consider using a novel \(R^{2\) metric based on Shapley decomposition to evaluate feature importance in machine learning models, as it provides a fair allocation of explained variability to each feature, is model-agnostic, and can be computed efficiently using pre-calculated Shapley values. (Redell 2019)
2.44 Self-supervised Learning
- Carefully consider the impact of the choice of learning objective on the learned representations, especially in the final layers, when using self-supervised or supervised methods for visual deep learning. (Grigg et al. 2021)
2.45 Sequential Model Based Optimization
- Consider using the flexible and comprehensive R toolbox, mlrMBO, for model-based optimization (MBO) when dealing with expensive black-box functions, as it enables approximation of the objective function through a surrogate regression model, supports single- and multi-objective optimization with mixed continuous, categorical, and conditional parameters, and is implemented in a modular fashion allowing for easy replacement or adaptation of components. (Bischl et al. 2017)
2.46 Snorkel Software
- Consider utilizing weak supervision, specifically through the use of noisy, programmatically-generated training data, to address the common issue of limited labeled training data in machine learning projects. (Dehghani et al. 2017)
2.47 Statistical Comparison Of Classifiers
- Utilize non-parametric tests like the Wilcoxon signed ranks test for comparing two classifiers and the Friedman test with corresponding post-hoc tests for comparing multiple classifiers over multiple data sets, as these tests are simple, safe, and robust alternatives to parametric tests that rely on strong assumptions. (Purg et al. 2023)
2.48 Statistical Learning Theory
- Carefully distinguish between descriptive and causal inferences, utilizing counterfactual frameworks and potential outcome models to accurately estimate causal effects while controlling for confounding factors. (Gelman and Vehtari 2020)
2.49 Survival Analysis
- Consider using the ggRandomForests package to visualize and explore the structure of your random forest models, as it provides separation of data and figures, modularity of data objects/figures, and flexibility in modifying the output using ggplot2 functions. (Ehrlinger 2016)
2.50 Synthetic Data Generation
- Consider using synthetic text generated by large language models (LLMs) to overcome obstacles in supervised text analysis, such as the high cost of labeling and retrieval, and copyright restrictions, while still maintaining transparency, reproducibility, and interpretability. (Jankowski and Huber 2023)
2.51 TensorFlow Distributions
- Carefully consider the shape semantics of your data when working with probability distributions, particularly distinguishing between sample, batch, and event shapes, to ensure efficient and accurate analysis. (Dillon et al. 2017)
2.52 TensorFlow Probability
- Utilize the flexibility of TensorFlow Probability JointDistributions to specify complex probabilistic models using either imperative or declarative styles, leveraging the shared interface for inference algorithms and the ability to easily switch between different model specifications. (Piponi, Moore, and Dillon 2020)
2.53 Understanding Machine Learning
- Consider the trade-offs between bias and variance when selecting a machine learning model, as well as the importance of evaluating model performance using methods such as cross-validation and train-test splits. (Chang, Weiss, and Freeman 2009)
2.54 Unsupervised Feature Learning
- Distinguish the contributions of architectures from those of learning systems by reporting random weight performance, as a sizeable component of a systems performance can come from the intrinsic properties of the architecture, and not from the learning system. (Gray 2005)
2.55 Variational Inference
- Utilize the Pareto Smoothed Importance Sampling (PSIS) diagnostic tool to assess the quality of your variational inference (VI) approximations, as it provides a continuous estimate of the Renyi divergence between the true and approximated posteriors, allowing for early detection of potentially disastrous VI approximations. (Yao et al. 2018)
2.56 Weak Supervision
- Consider utilizing a robust PCA-based algorithm for learning dependency structures in weak supervision models, as it can lead to improved theoretical recovery rates and superior performance on real-world tasks compared to existing methods that ignore sparsity patterns or make assumptions about conditional independence. (Varma et al. 2019)
2.57 Wikipedia Infobox Completion
- Consider utilizing both word and network embeddings when attempting to predict Wikipedia infobox types, particularly when working with limited information such as tables of contents and named entities in article abstracts. (Biswas et al. 2023)
3 Artificial intelligence
3.1 NA
Consider incorporating knowledge-guided linguistic rewrites as a secondary source of evidence when generating inference rule corpora, as it can significantly improve the precision of the rules without sacrificing substantial recall. (Jain, Rathi, and Chakrabarti 2020)
Consider both the degree of constrainedness and the availability of positive examples when studying the learnability of boolean formulas using deep neural networks, as these factors can significantly affect the performance of the models. (Nicolau et al. 2020)
Consider using automated methods, such as mining-based and paraphrasing-based approaches, to generate diverse and high-quality prompts for querying language models, rather than relying solely on manually created prompts, in order to more accurately estimate the knowledge contained in the models. (McCann et al. 2018)
3.2 Explainable Artificial Intelligence
Pay close attention to the issue of disagreement between explanations generated by different post hoc explanation methods, as it frequently arises in practice and can have significant implications for model interpretation and decision-making. Moreover, there is a lack of principled, well-established approaches for resolving such disagreements, suggesting a need for further research in this area. (Krishna et al. 2022)
Consider generating explanations for AI systems in the form of entailment trees, which are hierarchical structures that capture the logical relationships between premises and conclusions, and can help improve the interpretability and debuggability of AI systems. (Dalvi et al. 2021)
Carefully select appropriate explainable artificial intelligence (XAI) methods based on the model structure and the purpose of the explanation, recognizing that model-agnostic methods offer greater flexibility across different types of models but may sacrifice accuracy compared to model-specific or inherently interpretable methods. (Maksymiuk, Gosiewska, and Biecek 2020)
Aim to develop automatic concept-based explanation methods that prioritize meaningfulness, coherence, and importance in identifying higher-level human-understandable concepts applicable across datasets, as opposed to focusing solely on feature importance scores for individual inputs. (Ghorbani et al. 2019)
3.3 Atomic Commonsense Reasoning Dataset
- Consider organizing commonsense knowledge into typed if-then relations with variables, and distinguishing between causes vs. effects, agents vs. themes, voluntary vs. involuntary events, and actions vs. mental states, as this approach leads to improved accuracy in commonsense reasoning tasks. (Sap et al. 2018)
3.4 Language Models And Legal Reasoning
- Actively involve domain experts in the creation of evaluation tasks for large language models (LLMs) to ensure that the tasks accurately reflect real-world scenarios and enable meaningful engagement in discussions of LLM performance using familiar terminology and conceptual frameworks. (Guha et al. 2023)
3.5 Neuralsymbolic Integration
- Consider combining symbolic and subsymbolic knowledge representations in your language models, allowing for improved interpretability, adaptability, and control over the models factual information. (Verga et al. 2020)
4 Artificial neural networks
4.1 Attention Mechanism
- Consider analyzing the relationship between in-context learning in Transformers and gradient-based optimization techniques, particularly in the context of auto-regressive tasks, as the authors propose that in-context learning in the Transformer forward pass is implemented via gradient-based optimization of an implicit auto-regressive inner loss constructed from its in-context data. (Oswald et al. 2022)
4.2 Convolutional Neural Network
Carefully consider the choice of hyperparameters in your convolutional neural network models, particularly the number of filters, filter size, activation function, and stride, as these decisions can significantly impact the performance of the model. (Thoma 2017)
Employ a multi-granularity and multi-perspective approach to modeling sentence similarity using convolutional neural networks, combining both holistic and per-dimension filters, along with various pooling methods, to effectively capture diverse linguistic patterns and enhance overall performance. (He, Gimpel, and Lin 2015)
4.3 Deep Residual Network
- Consider optimizing your deep learning models for super-resolution tasks by simplifying the network architecture, modifying the loss function, and transferring knowledge from pre-trained models at other scales. (Lim et al. 2017)
4.4 Neural Network
- Employ statistical models capable of handling complex, nonlinear, and contingent relationships, such as neural network models, especially when studying rare events like international conflicts, where traditional linear-normal models may miss important nuances due to the rarity and heterogeneity of the phenomenon. (Beck, King, and Zeng 2000)
4.5 Recurrent Neural Networks
- Consider decomposing the output of an LSTM into a product of factors, where each factor represents the contribution of a particular word, in order to gain insights into the underlying learned patterns of the model. (Murdoch and Szlam 2017)
4.6 Transformers
- Consider utilizing a knowledge attribution method to identify knowledge neurons in pretrained transformers, as these neurons have been found to be positively correlated with the expression of your corresponding facts, allowing for targeted editing of specific factual knowledge without fine-tuning. (Dai et al. 2021)