1 Automated Syllabus of Natural Language Processing Papers

Built by Rex W. Douglass @RexDouglass ; Github ; LinkedIn

Papers curated by hand, summaries and taxonomy written by LLMs.

Submit paper to add for review

2 Natural language processing

2.1 Word Embedding

Leverage pre-trained language models and multi-task learning to prompt cross-language knowledge transfer for Temporal Expression Extraction (TEE) in low-resource languages, thereby improving performance and reducing reliance on scarce labeled data. (Cao et al. 2022)
Move beyond treating words as discrete entities and instead represent them as continuous vectors in a high-dimensional space, enabling better capture of semantic similarity between words. (Smith 2020)
Consider using context-independent anchors to facilitate the mapping of context-dependent embeddings, particularly in low-resource scenarios where direct supervision may not be feasible. (Aldarmaki and Diab 2019)
Optimize the dimensionality of word embeddings by balancing the bias-variance trade-off inherent in the Pairwise Inner Product (PIP) loss, which provides a theoretically sound and computationally efficient way to measure the dissimilarity between word embeddings. (Bahdanau, Cho, and Bengio 2014)

2.2 Causal Inference

Aim to create low-dimensional document embeddings that capture the necessary information for causal identification while reducing noise and irrelevant information, allowing for accurate estimation of causal effects from observational text data. (Egami et al. 2018)

2.3 Large Language Models

Carefully curate and refine your training datasets to improve model performance while reducing training costs and time, including removing similar and duplicate questions, checking for contamination, and selecting specialized fine-tuned LoRA modules for merging. (Lee, Hunter, and Ruiz 2023)
Carefully consider and specify the type of prompt used when evaluating large language models (LLMs) for complex tasks, as the choice of prompt can significantly affect performance and make comparisons across studies difficult without a consistent taxonomy. (Santu and Feng 2023)
Utilize a comprehensive and rigorous assessment framework to evaluate the reasoning capabilities of large language models (LLMs) on complex planning tasks, rather than relying solely on simple benchmarks or anecdotal evidence. (Valmeekam et al. 2022)

2.4 Chain Of Thought Prompting

Consider using a “Chain of Density” (CoD) approach to generate increasingly dense summaries through iterative identification and fusion of missing entities, while controlling for length, in order to strike an optimal balance between informativeness and readability. (Adams et al. 2023)
Explore the potential of zero-shot reasoning abilities in large language models (LLMs) by using simple prompts such as Lets think step by step, which can lead to significant improvements in performance on diverse reasoning tasks compared to traditional zero-shot approaches. (Black et al. 2021)

2.5 Dependency Parsing

Consider leveraging large-scale web-based corpora, such as DepCC, for improved performance in natural language processing tasks, particularly when dealing with unsupervised methods or verb similarity assessment. (Panchenko et al. 2017)
Consider using bidirectional Long Short-Term Memory (BiLSTM) networks as feature extractors for natural language processing tasks like dependency parsing, because they excel at representing elements in a sequence along with your contexts, require less feature engineering compared to traditional methods, and can be trained jointly with the parsing objective to optimize your performance. (Kiperwasser and Goldberg 2016)

2.6 Hownet

Consider using OpenHowNet, an open sememe-based lexical knowledge base built upon HowNet, which offers core data, web access, and APIs for natural language processing tasks such as word similarity computation, word sense disambiguation, and sentiment analysis. (Qi et al. 2019)
Consider using a common-sense knowledge base such as HowNet, which utilizes sememe-based interpretation and structured language markup to define concepts, in order to accurately capture complex inter-conceptual and inter-attribute relationships in natural language processing tasks. (Dong and Dong 2006)

2.7 Latent Dirichlet Allocation

Consider using the Rlda package for mixed-membership clustering analysis of categorical data, which extends the traditional Latent Dirichlet Allocation (LDA) model to handle Multinomial, Bernoulli, and Binomial data types, and allows for the selection of the optimal number of clusters based on a truncated stick-breaking prior approach. (Albuquerque, Valle, and Li 2019)
Consider using an asymmetric Dirichlet prior over the document-topic distributions in your LDA models, as it leads to improved model performance and greater robustness to variations in the number of topics and skewed word frequency distributions, without incurring additional computational costs beyond standard inference techniques. (Geman and Geman 1984)

2.8 Abstractive Text Summarization

Consider using a two-stage decoding process for natural language generation tasks, where the initial stage uses a left-context-only decoder to produce a draft summary, followed by a refine decoder that considers both sides context information to generate each word of the summary, leading to improved naturalness and coherence of the generated text. (H. Zhang, Xu, and Wang 2019)

2.9 Automated Text Analysis

Consider utilizing large language models (LLMs) for coding open-text survey responses due to your demonstrated ability to achieve near-human accuracy, potentially saving time and resources compared to traditional human coding methods. (Mellon et al. 2022)

2.10 Automatic Event Timeline Generation

Carefully define and operationalize the concept of an event in historical texts, paying attention to its temporal, spatial, and actor components, and utilizing appropriate natural language processing tools and techniques for accurate extraction and representation. (Adak et al. 2022)

2.11 BERT

Consider using simple BERT-based models for relation extraction and semantic role labeling tasks, as they have been shown to achieve state-of-the-art performance without requiring external lexical or syntactic features. (P. Shi and Lin 2019)

2.12 COMET

Consider using generative models like COMET for automatic commonsense knowledge base construction, as they can transfer implicit knowledge from deep pre-trained language models to generate explicit, high-quality, and diverse commonsense knowledge in natural language. (Bosselut et al. 2019)

2.13 Commonsense Validation

Consider leveraging pre-trained transformer-based language models, particularly RoBERTa and GPT-2, for commonsense validation and explanation tasks, as they demonstrated strong performance across various subtasks in the SemEval2020 challenge. (Cer et al. 2018)

2.14 Comparing Text Representations

Utilize data-dependent complexity (DDC) to assess the compatibility between text representations and tasks, allowing them to avoid potential biases introduced by varying initializations, hyperparameters, and stochastic gradient descent during empirical evaluations. (Y. Liu et al. 2019)

2.15 Deep Learning for Event Extraction

Consider both pipeline-based and joint-based event extraction paradigms when working on event extraction tasks, taking into account the potential issue of error propagation in pipeline-based methods and the benefits of reducing error propagation in joint-based methods. (Q. Li et al. 2021)

2.16 Distant Supervision

Consider using distant supervision with a latent disjunction model for entity-event extraction tasks, particularly when dealing with limited labeled data, as it enables accurate identification of entities even when only some of your associated mentions convey relevant information. (Keith et al. 2017)

2.17 Evaluation of Large Language Models

Carefully consider what to evaluate, where to evaluate, and how to evaluate when developing evaluation protocols for large language models, taking into account the specific goals, available resources, and potential limitations of each dimension. (Chang et al. 2023)

2.18 Event Storylines

Focus on developing methods for accurately detecting and classifying temporal and causal relationships between events in news data, as demonstrated by the introduction of the Event StoryLine Corpus (ESC) v0.9 benchmark dataset and the establishment of the StoryLine Extraction task. (Mostafazadeh et al. 2016)

2.19 FRANK benchmark

Adopt a nuanced, multidimensional view of factuality when evaluating summarization models, rather than treating it as a simple binary concept, and utilize a comprehensive typology of factual errors to guide your analyses. (Pagnoni, Balachandran, and Tsvetkov 2021)

2.20 Factuality Evaluation

Develop comprehensive factuality evaluation benchmarks covering multiple domains, including world knowledge, science and technology, math, writing and recommendation, and reasoning, and annotate factual errors at the segment level with predefined error types and reference links to support or refute statements. (S. Chen et al. 2023)

2.21 GPT-all

Prioritize openness and reproducibility in your work by releasing your data, training procedures, and model parameters, as demonstrated by the authors themselves in creating the GPT4All-J and GPT4All-13B-snoozy models. (Anand et al. 2023)

2.22 Hypernym Discovery

Consider using a combination of lexico-syntactic patterns and natural language processing tools to extract hypernymy relationships from large-scale web corpora, while carefully considering issues of data quality and redundancy through strategies such as pattern precision estimation, sentence splitting, and tuple aggregation. (Hubert et al. 2023)

2.23 JEEBench

Utilize challenging benchmarks like JEEBench to evaluate the problem-solving abilities of large language models (LLMs), as traditional benchmarks may not adequately capture the full range of difficulties encountered in real-world applications. (Arora and Singh 2023)

2.24 LEXNLP

Prioritize using established, open-source libraries with standard licenses, high levels of maturity, extensive documentation, broad platform and language support, and strong developer communities when conducting natural language processing and machine learning projects involving legal and regulatory text. (Bommarito, Katz, and Detterman 2018)

2.25 LLAMA model

Consider using open-source pre-trained models like Llama 2 due to your potential for faster development and wider accessibility, as evidenced by the success stories of early adopters in implementing various tasks such as model deployment, chatbot development, fine-tuning in different languages, domain-specific chatbot creation, parameter customization for CPU and GPU, and runtime efficiency optimization with limited resources. (Roumeliotis, Tselikas, and Nasiopoulos 2023)

2.26 Language Model Contamination

Develop and employ automatic or semi-automatic measures to detect data contamination in natural language processing benchmarks, build a registry of contamination cases, and address data contamination issues during peer review to ensure accurate and reliable results. (Sainz et al. 2023)

2.27 Language Model Hallucinations

Carefully consider and differentiate between two types of hallucinations in large language models: factuality hallucination, which involves generating false or inconsistent information about the real world, and faithfulness hallucination, which involves failing to accurately represent user instructions or provided context. (Huang et al. 2023)

2.28 Model Characteristics

Use a value order approach to establish monotone comparative statics of characteristic demand, which involves defining a partial order on the consumption set and utilizing lattice theoretical comparative statics or generalized monotone comparative statistics to identify the sufficient conditions for monotonicity of income effects. (Shirai 2010)

2.29 Multilingual NLP

Consider decomposing inputs and outputs into smaller components, such as bytes and triples, to enable models to learn the interactions between those components and potentially achieve better performance in natural language processing tasks. (Gillick et al. 2015)

2.30 NLPerformance

Be aware of the potential drawbacks of advanced prompting strategies, such as chain-of-thought and tree-of-thought, as they may not always provide consistent benefits and could negatively impact the performance of certain models, particularly smaller ones. (Song et al. 2023)

2.31 Named Entity Transliteration

Carefully consider the choice of transliteration approach, as the recent Tensor2Tensor Transformer architecture outperforms the traditional WFST approach and the Seq2Seq approach on every language, although it requires significantly more computational resources. (Merhav and Ash 2018)

2.32 Natural Language Processing

Be aware of potential discrepancies between your own beliefs and the actual distribution of beliefs within your field, as demonstrated by the finding that NLP researchers tend to overestimate your peers belief in the usefulness of benchmarks and scalability solutions, while underestimating your peers emphasis on linguistic structure, inductive bias, and interdisciplinary science. (Michael et al. 2022)

2.33 Neural Knowledge Language Models

Consider combining symbolic knowledge provided by knowledge graphs with RNN language models to improve the ability of language models to encode and decode knowledge, reduce perplexity, and generate fewer unknown words. (Ahn et al. 2016)

2.34 Neural Relation Extraction

Consider using a neural pattern diagnosis framework like DIAG-NRE to automatically summarize and refine high-quality relational patterns from noise data, thereby reducing the need for significant expert labor and enabling quick generalization to new relation types. (Zheng et al. 2018)

2.35 Never-Ending Language Learning

Utilize a combination of semi-supervised learning techniques, an ensemble of diverse knowledge extraction methods, and a versatile knowledge base representation to create a never-ending language learner that continually improves its performance. (Banko and Etzioni 2007)

2.36 News Summarization

Prioritize instruction tuning over model size when developing large language models for news summarization, as it leads to superior zero-shot summarization capabilities and avoids the pitfall of underestimating human performance due to low-quality reference summaries. (T. Zhang et al. 2023)

2.37 One Billion Word Benchmark

Prioritize working with large datasets and utilize advanced techniques such as character-level CNNs and importance sampling to improve the efficiency and accuracy of language modeling tasks. (Jozefowicz et al. 2016)

2.38 Open Source Large Language Models

Use a combination of LLM-based and traditional evaluation metrics to comprehensively assess the performance of open-source LLMs across a broad spectrum of tasks, in order to identify true advancements and the leading models. (H. Chen et al. 2023)

2.39 Pathways Language Model

Consider the potential for discontinuous improvements in model performance when scaling up large language models, as evidenced by the fact that the PaLM 540B model exhibited a drastic jump in accuracy compared to the PaLM 62B model on roughly 25% of the BIG-bench tasks. (Chowdhery et al. 2022)

2.40 Poincaré GloVe

Consider using hyperbolic embeddings for word representation tasks, as they offer several advantages over traditional Euclidean embeddings, including the ability to capture hierarchical relationships between words and improved performance on tasks such as similarity, analogy, and hypernymy detection. (Tifrea, Bécigneul, and Ganea 2018)

2.41 Pretrained Language Models

Carefully examine the types of associations that pre-trained language models (PLMs) rely on to capture factual knowledge, as the findings suggest that while PLMs tend to depend more on positionally close and highly co-occurred associations, knowledge-dependent associations are actually more effective for accurate factual knowledge capture. (S. Li et al. 2022)

2.42 Pretraining

Consider using distant supervision to automatically generate pre-training examples that require long-range reasoning, rather than relying solely on local contexts of naturally occurring texts. (Deng et al. 2021)

2.43 Prompt-based Learning

Carefully consider the choice of pre-trained language model, prompt engineering strategy, and answer engineering approach when implementing prompt-based learning methods in natural language processing. (P. Liu et al. 2021)

2.44 Reinforcement Learning

Consider framing natural language processing tasks as Markov Decision Processes (MDPs) and utilizing reinforcement learning algorithms to optimize policies for handling sequences of actions and rewards within those tasks. (Uc-Cetina et al. 2022)

2.45 Robustness in NLP

Prioritize creating benchmarks with clearly differentiated and challenging distribution shifts to accurately evaluate out-of-distribution robustness in NLP models. (Yuan et al. 2023)

2.46 SCROLLS benchmark

Prioritize tasks requiring synthesis of information across long sequences when developing benchmarks for evaluating models designed to handle long texts. (Shaham et al. 2022)

2.47 Safety in Large Language Models

Incorporate both safe and unsafe prompts when evaluating large language models to ensure that models strike an appropriate balance between helpfulness and harmlessness, avoiding exaggerated safety behaviors that limit your usefulness. (Röttger et al. 2023)

2.48 Sequence Labeling

Consider utilizing a novel method for class-conditional feature detection from a large, expressive deep network, which allows for token-level predictions to be derived from document-level predictions, and for those token-level predictions to be approximately decomposed into an explicit weighting over a set of nearest exemplar representations and your associated labels and predictions. (Schmaltz 2019)

2.49 SkipThought Vectors

Consider using an encoder-decoder model for unsupervised learning of a generic, distributed sentence encoder, which can effectively capture semantic and syntactic properties of sentences and produce robust, high-performing sentence representations for various NLP tasks. (Kiros et al. 2015)

2.50 TACRED dataset

Consider combining high-quality labeled data with a powerful model that utilizes position-aware attention to improve relation extraction performance. (Zaremba, Sutskever, and Vinyals 2014)

2.51 Text Categorization

Consider utilizing pre-trained language models like BERT and fine-tuning them on domain-specific data to achieve superior performance in text categorization tasks, especially in scenarios with limited labeled data. (Beieler 2016)

2.52 TinyBERT

Employ a two-stage learning framework for efficient transfer of knowledge from a large pre-trained language model like BERT to a smaller model like TinyBERT, involving general distillation followed by task-specific distillation, to ensure optimal performance and generalizability. (Jiao et al. 2019)

2.53 Tool-augmented Language Models

Carefully consider domain diversity, API authenticity, API diversity, and evaluation authenticity when developing benchmarks for tool-augmented LLMs, as these factors significantly affect the validity and generalizability of the results. (M. Li et al. 2023)

2.54 Transformers

Consider using the proposed Attention Free Transformer (AFT) model, which eliminates the need for dot product self-attention and reduces memory complexity to linear w.r.t. both context size and feature dimension, leading to improved efficiency and competitive performance compared to traditional Transformer models. (Zhai et al. 2021)

2.55 Vector Representations

Consider utilizing unambiguous resources such as Wikipedia when performing entity or concept embedding to avoid the issue of ambiguity inherent in existing word embedding approaches, leading to potentially better document representations. (Sherkat and Milios 2017)

2.56 Word Embeddings

Be aware of the limitations of using linear SVMs for hypernymy classification, as they may not truly capture the relationship between hyponym and hypernym, but instead detect differences in generality. (Vilnis and McCallum 2014)

2.57 Word Sense Disambiguation

Consider using a combination of manual, semi-automatic, automatic, and collaborative methods to create sense-annotated corpora for various resources and languages, as demonstrated by the success of existing datasets such as SemCor, MASC-WSA, SemEval, OntoNotes, Princeton Gloss, OMSTI, Wikipedia hyperlinks, SEW, BabelNet, SenseDefs, EuroSense, T-o-M, and OneSec (Pasini and Camacho-Collados 2018)

2.58 XWIKIRE dataset

Consider framing relation extraction as a multilingual machine reading problem, leveraging resources like X-WikiRE, to improve cross-lingual transfer and enhance zero-shot relation extraction capabilities, ultimately leading to better knowledge base population. (Cer et al. 2017)

2.59 Zero-Shot and Few-Shot Learning

Be cautious when interpreting the performance of large language models (LLMs) in zero-shot and few-shot settings, as task contamination - the presence of task-relevant examples in the pre-training data - can lead to inflated performance estimates, particularly for datasets released prior to the LLMs training data creation date. (C. Li and Flanigan 2023)

3 Information retrieval

3.1 Causal Inference

Shift from a correlation-driven paradigm to a causality-driven paradigm in building recommender systems, as this can mitigate data biases, handle missing or noisy data, and enable the achievement of beyond-accuracy objectives such as fairness, explainability, and transparency. (Gao et al. 2024)
Incorporate causal inference techniques when developing recommender systems to mitigate bias, promote explainability, and improve generalization, as traditional approaches rely solely on correlational reasoning and fail to account for underlying causal mechanisms. (Zhu, Ma, and Li 2023)

3.2 Entity Linking

Consider combining both entity-content similarity and entity-entity similarity methods when performing named entity linking (NEL), as this approach has been shown to lead to improved performance compared to relying solely on entity popularity measures. (Čuljak et al. 2022)
Carefully consider using natural language processing techniques to extract event-location relationships from text data when traditional methods may be insufficient or unavailable. (Halterman 2019)

3.3 Disambiguation

Employ a joint embedding model that combines feature-entity, mention-entity, knowledge graph, and coherence embeddings to accurately perform named entity linking tasks, particularly when dealing with issues such as limited training data and ambiguous mentions. (W. Shi et al. 2020)

3.4 Evaluation Metrics

Be aware of the unachievable region in precision-recall space, which is a function of class skew and influences the minimum precision that can be achieved for a given recall level. Ignoring this unachievable region can lead to biased estimates of algorithm performance and misleading conclusions. (Boyd et al. 2012)

3.5 Keyword Extraction

Consider using a computer-assisted (human empowered) algorithm for keyword and document set discovery from unstructured text, as opposed to a fully automated algorithm, because human input is necessary to resolve the inherent ambiguity in natural language and ensure accurate identification of relevant documents. (King, Lam, and Roberts 2017)

3.6 Language Models

Consider leveraging Large Language Models (LLMs) as a self-contained recommender system for various recommendation tasks, as evidenced by the promising results obtained from the LLMRec benchmark study. (NA?)

3.7 Search Theory

Carefully consider the role of search frictions and acceptance constraints in shaping participation decisions in skilled labor markets, as these factors can lead to counterintuitive comparative static properties and the possibility of underinvestment. (Bidner, Roger, and Moses 2016)

3.8 Wikipedia Reading

Consider utilizing end-to-end deep neural network architectures for natural language understanding tasks, particularly for tasks requiring diverse forms of reasoning, as these models can operate on increasingly raw forms of text input and potentially eliminate intermediate processing steps. (Hewlett et al. 2016)

4 Semantic web

4.1 Knowledge Graph

Consider using a graph data model for your data, as it offers greater flexibility for integrating diverse sources of data compared to traditional relational models, and supports the application of advanced graph analytics techniques for gaining insights. (Hogan et al. 2021)
Carefully evaluate the suitability of different publicly available knowledge graphs for your specific needs, considering factors such as size, level of detail, content focus, and overlap with other knowledge graphs. (Heist et al. 2020)
Carefully consider the type of relation being studied when choosing a knowledge graph representation model, as different types of relations require different geometric relationships between word embeddings, and certain models may be better suited to specific types of relations. (Allen, Balažević, and Hospedales 2019)

4.2 Wikidata

Consider the importance of evaluating the multilinguality of community-driven knowledge bases, particularly in relation to your ontology and real-world language distribution, as demonstrated by the analysis of Wikidata labels revealing an unequal distribution of languages and a need for improvement in language coverage. (Kaffee et al. 2017)
Consider leveraging the power of collaborative platforms like Wikidata to integrate and link disparate data sources, thereby improving data quality, reducing duplication, and promoting openness and collaboration in scientific research. (Vrandečić 2012)
Consider developing real-time visualization tools to monitor and analyze large-scale collaborative datasets, such as Wikidata, in order to detect anomalies, identify trends, and gain insights into user behavior. (Suchanek, Kasneci, and Weikum 2007)

4.3 DBPedia

Consider extending existing models like LEMON to better accommodate legacy lexical data, particularly when dealing with complex linguistic phenomena such as underspecified relations and multiple lexical entries within a single Wiktionary page. (McCrae et al. 2012)
Consider utilizing the DBpedia FlexiFusion workflow to efficiently integrate and enhance the quality of your data, particularly when working with multiple language-specific databases. (Mendes, Mühleisen, and Bizer 2012)

4.4 Linked Data

Consider linking visual and semantic information when creating large-scale linked datasets, as demonstrated by the creation of IMGpedia, which combines visual descriptors and visual similarity relations for the images of Wikimedia Commons with metadata from DBpedia Commons and DBpedia. (Ferrada 2017)

5 Information extraction

5.1 Event Extraction

Consider combining world knowledge (such as Freebase) and linguistic knowledge (such as FrameNet) to automatically generate labeled data for large-scale event extraction, which can improve the performance of models learned from these data. (Y. Chen et al. 2017)
Consider combining rule-based systems with machine learning models to accurately extract event properties from text, particularly when dealing with complex sentences where grammatical information alone may not be enough to resolve ambiguities. (Blei and Lafferty 2007)

5.2 Open Information Extraction

Consider using a two-stage transformation process involving clausal and phrasal disembedding to convert complex sentences into hierarchical representations of core facts and associated contexts, preserving semantic relationships through rhetorical relations, before performing relation extraction. (Cetto et al. 2018)
Prioritize developing automated, efficient, and domain-independent Open Information Extraction (Open IE) systems that accurately extract relational tuples from text, while minimizing reliance on manual efforts and deep linguistic processing techniques. (Niklaus et al. 2018)

5.3 Infobox Extraction

Consider building probabilistic models for relation extraction from infobox tables, which can improve robustness to template changes, and use distant supervision to automatically generate training data for these models. Additionally, researchers should avoid over-trusting anchor links for entity disambiguation and instead develop entity linking systems that incorporate information from HTML anchors and contextual information surrounding the mention in the same infobox. Lastly, researchers should aim to preserve unlinkable entities in the final output to improve (Peng et al. 2019)

5.4 Relation Extraction

Carefully consider the importance of feature engineering in machine learning models, as demonstrated by the finding that a simpler classifier trained on similar features performed comparably to a more complex neural network system for the task of relation extraction from unstructured text. (Joulin et al. 2016)

5.5 Zero-shot Event Extraction

Consider employing unsupervised sentence simplification techniques to improve the accuracy of machine reading comprehension (MRC)-based event extraction models, particularly for long-range dependencies and complex sentence structures. (Mehta, Rangwala, and Ramakrishnan 2022)

References

Adak, Sayantan, Altaf Ahmad, Aditya Basu, and Animesh Mukherjee. 2022. “Placing (Historical) Facts on a Timeline: A Classification Cum Coref Resolution Approach.” arXiv. https://doi.org/10.48550/ARXIV.2206.14089.

Adams, Griffin, Alexander Fabbri, Faisal Ladhak, Eric Lehman, and Noémie Elhadad. 2023. “From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting.” arXiv. https://doi.org/10.48550/ARXIV.2309.04269.

Ahn, Sungjin, Heeyoul Choi, Tanel Pärnamaa, and Yoshua Bengio. 2016. “A Neural Knowledge Language Model.” arXiv. https://doi.org/10.48550/ARXIV.1608.00318.

Albuquerque, Pedro H. M., Denis Ribeiro do Valle, and Daijiang Li. 2019. “Bayesian LDA for Mixed-Membership Clustering Analysis: The Rlda Package.” Knowledge-Based Systems 163 (January). https://doi.org/10.1016/j.knosys.2018.10.024.

Aldarmaki, Hanan, and Mona Diab. 2019. “Context-Aware Cross-Lingual Mapping.” arXiv. https://doi.org/10.48550/ARXIV.1903.03243.

Allen, Carl, Ivana Balažević, and Timothy Hospedales. 2019. “Interpreting Knowledge Graph Relation Representation from Word Embeddings.” arXiv. https://doi.org/10.48550/ARXIV.1909.11611.

Anand, Yuvanesh, Zach Nussbaum, Adam Treat, Aaron Miller, Richard Guo, Ben Schmidt, GPT4All Community, Brandon Duderstadt, and Andriy Mulyar. 2023. “GPT4All: An Ecosystem of Open Source Compressed Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2311.04931.

Arora, Daman, and Himanshu Gaurav Singh. 2023. “Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark for Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2305.15074.

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv. https://doi.org/10.48550/ARXIV.1409.0473.

Banko, Michele, and Oren Etzioni. 2007. “Strategies for Lifelong Knowledge Extraction from the Web.” Proceedings of the 4th International Conference on Knowledge Capture, October. https://doi.org/10.1145/1298406.1298425.

Beieler, John. 2016. “Generating Politically-Relevant Event Data.” arXiv. https://doi.org/10.48550/ARXIV.1609.06239.

Bidner, Chris, Guillaume Roger, and Jessica Moses. 2016. “Investing in Skill and Searching for Coworkers: Endogenous Participation in a Matching Market.” American Economic Journal: Microeconomics 8 (February). https://doi.org/10.1257/mic.20140110.

Black, Sid, Gao Leo, Phil Wang, Connor Leahy, and Stella Biderman. 2021. “GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow,” March. https://doi.org/10.5281/ZENODO.5297715.

Blei, David M., and John D. Lafferty. 2007. “A Correlated Topic Model of Science.” The Annals of Applied Statistics 1 (June). https://doi.org/10.1214/07-aoas114.

Bommarito, Michael J, Daniel Martin Katz, and Eric M Detterman. 2018. “LexNLP: Natural Language Processing and Information Extraction for Legal and Regulatory Texts.” arXiv. https://doi.org/10.48550/ARXIV.1806.03688.

Bosselut, Antoine, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019. “COMET: Commonsense Transformers for Automatic Knowledge Graph Construction.” arXiv. https://doi.org/10.48550/ARXIV.1906.05317.

Boyd, Kendrick, Vitor Santos Costa, Jesse Davis, and David Page. 2012. “Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation.” arXiv. https://doi.org/10.48550/ARXIV.1206.4667.

Cao, Yuwei, William Groves, Tanay Kumar Saha, Joel R. Tetreault, Alex Jaimes, Hao Peng, and Philip S. Yu. 2022. “XLTime: A Cross-Lingual Knowledge Transfer Framework for Temporal Expression Extraction.” arXiv. https://doi.org/10.48550/ARXIV.2205.01757.

Cer, Daniel, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. “SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation.” Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). https://doi.org/10.18653/v1/s17-2001.

Cer, Daniel, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, et al. 2018. “Universal Sentence Encoder.” arXiv. https://doi.org/10.48550/ARXIV.1803.11175.

Cetto, Matthias, Christina Niklaus, André Freitas, and Siegfried Handschuh. 2018. “Graphene: A Context-Preserving Open Information Extraction System.” arXiv. https://doi.org/10.48550/ARXIV.1808.09463.

Chang, Yupeng, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, et al. 2023. “A Survey on Evaluation of Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2307.03109.

Chen, Hailin, Fangkai Jiao, Xingxuan Li, Chengwei Qin, Mathieu Ravaut, Ruochen Zhao, Caiming Xiong, and Shafiq Joty. 2023. “ChatGPT’s One-Year Anniversary: Are Open-Source Large Language Models Catching Up?” arXiv. https://doi.org/10.48550/ARXIV.2311.16989.

Chen, Shiqi, Yiran Zhao, Jinghan Zhang, I-Chun Chern, Siyang Gao, Pengfei Liu, and Junxian He. 2023. “FELM: Benchmarking Factuality Evaluation of Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2310.00741.

Chen, Yubo, Shulin Liu, Xiang Zhang, Kang Liu, and Jun Zhao. 2017. “Automatically Labeled Data Generation for Large Scale Event Extraction.” Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/p17-1038.

Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, et al. 2022. “PaLM: Scaling Language Modeling with Pathways.” arXiv. https://doi.org/10.48550/ARXIV.2204.02311.

Čuljak, Marko, Andreas Spitz, Robert West, and Akhil Arora. 2022. “Strong Heuristics for Named Entity Linking.” arXiv. https://doi.org/10.48550/ARXIV.2207.02824.

Deng, Xiang, Yu Su, Alyssa Lees, You Wu, Cong Yu, and Huan Sun. 2021. “ReasonBERT: Pre-Trained to Reason with Distant Supervision.” arXiv. https://doi.org/10.48550/ARXIV.2109.04912.

Dong, Zhendong, and Qiang Dong. 2006. “Hownet and the Computation of Meaning,” February. https://doi.org/10.1142/5935.

Egami, Naoki, Christian J. Fong, Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart. 2018. “How to Make Causal Inferences Using Texts,” February. http://arxiv.org/abs/1802.02163v1.

Ferrada, Sebastián. 2017. “IMGpedia Dataset.” https://doi.org/10.6084/M9.FIGSHARE.4991099.V2.

Gao, Chen, Yu Zheng, Wenjie Wang, Fuli Feng, Xiangnan He, and Yong Li. 2024. “Causal Inference in Recommender Systems: A Survey and Future Directions.” ACM Transactions on Information Systems, January. https://doi.org/10.1145/3639048.

Geman, Stuart, and Donald Geman. 1984. “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-6 (November). https://doi.org/10.1109/tpami.1984.4767596.

Gillick, Dan, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2015. “Multilingual Language Processing from Bytes.” arXiv. https://doi.org/10.48550/ARXIV.1512.00103.

Halterman, Andrew. 2019. “Geolocating Political Events in Text.” Proceedings of the Third Workshop on Natural Language Processing and Computational Social Science. https://doi.org/10.18653/v1/w19-2104.

Heist, Nicolas, Sven Hertling, Daniel Ringler, and Heiko Paulheim. 2020. “Knowledge Graphs on the Web – an Overview.” arXiv. https://doi.org/10.48550/ARXIV.2003.00719.

Hewlett, Daniel, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey, and David Berthelot. 2016. “WikiReading: A Novel Large-Scale Language Understanding Task over Wikipedia.” arXiv. https://doi.org/10.48550/ARXIV.1608.03542.

Hogan, Aidan, Eva Blomqvist, Michael Cochez, Claudia D’amato, Gerard De Melo, Claudio Gutierrez, Sabrina Kirrane, et al. 2021. “Knowledge Graphs.” ACM Computing Surveys 54 (July). https://doi.org/10.1145/3447772.

Huang, Lei, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, et al. 2023. “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.” arXiv. https://doi.org/10.48550/ARXIV.2311.05232.

Hubert, Nicolas, Heiko Paulheim, Pierre Monnin, Armelle Brun, and Davy Monticolo. 2023. “Schema First! Learn Versatile Knowledge Graph Embeddings by Capturing Semantics with MASCHInE.” Proceedings of the 12th Knowledge Capture Conference 2023, December. https://doi.org/10.1145/3587259.3627550.

Jiao, Xiaoqi, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. “TinyBERT: Distilling BERT for Natural Language Understanding.” arXiv. https://doi.org/10.48550/ARXIV.1909.10351.

Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. “Bag of Tricks for Efficient Text Classification.” arXiv. https://doi.org/10.48550/ARXIV.1607.01759.

Jozefowicz, Rafal, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. “Exploring the Limits of Language Modeling.” arXiv. https://doi.org/10.48550/ARXIV.1602.02410.

Kaffee, Lucie-Aimée, Alessandro Piscopo, Pavlos Vougiouklis, Elena Simperl, Leslie Carr, and Lydia Pintscher. 2017. “A Glimpse into Babel.” Proceedings of the 13th International Symposium on Open Collaboration, August. https://doi.org/10.1145/3125433.3125465.

Keith, Katherine A., Abram Handler, Michael Pinkham, Cara Magliozzi, Joshua McDuffie, and Brendan O’Connor. 2017. “Identifying Civilians Killed by Police with Distantly Supervised Entity-Event Extraction.” arXiv. https://doi.org/10.48550/ARXIV.1707.07086.

King, Gary, Patrick Lam, and Margaret E. Roberts. 2017. “Computer‐assisted Keyword and Document Set Discovery from Unstructured Text.” American Journal of Political Science 61 (April). https://doi.org/10.1111/ajps.12291.

Kiperwasser, Eliyahu, and Yoav Goldberg. 2016. “Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations.” arXiv. https://doi.org/10.48550/ARXIV.1603.04351.

Kiros, Ryan, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. “Skip-Thought Vectors.” arXiv. https://doi.org/10.48550/ARXIV.1506.06726.

Lee, Ariel N., Cole J. Hunter, and Nataniel Ruiz. 2023. “Platypus: Quick, Cheap, and Powerful Refinement of LLMs.” arXiv. https://doi.org/10.48550/ARXIV.2308.07317.

Li, Changmao, and Jeffrey Flanigan. 2023. “Task Contamination: Language Models May Not Be Few-Shot Anymore.” arXiv. https://doi.org/10.48550/ARXIV.2312.16337.

Li, Minghao, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. “API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs.” arXiv. https://doi.org/10.48550/ARXIV.2304.08244.

Li, Qian, Jianxin Li, Jiawei Sheng, Shiyao Cui, Jia Wu, Yiming Hei, Hao Peng, et al. 2021. “A Survey on Deep Learning Event Extraction: Approaches and Applications.” arXiv. https://doi.org/10.48550/ARXIV.2107.02126.

Li, Shaobo, Xiaoguang Li, Lifeng Shang, Zhenhua Dong, Chengjie Sun, Bingquan Liu, Zhenzhou Ji, Xin Jiang, and Qun Liu. 2022. “How Pre-Trained Language Models Capture Factual Knowledge? A Causal-Inspired Analysis.” arXiv. https://doi.org/10.48550/ARXIV.2203.16747.

Liu, Pengfei, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. “Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.” arXiv. https://doi.org/10.48550/ARXIV.2107.13586.

Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” arXiv. https://doi.org/10.48550/ARXIV.1907.11692.

McCrae, John, Guadalupe Aguado-de-Cea, Paul Buitelaar, Philipp Cimiano, Thierry Declerck, Asunción Gómez-Pérez, Jorge Gracia, et al. 2012. “Interchanging Lexical Resources on the Semantic Web.” Language Resources and Evaluation 46 (May). https://doi.org/10.1007/s10579-012-9182-3.

Mehta, Sneha, Huzefa Rangwala, and Naren Ramakrishnan. 2022. “Improving Zero-Shot Event Extraction via Sentence Simplification.” arXiv. https://doi.org/10.48550/ARXIV.2204.02531.

Mellon, Jonathan, Jack Bailey, Ralph Scott, James Breckwoldt, and Marta Miori. 2022. “Does GPT-3 Know What the Most Important Issue Is? Using Large Language Models to Code Open-Text Social Survey Responses at Scale.” SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4310154.

Mendes, Pablo N., Hannes Mühleisen, and Christian Bizer. 2012. “Sieve.” Proceedings of the 2012 Joint EDBT/ICDT Workshops, March. https://doi.org/10.1145/2320765.2320803.

Merhav, Yuval, and Stephen Ash. 2018. “Design Challenges in Named Entity Transliteration.” arXiv. https://doi.org/10.48550/ARXIV.1808.02563.

Michael, Julian, Ari Holtzman, Alicia Parrish, Aaron Mueller, Alex Wang, Angelica Chen, Divyam Madaan, et al. 2022. “What Do NLP Researchers Believe? Results of the NLP Community Metasurvey.” arXiv. https://doi.org/10.48550/ARXIV.2208.12852.

Mostafazadeh, Nasrin, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. “A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories.” arXiv. https://doi.org/10.48550/ARXIV.1604.01696.

Niklaus, Christina, Matthias Cetto, André Freitas, and Siegfried Handschuh. 2018. “A Survey on Open Information Extraction.” arXiv. https://doi.org/10.48550/ARXIV.1806.05599.

Pagnoni, Artidoro, Vidhisha Balachandran, and Yulia Tsvetkov. 2021. “Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics.” arXiv. https://doi.org/10.48550/ARXIV.2104.13346.

Panchenko, Alexander, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann. 2017. “Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl.” arXiv. https://doi.org/10.48550/ARXIV.1710.01779.

Pasini, Tommaso, and Jose Camacho-Collados. 2018. “A Short Survey on Sense-Annotated Corpora.” arXiv. https://doi.org/10.48550/ARXIV.1802.04744.

Peng, Boya, Yejin Huh, Xiao Ling, and Michele Banko. 2019. “Improving Knowledge Base Construction from Robust Infobox Extraction.” Proceedings of the 2019 Conference of the North. https://doi.org/10.18653/v1/n19-2018.

Qi, Fanchao, Chenghao Yang, Zhiyuan Liu, Qiang Dong, Maosong Sun, and Zhendong Dong. 2019. “OpenHowNet: An Open Sememe-Based Lexical Knowledge Base.” arXiv. https://doi.org/10.48550/ARXIV.1901.09957.

Röttger, Paul, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. 2023. “XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2308.01263.

Roumeliotis, Konstantinos I., Nikolaos D. Tselikas, and Dimitrios K. Nasiopoulos. 2023. “Llama 2: Early Adopters’ Utilization of Meta’s New Open-Source Pretrained Model,” August. https://doi.org/10.20944/preprints202307.2142.v2.

Sainz, Oscar, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. 2023. “NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for Each Benchmark.” arXiv. https://doi.org/10.48550/ARXIV.2310.18018.

Santu, Shubhra Kanti Karmaker, and Dongji Feng. 2023. “TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks.” arXiv. https://doi.org/10.48550/ARXIV.2305.11430.

Schmaltz, Allen. 2019. “Detecting Local Insights from Global Labels: Supervised &Amp; Zero-Shot Sequence Labeling via a Convolutional Decomposition.” arXiv. https://doi.org/10.48550/ARXIV.1906.01154.

Shaham, Uri, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, et al. 2022. “SCROLLS: Standardized CompaRison over Long Language Sequences.” arXiv. https://doi.org/10.48550/ARXIV.2201.03533.

Sherkat, Ehsan, and Evangelos Milios. 2017. “Vector Embedding of Wikipedia Concepts and Entities.” arXiv. https://doi.org/10.48550/ARXIV.1702.03470.

Shi, Peng, and Jimmy Lin. 2019. “Simple BERT Models for Relation Extraction and Semantic Role Labeling.” arXiv. https://doi.org/10.48550/ARXIV.1904.05255.

Shi, Wei, Siyuan Zhang, Zhiwei Zhang, Hong Cheng, and Jeffrey Xu Yu. 2020. “Joint Embedding in Named Entity Linking on Sentence Level.” arXiv. https://doi.org/10.48550/ARXIV.2002.04936.

Shirai, Koji. 2010. “Monotone Comparative Statics of Characteristic Demand.” SSRN Electronic Journal. https://doi.org/10.2139/ssrn.1553547.

Smith, Noah A. 2020. “Contextual Word Representations.” Communications of the ACM 63 (May). https://doi.org/10.1145/3347145.

Song, Linxin, Jieyu Zhang, Lechao Cheng, Pengyuan Zhou, Tianyi Zhou, and Irene Li. 2023. “NLPBench: Evaluating Large Language Models on Solving NLP Problems.” arXiv. https://doi.org/10.48550/ARXIV.2309.15630.

Suchanek, Fabian M., Gjergji Kasneci, and Gerhard Weikum. 2007. “Yago.” Proceedings of the 16th International Conference on World Wide Web, May. https://doi.org/10.1145/1242572.1242667.

Tifrea, Alexandru, Gary Bécigneul, and Octavian-Eugen Ganea. 2018. “Poincaré GloVe: Hyperbolic Word Embeddings.” arXiv. https://doi.org/10.48550/ARXIV.1810.06546.

Uc-Cetina, Víctor, Nicolás Navarro-Guerrero, Anabel Martin-Gonzalez, Cornelius Weber, and Stefan Wermter. 2022. “Survey on Reinforcement Learning for Language Processing.” Artificial Intelligence Review 56 (June). https://doi.org/10.1007/s10462-022-10205-5.

Valmeekam, Karthik, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2022. “PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change.” arXiv. https://doi.org/10.48550/ARXIV.2206.10498.

Vilnis, Luke, and Andrew McCallum. 2014. “Word Representations via Gaussian Embedding.” arXiv. https://doi.org/10.48550/ARXIV.1412.6623.

Vrandečić, Denny. 2012. “Wikidata.” Proceedings of the 21st International Conference on World Wide Web, April. https://doi.org/10.1145/2187980.2188242.

Yuan, Lifan, Yangyi Chen, Ganqu Cui, Hongcheng Gao, Fangyuan Zou, Xingyi Cheng, Heng Ji, Zhiyuan Liu, and Maosong Sun. 2023. “Revisiting Out-of-Distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations.” arXiv. https://doi.org/10.48550/ARXIV.2306.04618.

Zaremba, Wojciech, Ilya Sutskever, and Oriol Vinyals. 2014. “Recurrent Neural Network Regularization.” arXiv. https://doi.org/10.48550/ARXIV.1409.2329.

Zhai, Shuangfei, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, and Josh Susskind. 2021. “An Attention Free Transformer.” arXiv. https://doi.org/10.48550/ARXIV.2105.14103.

Zhang, Haoyu, Jianjun Xu, and Ji Wang. 2019. “Pretraining-Based Natural Language Generation for Text Summarization.” arXiv. https://doi.org/10.48550/ARXIV.1902.09243.

Zhang, Tianyi, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B. Hashimoto. 2023. “Benchmarking Large Language Models for News Summarization.” arXiv. https://doi.org/10.48550/ARXIV.2301.13848.

Zheng, Shun, Xu Han, Yankai Lin, Peilin Yu, Lu Chen, Ling Huang, Zhiyuan Liu, and Wei Xu. 2018. “DIAG-NRE: A Neural Pattern Diagnosis Framework for Distantly Supervised Neural Relation Extraction.” arXiv. https://doi.org/10.48550/ARXIV.1811.02166.

Zhu, Yaochen, Jing Ma, and Jundong Li. 2023. “Causal Inference in Recommender Systems: A Survey of Strategies for Bias Mitigation, Explanation, and Generalization.” arXiv. https://doi.org/10.48550/ARXIV.2301.00910.