The paper links are added now. If you are one of the 1000 productive authors, we will ask you to vote soon!
Note on paper #18: this great paper by Amanda and Lyn was nominated by area chairs. But Amanda and Lyn graciously asked to exclude it for the award selection because they are NAACL-HLT2018 organizers:-) So we keep it in the nomination list, but won’t include it for the final award selection. -Heng
(in title alphabetical order)
Title: A General, Abstract Model of Incremental Dialogue Processing
Authors: David Schlangen and Gabriel Skantze
Justification: The paper presented a general model and a conceptual framework for incremental dialog processing, that is, how dialog systems should be able to process information not utterance-by-utterance, but in a continuous fashion, allowing for much more fluent and human-like interaction. At the same conference, the authors also presented another paper, describing the world’s first fully incremental dialog system (although in a very limited domain), based on the proposed model. This work has inspired numerous studies on incremental processing in dialog systems, and incremental processing continues to be, alongside dialog state tracking and neural modeling, one of the hottest areas of dialog systems research. At the time of nomination, the paper has 182 citations on Google Scholar.
Title: A Linear Programming Formulation for Global Inference in Natural Language Tasks
Authors: Dan Roth and Scott (Wen-Tau) Yih
Justification: The ILP formulation introduced in Roth & Yih (2004) has changed the way the research community thinks about global inference in natural language processing and has had impact on all areas in NLP, from syntax to summarization, to information extraction to multiple tasks in semantics. It introduced a new technical language that is now mainstream, a modeling tool that researchers have been using broadly to significantly advance many NLP applications, and it has triggered a range of research questions that have challenged and advanced our understanding of some of the key issues in natural language inference.
Title: An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems
Authors: Ehud Reiter and Anja Belz
Justification: The paper explores the relation between automated evaluation metrics (such as BLEU and ROUGE) and human evaluation of NLG systems. The findings of the paper suggest that although automatic metrics can be useful for predicting the linguistic quality of generated text, they cannot capture the quality of the generated content, which is very important for NLG systems. Automatic evaluation has been a matter of ongoing debate not only in NLG and this paper has played a crucial role in bolstering the view among many in the field that the most important results need to be backed up by human evaluation to be accepted. The paper is still highly cited and it is even more relevant now that there is a growing interest in NLG due to increasing industrial interest in spoken dialogue systems and personal assistants.
Title: An Unsupervised Method for Word Sense Tagging using Parallel Corpora
Authors: Mona Diab and Philip Resnik
Justification: This was the first paper to successfully use large scale cross-lingual projections for semantic representations, specifically in sense disambiguation. It extends Diab’s work from 2000 (ACL Workshop on Word Senses and Multi-Linguality) on using parallel corpora for projecting senses extending the notion of context cross linguistically on a large scale. The techniques in this paper helped propel a whole slew of research into using and leveraging cross lingual projections for semantics and multilingual resource creation bootstrapping labeled data and knowledge resources for other languages. This paper, together with contemporaneous work by Yarowsky, Ngai, and Wicentowsky (2001), was foundational in launching cross-lingual projection work on NLP tasks ranging from semantics, multilingual resource creation, information extraction to syntax across parallel corpora. The paper has been cited 268 times, the latest of which include several citations in 2017 in multiple languages, i.e. articles written in languages other than English.
Title: Anaphora and Discourse Structure
Authors: Bonnie Webber, Matthew Stone, Aravind Joshi, Alistair Knott
Justification: This paper helped to establish the theoretical basis for the Penn Discourse Treebank (PDTB), which has catalyzed a new wave of research in discourse parsing, as demonstrated in the CoNLL 2015 shared task. In this paper, Webber et al propose a new relationship between discourse structure and semantics, arguing that adverbial discourse cue phrases (e.g., then, instead, otherwise) function as anaphors, linking the matrix clause to the discourse context. This made it possible to develop structurally simpler models of discourse, providing a new perspective on earlier debates about whether discourse could be viewed as a tree. The paper also provides support for the PDTB’s turn towards local discourse phenomena, which are more practical to annotate and model computationally.
Title: BLEU: a Method for Automatic Evaluation of Machine Translation
Authors: Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu
Justification: It has long-lasting and ongoing impact to the field of machine translation both in research communities and industries. The metric is a standard measure to evaluate translation qualities and it helps advance the state of the art of machine translation systems.
Title: Cheap and Fast—But is it Good?: Evaluating Non-Expert Annotations for Natural Language Tasks
Authors: Rion Snow, Brendan O’Connor, Daniel Jurafsky, Andrew Y. Ng
Justification: It was the first paper (to our knowledge) to use MTurk in NLP, which is now pervasive.
Title: Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms
Author: Michael Collins
Justification: A key element of determining the “Test of Time” paper is: does this paper still influence work today? This paper laid the foundation for how a host of machine learning methods could be used across a range of NLP tasks. The idea behind the paper is simple and beautiful, as it applies the well known, and very old, Perceptron algorithm to structured prediction problems. This simple method achieved remarkably good results, opening the door to a range of complex NLP prediction tasks to use relatively simple ML methods with great results. This work led directly to a few different models that came to dominate a range of NLP tasks, such as information extraction and parsing. While the paper is a strong empirical contribution, it includes a theoretical analysis as well. As a result, this is among the most widely cited papers from the ACL community in the past two decades.
Title: Evaluating Content Selection in Summarization: The Pyramid Method
Authors: Ani Nenkova and Rebecca Passonneau
Justification: The Pyramid Method is one of the most widely used methods for consensus-based evaluation and has been used time after time across all evaluations of summarization (mono-lingual, cross-/multi-lingual or other). It is a very well studied and well documented process which has provided invaluable insights in the subjectivity of human summarization and evaluation, also suggesting ways to deal with the challenges it poses.
Title: Frustratingly Easy Domain Adaptation
Author: Hal Daume III
Justification: This paper has had a huge practical impact, the idea is simple to understand and implement, the paper has over 1000 citations, and the idea of feature augmentation for domain adaptation continues to be important, even in the new neural era of NLP.
Title: Minimum Error Rate Training In Statistical Machine Translation
Author: Franz Och
Justification: This paper presents a direct optimization of nondifferentiable BLEU score, a critical algorithm for getting Statistical Machine Translation into usable quality level. It’s a novel and really cool algorithm with efficient implementation. Many other tuning algorithms have succeeded it (and Neural Machine Translation has taken the wind out of tuning these days) but MERT is often the simplest and best working thing to use. Bonus points for David Chiang’s efficient and elegant pure C implementation of MERT which is very widely used and pops up where you least expect it.
Title: Modeling Local Coherence: An entity-based approach
Authors: Regina Barzilay and Mirella Lapata
Justification: This is the most influential data-driven coherence model. It is a good example of a computational model inspired by theory (Centering) but allowing to learn preferences from data rather than having hard constraints. The paper introduced a framework for representing documents and features for sentence flow producing a very competitive model. The model inspired many follow up work which explored additions to the entity framework as well as the document representation. This paper remains a comparison model in coherence papers till this day.
Title: Probabilistic Text Structuring: Experiments with Sentence Ordering
Authors: Mirella Lapata
Justification: This is the first paper which introduced a probabilistic approach for coherence. It introduced the idea of learning sentence ordering constraints from a large corpus of regular documents paving the way for unsupervised coherence models. This paper opened up the areas of both unsupervised probabilistic models for coherence as well as taking a data-driven approach for learning. In that way, the paper made a big mark on the field.
Title: Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis
Authors: Theresa Wilson, Janyce Wiebe, Paul Hoffmann
Justification: Sentiment analysis is one of the earliest impactful NLP tasks that continues to be widely applied in the industry to analyze customer reviews, survey responses, customer service logs, social media posts, news, and healthcare data. Wison, Wiebe and Hoffman (2005)‘s paper has pioneered the problem of context-dependent phrase-level sentiment analysis, and has become the reference work for anyone looking at finer-grained aspects of sentiment analysis. The paper introduced a linguistically motivated machine learning approach to automatically identify the contextual polarity of a large subset of sentiment expressions. This work has had both research and data impact. The paper provided a gamut of research contributions – it developed intuitions about linguistic phenomena of fine-grained contextual polarity, provided a corpus annotation study, developed a lexical resource, and provided empirical studies of machine learning experiments. Over the years it has influenced research in multiple NLP areas such as sentiment analysis, social media analysis and argumentation. The work produced a dataset that added contextual polarity judgments to existing annotations in the Multi-perspective Question Answering (MPQA) corpus. MPQA is one of the most widely used datasets for sentiment and opinion mining, including target-based analysis. The authors also released a sentiment lexicon that has been widely used as a resource for building subjectivity and opinion detection systems. The ideas discussed by the researchers in this paper continue to be relevant today as the need for separating fact from opinion in news and social media is more pressing than ever. With over 2500 Google scholar citations (as of 12/30/2017), and 295 Google scholar citations in 2017 alone, this work has stood the test of time.
Title: Sentence Level Discourse Parsing using Syntactic and Lexical Information
Authors: Radu Soricut and Daniel Marcu
Justification: This paper presented the first probabilistic approach to discourse parsing in the Rhetorical Structure Theory (RST) framework, with fundamental impact on subsequent work. Soricut and Marcu introduced probabilistic models for discourse-unit segmentation and sentence-level discourse parsing; they showed that at the sentence level, there exist strong ties between syntax and discourse which can be exploited and give rise to an effective parser. Their approach and findings not only continue to inspire modern discourse parsers, but also boosted the integration of RST-style discourse structure to other NLP tasks, such as summarization and sentiment analysis.
Title: TextRank: Bringing Order into Texts
Authors: Rada Mihalcea and Paul Tarau
Justification: A method which has been commonly used as baseline for both extractive and abstractive summarization systems and has constituted a milestone of graph methods on summarization. The paper highlighted the use of the proposed algorithm across sub-fields (keyword extraction, sentence extraction) demonstrating generic applicability and robustness. It also highlighted the value of unsupervised methods in a highly “supervised” research scene at the time.
Title: Thumbs up?: Sentiment Classification using Machine Learning Techniques
Authors: Bo Pang, Lillian Lee, Shivakumar Vaithyanathan
Justification: Sentiment analysis in one of the earliest NLP tasks that has had direct real-world impact in a number of industries. It now has a widespread and practical application in review mining, customer management, social media analysis, news analysis, healthcare support and decision support in the finance industry. Pang, Lee and Vaithyanathan (2002) ‘s paper is a pioneering work that has enabled NLP to make this impact. It is amongst the first works in sentiment analysis and helped define the subfield of sentiment and opinion analysis and review mining. It has become the go to paper for anyone starting work in this area. This work has had research, application and data impact. The paper introduced a new way to look at document classification, developed the first solutions to it using several machine learning methods and feature combinations, and presented insights into and challenges of sentiment classification. Beyond the task formulation and technical methods, this paper also had significant data impact. The movie review dataset has supported much of the early work in this area and it still is one of the commonly used benchmark evaluation datasets. There are two key reasons for this success: (a) emphasis on making the data widely available; and (b) carefully curating the data, for example to avoid domination of prolific reviewers. The data is extensively used in courses, and is part of NLTK as a core application to start for students interested in NLP. The insights and challenges discussed in this work have provided the basis of many works and is still driving new research today. According to recent statistics it is the highest cited EMNLP paper. With over 6800 Google scholar citations, and over 400 Google scholar citations in 2017 alone, this work has stood the test of time. Given the award time constraints, it is the last opportunity for this paper to be considered.
Title: Trainable sentence planning for complex information presentation in spoken dialog systems
Authors: Amanda Stent, Rashmi Prasad and Marilyn Walker
Justification: This paper introduced SPaRKy (Sentence Planning with Rhetorical Knowledge), the first trainable approach to sentence planning in natural language generation to use rhetorical relations to structure the discourse. SPaRKy uses hand-crafted sentence planning rules to generate candidate sentence plans which are then ranked by a trained sentence plan ranker. Experimental results indicated that the top-ranked sentence plan scored within 10% on average of the best human-ranked sentence plan. While recent papers on end-to-end NLG based on recurrent neural nets avoid the need for hand-crafted rules entirely, they do not take rhetorical/discourse relations into account, which have long been considered to be central to achieving coherence in NLG; in this respect, the paper continues to be a key reference point more than a dozen years later. Together with a follow-up journal article (JAIR-07), it has been cited 186 times according to Google Scholar, making it one of the most cited papers in Natural Language Generation.
Title: Unsupervised Discovery of Morphemes
Authors: Mathias Creutz and Krista Lagus
Justification: The paper “Unsupervised Discovery of Morphemes” by Mathias Creutz and Krista Lagus, first presented at the 2002 ACL Workshop on Morphological and Phonological Learning, is an influential, frequently cited paper in the area of Phonology, Morphology and Segmentation. It presents two unsupervised algorithms for segmentation of words into possibly lengthy sequences of morpheme-like units, one based on the Minimum Description Length principle (building on prior work by Goldsmith (2001)) and one based on Maximum Likelihood Estimation. The algorithms are tested on English and Finnish and are shown to be particularly appropriate for languages with agglutinative morphological structure such as Finnish. The ideas in the paper formed the basis for the first version of Morfessor, an open-source morphological segmenter which has been widely used within the community to segment text for use in applications such as speech recognition, information retrieval and machine translation, and which has served as a baseline for subsequent segmentation approaches (e.g. the Poon et al. 2009 NAACL best paper).