The wonderful NAACL 2018 PC co-chairs asked for a some thoughts on how to make a computational linguistics papers linguistically informative. My take on this is that not all NAACL papers need to be linguistically informative—but they should all be linguistically informed. Here are a four steps to achieving that, which I hope will be timely for short paper authors, and for everyone ahead of the camera ready deadline:
Step 1: Know Your Data
Open up the training (or dev) data and look inside. If it’s in a language you know, read (or listen) to a good chunk of it. (And if it’s not, find someone who can look read/listen to it and understand it for you.) Think about the properties it has. Is it well-edited text/scripted speech, or is it spontaneous and somewhat messy? What language varieties are represented? Are there any properties of variety or genre that you might expect to trip up any pre-trained components you are using? Are the sentences/utterances long and complicated, or relatively short and simple? Are there repeated items in the dataset, and are they there for good reason (e.g. standard section headers in medical documents)? These are a few relatively generic questions I could think of without looking at your data—any given dataset will have its own specific properties of interest that go well beyond these.
Step 2: Describe Your Data Honestly
Far too many *ACL papers talk as if the methods they present have been shown to work on “language in general”, and fail to even mention the language the data are from (let alone variety and genre). The mystery language (in my experience) is always English, but English is not and should not be understood as the canonical human language. The bare minimum here is to explicitly state the language that is being studied, but more description of the dataset is definitely better.
Step 3: Focus on Linguistic Structure, At Least Some of the Time
In the early 2000s, when I started working at UW, I found that some conversations with my colleagues in CS would get very confusing. After a year or two, though, I sorted it out: I was entering the conversations interested in language qua language. My colleagues were primarily interested in natural language data not for the language itself, but for the information encoded in it. And that’s fine! Both of these interests are legitimate. Once I pinpointed the source of confusion, it got much easier to communicate. The problem that I see, however, is that many people who work on NLP to get at the info in unstructured data look past the structures in language. As fluent speakers, it’s easy to miss the fact that the information isn’t right there in the words, because when we read (or listen) the linguistic processing that we are doing is not generally accessible to our conscious thought. But any linguist can tell you: a sentence is more than a bag of words and the structures that connect the words (as well as those inside them, i.e. morphology) are critical to the way in which the meanings are built up. Understanding that structure as NLP researchers puts us in a better position to take advantage of it. (Haven’t studied any linguistics and want a quick intro to the kind of structure I’m talking about? I wrote a book for you.)
Step 4: Do Error Analysis
Ideally, error analysis should be both a part of the cycle of system development and part of the final description that goes in the paper (which usually, let’s be honest, describes an arbitrary iteration of the development cycle, chosen by timing with respect to the paper deadline…). Error analysis means collecting a sample of errors, usually on the development data (or possibly all of the errors if the data set is small or the system very good), and classifying them. The interesting thing about such a classification task is that the classes aren’t set out ahead of time. You can look at errors based on confusion matrices, or by sentence length, and so on, but better error analyses go deeper, again by looking at the data. What recurring linguistic patterns can be found in the items that trip the system up? For example, is negation confusing a sentiment analysis system? Or perhaps the difference between presupposed and asserted content (which is marked with varying linguistic structures) is too subtle for a factoid QA system. Error analysis of this type requires a good deal of linguistic insight, and can be an excellent arena for collaboration with linguists (and far more rewarding to the linguist than doing annotation). Start this process early. The conversations can be tricky, as you try to explain how the system works to a linguist who might not be familiar with the type of algorithms you’re using and the linguist in turn tries to explain the patterns they are seeing in the errors. But they can be rewarding in equal measure as the linguistic insight brought out by the error analysis can inform further system development. Finally, a well-done comparative error analysis can give substance to claims that some particular aspect of system design is responsible for the measured improvement.