Putting the Linguistics in Computational Linguistics

The wonderful NAACL 2018 PC co-chairs asked for a some thoughts on how to make a computational linguistics papers linguistically informative.  My take on this is that not all NAACL papers need to be linguistically informative—but they should all be linguistically informed.  Here are a four steps to achieving that, which I hope will be timely for short paper authors, and for everyone ahead of the camera ready deadline:

Step 1: Know Your Data

Open up the training (or dev) data and look inside.  If it’s in a language you know, read (or listen) to a good chunk of it.  (And if it’s not, find someone who can look read/listen to it and understand it for you.) Think about the properties it has. Is it well-edited text/scripted speech, or is it spontaneous and somewhat messy? What language varieties are represented? Are there any properties of variety or genre that you might expect to trip up any pre-trained components you are using? Are the sentences/utterances long and complicated, or relatively short and simple? Are there repeated items in the dataset, and are they there for good reason (e.g. standard section headers in medical documents)? These are a few relatively generic questions I could think of without looking at your data—any given dataset will have its own specific properties of interest that go well beyond these.

Step 2: Describe Your Data Honestly

Far too many *ACL papers talk as if the methods they present have been shown to work on “language in general”, and fail to even mention the language the data are from (let alone variety and genre). The mystery language (in my experience) is always English, but English is not and should not be understood as the canonical human language.  The bare minimum here is to explicitly state the language that is being studied, but more description of the dataset is definitely better.

Step 3: Focus on Linguistic Structure, At Least Some of the Time

In the early 2000s, when I started working at UW, I found that some conversations with my colleagues in CS would get very confusing. After a year or two, though, I sorted it out: I was entering the conversations interested in language qua language. My colleagues were primarily interested in natural language data not for the language itself, but for the information encoded in it.  And that’s fine!  Both of these interests are legitimate.  Once I pinpointed the source of confusion, it got much easier to communicate.  The problem that I see, however, is that many people who work on NLP to get at the info in unstructured data look past the structures in language.  As fluent speakers, it’s easy to miss the fact that the information isn’t right there in the words, because when we read (or listen) the linguistic processing that we are doing is not generally accessible to our conscious thought.  But any linguist can tell you: a sentence is more than a bag of words and the structures that connect the words (as well as those inside them, i.e. morphology) are critical to the way in which the meanings are built up. Understanding that structure as NLP researchers puts us in a better position to take advantage of it. (Haven’t studied any linguistics and want a quick intro to the kind of structure I’m talking about? I wrote a book for you.)

Step 4: Do Error Analysis

Ideally, error analysis should be both a part of the cycle of system development and part of the final description that goes in the paper (which usually, let’s be honest, describes an arbitrary iteration of the development cycle, chosen by timing with respect to the paper deadline…). Error analysis means collecting a sample of errors, usually on the development data (or possibly all of the errors if the data set is small or the system very good), and classifying them. The interesting thing about such a classification task is that the classes aren’t set out ahead of time.  You can look at errors based on confusion matrices, or by sentence length, and so on, but better error analyses go deeper, again by looking at the data. What recurring linguistic patterns can be found in the items that trip the system up? For example, is negation confusing a sentiment analysis system? Or perhaps the difference between presupposed and asserted content (which is marked with varying linguistic structures) is too subtle for a factoid QA system. Error analysis of this type requires a good deal of linguistic insight, and can be an excellent arena for collaboration with linguists (and far more rewarding to the linguist than doing annotation). Start this process early. The conversations can be tricky, as you try to explain how the system works to a linguist who might not be familiar with the type of algorithms you’re using and the linguist in turn tries to explain the patterns they are seeing in the errors. But they can be rewarding in equal measure as the linguistic insight brought out by the error analysis can inform further system development. Finally, a well-done comparative error analysis can give substance to claims that some particular aspect of system design is responsible for the measured improvement.

 

5 thoughts on “Putting the Linguistics in Computational Linguistics

  1. My limited experiences as an author and reviewer with CL conferences and workshops gave me an impression that these very things you mention (error analysis, writing about understanding the data etc) are considered “blabbering” by some. So, perhaps reviewers also need to be given this advice – not just authors?

    Liked by 2 people

    1. Fully agree. Similarly, Nitin’s advice on reviewing should also be followed by authors when they write related work sections.

      Like

  2. Another example of “look at the data and think”, from the semi-routine task of tagging English with the Penn Treebank part of speech, is the distinction between the VBD/VBN/JJ tags. VBD is the past tense of a verb, as in “The blow broke/VBD the glass”, VBN is the past participle, as in “The glass has been broken/VBN.” and JJ is an adjective as in “He suffered broken/JJ ribs.” Do you understand why the last two are tagged differently? Can you (a human being) reliably tell the difference? Can your system? Does your system need to get this right?

    What if all three sentences use the same word? This is easy to arrange:

    The blow damaged/VBD the glass.
    The glass has been damaged/VBN.
    He suffered damaged/JJ ribs.

    It is pretty to easy to understand that this could be a problem for some system designs and not others. Whether you are a linguist or not, when you make choices about which datasets you use, you are adopting the ideas behind the classifications they use, and become responsible for the consequences. Maybe you really don’t care, because confusions between these labels don’t matter for your end task. But as scientists or engineers, we are really supposed to know for sure whether we care or not, and be able to justify the choices we make. Including ones like those above, where we kind of inherited the choice, and may not have known it.

    Liked by 2 people

Leave a comment