A Review of Reviewer Assignment Methods

Authors:

Amanda Stent and Heng Ji (NAACL-HLT2018 PC Chairs)

The Reviewer Assignment Problem

The reviewer assignment problem is the task of assigning submissions to reviewers so as to ensure a “fair” and “balanced” assignment. This problem has been widely studied (Price & Flach, 2017). The problem is most often described in terms of constraint satisfaction; make an assignment of reviewers R to submissions S such that:

  • each submission gets exactly n reviewers
  • each reviewer gets no more than m_r submissions
Of course, the problem so far is underspecified. In addition to satisfying these hard constraints, in our field as in many others we want to achieve less well-defined goals, including:
  • Fairness – no reviewer should be assigned to a submission for which they have a conflict of interest, e.g. they have advised or been advised by a co-author on the submission; they have co-authored with a co-author on the submission in the recent past; they work at the same institution as a co-author on the submission.
  • Expertise – all other things being equal, the reviewers assigned to a submission should be the most qualified to review that submission.
  • Interest – all other things being equal, the reviewers assigned to a submission should be interested in reviewing that submission.
Also, many CS authors would like as short a time as possible between submission and publication; and CS conferences (especially in fields like ML, computer vision and NLP) are growing rapidly. It seems that it ought to be possible to make an automatic assignment of submissions to reviewers that respects criteria like expertise, interest and fairness (Li & Hou, 2015). Indeed, work in this area has been going on since at least 1992 (Dumais and Nielsen, 1992).  It is known that this type of problem (with all the soft and hard constraints) is NP-hard (Goldsmith & Sloan, 2007; Long et al., 2013). However, there exist a number of approximate solutions (Karimzadehgan & Zhai, 2009, Li & Hou, 2015, Liu, Suel & Memon, 2014), any one of which could be implemented inside START (softconf), the conference management system ACL uses, given enough lead time.
 

Assessing Fairness, Expertise and Interest

The questions conference chairs face are:
  • How can we assess expertise (especially in a way that is scalable as our field grows)?
    • A standard solution is to compute a kind of match between a reviewer’s published papers and the submission, typically on the text alone (Dumais and Nielsen, 1992, Mimno & McCallum, 2008Li & Hou, 2015). START supports one of these, the Toronto Paper Matching system (TPMS; Charlin & Zemel, 2013) – more on TPMS below.
    • START supports bidding, which can be used to assess expertise – more on bidding below.
    • START supports the assignment of reviewers to areas or keywords; this can also be used to assess expertise – more on this below.
  • How can we assess interest (especially in a way that is scalable as our field grows)?
    • TPMS and bidding can be used to assess interest.
    • A (history-weighted) analysis of the papers a reviewer cites in their own work could also be used to communicate interest, although we know of no conference management system that supports this at present.
  • How can we assess fairness?
    • START has good automatic methods for assessing conflicts of interest based on user profiles. However, a reviewer may change affiliations without updating their profile. Also, a particular affiliation may be too broad – should every employee at a large tech company be prohibited from reviewing submissions by every other employee at that company, even if they have never met?
    • In addition, START cannot capture those conflicts of interest due to personality and politics that arise in any collection of individuals.
    • Other fields have broader definitions of fairness that include: no two submissions get assigned more than one shared reviewer, or no set of reviewers assigned to a submission share an affiliation. In our case, we wanted to achieve a broad distribution of submissions across qualified reviewers – regardless of organizational or geographic affiliation or popularity. 
For NAACL HLT 2018, we used a combination of areas, TPMS and bidding to assess interest and expertise.
  • For long paper submissions, we gave area chairs reviewers’ areas of interest, conflicts of interest, quotas, TPMS scores and bids for all their submissions. However, area chairs struggled to combine these sources of information, especially given that a) many reviewers bid minimally, only selecting a few papers as “yes”; b) most reviewers could be assigned submissions in multiple areas.
  • For short paper submissions, we did an automatic pre-assignment of reviewers to submissions based on areas of interest, conflicts of interest. quotas and TPMS scores, using the greedy method from (Liu, Suel & Memon, 2014) . Reviewers then bid, after which area chairs could adjust the reviewer assignments based on bids and their own expertise.

So, now we want to know:

  • What is the “buy in” for TPMS in our reviewer pool?
  • What is the “buy in” for bidding in our reviewer pool?
  • How highly correlated are TPMS scores and bids, in our reviewer pool?

Areas and Keywords

We used areas for NAACL HLT 2018, primarily for the following reasons:

  • To give reviewers a smaller set of papers to have to consider during bidding – because we were not able to get sorting of submissions by TPMS score implemented before the submission deadlines.
  • To sequester submissions for which area chairs had conflicts away from their view during the review process.
  • To enable area chairs not to have to think about all the reviewers during reviewer assignment.

The use of strict areas is problematic because sometimes a submission has to be placed in a less-than-optimal area due to conflicts of interest, and because many reviewers have expertise in multiple areas. In addition, area chairs may think first of assigning submissions to people they “know” – leading to some reviewers being heavily overloaded and others (equally qualified and willing) left out.

NAACL HLT 2016 used keywords instead of areas (Nenkova & Rambow, 2016). While interesting, this is difficult to do well because folksonomies are known to be difficult to accurately navigate (Price & Flach, 2017). We think keywords  would  work quite well if authors were required to submit abstracts and keywords a week before the paper submission deadline, and reviewers were required to choose from those keywords during the intervening week – either instead of something like TPMS, or as an adjunct to it.

Of course, automatic topic clustering methods like TPMS are supposed to be more generalizable than any fixed set of topics or keywords.

Toronto Paper Matching System

The Toronto Paper Matching system (TPMS) suggests reviewer assignments for each submission based on an automatic comparison of the text of each reviewer’s provided prior publications and the text of the submission (Charlin & Zemel, 2013). TPMS is relatively new to the *ACL community – it has been used at least in part since at least ACL 2017. It is much more known to the ML community.

A primary benefit of TPMS is that it is almost entirely automatic – once a reviewer has set up a TPMS profile, it can be reused in a wholly automatic fashion for multiple conferences – or the reviewer can update their profile at any time. 

A primary drawback of TPMS is that quite often people prefer to review papers not directly related to work they have published in the past – i.e. that preference and expertise may not be fully aligned.

Another issue with the use of TPMS in our field is that it is only as good as the individual reviewer’s interest in uploading a relevant set of prior publications. In the case of NAACL 2018, about 80% of our reviewers either had or created a TPMS profile  that was findable by START (using their email address). Variations on TPMS that are based on the ACL anthology could be used in the future. This would eliminate the need for reviewers to set up or maintain a profile, and offers opportunities for computational linguistics research.

Bidding

Bidding is relatively well known to the *ACL community by now (although we think it has been in use for less than 10 years). Bidding can be used to assess expertise and interest.

A primary drawback is that it is not automatable – and in fact, faced with hundreds of submissions, many reviewers fail to bid at all (Price & Flach, 2017). If bidding were done on a list of submissions sorted by expertise, it might help with this issue.

Another drawback of bidding is that bidders frequently conflate expertise and interest – or ignore expertise altogether. Many bidders mark a small number of submissions as “yes”, a small number as “maybe”, and the rest as “no” – even when they are qualified to review those submissions. A side effect is that submissions on “popular” topics get many bids, and submissions on less-popular topics get none – but all eligible submissions do need to be reviewed, preferably by a qualified reviewer.

A third drawback of bidding is that busy reviewers may not bid at all. Typically, for a *ACL conference the default bid is set to “qualified to review” – bidders who don’t bid at all may be assigned submissions they are not, in fact, qualified to review. If no default bid is set, reviewers who don’t bid are effectively eliminated from any automatic reviewer assignment process.

A fourth drawback of bidding is that it results in information leakage – sure, it’s fun in a gossipy kind of way to see what submissions a conference gets, but it impacts the integrity of double blind review in ways we cannot currently assess (think, for example, of a submission that is rejected from one conference and submitted to the next one).

How Well Do TPMS Scores Match Bids?

Given the caveats above, we are nonetheless interest in determining how well TPMS scores match bids – and how well either of these correlates with reviewer assignments. We are able to examine this on the short paper submissions to NAACL HLT 2018.

1040 of our reviewers have TPMS profiles findable by START.

For each of the 444 short papers, the area chairs assigned 3 reviewers. For the 444*3 = 1332 assigned <paper, reviewer> pairs:

  • 152 (11.4%) match the top 3 reviewers recommended by TPMS.
  • 302 (22.7%) match the top 10 reviewers recommended by TPMS, or had a TPMS score of > .8.

We observed that some reviewers consistently have high TPMS scores for many submissions, while others do not. In other words, TPMS scores alone cannot be used for automatic reviewer assignment – some reviewers would be assigned many papers, and many reviewers would be assigned none.

Reviewers were invited to bid on short paper submissions if they had either said they could review short paper submissions or not expressed a preference for reviewing short vs. long, and if they had reviewed two or fewer long paper submissions. 208 reviewers were only available to review short paper submissions. 795 reviewers were invited to bid on short paper submissions.

479 reviewers bid “yes” or “maybe” on at least one short paper assignment. Of that number, 392 bid “yes” or “maybe” on at least 5, and 243 on at least 10, submissions. We conclude that bidding alone does not give sufficient information for automatic reviewer assignment.

Area chairs were overall deeply considerate of bids.  After bidding, they changed at least one of the automatic area/quota/TPMS-based assignments for 360 short paper submissions, and changed all three for 85. For all of the 1278 assigned <paper, reviewer> pairs, 370 (29.0%) were “yes”, 245 (19.2%) were “maybe”, and 23 (1.8%) were “no” from reviewer bidding, and 640 (50.1%) did not receive bidding from the assigned reviewer.

20.5% of the time, a “yes” bid corresponded to a submission that had a TPMS score of 0.8 or higher. 11.0% of the time, a “maybe” bid corresponded to a submission that had a TPMS score of 0.8 or higher. 90.9% of the time, a “yes” or “maybe” bid corresponded to a reviewer who was not in the top 10 matching reviewers for that submission.

0.4% of the time, a “no” bid corresponded to the most closely matching reviewer for the submission according to TPMS scores, and 4.1% of the time it corresponded to a reviewer in the top 10 matching reviewers for the submission. We conclude that TPMS and bids capture independent sources of information, i.e. reviewers are generally bidding based on preferences.

Suggestions For the Future

Based on the experience of both NAACL HLT 2016 and NAACL HLT 2018, we suggest future program chairs to abandon areas altogether, going back to the senior program committee (meta-reviewer) model. Conflicts of interest involving meta reviewers being authors of submissions can be handled in START. Meta reviewers can provide their expertise in: recruiting reviewers, pre-checking submissions to determine desk rejects, ensuring high quality and timely reviews, and making informed accept-reject suggestions. 

In the early years of the conference, the senior program committee would align their recommendations at an in-person meeting, which then became a telephone call.. and then, spreadsheets shared around. For NAACL HLT 2016 and NAACL HLT 2018, area chairs/meta reviewers worked together in small groups of 2 or 3 to align their recommendations; this allows for coordination without the overhead of syncing with 5-10 other area chairs.

We suggest future program chairs to require authors to submit paper titles, authors, keywords and abstracts a week ahead of the paper submission deadline (as is done in many other fields). This will give program chairs the opportunity to assign submissions to meta reviewers/area chairs, to solicit interest data from reviewers, and (if areas are used) to assign reviewers to areas ahead of the submission deadline. It may even be possible to make a preliminary assignment of reviewers to submissions during that week, and to solicit additional reviewers for submissions where insufficient numbers of qualified reviewers have already been invited.

It will be necessary to accompany these changes with more automated reviewer (pre-)assignments, as reviewers won’t bid on a whole list of submissions, and area chairs find it hard to balance quotas, expertise and interest for large reviewer pools. If bidding is desired, we suggest ACL to work with START to implement a method for sorting submissions by reviewer expertise (keywords, citation graph analysis, TPMS scores or other method). This is something that was requested by ACL 2017, NAACL HLT 2018 and ACL 2018.

Citation graph analysis would be an alternative to keywords, bidding and/or TPMS (we can find no references to work on this task using citation graph analysis, but it seems like a promising approach to try). It  could easily be done given the considerable work our field has put in on the ACL anthology. Unfortunately, it can’t be done and integrated into START in the 3-5 months new program chairs have to set up a conference – it would have to be done as an ACL-wide activity.

We suggest ACL to work with program chairs to implement systems that use the ACL anthology to identify reviewers and assess interest, expertise and diversity of reviewer assignments.

Another way to look at reviewer assignment is to treat it as a recommender system problem. Typical recommender systems use metadata such as clicks/views. We use no historical information in reviewer assignment in our field – program chairs don’t know whether a reviewer has reviewed this submission before, or a submission from this group before, and they don’t know whether a reviewer has read a submission from a set of authors before. So, many of the signals that would be part of a “standard” recommender system are not currently available to our program chairs for reviewer assignment. This would be a fruitful area for future consideration.

 

 

 

 

 

 

One thought on “A Review of Reviewer Assignment Methods

  1. It sounds like you’re viewing this as a kind of maximum-weight b-matching problem (http://proceedings.mlr.press/v15/huang11a/huang11a.pdf), where each “fair” reviewer-paper edge is *separately* scored by some combination of “expertise” (is the reviewer useful for the paper?) and “interest” (is the paper congenial to the reviewer?).

    I agree that fairness, expertise, and interest are the basic criteria, but I don’t agree that the objective function decomposes over edges. Whether a paper will be expertly reviewed is a (stochastic) function of the *group* of 3 reviewers (and the area chairs). When I’m an area chair, I am looking to assign each paper a group of reviewers whose *collective* expertise will reduce my uncertainty about the paper. A paper with multiple aspects ought to include multiple kinds of expertise in that group, so that all aspects can be checked. And the group should be robust, so I don’t want it to include more than one potentially unreliable reviewer — I will need enough usable expertise in the set of reviews (+ discussion) to make a decision.

    As for identifying expertise: it’s best to go deeper than “bag of words,” high-level keywords, or bids. I’m often looking for people who have worked on conceptually similar problems, even if they use different terminology or are working in a different application setting. Certainly I try to come up with such reviewers when handling a single paper as a TACL or JAIR editor, and I try to do it as a conference area chair as well — which is one reason it’s good that you’ve reduced the load of papers per AC.

    Like

Leave a comment