A Gold Standard Dataset for the Reviewer Assignment Problem

March 23, 2023 · Declared Dead · 🏛 Trans. Mach. Learn. Res.

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Ivan Stelmakh, John Wieting, Sarina Xi, Graham Neubig, Nihar B. Shah arXiv ID 2303.16750 Category cs.IR: Information Retrieval Cross-listed cs.DL, cs.LG Citations 20 Venue Trans. Mach. Learn. Res. Last Checked 4 months ago

Abstract

Many peer-review venues are using algorithms to assign submissions to reviewers. The crux of such automated approaches is the notion of the "similarity score" -- a numerical estimate of the expertise of a reviewer in reviewing a paper -- and many algorithms have been proposed to compute these scores. However, these algorithms have not been subjected to a principled comparison, making it difficult for stakeholders to choose the algorithm in an evidence-based manner. The key challenge in comparing existing algorithms and developing better algorithms is the lack of publicly available gold-standard data. We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers who evaluated their expertise in reviewing papers they have read previously. Using our dataset, we compare several widely used similarity algorithms and offer key insights. First, all algorithms exhibit significant error, with misranking rates between 12%-30% in easier cases and 36%-43% in harder ones. Second, most specialized algorithms are designed to work with titles and abstracts of papers, and in this regime the SPECTER2 algorithm performs best. Interestingly, classical TF-IDF matches SPECTER2 in accuracy when given access to full submission texts. In contrast, off-the-shelf LLMs lag behind specialized approaches.