Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems
April 15, 2024 ยท Declared Dead ยท ๐ NAACL-HLT
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Clemencia Siro, Mohammad Aliannejadi, Maarten de Rijke
arXiv ID
2404.09980
Category
cs.CL: Computation & Language
Cross-listed
cs.HC,
cs.IR
Citations
3
Venue
NAACL-HLT
Last Checked
4 months ago
Abstract
Crowdsourced labels play a crucial role in evaluating task-oriented dialogue systems (TDSs). Obtaining high-quality and consistent ground-truth labels from annotators presents challenges. When evaluating a TDS, annotators must fully comprehend the dialogue before providing judgments. Previous studies suggest using only a portion of the dialogue context in the annotation process. However, the impact of this limitation on label quality remains unexplored. This study investigates the influence of dialogue context on annotation quality, considering the truncated context for relevance and usefulness labeling. We further propose to use large language models (LLMs) to summarize the dialogue context to provide a rich and short description of the dialogue context and study the impact of doing so on the annotator's performance. Reducing context leads to more positive ratings. Conversely, providing the entire dialogue context yields higher-quality relevance ratings but introduces ambiguity in usefulness ratings. Using the first user utterance as context leads to consistent ratings, akin to those obtained using the entire dialogue, with significantly reduced annotation effort. Our findings show how task design, particularly the availability of dialogue context, affects the quality and consistency of crowdsourced evaluation labels.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computation & Language
๐
๐
Old Age
๐
๐
Old Age
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
๐
๐
Old Age
XLNet: Generalized Autoregressive Pretraining for Language Understanding
๐ฎ
๐ฎ
The Ethereal
Effective Approaches to Attention-based Neural Machine Translation
๐
๐
Old Age
A large annotated corpus for learning natural language inference
๐
๐
Old Age
HellaSwag: Can a Machine Really Finish Your Sentence?
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
๐ป
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
๐ป
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
๐ป
Ghosted