The Reasonable Effectiveness of Diverse Evaluation Data
January 23, 2023 Β· Declared Dead Β· π arXiv.org
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Lora Aroyo, Mark Diaz, Christopher Homan, Vinodkumar Prabhakaran, Alex Taylor, Ding Wang
arXiv ID
2301.09406
Category
cs.HC: Human-Computer Interaction
Citations
11
Venue
arXiv.org
Last Checked
4 months ago
Abstract
In this paper, we present findings from an semi-experimental exploration of rater diversity and its influence on safety annotations of conversations generated by humans talking to a generative AI-chat bot. We find significant differences in judgments produced by raters from different geographic regions and annotation platforms, and correlate these perspectives with demographic sub-groups. Our work helps define best practices in model development -- specifically human evaluation of generative models -- on the backdrop of growing work on sociotechnical AI evaluations.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Human-Computer Interaction
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Improving fairness in machine learning systems: What do industry practitioners need?
R.I.P.
π»
Ghosted
Identifying Stable Patterns over Time for Emotion Recognition from EEG
R.I.P.
π»
Ghosted
Questioning the AI: Informing Design Practices for Explainable AI User Experiences
R.I.P.
π»
Ghosted
Deep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities
R.I.P.
π»
Ghosted
Educational data mining and learning analytics: An updated survey
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted