Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources
November 28, 2022 ยท The Cartographer ยท ๐ Conference on Empirical Methods in Natural Language Processing
"No code URL or promise found in abstract"
"Title-pattern auto-detect: Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources"
Evidence collected by the PWNC Scanner
Authors
Xinyan Velocity Yu, Akari Asai, Trina Chatterjee, Junjie Hu, Eunsol Choi
arXiv ID
2211.15649
Category
cs.CL: Computation & Language
Cross-listed
cs.AI
Citations
29
Venue
Conference on Empirical Methods in Natural Language Processing
Last Checked
23 hours ago
Abstract
While the NLP community is generally aware of resource disparities among languages, we lack research that quantifies the extent and types of such disparity. Prior surveys estimating the availability of resources based on the number of datasets can be misleading as dataset quality varies: many datasets are automatically induced or translated from English data. To provide a more comprehensive picture of language resources, we examine the characteristics of 156 publicly available NLP datasets. We manually annotate how they are created, including input text and label sources and tools used to build them, and what they study, tasks they address and motivations for their creation. After quantifying the qualitative NLP resource gap across languages, we discuss how to improve data collection in low-resource languages. We survey language-proficient NLP researchers and crowd workers per language, finding that their estimated availability correlates with dataset availability. Through crowdsourcing experiments, we identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform. We conclude by making macro and micro-level suggestions to the NLP community and individual researchers for future multilingual data development.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computation & Language
๐
๐
Old Age
๐
๐
Old Age
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
๐
๐
Old Age
XLNet: Generalized Autoregressive Pretraining for Language Understanding
๐๏ธ
๐๏ธ
Transcended
Effective Approaches to Attention-based Neural Machine Translation
๐
๐
Old Age
A large annotated corpus for learning natural language inference
๐
๐
Old Age