The State and Fate of Summarization Datasets: A Survey
November 07, 2024 ยท The Cartographer ยท ๐ North American Chapter of the Association for Computational Linguistics
"No code URL or promise found in abstract"
"Title-pattern auto-detect: The State and Fate of Summarization Datasets: A Survey"
Evidence collected by the PWNC Scanner
Authors
Noam Dahan, Gabriel Stanovsky
arXiv ID
2411.04585
Category
cs.CL: Computation & Language
Citations
3
Venue
North American Chapter of the Association for Computational Linguistics
Last Checked
23 hours ago
Abstract
Automatic summarization has consistently attracted attention due to its versatility and wide application in various downstream tasks. Despite its popularity, we find that annotation efforts have largely been disjointed, and have lacked common terminology. Consequently, it is challenging to discover existing resources or identify coherent research directions. To address this, we survey a large body of work spanning 133 datasets in over 100 languages, creating a novel ontology covering sample properties, collection methods and distribution. With this ontology we make key observations, including the lack in accessible high-quality datasets for low-resource languages, and the field's over-reliance on the news domain and on automatically collected distant supervision. Finally, we make available a web interface that allows users to interact and explore our ontology and dataset collection, as well as a template for a summarization data card, which can be used to streamline future research into a more coherent body of work.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computation & Language
๐
๐
Old Age
๐
๐
Old Age
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
๐
๐
Old Age
XLNet: Generalized Autoregressive Pretraining for Language Understanding
๐๏ธ
๐๏ธ
Transcended
Effective Approaches to Attention-based Neural Machine Translation
๐
๐
Old Age
A large annotated corpus for learning natural language inference
๐
๐
Old Age