🌅
🌅
Old Age
LMentry: A Language Model Benchmark of Elementary Language Tasks
November 03, 2022 · 🏛 Annual Meeting of the Association for Computational Linguistics
"No code URL or promise found in abstract"
"HuggingFace models found (backfill)"
Evidence collected by the PWNC Scanner
Authors
Avia Efrat, Or Honovich, Omer Levy
arXiv ID
2211.02069
Category
cs.CL: Computation & Language
Cross-listed
cs.AI,
cs.LG
Citations
31
Venue
Annual Meeting of the Association for Computational Linguistics
Repository
https://huggingface.co/datasets/clembench-playpen/lmentry
Last Checked
7 days ago
Abstract
As the performance of large language models rapidly improves, benchmarks are getting larger and more complex as well. We present LMentry, a benchmark that avoids this "arms race" by focusing on a compact set of tasks that are trivial to humans, e.g. writing a sentence containing a specific word, identifying which words in a list belong to a specific category, or choosing which of two words is longer. LMentry is specifically designed to provide quick and interpretable insights into the capabilities and robustness of large language models. Our experiments reveal a wide variety of failure cases that, while immediately obvious to humans, pose a considerable challenge for large language models, including OpenAI's latest 175B-parameter instruction-tuned model, TextDavinci002. LMentry complements contemporary evaluation approaches of large language models, providing a quick, automatic, and easy-to-run "unit test", without resorting to large benchmark suites of complex tasks.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
📜 Similar Papers
In the same crypt — Computation & Language
🌅
🌅
Old Age
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
🌅
🌅
Old Age
XLNet: Generalized Autoregressive Pretraining for Language Understanding
🔮
🔮
The Ethereal
Effective Approaches to Attention-based Neural Machine Translation
🌅
🌅
Old Age
A large annotated corpus for learning natural language inference
🌅
🌅
Old Age