๐
๐
Old Age
PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large Language Models
August 03, 2024 ยท Entered Twilight ยท ๐ arXiv.org
Repo contents: README.md, calc_metrics.py, plugh.json, plugh.responses.json, sample_responses.py
Authors
Alexey Tikhonov
arXiv ID
2408.04648
Category
cs.CL: Computation & Language
Cross-listed
cs.AI,
cs.IR
Citations
2
Venue
arXiv.org
Repository
https://github.com/altsoph/PLUGH
Last Checked
3 months ago
Abstract
We present PLUGH (https://www.urbandictionary.com/define.php?term=plugh), a modern benchmark that currently consists of 5 tasks, each with 125 input texts extracted from 48 different games and representing 61 different (non-isomorphic) spatial graphs to assess the abilities of Large Language Models (LLMs) for spatial understanding and reasoning. Our evaluation of API-based and open-sourced LLMs shows that while some commercial LLMs exhibit strong reasoning abilities, open-sourced competitors can demonstrate almost the same level of quality; however, all models still have significant room for improvement. We identify typical reasons for LLM failures and discuss possible ways to deal with them. Datasets and evaluation code are released (https://github.com/altsoph/PLUGH).
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computation & Language
๐
๐
Old Age
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
๐
๐
Old Age
XLNet: Generalized Autoregressive Pretraining for Language Understanding
๐ฎ
๐ฎ
The Ethereal
Effective Approaches to Attention-based Neural Machine Translation
๐
๐
Old Age
A large annotated corpus for learning natural language inference
๐
๐
Old Age