Benchmarking Failures in Tool-Augmented Language Models

March 18, 2025 Β· Declared Dead Β· πŸ› North American Chapter of the Association for Computational Linguistics

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Eduardo TreviΓ±o, Hugo Contant, James Ngai, Graham Neubig, Zora Zhiruo Wang arXiv ID 2503.14227 Category cs.SE: Software Engineering Cross-listed cs.CL Citations 4 Venue North American Chapter of the Association for Computational Linguistics Last Checked 4 months ago
Abstract
The integration of tools has extended the capabilities of language models (LMs) beyond vanilla text generation to versatile scenarios. However, tool-augmented language models (TaLMs) often assume 'perfect' information access and tool availability, which may not hold in the real world. To systematically study TaLMs' imperfections, we introduce the FAIL-TALMS benchmark, featuring two major failures: under-specified user queries and non-available tools. FAIL-TALMS contains 1,749 examples using 906 tools across 21 categories, including single- and multi-tool usage. We evaluate top-performing proprietary and open-source models, and find all current models except for Claude struggle to recognize missing tools or information. Further, to study possible mitigation of the failures, we enable real-time human interaction, named the Ask-and-Help (AAH) method, to provide missing information or replace non-functional tools. While AAH can help models solve tasks more correctly when queries are under-specified, it brings minimal benefit when complex tools are broken.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Software Engineering

Died the same way β€” πŸ‘» Ghosted