Does Audio Matter for Modern Video-LLMs and Their Benchmarks?

September 22, 2025 · Declared Dead · 🏛 arXiv.org

Repo contents: .gitignore, README.md, main_table.png

Authors Geewook Kim, Minjoon Seo arXiv ID 2509.17901 Category cs.CV: Computer Vision Cross-listed cs.MM, cs.SD Citations 0 Venue arXiv.org Repository https://github.com/naver-ai/LLaVA-AV-SSM ⭐ 3 Last Checked 2 months ago

Abstract

Modern multimodal large language models often claim "video understanding," yet most evaluations use muted videos or simply discard audio. We ask a direct question: how much does audio actually matter for contemporary Video-LLMs and the benchmarks that certify them? We audit widely used suites and observe that many items are even solvable from a single frame, rendering audio largely redundant. Building on LLaVA-OneVision architecture, we attach a speech/audio encoder (e.g., Whisper) and analyze when audio helps, while addressing audio token explosion with a lightweight Mamba-based state-space token compressor. We find that audio yields minimal gains on recent video benchmarks but is decisive on curated, audio-sensitive subsets. To enable faithful evaluation, we release AVQA-Hard and Music-AVQA-Hard, our model, and code. Our findings surface a growing gap between current academic practice and real-world expectations, and provide practical tools for scalable audio-visual Video-LLMs. We will fully open-source our work at https://github.com/naver-ai/LLaVA-AV-SSM.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 💻 Repository 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Computer Vision

🌅 🌅 Old Age

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, ... (+2 more)

cs.CV 🏛 CVPR 📚 220.4K cites 10 years ago

🌅 🌅 Old Age

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, Kaiming He, ... (+2 more)

cs.CV 🏛 IEEE TPAMI 📚 70.4K cites 10 years ago

R.I.P. 👻 Ghosted

You Only Look Once: Unified, Real-Time Object Detection

Joseph Redmon, Santosh Divvala, ... (+2 more)

cs.CV 🏛 CVPR 📚 43.4K cites 10 years ago

🌅 🌅 Old Age

SSD: Single Shot MultiBox Detector

Wei Liu, Dragomir Anguelov, ... (+5 more)

cs.CV 🏛 ECCV 📚 33.8K cites 10 years ago

🌅 🌅 Old Age

Squeeze-and-Excitation Networks

Jie Hu, Li Shen, ... (+3 more)

cs.CV 🏛 CVPR 📚 32.3K cites 8 years ago

R.I.P. 👻 Ghosted

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy, Vincent Vanhoucke, ... (+3 more)

cs.CV 🏛 CVPR 📚 30.2K cites 10 years ago

Died the same way — 🦴 Skeleton Repo

R.I.P. 🦴 Skeleton Repo

EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification

Patrick Helber, Benjamin Bischke, ... (+2 more)

cs.CV 🏛 J.STAEORS 📚 2.4K cites 8 years ago

R.I.P. 🦴 Skeleton Repo

Deep Learning for 3D Point Clouds: A Survey

Yulan Guo, Hanyun Wang, ... (+4 more)

cs.CV 🏛 IEEE TPAMI 📚 2.1K cites 6 years ago

R.I.P. 🦴 Skeleton Repo

Adversarial Examples: Attacks and Defenses for Deep Learning

Xiaoyong Yuan, Pan He, ... (+2 more)

cs.LG 🏛 IEEE TNNLS 📚 1.8K cites 8 years ago

R.I.P. 🦴 Skeleton Repo

Neural Style Transfer: A Review

Yongcheng Jing, Yezhou Yang, ... (+4 more)

cs.CV 🏛 IEEE TVCG 📚 828 cites 8 years ago