VGGSounder: Audio-Visual Evaluations for Foundation Models

August 11, 2025 Β· Declared Dead Β· πŸ› arXiv.org

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Daniil Zverev, ThaddΓ€us Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke arXiv ID 2508.08237 Category cs.MM: Multimedia Cross-listed cs.AI, cs.CV, cs.SD, eess.AS Citations 4 Venue arXiv.org Last Checked 3 months ago
Abstract
The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Multimedia

R.I.P. πŸ‘» Ghosted

Video Generation From Text

Yitong Li, Martin Renqiang Min, ... (+3 more)

cs.MM πŸ› AAAI πŸ“š 300 cites 8 years ago

Died the same way β€” πŸ‘» Ghosted