Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation

May 11, 2025 · Declared Dead · 🏛 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Xilin Jiang, Junkai Wu, Vishal Choudhari, Nima Mesgarani arXiv ID 2505.06803 Category cs.SD: Sound Cross-listed cs.CL, cs.CV, cs.MM, eess.AS Citations 2 Venue IEEE Workshop on Applications of Signal Processing to Audio and Acoustics Last Checked 4 months ago

Abstract

Audio large language models (LLMs) are considered experts at recognizing sound objects, yet their performance relative to LLMs in other sensory modalities, such as visual or audio-visual LLMs, and to humans using their ears, eyes, or both remains unexplored. To investigate this, we systematically evaluate audio, visual, and audio-visual LLMs, specifically Qwen2-Audio, Qwen2-VL, and Qwen2.5-Omni, against humans in recognizing sound objects of different classes from audio-only, silent video, or sounded video inputs. We uncover a performance gap between Qwen2-Audio and Qwen2-VL that parallels the sensory discrepancy between human ears and eyes. To reduce this gap, we introduce a cross-modal distillation framework, where an LLM in one modality serves as the teacher and another as the student, with knowledge transfer in sound classes predicted as more challenging to the student by a heuristic model. Distillation in both directions, from Qwen2-VL to Qwen2-Audio and vice versa, leads to notable improvements, particularly in challenging classes. This work highlights the sensory gap in LLMs from a human-aligned perspective and proposes a principled approach to enhancing modality-specific perception in multimodal LLMs.