๐ฎ
๐ฎ
The Ethereal
Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding
April 16, 2026 ยท Grace Period ยท + Add venue
Authors
Jieyi Wang, Yazhe Niu, Dexuan Xu, Zhongyu Wei
arXiv ID
2604.14806
Category
cs.SD: Sound
Cross-listed
cs.MM
Citations
0
Abstract
Recent Large Audio Language Models have demonstrated impressive capabilities in audio understanding. However, they often suffer from perceptual errors, while reliable audio reasoning is unattainable without first grounding the model's perception in structured auditory scenes. Inspired by Auditory Scene Analysis, we first introduce a Perception-Aware Question Answering (PAQA) dataset. PAQA implements a hierarchical decoupling strategy that separates speech from environmental sound and distinguishes multiple speakers, providing explicit perceptual reasoning for training. Building on this, we propose HyPeR, a two-stage Hybrid Perception-Reasoning framework. In Stage I, we finetune the model on PAQA to perceive acoustic attributes in complex audio. In Stage II, we leverage GRPO to refine the model's internal deliberation. We also introduce PAUSE tokens to facilitate latent computation during acoustically ambiguous phases and design perceptual consistency reward to align reasoning rationales with raw audio. Experiments across benchmarks demonstrate that HyPeR achieves absolute improvements over the base model, with performance comparable to large-scale models, stressing the effectiveness of hybrid perception-grounded reasoning for robust and multi-speaker audio understanding.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Sound
R.I.P.
๐ป
Ghosted
Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks
R.I.P.
๐ป
Ghosted
The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines
R.I.P.
๐ป
Ghosted
TasNet: time-domain audio separation network for real-time, single-channel speech separation
R.I.P.
๐ป
Ghosted
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
R.I.P.
๐ป
Ghosted