VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

April 19, 2026 · Grace Period · 🏛 INTERSPEECH 2026

Authors Yi-Cheng Lin, Yusuke Hirota, Sung-Feng Huang, Hung-yi Lee arXiv ID 2604.17248 Category eess.AS: Audio & Speech Cross-listed cs.CL, cs.SD Citations 0 Venue INTERSPEECH 2026

Abstract

Large Audio-Language Models (LALMs) are increasingly integrated into daily applications, yet their generative biases remain underexplored. Existing speech fairness benchmarks rely on synthetic speech and Multiple-Choice Questions (MCQs), both offering a fragmented view of fairness. We propose VIBE, a framework that evaluates generative bias through open-ended tasks such as personalized recommendations, using real-world human recordings. Unlike MCQs, our method allows stereotypical associations to manifest organically without predefined options, making it easily extensible to new tasks. Evaluating 11 state-of-the-art LALMs reveals systematic biases in realistic scenarios. We find that gender cues often trigger larger distributional shifts than accent cues, indicating that current LALMs reproduce social stereotypes.