MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions
July 14, 2025 Β· Declared Dead Β· π Conference on Empirical Methods in Natural Language Processing
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Ramaneswaran Selvakumar, Ashish Seth, Nishit Anand, Utkarsh Tyagi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha
arXiv ID
2507.10859
Category
cs.MM: Multimedia
Cross-listed
cs.CL,
cs.HC
Citations
0
Venue
Conference on Empirical Methods in Natural Language Processing
Last Checked
4 months ago
Abstract
The rapid progress of Large Language Models (LLMs) has empowered omni models to act as voice assistants capable of understanding spoken dialogues. These models can process multimodal inputs beyond text, such as speech and visual data, enabling more context-aware interactions. However, current benchmarks fall short in comprehensively evaluating how well these models generate context-aware responses, particularly when it comes to implicitly understanding fine-grained speech characteristics, such as pitch, emotion, timbre, and volume or the environmental acoustic context such as background sounds. Additionally, they inadequately assess the ability of models to align paralinguistic cues with complementary visual signals to inform their responses. To address these gaps, we introduce MultiVox, the first omni voice assistant benchmark designed to evaluate the ability of voice assistants to integrate spoken and visual cues including paralinguistic speech features for truly multimodal understanding. Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features and a range of visual cues such as images and videos. Our evaluation on 10 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Multimedia
π
π
Old Age
R.I.P.
π»
Ghosted
Viewport-Adaptive Navigable 360-Degree Video Delivery
π
π
The Cartographer
A Comprehensive Survey on Cross-modal Retrieval
π
π
The Cartographer
An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges
R.I.P.
π»
Ghosted
A Convolutional Neural Network Approach for Post-Processing in HEVC Intra Coding
R.I.P.
π»
Ghosted
Video Generation From Text
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted