R.I.P.
๐ป
Ghosted
Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data
March 09, 2026 ยท Grace Period ยท ๐ Interspeech 2026
Authors
Pol Buitrago, Pol Gร lvez, Oriol Pareras, Javier Hernando
arXiv ID
2603.08249
Category
eess.AS: Audio & Speech
Cross-listed
cs.CL,
eess.IV
Citations
0
Venue
Interspeech 2026
Abstract
Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable synthetic video thus offers a viable substitute for real recordings in zero-AV-resource AVSR.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Audio & Speech
R.I.P.
๐ป
Ghosted
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
R.I.P.
๐ป
Ghosted
DiffWave: A Versatile Diffusion Model for Audio Synthesis
R.I.P.
๐ป
Ghosted
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
R.I.P.
๐ป
Ghosted
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
R.I.P.
๐ป
Ghosted