Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling

December 21, 2024 · Declared Dead · 🏛 IEEE International Conference on Acoustics, Speech, and Signal Processing

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Shao-Syuan Huang, Kuan-Po Huang, Andy T. Liu, Hung-yi Lee arXiv ID 2412.16474 Category eess.AS: Audio & Speech Cross-listed cs.CL Citations 5 Venue IEEE International Conference on Acoustics, Speech, and Signal Processing Last Checked 3 months ago

Abstract

Multilingual Automatic Speech Recognition (ASR) aims to recognize and transcribe speech from multiple languages within a single system. Whisper, one of the most advanced ASR models, excels in this domain by handling 99 languages effectively, leveraging a vast amount of data and incorporating language tags as prefixes to guide the recognition process. However, despite its success, Whisper struggles with unseen languages, those not included in its pre-training. Motivated by the observation that many languages share linguistic characteristics, we propose methods that exploit these relationships to enhance ASR performance on unseen languages. Specifically, we introduce a weighted sum method, which computes a weighted sum of the embeddings of language tags, using Whisper's predicted language probabilities. In addition, we develop a predictor-based approach that refines the weighted sum embedding to more closely approximate the true embedding for unseen languages. Experimental results demonstrate substantial improvements in ASR performance, both in zero-shot and fine-tuning settings. Our proposed methods outperform baseline approaches, providing an effective solution for addressing unseen languages in multilingual ASR.