Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition

November 26, 2024 Β· Declared Dead Β· πŸ› IEEE Signal Processing Letters

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Hyeonseung Lee, Ji Won Yoon, Sungsoo Kim, Nam Soo Kim arXiv ID 2411.17537 Category eess.AS: Audio & Speech Cross-listed cs.LG Citations 0 Venue IEEE Signal Processing Letters Last Checked 3 months ago
Abstract
Transducer neural networks have emerged as the mainstream approach for streaming automatic speech recognition (ASR), offering state-of-the-art performance in balancing accuracy and latency. In the conventional framework, streaming transducer models are trained to maximize the likelihood function based on non-streaming recursion rules. However, this approach leads to a mismatch between training and inference, resulting in the issue of deformed likelihood and consequently suboptimal ASR accuracy. We introduce a mathematical quantification of the gap between the actual likelihood and the deformed likelihood, namely forward variable causal compensation (FoCC). We also present its estimator, FoCCE, as a solution to estimate the exact likelihood. Through experiments on the LibriSpeech dataset, we show that FoCCE training improves the accuracy of the streaming transducers.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Audio & Speech

Died the same way β€” πŸ‘» Ghosted