Improving RNN Transducer Modeling for End-to-End Speech Recognition

September 26, 2019 · Declared Dead · 🏛 Automatic Speech Recognition & Understanding

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Jinyu Li, Rui Zhao, Hu Hu, Yifan Gong arXiv ID 1909.12415 Category cs.CL: Computation & Language Cross-listed eess.AS Citations 176 Venue Automatic Speech Recognition & Understanding Last Checked 3 months ago

Abstract

In the last few years, an emerging trend in automatic speech recognition research is the study of end-to-end (E2E) systems. Connectionist Temporal Classification (CTC), Attention Encoder-Decoder (AED), and RNN Transducer (RNN-T) are the most popular three methods. Among these three methods, RNN-T has the advantages to do online streaming which is challenging to AED and it doesn't have CTC's frame-independence assumption. In this paper, we improve the RNN-T training in two aspects. First, we optimize the training algorithm of RNN-T to reduce the memory consumption so that we can have larger training minibatch for faster training speed. Second, we propose better model structures so that we obtain RNN-T models with the very good accuracy but small footprint. Trained with 30 thousand hours anonymized and transcribed Microsoft production data, the best RNN-T model with even smaller model size (216 Megabytes) achieves up-to 11.8% relative word error rate (WER) reduction from the baseline RNN-T model. This best RNN-T model is significantly better than the device hybrid model with similar size by achieving up-to 15.0% relative WER reduction, and obtains similar WERs as the server hybrid model of 5120 Megabytes in size.