Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

December 10, 2020 · Declared Dead · 🏛 arXiv.org

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Binbin Zhang, Di Wu, Zhuoyuan Yao, Xiong Wang, Fan Yu, Chao Yang, Liyong Guo, Yaguang Hu, Lei Xie, Xin Lei arXiv ID 2012.05481 Category cs.SD: Sound Cross-listed cs.CL, eess.AS Citations 89 Venue arXiv.org Last Checked 2 months ago

Abstract

In this paper, we present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified. We propose a dynamic chunk-based attention strategy to allow arbitrary right context length. At inference time, the CTC decoder generates n-best hypotheses in a streaming way. The inference latency could be easily controlled by only changing the chunk size. The CTC hypotheses are then rescored by the attention decoder to get the final result. This efficient rescoring process causes very little sentence-level latency. Our experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently. On the AISHELL-1 test set, our unified model achieves 5.60% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streaming transformer. The same model achieves 5.42% CER with 640ms latency in a streaming ASR system.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Sound

R.I.P. 👻 Ghosted

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, Sander Dieleman, ... (+7 more)

cs.SD 🏛 Speech Synthesis 📚 8.0K cites 9 years ago

R.I.P. 👻 Ghosted

CNN Architectures for Large-Scale Audio Classification

Shawn Hershey, Sourish Chaudhuri, ... (+11 more)

cs.SD 🏛 ICASSP 📚 2.8K cites 9 years ago

R.I.P. 👻 Ghosted

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

Yi Luo, Nima Mesgarani

cs.SD 🏛 IEEE/ACM TASLP 📚 2.1K cites 7 years ago

R.I.P. 👻 Ghosted

Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification

Justin Salamon, Juan Pablo Bello

cs.SD 🏛 IEEE SPL 📚 1.4K cites 9 years ago

R.I.P. 👻 Ghosted

WaveGlow: A Flow-based Generative Network for Speech Synthesis

Ryan Prenger, Rafael Valle, Bryan Catanzaro

cs.SD 🏛 ICASSP 📚 1.1K cites 7 years ago

R.I.P. 👻 Ghosted

Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks

Morten Kolbæk, Dong Yu, ... (+2 more)

cs.SD 🏛 IEEE/ACM TASLP 📚 763 cites 9 years ago

Died the same way — 👻 Ghosted

R.I.P. 👻 Ghosted

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, ... (+29 more)

cs.CL 🏛 NeurIPS 📚 54.2K cites 5 years ago

R.I.P. 👻 Ghosted

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross, ... (+19 more)

cs.LG 🏛 NeurIPS 📚 49.7K cites 6 years ago

R.I.P. 👻 Ghosted

XGBoost: A Scalable Tree Boosting System

Tianqi Chen, Carlos Guestrin

cs.LG 🏛 KDD 📚 49.2K cites 10 years ago

R.I.P. 👻 Ghosted

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy

cs.LG 🏛 ICML 📚 46.0K cites 11 years ago