Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training
May 22, 2023 ยท Entered Twilight ยท ๐ Conference of the European Chapter of the Association for Computational Linguistics
Repo contents: CMSST_main.py, README.md, data_prepare, data_process, datasets, env_speechbrain4.yaml, eval, hparams, models, prepare_aux_text_embedding.py, prepare_process_data.py, processed_data, script, select_aux_sample.py, text_filter, utils
Authors
Jianfeng He, Julian Salazar, Kaisheng Yao, Haoqi Li, Jinglun Cai
arXiv ID
2305.12793
Category
eess.AS: Audio & Speech
Cross-listed
cs.CL,
cs.MM,
cs.SD
Citations
7
Venue
Conference of the European Chapter of the Association for Computational Linguistics
Repository
https://github.com/amazon-science/zero-shot-E2E-slu
โญ 9
Last Checked
2 months ago
Abstract
End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore \textit{zero-shot} E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a natural language understanding (NLU) model learned on text-semantics corpora. However, this method requires the domains of speech-text and text-semantics to match, which often mismatch due to separate collections. Furthermore, using the entire collected speech-text corpus from any domains leads to \textit{imbalance} and \textit{noise} issues. To address these, we propose \textit{cross-modal selective self-training} (CMSST). CMSST tackles imbalance by clustering in a joint space of the three modalities (speech, text, and semantics) and handles label noise with a selection network. We also introduce two benchmarks for zero-shot E2E SLU, covering matched and found speech (mismatched) settings. Experiments show that CMSST improves performance in both two settings, with significantly reduced sample sizes and training time. Our code and data are released in https://github.com/amazon-science/zero-shot-E2E-slu.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Audio & Speech
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
R.I.P.
๐ป
Ghosted
DiffWave: A Versatile Diffusion Model for Audio Synthesis
R.I.P.
๐ป
Ghosted
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
R.I.P.
๐ป
Ghosted
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
R.I.P.
๐ป
Ghosted