A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

March 05, 2023 Β· Declared Dead Β· πŸ› 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Siyang Wang, Gustav Eje Henter, Joakim Gustafson, Γ‰va SzΓ©kely arXiv ID 2303.02719 Category eess.AS: Audio & Speech Cross-listed cs.HC, cs.LG, cs.SD Citations 7 Venue 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) Last Checked 3 months ago
Abstract
Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Audio & Speech

Died the same way β€” πŸ‘» Ghosted