How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer

November 15, 2025 Β· Declared Dead Β· πŸ› arXiv.org

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Minu Kim, Ji Sub Um, Hoirin Kim arXiv ID 2511.12285 Category eess.AS: Audio & Speech Cross-listed cs.CL Citations 0 Venue arXiv.org Last Checked 3 months ago
Abstract
Lexical tone is central to many languages but remains underexplored in self-supervised learning (SSL) speech models, especially beyond Mandarin. We study four languages with complex and diverse tone systems (Burmese, Thai, Lao, and Vietnamese) to ask how far such models "listen" for tone and how transfer operates in low-resource conditions. As a baseline reference, we estimate the temporal span of tone cues: approximately 100ms (Burmese/Thai) and 180ms (Lao/Vietnamese). Probes and gradient analysis on fine-tuned SSL models reveal that tone transfer varies by downstream task: automatic speech recognition fine-tuning aligns spans with language-specific tone cues, while prosody- and voice-related tasks bias toward overly long spans. These findings indicate that tone transfer is shaped by downstream task, highlighting task effects on temporal focus in tone modeling.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Audio & Speech

Died the same way β€” πŸ‘» Ghosted