Phonology-Guided Speech-to-Speech Translation for African Languages
October 30, 2024 Β· Declared Dead Β· π Speech Communication
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Peter Ochieng, Dennis Kaburu
arXiv ID
2410.23323
Category
eess.AS: Audio & Speech
Cross-listed
cs.AI,
cs.CL
Citations
0
Venue
Speech Communication
Last Checked
3 months ago
Abstract
We present a prosody-guided framework for speech-to-speech translation (S2ST) that aligns and translates speech \emph{without} transcripts by leveraging cross-linguistic pause synchrony. Analyzing a 6{,}000-hour East African news corpus spanning five languages, we show that \emph{within-phylum} language pairs exhibit 30--40\% lower pause variance and over 3$\times$ higher onset/offset correlation compared to cross-phylum pairs. These findings motivate \textbf{SPaDA}, a dynamic-programming alignment algorithm that integrates silence consistency, rate synchrony, and semantic similarity. SPaDA improves alignment $F_1$ by +3--4 points and eliminates up to 38\% of spurious matches relative to greedy VAD baselines. Using SPaDA-aligned segments, we train \textbf{SegUniDiff}, a diffusion-based S2ST model guided by \emph{external gradients} from frozen semantic and speaker encoders. SegUniDiff matches an enhanced cascade in BLEU (30.3 on CVSS-C vs.\ 28.9 for UnitY), reduces speaker error rate (EER) from 12.5\% to 5.3\%, and runs at an RTF of 1.02. To support evaluation in low-resource settings, we also release a three-tier, transcript-free BLEU suite (M1--M3) that correlates strongly with human judgments. Together, our results show that prosodic cues in multilingual speech provide a reliable scaffold for scalable, non-autoregressive S2ST.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Audio & Speech
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
LPCNet: Improving Neural Speech Synthesis Through Linear Prediction
R.I.P.
π»
Ghosted
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
R.I.P.
π»
Ghosted
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech
R.I.P.
π»
Ghosted
Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders
R.I.P.
π»
Ghosted
Utterance-level Aggregation For Speaker Recognition In The Wild
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted