Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward
November 12, 2025 Β· Declared Dead Β· π arXiv.org
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Guansu Wang, Peijie Sun
arXiv ID
2511.17555
Category
eess.AS: Audio & Speech
Cross-listed
cs.CL
Citations
0
Venue
arXiv.org
Last Checked
3 months ago
Abstract
Recent advances in text-to-speech (TTS) have enabled models to clone arbitrary unseen speakers and synthesize high-quality, natural-sounding speech. However, evaluation methods lag behind: typical mean opinion score (MOS) estimators perform regression over entire utterances, while failures usually occur in a few problematic words. We observe that encoder-decoder ASR models (e.g., Whisper) surface word-level mismatches between speech and text via cross-attention, providing a fine-grained reward signal. Building on this, we introduce Word-level TTS Alignment by ASR-driven Attentive Reward (W3AR). Without explicit reward annotations, W3AR uses attention from a pre-trained ASR model to drive finer-grained alignment and optimization of sequences predicted by a TTS model. Experiments show that W3AR improves the quality of existing TTS systems and strengthens zero-shot robustness on unseen speakers. More broadly, our results suggest a simple recipe for generative modeling: understanding models can act as evaluators, delivering informative, fine-grained feedback for optimization.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Audio & Speech
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
LPCNet: Improving Neural Speech Synthesis Through Linear Prediction
R.I.P.
π»
Ghosted
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
R.I.P.
π»
Ghosted
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech
R.I.P.
π»
Ghosted
Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders
R.I.P.
π»
Ghosted
Utterance-level Aggregation For Speaker Recognition In The Wild
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted