A Comprehensive Study on the Effectiveness of ASR Representations for Noise-Robust Speech Emotion Recognition
November 13, 2023 ยท Declared Dead ยท ๐ IEEE Transactions on Audio, Speech, and Language Processing
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Xiaohan Shi, Jiajun He, Xingfeng Li, Tomoki Toda
arXiv ID
2311.07093
Category
cs.SD: Sound
Cross-listed
cs.CL,
eess.AS
Citations
5
Venue
IEEE Transactions on Audio, Speech, and Language Processing
Last Checked
3 months ago
Abstract
This paper proposes an efficient attempt to noisy speech emotion recognition (NSER). Conventional NSER approaches have proven effective in mitigating the impact of artificial noise sources, such as white Gaussian noise, but are limited to non-stationary noises in real-world environments due to their complexity and uncertainty. To overcome this limitation, we introduce a new method for NSER by adopting the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech. We first obtain intermediate layer information from the ASR model as a feature representation for emotional speech and then apply this representation for the downstream NSER task. Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised learning approaches, and 3) even outperforms text-based approaches using ASR transcription or the ground truth transcription of noisy speech.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Sound
๐ฎ
๐ฎ
The Ethereal
R.I.P.
๐ป
Ghosted
Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks
R.I.P.
๐ป
Ghosted
The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines
R.I.P.
๐ป
Ghosted
TasNet: time-domain audio separation network for real-time, single-channel speech separation
R.I.P.
๐ป
Ghosted
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
R.I.P.
๐ป
Ghosted
MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
๐ป
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
๐ป
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
๐ป
Ghosted