Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers
September 16, 2024 Β· Declared Dead Β· π IEEE Robotics and Automation Letters
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa
arXiv ID
2409.10687
Category
eess.AS: Audio & Speech
Cross-listed
cs.HC,
cs.RO,
cs.SD
Citations
4
Venue
IEEE Robotics and Automation Letters
Last Checked
3 months ago
Abstract
Emotions are an essential element in verbal communication, so understanding individuals' affect during a human-robot interaction (HRI) becomes imperative. This paper investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers) pipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from different human subjects having pseudo-naturalistic conversations with the NAO robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants. In the results, we show that fine-tuning vision transformers on benchmark datasets and and then using either these already fine-tuned models or ensembling ViT/BEiT models gets us the highest classification accuracies per individual when it comes to identifying four primary emotions from their speech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs or BEiTs.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Audio & Speech
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
LPCNet: Improving Neural Speech Synthesis Through Linear Prediction
R.I.P.
π»
Ghosted
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
R.I.P.
π»
Ghosted
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech
R.I.P.
π»
Ghosted
Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders
R.I.P.
π»
Ghosted
Utterance-level Aggregation For Speaker Recognition In The Wild
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted