Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities
May 20, 2025 ยท Declared Dead ยท ๐ European Signal Processing Conference
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Parthasaarathy Sudarsanam, Irene Martรญn-Moratรณ, Tuomas Virtanen
arXiv ID
2505.14562
Category
cs.SD: Sound
Cross-listed
cs.MM,
eess.AS
Citations
4
Venue
European Signal Processing Conference
Last Checked
3 months ago
Abstract
This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing large-scale unlabeled data to learn shared representations. Existing deep learning approach for trimodal alignment involves two-stages, that separately align visual-text and audio-text modalities. This approach suffers from mismatched data distributions, resulting in suboptimal alignment. Leveraging the AVCaps dataset, which provides audio, visual and audio-visual captions for video clips, our method jointly optimizes the representation of all the modalities using contrastive training. Our results demonstrate that the single-stage approach outperforms the two-stage method, achieving a two-fold improvement in audio based visual retrieval, highlighting the advantages of unified multimodal representation learning.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Sound
๐ฎ
๐ฎ
The Ethereal
R.I.P.
๐ป
Ghosted
Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks
R.I.P.
๐ป
Ghosted
The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines
R.I.P.
๐ป
Ghosted
TasNet: time-domain audio separation network for real-time, single-channel speech separation
R.I.P.
๐ป
Ghosted
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
R.I.P.
๐ป
Ghosted
MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
๐ป
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
๐ป
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
๐ป
Ghosted