Sequential Contrastive Audio-Visual Learning
July 08, 2024 ยท Declared Dead ยท ๐ IEEE International Conference on Acoustics, Speech, and Signal Processing
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Ioannis Tsiamas, Santiago Pascual, Chunghsin Yeh, Joan Serrร
arXiv ID
2407.05782
Category
cs.SD: Sound
Cross-listed
cs.CV,
cs.LG,
cs.MM,
eess.AS
Citations
5
Venue
IEEE International Conference on Acoustics, Speech, and Signal Processing
Last Checked
3 months ago
Abstract
Contrastive learning has emerged as a powerful technique in audio-visual representation learning, leveraging the natural co-occurrence of audio and visual modalities in webscale video datasets. However, conventional contrastive audio-visual learning (CAV) methodologies often rely on aggregated representations derived through temporal aggregation, neglecting the intrinsic sequential nature of the data. This oversight raises concerns regarding the ability of standard approaches to capture and utilize fine-grained information within sequences. In response to this limitation, we propose sequential contrastive audiovisual learning (SCAV), which contrasts examples based on their non-aggregated representation space using multidimensional sequential distances. Audio-visual retrieval experiments with the VGGSound and Music datasets demonstrate the effectiveness of SCAV, with up to 3.5x relative improvements in recall against traditional aggregation-based contrastive learning and other previously proposed methods, which utilize more parameters and data. We also show that models trained with SCAV exhibit a significant degree of flexibility regarding the metric employed for retrieval, allowing us to use a hybrid retrieval approach that is both effective and efficient.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Sound
๐ฎ
๐ฎ
The Ethereal
R.I.P.
๐ป
Ghosted
Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks
R.I.P.
๐ป
Ghosted
The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines
R.I.P.
๐ป
Ghosted
TasNet: time-domain audio separation network for real-time, single-channel speech separation
R.I.P.
๐ป
Ghosted
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
R.I.P.
๐ป
Ghosted
MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
๐ป
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
๐ป
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
๐ป
Ghosted