Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model

March 12, 2025 Β· Declared Dead Β· πŸ› European Signal Processing Conference

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Ali Vosoughi, Dimitra Emmanouilidou, Hannes Gamper arXiv ID 2503.09205 Category cs.MM: Multimedia Cross-listed cs.CL, cs.IR, cs.SD, eess.AS Citations 3 Venue European Signal Processing Conference Last Checked 3 months ago
Abstract
Integrating audio and visual data for training multimodal foundational models remains a challenge. The Audio-Video Vector Alignment (AVVA) framework addresses this by considering AV scene alignment beyond mere temporal synchronization, and leveraging Large Language Models (LLMs) for data curation. AVVA implements a scoring mechanism for selecting aligned training data segments. It integrates Whisper, a speech-based foundation model, for audio and DINOv2 for video analysis in a dual-encoder structure with contrastive learning on AV pairs. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate the effectiveness of the proposed model architecture and data curation approach. AVVA achieves a significant improvement in top-k accuracies for video-to-audio retrieval on all datasets compared to DenseAV, while using only 192 hrs of curated training data. Furthermore, an ablation study indicates that the data curation process effectively trades data quality for data quantity, yielding increases in top-k retrieval accuracies on AudioCaps, VALOR, and VGGSound, compared to training on the full spectrum of uncurated data.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Multimedia

R.I.P. πŸ‘» Ghosted

Video Generation From Text

Yitong Li, Martin Renqiang Min, ... (+3 more)

cs.MM πŸ› AAAI πŸ“š 300 cites 8 years ago

Died the same way β€” πŸ‘» Ghosted