Enhancing Video Music Recommendation with Transformer-Driven Audio-Visual Embeddings

March 06, 2025 Β· Declared Dead Β· πŸ› 2024 IEEE 5th International Symposium on the Internet of Sounds (IS2)

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Shimiao Liu, Alexander Lerch arXiv ID 2503.05008 Category cs.MM: Multimedia Citations 1 Venue 2024 IEEE 5th International Symposium on the Internet of Sounds (IS2) Last Checked 4 months ago
Abstract
A fitting soundtrack can help a video better convey its content and provide a better immersive experience. This paper introduces a novel approach utilizing self-supervised learning and contrastive learning to automatically recommend audio for video content, thereby eliminating the need for manual labeling. We use a dual-branch cross-modal embedding model that maps both audio and video features into a common low-dimensional space. The fit of various audio-video pairs can then be mod-eled as inverse distance measure. In addition, a comparative analysis of various temporal encoding methods is presented, emphasizing the effectiveness of transformers in managing the temporal information of audio-video matching tasks. Through multiple experiments, we demonstrate that our model TIVM, which integrates transformer encoders and using InfoN Celoss, significantly improves the performance of audio-video matching and surpasses traditional methods.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Multimedia

R.I.P. πŸ‘» Ghosted

Video Generation From Text

Yitong Li, Martin Renqiang Min, ... (+3 more)

cs.MM πŸ› AAAI πŸ“š 300 cites 8 years ago

Died the same way β€” πŸ‘» Ghosted