A Comprehensive Survey on Generative AI for Video-to-Music Generation

February 18, 2025 · The Cartographer · 🏛 arXiv.org

"No code URL or promise found in abstract"
"Title-pattern auto-detect: A Comprehensive Survey on Generative AI for Video-to-Music Generation"

Evidence collected by the PWNC Scanner

Authors Shulei Ji, Songruoyao Wu, Zihao Wang, Shuyu Li, Kejun Zhang arXiv ID 2502.12489 Category eess.AS: Audio & Speech Cross-listed cs.AI, cs.MM Citations 5 Venue arXiv.org Last Checked 3 days ago

Abstract

The burgeoning growth of video-to-music generation can be attributed to the ascendancy of multimodal generative models. However, there is a lack of literature that comprehensively combs through the work in this field. To fill this gap, this paper presents a comprehensive review of video-to-music generation using deep generative AI techniques, focusing on three key components: conditioning input construction, conditioning mechanism, and music generation frameworks. We categorize existing approaches based on their designs for each component, clarifying the roles of different strategies. Preceding this, we provide a fine-grained categorization of video and music modalities, illustrating how different categories influence the design of components within the generation pipelines. Furthermore, we summarize available multimodal datasets and evaluation metrics while highlighting ongoing challenges in the field.