Multimodal Alignment and Fusion: A Survey

November 26, 2024 · The Cartographer · 🏛 International Journal of Computer Vision

"No code URL or promise found in abstract"
"Title-pattern auto-detect: Multimodal Alignment and Fusion: A Survey"

Evidence collected by the PWNC Scanner

Authors Songtao Li, Hao Tang arXiv ID 2411.17040 Category cs.CV: Computer Vision Citations 91 Venue International Journal of Computer Vision Last Checked 1 day ago

Abstract

This survey provides a comprehensive overview of recent advances in multimodal alignment and fusion within the field of machine learning, driven by the increasing availability and diversity of data modalities such as text, images, audio, and video. Unlike previous surveys that often focus on specific modalities or limited fusion strategies, our work presents a structure-centric and method-driven framework that emphasizes generalizable techniques. We systematically categorize and analyze key approaches to alignment and fusion through both structural perspectives -- data-level, feature-level, and output-level fusion -- and methodological paradigms -- including statistical, kernel-based, graphical, generative, contrastive, attention-based, and large language model (LLM)-based methods, drawing insights from an extensive review of over 260 relevant studies. Furthermore, this survey highlights critical challenges such as cross-modal misalignment, computational bottlenecks, data quality issues, and the modality gap, along with recent efforts to address them. Applications ranging from social media analysis and medical imaging to emotion recognition and embodied AI are explored to illustrate the real-world impact of robust multimodal systems. The insights provided aim to guide future research toward optimizing multimodal learning systems for improved scalability, robustness, and generalizability across diverse domains.