A Survey of Multimodal Large Language Model from A Data-centric Perspective

May 26, 2024 · The Cartographer · 🏛 arXiv.org

"No code URL or promise found in abstract"
"Title-pattern auto-detect: A Survey of Multimodal Large Language Model from A Data-centric Perspective"

Evidence collected by the PWNC Scanner

Authors Tianyi Bai, Hao Liang, Binwang Wan, Yanran Xu, Xi Li, Shiyu Li, Ling Yang, Bozhou Li, Yifan Wang, Bin Cui, Ping Huang, Jiulong Shan, Conghui He, Binhang Yuan, Wentao Zhang arXiv ID 2405.16640 Category cs.AI: Artificial Intelligence Cross-listed cs.CL, cs.CV, cs.MM Citations 68 Venue arXiv.org Last Checked 1 day ago

Abstract

Multimodal large language models (MLLMs) enhance the capabilities of standard large language models by integrating and processing data from multiple modalities, including text, vision, audio, video, and 3D environments. Data plays a pivotal role in the development and refinement of these models. In this survey, we comprehensively review the literature on MLLMs from a data-centric perspective. Specifically, we explore methods for preparing multimodal data during the pretraining and adaptation phases of MLLMs. Additionally, we analyze the evaluation methods for the datasets and review the benchmarks for evaluating MLLMs. Our survey also outlines potential future research directions. This work aims to provide researchers with a detailed understanding of the data-driven aspects of MLLMs, fostering further exploration and innovation in this field.