MRAMG-Bench: A Comprehensive Benchmark for Advancing Multimodal Retrieval-Augmented Multimodal Generation

February 06, 2025 · Declared Dead · 🏛 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Repo contents: README.md

Authors Qinhan Yu, Zhiyou Xiao, Binghui Li, Zhengren Wang, Chong Chen, Wentao Zhang arXiv ID 2502.04176 Category cs.LG: Machine Learning Cross-listed cs.IR Citations 9 Venue Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Repository https://github.com/MRAMG-Bench/MRAMG ⭐ 17 Last Checked 1 month ago

Abstract

Recent advances in Retrieval-Augmented Generation (RAG) have significantly improved response accuracy and relevance by incorporating external knowledge into Large Language Models (LLMs). However, existing RAG methods primarily focus on generating text-only answers, even in Multimodal Retrieval-Augmented Generation (MRAG) scenarios, where multimodal elements are retrieved to assist in generating text answers. To address this, we introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, in which we aim to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus. Despite growing attention to this challenging task, a notable lack of a comprehensive benchmark persists for effectively evaluating its performance. To bridge this gap, we provide MRAMG-Bench, a meticulously curated, human-annotated benchmark comprising 4,346 documents, 14,190 images, and 4,800 QA pairs, distributed across six distinct datasets and spanning three domains: Web, Academia, and Lifestyle. The datasets incorporate diverse difficulty levels and complex multi-image scenarios, providing a robust foundation for evaluating the MRAMG task. To facilitate rigorous evaluation, MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics, enabling a thorough analysis of the performance of generative models in the MRAMG task. Additionally, we propose an efficient and flexible multimodal answer generation framework that can leverage LLMs/MLLMs to generate multimodal responses. Our datasets and complete evaluation results for 11 popular generative models are available at https://github.com/MRAMG-Bench/MRAMG.