Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models

December 03, 2023 Β· Declared Dead Β· πŸ› 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors AndrΓ©s Villa, Juan Carlos LeΓ³n AlcΓ‘zar, Alvaro Soto, Bernard Ghanem arXiv ID 2312.02219 Category cs.CV: Computer Vision Cross-listed cs.CL Citations 22 Venue 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) Last Checked 4 months ago
Abstract
Large Vision and Language Models have enabled significant advances in fully supervised and zero-shot visual tasks. These large architectures serve as the baseline to what is currently known as Instruction Tuning Large Vision and Language models (IT-LVLMs). IT-LVLMs are general-purpose multi-modal assistants whose responses are modulated by natural language instructions and visual data. Despite this versatility, IT-LVLM effectiveness in fundamental computer vision problems remains unclear, primarily due to the absence of a standardized evaluation benchmark. This paper introduces a Multi-modal Evaluation Benchmark named MERLIM, a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks. MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs. Our results bring important insights on the performance of state-of-the-art IT-LVLMs including limitations at identifying fine-grained visual concepts, object hallucinations across tasks, and biases towards the language query. Our findings also suggest that these models have weak visual grounding, but manage to make adequate guesses from global visual patterns or language biases contained in the LLM component. We name this phenomenon of correct answers with no visual grounding as hidden hallucinations.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Computer Vision

πŸŒ… πŸŒ… Old Age

Fast R-CNN

Ross Girshick

cs.CV πŸ› ICCV πŸ“š 27.7K cites 11 years ago

Died the same way β€” πŸ‘» Ghosted