Leveraging Large Vision-Language Model as User Intent-aware Encoder for Composed Image Retrieval

December 15, 2024 · Declared Dead · 🏛 AAAI Conference on Artificial Intelligence

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Zelong Sun, Dong Jing, Guoxing Yang, Nanyi Fei, Zhiwu Lu arXiv ID 2412.11087 Category cs.IR: Information Retrieval Citations 8 Venue AAAI Conference on Artificial Intelligence Last Checked 4 months ago

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images from candidate set using a hybrid-modality query consisting of a reference image and a relative caption that describes the user intent. Recent studies attempt to utilize Vision-Language Pre-training Models (VLPMs) with various fusion strategies for addressing the task.However, these methods typically fail to simultaneously meet two key requirements of CIR: comprehensively extracting visual information and faithfully following the user intent. In this work, we propose CIR-LVLM, a novel framework that leverages the large vision-language model (LVLM) as the powerful user intent-aware encoder to better meet these requirements. Our motivation is to explore the advanced reasoning and instruction-following capabilities of LVLM for accurately understanding and responding the user intent. Furthermore, we design a novel hybrid intent instruction module to provide explicit intent guidance at two levels: (1) The task prompt clarifies the task requirement and assists the model in discerning user intent at the task level. (2) The instance-specific soft prompt, which is adaptively selected from the learnable prompt pool, enables the model to better comprehend the user intent at the instance level compared to a universal prompt for all instances. CIR-LVLM achieves state-of-the-art performance across three prominent benchmarks with acceptable inference efficiency. We believe this study provides fundamental insights into CIR-related fields.