Query Expansion in the Age of Pre-trained and Large Language Models: A Comprehensive Survey

September 09, 2025 · The Cartographer · 🏛 arXiv.org

"No code URL or promise found in abstract"
"Title-pattern auto-detect: Query Expansion in the Age of Pre-trained and Large Language Models: A Comprehensive Survey"

Evidence collected by the PWNC Scanner

Authors Minghan Li, Xinxuan Lv, Junjie Zou, Tongna Chen, Chao Zhang, Suchao An, Ercong Nie, Guodong Zhou arXiv ID 2509.07794 Category cs.IR: Information Retrieval Citations 5 Venue arXiv.org Last Checked 3 days ago

Abstract

Modern information retrieval (IR) must reconcile short, ambiguous queries with increasingly diverse and dynamic corpora. Query expansion (QE) remains central to alleviating vocabulary mismatch, yet the design space has shifted with pre-trained and large language models (PLMs, LLMs). In this survey, we organize recent work along four complementary dimensions: the point of injection (implicit/embedding vs. selection-based explicit), grounding and interaction (from zero-grounding prompts to multi-round retrieve-expand loops), learning and alignment (SFT/PEFT/DPO), and knowledge-graph integration. A model-centric taxonomy is also outlined, spanning encoder-only, encoder-decoder, decoder-only, instruction-tuned, and domain or multilingual variants, with affordances for QE such as contextual disambiguation, controllable generation, and zero-shot or few-shot reasoning. Practice-oriented guidance specifies where neural QE helps most: first-stage retrieval, multi-query fusion, re-ranking, and retrieval-augmented generation (RAG). The survey compares traditional and neural QE across seven aspects and maps applications in web search, biomedicine, e-commerce, open-domain question answering/RAG, conversational and code search, and cross-lingual settings. The survey concludes with an agenda focused on reliable, safe, efficient, and adaptive QE, offering a principled blueprint for deploying and combining techniques under real-world constraints.