Evaluating LLMs' Assessment of Mixed-Context Hallucination Through the Lens of Summarization

March 03, 2025 · Declared Dead · 🏛 Annual Meeting of the Association for Computational Linguistics

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Siya Qi, Rui Cao, Yulan He, Zheng Yuan arXiv ID 2503.01670 Category cs.CL: Computation & Language Cross-listed cs.AI, cs.CY, cs.IR, cs.LG Citations 4 Venue Annual Meeting of the Association for Computational Linguistics Last Checked 4 months ago

Abstract

With the rapid development of large language models (LLMs), LLM-as-a-judge has emerged as a widely adopted approach for text quality evaluation, including hallucination evaluation. While previous studies have focused exclusively on single-context evaluation (e.g., discourse faithfulness or world factuality), real-world hallucinations typically involve mixed contexts, which remains inadequately evaluated. In this study, we use summarization as a representative task to comprehensively evaluate LLMs' capability in detecting mixed-context hallucinations, specifically distinguishing between factual and non-factual hallucinations. Through extensive experiments across direct generation and retrieval-based models of varying scales, our main observations are: (1) LLMs' intrinsic knowledge introduces inherent biases in hallucination evaluation; (2) These biases particularly impact the detection of factual hallucinations, yielding a significant performance bottleneck; (3) The fundamental challenge lies in effective knowledge utilization, balancing between LLMs' intrinsic knowledge and external context for accurate mixed-context hallucination evaluation.