TOP-Training: Target-Oriented Pretraining for Medical Extractive Question Answering

October 25, 2023 · Declared Dead · 🏛 COLING 2025

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Saptarshi Sengupta, Connor Heaton, Shreya Ghosh, Wenpeng Yin, Preslav Nakov, Suhang Wang arXiv ID 2310.16995 Category cs.CL: Computation & Language Citations 2 Venue COLING 2025 Last Checked 4 months ago

Abstract

We study extractive question-answering in the medical domain (Medical-EQA). This problem has two main challenges: (i) domain specificity, as most AI models lack necessary domain knowledge, and (ii) extraction-based answering style, which restricts most autoregressive LLMs due to potential hallucinations. To handle those challenges, we propose TOP-Training, a target-oriented pre-training paradigm that stands out among all domain adaptation techniques with two desirable features: (i) TOP-Training moves one step further than popular domain-oriented fine-tuning since it not only moves closer to the target domain, but also familiarizes itself with the target dataset, and (ii) it does not assume the existence of a large set of unlabeled instances from the target domain. Specifically, for a target Medical-EQA dataset, we extract its entities and leverage large language models (LLMs) to generate synthetic texts containing those entities; we then demonstrate that pretraining on this synthetic text data yields better performance on the target Medical-EQA benchmarks. Overall, our contributions are threefold: (i) TOP-Training, a new pretraining technique to effectively adapt LLMs to better solve a target problem, (ii) TOP-Training has a wide application scope because it does not require the target problem to have a large set of unlabeled data, and (iii) our experiments highlight the limitations of autoregressive LLMs, emphasizing TOP-Training as a means to unlock the true potential of bidirectional LLMs.