Robustifying and Selecting Cohort-Appropriate Prognostic Models under Distributional Shifts

April 16, 2026 · Grace Period · + Add venue

Authors Dimitris Bertsimas, Carol Gao, Angelos G. Koulouras, Georgios Antonios Margonis arXiv ID 2604.16537 Category stat.ME Cross-listed cs.AI, stat.AP Citations 0

Abstract

External validation is widely regarded as the gold standard for prognostic model evaluation. In this study, we challenge the assumption that successful external calibration guarantees model generalizability and propose two complementary strategies to improve transportability of prognostic models across cohorts. Using six real-world surgical cohorts from tertiary academic centers, we tested whether successful external calibration depends largely on similarity in covariates and outcomes between training and validation cohorts, quantified using Kullback-Leibler (KL) divergence, with calibration assessed by the Integrated Calibration Index (ICI). From the model-developer's perspective, we trained the "best-on-average" prognostic model by tuning toward a meta-analysis-derived covariate and outcome distribution as an approximation of the broader target population. From the end-user perspective, we proposed a simple measure for cohort outcome similarity to identify, among published models, the one most suitable for a given target cohort in terms of both calibration and clinical utility. External calibration worsened as distributional mismatch increased. Higher KL divergence was associated with higher ICI in both surgery-alone (Spearman $ρ=0.614$, $p=0.004$) and surgery + adjuvant chemotherapy cohorts (Spearman $ρ=0.738$, $p<0.001$). Meta-analysis-informed weighting improved calibration in most settings without materially affecting discrimination, with the clearest benefit when evaluated on the aggregated external population ($p=0.037$). Models developed in more similar cohorts achieved lower ICI in surgery-alone (Spearman $ρ=0.803$, $p<0.001$) and surgery + adjuvant chemotherapy cohorts (Spearman $ρ=0.737$, $p<0.001$), and provided greater clinical utility on DCA.