Evaluating and Generating Query Workloads for High Dimensional Vector Similarity Search

June 12, 2026 · Grace Period · 🏛 the proceedings of KDD 2025

Authors Matteo Ceccarello, Alexandra Levchenko, Ioana Ileana, Themis Palpanas arXiv ID 2606.14511 Category cs.DB: Databases Citations 0 Venue the proceedings of KDD 2025

Abstract

Similarity search lies at the heart of many modern applications, ranging from databases to deep learning to data series analysis. As such, a vast effort has been invested in developing algorithms, data structures and implementations to speed up this crucial subroutine. To empirically validate these approaches, several benchmarking efforts have been initiated covering a wide array of datasets. In this paper, we observe that usually little control is exercised on the hardness of the workloads with which methods are tested and compared. To address this issue, we first evaluate several query hardness measures with respect to their ability to capture the empirical hardness of a query, i.e. the effort invested by an index data structure to provide an answer. Then, we propose two methods, deemed \HephAnn and \HephGrad, for synthesizing query workloads so that they meet a user-specified hardness target. Both methods allow to produce workloads with the desired hardness: we find that \HephGrad is faster, while \HephAnn makes fewer assumptions on the target hardness measure. The resulting workloads can be used to gain insights into the behavior of similarity search algorithms.