A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighborhood Search
August 06, 2024 Β· Declared Dead Β· π International Conference on Information and Knowledge Management
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Yiping Sun, Yang Shi, Jiaolong Du
arXiv ID
2408.02937
Category
cs.IR: Information Retrieval
Citations
7
Venue
International Conference on Information and Knowledge Management
Last Checked
4 months ago
Abstract
In recent years, Approximate Nearest Neighbor Search (ANNS) has played a pivotal role in modern search and recommendation systems, especially in emerging LLM applications like Retrieval-Augmented Generation. There is a growing exploration into harnessing the parallel computing capabilities of GPUs to meet the substantial demands of ANNS. However, existing systems primarily focus on offline scenarios, overlooking the distinct requirements of online applications that necessitate real-time insertion of new vectors. This limitation renders such systems inefficient for real-world scenarios. Moreover, previous architectures struggled to effectively support real-time insertion due to their reliance on serial execution streams. In this paper, we introduce a novel Real-Time Adaptive Multi-Stream GPU ANNS System (RTAMS-GANNS). Our architecture achieves its objectives through three key advancements: 1) We initially examined the real-time insertion mechanisms in existing GPU ANNS systems and discovered their reliance on repetitive copying and memory allocation, which significantly hinders real-time effectiveness on GPUs. As a solution, we introduce a dynamic vector insertion algorithm based on memory blocks, which includes in-place rearrangement. 2) To enable real-time vector insertion in parallel, we introduce a multi-stream parallel execution mode, which differs from existing systems that operate serially within a single stream. Our system utilizes a dynamic resource pool, allowing multiple streams to execute concurrently without additional execution blocking. 3) Through extensive experiments and comparisons, our approach effectively handles varying QPS levels across different datasets, reducing latency by up to 40%-80%. The proposed system has also been deployed in real-world industrial search and recommendation systems, serving hundreds of millions of users daily, and has achieved good results.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Information Retrieval
R.I.P.
π»
Ghosted
π
π
Old Age
Neural Graph Collaborative Filtering
R.I.P.
π»
Ghosted
DeepFM: A Factorization-Machine based Neural Network for CTR Prediction
R.I.P.
π»
Ghosted
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer
R.I.P.
π
404 Not Found
Graph Neural Networks for Social Recommendation
R.I.P.
π»
Ghosted
Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted