Splitwiser: Efficient LM inference with constrained resources
April 21, 2025 ยท Entered Twilight ยท ๐ arXiv.org
Repo contents: .buildkite, .dockerignore, .github, .gitignore, .readthedocs.yaml, CONTRIBUTING.md, Dockerfile, Dockerfile.rocm, LICENSE, MANIFEST.in, README.md, benchmarks, csrc, docs, examples, format.sh, mypy.ini, patch_xformers.rocm.sh, pyproject.toml, requirements-build.txt, requirements-dev.txt, requirements-neuron.txt, requirements-rocm.txt, requirements.txt, rocm_patch, setup.py, tests, vllm
Authors
Asad Aali, Adney Cardoza, Melissa Capo
arXiv ID
2505.03763
Category
cs.AR: Hardware Architecture
Cross-listed
cs.AI,
cs.DC,
cs.LG
Citations
0
Venue
arXiv.org
Repository
https://github.com/adney11/vllm-sysml
Last Checked
3 months ago
Abstract
Efficient inference of LLMs remains a crucial challenge, with two main phases: a compute-intensive prompt computation and a memory-intensive token generation. Despite existing batching and scheduling techniques, token generation phases fail to fully utilize compute resources, especially when compared to prompt computation phases. To address these challenges, we propose Splitwiser, a methodology that splits the two phases of an LLM inference request onto the same GPU, thereby reducing overhead and improving memory access and cache utilization. By eliminating the need to transfer data across devices, Splitwiser aims to minimize network-related overheads. In this report, we describe the basic structure of our proposed pipeline while sharing preliminary results and analysis. We implement our proposed multiprocessing design on two widely-used and independent LLM architectures: Huggingface and vLLM. We open-source our code for the respective implementations: 1) Huggingface (https://github.com/asad-aali/splitwiser), and 2) vLLM (https://github.com/adney11/vllm-sysml).
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Hardware Architecture
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
Corona: System Implications of Emerging Nanophotonic Technology
R.I.P.
๐ป
Ghosted
A scalable multi-core architecture with heterogeneous memory structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs)
R.I.P.
๐ป
Ghosted
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
R.I.P.
๐ป
Ghosted
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
R.I.P.
๐ป
Ghosted