AdaSplash-2: Faster Differentiable Sparse Attention

April 16, 2026 ยท Grace Period ยท + Add venue

โณ Grace Period
This paper is less than 90 days old. We give authors time to release their code before passing judgment.
Authors Nuno Gonรงalves, Hugo Pitorro, Vlad Niculae, Edoardo Ponti, Lei Li, Andre Martins, Marcos Treviso arXiv ID 2604.15180 Category cs.LG: Machine Learning Cross-listed cs.CL Citations 0
Abstract
Sparse attention has been proposed as a way to alleviate the quadratic cost of transformers, a central bottleneck in long-context training. A promising line of work is $ฮฑ$-entmax attention, a differentiable sparse alternative to softmax that enables input-dependent sparsity yet has lagged behind softmax due to the computational overhead necessary to compute the normalizer $ฯ„$. In this paper, we introduce AdaSplash-2, which addresses this limitation through a novel histogram-based initialization that reduces the number of iterations needed to compute $ฯ„$ to typically 1--2. The key idea is to compute a coarse histogram of attention scores on the fly and store it in on-chip SRAM, yielding a more accurate initialization that enables fast forward and backward computation. Combined with a sparsity-aware GPU implementation that skips zero blocks with low overhead, AdaSplash-2 matches or improves per-step training time relative to FlashAttention-2 when block sparsity is moderate-to-high (e.g., $>$60\%), which often occurs at long-context lengths. On downstream tasks, models trained with our efficient $ฮฑ$-entmax attention match softmax baselines at short-context lengths and achieve substantial gains in long-context settings.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Machine Learning