🌅
🌅
Old Age
Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation
November 27, 2024 · 🏛 Conference on Empirical Methods in Natural Language Processing
"No code URL or promise found in abstract"
"HuggingFace models found (backfill)"
Evidence collected by the PWNC Scanner
Authors
Ziyin Zhang, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Rui Wang, Zhaopeng Tu
arXiv ID
2411.18462
Category
cs.CL: Computation & Language
Cross-listed
cs.AI
Citations
3
Venue
Conference on Empirical Methods in Natural Language Processing
Repository
https://huggingface.co/Geralt-Targaryen/QwQ-1.5B-Persona
Last Checked
1 hour ago
Abstract
Conventional speculative decoding (SD) methods utilize a predefined length policy for proposing drafts, which implies the premise that the target model smoothly accepts the proposed draft tokens. However, reality deviates from this assumption: the oracle draft length varies significantly, and the fixed-length policy hardly satisfies such a requirement. Moreover, such discrepancy is further exacerbated in scenarios involving complex reasoning and long-form generation, particularly under test-time scaling for reasoning-specialized models. Through both theoretical and empirical estimation, we establish that the discrepancy between the draft and target models can be approximated by the draft model's prediction entropy: a high entropy indicates a low acceptance rate of draft tokens, and vice versa. Based on this insight, we propose SVIP: Self-Verification Length Policy for Long-Context Speculative Decoding, which is a training-free dynamic length policy for speculative decoding systems that adaptively determines the lengths of draft sequences by referring to the draft entropy. Experimental results on mainstream SD benchmarks as well as reasoning-heavy benchmarks demonstrate the superior performance of SVIP, achieving up to 17% speedup on MT-Bench at 8K context compared with fixed draft lengths, and 22% speedup for QwQ in long-form reasoning.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
📜 Similar Papers
In the same crypt — Computation & Language
🌅
🌅
Old Age
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
🌅
🌅
Old Age
XLNet: Generalized Autoregressive Pretraining for Language Understanding
🏛️
🏛️
Transcended
Effective Approaches to Attention-based Neural Machine Translation
🌅
🌅
Old Age
A large annotated corpus for learning natural language inference
🌅
🌅
Old Age