๐ฎ
๐ฎ
The Ethereal
A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions
April 19, 2026 ยท Grace Period ยท ๐ ACL 2026
Authors
Zhiyin Yu, Yuchen Mou, Juncheng Yan, Junyu Luo, Chunchun Chen, Xing Wei, Yunhui Liu, Hongru Sun, Yuxing Zhang, Jun Xu, Yatao Bian, Ming Zhang, Wei Ye, Tieke He, Jie Yang, Guanjie Zheng, Zhonghai Wu, Bo Zhang, Lei Bai, Xiao Luo
arXiv ID
2604.17312
Category
cs.LG: Machine Learning
Cross-listed
cs.AI
Citations
0
Venue
ACL 2026
Abstract
Reinforcement learning (RL) has emerged as a powerful post-training paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, reinforcement learning for LLMs faces substantial data scarcity challenges, including the limited availability of high-quality external supervision and the constrained volume of model-generated experience. These limitations make data-efficient reinforcement learning a critical research direction. In this survey, we present the first systematic review of reinforcement learning for LLMs under data scarcity. We propose a bottom-up hierarchical framework built around three complementary perspectives: the data-centric perspective, the training-centric perspective, and the framework-centric perspective. We develop a taxonomy of existing methods, summarize representative approaches in each category, and analyze their strengths and limitations. Our taxonomy aims to provide a clear conceptual foundation for understanding the design space of data-efficient RL for LLMs and to guide researchers working in this emerging area. We hope this survey offers a comprehensive roadmap for future research and inspires new directions toward more efficient and scalable reinforcement learning post-training for LLMs.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Machine Learning
๐ฎ
๐ฎ
The Ethereal
Continuous control with deep reinforcement learning
๐
๐
Old Age
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
๐
๐
Old Age
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
๐
๐
Old Age
SGDR: Stochastic Gradient Descent with Warm Restarts
๐ฎ
๐ฎ
The Ethereal