A Survey of Reinforcement Learning from Human Feedback
December 22, 2023 Β· The Cartographer Β· π arXiv.org
"No code URL or promise found in abstract"
"Title-pattern auto-detect: A Survey of Reinforcement Learning from Human Feedback"
Evidence collected by the PWNC Scanner
Authors
Timo Kaufmann, Paul Weng, Viktor Bengs, Eyke HΓΌllermeier
arXiv ID
2312.14925
Category
cs.LG: Machine Learning
Citations
281
Venue
arXiv.org
Last Checked
1 day ago
Abstract
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of preference-based reinforcement learning (PbRL), it stands at the intersection of artificial intelligence and human-computer interaction. This positioning provides a promising approach to enhance the performance and adaptability of intelligent systems while also improving the alignment of their objectives with human values. The success in training large language models (LLMs) has impressively demonstrated this potential in recent years, where RLHF has played a decisive role in directing the model's capabilities towards human objectives. This article provides an overview of the fundamentals of RLHF, exploring how RL agents interact with human feedback. While recent focus has been on RLHF for LLMs, our survey covers the technique across multiple domains. We provide our most comprehensive coverage in control and robotics, where many fundamental techniques originate, alongside a dedicated LLM section. We examine the core principles that underpin RLHF, how algorithms and human feedback work together, and the main research trends in the field. Our goal is to give researchers and practitioners a clear understanding of this rapidly growing field.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Machine Learning
ποΈ
ποΈ
Transcended
ποΈ
ποΈ
Transcended
Continuous control with deep reinforcement learning
π
π
Old Age
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
π
π
Old Age
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
π
π
Old Age
SGDR: Stochastic Gradient Descent with Warm Restarts
ποΈ
ποΈ
Transcended