Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling
June 08, 2019 ยท Declared Dead ยท ๐ Neural Information Processing Systems
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Tengyang Xie, Yifei Ma, Yu-Xiang Wang
arXiv ID
1906.03393
Category
cs.LG: Machine Learning
Cross-listed
cs.AI,
stat.ML
Citations
189
Venue
Neural Information Processing Systems
Last Checked
3 months ago
Abstract
Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) -- the problem of evaluating a new policy using the historical data obtained by different behavior policies -- under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon $H$. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of $$ \frac{1}{n} \sum\nolimits_{t=1}^H\mathbb{E}_ฮผ\left[\frac{d_t^ฯ(s_t)^2}{d_t^ฮผ(s_t)^2} \mathrm{Var}_ฮผ\left[\frac{ฯ_t(a_t|s_t)}{ฮผ_t(a_t|s_t)}\big( V_{t+1}^ฯ(s_{t+1}) + r_t\big) \middle| s_t\right]\right] + \tilde{O}(n^{-1.5}) $$ where $ฮผ$ and $ฯ$ are the logging and target policies, $d_t^ฮผ(s_t)$ and $d_t^ฯ(s_t)$ are the marginal distribution of the state at $t$th step, $H$ is the horizon, $n$ is the sample size and $V_{t+1}^ฯ$ is the value function of the MDP under $ฯ$. The result matches the Cramer-Rao lower bound in \citet{jiang2016doubly} up to a multiplicative factor of $H$. To the best of our knowledge, this is the first OPE estimation error bound with a polynomial dependence on $H$. Besides theory, we show empirical superiority of our method in time-varying, partially observable, and long-horizon RL environments.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Machine Learning
๐ฎ
๐ฎ
The Ethereal
๐ฎ
๐ฎ
The Ethereal
Continuous control with deep reinforcement learning
๐
๐
Old Age
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
๐
๐
Old Age
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
๐
๐
Old Age
SGDR: Stochastic Gradient Descent with Warm Restarts
๐ฎ
๐ฎ
The Ethereal
Asynchronous Methods for Deep Reinforcement Learning
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
๐ป
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
๐ป
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
๐ป
Ghosted