Q($Ξ»$) with Off-Policy Corrections

February 16, 2016 Β· Declared Dead Β· πŸ› International Conference on Algorithmic Learning Theory

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Anna Harutyunyan, Marc G. Bellemare, Tom Stepleton, Remi Munos arXiv ID 1602.04951 Category cs.AI: Artificial Intelligence Cross-listed cs.LG, stat.ML Citations 99 Venue International Conference on Algorithmic Learning Theory Last Checked 3 months ago
Abstract
We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions. These conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-policy TD($Ξ»$). We illustrate this theoretical relationship empirically on a continuous-state control task.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Artificial Intelligence

Died the same way β€” πŸ‘» Ghosted