oO(ML Discuss)
Talking about ICML 2010
Toward Off-Policy Learning Control with Function Approximation
by Csaba Szepesvari , Hamid Maei , Shalabh Bhatnagar , Richard Sutton , at ICML 2010
We present the first temporal-difference learning algorithm for off-policy control with unrestricted linear function approximation whose per-time-step complexity is linear in the number of features. Our algorithm, \textit{Greedy-GQ}, is an extension of recent work on gradient temporal-difference learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting \textit{latent learning} because the optimal policy, though learned, is not manifest in behavior. Popular off-policy algorithms such as Q-learning are known to be unstable in this setting when used with linear function approximation.
Download PDF
blog comments powered by Disqus