| |
Abstract:
We propose a new approach to reinforcement learning which
combines least squares function approximation with policy
iteration. Our method is model-free and completely off policy. We
are motivated by the least squares temporal difference learning
algorithm (LSTD), which is known for its efficient use of sample
experiences compared to pure temporal difference algorithms. LSTD
is ideal for prediction problems, however it heretofore has not
had a straightforward application to control problems. Moreover,
approximations learned by LSTD are strongly influenced by the
visitation distribution over states. Our new algorithm, Least
Squares Policy Iteration (LSPI) addresses these issues. The
result is an off-policy method which can use (or reuse) data
collected from any source. We have tested LSPI on several
problems, including a bicycle simulator in which it learns to
guide the bicycle to a goal efficiently by merely observing a
relatively small number of completely random trials.
|