| |
Abstract:
We address the problem of non-convergence of online
reinforcement learning algorithms (e.g., Q learning and
SARSA(λ)) by adopting an incremental-batch approach that
separates the exploration process from the function fitting
process. Our BFBP (Batch Fit to Best Paths) algorithm alternates
between an exploration phase (during which trajectories are
generated to try to find fragments of the optimal policy) and a
function fitting phase (during which a function approximator is
fit to the best known paths from start states to terminal
states). An advantage of this approach is that batch
value-function fitting is a global process, which allows it to
address the tradeoffs in function approximation that cannot be
handled by local, online algorithms. This approach was pioneered
by Boyan and Moore with their GrowSupport and ROUT algorithms. We
show how to improve upon their work by applying a better
exploration process and by enriching the function fitting
procedure to incorporate Bellman error and advantage error
measures into the objective function. The results show improved
performance on several benchmark problems.
|