| |
Abstract:
We provide a natural gradient method that represents the
steepest descent direction based on the underlying structure of
the parameter space. Although gradient methods cannot make large
changes in the values of the parameters, we show that the natural
gradient is moving toward choosing a greedy optimal action rather
than just a better action. These greedy optimal actions are those
that would be chosen under one improvement step of policy
iteration with approximate,
compatible
value functions, as defined by Sutton
et al.
[9]. We then show drastic performance improvements in simple
MDPs and in the more challenging MDP of Tetris.
References
[9] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy
gradient methods for reinforcement learning with function
approximation.
Neural Information Processing Systems
, 13, 2000.
|