| |
Abstract:
but the standard approach of approximating a value function
and determining a policy from it has so far proven theoretically
intractable. In this paper we explore an alternative approach in
which the policy is explicitly represented by its own function
approximator, independent of the value function, and is updated
according to the gradient of expected reward with respect to the
policy parameters. Williams's REINFORCE method and actor--critic
methods are examples of this approach. Our main new result is to
show that the gradient can be written in a form suitable for
estimation from experience aided by an approximate action-value or
advantage function. Using this result, we prove for the first time
that a version of policy iteration with arbitrary differentiable
function approximation is convergent to a locally optimal
policy.
|