| |
Abstract:
We propose and analyze a class of actor-critic algorithms for
simulation-based optimization of a Markov decision process over a
parameterized family of randomized stationary policies. These are
two-time-scale algorithms in which the critic uses TD learning with
a linear approximation architecture, and the actor is updated in an
approximate gradient direction based on information provided by the
critic. We show that a set of appropriate features for the critic
is prescribed by the choice of parametrization of the actor. We
provide an interpretation of the gradient in terms of Riemannian
geometry, and conclude by discussing convergence properties and
some open problems.
|