| |
Abstract:
We consider the use of two additive control variate methods to
reduce the variance of performance gradient estimates in
reinforcement learning problems. The first approach we consider
is the baseline method, in which a function of the current state
is added to the discounted value estimate. We relate the
performance of these methods, which use sample paths, to the
variance of estimates based on iid data. We derive the baseline
function that minimizes this variance, and we show that the
variance for any baseline is the sum of the optimal variance and
a weighted squared distance to the optimal baseline. We show that
the widely used average discounted value baseline (where the
reward is replaced by the difference between the reward and its
expectation) is suboptimal. The second approach we consider is
the actor-critic method, which uses an approximate value
function. We give bounds on the expected squared error of its
estimates. We show that minimizing distance to the true value
function is suboptimal in general; we provide an example for
which the true value function gives an estimate with positive
variance, but the optimal value function gives an unbiased
estimate with zero variance. Our bounds suggest algorithms to
estimate the gradient of the performance of parameterized
baseline or value functions. We present preliminary experiments
that illustrate the performance improvements on a simple control
problem.
|