| |
Abstract:
We address two open theoretical questions in Policy Gradient
Reinforcement Learning. The first concerns the efficacy of using
function approximation to represent the state action value
function,
Q
. Theory is presented showing that linear function approximation
representations of
Q
can degrade the rate of convergence of performance gradient
estimates by a factor of
O
(
M
L
) relative to when no function approximation of
Q
is used, where
M
is the number of possible actions and
L
is the number of basis functions in the function approximation
representation. The second concerns the use of a bias term in
estimating the state action value function. Theory is presented
showing that a non-zero bias term can improve the rate of
convergence of performance gradient estimates by
O
(1 - (1/
M
)), where
M
is the number of possible actions. Experimental evidence is
presented showing that these theoretical results lead to
significant improvement in the convergence properties of Policy
Gradient Reinforcement Learning algorithms.
|