| |
Abstract:
The problem of reinforcement learning in a non-Markov
environment is explored using a dynamic Bayesian network, where
conditional independence assumptions between random variables are
compactly represented by network parameters. The parameters are
learned on-line, and approximations are used to perform inference
and to compute the optimal value function. The relative effects of
inference and value function approximations on the quality of the
final policy are investigated, by learning to solve a moderately
difficult driving task. The two value function approximations,
linear and quadratic, were found to perform similarly, but the
quadratic model was more sensitive to initialization. Both
performed below the level of human performance on the task. The
dynamic Bayesian network performed comparably to a model using an
HMM-style representation, while requiring exponentially fewer
parameters.
|