| |
Abstract:
We consider the problem of reliably choosing a near-best
strategy from a restricted class of strategies
in a partially observable Markov decision process (POMDP). We
assume we are given the ability to
simulate
the POMDP, and study what might be called the
sample complexity
--- that is, the amount of data one must generate in the POMDP in
order to choose a good strategy. We prove upper bounds on the
sample complexity showing that, even for
infinitely large and arbitrarily complex
POMDPs, the amount of data needed can be finite, and depends only
linearly on the complexity of the restricted strategy class
, and exponentially on the horizon time. This latter dependence
can be eased in a variety of ways, including the application of
gradient and local search algorithms. Our measure of complexity
generalizes the classical supervised learning notion of VC
dimension to the settings of reinforcement learning and
planning.
|