| |
Abstract:
Partially Observable Markov Decision Processes (POMDPs)
constitute an important class of reinforcement learning problems
which present unique theoretical and computational difficulties. In
the absence of the Markov property, popular reinforcement learning
algorithms such as Q-learning may no longer be effective, and
memory-based methods which remove partial observability via
state-estimation are notoriously expensive. An alternative approach
is to seek a stochastic memoryless policy which for each
observation of the environment prescribes a probability
distribution over available actions that maximizes the average
reward per timestep. A reinforcement learning algorithm which
learns a locally optimal stochastic memoryless policy has been
proposed by Jaakkola, Singh and Jordan, but not empirically
verified. We present a variation of this algorithm, discuss its
implementation, and demonstrate its viability using four test
problems.
|