| |
Abstract:
Agents acting in the real world are confronted with the
problem of making good decisions with limited knowledge of the
environment. Partially observable Markov decision processes
(POMDPs) model decision problems in which an agent tries to
maximize its reward in the face of limited sensor feedback. Recent
work has shown empirically that a reinforcement learning (RL)
algorithm called Sarsa Lambda can efficiently find optimal
memoryless policies, which map current observations to actions, for
POMDP problems (Loch and Singh 1998). The Sarsa Lambda algorithm
uses a form of short-term memory called an eligibility trace, which
distributes temporally delayed rewards to observation-action pairs
which lead up to the reward. This paper explores the effect of
eligibility traces on the ability of the Sarsa Lambda algorithm to
find optimal memoryless policies. A variant of Sarsa Lambda called
k-step truncated Sarsa Lambda is applied to four test problems
taken from the recent work of Littman, Littman, Cassandra and
Kaelbling, Parr and Russell, and Chrisman. The empirical results
show that eligibility traces can be significantly truncated without
affecting the ability of Sarsa Lambda to find optimal memoryless
policies for POMDPs.
|