| |
Abstract:
We show the convergence of two deterministic variants of
Q-learning. The first is the widely used optimistic Q-learning,
which initializes the Q-values to large initial values and then
follows a greedy policy with respect to the Q-values. We show
that setting the initial value sufficiently large guarantees the
converges to an ε-optimal policy. The second is a new and
novel algorithm
incremental Q-learning
, which gradually promotes the values of actions that are not
taken. We show that incremental Q-learning converges, in the
limit, to the optimal policy. Our incremental Q-learning
algorithm can be viewed as derandomization of the
ε-greedy Q-learning.
|