Q-Learning: Off-Policy TD Control
Published:

The Q-Learning Update
Q-learning was introduced by Christopher Watkins in his 1989 PhD thesis. The core update rule is:
The crucial difference from TD(0) (which evaluates a fixed policy) is the \(\max_{a'}\) operator. By always bootstrapping from the best possible next action, Q-learning targets the optimal Q-function \(Q^*\) directly, without requiring the agentโs behaviour to match the optimal policy.
Full tabular Q-learning algorithm:
Initialise Q(s, a) = 0 for all s, a
for each episode:
s โ initial state
while s is not terminal:
a โ ฮต-greedy(Q, s) # behaviour policy
s', r โ step(s, a)
Q(s,a) โ Q(s,a) + ฮฑ[r + ฮณ max_{a'} Q(s',a') - Q(s,a)]
s โ s'
Off-Policy Nature of Q-Learning
Q-learning is off-policy: the policy used to select actions (the behaviour policy \(\mu\), typically ฮต-greedy) is different from the policy being learned (the target policy, which is greedy with respect to Q).
This is a major practical advantage: Q-learning can learn from any data โ replayed experience, data from a different policy, or even expert demonstrations โ as long as every state-action pair is visited sufficiently often.
Compare with SARSA (on-policy TD control):
SARSA uses the actual next action \(a_{t+1}\) (sampled from the behaviour policy) rather than the max. As a result, SARSA learns the Q-function of the behaviour policy โ which is safer in settings where exploration is risky (cliffs, dangerous states), because SARSA accounts for the exploration noise in its value estimates.
Convergence Theorem
Theorem (Watkins & Dayan, 1992): In the tabular case, Q-learning converges to \(Q^*\) almost surely, provided:
- All state-action pairs are visited infinitely often.
- Step-sizes satisfy \(\sum_t \alpha_t(s,a) = \infty\) and \(\sum_t \alpha_t^2(s,a) < \infty\).
- Rewards are bounded.
The proof uses the theory of stochastic approximation (Robbins-Monro). The key step shows that the Q-learning update is an instance of a contraction applied in expectation, guaranteeing convergence to the unique fixed point \(Q^*\).
Note that convergence is not guaranteed with function approximation (neural networks). The combination of off-policy learning, bootstrapping, and function approximation is called the deadly triad โ addressed by DQNโs experience replay and target networks.
Grid World Example
Consider a 4ร4 grid world with a goal state (reward +1) and a hole (reward โ1), with \(\gamma = 0.9\). After 500 episodes of Q-learning with ฮต=0.1:
- Q-values at states adjacent to the goal converge to approximately 0.9 (one step away).
- Q-values two steps away converge to โ 0.81 = 0.9ยฒ.
- The greedy policy recovers the shortest path to the goal.
SARSA, run on the same grid with ฮต=0.1, learns a slightly different policy: it avoids states adjacent to the hole (because ฮต-greedy may select the dangerous action), while Q-learning finds the optimal (riskier but shorter) path.
SARSA vs Q-Learning: When to Use Which
| ย | SARSA | Q-Learning |
|---|---|---|
| On/Off policy | On-policy | Off-policy |
| Safety | Safer (accounts for exploration) | Riskier (assumes greedy future) |
| With replay buffer | Problematic | Works naturally |
| Convergence | To \(V^\mu\) | To \(V^*\) |
| Typical use | Safe RL, on-policy settings | DQN, experience replay |
References
- Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, University of Cambridge.
- Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3โ4), 279โ292.
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.), Chapter 6. MIT Press.
