Q-Learning: Off-Policy TD Control

4 minute read

Published:

TL;DR: Q-learning updates action-value estimates using the max over next-state Q-values as a bootstrap target, making it off-policy โ€” it converges to Q* regardless of the exploration strategy used to collect data. Combined with neural networks, Q-learning becomes DQN, which achieved human-level performance on Atari games.
Q-learning and DQN architecture
Q-network architecture for Atari (Mnih et al., 2015)

The Q-Learning Update

Q-learning was introduced by Christopher Watkins in his 1989 PhD thesis. The core update rule is:

$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \underbrace{\left[r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right]}_{\text{Q-learning TD error}}$$

The crucial difference from TD(0) (which evaluates a fixed policy) is the \(\max_{a'}\) operator. By always bootstrapping from the best possible next action, Q-learning targets the optimal Q-function \(Q^*\) directly, without requiring the agentโ€™s behaviour to match the optimal policy.

Full tabular Q-learning algorithm:

Initialise Q(s, a) = 0 for all s, a
for each episode:
    s โ† initial state
    while s is not terminal:
        a โ† ฮต-greedy(Q, s)        # behaviour policy
        s', r โ† step(s, a)
        Q(s,a) โ† Q(s,a) + ฮฑ[r + ฮณ max_{a'} Q(s',a') - Q(s,a)]
        s โ† s'

Off-Policy Nature of Q-Learning

Q-learning is off-policy: the policy used to select actions (the behaviour policy \(\mu\), typically ฮต-greedy) is different from the policy being learned (the target policy, which is greedy with respect to Q).

This is a major practical advantage: Q-learning can learn from any data โ€” replayed experience, data from a different policy, or even expert demonstrations โ€” as long as every state-action pair is visited sufficiently often.

Compare with SARSA (on-policy TD control):

$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)\right]$$

SARSA uses the actual next action \(a_{t+1}\) (sampled from the behaviour policy) rather than the max. As a result, SARSA learns the Q-function of the behaviour policy โ€” which is safer in settings where exploration is risky (cliffs, dangerous states), because SARSA accounts for the exploration noise in its value estimates.

Key Insight: The off-policy property of Q-learning is why experience replay works: past transitions, even from an old policy, can still teach the agent about Q*. SARSA cannot safely use experience replay because old transitions reflect an outdated behaviour policy.

Convergence Theorem

Theorem (Watkins & Dayan, 1992): In the tabular case, Q-learning converges to \(Q^*\) almost surely, provided:

  1. All state-action pairs are visited infinitely often.
  2. Step-sizes satisfy \(\sum_t \alpha_t(s,a) = \infty\) and \(\sum_t \alpha_t^2(s,a) < \infty\).
  3. Rewards are bounded.

The proof uses the theory of stochastic approximation (Robbins-Monro). The key step shows that the Q-learning update is an instance of a contraction applied in expectation, guaranteeing convergence to the unique fixed point \(Q^*\).

Note that convergence is not guaranteed with function approximation (neural networks). The combination of off-policy learning, bootstrapping, and function approximation is called the deadly triad โ€” addressed by DQNโ€™s experience replay and target networks.

Grid World Example

Consider a 4ร—4 grid world with a goal state (reward +1) and a hole (reward โˆ’1), with \(\gamma = 0.9\). After 500 episodes of Q-learning with ฮต=0.1:

  • Q-values at states adjacent to the goal converge to approximately 0.9 (one step away).
  • Q-values two steps away converge to โ‰ˆ 0.81 = 0.9ยฒ.
  • The greedy policy recovers the shortest path to the goal.

SARSA, run on the same grid with ฮต=0.1, learns a slightly different policy: it avoids states adjacent to the hole (because ฮต-greedy may select the dangerous action), while Q-learning finds the optimal (riskier but shorter) path.

SARSA vs Q-Learning: When to Use Which

ย SARSAQ-Learning
On/Off policyOn-policyOff-policy
SafetySafer (accounts for exploration)Riskier (assumes greedy future)
With replay bufferProblematicWorks naturally
ConvergenceTo \(V^\mu\)To \(V^*\)
Typical useSafe RL, on-policy settingsDQN, experience replay

References

  1. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, University of Cambridge.
  2. Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3โ€“4), 279โ€“292.
  3. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.), Chapter 6. MIT Press.