Q-Learning: Off-Policy TD Control

4 minute read

Published: August 06, 2025

TL;DR: Q-learning updates action-value estimates using the max over next-state Q-values as a bootstrap target, making it off-policy — it converges to Q* regardless of the exploration strategy used to collect data. Combined with neural networks, Q-learning becomes DQN, which achieved human-level performance on Atari games.

Q-learning and DQN architecture — Q-network architecture for Atari (Mnih et al., 2015)

The Q-Learning Update

Q-learning was introduced by Christopher Watkins in his 1989 PhD thesis. The core update rule is:

$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \underbrace{\left[r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right]}_{\text{Q-learning TD error}}$$

The crucial difference from TD(0) (which evaluates a fixed policy) is the $\max_{a'}$ operator. By always bootstrapping from the best possible next action, Q-learning targets the optimal Q-function $Q^*$ directly, without requiring the agent’s behaviour to match the optimal policy.

Full tabular Q-learning algorithm:

Initialise Q(s, a) = 0 for all s, a
for each episode:
    s ← initial state
    while s is not terminal:
        a ← ε-greedy(Q, s)        # behaviour policy
        s', r ← step(s, a)
        Q(s,a) ← Q(s,a) + α[r + γ max_{a'} Q(s',a') - Q(s,a)]
        s ← s'

Off-Policy Nature of Q-Learning

Q-learning is off-policy: the policy used to select actions (the behaviour policy $\mu$, typically ε-greedy) is different from the policy being learned (the target policy, which is greedy with respect to Q).

This is a major practical advantage: Q-learning can learn from any data — replayed experience, data from a different policy, or even expert demonstrations — as long as every state-action pair is visited sufficiently often.

Compare with SARSA (on-policy TD control):

$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)\right]$$

SARSA uses the actual next action $a_{t+1}$ (sampled from the behaviour policy) rather than the max. As a result, SARSA learns the Q-function of the behaviour policy — which is safer in settings where exploration is risky (cliffs, dangerous states), because SARSA accounts for the exploration noise in its value estimates.

Key Insight: The off-policy property of Q-learning is why experience replay works: past transitions, even from an old policy, can still teach the agent about Q*. SARSA cannot safely use experience replay because old transitions reflect an outdated behaviour policy.

Convergence Theorem

Theorem (Watkins & Dayan, 1992): In the tabular case, Q-learning converges to $Q^*$ almost surely, provided:

All state-action pairs are visited infinitely often.
Step-sizes satisfy $\sum_t \alpha_t(s,a) = \infty$ and $\sum_t \alpha_t^2(s,a) < \infty$.
Rewards are bounded.

The proof uses the theory of stochastic approximation (Robbins-Monro). The key step shows that the Q-learning update is an instance of a contraction applied in expectation, guaranteeing convergence to the unique fixed point $Q^*$.

Note that convergence is not guaranteed with function approximation (neural networks). The combination of off-policy learning, bootstrapping, and function approximation is called the deadly triad — addressed by DQN’s experience replay and target networks.

Grid World Example

Consider a 4×4 grid world with a goal state (reward +1) and a hole (reward −1), with $\gamma = 0.9$. After 500 episodes of Q-learning with ε=0.1:

Q-values at states adjacent to the goal converge to approximately 0.9 (one step away).
Q-values two steps away converge to ≈ 0.81 = 0.9².
The greedy policy recovers the shortest path to the goal.

SARSA, run on the same grid with ε=0.1, learns a slightly different policy: it avoids states adjacent to the hole (because ε-greedy may select the dangerous action), while Q-learning finds the optimal (riskier but shorter) path.

SARSA vs Q-Learning: When to Use Which

	SARSA	Q-Learning
On/Off policy	On-policy	Off-policy
Safety	Safer (accounts for exploration)	Riskier (assumes greedy future)
With replay buffer	Problematic	Works naturally
Convergence	To $V^\mu$	To $V^*$
Typical use	Safe RL, on-policy settings	DQN, experience replay

References

Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, University of Cambridge.
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.), Chapter 6. MIT Press.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Q-Learning: Off-Policy TD Control

The Q-Learning Update

Off-Policy Nature of Q-Learning

Convergence Theorem

Grid World Example

SARSA vs Q-Learning: When to Use Which

References

Share on

You May Also Enjoy

GAPE: Remember to Forget — Gated Adaptive Positional Encoding

PolyNSD: Polynomial Neural Sheaf Diffusion

TDA in Materials Science: Topology of Structure and Phase

TDA in Drug Discovery: Molecular Topology