Temporal Difference Learning: Bootstrapping from Experience
Published:

The DP–MC–TD Triangle
Three approaches to policy evaluation form a triangle of trade-offs:
| Method | Requires model? | Waits for episode? | Bias | Variance |
|---|---|---|---|---|
| Dynamic Programming | Yes | No | Low | Zero |
| Monte Carlo | No | Yes | Zero | High |
| Temporal Difference | No | No | Low | Low |
Monte Carlo (MC) methods wait for the episode to end, then update using the actual return \(G_t = r_t + \gamma r_{t+1} + \cdots + \gamma^{T-t} r_T\). This is unbiased but has high variance because \(G_t\) is a long sum of random variables.
Temporal Difference methods update after each step using an estimated return — the TD target: \(r_t + \gamma V(s_{t+1})\). This introduces some bias (\(V\) is not yet converged) but dramatically reduces variance.
TD(0): The Fundamental Update
The simplest TD algorithm, TD(0), updates the value function after each transition:
The TD error \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\) is the central object: it measures how much better (or worse) the outcome was compared to the current prediction. When \(\delta_t > 0\), the agent was pleasantly surprised; when \(\delta_t < 0\), the outcome was worse than expected.
The TD(0) algorithm:
Initialise V(s) = 0 for all s
for each episode:
s ← initial state
while s is not terminal:
a ← π(s)
s', r ← step(s, a)
δ ← r + γ V(s') - V(s)
V(s) ← V(s) + α δ
s ← s'
Convergence: for tabular representations and a policy \(\pi\), TD(0) converges to \(V^\pi\) almost surely as long as the step-size \(\alpha\) satisfies the Robbins-Monro conditions: \(\sum_t \alpha_t = \infty\) and \(\sum_t \alpha_t^2 < \infty\).
n-Step Returns
TD(0) uses a 1-step return. MC uses an \(\infty\)-step return. n-step TD interpolates between them:
The update is: \(V(s_t) \leftarrow V(s_t) + \alpha [G_t^{(n)} - V(s_t)]\).
For \(n=1\): TD(0). For \(n=\infty\): MC. The optimal \(n\) is task-dependent, and in practice \(n \in [3, 20]\) often outperforms both extremes.
TD(λ) and Eligibility Traces
TD(λ) combines n-step returns for all \(n\) simultaneously, weighted by \(\lambda^{n-1}\):
\[G_t^\lambda = (1-\lambda) \sum_{n=1}^\infty \lambda^{n-1} G_t^{(n)}\]This is implemented online via eligibility traces — a memory variable \(z(s)\) that decays exponentially:
z(s) ← 0 for all s
for each step:
z(s_t) ← z(s_t) + 1 # accumulating trace
δ ← r + γ V(s') - V(s)
for all s: V(s) ← V(s) + α δ z(s)
z(s) ← γ λ z(s) # decay
When \(\lambda = 0\): reduces to TD(0). When \(\lambda = 1\): equivalent to MC (in the offline case). The trace \(z(s)\) records which states were recently visited, so a surprising reward propagates backward through the trajectory.
Why TD is More Efficient than MC
Consider estimating the value at a state visited 1000 times. MC updates \(V(s)\) 1000 times — once per visit, with the episode return. TD(0) also updates 1000 times, but every other state on the trajectory is also updated after each step. This means TD learning propagates information much faster through the state space.
Additionally, TD can learn online during an episode without waiting for termination — critical for long or continuing tasks where waiting for episode end would make learning prohibitively slow.
References
- Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44.
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.), Chapter 6. MIT Press.
- Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16(5), 1936–1947.
