Bellman Equations: Recursive Structure of Optimal Value
Published:

Bellman Expectation vs. Bellman Optimality
The previous post introduced the Bellman expectation equations, which characterise the value of a given policy \(\pi\). Now we ask: what is the best possible value at each state?
The optimal state-value function \(V^*(s)\) is:
\[V^*(s) = \max_\pi V^\pi(s) \qquad \forall s \in \mathcal{S}\]There always exists a deterministic optimal policy \(\pi^*\) that simultaneously achieves \(V^*\) in all states (a key theorem of MDPs). The Bellman optimality equation for \(V^*\) is:
For the optimal Q-function:
The optimal policy is then simply greedy with respect to \(Q^*\):
Note the crucial difference: in the expectation equation, the action is averaged over \(\pi\); in the optimality equation, it is maximised. The max operator makes the optimality equations nonlinear, which is why they cannot be solved by simple matrix inversion (unlike policy evaluation).
Contraction Mapping and Convergence
The Bellman optimality operator \(\mathcal{T}^*\) maps a value function \(V\) to a new one:
\[(\mathcal{T}^* V)(s) = \max_a \sum_{s'} P(s' \mid s, a)\!\left[R(s,a,s') + \gamma V(s')\right]\]Theorem (Contraction Mapping): \(\mathcal{T}^*\) is a \(\gamma\)-contraction under the sup-norm:
\[\|\mathcal{T}^* V - \mathcal{T}^* U\|_\infty \leq \gamma \|V - U\|_\infty\]By the Banach fixed-point theorem, repeated application of \(\mathcal{T}^*\) from any initial \(V^{(0)}\) converges geometrically to the unique fixed point \(V^*\) at rate \(\gamma^k\) after \(k\) iterations:
\[\|V^{(k)} - V^*\|_\infty \leq \gamma^k \|V^{(0)} - V^*\|_\infty\]Dynamic Programming Algorithms
When the MDP model \((P, R)\) is known, dynamic programming (DP) provides exact solutions.
Policy Evaluation โ compute \(V^\pi\) for a fixed policy by iterating the Bellman expectation operator until convergence:
V โ 0 (arbitrary initialisation)
repeat:
for each s in S:
V(s) โ ฮฃ_a ฯ(a|s) ฮฃ_{s'} P(s'|s,a) [R(s,a,s') + ฮณ V(s')]
until max_s |ฮV(s)| < ฮต
Policy Improvement โ given \(V^\pi\), construct a strictly better policy by acting greedily:
\[\pi'(s) = \arg\max_a \sum_{s'} P(s' \mid s, a)\!\left[R(s,a,s') + \gamma V^\pi(s')\right]\]The Policy Improvement Theorem guarantees \(V^{\pi'} \geq V^\pi\) everywhere. Alternating evaluation and improvement until the policy is stable gives policy iteration, which converges to \(\pi^*\) in a finite number of steps (for finite MDPs).
Value Iteration โ merges evaluation and improvement into a single sweep by applying \(\mathcal{T}^*\) directly, avoiding the inner loop of policy evaluation. It converges to \(V^*\) asymptotically, with the optimal policy recovered by taking the greedy action with respect to the converged \(V^*\).
Limitations of DP
DP requires:
- Full knowledge of the model \(P\) and \(R\).
Tabular representation โ infeasible when $$ \mathcal{S} \(is astronomical (e.g., Go has ~\)10^{170}$$ states).
RL algorithms (Q-learning, DQN, TD) can be understood as sample-based, model-free approximations to DP: they estimate the Bellman operator from experience rather than computing it exactly from the model.
References
- Bellman, R. (1957). Dynamic Programming. Princeton University Press.
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.), Chapter 4. MIT Press.
- Bertsekas, D. P. (2019). Reinforcement Learning and Optimal Control. Athena Scientific.
