Markov Decision Processes: The Mathematics of Sequential Decisions
Published:

The MDP Tuple
A Markov Decision Process is defined by five components:
- S: state space (finite or continuous)
- A: action space (finite or continuous)
- P: transition function, \(P(s' \mid s, a)\) โ probability of next state \(s'\) given current state \(s\) and action \(a\)
- R: reward function, \(R(s, a, s')\) โ scalar reward for the transition \((s, a, s')\)
- ฮณ โ [0, 1): discount factor
The Markov property is the fundamental assumption: the future is conditionally independent of the past given the present state.
\[P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0) = P(s_{t+1} \mid s_t, a_t)\]This property makes dynamic programming tractable. Without it, optimal decision-making would require reasoning over the entire trajectory history, which is computationally infeasible in general.
Return and Discount
The agentโs goal is to maximise the discounted return \(G_t\), defined as the sum of future rewards weighted by powers of \(\gamma\):
Note the recursive structure: \(G_t = r_t + \gamma G_{t+1}\). This telescoping identity is what enables Bellmanโs equations.
The discount factor \(\gamma\) serves two purposes: (1) mathematical convenience โ it ensures \(G_t\) is finite for bounded rewards; (2) economic interpretation โ future rewards are worth less than immediate ones, encoding a preference for earlier payoffs.
Value Functions
Value functions quantify how good it is to be in a given state or to take a given action.
The state-value function under policy \(\pi\) is:
The action-value function (Q-function) under policy \(\pi\) is:
The two are related by: \(V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a)\).
Bellman Expectation Equations
By expanding the definition of \(V^\pi\) using the recursive structure of \(G_t\):
\[V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\left[R(s, a, s') + \gamma V^\pi(s')\right]\]This is the Bellman expectation equation for \(V^\pi\). It states that the value of a state equals the expected immediate reward plus the discounted value of the next state, averaged over actions (per the policy) and over stochastic transitions.
Similarly for \(Q^\pi\):
\[Q^\pi(s, a) = \sum_{s'} P(s' \mid s, a)\left[R(s, a, s') + \gamma \sum_{a'} \pi(a' \mid s')\, Q^\pi(s', a')\right]\]These equations can be written compactly using the Bellman operator \(\mathcal{T}^\pi\), which is a contraction under the sup-norm โ guaranteeing that iterating it converges to the unique fixed point \(V^\pi\).
Finite vs. Infinite Horizon
In episodic (finite-horizon) tasks, episodes terminate after \(T\) steps. The return is \(G_t = \sum_{k=t}^T r_k\) (no discounting necessary, though still valid). Examples: Atari games, chess.
In continuing (infinite-horizon) tasks, there is no natural episode boundary. Discounting (\(\gamma < 1\)) is required to ensure finite returns. Examples: continuous control, process optimisation.
Many algorithms handle both with a single formulation by setting \(\gamma = 1\) and using episode termination as a special absorbing state that emits zero reward indefinitely.
References
- Bellman, R. (1957). Dynamic Programming. Princeton University Press.
- Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.), Chapter 3. MIT Press.
