Markov Decision Processes: The Mathematics of Sequential Decisions

4 minute read

Published:

TL;DR: A Markov Decision Process (MDP) is a tuple (S, A, P, R, ฮณ) that formalises sequential decision-making under uncertainty. Value functions V^ฯ€ and Q^ฯ€ capture the expected return from each state or state-action pair, and the Bellman equations relate them recursively โ€” forming the backbone of almost every RL algorithm.
MDP and agent-environment loop
Agent-environment interaction in the MDP framework (Mnih et al., 2016)

The MDP Tuple

A Markov Decision Process is defined by five components:

  • S: state space (finite or continuous)
  • A: action space (finite or continuous)
  • P: transition function, \(P(s' \mid s, a)\) โ€” probability of next state \(s'\) given current state \(s\) and action \(a\)
  • R: reward function, \(R(s, a, s')\) โ€” scalar reward for the transition \((s, a, s')\)
  • ฮณ โˆˆ [0, 1): discount factor

The Markov property is the fundamental assumption: the future is conditionally independent of the past given the present state.

\[P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0) = P(s_{t+1} \mid s_t, a_t)\]

This property makes dynamic programming tractable. Without it, optimal decision-making would require reasoning over the entire trajectory history, which is computationally infeasible in general.

Return and Discount

The agentโ€™s goal is to maximise the discounted return \(G_t\), defined as the sum of future rewards weighted by powers of \(\gamma\):

$$G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots$$

Note the recursive structure: \(G_t = r_t + \gamma G_{t+1}\). This telescoping identity is what enables Bellmanโ€™s equations.

The discount factor \(\gamma\) serves two purposes: (1) mathematical convenience โ€” it ensures \(G_t\) is finite for bounded rewards; (2) economic interpretation โ€” future rewards are worth less than immediate ones, encoding a preference for earlier payoffs.

Value Functions

Value functions quantify how good it is to be in a given state or to take a given action.

The state-value function under policy \(\pi\) is:

$$V^\pi(s) = \mathbb{E}_\pi\!\left[G_t \mid S_t = s\right] = \mathbb{E}_\pi\!\left[\sum_{k=0}^\infty \gamma^k r_{t+k} \;\middle|\; S_t = s\right]$$

The action-value function (Q-function) under policy \(\pi\) is:

$$Q^\pi(s, a) = \mathbb{E}_\pi\!\left[G_t \mid S_t = s,\, A_t = a\right]$$

The two are related by: \(V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a)\).

Key Insight: The Q-function is strictly more informative than the V-function โ€” given Q^ฯ€, we can always recover V^ฯ€ by marginalising over the policy. Conversely, given only V^ฯ€, we need knowledge of the transition model P to compute Q^ฯ€. This is why model-free, value-based methods (DQN, Q-learning) learn Q directly.

Bellman Expectation Equations

By expanding the definition of \(V^\pi\) using the recursive structure of \(G_t\):

\[V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\left[R(s, a, s') + \gamma V^\pi(s')\right]\]

This is the Bellman expectation equation for \(V^\pi\). It states that the value of a state equals the expected immediate reward plus the discounted value of the next state, averaged over actions (per the policy) and over stochastic transitions.

Similarly for \(Q^\pi\):

\[Q^\pi(s, a) = \sum_{s'} P(s' \mid s, a)\left[R(s, a, s') + \gamma \sum_{a'} \pi(a' \mid s')\, Q^\pi(s', a')\right]\]

These equations can be written compactly using the Bellman operator \(\mathcal{T}^\pi\), which is a contraction under the sup-norm โ€” guaranteeing that iterating it converges to the unique fixed point \(V^\pi\).

Finite vs. Infinite Horizon

In episodic (finite-horizon) tasks, episodes terminate after \(T\) steps. The return is \(G_t = \sum_{k=t}^T r_k\) (no discounting necessary, though still valid). Examples: Atari games, chess.

In continuing (infinite-horizon) tasks, there is no natural episode boundary. Discounting (\(\gamma < 1\)) is required to ensure finite returns. Examples: continuous control, process optimisation.

Many algorithms handle both with a single formulation by setting \(\gamma = 1\) and using episode termination as a special absorbing state that emits zero reward indefinitely.

References

  1. Bellman, R. (1957). Dynamic Programming. Princeton University Press.
  2. Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.
  3. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.), Chapter 3. MIT Press.