Markov Decision Processes: The Mathematics of Sequential Decisions

4 minute read

Published: August 02, 2025

TL;DR: A Markov Decision Process (MDP) is a tuple (S, A, P, R, γ) that formalises sequential decision-making under uncertainty. Value functions V^π and Q^π capture the expected return from each state or state-action pair, and the Bellman equations relate them recursively — forming the backbone of almost every RL algorithm.

MDP and agent-environment loop — Agent-environment interaction in the MDP framework (Mnih et al., 2016)

The MDP Tuple

A Markov Decision Process is defined by five components:

S: state space (finite or continuous)
A: action space (finite or continuous)
P: transition function, $P(s' \mid s, a)$ — probability of next state $s'$ given current state $s$ and action $a$
R: reward function, $R(s, a, s')$ — scalar reward for the transition $(s, a, s')$
γ ∈ [0, 1): discount factor

The Markov property is the fundamental assumption: the future is conditionally independent of the past given the present state.

\[P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0) = P(s_{t+1} \mid s_t, a_t)\]

This property makes dynamic programming tractable. Without it, optimal decision-making would require reasoning over the entire trajectory history, which is computationally infeasible in general.

Return and Discount

The agent’s goal is to maximise the discounted return $G_t$, defined as the sum of future rewards weighted by powers of $\gamma$:

$$G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots$$

Note the recursive structure: $G_t = r_t + \gamma G_{t+1}$. This telescoping identity is what enables Bellman’s equations.

The discount factor $\gamma$ serves two purposes: (1) mathematical convenience — it ensures $G_t$ is finite for bounded rewards; (2) economic interpretation — future rewards are worth less than immediate ones, encoding a preference for earlier payoffs.

Value Functions

Value functions quantify how good it is to be in a given state or to take a given action.

The state-value function under policy $\pi$ is:

$$V^\pi(s) = \mathbb{E}_\pi\!\left[G_t \mid S_t = s\right] = \mathbb{E}_\pi\!\left[\sum_{k=0}^\infty \gamma^k r_{t+k} \;\middle|\; S_t = s\right]$$

The action-value function (Q-function) under policy $\pi$ is:

$$Q^\pi(s, a) = \mathbb{E}_\pi\!\left[G_t \mid S_t = s,\, A_t = a\right]$$

The two are related by: $V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a)$.

Key Insight: The Q-function is strictly more informative than the V-function — given Q^π, we can always recover V^π by marginalising over the policy. Conversely, given only V^π, we need knowledge of the transition model P to compute Q^π. This is why model-free, value-based methods (DQN, Q-learning) learn Q directly.

Bellman Expectation Equations

By expanding the definition of $V^\pi$ using the recursive structure of $G_t$:

\[V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\left[R(s, a, s') + \gamma V^\pi(s')\right]\]

This is the Bellman expectation equation for $V^\pi$. It states that the value of a state equals the expected immediate reward plus the discounted value of the next state, averaged over actions (per the policy) and over stochastic transitions.

Similarly for $Q^\pi$:

\[Q^\pi(s, a) = \sum_{s'} P(s' \mid s, a)\left[R(s, a, s') + \gamma \sum_{a'} \pi(a' \mid s')\, Q^\pi(s', a')\right]\]

These equations can be written compactly using the Bellman operator $\mathcal{T}^\pi$, which is a contraction under the sup-norm — guaranteeing that iterating it converges to the unique fixed point $V^\pi$.

Finite vs. Infinite Horizon

In episodic (finite-horizon) tasks, episodes terminate after $T$ steps. The return is $G_t = \sum_{k=t}^T r_k$ (no discounting necessary, though still valid). Examples: Atari games, chess.

In continuing (infinite-horizon) tasks, there is no natural episode boundary. Discounting ($\gamma < 1$) is required to ensure finite returns. Examples: continuous control, process optimisation.

Many algorithms handle both with a single formulation by setting $\gamma = 1$ and using episode termination as a special absorbing state that emits zero reward indefinitely.

References

Bellman, R. (1957). Dynamic Programming. Princeton University Press.
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.), Chapter 3. MIT Press.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Markov Decision Processes: The Mathematics of Sequential Decisions

The MDP Tuple

Return and Discount

Value Functions

Bellman Expectation Equations

Finite vs. Infinite Horizon

References

Share on

You May Also Enjoy

GAPE: Remember to Forget — Gated Adaptive Positional Encoding

PolyNSD: Polynomial Neural Sheaf Diffusion

TDA in Materials Science: Topology of Structure and Phase

TDA in Drug Discovery: Molecular Topology