Reinforcement Learning: A Complete Guide
Published:

What is Reinforcement Learning?
Reinforcement Learning (RL) is the computational study of decision-making. An agent interacts with an environment over discrete time steps: at each step, the agent observes a state \(s_t\), selects an action \(a_t\) according to its policy \(\pi\), and receives a scalar reward \(r_t\) along with the next state \(s_{t+1}\).
The goal is to find a policy \(\pi\) that maximises the expected cumulative discounted return:
The discount factor \(\gamma\) balances immediate vs. future rewards. When \(\gamma \to 0\) the agent is myopic; when \(\gamma \to 1\) it cares about the distant future.
What distinguishes RL from other machine learning paradigms:
- No supervision: no teacher provides the correct action.
- Delayed credit: a reward at step \(t\) may result from actions taken many steps earlier.
- Non-stationarity: the agentโs own learning changes the data distribution it encounters.
The RL Pipeline
Every RL system follows the same fundamental loop:
- Agent observes state \(s_t\) from the environment.
- Agent selects action \(a_t \sim \pi(\cdot \mid s_t)\).
- Environment transitions to \(s_{t+1} \sim P(\cdot \mid s_t, a_t)\) and emits reward \(r_t = R(s_t, a_t, s_{t+1})\).
- Agent updates its policy using the observed transition \((s_t, a_t, r_t, s_{t+1})\).
This loop is mathematically formalised as a Markov Decision Process (MDP), covered in the next post. The Markov property โ that \(s_{t+1}\) depends only on \((s_t, a_t)\), not on full history โ is the key simplifying assumption.
Algorithm Landscape
Modern RL algorithms can be organised along two axes:
Model-free vs. Model-based:
- Model-free: the agent learns a policy or value function directly from experience, without building an explicit model of environment dynamics. Examples: Q-learning, DQN, PPO, SAC.
- Model-based: the agent learns a model \(\hat{P}(s' \mid s, a)\) and uses it for planning. Examples: Dyna, World Models, MuZero.
Value-based vs. Policy-based:
- Value-based: learn \(Q^*(s,a)\) and derive a greedy policy. Examples: Q-learning, DQN, Rainbow.
- Policy-based: directly parameterise and optimise \(\pi_\theta\). Examples: REINFORCE, A3C, PPO.
- Actor-critic: maintain both a policy (actor) and a value function (critic). Examples: A3C, SAC, PPO with value baseline.
The rough historical progression: tabular Q-learning (1989) โ DQN with deep neural networks (2013) โ policy gradient methods with trust regions (2015โ2017, TRPO/PPO) โ off-policy maximum-entropy methods (2018, SAC) โ model-based planning (MuZero 2019).
RL vs. Supervised Learning
| Aspect | Supervised Learning | Reinforcement Learning |
|---|---|---|
| Labels | Provided by oracle | Generated through interaction |
| Data distribution | Fixed | Non-stationary (policy-dependent) |
| Feedback | Immediate, per-sample | Delayed, sparse reward |
| Goal | Minimise prediction error | Maximise cumulative return |
This distinction matters when applying RL to language model alignment (RLHF): the reward signal comes from a trained reward model or human preferences, not ground-truth labels.
Book Structure
This book is organised into six parts:
- Foundations: MDPs, Bellman equations, exploration, temporal-difference learning.
- Value-Based Methods: Q-learning, DQN, Rainbow.
- Policy Gradient Methods: REINFORCE, A3C, PPO, SAC, TRPO.
- Model-Based RL: Dyna, World Models, MuZero.
- Multi-Agent RL: cooperative and competitive settings, QMIX, MADDPG.
- Applications: games, RLHF for LLMs, robotics.
References
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. [Online: incompleteideas.net]
- Silver, D., Singh, S., Precup, D., & Sutton, R. S. (2021). Reward is enough. Artificial Intelligence, 299, 103535.
- Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26โ38.
