Reinforcement Learning: A Complete Guide

4 minute read

Published:

TL;DR: Reinforcement Learning trains agents to maximise cumulative reward through trial-and-error interaction with an environment. Unlike supervised learning, there are no labelled examples โ€” only a scalar reward signal that may be sparse and delayed. This book covers foundational theory through state-of-the-art algorithms.
RL overview and RLHF pipeline
The RL training loop and RLHF pipeline (Ouyang et al., 2022)

What is Reinforcement Learning?

Reinforcement Learning (RL) is the computational study of decision-making. An agent interacts with an environment over discrete time steps: at each step, the agent observes a state \(s_t\), selects an action \(a_t\) according to its policy \(\pi\), and receives a scalar reward \(r_t\) along with the next state \(s_{t+1}\).

The goal is to find a policy \(\pi\) that maximises the expected cumulative discounted return:

$$J(\pi) = \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r_t\right], \quad \gamma \in [0, 1)$$

The discount factor \(\gamma\) balances immediate vs. future rewards. When \(\gamma \to 0\) the agent is myopic; when \(\gamma \to 1\) it cares about the distant future.

What distinguishes RL from other machine learning paradigms:

  • No supervision: no teacher provides the correct action.
  • Delayed credit: a reward at step \(t\) may result from actions taken many steps earlier.
  • Non-stationarity: the agentโ€™s own learning changes the data distribution it encounters.

The RL Pipeline

Every RL system follows the same fundamental loop:

  1. Agent observes state \(s_t\) from the environment.
  2. Agent selects action \(a_t \sim \pi(\cdot \mid s_t)\).
  3. Environment transitions to \(s_{t+1} \sim P(\cdot \mid s_t, a_t)\) and emits reward \(r_t = R(s_t, a_t, s_{t+1})\).
  4. Agent updates its policy using the observed transition \((s_t, a_t, r_t, s_{t+1})\).

This loop is mathematically formalised as a Markov Decision Process (MDP), covered in the next post. The Markov property โ€” that \(s_{t+1}\) depends only on \((s_t, a_t)\), not on full history โ€” is the key simplifying assumption.

Key Insight: RL is fundamentally different from supervised learning because the agent must explore the environment to generate its own training data. This creates the exploration-exploitation dilemma: should the agent try known good actions (exploit) or try new ones to gather information (explore)?

Algorithm Landscape

Modern RL algorithms can be organised along two axes:

Model-free vs. Model-based:

  • Model-free: the agent learns a policy or value function directly from experience, without building an explicit model of environment dynamics. Examples: Q-learning, DQN, PPO, SAC.
  • Model-based: the agent learns a model \(\hat{P}(s' \mid s, a)\) and uses it for planning. Examples: Dyna, World Models, MuZero.

Value-based vs. Policy-based:

  • Value-based: learn \(Q^*(s,a)\) and derive a greedy policy. Examples: Q-learning, DQN, Rainbow.
  • Policy-based: directly parameterise and optimise \(\pi_\theta\). Examples: REINFORCE, A3C, PPO.
  • Actor-critic: maintain both a policy (actor) and a value function (critic). Examples: A3C, SAC, PPO with value baseline.

The rough historical progression: tabular Q-learning (1989) โ†’ DQN with deep neural networks (2013) โ†’ policy gradient methods with trust regions (2015โ€“2017, TRPO/PPO) โ†’ off-policy maximum-entropy methods (2018, SAC) โ†’ model-based planning (MuZero 2019).

RL vs. Supervised Learning

AspectSupervised LearningReinforcement Learning
LabelsProvided by oracleGenerated through interaction
Data distributionFixedNon-stationary (policy-dependent)
FeedbackImmediate, per-sampleDelayed, sparse reward
GoalMinimise prediction errorMaximise cumulative return

This distinction matters when applying RL to language model alignment (RLHF): the reward signal comes from a trained reward model or human preferences, not ground-truth labels.

Book Structure

This book is organised into six parts:

  1. Foundations: MDPs, Bellman equations, exploration, temporal-difference learning.
  2. Value-Based Methods: Q-learning, DQN, Rainbow.
  3. Policy Gradient Methods: REINFORCE, A3C, PPO, SAC, TRPO.
  4. Model-Based RL: Dyna, World Models, MuZero.
  5. Multi-Agent RL: cooperative and competitive settings, QMIX, MADDPG.
  6. Applications: games, RLHF for LLMs, robotics.

References

  1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. [Online: incompleteideas.net]
  2. Silver, D., Singh, S., Precup, D., & Sutton, R. S. (2021). Reward is enough. Artificial Intelligence, 299, 103535.
  3. Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26โ€“38.