PPO: Proximal Policy Optimisation

3 minute read

Published:

TL;DR: PPO (Proximal Policy Optimisation) is one of the most widely used deep RL algorithms. It keeps updates close to the current policy by clipping the probability ratio, preventing destructively large steps. Combined with Generalised Advantage Estimation (GAE), PPO is stable, sample-efficient, and easy to implement — making it the default choice for many practitioners.
Proximal Policy Optimisation
Actor-critic architecture used by PPO (Mnih et al., 2016)

The Problem: Policy Gradient Instability

Vanilla policy gradient updates can be catastrophic. A too-large gradient step can collapse the policy to always selecting the same action, from which recovery is slow or impossible. TRPO (Schulman et al. 2015) solved this by enforcing a hard KL-divergence constraint between old and new policies, but it requires an expensive second-order conjugate gradient solve at every step.

PPO (Schulman et al. 2017) achieves similar stability with a first-order method: instead of enforcing a hard constraint, it clips the objective so that large policy changes simply stop contributing to the gradient.

The Clipped Surrogate Objective

Define the probability ratio:

r_t(θ) = π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t)

The standard policy gradient objective is \(L^{PG} = E_t[r_t(\theta) \cdot \hat{A}_t]\). Large \(r_t\) signals a large policy shift. PPO clips this ratio:

L^{CLIP}(θ) = E_t [ min( r_t(θ) Â_t , clip(r_t(θ), 1-ε, 1+ε) Â_t ) ]

With \(\varepsilon = 0.2\), the policy ratio is clipped to \([0.8, 1.2]\). Taking the minimum means: if the unclipped objective would push the update too far, the clipped term dominates and the gradient is zero — the update simply does not happen.

Key Insight: Clipping acts as a pessimistic lower bound on the policy improvement. We only benefit from an action when we have not already moved the policy too far in that direction. This asymmetry prevents overshooting while still allowing learning.

Generalised Advantage Estimation (GAE)

PPO pairs well with GAE (Schulman et al. 2016), which computes advantage estimates as an exponentially weighted mixture of n-step returns:

Â_t = Σ_{l=0}^∞ (γλ)^l δ_{t+l}, δ_t = r_t + γV(s_{t+1}) - V(s_t)

The hyperparameter \(\lambda \in [0,1]\) interpolates between high-bias one-step TD (\(\lambda = 0\)) and high-variance Monte Carlo returns (\(\lambda = 1\)). Values around \(\lambda = 0.95\) work well in practice.

Full PPO Loss and Implementation

The final loss combines the clipped policy objective, a value function regression term, and an entropy bonus:

L = L^{CLIP} - c_1 · (V_θ(s_t) - V_t^{targ})² + c_2 · H[π_θ(·|s_t)]

Key implementation details:

  • Multiple epochs: after collecting a batch of experience, PPO performs several gradient steps (typically 3–10 epochs) over the same data, improving sample efficiency relative to A2C.
  • Mini-batches: the collected rollout is split into mini-batches for each gradient step.
  • Advantage normalisation: normalising advantages across the batch (zero mean, unit std) stabilises learning.
  • Value function clipping: optionally clip the value function update analogously to the policy.
  • Orthogonal initialisation and learning rate annealing help with stability on continuous control tasks.

PPO in Practice

PPO has become the de facto standard for many RL applications. It trained OpenAI Five (Dota 2), was used for initial RLHF fine-tuning of InstructGPT and ChatGPT, and underlies most robotic manipulation pipelines. Its success stems from a rare combination: it is theoretically motivated, empirically robust, and straightforward to implement correctly.

References

  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
  • Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR. arXiv:1506.02438.
  • Schulman, J., Levine, S., Moritz, P., Jordan, M., & Abbeel, P. (2015). Trust Region Policy Optimization. ICML. arXiv:1502.05477.
  • Andrychowicz, M., et al. (2020). What Matters in On-Policy Reinforcement Learning? A Large-Scale Empirical Study. arXiv:2006.05990.