PPO: Proximal Policy Optimisation
Published:

The Problem: Policy Gradient Instability
Vanilla policy gradient updates can be catastrophic. A too-large gradient step can collapse the policy to always selecting the same action, from which recovery is slow or impossible. TRPO (Schulman et al. 2015) solved this by enforcing a hard KL-divergence constraint between old and new policies, but it requires an expensive second-order conjugate gradient solve at every step.
PPO (Schulman et al. 2017) achieves similar stability with a first-order method: instead of enforcing a hard constraint, it clips the objective so that large policy changes simply stop contributing to the gradient.
The Clipped Surrogate Objective
Define the probability ratio:
The standard policy gradient objective is \(L^{PG} = E_t[r_t(\theta) \cdot \hat{A}_t]\). Large \(r_t\) signals a large policy shift. PPO clips this ratio:
With \(\varepsilon = 0.2\), the policy ratio is clipped to \([0.8, 1.2]\). Taking the minimum means: if the unclipped objective would push the update too far, the clipped term dominates and the gradient is zero — the update simply does not happen.
Generalised Advantage Estimation (GAE)
PPO pairs well with GAE (Schulman et al. 2016), which computes advantage estimates as an exponentially weighted mixture of n-step returns:
The hyperparameter \(\lambda \in [0,1]\) interpolates between high-bias one-step TD (\(\lambda = 0\)) and high-variance Monte Carlo returns (\(\lambda = 1\)). Values around \(\lambda = 0.95\) work well in practice.
Full PPO Loss and Implementation
The final loss combines the clipped policy objective, a value function regression term, and an entropy bonus:
Key implementation details:
- Multiple epochs: after collecting a batch of experience, PPO performs several gradient steps (typically 3–10 epochs) over the same data, improving sample efficiency relative to A2C.
- Mini-batches: the collected rollout is split into mini-batches for each gradient step.
- Advantage normalisation: normalising advantages across the batch (zero mean, unit std) stabilises learning.
- Value function clipping: optionally clip the value function update analogously to the policy.
- Orthogonal initialisation and learning rate annealing help with stability on continuous control tasks.
PPO in Practice
PPO has become the de facto standard for many RL applications. It trained OpenAI Five (Dota 2), was used for initial RLHF fine-tuning of InstructGPT and ChatGPT, and underlies most robotic manipulation pipelines. Its success stems from a rare combination: it is theoretically motivated, empirically robust, and straightforward to implement correctly.
References
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
- Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR. arXiv:1506.02438.
- Schulman, J., Levine, S., Moritz, P., Jordan, M., & Abbeel, P. (2015). Trust Region Policy Optimization. ICML. arXiv:1502.05477.
- Andrychowicz, M., et al. (2020). What Matters in On-Policy Reinforcement Learning? A Large-Scale Empirical Study. arXiv:2006.05990.
