TRPO and Natural Policy Gradient

4 minute read

Published: August 13, 2025

TL;DR: TRPO (Trust Region Policy Optimisation) places a hard KL-divergence constraint on policy updates, guaranteeing that performance never decreases. The natural policy gradient steers updates in the direction of steepest ascent in policy space (not parameter space), using the Fisher information matrix as a Riemannian metric. Together, these ideas provide principled, stable policy optimisation with monotonic improvement guarantees.

The Problem with Vanilla Gradient Ascent

Gradient ascent in parameter space has an implicit flaw: the same step size in parameter space can produce wildly different changes in policy behaviour depending on the local curvature. A large step might leap past the trust region into a region where the policy is catastrophically bad, and recovery is difficult.

The core question TRPO asks is: how do we define a “safe” step size that is invariant to the parameterisation of the policy?

The Natural Gradient

In Euclidean gradient ascent, the update is:

θ ← θ + α ∇_θ J(θ)

The natural gradient (Amari 1998) replaces the Euclidean metric with the Fisher information matrix \(F(\theta)\), which measures the local curvature of the KL divergence between policy distributions:

F(θ) = E_π [ ∇_θ log π_θ(a|s) ∇_θ log π_θ(a|s)^T ]

The natural gradient update is:

θ ← θ + α F(θ)^{-1} ∇_θ J(θ)

This steepest ascent direction in distribution space is invariant to reparameterisation: regardless of how we represent the policy (softmax, log-linear, neural network), the natural gradient points in the same direction in policy space.

Key Insight: The Fisher information matrix is the Riemannian metric on the manifold of probability distributions. Computing the natural gradient is equivalent to asking: "which parameter change produces the maximum improvement in objective per unit of KL divergence from the current policy?" This is the right question to ask for policy optimisation.

TRPO: Formalising the Trust Region

TRPO (Schulman et al. 2015) formalises this as a constrained optimisation:

max_θ E_t [ (π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t)) · Â_t ] subject to: E_t [ KL( π_{θ_old}(·|s_t) || π_θ(·|s_t) ) ] ≤ δ

The constraint keeps the new policy within a trust region of radius \(\delta\) in KL space. The monotonic improvement theorem guarantees that, within this constraint, the true objective cannot decrease.

Solving the Constrained Problem

Directly inverting \(F\) for large neural networks is intractable. TRPO avoids this via:

Conjugate gradient: efficiently compute \(F^{-1} g\) without forming \(F\) explicitly, using matrix-vector products \(F v = \nabla_\theta (\nabla_\theta \log \pi \cdot v)\).
Line search: after finding the natural gradient direction, perform a backtracking line search to find the largest step that satisfies the KL constraint and actually improves the surrogate objective.

This makes TRPO second-order in concept but avoids full Hessian computation.

TRPO vs PPO

TRPO’s theoretical guarantees are attractive, but the conjugate gradient solve and line search make it 10–100× more expensive per update than first-order methods. PPO (Schulman et al. 2017) approximates the same constraint with a clipping heuristic, achieving comparable empirical stability at a fraction of the cost.

For most practical applications, PPO is preferred. TRPO remains valuable when theoretical guarantees matter — for instance, in safety-critical settings — and its analysis underpins the theoretical foundations of PPO.

ACKTR: Natural Gradient with Kronecker Factoring

ACKTR (Wu et al. 2017) approximates the Fisher matrix using Kronecker-factored structure (K-FAC), enabling efficient second-order updates without conjugate gradient. It achieves the stability of TRPO with computational cost closer to first-order methods, and outperforms A2C and TRPO on several Atari and MuJoCo benchmarks.

References

Schulman, J., Levine, S., Moritz, P., Jordan, M., & Abbeel, P. (2015). Trust Region Policy Optimization. ICML. arXiv:1502.05477.
Amari, S. (1998). Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2), 251–276.
Kakade, S. (2002). A Natural Policy Gradient. NeurIPS.
Wu, Y., Mansimov, E., Liao, S., Grosse, R., & Ba, J. (2017). Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. NeurIPS. arXiv:1708.05144.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

TRPO and Natural Policy Gradient

The Problem with Vanilla Gradient Ascent

The Natural Gradient

TRPO: Formalising the Trust Region

Solving the Constrained Problem

TRPO vs PPO

ACKTR: Natural Gradient with Kronecker Factoring

References

Share on

You May Also Enjoy

GAPE: Remember to Forget — Gated Adaptive Positional Encoding

PolyNSD: Polynomial Neural Sheaf Diffusion

TDA in Materials Science: Topology of Structure and Phase

TDA in Drug Discovery: Molecular Topology