Definition
Proximal Policy Optimisation
Let be a Markov decision process. Let be a parameterised stochastic policy with parameters , let be the policy from the previous iteration, and let be an estimator of the advantage function at timestep .
PPO maximises the clipped surrogate objective
where is the probability ratio and is a clipping hyperparameter.
Problem Setup
Formal setup
An agent interacts with an environment modelled as a Markov decision process , where:
- is the state space,
- is the action space,
- is the transition probability,
- is the immediate reward,
- is the discount factor.
A stochastic policy gives the probability of taking action in state , parameterised by .
The return from timestep is
The objective is to find that maximises the expected return from the start distribution:
Policy Gradient
Policy Gradient Theorem
Why the advantage?
The action-value tells you the expected return after taking in . The value tells you the expected return from under the current policy. The advantage measures how much better is than the average action in . Using the advantage instead of the raw return reduces variance without introducing bias.
Vanilla policy gradient
Naively following with a fixed step size is brittle. A step that is too large can collapse performance, because the policy changes too much and the data collected under no longer represents . A step that is too small wastes samples.
From Trust Region to Clipping
The trust region idea
Trust Region Policy Optimization (TRPO) constrains each update to a KL divergence ball:
This ensures monotonic improvement in theory, but the constrained optimisation requires conjugate gradient and a line search, making it computationally heavy.
PPO's solution: clipping
PPO replaces the hard constraint with a clipped surrogate objective that penalises large policy changes directly in the loss. Define the probability ratio
When , we have . The unconstrained surrogate would allow to grow unboundedly. PPO clips to :
The ensures the objective is a pessimistic bound on the unconstrained surrogate:
- When (the action was good): increasing beyond yields no further gain — the clip caps the incentive.
- When (the action was bad): decreasing below yields no further gain — the clip caps the penalty.
In both cases, PPO ignores changes to that would move the new policy too far from the old one.
Algorithm
PPO with clipped objective
Instance: MDP , initial policy parameters , value function parameters , clipping , number of epochs , minibatch size .
Repeat for each iteration:
- Collect: Run for timesteps, collecting trajectories .
- Compute advantage estimates using generalised advantage estimation.
- Compute returns .
- For epochs:
- Shuffle the samples into minibatches of size .
- For each minibatch :
- Update by gradient ascent on over .
- Update by gradient descent on .
- Set .
Hyperparameters
Typical settings: , , , and chosen so that the total collected experience is a few thousand timesteps.
Properties
On-policy
PPO is an on-policy algorithm: each batch of data is collected under the current policy and discarded after one (or a few) gradient steps. This ensures the probability ratio is computed with respect to the policy that generated the data.
Model-free
PPO does not learn or use a model of the transition dynamics . It is a model-free method.
Actor-critic
PPO maintains both a policy (actor) with parameters and a value function (critic) with parameters . The critic provides the baseline for advantage estimation, reducing variance.
Monotonic improvement
Unlike TRPO, PPO does not provide a theoretical guarantee of monotonic improvement. The clipping heuristic is an empirical approximation of the trust region constraint, but in practice it achieves comparable or better stability with a simpler implementation.
Variance reduction via GAE
PPO typically uses generalised advantage estimation (GAE) to compute , which trades off bias and variance through a parameter :
where is the temporal-difference error.