Lukas' Notes

reinforcement-learning

Definition

Proximal Policy Optimisation

Let be a Markov decision process. Let be a parameterised stochastic policy with parameters , let be the policy from the previous iteration, and let be an estimator of the advantage function at timestep .

PPO maximises the clipped surrogate objective

where is the probability ratio and is a clipping hyperparameter.

Problem Setup

Formal setup

An agent interacts with an environment modelled as a Markov decision process , where:

  • is the state space,
  • is the action space,
  • is the transition probability,
  • is the immediate reward,
  • is the discount factor.

A stochastic policy gives the probability of taking action in state , parameterised by .

The return from timestep is

The objective is to find that maximises the expected return from the start distribution:

Policy Gradient

Policy Gradient Theorem

The gradient of with respect to is

where is an estimate of the advantage function

Why the advantage?

The action-value tells you the expected return after taking in . The value tells you the expected return from under the current policy. The advantage measures how much better is than the average action in . Using the advantage instead of the raw return reduces variance without introducing bias.

Vanilla policy gradient

Naively following with a fixed step size is brittle. A step that is too large can collapse performance, because the policy changes too much and the data collected under no longer represents . A step that is too small wastes samples.

From Trust Region to Clipping

The trust region idea

Trust Region Policy Optimization (TRPO) constrains each update to a KL divergence ball:

This ensures monotonic improvement in theory, but the constrained optimisation requires conjugate gradient and a line search, making it computationally heavy.

PPO's solution: clipping

PPO replaces the hard constraint with a clipped surrogate objective that penalises large policy changes directly in the loss. Define the probability ratio

When , we have . The unconstrained surrogate would allow to grow unboundedly. PPO clips to :

The ensures the objective is a pessimistic bound on the unconstrained surrogate:

  • When (the action was good): increasing beyond yields no further gain — the clip caps the incentive.
  • When (the action was bad): decreasing below yields no further gain — the clip caps the penalty.

In both cases, PPO ignores changes to that would move the new policy too far from the old one.

Algorithm

PPO with clipped objective

Instance: MDP , initial policy parameters , value function parameters , clipping , number of epochs , minibatch size .

Repeat for each iteration:

  1. Collect: Run for timesteps, collecting trajectories .
  2. Compute advantage estimates using generalised advantage estimation.
  3. Compute returns .
  4. For epochs:
    • Shuffle the samples into minibatches of size .
    • For each minibatch :
  5. Set .

Hyperparameters

Typical settings: , , , and chosen so that the total collected experience is a few thousand timesteps.

Properties

On-policy

PPO is an on-policy algorithm: each batch of data is collected under the current policy and discarded after one (or a few) gradient steps. This ensures the probability ratio is computed with respect to the policy that generated the data.

Model-free

PPO does not learn or use a model of the transition dynamics . It is a model-free method.

Actor-critic

PPO maintains both a policy (actor) with parameters and a value function (critic) with parameters . The critic provides the baseline for advantage estimation, reducing variance.

Monotonic improvement

Unlike TRPO, PPO does not provide a theoretical guarantee of monotonic improvement. The clipping heuristic is an empirical approximation of the trust region constraint, but in practice it achieves comparable or better stability with a simpler implementation.

Variance reduction via GAE

PPO typically uses generalised advantage estimation (GAE) to compute , which trades off bias and variance through a parameter :

where is the temporal-difference error.