Proximal Policy Optimisation

Definition

Proximal Policy Optimisation

Let $M = (S, A, P, R, γ)$ be a Markov decision process. Let $π_{θ}$ be a parameterised stochastic policy with parameters $θ$ , let $π_{θ_{old}}$ be the policy from the previous iteration, and let $\hat{A}_{t}$ be an estimator of the advantage function at timestep $t$ .

PPO maximises the clipped surrogate objective
$L^{CLIP} (θ) = E_{t} [min (r_{t} (θ) \hat{A}_{t}, clip (r_{t} (θ), 1 - ε, 1 + ε) \hat{A}_{t})]$
where $r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{old}} ( a _{t} ∣ s _{t} )}$ is the probability ratio and $ε \in (0, 1)$ is a clipping hyperparameter.

Problem Setup

Formal setup

An agent interacts with an environment modelled as a Markov decision process $M = (S, A, P, R, γ)$ , where:

$S$ is the state space,

$A$ is the action space,

$P (s^{'} ∣ s, a)$ is the transition probability,

$R (s, a)$ is the immediate reward,

$γ \in [0, 1]$ is the discount factor.

A stochastic policy $π_{θ} (a ∣ s)$ gives the probability of taking action $a$ in state $s$ , parameterised by $θ \in R^{d}$ .

The return from timestep $t$ is
$G_{t} = k = 0 \sum \infty γ^{k} R (s_{t + k}, a_{t + k}) .$
The objective is to find $θ^{*}$ that maximises the expected return from the start distribution:
$J (θ) = E_{τ \sim π_{θ}} [G_{0}] .$

Policy Gradient

Policy Gradient Theorem

The gradient of $J (θ)$ with respect to $θ$ is
$\nabla_{θ} J (θ) = E_{τ \sim π_{θ}} [t = 0 \sum \infty \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \hat{A}_{t}]$
where $\hat{A}_{t}$ is an estimate of the advantage function
$A^{π_{θ}} (s, a) = Q^{π_{θ}} (s, a) - V^{π_{θ}} (s) .$

Why the advantage?

The action-value $Q^{π_{θ}} (s, a)$ tells you the expected return after taking $a$ in $s$ . The value $V^{π_{θ}} (s)$ tells you the expected return from $s$ under the current policy. The advantage $A (s, a)$ measures how much better $a$ is than the average action in $s$ . Using the advantage instead of the raw return reduces variance without introducing bias.

Vanilla policy gradient

Naively following $\nabla_{θ} J (θ)$ with a fixed step size is brittle. A step that is too large can collapse performance, because the policy changes too much and the data collected under $π_{θ_{old}}$ no longer represents $π_{θ}$ . A step that is too small wastes samples.

From Trust Region to Clipping

The trust region idea

Trust Region Policy Optimization (TRPO) constrains each update to a KL divergence ball:
$θ max s.t. E_{t} [r_{t} (θ) \hat{A}_{t}] E_{t} [D_{KL} (π_{θ_{old}} (\cdot ∣ s_{t}) ∥ π_{θ} (\cdot ∣ s_{t}))] \leq δ .$
This ensures monotonic improvement in theory, but the constrained optimisation requires conjugate gradient and a line search, making it computationally heavy.

PPO's solution: clipping

PPO replaces the hard constraint with a clipped surrogate objective that penalises large policy changes directly in the loss. Define the probability ratio
$r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{old}} ( a _{t} ∣ s _{t} )} .$
When $π_{θ} = π_{θ_{old}}$ , we have $r_{t} (θ) = 1$ . The unconstrained surrogate $L^{CPI} (θ) = E_{t} [r_{t} (θ) \hat{A}_{t}]$ would allow $r_{t}$ to grow unboundedly. PPO clips $r_{t}$ to $[1 - ε, 1 + ε]$ :
$L^{CLIP} (θ) = E_{t} [min (r_{t} (θ) \hat{A}_{t}, clip (r_{t} (θ), 1 - ε, 1 + ε) \hat{A}_{t})] .$
The $min$ ensures the objective is a pessimistic bound on the unconstrained surrogate:

When $\hat{A}_{t} > 0$ (the action was good): increasing $r_{t}$ beyond $1 + ε$ yields no further gain — the clip caps the incentive.

When $\hat{A}_{t} < 0$ (the action was bad): decreasing $r_{t}$ below $1 - ε$ yields no further gain — the clip caps the penalty.

In both cases, PPO ignores changes to $r_{t}$ that would move the new policy too far from the old one.

Algorithm

PPO with clipped objective

Instance: MDP $M$ , initial policy parameters $θ_{0}$ , value function parameters $ϕ_{0}$ , clipping $ε$ , number of epochs $K$ , minibatch size $M$ .

Repeat for each iteration:

Collect: Run $π_{θ_{old}}$ for $T$ timesteps, collecting trajectories ${s_{t}, a_{t}, r_{t}}$ .

Compute advantage estimates $\hat{A}_{1}, \dots, \hat{A}_{T}$ using generalised advantage estimation.

Compute returns $\hat{G}_{t} = \hat{A}_{t} + V_{ϕ} (s_{t})$ .

For $k = 1, \dots, K$ epochs:

Shuffle the $T$ samples into minibatches of size $M$ .

For each minibatch $B$ :

Update $θ$ by gradient ascent on $L^{CLIP} (θ)$ over $B$ .

Update $ϕ$ by gradient descent on $\frac{1}{∣ B ∣} \sum_{t \in B} (V_{ϕ} (s_{t}) - \hat{G}_{t})^{2}$ .

Set $θ_{old} \leftarrow θ$ .

Hyperparameters

Typical settings: $ε \in [0.1, 0.3]$ , $K \in [3, 10]$ , $M \in [64, 256]$ , and $T$ chosen so that the total collected experience is a few thousand timesteps.

Properties

On-policy

PPO is an on-policy algorithm: each batch of data is collected under the current policy and discarded after one (or a few) gradient steps. This ensures the probability ratio $r_{t} (θ)$ is computed with respect to the policy that generated the data.

Model-free

PPO does not learn or use a model of the transition dynamics $P (s^{'} ∣ s, a)$ . It is a model-free method.

Actor-critic

PPO maintains both a policy (actor) with parameters $θ$ and a value function (critic) with parameters $ϕ$ . The critic provides the baseline for advantage estimation, reducing variance.

Monotonic improvement

Unlike TRPO, PPO does not provide a theoretical guarantee of monotonic improvement. The clipping heuristic is an empirical approximation of the trust region constraint, but in practice it achieves comparable or better stability with a simpler implementation.

Variance reduction via GAE

PPO typically uses generalised advantage estimation (GAE) to compute $\hat{A}_{t}$ , which trades off bias and variance through a parameter $λ \in [0, 1]$ :
$\hat{A}_{t}^{GAE (γ, λ)} = l = 0 \sum \infty (γλ)^{l} δ_{t + l}$
where $δ_{t} = r_{t} + γ V_{ϕ} (s_{t + 1}) - V_{ϕ} (s_{t})$ is the temporal-difference error.

Lukas' Notes

Proximal Policy Optimisation

Table of Contents

Definition

Problem Setup

Policy Gradient

From Trust Region to Clipping

Algorithm

Properties