Definition
Monte Carlo Advantage Estimation
Unbiased, but high-variance
Because is a single draw from the distribution whose mean is , the estimator is unbiased for whenever is the true value function: nothing is assumed about the future beyond what was actually observed. The defining limitation is that can only be known once the return has finished playing out, so the method needs complete episodes — or a bootstrap with at the point the return is truncated.
The cost is variance. sums every reward from onward, each one stochastic, so its spread grows with the horizon; a single lucky or unlucky trajectory can swing far from . This is the opposite trade from bootstrapping. A one-step temporal-difference error replaces the tail of the return with , which is far less noisy but imports the critic’s bias. Generalised advantage estimation interpolates between the two, with Monte Carlo as the endpoint.
Implementation
class Advantage(NamedTuple):
ret: Annotated[Tensor, "n_steps n_envs"]
adv: Annotated[Tensor, "n_steps n_envs"]
def monte_carlo(
rew: Annotated[Tensor, "n_steps n_envs"],
dones: Annotated[Tensor, "n_steps n_envs"],
val: Annotated[Tensor, "n_steps n_envs"],
next_val: Annotated[Tensor, "n_envs"],
next_done: Annotated[Tensor, "n_envs"],
discount_factor: float,
) -> Advantage:
returns: Annotated[Tensor, "n_steps n_envs"] = zeros_like(rew)
for step in reversed(range(rew.shape[0])):
if step == rew.shape[0] - 1:
next_nonterminal = 1.0 - next_done
next_return = next_val
else:
next_nonterminal = 1.0 - dones[step + 1]
next_return = returns[step + 1]
returns[step] = rew[step] + discount_factor * next_nonterminal * next_return
advantages: Annotated[Tensor, "n_steps n_envs"] = returns - val
return Advantage(ret=returns, adv=advantages)Each iteration fills one entry of returns from the return recurrence
so the loop runs in reversed order — each needs . The if/else only chooses where and the done flag come from:
- Every step but the last reads the already-computed
returns[step + 1]anddones[step + 1]. - The last step has no
returns[step + 1], so it bootstraps fromnext_val— the critic’s value of the observation after the rollout — gated bynext_done.
Both branches use the same done mask 1 - d_{t+1}: it zeroes when transition ended an episode, so collapses to with no bootstrap. Vectorised environments auto-reset on termination and pack several episodes into one rollout; the mask is what keeps one episode’s return from leaking into the next.
Transition 2 is terminal, so and the arrow carrying back into is severed: . Episode A’s returns fill backward from ; Episode B’s start fresh at . Without the mask, would absorb rewards from the reset episode.
Finally, advantages = returns - val is the literal . The tuple is returned because PPO uses the two fields for different updates: adv as the policy-gradient signal, ret as the critic’s regression target.