Lukas' Notes

reinforcement-learning

Definition

Monte Carlo Advantage Estimation

Monte Carlo advantage estimation estimates the advantage of the action actually taken by subtracting a value baseline from a single sampled return:

where is the realised discounted return from time . It stands in for the true advantage by using the sample in place of the intractable expectation .

Unbiased, but high-variance

Because is a single draw from the distribution whose mean is , the estimator is unbiased for whenever is the true value function: nothing is assumed about the future beyond what was actually observed. The defining limitation is that can only be known once the return has finished playing out, so the method needs complete episodes — or a bootstrap with at the point the return is truncated.

The cost is variance. sums every reward from onward, each one stochastic, so its spread grows with the horizon; a single lucky or unlucky trajectory can swing far from . This is the opposite trade from bootstrapping. A one-step temporal-difference error replaces the tail of the return with , which is far less noisy but imports the critic’s bias. Generalised advantage estimation interpolates between the two, with Monte Carlo as the endpoint.

Implementation

class Advantage(NamedTuple):
    ret: Annotated[Tensor, "n_steps n_envs"]
    adv: Annotated[Tensor, "n_steps n_envs"]
 
 
def monte_carlo(
    rew: Annotated[Tensor, "n_steps n_envs"],
    dones: Annotated[Tensor, "n_steps n_envs"],
    val: Annotated[Tensor, "n_steps n_envs"],
    next_val: Annotated[Tensor, "n_envs"],
    next_done: Annotated[Tensor, "n_envs"],
    discount_factor: float,
) -> Advantage:
    returns: Annotated[Tensor, "n_steps n_envs"] = zeros_like(rew)
 
    for step in reversed(range(rew.shape[0])):
        if step == rew.shape[0] - 1:
            next_nonterminal = 1.0 - next_done
            next_return = next_val
        else:
            next_nonterminal = 1.0 - dones[step + 1]
            next_return = returns[step + 1]
 
        returns[step] = rew[step] + discount_factor * next_nonterminal * next_return
 
    advantages: Annotated[Tensor, "n_steps n_envs"] = returns - val
    return Advantage(ret=returns, adv=advantages)

Each iteration fills one entry of returns from the return recurrence

so the loop runs in reversed order — each needs . The if/else only chooses where and the done flag come from:

  • Every step but the last reads the already-computed returns[step + 1] and dones[step + 1].
  • The last step has no returns[step + 1], so it bootstraps from next_val — the critic’s value of the observation after the rollout — gated by next_done.

Both branches use the same done mask 1 - d_{t+1}: it zeroes when transition ended an episode, so collapses to with no bootstrap. Vectorised environments auto-reset on termination and pack several episodes into one rollout; the mask is what keeps one episode’s return from leaking into the next.

Transition 2 is terminal, so and the arrow carrying back into is severed: . Episode A’s returns fill backward from ; Episode B’s start fresh at . Without the mask, would absorb rewards from the reset episode.

Finally, advantages = returns - val is the literal . The tuple is returned because PPO uses the two fields for different updates: adv as the policy-gradient signal, ret as the critic’s regression target.