Monte Carlo Advantage Estimation

Definition

Monte Carlo Advantage Estimation

Monte Carlo advantage estimation estimates the advantage of the action actually taken by subtracting a value baseline from a single sampled return:
$A_{t}^{MC} = G_{t} - V^{π} (s_{t}),$
where $G_{t}$ is the realised discounted return from time $t$ . It stands in for the true advantage $A^{π} (s_{t}, a_{t}) = Q^{π} (s_{t}, a_{t}) - V^{π} (s_{t})$ by using the sample $G_{t}$ in place of the intractable expectation $Q^{π} (s_{t}, a_{t}) = E_{π} [G_{t} ∣ s_{t}, a_{t}]$ .

Unbiased, but high-variance

Because $G_{t}$ is a single draw from the distribution whose mean is $Q^{π} (s_{t}, a_{t})$ , the estimator is unbiased for $A^{π} (s_{t}, a_{t})$ whenever $V^{π}$ is the true value function: nothing is assumed about the future beyond what was actually observed. The defining limitation is that $G_{t}$ can only be known once the return has finished playing out, so the method needs complete episodes — or a bootstrap with $V^{π}$ at the point the return is truncated.

The cost is variance. $G_{t}$ sums every reward from $t$ onward, each one stochastic, so its spread grows with the horizon; a single lucky or unlucky trajectory can swing $A_{t}^{MC}$ far from $A^{π}$ . This is the opposite trade from bootstrapping. A one-step temporal-difference error replaces the tail of the return with $V^{π} (s_{t + 1})$ , which is far less noisy but imports the critic’s bias. Generalised advantage estimation interpolates between the two, with Monte Carlo as the $λ = 1$ endpoint.

Implementation

class Advantage(NamedTuple):
    ret: Annotated[Tensor, "n_steps n_envs"]
    adv: Annotated[Tensor, "n_steps n_envs"]
 
 
def monte_carlo(
    rew: Annotated[Tensor, "n_steps n_envs"],
    dones: Annotated[Tensor, "n_steps n_envs"],
    val: Annotated[Tensor, "n_steps n_envs"],
    next_val: Annotated[Tensor, "n_envs"],
    next_done: Annotated[Tensor, "n_envs"],
    discount_factor: float,
) -> Advantage:
    returns: Annotated[Tensor, "n_steps n_envs"] = zeros_like(rew)
 
    for step in reversed(range(rew.shape[0])):
        if step == rew.shape[0] - 1:
            next_nonterminal = 1.0 - next_done
            next_return = next_val
        else:
            next_nonterminal = 1.0 - dones[step + 1]
            next_return = returns[step + 1]
 
        returns[step] = rew[step] + discount_factor * next_nonterminal * next_return
 
    advantages: Annotated[Tensor, "n_steps n_envs"] = returns - val
    return Advantage(ret=returns, adv=advantages)

Each iteration fills one entry of returns from the return recurrence

G_{t} = r_{t} + γ (1 - d_{t + 1}) G_{t + 1},

so the loop runs in reversed order — each $G_{t}$ needs $G_{t + 1}$ . The if/else only chooses where $G_{t + 1}$ and the done flag come from:

Every step but the last reads the already-computed returns[step + 1] and dones[step + 1].
The last step has no returns[step + 1], so it bootstraps from next_val — the critic’s value of the observation after the rollout — gated by next_done.

Both branches use the same done mask 1 - d_{t+1}: it zeroes $G_{t + 1}$ when transition $t$ ended an episode, so $G_{t}$ collapses to $r_{t}$ with no bootstrap. Vectorised environments auto-reset on termination and pack several episodes into one rollout; the mask is what keeps one episode’s return from leaking into the next.

Transition 2 is terminal, so $d_{3} = 1$ and the arrow carrying $G_{3}$ back into $G_{2}$ is severed: $G_{2} = r_{2}$ . Episode A’s returns fill backward from $r_{2}$ ; Episode B’s start fresh at $r_{3}$ . Without the mask, $G_{2}$ would absorb rewards from the reset episode.

Finally, advantages = returns - val is the literal $A_{t} = G_{t} - V (s_{t})$ . The tuple is returned because PPO uses the two fields for different updates: adv as the policy-gradient signal, ret as the critic’s regression target.

Lukas' Notes

Monte Carlo Advantage Estimation

Table of Contents

Definition

Unbiased, but high-variance

Implementation

Backlinks