Temporal-Difference Error

Definition

Temporal-Difference Error

The temporal-difference (TD) error is the difference between two successive estimates of the same value, one step apart. Given a policy $π$ with value estimate $V^{π}$ , it is
$δ_{t} = r_{t} + γ V^{π} (s_{t + 1}) - V^{π} (s_{t}) .$
The term $r_{t} + γ V^{π} (s_{t + 1})$ is the one-step target: the reward just received, plus the discounted value of the state reached. Subtracting $V^{π} (s_{t})$ asks how much that target disagrees with the critic’s current prediction. A positive $δ_{t}$ means the transition was better than expected; a negative one means it was worse.

Relation to the Bellman equation

The TD error vanishes in expectation when $V^{π}$ is the true value function. Taking expectations and using $V^{π} (s) = E_{π} [r + γ V^{π} (s^{'})]$ ,

E_{π} [δ_{t} ∣ s_{t} = s] = 0,

which is exactly the Bellman consistency condition for $V^{π}$ . So $δ_{t}$ is the residual of that equation on a single sampled transition: it is zero on average when the value is correct, and nonzero wherever the critic is wrong.

As a learning signal

The TD error is the basic update signal of temporal-difference learning. The value estimate is moved toward the target by a step proportional to $δ_{t}$ , which drives $V^{π}$ toward Bellman consistency without waiting for a full episode’s return. It also underlies advantage estimation: generalised advantage estimation builds its estimator from a discounted sum of TD errors.

Lukas' Notes

Temporal-Difference Error

Table of Contents

Definition

Relation to the Bellman equation

As a learning signal

Backlinks