Lukas' Notes

reinforcement-learning

Definition

Temporal-Difference Error

The temporal-difference (TD) error is the difference between two successive estimates of the same value, one step apart. Given a policy with value estimate , it is

The term is the one-step target: the reward just received, plus the discounted value of the state reached. Subtracting asks how much that target disagrees with the critic’s current prediction. A positive means the transition was better than expected; a negative one means it was worse.

Relation to the Bellman equation

The TD error vanishes in expectation when is the true value function. Taking expectations and using ,

which is exactly the Bellman consistency condition for . So is the residual of that equation on a single sampled transition: it is zero on average when the value is correct, and nonzero wherever the critic is wrong.

As a learning signal

The TD error is the basic update signal of temporal-difference learning. The value estimate is moved toward the target by a step proportional to , which drives toward Bellman consistency without waiting for a full episode’s return. It also underlies advantage estimation: generalised advantage estimation builds its estimator from a discounted sum of TD errors.