Lukas' Notes

reinforcement-learning

Definition

Return

Let be a trajectory sampled under policy . The return from timestep is the discounted sum of future rewards:

where is the discount factor.