Lukas' Notes

reinforcement-learning

Definition

Policy Gradient

Given a parameterised stochastic policy , the policy gradient is the gradient of the expected return

with respect to the policy parameters:

By the policy gradient theorem, it can be estimated as

where is an advantage estimate. Therefore, actions with positive advantage are made more likely, while actions with negative advantage are made less likely.