Lukas' Notes

reinforcement-learning

Definition

Policy Gradient Theorem

Let be a parameterised stochastic policy and let be the . The expected return gradient of with respect to is

where is any baseline that depends only on . Common choices include the return , the action-value , or the advantage .

The theorem reduces the problem of differentiating through environment dynamics to differentiating only through the policy’s own log-probabilities, enabling gradient-based policy improvement without a model of .