Definition
Policy Gradient Theorem
Let be a parameterised stochastic policy and let be the . The expected return gradient of with respect to is
where is any baseline that depends only on . Common choices include the return , the action-value , or the advantage .
The theorem reduces the problem of differentiating through environment dynamics to differentiating only through the policy’s own log-probabilities, enabling gradient-based policy improvement without a model of .