Policy Gradient Theorem

Definition

Policy Gradient Theorem

Let $π_{θ}$ be a parameterised stochastic policy and let $J (θ) = E_{τ \sim π_{θ}} [G_{0}]$ be the . The expected return gradient of $J$ with respect to $θ$ is
$\nabla_{θ} J (θ) = E_{τ \sim π_{θ}} [t = 0 \sum \infty \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) Ψ_{t}]$
where $Ψ_{t}$ is any baseline that depends only on $s_{t}$ . Common choices include the return $G_{t}$ , the action-value $Q^{π} (s_{t}, a_{t})$ , or the advantage $A^{π} (s_{t}, a_{t})$ .

The theorem reduces the problem of differentiating through environment dynamics to differentiating only through the policy’s own log-probabilities, enabling gradient-based policy improvement without a model of $P (s^{'} ∣ s, a)$ .

Lukas' Notes

Policy Gradient Theorem

Definition

Backlinks