Policy Gradient

Definition

Policy Gradient

Given a parameterised stochastic policy $π_{θ} (a ∣ s)$ , the policy gradient is the gradient of the expected return
$J (θ) = E_{τ \sim π_{θ}} [R (τ)]$
with respect to the policy parameters:
$\nabla_{θ} J (θ) .$
By the policy gradient theorem, it can be estimated as
$\nabla_{θ} J (θ) = E_{t} [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) A_{t}],$
where $A_{t}$ is an advantage estimate. Therefore, actions with positive advantage are made more likely, while actions with negative advantage are made less likely.

Lukas' Notes

Policy Gradient

Definition

Backlinks