reinforcement-learning Definition On-Policy Action-Value Function The on-policy action-value function Qπ(s,a) gives the expected return if you start in state s, take and arbitrary action a (not necessarily from the policy), and then forever after act according to policy π: Qπ(s,a)=Eτ∼π[R(τ)∣s0=s, a0=a]