First recall that a policy [math]\pi[/math] is a mapping from each state, [math]s[/math], action [math]a[/math], to the probability [math]\pi(a|s)[/math] of taking action [math]a[/math] when in state [math]s[/math].
The state value function, [math]V_\pi(s)[/math], is the expected return when starting in state [math]s[/math] and following [math]\pi[/math] thereafter.
Similarly, the state-action value function, [math]Q_\pi(s, a)[/math], is the expected return of when starting in state [math]s[/math], taking action [math]a[/math], and following policy [math]\pi[/math] thereafter.
Read these 3 times out loud and you’ll get the difference.
6.6K views ·
View upvotes
· 1 of 5 answers
Something went wrong. Wait a moment and try again.