Is my understanding of On-Policy and Off-Policy TD algorithms correct?

https://datascience.stackexchange.com/questions/26471

31-10-2019
|

Question

After reading several questions here and browsing some pages on the topic, here is my understanding of the key difference between Q-learning (as an example of off-policy) and SARSA (as an example of on-policy) methods. Please correct me if I am misled.

1) With an on-policy algorithm we use the current policy (a regression model with weights W, and ε-greedy selection) to generate the next state's Q.

2) With an off-policy algorithm we use a greedy version of the current policy to generate the next state's Q.

3) If an exploration constant ε is set to 0, then the off-policy method becomes on-policy, since Q is derived using the same greedy policy.

4) However, on-policy method uses one sample to update the policy, and this sample comes from on-line world exploration since we need to know exactly which actions the policy generates in current and next states. While off-policy method may use experience replay of past trajectories (generated by different policies) to use a distribution of inputs and outputs to the policy model.

Source: https://courses.engr.illinois.edu/cs440/fa2009/lectures/Lect24.pdf

One more reading: http://mi.eng.cam.ac.uk/~mg436/LectureSlides/MLSALT7/L3.pdf

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange