SLIDE 24 Monotonic ǫ-greedy Policy Improvement
Theorem
For any ǫ-greedy policy πi, the ǫ-greedy policy w.r.t. Qπi, πi+1 is a monotonic improvement V πi+1 ≥ V πi
Qπi (s, πi+1(s)) =
πi+1(a|s)Qπi (s, a) = (ǫ/|A|)
a∈A
Qπi (s, a) + (1 − ǫ) max
a
Qπi (s, a) = (ǫ/|A|)
a∈A
Qπi (s, a) + (1 − ǫ) max
a
Qπi (s, a) 1 − ǫ 1 − ǫ = (ǫ/|A|)
a∈A
Qπi (s, a) + (1 − ǫ) max
a
Qπi (s, a)
πi (a|s) −
ǫ |A|
1 − ǫ ≥ ǫ |A|
a∈A
Qπi (s, a) + (1 − ǫ)
πi (a|s) −
ǫ |A|
1 − ǫ Qπi (s, a) =
πi (a|s)Qπi (s, a) = V πi (s)
Therefore V πi+1 ≥ V π (from the policy improvement theorem)
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 24 / 58