Temporal Difference Learning
Deep Reinforcement Learning and Control Katerina Fragkiadaki
Carnegie Mellon School of Computer Science Spring 2019, CMU 10-403
Temporal Difference Learning Spring 2019, CMU 10-403 Katerina - - PowerPoint PPT Presentation
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Temporal Difference Learning Spring 2019, CMU 10-403 Katerina Fragkiadaki Used Materials Disclaimer : Much of the material and slides for this lecture were
Deep Reinforcement Learning and Control Katerina Fragkiadaki
Carnegie Mellon School of Computer Science Spring 2019, CMU 10-403
MC: sample average return approximates expectation DP: the expected values are provided by a model. But we use a current estimate V(St+1) of the true vπ(St+1) TD: combine both: Sample expected values and use a current estimate V(St+1) of the true vπ(St+1)
X
a
π(a|St) X
s0,r
p(s0, r|St, a)[r + γV (s0)]
V (St) ← V (St) + α h Rt+1 + γV (St+1) − V (St) i .
width
height (depth)
Temporal- difference learning Dynamic programming Monte Carlo
Exhaustive search
Search, planning in a later lecture!
St,At Rt+1 St St+1, At+1 Rt+2 St+1 Rt+3 St+2 St+3
St+2, At+2 St+3, At+3
Initialize Q(s, a), ∀s ∈ S, a ∈ A(s), arbitrarily, and Q(terminal-state, ·) = 0 Repeat (for each episode): Initialize S Choose A from S using policy derived from Q (e.g., ε-greedy) Repeat (for each step of episode): Take action A, observe R, S0 Choose A0 from S0 using policy derived from Q (e.g., ε-greedy) Q(S, A) ← Q(S, A) + α[R + γQ(S0, A0) − Q(S, A)] S ← S0; A ← A0; until S is terminal
Q: Can a policy result in infinite loops? What will MC policy iteration do then?
terminate.
policy.
a
Initialize Q(s, a), ∀s ∈ S, a ∈ A(s), arbitrarily, and Q(terminal-state, ·) = 0 Repeat (for each episode): Initialize S Repeat (for each step of episode): Choose A from S using policy derived from Q (e.g., ε-greedy) Take action A, observe R, S0 Q(S, A) ← Q(S, A) + α[R + γ maxa Q(S0, a) − Q(S, A)] S ← S0; until S is terminal
Q(S, A) ← Q(S, A) + α[R + γQ(S0, A0) − Q(S, A)]
ϵ − greedy, ϵ = 0.1
a
Q: Is expected SARSA on policy or off policy? What if \pi is the greedy deterministic policy?
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −160 −140 −120 −100 −80 −60 −40 −20
alpha
n = 100, Sarsa n = 100, Q−learning n = 100, Expected Sarsa n = 1E5, Sarsa n = 1E5, Q−learning n = 1E5, Expected Sarsa
Expected Sarsa Sarsa Q-learning
Asymptotic Performance Interim Performance (after 100 episodes)
Q-learning
1 0.1 0.2 0.4 0.6 0.8 0.3 0.5 0.7 0.9