Temporal Difference Learning
CMPUT 366: Intelligent Systems
S&B §6.0-6.2, §6.4-6.5
Temporal Difference Learning CMPUT 366: Intelligent Systems S&B - - PowerPoint PPT Presentation
Temporal Difference Learning CMPUT 366: Intelligent Systems S&B 6.0-6.2, 6.4-6.5 Lecture Overview 1. Recap 2. TD Prediction 3. On-Policy TD Control (Sarsa) 4. Off-Policy TD Control (Q-Learning) Recap: Monte Carlo RL Monte
CMPUT 366: Intelligent Systems
S&B §6.0-6.2, §6.4-6.5
averaging actual returns over sampled trajectories
(e.g., 𝜁-greedy)
based on episodes generated by a different behaviour policy
episodes from a behaviour policy
know the rules.
told when we win or lose
dynamic programming in this scenario?
partly on estimates from previous iterations
Bootstrapping No bootstrapping Learns from experience
Requires full dynamics
Dynamic Programming: Monte Carlo: TD(0):
V(St) ← ∑
a
π(a|St)∑
s′,r
p(s′, r|St, a)[r + γV(s′)] V(St) ← V(St) + α [Gt − V(St)] V(St) ← V(St) + α [Rt+1 + γV(St+1) − V(St)]
vπ(s) . = Eπ[Gt | St =s] = Eπ[Rt+1 + γGt+1 | St =s] = Eπ[Rt+1 + γvπ(St+1) | St =s] .
Monte Carlo: Approximate because of 𝔽 Dynamic programming: Approximate because not known
vπ
TD(0): Approximate because of 𝔽 and not known
vπ
Tabular TD(0) for estimating vπ Input: the policy π to be evaluated Algorithm parameter: step size α ∈ (0, 1] Initialize V (s), for all s ∈ S+, arbitrarily except that V (terminal) = 0 Loop for each episode: Initialize S Loop for each step of episode: A ← action given by π for S Take action A, observe R, S0 V (S) ← V (S) + α ⇥ R + γV (S0) − V (S) ⇤ S ← S0 until S is terminal
Question: What information does this algorithm use?
and
and
π Q π π Q π
Question: What information does this algorithm use? Question: Will this estimate the Q-values of the optimal policy?
Sarsa (on-policy TD control) for estimating Q ≈ q⇤ Algorithm parameters: step size α ∈ (0, 1], small ε > 0 Initialize Q(s, a), for all s ∈ S+, a ∈ A(s), arbitrarily except that Q(terminal, ·) = 0 Loop for each episode: Initialize S Choose A from S using policy derived from Q (e.g., ε-greedy) Loop for each step of episode: Take action A, observe R, S0 Choose A0 from S0 using policy derived from Q (e.g., ε-greedy) Q(S, A) ← Q(S, A) + α ⇥ R + γQ(S0, A0) − Q(S, A) ⇤ S ← S0; A ← A0; until S is terminal
are 𝜁-greedy
Question: What information does this algorithm use? Question: Why aren't we estimating the policy 𝜌 explicitly? Q-learning (off-policy TD control) for estimating π ≈ π⇤ Algorithm parameters: step size α ∈ (0, 1], small ε > 0 Initialize Q(s, a), for all s ∈ S+, a ∈ A(s), arbitrarily except that Q(terminal, ·) = 0 Loop for each episode: Initialize S Loop for each step of episode: Choose A from S using policy derived from Q (e.g., ε-greedy) Take action A, observe R, S0 Q(S, A) ← Q(S, A) + α ⇥ R + γ maxa Q(S0, a) − Q(S, A) ⇤ S ← S0 until S is terminal
! ! ∀ ∀ ! #∃ ! ∃∀ ! % ∃ ∀ !∀∀ %∀∀ &∀∀ ∍∀∀ ∃∀∀
S G
T h e C l i f f
R
R = -1
Safer path Optimal path
R = -100
! ! ∀ ∀ (! !
l
𝛿=1 (undiscounted)
! ! ∀ ∀ ! #∃ ! ∃∀ ! % ∃ ∀ !∀∀ %∀∀ &∀∀ ∍∀∀ ∃∀∀
Sarsa Q-learning
! ! ∀ ∀ (! !
Sum of rewards during episode Episodes
100 200 300 400 500
s at ge e- ff ac- r
t
ac-
i- e
Q-Learning estimates optimal policy, but Sarsa consistently
(requires full dynamics)
𝜁-greedy policy