New Temporal-Difference Methods Based on Gradient Descent
Rich Sutton Hamid Maei Doina Precup (McGill) Shalabh Bhatnagar (IIS Bangalore) Csaba Szepesvari Eric Wiewiora David Silver
New Temporal-Difference Methods Based on Gradient Descent Rich - - PowerPoint PPT Presentation
New Temporal-Difference Methods Based on Gradient Descent Rich Sutton Hamid Maei Doina Precup (McGill) Shalabh Bhatnagar (IIS Bangalore) Csaba Szepesvari Eric Wiewiora David Silver Outline The promise and problems of TD learning
Rich Sutton Hamid Maei Doina Precup (McGill) Shalabh Bhatnagar (IIS Bangalore) Csaba Szepesvari Eric Wiewiora David Silver
5 2
states
V (s) = E[outcome|s]
Vθ(s) V (s), θ ⇥ ⇤n Vθ(s) = θφs, φs ⇥n
modifiable parameter vector feature vector for state s
V (s) = E[outcome|s]
Vθ(s) V (s), θ ⇥ ⇤n Vθ(s) = θφs, φs ⇥n
modifiable parameter vector feature vector for state s
1 1 1
feature vector
φs
= -2 + 0 + 5 = 3
x
parameter vector
0.1
0.5 5
θ
state
s
1
state trajectory
1 2 1 1
rewards target values (returns) = sum of future rewards until end
discounting horizon
1 2 4 5 6
V (s) = E ∞ ⇤
t=0
γtrt | s0 = s ⇥
discount rate,
0 ≤ γ ≤ 1
1 2 1
2 1 1
ds - distribution of first state s rs - expected reward given s Pss’ - prob of next state s’ given s
(s, r, s) or (φ, r, φ) P and d are linked
trajectories transitions
θ ← θ + αδφ δ = r + γθ⇥φ − θ⇥φ
1 2 1
2 1 1
ds rs Pss’
trajectories transitions
P and d are no longer linked TD(0) may diverge!
Vk(s) = !(7)+2!(1) terminal state 99% 1% 100% Vk(s) = !(7)+2!(2) Vk(s) = !(7)+2!(3) Vk(s) = !(7)+2!(4) Vk(s) = !(7)+2!(5) Vk(s) = 2!(7)+!(6)
α = 0.01 γ = 0.99
θ0 = (1, 1, 1, 1, 1, 10, 1)
r = 0
α = 0.01 γ = 0.99
θ0 = (1, 1, 1, 1, 1, 10, 1)
5 10
1000 2000 3000 4000 5000 10 10
/ -10
Iterations (k)
5
10
10
10 10
values, !k(i)
(log scale, broken at ±1)
!k(7) !k(1) – !k(5) !k(6)
deterministic updates
TD update: TD fixpoint:
δ = r + γθ⇥φ − θ⇥φ = 0 + 2θ − θ = θ ∆θ = αδφ = αθ θ∗ =
J(θ)
θJ(θ) θ θ ⇥ θ α⇤θJt(θ)
δ = r + γθ⇥φ − θ⇥φ
∆θ = αδφ
∂2J ∂θj∂θi = ∂(δφi) ∂θj = (γφ
j − φj)φi
∂2J ∂θi∂θj = ∂(δφj) ∂θi = (γφ
i − φi)φj
∂J ∂θi = δφi
Assume there is a J such that: Then look at the second derivative:
∂2J ∂θj∂θi = ∂2J ∂θi∂θj
Real 2nd derivatives must be symmetric
C
t r a d i c t i
!
V = r + γPV = TV
Mean-Square Value Error
MSE(θ) =
ds (Vθ(s) V (s))2 = ⇥ Vθ V ⇥2
D
Mean-Square Bellman Error
MSBE(θ) = ⇥ Vθ TVθ ⇥2
D
True value function
T Vθ
Π
TVθ
ΠTVθ
Φ, D
RMSBE R M S P B E The space spanned by the feature vectors, weighted by the state visitation distribution
T takes you outside
the space
Π projects you back
into it
D = diag(d)
Vθ = ΠTVθ
Is the TD fix-point Better objective fn?
Previous work on gradient methods for TD minimized this objective fn (Baird 1995, 1999)
A B 1
50% 50% 100%
V (A) = 0.5 V (B) = 1
J(θ) = E[δ2] V (B) = 2/3 V (A) = 1/3
A1 A2 B 1
100% 100% 100%
V (B) = 2/3 V (A) = 1/3
T Vθ
Π
TVθ
ΠTVθ
Φ, D
RMSBE RMSPBE
ΦT D(TVθ − Vθ) = E[δφ]
ΦT DΦ = E[φφT ]
MSPBE(θ) = ⇥ Vθ ΠTVθ ⇥2
D
= ⇥ Π(Vθ TVθ) ⇥2
D
= (Π(Vθ TVθ))⇤D(Π(Vθ TVθ)) = (Vθ TVθ)⇤Π⇤DΠ(Vθ TVθ) = (Vθ TVθ)⇤D⇤Φ(Φ⇤DΦ)1Φ⇤D(Vθ TVθ) = (Φ⇤D(TVθ Vθ))⇤(Φ⇤DΦ)1Φ⇤D(TVθ Vθ) = E[δφ]⇤ E
1 2 MSPBE(θ) = E ⇤ (φ γφ⇥)φ⌅⌅ E ⇤ φφ⌅⌅1 E[δφ] ⇥ E ⇤ (φ γφ⇥)φ⌅⌅ w,
w ⇥ E ⇤ φφ⌅⌅1 E[δφ] .
Assuming
θ ← θ + α(φ − γφ)(φ⇥w)
w ← w + β(δ − φw)φ
δ = r + γθ⇥φ − θ⇥φ
This is the main trick!
w ≈ E[δφ]
Assuming
w
1 2⇤θNEU(θ) = E[(φ γφ)φ⇥]E[δφ] ⇥ E[(φ γφ)φ⇥]w
w ⇥ E ⇤ φφ⌅⌅1 E[δφ] .
1 2 MSPBE(θ) = E ⇤ (φ γφ⇥)φ⌅⌅ E ⇤ φφ⌅⌅1 E[δφ] =
⇤ φφ⌅⌅ γE ⇤ φ⇥φ⌅⌅⇥ E ⇤ φφ⌅⌅1 E[δφ] = E[δφ] γE ⇤ φ⇥φ⌅⌅ E ⇤ φφ⌅⌅1 E[δφ] ⇥ E[δφ] γE ⇤ φ⇥φ⌅⌅ w,
Assuming
θ ← θ + αδφ − αγφ(φ⇥w)
conventional TD(0) gradient correction term
With updated as in GTD-2
w
remarks)
α, β − → 0 α β − → 0 α = β − → 0
E[δφ] − → 0
A B C D E
1
start
3 different feature representations.
1 2 3 4 5 13
0.75 0.25
1
0.5 0.5
1
Boyan 1999
.0 .03 .06 .09 .12 .03 .06 .12 .25 0.5
!
RMSPBE
100 200
Random Walk - Tabular features
episodes
GTD GTD-2 TD-C TD GTD GTD-2 TD-C TD
.00 .05 .10 .15 .20 .03 .06 .12 .25 0.5
!
RMSPBE
250 500
Random Walk - Inverted features
episodes
GTD GTD-2 TD-C TD GTD GTD-2 TD-C TD
.00 .02 .04 .06 .08 .008 .015 .03 .06 .12 .25 0.5
!
RMSPBE
100 200 300
Random Walk - Dependent features
episodes
GTD GTD-2 TD-C TD GTD GTD-2 TD-C TD
0.7 1.4 2.1 2.8 .015 .03 .06 .12 .25 0.5 1 2
!
RMSPBE
50 100
Boyan Chain
episodes
GTD GTD-2 TD-C TD GTD GTD-2 TD-C TD
TD, TD-C > GTD-2 > GTD Sometimes TD > TD-C
0.001 0.01 0.1 1 1e-05 0.0001 0.001 0.01 0.1 1 NEU Alpha
TD GTD TD-C GTD-2
! "! #! $! %! &!! &"! &#! &$! &%! "!! ! " # $ % &! '()*+, )-../0 123 234 123!"
! "!!! #!!! $!!! %!!! &!!! "!
!"!
"!
!&
"!
!
"!
&
"!
"!
'()(*+,+)-.!/01 23++45 ! !
"!
67!
&