Eligibility Traces
Chapter 12
Eligibility Traces Chapter 12 Eligibility traces are Another way - - PowerPoint PPT Presentation
Eligibility Traces Chapter 12 Eligibility traces are Another way of interpolating between MC and TD methods A way of implementing compound -return targets A basic mechanistic idea a short-term, fading memory A new style of algorithm
Chapter 12
Eligibility traces are
Another way of interpolating between MC and TD methods A way of implementing compound λ-return targets A basic mechanistic idea — a short-term, fading memory A new style of algorithm development/analysis the forward-view ⇔ backward-view transformation Forward view: conceptually simple — good for theory, intuition Backward view: computationally congenial implementation of the f. view
width
height (depth)
Temporal- difference learning Dynamic programming Monte Carlo
...
Exhaustive search
3
Unified View
Multi-step bootstrapping
Recall n-step targets
For example, in the episodic case, with linear function approximation: 2-step target: n-step target:
taken as zero and the n- (G(n)
t
. = Gt if t + n T).
G(2)
t
. = Rt+1 + γRt+2 + γ2θ>
t+1φt+2
G(n)
t
. = Rt+1 + · · · + γn1Rt+n + γnθ>
t+n1φt+n
with
Any set of update targets can be averaged to produce new compound update targets
For example, half a 2-step plus half a 4-step Called a compound backup Draw each component Label with the weights for that component
A compound backup
1 2 1 2
Ut = 1 2G(2)
t
+ 1 2G(4)
t
The λ-return is a compound update target
The λ-return a target that averages all n-step targets each weighted by λn-1
1!" (1!") " (1!") "2
# = 1
TD("), "-return
"
T-t-1
Gλ
t
. = (1 − λ)
∞
X
n=1
λn−1G(n)
t
7
λ-return Weighting Function
Until termination After termination
Gλ
t
= (1 − λ)
T−t−1
X
n=1
λn−1G(n)
t
+ λT−t−1Gt.
1!"
weight given to the 3-step return
decay by "
weight given to actual, final return
t T Time Weight
total area = 1
is (1 − λ)λ2 is λT −t−1
8
Relation to TD(0) and MC
The λ-return can be rewritten as: If λ = 1, you get the MC target: If λ = 0, you get the TD(0) target:
Until termination After termination
Gλ
t
= (1 − λ)
T−t−1
X
n=1
λn−1G(n)
t
+ λT−t−1Gt. Gλ
t
= (1 1)
T −t−1
X
n=1
1n−1G(n)
t
+ 1T −t−1Gt = Gt Gλ
t
= (1 0)
T −t−1
X
n=1
0n−1G(n)
t
+ 0T −t−1Gt = G(1)
t
The off-line λ-return “algorithm”
Wait until the end of the episode (offline) Then go back over the time steps, updating
θt+1 . = θt + α h Gλ
t ˆ
v(St,θt) i rˆ v(St,θt), t = 0, . . . , T 1.
The λ-return alg performs similarly to n-step algs
α
n=1 n=2 n=4 n=8 n=16
n=32
n=32 n=64
128
512 256n-step TD methods
(from Chapter 7)
Off-line λ-return algorithm
α
RMS error at the end
10 episodes
λ=0 λ=.4 λ=.8 λ=.9 λ=.95
λ=.975 λ=.99 λ=1
λ=.95
Intermediate λ is best (just like intermediate n is best) λ-return slightly better than n-step
The forward view looks forward from the state being updated to future states and rewards
T i m e
rt+3
rt+2
rt+1
rT
st+1
st+2
st+3
st
St
St+1
St+2
St+3
R
R
R
R
The backward view looks back to the recently visited states (marked by eligibility traces)
Shout the TD error backwards The traces fade with temporal distance by γλ
et
et
et
et
T i m e
st
st+1
st-1
st-2
st-3
St
St+1
St-1
St-2
St-3
et
et
et
et
Demo
13
Here we are marking state-action pairs with a replacing eligibility trace
14
Eligibility traces (mechanism)
The forward view was for theory The backward view is for mechanism New memory vector called eligibility trace On each step, decay each component by γλ and increment the trace for the current state by 1 Accumulating trace
accumulating eligibility trace times of visits to a state
et ∈ Rn ≥ 0
e0 . = 0, et . = rˆ v(St,θt) + γλet−1,
same shape as 휽
The Semi-gradient TD(λ) algorithm
θt+1 . = θt + αδtet,
δt . = Rt+1 + γˆ v(St+1,θt) ˆ v(St,θt).
e0 . = 0, et . = rˆ v(St,θt) + γλet−1
TD(λ) performs similarly to offline λ-return alg. but slightly worse, particularly at high α
Off-line λ-return algorithm
(from the previous section)
α
λ=0 λ=.4 λ=.8 λ=.9 λ=.95
λ=.975 λ=.99 λ=1
λ=.95 λ=0 λ=.4 λ=.8 λ=.9 λ=.95
.975 .99 1
TD(λ)
α
λ=.8 λ=.9
RMS error at the end
10 episodes
Can we do better? Can we update online?
Tabular 19-state random walk task
The online λ-return algorithm performs best of all
RMS error
10 episodes
Off-line λ-return algorithm
α
λ=0 λ=.4 λ=.8 λ=.9 λ=.95
λ=.975 λ=.99 λ=1
λ=.95
On-line λ-return algorithm
= true online TD(λ)
α
λ=0 λ=.4
λ=.8
λ=.9 λ=.95 λ=.975 λ=.99 λ=1 λ=.95
Figure 12.7:
Tabular 19-state random walk task
The online λ-return alg uses a truncated λ-return as its target
Gλ|h
t
. = (1 − λ)
h−t−1
X
n=1
λn−1G(n)
t
+ λh−t−1G(h−t)
t
, 0 ≤ t < h ≤ T.
T i m e
rt+3
rt+2
rt+1
rT
st+1
st+2
st+3
st
St
St+1
St+2
St+3
R
R
R
R
θh
t+1
. = θh
t + α
h Gλ|h
t
ˆ v(St,θh
t )
i rˆ v(St,θh
t )
horizon h = t +3
There is a separate 휽 sequence for each h!
The online λ-return algorithm
θh
t+1
. = θh
t + α
h Gλ|h
t
ˆ v(St,θh
t )
i rˆ v(St,θh
t )
There is a separate 휽 sequence for each h!
θ0 θ1 θ1
1
θ2 θ2
1
θ2
2
θ3 θ3
1
θ3
2
θ3
3
. . . . . . . . . . . . ... θT θT
1
θT
2
θT
3
· · · θT
T
h = 1 : θ1
1
. = θ1
0 + α
h Gλ|1 ˆ v(S0,θ1
0)
i rˆ v(S0,θ1
0),
h = 2 : θ2
1
. = θ2
0 + α
h Gλ|2 ˆ v(S0,θ2
0)
i rˆ v(S0,θ2
0),
θ2
2
. = θ2
1 + α
h Gλ|2
1
ˆ v(S1,θ2
1)
i rˆ v(S1,θ2
1),
h = 3 : θ3
1
. = θ3
0 + α
h Gλ|3 ˆ v(S0,θ3
0)
i rˆ v(S0,θ3
0),
θ3
2
. = θ3
1 + α
h Gλ|3
1
ˆ v(S1,θ3
1)
i rˆ v(S1,θ3
1),
θ3
3
. = θ3
2 + α
h Gλ|3
2
ˆ v(S2,θ3
2)
i rˆ v(S2,θ3
2).
…
True online TD(λ) computes just the diagonal, cheaply (for linear FA)
True online TD(λ)
θt+1 . = θt + αδtet + α ⇣ θ>
t φt − θ> t1φt
⌘ (et − φt), et . = γλet1 + ⇣ 1 − αγλe>
t1φt
⌘ φt.
dutch trace
21
Accumulating, Dutch, and Replacing Traces
All traces fade the same: But increment differently!
times of state visits accumulating traces dutch traces (α = 0.5) replacing traces
The simplest example of deriving a backward view from a forward view
Monte Carlo learning of a final target Will derive dutch traces Showing the dutch traces really are not about TD They are about efficiently implementing online algs
θt+1 . = θt + αt
t θt
t = 0, . . . , T − 1,
all done at time T
step size
The Problem: Predict final target Z with linear function approximation φ0 φ1 φ2
φT −1
θ0 θ0 θ0 θ0
θT θT θT θT
θ>
0φ0 θ> 0φ1 θ> 0φ2
θ>
0 φT 1
Time 1 2 . . . T-1 T 1 2 Data . . .
Z
Weights . . . Predictions . . .
episode next episode
≈ Z
MC:
Computation per step (including memory) must be
general, the predictive span is the number of steps between making a prediction and observing the outcome
MC:
θt+1 . = θt + αt
t θt
t = 0, . . . , T − 1,
all done at time T
step size
Is MC indep of span? No What is the span? T
Computation per step (including memory) must be
general, the predictive span is the number of steps between making a prediction and observing the outcome
MC:
θt+1 . = θt + αt
t θt
t = 0, . . . , T − 1,
all done at time T
step size
all done at time T
Computation and memory needed at step T increases with T ⇒ not IoS
Given: MC algorithm: Equivalent independent-of-span algorithm: Proved:
θt+1 . = θt + αt
t θt
t = 0, . . . , T − 1,
θT . = aT1 + ZeT1, a0 . = θ0, then at . = at1 − αtφtφ>
t at1,
t = 1, . . . , T − 1 e0 . = α0φ0, then et . = et1 − αtφtφ>
t et1 + αtφt,
t = 1, . . . , T − 1
θ0 φ0, φ1, φ2, . . . , φT1 Z
θT = θT
at 2 <n, et 2 <n
θt+1 . = θt + αt
t θt
t = 0, . . . , T − 1, (1) θT = θT1 + αT1
T1θT1
= θT1 + αT1φT1
T1θT1
=
T1
= FT1θT1 + ZαT1φT1 (where Ft . = I − αtφtφ>
t )
= FT1 (FT2θT2 + ZαT2φT2) + ZαT1φT1 = FT1FT2θT2 + Z (FT1αT2φT2 + αT1φT1) = FT1FT2 (FT3θT3 + ZαT3φT3) + Z (FT1αT2φT2 + αT1φT1) = FT1FT2FT3θT3 + Z (FT1FT2αT3φT3 + FT1αT2φT2 + αT1φT1) . . . = FT1FT2 · · · F0θ0 | {z } aT1 + Z
T1
X
k=0
FT1FT2 · · · Fk+1αkφk | {z } eT1 = aT1 + ZeT1 , (2)
at 2 <n, et 2 <n
auxiliary short-term-memory vectors
MC:
. = FT1FT2 · · · F0θ0 | {z } aT1 + Z
T1
X
k=0
FT1FT2 · · · Fk+1αkφk | {z } eT1 = aT1 + ZeT1 , (2) et . =
t
X
k=0
FtFt1 · · · Fk+1αkφk, t = 0, . . . , T − 1 =
t1
X
k=0
FtFt1 · · · Fk+1αkφk + αtφt = Ft
t1
X
k=0
Ft1Ft2 · · · Fk+1αkφk + αtφt = Ftet1 + αtφt, t = 1, . . . , T − 1 = et1 − αtφtφ>
t et1 + αtφt,
t = 1, . . . , T − 1. (3) at . = FtFt1 · · · F0θ0 = Ftat1 = at1 − αtφtφ>
t at1,
t = 1, . . . , T − 1. (4)
Given: MC: Equivalent independent-of-span algorithm: Proved:
θt+1 . = θt + αt
t θt
t = 0, . . . , T − 1,
θT . = aT1 + ZeT1, a0 . = θ0, then at . = at1 − αtφtφ>
t at1,
t = 1, . . . , T − 1 e0 . = α0φ0, then et . = et1 − αtφtφ>
t et1 + αtφt,
t = 1, . . . , T − 1
θ0 φ0, φ1, φ2, . . . , φT1 Z
θT = θT
at 2 <n, et 2 <n
Conclusions from the forward-backward derivation
We have derived dutch eligibility traces from an MC update, without any TD learning Dutch traces, and in fact all eligibility traces, are not about TD; they are about efficient multi-step learning We can derive new non-obvious algorithms that are equivalent to obvious algorithms but have better computational properties This is a different type of machine-learning result, an algorithm equivalence
51
Conclusions regarding Eligibility Traces
Provide an efficient, incremental way to combine MC and TD Includes advantages of MC (better when non-Markov) Includes advantages of TD (faster, comp. congenial) True online TD(λ) is new and best Is exactly equivalent to online λ-return algorithm Three varieties of traces: accumulating, dutch, (replacing) Traces to control in on-policy and off-policy forms Traces do have a small cost in computation (≈x2)