Eligibility Traces Chapter 12 Eligibility traces are Another way - - PowerPoint PPT Presentation

eligibility traces
SMART_READER_LITE
LIVE PREVIEW

Eligibility Traces Chapter 12 Eligibility traces are Another way - - PowerPoint PPT Presentation

Eligibility Traces Chapter 12 Eligibility traces are Another way of interpolating between MC and TD methods A way of implementing compound -return targets A basic mechanistic idea a short-term, fading memory A new style of algorithm


slide-1
SLIDE 1

Eligibility Traces

Chapter 12

slide-2
SLIDE 2

Eligibility traces are

Another way of interpolating between MC and TD methods A way of implementing compound λ-return targets A basic mechanistic idea — a short-term, fading memory A new style of algorithm development/analysis the forward-view ⇔ backward-view transformation Forward view:
 conceptually simple — good for theory, intuition Backward view:
 computationally congenial implementation of the f. view

slide-3
SLIDE 3

width

  • f backup

height (depth)

  • f backup

Temporal- difference learning Dynamic programming Monte Carlo

...

Exhaustive search

3

Unified View

Multi-step bootstrapping

slide-4
SLIDE 4

Recall n-step targets

For example, in the episodic case, 
 with linear function approximation: 2-step target: n-step target:

taken as zero and the n- (G(n)

t

. = Gt if t + n T).

G(2)

t

. = Rt+1 + γRt+2 + γ2θ>

t+1φt+2

G(n)

t

. = Rt+1 + · · · + γn1Rt+n + γnθ>

t+n1φt+n

with

slide-5
SLIDE 5

Any set of update targets can be averaged to produce new compound update targets

For example, half a 2-step plus half a 4-step Called a compound backup Draw each component Label with the weights for that component

A compound backup

1 2 1 2

Ut = 1 2G(2)

t

+ 1 2G(4)

t

slide-6
SLIDE 6

The λ-return is a compound update target

The λ-return a target that 
 averages all n-step targets each weighted by λn-1

1!" (1!") " (1!") "2

# = 1

TD("), "-return

"

T-t-1

t

. = (1 − λ)

X

n=1

λn−1G(n)

t

slide-7
SLIDE 7

7

λ-return Weighting Function

Until termination After termination

t

= (1 − λ)

T−t−1

X

n=1

λn−1G(n)

t

+ λT−t−1Gt.

1!"

weight given to the 3-step return

decay by "

weight given to actual, final return

t T Time Weight

total area = 1

is (1 − λ)λ2 is λT −t−1

slide-8
SLIDE 8

8

Relation to TD(0) and MC

The λ-return can be rewritten as: If λ = 1, you get the MC target: If λ = 0, you get the TD(0) target:

Until termination After termination

t

= (1 − λ)

T−t−1

X

n=1

λn−1G(n)

t

+ λT−t−1Gt. Gλ

t

= (1 1)

T −t−1

X

n=1

1n−1G(n)

t

+ 1T −t−1Gt = Gt Gλ

t

= (1 0)

T −t−1

X

n=1

0n−1G(n)

t

+ 0T −t−1Gt = G(1)

t

slide-9
SLIDE 9

The off-line λ-return “algorithm”

Wait until the end of the episode (offline) Then go back over the time steps, updating

θt+1 . = θt + α h Gλ

t ˆ

v(St,θt) i rˆ v(St,θt), t = 0, . . . , T 1.

slide-10
SLIDE 10

The λ-return alg performs similarly to n-step algs


  • n the 19-state random walk (Tabular)

α

n=1 n=2 n=4 n=8 n=16

n=32

n=32 n=64

128

512 256

n-step TD methods

(from Chapter 7)

Off-line λ-return algorithm

α

RMS error at the end

  • f the episode
  • ver the first

10 episodes

λ=0 λ=.4 λ=.8 λ=.9 λ=.95

λ=.975 λ=.99 λ=1

λ=.95

Intermediate λ is best (just like intermediate n is best) λ-return slightly better than n-step

slide-11
SLIDE 11

The forward view looks forward from the state being updated to future states and rewards

T i m e

rt+3

rt+2

rt+1

rT

st+1

st+2

st+3

st

St

St+1

St+2

St+3

R

R

R

R

slide-12
SLIDE 12

The backward view looks back to the recently visited states (marked by eligibility traces)

Shout the TD error backwards The traces fade with temporal distance by γλ

!t

et

et

et

et

T i m e

st

st+1

st-1

st-2

st-3

St

St+1

St-1

St-2

St-3

δt

et

et

et

et

slide-13
SLIDE 13

Demo

13

Here we are marking state-action pairs with a replacing eligibility trace

slide-14
SLIDE 14

14

Eligibility traces (mechanism)

The forward view was for theory The backward view is for mechanism New memory vector called eligibility trace On each step, decay each component by γλ and increment the trace for the current state by 1 Accumulating trace

accumulating eligibility trace times of visits to a state

et ∈ Rn ≥ 0

e0 . = 0, et . = rˆ v(St,θt) + γλet−1,

same shape as 휽

slide-15
SLIDE 15

The Semi-gradient TD(λ) algorithm

θt+1 . = θt + αδtet,

δt . = Rt+1 + γˆ v(St+1,θt) ˆ v(St,θt).

e0 . = 0, et . = rˆ v(St,θt) + γλet−1

slide-16
SLIDE 16

TD(λ) performs similarly to offline λ-return alg. but slightly worse, particularly at high α

Off-line λ-return algorithm

(from the previous section)

α

λ=0 λ=.4 λ=.8 λ=.9 λ=.95

λ=.975 λ=.99 λ=1

λ=.95 λ=0 λ=.4 λ=.8 λ=.9 λ=.95

.975 .99 1

TD(λ)

α

λ=.8 λ=.9

RMS error at the end

  • f the episode
  • ver the first

10 episodes

Can we do better? Can we update online?

Tabular 19-state random walk task

slide-17
SLIDE 17

The online λ-return algorithm performs best of all

RMS error

  • ver first

10 episodes

Off-line λ-return algorithm

α

λ=0 λ=.4 λ=.8 λ=.9 λ=.95

λ=.975 λ=.99 λ=1

λ=.95

On-line λ-return algorithm

= true online TD(λ)

α

λ=0 λ=.4

λ=.8

λ=.9 λ=.95 λ=.975 λ=.99 λ=1 λ=.95

Figure 12.7:

Tabular 19-state random walk task

slide-18
SLIDE 18

The online λ-return alg uses a truncated λ-return 
 as its target

Gλ|h

t

. = (1 − λ)

h−t−1

X

n=1

λn−1G(n)

t

+ λh−t−1G(h−t)

t

, 0 ≤ t < h ≤ T.

T i m e

rt+3

rt+2

rt+1

rT

st+1

st+2

st+3

st

St

St+1

St+2

St+3

R

R

R

R

θh

t+1

. = θh

t + α

h Gλ|h

t

ˆ v(St,θh

t )

i rˆ v(St,θh

t )

horizon h = t +3

There is a separate 휽 sequence for each h!

slide-19
SLIDE 19

The online λ-return algorithm

θh

t+1

. = θh

t + α

h Gλ|h

t

ˆ v(St,θh

t )

i rˆ v(St,θh

t )

There is a separate 휽 sequence for each h!

θ0 θ1 θ1

1

θ2 θ2

1

θ2

2

θ3 θ3

1

θ3

2

θ3

3

. . . . . . . . . . . . ... θT θT

1

θT

2

θT

3

· · · θT

T

h = 1 : θ1

1

. = θ1

0 + α

h Gλ|1 ˆ v(S0,θ1

0)

i rˆ v(S0,θ1

0),

h = 2 : θ2

1

. = θ2

0 + α

h Gλ|2 ˆ v(S0,θ2

0)

i rˆ v(S0,θ2

0),

θ2

2

. = θ2

1 + α

h Gλ|2

1

ˆ v(S1,θ2

1)

i rˆ v(S1,θ2

1),

h = 3 : θ3

1

. = θ3

0 + α

h Gλ|3 ˆ v(S0,θ3

0)

i rˆ v(S0,θ3

0),

θ3

2

. = θ3

1 + α

h Gλ|3

1

ˆ v(S1,θ3

1)

i rˆ v(S1,θ3

1),

θ3

3

. = θ3

2 + α

h Gλ|3

2

ˆ v(S2,θ3

2)

i rˆ v(S2,θ3

2).

True online TD(λ) computes just the diagonal, cheaply (for linear FA)

slide-20
SLIDE 20

True online TD(λ)

θt+1 . = θt + αδtet + α ⇣ θ>

t φt − θ> t1φt

⌘ (et − φt), et . = γλet1 + ⇣ 1 − αγλe>

t1φt

⌘ φt.

dutch trace

slide-21
SLIDE 21

21

Accumulating, Dutch, and Replacing Traces

All traces fade the same: But increment differently!

times of state visits accumulating traces dutch traces (α = 0.5) replacing traces

slide-22
SLIDE 22

The simplest example of deriving a backward view from a forward view

Monte Carlo learning of a final target Will derive dutch traces Showing the dutch traces really are not about TD They are about efficiently implementing online algs

slide-23
SLIDE 23

θt+1 . = θt + αt

  • Z − φ>

t θt

  • φt,

t = 0, . . . , T − 1,

all done at time T

step size

The Problem: Predict final target Z with linear function approximation φ0 φ1 φ2

φT −1

θ0 θ0 θ0 θ0

θT θT θT θT

θ>

0φ0 θ> 0φ1 θ> 0φ2

θ>

0 φT 1

Time 1 2 . . . T-1 T 1 2 Data . . .

Z

Weights . . . Predictions . . .

episode next episode

≈ Z

MC:

slide-24
SLIDE 24

Computational goals

Computation per step (including memory) must be

  • 1. Constant. (non-increasing with number of episodes)
  • 2. Proportionate. (proportional to number of weights, or O(n))
  • 3. Independent of span. (not increasing with episode length) In

general, the predictive span is the number of steps between making a prediction and observing the outcome

MC:

θt+1 . = θt + αt

  • Z − φ>

t θt

  • φt,

t = 0, . . . , T − 1,

all done at time T

step size

Is MC indep of span? No What is the span? T

slide-25
SLIDE 25

Computational goals

Computation per step (including memory) must be

  • 1. Constant. (non-increasing with number of episodes)
  • 2. Proportionate. (proportional to number of weights, or O(n))
  • 3. Independent of span. (not increasing with episode length) In

general, the predictive span is the number of steps between making a prediction and observing the outcome

MC:

θt+1 . = θt + αt

  • Z − φ>

t θt

  • φt,

t = 0, . . . , T − 1,

all done at time T

step size

all done at time T

Computation and memory needed at step T increases with T ⇒ not IoS

slide-26
SLIDE 26

Final Result

Given: MC algorithm: Equivalent independent-of-span algorithm: Proved:

θt+1 . = θt + αt

  • Z − φ>

t θt

  • φt,

t = 0, . . . , T − 1,

θT . = aT1 + ZeT1, a0 . = θ0, then at . = at1 − αtφtφ>

t at1,

t = 1, . . . , T − 1 e0 . = α0φ0, then et . = et1 − αtφtφ>

t et1 + αtφt,

t = 1, . . . , T − 1

θ0 φ0, φ1, φ2, . . . , φT1 Z

θT = θT

at 2 <n, et 2 <n

slide-27
SLIDE 27

θt+1 . = θt + αt

  • Z − φ>

t θt

  • φt,

t = 0, . . . , T − 1, (1) θT = θT1 + αT1

  • Z − φ>

T1θT1

  • φT1

= θT1 + αT1φT1

  • −φ>

T1θT1

  • + αT1ZφT1

=

  • I − αT1φT1φ>

T1

  • θT1 + ZαT1φT1

= FT1θT1 + ZαT1φT1 (where Ft . = I − αtφtφ>

t )

= FT1 (FT2θT2 + ZαT2φT2) + ZαT1φT1 = FT1FT2θT2 + Z (FT1αT2φT2 + αT1φT1) = FT1FT2 (FT3θT3 + ZαT3φT3) + Z (FT1αT2φT2 + αT1φT1) = FT1FT2FT3θT3 + Z (FT1FT2αT3φT3 + FT1αT2φT2 + αT1φT1) . . . = FT1FT2 · · · F0θ0 | {z } aT1 + Z

T1

X

k=0

FT1FT2 · · · Fk+1αkφk | {z } eT1 = aT1 + ZeT1 , (2)

at 2 <n, et 2 <n

auxiliary short-term-memory vectors

MC:

slide-28
SLIDE 28

. = FT1FT2 · · · F0θ0 | {z } aT1 + Z

T1

X

k=0

FT1FT2 · · · Fk+1αkφk | {z } eT1 = aT1 + ZeT1 , (2) et . =

t

X

k=0

FtFt1 · · · Fk+1αkφk, t = 0, . . . , T − 1 =

t1

X

k=0

FtFt1 · · · Fk+1αkφk + αtφt = Ft

t1

X

k=0

Ft1Ft2 · · · Fk+1αkφk + αtφt = Ftet1 + αtφt, t = 1, . . . , T − 1 = et1 − αtφtφ>

t et1 + αtφt,

t = 1, . . . , T − 1. (3) at . = FtFt1 · · · F0θ0 = Ftat1 = at1 − αtφtφ>

t at1,

t = 1, . . . , T − 1. (4)

slide-29
SLIDE 29

Final Result

Given: MC: Equivalent independent-of-span algorithm: Proved:

θt+1 . = θt + αt

  • Z − φ>

t θt

  • φt,

t = 0, . . . , T − 1,

θT . = aT1 + ZeT1, a0 . = θ0, then at . = at1 − αtφtφ>

t at1,

t = 1, . . . , T − 1 e0 . = α0φ0, then et . = et1 − αtφtφ>

t et1 + αtφt,

t = 1, . . . , T − 1

θ0 φ0, φ1, φ2, . . . , φT1 Z

θT = θT

at 2 <n, et 2 <n

slide-30
SLIDE 30

Conclusions from the forward-backward derivation

We have derived dutch eligibility traces from an MC update, without any TD learning Dutch traces, and in fact all eligibility traces, are not about TD; they are about efficient multi-step learning We can derive new non-obvious algorithms that are equivalent to obvious algorithms but have better computational properties This is a different type of machine-learning result, 
 an algorithm equivalence

slide-31
SLIDE 31

51

Conclusions regarding Eligibility Traces

Provide an efficient, incremental way to combine MC and TD Includes advantages of MC (better when non-Markov) Includes advantages of TD (faster, comp. congenial) True online TD(λ) is new and best Is exactly equivalent to online λ-return algorithm Three varieties of traces: accumulating, dutch, (replacing) Traces to control in on-policy and off-policy forms Traces do have a small cost in computation (≈x2)