Eligibility Traces Unifying Monte Carlo and TD key algorithms: TD( - - PowerPoint PPT Presentation

eligibility traces
SMART_READER_LITE
LIVE PREVIEW

Eligibility Traces Unifying Monte Carlo and TD key algorithms: TD( - - PowerPoint PPT Presentation

Eligibility Traces Unifying Monte Carlo and TD key algorithms: TD( ), Sarsa( ), Q( ) Unified View width of backup Dynamic Temporal- programming difference learning height (depth) of backup Exhaustive Monte search Carlo ...


slide-1
SLIDE 1

Eligibility Traces

Unifying Monte Carlo and TD

key algorithms: TD(λ), Sarsa(λ), Q(λ)

slide-2
SLIDE 2

width

  • f backup

height (depth)

  • f backup

Temporal- difference learning Dynamic programming Monte Carlo

...

Exhaustive search

2

Unified View

slide-3
SLIDE 3

3

N-step TD Prediction

Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)

TD (1-step) 2-step 3-step n-step Monte Carlo

slide-4
SLIDE 4

Monte Carlo: TD: Use Vt to estimate remaining return n-step TD: 2 step return: n-step return:

Mathematics of N-step TD Prediction

Gt . = Rt+1 + γRt+2 + γ2Rt+3 + · · · + γT−t−1RT G(1)

t

. = Rt+1 + γVt(St+1)

G(2)

t

. = Rt+1 + γRt+2 + γ2Vt(St+2) G(n)

t

. = Rt+1 + γRt+2 + γ2 + · · · + γn−1Rt+n + γnVt(St+n),

slide-5
SLIDE 5

5

Forward View of TD(λ)

Look forward from each state to determine update from future states and rewards:

T i m e

rt+3

rt+2

rt+1

rT

st+1

st+2

st+3

st

St

St+1

St+2

St+3

R

R

R

R

slide-6
SLIDE 6

6

Learning with n-step Backups

Backup computes an increment: Then, Online updating: Off-line updating:

∆t(St) . = α h G(n)

t

− Vt(St) i

(∆t(s) = 0, 8s 6= St).

Vt+1(s) = Vt(s) + ∆t(s), 8s 2 S. V (s) V (s) +

T−1

X

t=0

∆t(s).

8s 2 S.

slide-7
SLIDE 7

7

Error-reduction property

Error reduction property of n-step returns Using this, you can show that n-step methods converge

Maximum error using n-step return Maximum error using V

max

s

h G(n)

t

  • St =s

i vπ(s)

  •  γn max

s

  • Vt(s) vπ(s)
slide-8
SLIDE 8

8

Random Walk Examples

How does 2-step TD work here? How about 3-step TD?

A B C D E

1

start

slide-9
SLIDE 9

9

A Larger Example – 19-state Random Walk

On-line is better than off-line An intermediate n is best Do you think there is an optimal n? for every task?

On-line n-step TD methods Off-line n-step TD methods

α α

RMS error

  • ver first

10 episodes

n=1 n=2 n=4 n=8 n=16 n=32

n=64

256

128

512

n=3 n=64

n=1 n=2 n=4 n=8 n=16

n=32

n=32 n=64

128

512 256

slide-10
SLIDE 10

10

Averaging N-step Returns

n-step methods were introduced to help with TD(λ) understanding Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-step Called a complex backup Draw each component Label with the weights for that component

A complex backup

1 2 1 2

: 1

2G(2) t

+ 1

2G(4) t

as long as the we

slide-11
SLIDE 11

11

Forward View of TD(λ)

TD(λ) is a method for averaging all n-step backups weight by λn-1 (time since visitation) λ-return: Backup using λ-return:

1!" (1!") " (1!") "2

# = 1

TD("), "-return

"

T-t-1

t

. = (1 − λ)

X

n=1

λn−1G(n)

t

∆t(St) . = α h Gλ

t Vt(St)

i

slide-12
SLIDE 12

12

λ-return Weighting Function

Until termination After termination

t

= (1 − λ)

T−t−1

X

n=1

λn−1G(n)

t

+ λT−t−1Gt.

1!"

weight given to the 3-step return

decay by "

weight given to actual, final return

t T Time Weight

total area = 1

slide-13
SLIDE 13

13

Relation to TD(0) and MC

The λ-return can be rewritten as: If λ = 1, you get MC: If λ = 0, you get TD(0)

Until termination After termination

t

= (1 − λ)

T−t−1

X

n=1

λn−1G(n)

t

+ λT−t−1Gt. Gλ

t

= (1 1)

T −t−1

X

n=1

1n−1G(n)

t

+ 1T −t−1Gt = Gt Gλ

t

= (1 0)

T −t−1

X

n=1

0n−1G(n)

t

+ 0T −t−1Gt = G(1)

t

slide-14
SLIDE 14

14

Forward View of TD(λ)

Look forward from each state to determine update from future states and rewards:

T i m e

rt+3

rt+2

rt+1

rT

st+1

st+2

st+3

st

St

St+1

St+2

St+3

R

R

R

R

slide-15
SLIDE 15

15

λ-return on the Random Walk

On-line >> Off-line Intermediate values of λ best λ-return better than n-step return

On-line λ-return algorithm Off-line λ-return algorithm

≡ off-line TD(λ), accumulating traces

α α

RMS error

  • ver first

10 episodes

λ=0 λ=.4 λ=.8 λ=.9 λ=.95

λ=.975 λ=.99 λ=1

λ=.95 λ=0 λ=.4 λ=.8 λ=.9 λ=.95 λ=.975 λ=.99 λ=1

λ=.99

slide-16
SLIDE 16

16

Backward View

Shout δt backwards over time The strength of your voice decreases with temporal distance by γλ

!t

et

et

et

et

Time

st

st+1

st-1

st-2

st-3

훿t

St+1

St

St-1

St-2

St-3

Et(

Et(

Et(

Et(

δt . = Rt+1 + γVt(St+1) Vt(St). ∆Vt(s) . = αδtEt(s)

slide-17
SLIDE 17

17

Backward View of TD(λ)

The forward view was for theory The backward view is for mechanism New variable called eligibility trace On each step, decay all traces by γλ and increment the trace for the current state by 1 Accumulating trace

accumulating eligibility trace times of visits to a state

  • trace. The elig

Et(s) 2 R+.

Et(s) = ⇢ γλEt−1(s) if s6=St; γλEt−1(s) + 1 if s=St,

slide-18
SLIDE 18

18

On-line Tabular TD(λ)

Initialize V (s) arbitrarily (but set to 0 if s is terminal) Repeat (for each episode): Initialize E(s) = 0, for all s ∈ S Initialize S Repeat (for each step of episode): A ← action given by π for S Take action A, observe reward, R, and next state, S0 δ ← R + γV (S0) − V (S) E(S) ← E(S) + 1 (accumulating traces)

  • r E(S) ← (1 − α)E(S) + 1

(dutch traces)

  • r E(S) ← 1

(replacing traces) For all s ∈ S: V (s) ← V (s) + αδE(s) E(s) ← γλE(s) S ← S0 until S is terminal

slide-19
SLIDE 19

19

Relation of Backwards View to MC & TD(0)

Using update rule: As before, if you set λ to 0, you get to TD(0) If you set λ to 1, you get MC but in a better way Can apply TD(1) to continuing tasks Works incrementally and on-line (instead of waiting to the end of the episode)

∆Vt(s) . = αδtEt(s)

slide-20
SLIDE 20

20

Forward View = Backward View

The forward (theoretical) view of TD(λ) is equivalent to the backward (mechanistic) view for off-line updating On-line updating with small α is similar

Backward updates Forward updates

algebra

T−1

X

t=0

∆V TD

t

(s) =

T−1

X

t=0

∆V λ

t (St)IsSt,

X X

T−1

X

t=0

αIsSt

T−1

X

k=t

(γλ)k−tδk.

slide-21
SLIDE 21

21

On-line versus Off-line on Random Walk

Same 19 state random walk On-line performs better over a broader range of parameters

λ=0 λ=.4 λ=.8 λ=.9 λ=.95 λ=.975 λ=.99 λ=1

λ=.99

λ=0 λ=.4 λ=.8 λ=.9 λ=.95

.975 .99 1

On-line TD(λ), accumulating traces Off-line TD(λ), accumulating traces

≡ off-line λ-return algorithm

α α

RMS error

  • ver first

10 episodes

λ=.8 λ=.9

slide-22
SLIDE 22

22

Replacing and Dutch Traces

All traces fade the same: But increment differently!

times of state visits accumulating traces dutch traces (α = 0.5) replacing traces

Et(s) . = γλEt−1(s), 8s 2 S, s 6= St,

Et(St) . = γλEt−1(St) + 1 Et(St) . = 1. Et(St) . = (1 − α)γλEt−1(St) + 1

slide-23
SLIDE 23

23

Replacing and Dutch on the Random Walk

On-line TD(λ), dutch traces

λ=0 λ=.4

λ=.8

λ=.9 λ=.95 λ=.975 λ=.99 λ=1 λ=.95

α

RMS error

  • ver first

10 episodes

On-line TD(λ), replacing traces

α

λ=0 λ=.4

λ=.8

λ=.9 λ=.95

λ=.975

λ=1

λ=.99 λ=.975

slide-24
SLIDE 24

On-line TD(λ), accumulating traces On-line TD(λ), dutch traces On-line λ-return Off-line λ-return

= off-line TD(λ), accumulating traces

RMS error over first 10 episodes on 19-state random walk

λ=0 λ=.4 λ=.8 λ=.9 λ=.95

.975 .99 1

λ=0 λ=.4 λ=.8 λ=.9 λ=.95

λ=.975 λ=.99 λ=1

λ=.95

On-line TD(λ), replacing traces True on-line TD(λ)

= real-time λ-return

α α

λ=0 λ=.4

λ=.8

λ=.9 λ=.95 λ=0 λ=.4

λ=.8

λ=.9 λ=.95 λ=.975 λ=.99 λ=1 λ=.95 λ=0 λ=.4

λ=.8

λ=.9 λ=.95 λ=.975 λ=.99 λ=1 λ=.95

λ=.975

λ=1

λ=.99

λ=0 λ=.4 λ=.8 λ=.9 λ=.95 λ=.975 λ=.99 λ=1

λ=.99 λ=.975

All λ results

  • n the 


random walk

slide-25
SLIDE 25

25

Control: Sarsa(λ)

Everything changes from states to state-action pairs

λ

T-t-1

s , a

t

1−λ (1−λ) λ (1−λ) λ2

Σ = 1

t

sT

Sarsa(λ)

St ST , At

Qt+1(s, a) = Qt(s, a) + αδtEt(s, a), for all s, a where δt = Rt+1 + γQt(St+1, At+1) − Qt(St, At) and Et(s, a) = ⇢ γλEt−1(s, a) + 1 if s=St and a = At; γλEt−1(s, a)

  • therwise.

for all s, a 8s, a

slide-26
SLIDE 26

Demo

26

slide-27
SLIDE 27

27

Sarsa(λ) Algorithm

Initialize Q(s, a) arbitrarily, for all s ∈ S, a ∈ A(s) Repeat (for each episode): E(s, a) = 0, for all s ∈ S, a ∈ A(s) Initialize S, A Repeat (for each step of episode): Take action A, observe R, S0 Choose A0 from S0 using policy derived from Q (e.g., ε-greedy) δ ← R + γQ(S0, A0) − Q(S, A) E(S, A) ← E(S, A) + 1 For all s ∈ S, a ∈ A(s): Q(s, a) ← Q(s, a) + αδE(s, a) E(s, a) ← γλE(s, a) S ← S0; A ← A0 until S is terminal

slide-28
SLIDE 28

28

Sarsa(λ) Gridworld Example

With one trial, the agent has much more information about how to get to the goal not necessarily the best way Can considerably accelerate learning

Path taken Action values increased by one-step Sarsa Action values increased by Sarsa(!) with !=0.9

slide-29
SLIDE 29

29

Watkins’s Q(λ)

How can we extend this to Q-learning? If you mark every state action pair as eligible, you backup over non-greedy policy Watkins’s: Zero out eligibility trace after a non- greedy action. Do max when backing up at first non-greedy choice.

1!" (1!") " (1!") "

2

Watkins's Q(")

OR

first non-greedy action

"

n!1

s , a

t t

st+n

"

T-t-1

St St-n , At

Zt(s, a) = 8 < : 1 + γλ Zt−1(s, a) if St = s, At = a, and At was greedy; if St = s, At = a, and At was not greedy; γλ Zt−1(s, a) for all other s, a; Qt+1(s, a) = Qt(s, a) + αδtZt(s, a), ∀s ∈ S, a ∈ A(s)

δt = Rt+1 + γ max

a0

Qt(St+1, a0) − Qt(St, At).

E( E( E(

slide-30
SLIDE 30

30

Watkins’s Q(λ)

Initialize Q(s, a) arbitrarily, for all s ∈ S, a ∈ A(s) Repeat (for each episode): E(s, a) = 0, for all s ∈ S, a ∈ A(s) Initialize S, A Repeat (for each step of episode): Take action A, observe R, S0 Choose A0 from S0 using policy derived from Q (e.g., ε-greedy) A⇤ ← arg maxa Q(S0, a) (if A0 ties for the max, then A⇤ ← A0) δ ← R + γQ(S0, A⇤) − Q(S, A) E(S, A) ← E(S, A) + 1 For all s ∈ S, a ∈ A(s): Q(s, a) ← Q(s, a) + αδE(s, a) If A0 = A⇤, then E(s, a) ← γλE(s, a) else E(s, a) ← 0 S ← S0; A ← A0 until S is terminal

slide-31
SLIDE 31

31

Replacing Traces Example

Same 19 state random walk task as before Replacing traces perform better than accumulating traces over more values of λ

0.2 0.3 0.4 0.5 0.2 0.4 0.6 0.8 1

!

accumulating traces replacing traces

RMS error at best "

slide-32
SLIDE 32

32

Why Replacing Traces?

Replacing traces can significantly speed learning They can make the system perform well for a broader set of parameters Accumulating traces can do poorly on certain types of tasks Why is this task particularly onerous for accumulating traces?

right

+1

wrong right wrong right wrong right wrong right wrong

slide-33
SLIDE 33

Interim TD(λ) Forward View

At each time t, you can only see the data up to time t so you must bootstrap at time t However you can go back and redo all previous updates at times k < t TD(λ) is equivalent to this exactly under off-line updating approximately under online

33 T i m e

rt+3

rt+2

rt+1

rT

st+1

st+2

st+3

st

Sk

Sk+2

Sk+3

St

Rt

Rk+3

Rk+2

Rk

. . .

Rk+1 Sk+1

Interim TD(λ) backup

Ak Rk+1 Sk+1 Ak+1 Rk+2 Sk+2 Rt At−1

λt−k−1

1 − λ (1 − λ)λ

Sk St

(1 − λ)λ2

slide-34
SLIDE 34

True Online TD(λ)

A new algorithm that more truly achieves the goals of TD(λ) under online updating achieves the interim TD(λ) forward view exactly, 
 even under online updating, for any λ, 𝜹 Not restricted to episodic problems Extends immediately to function approximation Appears to perform better than both accumulating and replacing traces (“enhanced” traces) Tabular version:

34

Et(s) = γλEt1(s) + (if s = St) 1 − αγλEt1(s) δt = Rt+1 + γVt(St+1) − Vt1(St) Vt+1(s) = Vt(s) + αδtEt(s) + (if s = St) α(Vt1(St) − Vt(St))

slide-35
SLIDE 35

35

More Replacing Traces

Off-line replacing trace TD(1) is identical to first-visit MC Extension to action-values: When you revisit a state, what should you do with the traces for the other actions? Perhaps you should set them to zero: But it is not clear that this is a good idea in all

Et(s, a) =    1 if s=St and a=At; if s=St and a 6= At; γλEt−1(s, a) if s 6= St. for all s, a

slide-36
SLIDE 36

36

Implementation Issues with Traces

Could require much more computation But most eligibility traces are VERY close to zero Really only need to update those In practice increases computation by only a small multiple

slide-37
SLIDE 37

37

Variable λ

Can generalize to variable λ Here λ is a function of time Could define

λt = λ(st) or λt = λ

Et(s) = ⇢ γλtEt−1(s) if s6=St; γλtEt−1(s) + 1 if s=St,

slide-38
SLIDE 38

38

Conclusions regarding Eligibility Traces

Provide an efficient, incremental way to combine Monte Carlo (MC) and temporal-difference (TD) learning methods Includes advantages of MC (can deal with lack of Markov property) Includes advantages of TD (using TD error, bootstrapping) Can achieve MC behavior even on non-episodic problems Can significantly speed learning Extends to control in on-policy (Sarsa(λ)) and semi-off- policy (Q(λ)) forms Three varieties: accumulating, replacing, and new

slide-39
SLIDE 39

questions?

39

slide-40
SLIDE 40

TD(λ) algorithm/model/neuron

wi ei

˙ w

i ~ δ ⋅ei

xi

Reward

δ

States

  • r

Features Value of state

  • r action

wi ⋅xi

i

TD Error

TD Error Eligibility Trace

λ