eligibility traces
play

Eligibility Traces Chapter 12 Eligibility traces are Another way - PowerPoint PPT Presentation

Eligibility Traces Chapter 12 Eligibility traces are Another way of interpolating between MC and TD methods A way of implementing compound -return targets A basic mechanistic idea a short-term, fading memory A new style of algorithm


  1. Eligibility Traces Chapter 12

  2. Eligibility traces are Another way of interpolating between MC and TD methods A way of implementing compound λ -return targets A basic mechanistic idea — a short-term, fading memory A new style of algorithm development/analysis the forward-view ⇔ backward-view transformation Forward view: 
 conceptually simple — good for theory, intuition Backward view: 
 computationally congenial implementation of the f. view

  3. Unified View width of backup Dynamic Temporal- programming difference learning height Multi-step (depth) bootstrapping of backup Exhaustive Monte search Carlo ... 3

  4. Recall n -step targets For example, in the episodic case, 
 with linear function approximation: 2-step target: . G (2) = R t +1 + γ R t +2 + γ 2 θ > t +1 φ t +2 t n -step target: . G ( n ) = R t +1 + · · · + γ n � 1 R t + n + γ n θ > t + n � 1 φ t + n t taken as zero and the n - . ( G ( n ) with = G t if t + n � T ). t

  5. Any set of update targets can be averaged to produce new compound update targets A compound backup For example, half a 2-step plus half a 4-step U t = 1 + 1 2 G (2) 2 G (4) t t 1 2 Called a compound backup Draw each component 1 2 Label with the weights for that component

  6. The λ -return is a compound update target TD( " ), " -return The λ -return a target that 
 averages all n -step targets each weighted by λ n -1 1 !" (1 !" ) " ∞ . λ n − 1 G ( n ) X G λ = (1 − λ ) t t (1 !" ) " 2 n =1 # = 1 T-t- 1 "

  7. λ -return Weighting Function weight given to total area = 1 the 3-step return is (1 − λ ) λ 2 decay by " Weight weight given to 1 !" actual, final return is λ T − t − 1 t T Time T − t − 1 λ n − 1 G ( n ) X λ T − t − 1 G t . G λ = (1 − λ ) + t t n =1 Until termination After termination 7

  8. Relation to TD(0) and MC The λ -return can be rewritten as: T − t − 1 λ n − 1 G ( n ) X λ T − t − 1 G t . G λ = (1 − λ ) + t t n =1 Until termination After termination If λ = 1, you get the MC target: T − t − 1 1 n − 1 G ( n ) X 1 T − t − 1 G t G λ = (1 � 1) + = G t t t n =1 If λ = 0, you get the TD(0) target: T − t − 1 0 n − 1 G ( n ) G (1) X 0 T − t − 1 G t G λ = (1 � 0) + = t t t n =1 8

  9. The off-line λ -return “algorithm” Wait until the end of the episode (offline) Then go back over the time steps, updating θ t +1 . h i G λ = θ t + α t � ˆ v ( S t , θ t ) r ˆ v ( S t , θ t ) , t = 0 , . . . , T � 1 .

  10. The λ -return alg performs similarly to n -step algs 
 on the 19-state random walk (Tabular) n-step TD methods Off-line λ -return algorithm (from Chapter 7) 256 512 λ =1 128 n=64 n=32 λ =.99 λ =.975 λ =.95 RMS error at the end of the episode over the first 10 episodes n=32 n=1 λ =0 λ =.95 n=16 λ =.9 n=2 λ =.4 n=8 n=4 λ =.8 α α Intermediate λ is best (just like intermediate n is best) λ -return slightly better than n -step

  11. The forward view looks forward from the state being updated to future states and rewards R r T r t +3 R s t +3 St +3 r t +2 R s t +2 St +2 R r t +1 s t +1 St +1 s t St e m i T

  12. The backward view looks back to the recently visited states (marked by eligibility traces) ! t δ t e t e t e t e t s t -3 S t - 3 e t e t s t -2 S t - 2 e t e t s t -1 S t - 1 s t S t s t +1 S t + 1 T i m e Shout the TD error backwards The traces fade with temporal distance by γλ

  13. Demo Here we are marking state-action pairs with a replacing eligibility trace 13

  14. Eligibility traces (mechanism) The forward view was for theory The backward view is for mechanism same shape as 휽 e t ∈ R n ≥ 0 New memory vector called eligibility trace On each step, decay each component by γλ and increment the trace for the current state by 1 Accumulating trace e 0 . = 0 , accumulating eligibility trace e t . = r ˆ v ( S t , θ t ) + γλ e t − 1 , times of visits to a state 14

  15. The Semi-gradient TD( λ ) algorithm θ t +1 . = θ t + αδ t e t , . = R t +1 + γ ˆ v ( S t +1 , θ t ) � ˆ v ( S t , θ t ) . δ t e 0 . = 0 , e t . = r ˆ v ( S t , θ t ) + γλ e t − 1

  16. TD( λ ) performs similarly to offline λ -return alg. but slightly worse, particularly at high α Tabular 19-state random walk task Off-line λ -return algorithm TD( λ ) (from the previous section) 1 λ =1 .99 .975 λ =.95 λ =.99 λ =.9 λ =.975 λ =.8 λ =.95 RMS error at the end of the episode over the first λ =0 10 episodes λ =0 λ =.95 λ =.4 λ =.9 λ =.9 λ =.4 λ =.8 λ =.8 α α Can we do better? Can we update online?

  17. The online λ -return algorithm performs best of all Tabular 19-state random walk task On-line λ -return algorithm Off-line λ -return algorithm = true online TD( λ ) λ =1 λ =1 λ =.99 λ =.99 λ =.975 λ =.975 λ =.95 RMS error λ =.95 over first 10 episodes λ =0 λ =0 λ =.95 λ =.95 λ =.9 λ =.9 λ =.4 λ =.4 λ =.8 λ =.8 α α Figure 12.7:

  18. The online λ -return alg uses a truncated λ -return 
 as its target h − t − 1 . G λ | h λ n − 1 G ( n ) λ h − t − 1 G ( h − t ) X = (1 − λ ) + , 0 ≤ t < h ≤ T. t t t n =1 r T R r t +3 R s t +3 horizon h = t +3 St +3 r t +2 R s t +2 St +2 R r t +1 s t +1 St +1 s t St e m i T There is a separate . h i G λ | h θ h = θ h v ( S t , θ h v ( S t , θ h t + α � ˆ t ) r ˆ t ) t +1 t 휽 sequence for each h !

  19. The online λ -return algorithm There is a separate . h i G λ | h θ h = θ h v ( S t , θ h v ( S t , θ h t + α � ˆ t ) r ˆ t ) t +1 t 휽 sequence for each h ! . h i G λ | 1 θ 1 = θ 1 v ( S 0 , θ 1 v ( S 0 , θ 1 h = 1 : 0 + α � ˆ 0 ) r ˆ 0 ) , 1 0 θ 0 0 . h i G λ | 2 θ 2 = θ 2 v ( S 0 , θ 2 v ( S 0 , θ 2 θ 1 θ 1 h = 2 : 0 + α � ˆ 0 ) r ˆ 0 ) , 1 0 1 0 θ 2 θ 2 θ 2 . h i 0 1 2 G λ | 2 θ 2 = θ 2 v ( S 1 , θ 2 v ( S 1 , θ 2 1 + α � ˆ 1 ) r ˆ 1 ) , θ 3 θ 3 θ 3 θ 3 2 1 0 1 2 3 . . . . ... . . . . . . . . . θ T θ T θ T θ T θ T h i G λ | 3 θ 3 = θ 3 v ( S 0 , θ 3 v ( S 0 , θ 3 h = 3 : 0 + α � ˆ 0 ) r ˆ 0 ) , · · · 0 1 2 3 T 1 0 . h i G λ | 3 θ 3 = θ 3 v ( S 1 , θ 3 v ( S 1 , θ 3 1 + α � ˆ 1 ) r ˆ 1 ) , 2 1 True online TD( λ ) . h i G λ | 3 θ 3 = θ 3 v ( S 2 , θ 3 v ( S 2 , θ 3 2 + α � ˆ 2 ) r ˆ 2 ) . computes just the 3 2 diagonal, cheaply … (for linear FA)

  20. True online TD( λ ) θ t +1 . ⇣ ⌘ θ > t φ t − θ > = θ t + αδ t e t + α ( e t − φ t ) , t � 1 φ t e t . ⇣ ⌘ 1 − αγλ e > = γλ e t � 1 + φ t . t � 1 φ t dutch trace

  21. Accumulating, Dutch, and Replacing Traces All traces fade the same: But increment differently! times of state visits accumulating traces dutch traces ( α = 0.5) replacing traces 21

  22. The simplest example of deriving a backward view from a forward view Monte Carlo learning of a final target Will derive dutch traces Showing the dutch traces really are not about TD They are about efficiently implementing online algs

  23. The Problem: Predict final target Z with linear function approximation episode next episode Time 0 1 2 . . . T-1 T 0 1 2 φ 0 φ 1 φ 2 φ T − 1 Z Data . . . θ 0 θ 0 θ 0 θ 0 θ T θ T θ T θ T Weights . . . θ > 0 φ 0 θ > 0 φ 1 θ > θ > 0 φ 2 0 φ T � 1 Predictions . . . ≈ Z θ t +1 . � � Z − φ > = θ t + α t t = 0 , . . . , T − 1 , MC: t θ t φ t , step size all done at time T

  24. Computational goals Computation per step (including memory) must be 1. Constant . (non-increasing with number of episodes) 2. Proportionate . (proportional to number of weights, or O(n)) 3. Independent of span . (not increasing with episode length) In general, the predictive span is the number of steps between making a prediction and observing the outcome θ t +1 . � � Z − φ > = θ t + α t t = 0 , . . . , T − 1 , MC: t θ t φ t , What is the span? T step size all done at time T Is MC indep of span? No

  25. Computational goals Computation per step (including memory) must be 1. Constant . (non-increasing with number of episodes) 2. Proportionate . (proportional to number of weights, or O(n)) 3. Independent of span . (not increasing with episode length) In general, the predictive span is the number of steps between making a prediction and observing the outcome θ t +1 . � � Z − φ > = θ t + α t t = 0 , . . . , T − 1 , MC: t θ t φ t , Computation and memory needed step size all done at time T all done at time T at step T increases with T ⇒ not IoS

  26. Final Result Given: θ 0 φ 0 , φ 1 , φ 2 , . . . , φ T � 1 Z MC algorithm: θ t +1 . � � Z − φ > = θ t + α t t = 0 , . . . , T − 1 , t θ t φ t , Equivalent independent-of-span algorithm: θ T . a t 2 < n , e t 2 < n = a T � 1 + Z e T � 1 , a 0 . = θ 0 , then a t . = a t � 1 − α t φ t φ > t = 1 , . . . , T − 1 t a t � 1 , e 0 . = α 0 φ 0 , then e t . = e t � 1 − α t φ t φ > t e t � 1 + α t φ t , t = 1 , . . . , T − 1 Proved: θ T = θ T

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend