eligibility traces
play

Eligibility Traces Unifying Monte Carlo and TD key algorithms: TD( - PowerPoint PPT Presentation

Eligibility Traces Unifying Monte Carlo and TD key algorithms: TD( ), Sarsa( ), Q( ) Unified View width of backup Dynamic Temporal- programming difference learning height (depth) of backup Exhaustive Monte search Carlo ...


  1. Eligibility Traces Unifying Monte Carlo and TD key algorithms: TD( λ ), Sarsa( λ ), Q( λ )

  2. Unified View width of backup Dynamic Temporal- programming difference learning height (depth) of backup Exhaustive Monte search Carlo ... 2

  3. N-step TD Prediction Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps) TD (1-step) 2-step 3-step n -step Monte Carlo 3

  4. Mathematics of N-step TD Prediction G t . = R t +1 + γ R t +2 + γ 2 R t +3 + · · · + γ T − t − 1 R T Monte Carlo: . G (1) TD: = R t +1 + γ V t ( S t +1 ) t Use V t to estimate remaining return n -step TD: . G (2) = R t +1 + γ R t +2 + γ 2 V t ( S t +2 ) 2 step return: t . = R t +1 + γ R t +2 + γ 2 + · · · + γ n − 1 R t + n + γ n V t ( S t + n ) , G ( n ) n -step return: t

  5. Forward View of TD( λ ) Look forward from each state to determine update from future states and rewards: R r T r t +3 R s t +3 St +3 r t +2 R s t +2 St +2 R r t +1 s t +1 St +1 s t St e m i T 5

  6. Learning with n -step Backups Backup computes an increment: ∆ t ( S t ) . h i G ( n ) = α − V t ( S t ) ( ∆ t ( s ) = 0 , 8 s 6 = S t ). t Then, Online updating: V t +1 ( s ) = V t ( s ) + ∆ t ( s ) , 8 s 2 S . Off-line updating: T − 1 X V ( s ) V ( s ) + ∆ t ( s ) . 8 s 2 S . t =0 6

  7. Error-reduction property Error reduction property of n -step returns � � � � � h i �  γ n max G ( n ) max � S t = s � v π ( s ) � V t ( s ) � v π ( s ) � E π � � � � � t � s s Maximum error using n -step return Maximum error using V Using this, you can show that n -step methods converge 7

  8. Random Walk Examples 0 0 0 0 0 1 A B C D E start How does 2-step TD work here? How about 3-step TD? 8

  9. A Larger Example – 19-state Random Walk On-line n-step TD methods Off-line n-step TD methods 256 256 512 512 128 128 n=64 n=64 n=32 n=1 RMS error n=64 n=3 over first n=2 10 episodes n=32 n=32 n=1 n=4 n=16 n=8 n=16 n=2 n=8 n=4 α α On-line is better than off-line An intermediate n is best Do you think there is an optimal n ? for every task? 9

  10. Averaging N-step Returns A complex backup n -step methods were introduced to help with TD( λ ) understanding Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-step 1 2 G (2) 2 G (4) : 1 + 1 2 t t as long as the we Called a complex backup 1 2 Draw each component Label with the weights for that component 10

  11. Forward View of TD( λ ) TD( " ), " -return TD( λ ) is a method for averaging all n -step backups weight by λ n -1 (time 1 !" since visitation) λ -return: (1 !" ) " ∞ . λ n − 1 G ( n ) X G λ = (1 − λ ) t t (1 !" ) " 2 n =1 Backup using λ -return: ∆ t ( S t ) . h i G λ # = 1 = α t � V t ( S t ) T-t- 1 " 11

  12. λ -return Weighting Function weight given to total area = 1 the 3-step return decay by " Weight weight given to 1 !" actual, final return T t Time T − t − 1 λ n − 1 G ( n ) X λ T − t − 1 G t . G λ = (1 − λ ) + t t n =1 Until termination After termination 12

  13. Relation to TD(0) and MC The λ -return can be rewritten as: T − t − 1 λ n − 1 G ( n ) X λ T − t − 1 G t . G λ = (1 − λ ) + t t n =1 Until termination After termination If λ = 1, you get MC: T − t − 1 1 n − 1 G ( n ) X 1 T − t − 1 G t G λ = (1 � 1) + = G t t t n =1 If λ = 0, you get TD(0) T − t − 1 0 n − 1 G ( n ) G (1) X 0 T − t − 1 G t G λ = (1 � 0) + = t t t n =1 13

  14. Forward View of TD( λ ) Look forward from each state to determine update from future states and rewards: R r T r t +3 R s t +3 St +3 r t +2 R s t +2 St +2 R r t +1 s t +1 St +1 s t St e m i T 14

  15. λ -return on the Random Walk Off-line λ -return algorithm On-line λ -return algorithm ≡ off-line TD( λ ), accumulating traces λ =1 λ =1 λ =.99 λ =.99 λ =.975 λ =0 λ =.95 RMS error λ =.99 λ =.4 over first λ =.975 10 episodes λ =.8 λ =.95 λ =0 λ =.95 λ =.9 λ =.9 λ =.4 λ =.8 α α On-line >> Off-line Intermediate values of λ best λ -return better than n -step return 15

  16. Backward View . = R t +1 + γ V t ( S t +1 ) � V t ( S t ) . δ t ∆ V t ( s ) . = αδ t E t ( s ) ! t 훿 t e t E t ( e t E t ( S t -3 s t -3 e t E t ( St -2 s t -2 e t E t ( St -1 s t -1 s t St s t +1 St +1 Time Shout δ t backwards over time The strength of your voice decreases with temporal distance by γλ 16

  17. Backward View of TD( λ ) The forward view was for theory The backward view is for mechanism trace . The elig New variable called eligibility trace E t ( s ) 2 R + . On each step, decay all traces by γλ and increment the trace for the current state by 1 Accumulating trace ⇢ γλ E t − 1 ( s ) if s 6 = S t ; accumulating eligibility trace E t ( s ) = γλ E t − 1 ( s ) + 1 if s = S t , times of visits to a state 17

  18. On-line Tabular TD( λ ) Initialize V ( s ) arbitrarily (but set to 0 if s is terminal) Repeat (for each episode): Initialize E ( s ) = 0, for all s ∈ S Initialize S Repeat (for each step of episode): A ← action given by π for S Take action A , observe reward, R , and next state, S 0 δ ← R + γ V ( S 0 ) − V ( S ) E ( S ) ← E ( S ) + 1 (accumulating traces) or E ( S ) ← (1 − α ) E ( S ) + 1 (dutch traces) or E ( S ) ← 1 (replacing traces) For all s ∈ S : V ( s ) ← V ( s ) + αδ E ( s ) E ( s ) ← γλ E ( s ) S ← S 0 until S is terminal 18

  19. Relation of Backwards View to MC & TD(0) Using update rule: ∆ V t ( s ) . = αδ t E t ( s ) As before, if you set λ to 0, you get to TD(0) If you set λ to 1, you get MC but in a better way Can apply TD(1) to continuing tasks Works incrementally and on-line (instead of waiting to the end of the episode) 19

  20. Forward View = Backward View The forward (theoretical) view of TD( λ ) is equivalent to the backward (mechanistic) view for off-line updating T − 1 T − 1 X X ∆ V TD ∆ V λ ( s ) = t ( S t ) I sS t , t t =0 t =0 Backward updates Forward updates X X algebra T − 1 T − 1 X X ( γλ ) k − t δ k . α I sS t t =0 k = t On-line updating with small α is similar 20

  21. On-line versus Off-line on Random Walk Off-line TD( λ ), accumulating traces On-line TD( λ ), accumulating traces ≡ off-line λ -return algorithm λ =1 1 .99 λ =.99 .975 λ =.95 λ =.9 λ =.8 λ =0 RMS error λ =.99 λ =.4 over first λ =.975 10 episodes λ =0 λ =.8 λ =.95 λ =.9 λ =.4 λ =.9 λ =.8 α α Same 19 state random walk On-line performs better over a broader range of parameters 21

  22. Replacing and Dutch Traces All traces fade the same: E t ( s ) . = γλ E t − 1 ( s ) , 8 s 2 S , s 6 = S t , But increment differently! times of state visits E t ( S t ) . = γλ E t − 1 ( S t ) + 1 accumulating traces E t ( S t ) . = (1 − α ) γλ E t − 1 ( S t ) + 1 dutch traces ( α = 0.5) E t ( S t ) . = 1 . replacing traces 22

  23. Replacing and Dutch on the Random Walk On-line TD( λ ), replacing traces On-line TD( λ ), dutch traces λ =1 λ =1 λ =.99 λ =.975 λ =.99 λ =.975 RMS error λ =.95 over first 10 episodes λ =0 λ =.975 λ =0 λ =.95 λ =.95 λ =.9 λ =.4 λ =.4 λ =.9 λ =.8 λ =.8 α α 23

  24. On-line λ -return Off-line λ -return = off-line TD( λ ), accumulating traces λ =1 λ =1 λ =.99 λ =.99 λ =.975 λ =0 λ =.95 λ =.99 λ =.4 RMS error over first 10 episodes on 19-state random walk λ =.975 All λ results λ =.8 λ =.95 λ =0 λ =.95 λ =.9 on the 
 λ =.9 λ =.4 λ =.8 random walk On-line TD( λ ), dutch traces On-line TD( λ ), accumulating traces 1 λ =1 .99 .975 λ =.95 λ =.9 λ =.99 λ =.8 λ =.975 λ =.95 λ =0 λ =0 λ =.95 λ =.4 λ =.9 λ =.4 λ =.8 True on-line TD( λ ) On-line TD( λ ), replacing traces = real-time λ -return λ =1 λ =1 λ =.99 λ =.99 λ =.975 λ =.975 λ =.95 λ =0 λ =.975 λ =0 λ =.95 λ =.95 λ =.9 λ =.4 λ =.4 λ =.9 λ =.8 λ =.8 α α

  25. Control: Sarsa( λ ) Sarsa( λ ) Everything changes from St , At s , a t t states to state-action pairs 1 −λ (1 −λ ) λ Q t +1 ( s, a ) = Q t ( s, a ) + αδ t E t ( s, a ) , for all s, a 8 s, a (1 −λ ) λ 2 where Σ = 1 S T s T δ t = R t +1 + γ Q t ( S t +1 , A t +1 ) − Q t ( S t , A t ) T-t- 1 λ and ⇢ γλ E t − 1 ( s, a ) + 1 if s = S t and a = A t ; E t ( s, a ) = for all s, a γλ E t − 1 ( s, a ) otherwise. 25

  26. Demo 26

  27. Sarsa( λ ) Algorithm Initialize Q ( s, a ) arbitrarily, for all s ∈ S , a ∈ A ( s ) Repeat (for each episode): E ( s, a ) = 0, for all s ∈ S , a ∈ A ( s ) Initialize S , A Repeat (for each step of episode): Take action A , observe R , S 0 Choose A 0 from S 0 using policy derived from Q (e.g., ε -greedy) δ ← R + γ Q ( S 0 , A 0 ) − Q ( S, A ) E ( S, A ) ← E ( S, A ) + 1 For all s ∈ S , a ∈ A ( s ): Q ( s, a ) ← Q ( s, a ) + αδ E ( s, a ) E ( s, a ) ← γλ E ( s, a ) S ← S 0 ; A ← A 0 until S is terminal 27

  28. Sarsa( λ ) Gridworld Example Action values increased Action values increased Path taken by one-step Sarsa by Sarsa( ! ) with ! =0.9 With one trial, the agent has much more information about how to get to the goal not necessarily the best way Can considerably accelerate learning 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend