efficient planning
play

Efficient Planning 1 R. S. Sutton and A. G. Barto: Reinforcement - PowerPoint PPT Presentation

Efficient Planning 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction Tuesday class summary: Planning: any computational process that uses a model to create or improve a policy Dyna framework: 2 R. S. Sutton and A. G.


  1. Efficient Planning 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  2. Tuesday class summary: Planning: any computational process that uses a model to create or improve a policy Dyna framework: 2 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  3. Questions during class “Why use simulated experience? Can’t you directly compute solution based on model?” “Wouldn’t it be better to plan backwards from goal” 3 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  4. How to Achieve Efficient Planning? What type of backup is better? Sample vs. full backups Incremental vs. less incremental backups How to order the backups? 4 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  5. What is Efficient Planning? Planning algorithm A is more efficient than planning algorithm B if: it can compute the optimal policy (or value function) in less time. given the same amount of computation time, it improves the policy (or value function) more. 5 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  6. What backup type is best? 6 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  7. Full vs. Sample Backups Sample backups Value Full backups estimated (one-step TD) (DP) s s ! ( s ) a a v π V r r s' s' policy evaluation TD(0) s max a v * V *( s ) r s' value iteration s,a s,a r r Q ! ( a , s ) q π s' s' a' a' Q-policy evaluation Sarsa s,a s,a r r s' s' Q * q * ( a , s ) max max a' a' Q-value iteration Q-learning 7 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  8. Full vs. Sample Backups 1 full sample backups backups b = 2 (branching factor) RMS error in value b =10 estimate b =100 b =1000 b =10,000 0 0 1 b 2 b a 0 Q ( s 0 , a 0 ) max Number of computations b successor states, equally likely; initial error = 1; assume all next states’ values are correct 8 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  9. Small Backups Small backups are single-successor backups based on the model Small backups have the same computational complexity as sample backups Small backups have no sampling error Small backups require storage for ‘old’ values 9 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  10. Main Idea behind Small Backups Consider estimate A that is constructed from a weighted sum estimates . X i X full backup: A w i X i i What can we do if we know that only a single successor, , y X j . changed value since the last backup? − x j Let be the old value of , used to construct the current + X j X value of A . The value A can then be updated for a single successor by adding the difference between the new and the old value: A A + w j ( X j � x j ) small backup: 10 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  11. Small vs. Sample Backups 1 0.8 r left = +1 e z i s RMS − p e s t t 0.6 n a r right = -1 t s n o c r = +1 ) , sample backup: TD(0), decaying step − size 0 error ( D T p : u k c a b e l p m a 0.4 s (normalized) r 0.2 t f r e l r i g h small backup t 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 step − size / step − size decay = +1 = -1 1 r s a n n o normalized RMS error d i o t 0.8 i m n s t a r g i n a y α e c d 0 ) , D ( T p : k u a c b l e m p a s 0.6 r left = +1 0.4 r = +1 r right = +1 sample backup: TD(0), constant α 0.2 small backup 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 alpha / decay 11 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  12. Small vs. Sample Backups B A C transition probability state values 1 10 8 0.667 6 4 0.333 2 0 0 state A state B state A state B 12 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  13. Backup Ordering 13 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  14. Backup Ordering Do Forever: 1) Select a state s 2 S according to some selection strategy H 2) Apply a full backup to s : h i r ( s, a ) + P s 0 p ( s 0 | s, a ) V ( s 0 ) V ( s ) max a ˆ Asynchronous Value Iteration For every selection strategy H that selects each state infinitely often the values V converge to the optimal value function V ⇤ The rate of convergence depends strongly on the selection strategy H 14 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  15. The Trade-Off For any effective ordering strategy the cost that is saved by having to perform less backups should out-weigh the cost of maintaining the ordering: cost to maintain cost savings ordering due to fewer backups 15 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  16. Prioritized Sweeping Which states or state-action pairs should be generated during planning? Work backwards from states whose values have just changed: Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change When a new backup occurs, insert predecessors according to their priorities Always perform backups from first in queue Moore & Atkeson 1993; Peng & Williams 1993 improved by McMahan & Gordon 2005; Van Seijen 2013 16 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  17. Moore and Atekson’s Prioritized Sweeping Published in 1993. 17 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  18. Prioritized Sweeping vs. Dyna-Q Both use n =5 backups per environmental interaction 18 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  19. Bellman Error Ordering Bellman error is a measure for the difference between the current value and the value after a full backup: � i� h X p ( s 0 | s, a ) V ( s 0 ) BE ( s ) = � V ( s ) � max r ( s, a ) + ˆ � � � a s 0 19 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  20. Bellman Error Ordering initialize V ( s ) arbitrarily for all s compute BE ( s ) for all s loop { until convergence } select state s 0 with worst Bellman error perform full backup of s 0 BE ( s 0 ) ← 0 s of s 0 do for all predecessor states ¯ recompute BE (¯ s ) end for end loop To get positive trade-off: comp. time Bellman error << comp time Full backup 20 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  21. Prioritized Sweeping with Small Backups initialize V ( s ) arbitrarily for all s initialize U ( s ) = V ( s ) for all s initialize Q ( s, a ) = V ( s ) for all s, a initialize N sa , N s 0 sa to 0 for all s, a, s 0 loop { over episodes } initialize s repeat { for each step in the episode } select action a , based on Q ( s, · ) take action a , observe r and s 0 N s 0 sa ← N s 0 N sa ← N sa + 1; sa + 1 ⇥ ⇤ Q ( s, a ) ← Q ( s, a )( N sa − 1) + r + γ V ( s 0 ) /N sa V ( s ) ← max b Q ( s, b ) p ← | V ( s ) − U ( s ) | if s is on queue, set its priority to p ; otherwise, add it with priority p for a number of update cycles do s 0 from queue remove top state ¯ s 0 ) − V (¯ s 0 ) ∆ U ← U (¯ V (¯ s 0 ) ← V U ¯ s 0 ) s 0 a ) pairs with N ¯ for all (¯ s, ¯ a > 0 do s ¯ ¯ s 0 a ) + γ N ¯ Q (¯ s, ¯ a ) ← Q (¯ s, ¯ a /N ¯ a · ∆ U s ¯ s ¯ ¯ U (¯ s ) ← max b Q (¯ s, b ) p ← | V (¯ s ) − U (¯ s ) | if s is on queue, set its priority to p ; otherwise, add it with priority p end for end for s ← s 0 until s is terminal end loop 21 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  22. Empirical Comparison 0.55 PS, Moore & Atkeson initial error 0.5 0.45 RMS 0.4 error 0.35 PS, Wiering & Schmidhuber (avg. over P S , P e n first 10 5 obs) 0.3 g & W i l l i a m s 0.25 PS, small backups 0.2 0.15 value iteration 0.1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 − 6 x 10 comp. time per observation [s] 22 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ation tas

  23. Trajectory Sampling Trajectory sampling: perform backups along simulated trajectories This samples from the on-policy distribution Advantages when function approximation is used (Chapter 8) Focusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored: Initial states Irrelevant states Reachable under optimal control 23 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  24. Trajectory Sampling Experiment one-step full tabular backups uniform: cycled through all state- action pairs on-policy: backed up along simulated trajectories 200 randomly generated undiscounted episodic tasks 2 actions for each state, each with b equally likely next states 0.1 prob of transition to terminal state expected reward on each transition selected from mean 0 variance 1 Gaussian 24 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  25. Heuristic Search Used for action selection, not for changing a value function (=heuristic evaluation function) Backed-up values are computed, but typically discarded Extension of the idea of a greedy policy — only deeper Also suggests ways to select states to backup: smart focusing: 25 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend