refresh your knowledge 6
play

Refresh Your Knowledge 6 Experience replay in deep Q-learning - PowerPoint PPT Presentation

Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 1 With slides from Katerina Fragkiadaki and Pieter Abbeel Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234


  1. Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 1 With slides from Katerina Fragkiadaki and Pieter Abbeel Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 1 / 57

  2. Refresh Your Knowledge 6 Experience replay in deep Q-learning (select all): Involves using a bank of prior (s,a,r,s’) tuples and doing Q-learning 1 updates on the tuples in the bank Always uses the most recent history of tuples 2 Reduces the data efficiency of DQN 3 Increases the computational cost 4 Not sure 5 Answer: It increases the computational cost, it uses a bank of tuples an it samples them, it’s likely to increase the data efficiency, and it does not have to always use the most recent history of tuples. Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 2 / 57

  3. Deep RL Success in Atari has led to huge excitement in using deep neural networks to do value function approximation in RL Some immediate improvements (many others!) Double DQN (Deep Reinforcement Learning with Double Q-Learning, Van Hasselt et al, AAAI 2016) Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR 2016) Dueling DQN (best paper ICML 2016) (Dueling Network Architectures for Deep Reinforcement Learning, Wang et al, ICML 2016) Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 3 / 57

  4. Class Structure Last time: CNNs and Deep Reinforcement learning This time: DRL and Imitation Learning in Large State Spaces Next time: Policy Search Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 4 / 57

  5. Double DQN Recall maximization bias challenge Max of the estimated state-action values can be a biased estimate of the max Double Q-learning Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 5 / 57

  6. Recall: Double Q-Learning 1: Initialize Q 1 ( s , a ) and Q 2 ( s , a ), ∀ s ∈ S , a ∈ A t = 0, initial state s t = s 0 2: loop Select a t using ǫ -greedy π ( s ) = arg max a Q 1 ( s t , a ) + Q 2 ( s t , a ) 3: Observe ( r t , s t +1 ) 4: if (with 0.5 probability) then 5: 6: Q 1 ( s t , a t ) ← Q 1 ( s t , a t )+ α ( r t + Q 1 ( s t +1 , arg max a ′ Q 2 ( s t +1 , a ′ )) − Q 1 ( s t , a t )) else 7: 8: Q 2 ( s t , a t ) ← Q 2 ( s t , a t )+ α ( r t + Q 2 ( s t +1 , arg max a ′ Q 1 ( s t +1 , a ′ )) − Q 2 ( s t , a t )) end if 9: t = t + 1 10: 11: end loop Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 6 / 57 This was using a lookup table representation for the state-action value

  7. Double DQN Extend this idea to DQN Current Q-network w is used to select actions Older Q-network w − is used to evaluate actions Action evaluation: w − � �� � ˆ Q ( s ′ , a ′ ; w ) ˆ ; w − ) − ˆ ∆ w = α ( r + γ Q (arg max Q ( s , a ; w )) a ′ � �� � Action selection: w Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 7 / 57

  8. Double DQN Figure: van Hasselt, Guez, Silver, 2015 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 8 / 57

  9. Deep RL Success in Atari has led to huge excitement in using deep neural networks to do value function approximation in RL Some immediate improvements (many others!) Double DQN (Deep Reinforcement Learning with Double Q-Learning, Van Hasselt et al, AAAI 2016) Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR 2016) Dueling DQN (best paper ICML 2016) (Dueling Network Architectures for Deep Reinforcement Learning, Wang et al, ICML 2016) Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 9 / 57

  10. Check Your Understanding: Mars Rover Model-Free Policy Evaluation ! " ! # ! $ ! % ! & ! ' ! ( ) ! ' = 0 ) ! ( = +10 ) ! " = +1 ) ! # = 0 ) ! $ = 0 ) ! % = 0 ) ! & = 0 89/: ./01/!123 .2456 7214 .2456 7214 π ( s ) = a 1 ∀ s , γ = 1. Any action from s 1 and s 7 terminates episode Trajectory = ( s 3 , a 1 , 0, s 2 , a 1 , 0, s 2 , a 1 , 0, s 1 , a 1 , 1, terminal) First visit MC estimate of V of each state? [1 1 1 0 0 0 0] TD estimate of all states (init at 0) with α = 1 is [1 0 0 0 0 0 0] Chose 2 ”replay” backups to do. Which should we pick to get estimate closest to MC first visit estimate? Doesn’t matter, any will yield the same 1 ( s 3 , a 1 , 0 , s 2 ) then ( s 2 , a 1 , 0 , s 1 ) 2 ( s 2 , a 1 , 0 , s 1 ) then ( s 3 , a 1 , 0 , s 2 ) 3 ( s 2 , a 1 , 0 , s 1 ) then ( s 3 , a 1 , 0 , s 2 ) 4 Not sure 5 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 10 / 57

  11. Check Your Understanding: Mars Rover Model-Free Policy Evaluation ! " ! # ! $ ! % ! & ! ' ! ( ) ! ' = 0 ) ! ( = +10 ) ! " = +1 ) ! # = 0 ) ! $ = 0 ) ! % = 0 ) ! & = 0 89/: ./01/!123 .2456 7214 .2456 7214 π ( s ) = a 1 ∀ s , γ = 1. Any action from s 1 and s 7 terminates episode Trajectory = ( s 3 , a 1 , 0, s 2 , a 1 , 0, s 2 , a 1 , 0, s 1 , a 1 , 1, terminal) First visit MC estimate of V of each state? [1 1 1 0 0 0 0] TD estimate of all states (init at 0) with α = 1 is [1 0 0 0 0 0 0] Chose 2 ”replay” backups to do. Which should we pick to get estimate closest to MC first visit estimate? Doesn’t matter, any will yield the same 1 ( s 3 , a 1 , 0 , s 2 ) then ( s 2 , a 1 , 0 , s 1 ) 2 ( s 2 , a 1 , 0 , s 1 ) then ( s 3 , a 1 , 0 , s 2 ) 3 ( s 2 , a 1 , 0 , s 1 ) then ( s 3 , a 1 , 0 , s 2 ) 4 Not sure 5 Answer: ( s 2 , a 1 , 0 , s 1 ), ( s 3 , a 1 , 0 , s 2 ) yielding V =[1 1 1 0 0 0 0]. Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 11 / 57

  12. Impact of Replay? In tabular TD-learning, order of replaying updates could help speed learning Repeating some updates seem to better propagate info than others Systematic ways to prioritize updates? Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 12 / 57

  13. Potential Impact of Ordering Episodic Replay Updates Figure: Schaul, Quan, Antonoglou, Silver ICLR 2016 Schaul, Quan, Antonoglou, Silver ICLR 2016 Oracle: picks ( s , a , r , s ′ ) tuple to replay that will minimize global loss Exponential improvement in convergence Number of updates needed to converge Oracle is not a practical method but illustrates impact of ordering Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 13 / 57

  14. Prioritized Experience Replay Let i be the index of the i -the tuple of experience ( s i , a i , r i , s i +1 ) Sample tuples for update using priority function Priority of a tuple i is proportional to DQN error � � � � a ′ Q ( s i +1 , a ′ ; w − ) − Q ( s i , a i ; w ) p i = � r + γ max � � � Update p i every update. p i for new tuples is set to maximum value One method 1 : proportional (stochastic prioritization) p β i P ( i ) = � k p β k β = 0 yields what rule for selecting among existing tuples? 1 See paper for details and an alternative Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 14 / 57

  15. Exercise: Prioritized Replay Let i be the index of the i -the tuple of experience ( s i , a i , r i , s i +1 ) Sample tuples for update using priority function Priority of a tuple i is proportional to DQN error � � � � a ′ Q ( s i +1 , a ′ ; w − ) − Q ( s i , a i ; w ) p i = � r + γ max � � � Update p i every update. p i for new tuples is set to 0 One method 1 : proportional (stochastic prioritization) p β i P ( i ) = � k p β k β = 0 yields what rule for selecting among existing tuples? Selects randomly Selects the one with the highest priority It depends on the priorities of the tuples Not Sure Answer: Selects randomly Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 15 / 57

  16. Performance of Prioritized Replay vs Double DQN Figure: Schaul, Quan, Antonoglou, Silver ICLR 2016 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 16 / 57

  17. Deep RL Success in Atari has led to huge excitement in using deep neural networks to do value function approximation in RL Some immediate improvements (many others!) Double DQN (Deep Reinforcement Learning with Double Q-Learning, Van Hasselt et al, AAAI 2016) Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR 2016) Dueling DQN (best paper ICML 2016) (Dueling Network Architectures for Deep Reinforcement Learning, Wang et al, ICML 2016) Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 17 / 57

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend