Refresh Your Knowledge 6 Experience replay in deep Q-learning - PowerPoint PPT Presentation

Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 1 With slides from Katerina Fragkiadaki and Pieter Abbeel Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 1 / 57

Refresh Your Knowledge 6 Experience replay in deep Q-learning (select all): Involves using a bank of prior (s,a,r,s’) tuples and doing Q-learning 1 updates on the tuples in the bank Always uses the most recent history of tuples 2 Reduces the data efficiency of DQN 3 Increases the computational cost 4 Not sure 5 Answer: It increases the computational cost, it uses a bank of tuples an it samples them, it’s likely to increase the data efficiency, and it does not have to always use the most recent history of tuples. Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 2 / 57

Deep RL Success in Atari has led to huge excitement in using deep neural networks to do value function approximation in RL Some immediate improvements (many others!) Double DQN (Deep Reinforcement Learning with Double Q-Learning, Van Hasselt et al, AAAI 2016) Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR 2016) Dueling DQN (best paper ICML 2016) (Dueling Network Architectures for Deep Reinforcement Learning, Wang et al, ICML 2016) Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 3 / 57

Class Structure Last time: CNNs and Deep Reinforcement learning This time: DRL and Imitation Learning in Large State Spaces Next time: Policy Search Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 4 / 57

Double DQN Recall maximization bias challenge Max of the estimated state-action values can be a biased estimate of the max Double Q-learning Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 5 / 57

Recall: Double Q-Learning 1: Initialize Q 1 ( s , a ) and Q 2 ( s , a ), ∀ s ∈ S , a ∈ A t = 0, initial state s t = s 0 2: loop Select a t using ǫ -greedy π ( s ) = arg max a Q 1 ( s t , a ) + Q 2 ( s t , a ) 3: Observe ( r t , s t +1 ) 4: if (with 0.5 probability) then 5: 6: Q 1 ( s t , a t ) ← Q 1 ( s t , a t )+ α ( r t + Q 1 ( s t +1 , arg max a ′ Q 2 ( s t +1 , a ′ )) − Q 1 ( s t , a t )) else 7: 8: Q 2 ( s t , a t ) ← Q 2 ( s t , a t )+ α ( r t + Q 2 ( s t +1 , arg max a ′ Q 1 ( s t +1 , a ′ )) − Q 2 ( s t , a t )) end if 9: t = t + 1 10: 11: end loop Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 6 / 57 This was using a lookup table representation for the state-action value

Double DQN Extend this idea to DQN Current Q-network w is used to select actions Older Q-network w − is used to evaluate actions Action evaluation: w − � �� ˆ Q ( s ′ , a ′ ; w ) ˆ ; w − ) − ˆ ∆ w = α ( r + γ Q (arg max Q ( s , a ; w )) a ′ � �� Action selection: w Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 7 / 57

Double DQN Figure: van Hasselt, Guez, Silver, 2015 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 8 / 57

Check Your Understanding: Mars Rover Model-Free Policy Evaluation ! " ! # ! $ ! % ! & ! ' ! ( ) ! ' = 0 ) ! ( = +10 ) ! " = +1 ) ! # = 0 ) ! $ = 0 ) ! % = 0 ) ! & = 0 89/: ./01/!123 .2456 7214 .2456 7214 π ( s ) = a 1 ∀ s , γ = 1. Any action from s 1 and s 7 terminates episode Trajectory = ( s 3 , a 1 , 0, s 2 , a 1 , 0, s 2 , a 1 , 0, s 1 , a 1 , 1, terminal) First visit MC estimate of V of each state? [1 1 1 0 0 0 0] TD estimate of all states (init at 0) with α = 1 is [1 0 0 0 0 0 0] Chose 2 ”replay” backups to do. Which should we pick to get estimate closest to MC first visit estimate? Doesn’t matter, any will yield the same 1 ( s 3 , a 1 , 0 , s 2 ) then ( s 2 , a 1 , 0 , s 1 ) 2 ( s 2 , a 1 , 0 , s 1 ) then ( s 3 , a 1 , 0 , s 2 ) 3 ( s 2 , a 1 , 0 , s 1 ) then ( s 3 , a 1 , 0 , s 2 ) 4 Not sure 5 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 10 / 57

Check Your Understanding: Mars Rover Model-Free Policy Evaluation ! " ! # ! $ ! % ! & ! ' ! ( ) ! ' = 0 ) ! ( = +10 ) ! " = +1 ) ! # = 0 ) ! $ = 0 ) ! % = 0 ) ! & = 0 89/: ./01/!123 .2456 7214 .2456 7214 π ( s ) = a 1 ∀ s , γ = 1. Any action from s 1 and s 7 terminates episode Trajectory = ( s 3 , a 1 , 0, s 2 , a 1 , 0, s 2 , a 1 , 0, s 1 , a 1 , 1, terminal) First visit MC estimate of V of each state? [1 1 1 0 0 0 0] TD estimate of all states (init at 0) with α = 1 is [1 0 0 0 0 0 0] Chose 2 ”replay” backups to do. Which should we pick to get estimate closest to MC first visit estimate? Doesn’t matter, any will yield the same 1 ( s 3 , a 1 , 0 , s 2 ) then ( s 2 , a 1 , 0 , s 1 ) 2 ( s 2 , a 1 , 0 , s 1 ) then ( s 3 , a 1 , 0 , s 2 ) 3 ( s 2 , a 1 , 0 , s 1 ) then ( s 3 , a 1 , 0 , s 2 ) 4 Not sure 5 Answer: ( s 2 , a 1 , 0 , s 1 ), ( s 3 , a 1 , 0 , s 2 ) yielding V =[1 1 1 0 0 0 0]. Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 11 / 57

Impact of Replay? In tabular TD-learning, order of replaying updates could help speed learning Repeating some updates seem to better propagate info than others Systematic ways to prioritize updates? Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 12 / 57

Potential Impact of Ordering Episodic Replay Updates Figure: Schaul, Quan, Antonoglou, Silver ICLR 2016 Schaul, Quan, Antonoglou, Silver ICLR 2016 Oracle: picks ( s , a , r , s ′ ) tuple to replay that will minimize global loss Exponential improvement in convergence Number of updates needed to converge Oracle is not a practical method but illustrates impact of ordering Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 13 / 57

Prioritized Experience Replay Let i be the index of the i -the tuple of experience ( s i , a i , r i , s i +1 ) Sample tuples for update using priority function Priority of a tuple i is proportional to DQN error � � � � a ′ Q ( s i +1 , a ′ ; w − ) − Q ( s i , a i ; w ) p i = � r + γ max � � � Update p i every update. p i for new tuples is set to maximum value One method 1 : proportional (stochastic prioritization) p β i P ( i ) = � k p β k β = 0 yields what rule for selecting among existing tuples? 1 See paper for details and an alternative Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 14 / 57

Exercise: Prioritized Replay Let i be the index of the i -the tuple of experience ( s i , a i , r i , s i +1 ) Sample tuples for update using priority function Priority of a tuple i is proportional to DQN error � � � � a ′ Q ( s i +1 , a ′ ; w − ) − Q ( s i , a i ; w ) p i = � r + γ max � � � Update p i every update. p i for new tuples is set to 0 One method 1 : proportional (stochastic prioritization) p β i P ( i ) = � k p β k β = 0 yields what rule for selecting among existing tuples? Selects randomly Selects the one with the highest priority It depends on the priorities of the tuples Not Sure Answer: Selects randomly Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 15 / 57

Performance of Prioritized Replay vs Double DQN Figure: Schaul, Quan, Antonoglou, Silver ICLR 2016 Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 16 / 57

Refresh Your Knowledge 6 Experience replay in deep Q-learning - PowerPoint PPT Presentation

Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 1 With slides from Katerina Fragkiadaki and Pieter Abbeel Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Winston-Salem State University Undergrad Brand Refresh WSSU UG Brand Refresh | Progress IMC:

Refresh RUS 1 Network RUS: Electrification Refresh 30 year perspective as part of Long

JSNA Refresh www.WalsallIntelligence.org.uk JSNA 2018-2019 refresh Gap in life expectancy Ageing

Cleaning Up the Clutter: Refresh Your MadCap Flare Project Design PRESENTED BY Nate Wolf

Plan for today Knowledge-based systems 1 Explicit knowledge Knowledge Representation Inferred

Plan for today Knowledge-based systems 1 Tacit knowledge Knowledge Representation Inferred

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

Lower Don Trail Master Plan Refresh Public Open House_September 17 2019 1 Lower Don Trail

and Sustainable Food Systems in the Circular Economy REFRESH final Policy Workshop Brussels, 22

SCIM Refresh Presented by Paul Mortimer Asset Management Policy Advisor paulmortimer@nhs.net 1

PC Refresh in the Consumerized IT Environment David Buchholz, Director of Consumerization, Intel

Adding value to food waste and by-products REFRESH Community of Experts webinar series

Voluntary Agreements to Address Food Waste REFRESH Community of Experts webinar series

VLSID 2016 KOLKATA, INDIA January 4-8, 2016 Massed Refresh: An Energy-Efficient Technique to

Measuring and managing retail food waste REFRESH Community of Experts webinar series

Sun Position Fun Observing and understanding the sun's position and movement, indoors and

Mobile Navigation of Physical Heritage Spaces Ayodeji Olojede 1 Introduction Rock Art

Human-Computer Interaction 13. Prep for final presentation & submission Dr. Sunyoung Kim

WELCOME! Mens Fellowship Breakfast June 5, 2020 Message and Stru Me tructu ture of Ma

SUN CONFE NFEREN RENCE CE Helsi lsinki nki, , Finl nland and Denise Villikka Senior

Digital System-On-Chip components at ESA components at ESA ASIC technology platforms and

Wayfinding Robert W. Lindeman Worcester Polytechnic Institute Department of Computer Science

Stanford CS193p Developing Applications for iOS Winter 2017 CS193p Winter 2017 Today

Refresh Your Knowledge 6 Experience replay in deep Q-learning - PowerPoint PPT Presentation

Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 1 With slides from Katerina Fragkiadaki and Pieter Abbeel Lecture 7: Imitation Learning in Large State Spaces 1 Emma Brunskill (CS234

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Winston-Salem State University Undergrad Brand Refresh WSSU UG Brand Refresh | Progress IMC:

Refresh RUS 1 Network RUS: Electrification Refresh 30 year perspective as part of Long

JSNA Refresh www.WalsallIntelligence.org.uk JSNA 2018-2019 refresh Gap in life expectancy Ageing

Cleaning Up the Clutter: Refresh Your MadCap Flare Project Design PRESENTED BY Nate Wolf

Plan for today Knowledge-based systems 1 Explicit knowledge Knowledge Representation Inferred

Plan for today Knowledge-based systems 1 Tacit knowledge Knowledge Representation Inferred

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

Lower Don Trail Master Plan Refresh Public Open House_September 17 2019 1 Lower Don Trail

and Sustainable Food Systems in the Circular Economy REFRESH final Policy Workshop Brussels, 22

SCIM Refresh Presented by Paul Mortimer Asset Management Policy Advisor paulmortimer@nhs.net 1

PC Refresh in the Consumerized IT Environment David Buchholz, Director of Consumerization, Intel

Adding value to food waste and by-products REFRESH Community of Experts webinar series

Voluntary Agreements to Address Food Waste REFRESH Community of Experts webinar series

VLSID 2016 KOLKATA, INDIA January 4-8, 2016 Massed Refresh: An Energy-Efficient Technique to

Measuring and managing retail food waste REFRESH Community of Experts webinar series

Sun Position Fun Observing and understanding the sun's position and movement, indoors and

Mobile Navigation of Physical Heritage Spaces Ayodeji Olojede 1 Introduction Rock Art

Human-Computer Interaction 13. Prep for final presentation &amp; submission Dr. Sunyoung Kim

WELCOME! Mens Fellowship Breakfast June 5, 2020 Message and Stru Me tructu ture of Ma

SUN CONFE NFEREN RENCE CE Helsi lsinki nki, , Finl nland and Denise Villikka Senior

Digital System-On-Chip components at ESA components at ESA ASIC technology platforms and

Wayfinding Robert W. Lindeman Worcester Polytechnic Institute Department of Computer Science

Stanford CS193p Developing Applications for iOS Winter 2017 CS193p Winter 2017 Today

Human-Computer Interaction 13. Prep for final presentation & submission Dr. Sunyoung Kim