deep reinforcement learning
play

Deep Reinforcement Learning [Human-Level Control through deep - PowerPoint PPT Presentation

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature 2015] CS 486/686 University of Waterloo Lecture 20: July 10, 2017 Outline Value Function Approximation Linear approximation Neural


  1. Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature 2015] CS 486/686 University of Waterloo Lecture 20: July 10, 2017

  2. Outline • Value Function Approximation – Linear approximation – Neural network approximation • Deep Q-network 2 CS486/686 Lecture Slides (c) 2017 P. Poupart

  3. Quick recap • Markov Decision Processes: value iteration • Reinforcement Learning: Q-Learning � • Complexity depends on number of states and actions 3 CS486/686 Lecture Slides (c) 2017 P. Poupart

  4. Large State Spaces • Computer Go: states • Inverted pendulum: – 4-dimensional continuous state space • Atari: 210x160x3 dimensions (pixel values) 4 CS486/686 Lecture Slides (c) 2017 P. Poupart

  5. Functions to be Approximated • Policy: • Q-function: • Value function: 5 CS486/686 Lecture Slides (c) 2017 P. Poupart

  6. Q-function Approximation • Let • Linear • Non-linear (e.g., neural network) 6 CS486/686 Lecture Slides (c) 2017 P. Poupart

  7. Gradient Q-learning • Minimize squared error between Q-value estimate and target – Q-value estimate: – Target: � fixed • Squared error: � • Gradient 𝒙 � 7 CS486/686 Lecture Slides (c) 2017 P. Poupart

  8. Gradient Q-learning Initialize weights at random in Observe current state Loop Select action and execute it Receive immediate reward Observe new state ���� �� 𝒙 �,� Gradient: � � 𝒙 𝒙 �𝒙 �𝒙 � � ���� Update weights: �𝒙 Update state: ’ 8 CS486/686 Lecture Slides (c) 2017 P. Poupart

  9. Recap: Convergence of Tabular Q-learning • Tabular Q-Learning converges to optimal Q- function under the following conditions: and • Let – Where is # of times that is visited • Q-learning � 9 CS486/686 Lecture Slides (c) 2017 P. Poupart

  10. Convergence of Linear Gradient Q-Learning • Linear Q-Learning converges under the same conditions: and • Let • Let • Q-learning 𝒙 � 10 CS486/686 Lecture Slides (c) 2017 P. Poupart

  11. Divergence of non-linear Q-learning • Even when the following conditions hold and non-linear Q-learning may diverge • Intuition: – Adjusting to increase at might introduce errors at nearby state-action pairs. 11 CS486/686 Lecture Slides (c) 2017 P. Poupart

  12. Mitigating divergence • Two tricks are often used in practice: 1. Experience replay 2. Use two networks: – Q-network – Target network 12 CS486/686 Lecture Slides (c) 2017 P. Poupart

  13. Experience Replay • Idea: store previous experiences into a buffer and sample a mini-batch of previous experiences at each step to learn by Q-learning • Advantages – Break correlations between successive updates (more stable learning) – Fewer interactions with environment needed to converge (greater data efficiency) 13 CS486/686 Lecture Slides (c) 2017 P. Poupart

  14. Target Network • Idea: Use a separate target network that is updated only periodically repeat for each in mini-batch: � update target • Advantage: mitigate divergence 14 CS486/686 Lecture Slides (c) 2017 P. Poupart

  15. Target Network • Similar to value iteration: repeat for all � update target repeat for each in mini-batch: � update target 15 CS486/686 Lecture Slides (c) 2017 P. Poupart

  16. Deep Q-network • Google Deep Mind: • Deep Q-network: Gradient Q-learning with – Deep neural networks – Experience replay – Target network • Breakthrough: human-level play in many Atari video games 16 CS486/686 Lecture Slides (c) 2017 P. Poupart

  17. Deep Q-network Initialize weights and at random in Observe current state Loop Select action and execute it Receive immediate reward Observe new state Add to experience buffer � Sample mini-batch of experiences from buffer For each experience in mini-batch � � Gradient: ���� �� 𝒙 �̂,� � � 𝒙 𝐱 � �𝒙 �𝒙 � � � ���� Update weights: �𝒙 Update state: ’ Every steps, update target: 17 CS486/686 Lecture Slides (c) 2017 P. Poupart

  18. Deep Q-Network for Atari 18 CS486/686 Lecture Slides (c) 2017 P. Poupart

  19. DQN versus Linear approx. 19 CS486/686 Lecture Slides (c) 2017 P. Poupart

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend