cs885 reinforcement learning lecture 4b may 11 2018
play

CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep Q-networks [SutBar] Sec. 9.4, 9.7, [Sze] Sec. 4.3.2 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Outline Value Function Approximation Linear approximation


  1. CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep Q-networks [SutBar] Sec. 9.4, 9.7, [Sze] Sec. 4.3.2 University of Waterloo CS885 Spring 2018 Pascal Poupart 1

  2. Outline • Value Function Approximation – Linear approximation – Neural network approximation • Deep Q-network University of Waterloo CS885 Spring 2018 Pascal Poupart 2

  3. Q-function Approximation • Let ! = # $ , # & , … , # ( ) • Linear * !, + ≈ ∑ . / 0. # . • Non-linear (e.g., neural network) * !, + ≈ 1(3; 5) University of Waterloo CS885 Spring 2018 Pascal Poupart 3

  4. Gradient Q-learning • Minimize squared error between Q-value estimate and target – Q-value estimate: ! " ($, &) " ($ 1 , & 1 ) – Target: ( + * max . / ! 0 " fixed 0 • Squared error: " $ 1 , & 1 ] 5 4 2(((") = 5 [! " $, & − ( − * max . / ! 0 • Gradient 9:;; 9< " =,. " $ 1 , & 1 9" = ! " $, & − ( − * max . / ! 0 9" University of Waterloo CS885 Spring 2018 Pascal Poupart 4

  5. Gradient Q-learning Initialize weights ! uniformly at random in [−1,1] Observe current state ' Loop Select action ( and execute it Receive immediate reward ) Observe new state '’ Gradient: +,-- +7 ! 8,4 4 5 / ! ' 6 , ( 6 +! = / ! ', ( − ) − 0 max +! Update weights: ! ← ! − : +,-- +! Update state: ' ← ' ’ University of Waterloo CS885 Spring 2018 Pascal Poupart 5

  6. Recap: Convergence of Tabular Q-learning • Tabular Q-Learning converges to optimal Q-function under the following conditions: ) < ∞ % % ∑ "#$ & " = ∞ and ∑ "#$ & " • Let & " +, - = 1/0(+, -) – Where 0(+, -) is # of times that (+, -) is visited • Q-learning < = 3 + > , - > − 3(+, -)] 3 +, - ← 3 +, - + & " (+, -)[7 + 8 max University of Waterloo CS885 Spring 2018 Pascal Poupart 6

  7. Convergence of Linear Gradient Q-Learning • Linear Q-Learning converges under the same conditions: ) < ∞ % % ∑ "#$ & " = ∞ and ∑ "#$ & " • Let & " = 1/- • Let . / 0, 2 = ∑ 3 4 3 5 3 • Q-learning @A / B,= = > . / 0 ? , 2 ? / ← / − & " . / 0, 2 − 8 − 9 max @/ University of Waterloo CS885 Spring 2018 Pascal Poupart 7

  8. Divergence of Non-linear Gradient Q-learning • Even when the following conditions hold ) < ∞ % % ∑ "#$ & " = ∞ and ∑ "#$ & " non-linear Q-learning may diverge • Intuition: – Adjusting + to increase , at (., 0) might introduce errors at nearby state-action pairs. University of Waterloo CS885 Spring 2018 Pascal Poupart 8

  9. Mitigating divergence • Two tricks are often used in practice: 1. Experience replay 2. Use two networks: – Q-network – Target network University of Waterloo CS885 Spring 2018 Pascal Poupart 9

  10. Experience Replay • Idea: store previous experiences (", $, "’, &) into a buffer and sample a mini-batch of previous experiences at each step to learn by Q-learning • Advantages – Break correlations between successive updates (more stable learning) – Fewer interactions with environment needed to converge (greater data efficiency) University of Waterloo CS885 Spring 2018 Pascal Poupart 10

  11. Target Network • Idea: Use a separate target network that is updated only periodically repeat for each !, #, ! $ , % in mini-batch: 3+ & !, # & ! $ , # $ & ← & − ) * + & !, # − % − , max 0 1 + 2 3& & ← & 2 update target • Advantage: mitigate divergence University of Waterloo CS885 Spring 2018 Pascal Poupart 11

  12. Target Network • Similar to value iteration: repeat for all ! ( ! + * ∑ , - Pr ! 0 !, 2 3 "(! 0 ) " ! ← max ∀! ' update target 3 " ← " repeat for each !, 2, ! 0 , 7 in mini-batch: >< 8 !, 2 8 ! 0 , 2 0 8 ← 8 − : ; < 8 !, 2 − 7 − * max ' - < = >8 = 8 ← 8 update target University of Waterloo CS885 Spring 2018 Pascal Poupart 12

  13. Deep Q-network • Google Deep Mind: • Deep Q-network: Gradient Q-learning with – Deep neural networks – Experience replay – Target network • Breakthrough: human-level play in many Atari video games University of Waterloo CS885 Spring 2018 Pascal Poupart 13

  14. ̂ ̂ ̂ ̂ Deep Q-network Initialize weights ! and " ! at random in [−1,1] Observe current state ( Loop Select action ) and execute it Receive immediate reward * Observe new state (’ Add ((, ), ( - , *) to experience buffer Sample mini-batch of experiences from buffer ( - , ̂ For each experience (, 0 ), ̂ * in mini-batch Gradient: 1233 1> ! ?, 0 : ( - , 0 ) - 1! = 5 ! (, 0 ) − ̂ * − 6 max : ; Q " = 1! 0 Update weights: ! ← ! − A 1233 1! Update state: ( ← ( ’ Every B steps, update target: " ! ← ! University of Waterloo CS885 Spring 2018 Pascal Poupart 14

  15. Deep Q-Network for Atari University of Waterloo CS885 Spring 2018 Pascal Poupart 15

  16. DQN versus Linear approx. University of Waterloo CS885 Spring 2018 Pascal Poupart 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend