CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep Q-networks [SutBar] Sec. 9.4, 9.7, [Sze] Sec. 4.3.2 University of Waterloo CS885 Spring 2018 Pascal Poupart 1

Outline • Value Function Approximation – Linear approximation – Neural network approximation • Deep Q-network University of Waterloo CS885 Spring 2018 Pascal Poupart 2

Q-function Approximation • Let ! = # $ , # & , … , # ( ) • Linear * !, + ≈ ∑ . / 0. # . • Non-linear (e.g., neural network) * !, + ≈ 1(3; 5) University of Waterloo CS885 Spring 2018 Pascal Poupart 3

Gradient Q-learning • Minimize squared error between Q-value estimate and target – Q-value estimate: ! " ($, &) " ($ 1 , & 1 ) – Target: ( + * max . / ! 0 " fixed 0 • Squared error: " $ 1 , & 1 ] 5 4 2(((") = 5 [! " $, & − ( − * max . / ! 0 • Gradient 9:;; 9< " =,. " $ 1 , & 1 9" = ! " $, & − ( − * max . / ! 0 9" University of Waterloo CS885 Spring 2018 Pascal Poupart 4

Gradient Q-learning Initialize weights ! uniformly at random in [−1,1] Observe current state ' Loop Select action ( and execute it Receive immediate reward ) Observe new state '’ Gradient: +,-- +7 ! 8,4 4 5 / ! ' 6 , ( 6 +! = / ! ', ( − ) − 0 max +! Update weights: ! ← ! − : +,-- +! Update state: ' ← ' ’ University of Waterloo CS885 Spring 2018 Pascal Poupart 5

Recap: Convergence of Tabular Q-learning • Tabular Q-Learning converges to optimal Q-function under the following conditions: ) < ∞ % % ∑ "#$ & " = ∞ and ∑ "#$ & " • Let & " +, - = 1/0(+, -) – Where 0(+, -) is # of times that (+, -) is visited • Q-learning < = 3 + > , - > − 3(+, -)] 3 +, - ← 3 +, - + & " (+, -)[7 + 8 max University of Waterloo CS885 Spring 2018 Pascal Poupart 6

Convergence of Linear Gradient Q-Learning • Linear Q-Learning converges under the same conditions: ) < ∞ % % ∑ "#$ & " = ∞ and ∑ "#$ & " • Let & " = 1/- • Let . / 0, 2 = ∑ 3 4 3 5 3 • Q-learning @A / B,= = > . / 0 ? , 2 ? / ← / − & " . / 0, 2 − 8 − 9 max @/ University of Waterloo CS885 Spring 2018 Pascal Poupart 7

Divergence of Non-linear Gradient Q-learning • Even when the following conditions hold ) < ∞ % % ∑ "#$ & " = ∞ and ∑ "#$ & " non-linear Q-learning may diverge • Intuition: – Adjusting + to increase , at (., 0) might introduce errors at nearby state-action pairs. University of Waterloo CS885 Spring 2018 Pascal Poupart 8

Mitigating divergence • Two tricks are often used in practice: 1. Experience replay 2. Use two networks: – Q-network – Target network University of Waterloo CS885 Spring 2018 Pascal Poupart 9

Experience Replay • Idea: store previous experiences (", $, "’, &) into a buffer and sample a mini-batch of previous experiences at each step to learn by Q-learning • Advantages – Break correlations between successive updates (more stable learning) – Fewer interactions with environment needed to converge (greater data efficiency) University of Waterloo CS885 Spring 2018 Pascal Poupart 10

Target Network • Idea: Use a separate target network that is updated only periodically repeat for each !, #, ! $ , % in mini-batch: 3+ & !, # & ! $ , # $ & ← & − ) * + & !, # − % − , max 0 1 + 2 3& & ← & 2 update target • Advantage: mitigate divergence University of Waterloo CS885 Spring 2018 Pascal Poupart 11

Target Network • Similar to value iteration: repeat for all ! ( ! + * ∑ , - Pr ! 0 !, 2 3 "(! 0 ) " ! ← max ∀! ' update target 3 " ← " repeat for each !, 2, ! 0 , 7 in mini-batch: >< 8 !, 2 8 ! 0 , 2 0 8 ← 8 − : ; < 8 !, 2 − 7 − * max ' - < = >8 = 8 ← 8 update target University of Waterloo CS885 Spring 2018 Pascal Poupart 12

Deep Q-network • Google Deep Mind: • Deep Q-network: Gradient Q-learning with – Deep neural networks – Experience replay – Target network • Breakthrough: human-level play in many Atari video games University of Waterloo CS885 Spring 2018 Pascal Poupart 13

̂ ̂ ̂ ̂ Deep Q-network Initialize weights ! and " ! at random in [−1,1] Observe current state ( Loop Select action ) and execute it Receive immediate reward * Observe new state (’ Add ((, ), ( - , *) to experience buffer Sample mini-batch of experiences from buffer ( - , ̂ For each experience (, 0 ), ̂ * in mini-batch Gradient: 1233 1> ! ?, 0 : ( - , 0 ) - 1! = 5 ! (, 0 ) − ̂ * − 6 max : ; Q " = 1! 0 Update weights: ! ← ! − A 1233 1! Update state: ( ← ( ’ Every B steps, update target: " ! ← ! University of Waterloo CS885 Spring 2018 Pascal Poupart 14

Deep Q-Network for Atari University of Waterloo CS885 Spring 2018 Pascal Poupart 15

DQN versus Linear approx. University of Waterloo CS885 Spring 2018 Pascal Poupart 16

CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep Q-networks [SutBar] Sec. 9.4, 9.7, [Sze] Sec. 4.3.2 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Outline Value Function Approximation Linear approximation

CS885 Reinforcement Learning Lecture 1a: May 2, 2018 Course Introduction [SutBar] Chapter 1,

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed Bandits [SutBar] Sec. 2.1-2.7,

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

CS885 Reinforcement Learning Lecture 4a: May 11, 2018 Deep Neural Networks [GBC] Chap. 6, 7, 8

CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to Markov decision processes [SutBar]

CS885 Reinforcement Learning Lecture 1b: May 2, 2018 Markov Processes [RusNor] Sec. 15.1

CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep Recurrent Q-Networks [GBC] Chap. 10

CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov Decision Processes [Put]

CS885 Reinforcement Learning Lecture 13c: June 13, 2018 Adversarial Search [RusNor] Sec. 5.1-5.4

CS885 Reinforcement Learning Lecture 14c: June 15, 2018 Trust Region Methods [Nocedal and

Neural Combinatorial Optimization With Reinforcement Learning CS885 Reinforcement Learning Paper

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning-Based End-to-End Parking for Automatic Parking System CS885

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Deep Neural Networks and Mixed Integer Linear Optimization Matteo Fischetti, University of

ECE 5984: Introduction to Machine Learning Topics: Neural Networks Backprop Readings:

In Introductio ion to Neural l Networks I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 2

Linearly-polarized small-x gluons in forward heavy quark production Pieter Taels, INFN Pavia REF

Neural Networks with Googles TensorFlow Shuo Zhang Computational discourse analysis 11/22/16

dt < | ( ) | h t (this has to do with system stability system stability (ECE

LSI system Input v v1 x v1 x v2 x v2 x + + L + v3 x + v3 x v4 x + v4 x + Output

Linear Prediction Analysis of Speech Sounds Berlin Chen 2003 References: 1. X. Huang et. al.,

Sambuz

Useful Links

Newsletter

Mail Us

CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep Q-networks [SutBar] Sec. 9.4, 9.7, [Sze] Sec. 4.3.2 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Outline Value Function Approximation Linear approximation

CS885 Reinforcement Learning Lecture 1a: May 2, 2018 Course Introduction [SutBar] Chapter 1,

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed Bandits [SutBar] Sec. 2.1-2.7,

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

CS885 Reinforcement Learning Lecture 4a: May 11, 2018 Deep Neural Networks [GBC] Chap. 6, 7, 8

CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to Markov decision processes [SutBar]

CS885 Reinforcement Learning Lecture 1b: May 2, 2018 Markov Processes [RusNor] Sec. 15.1

CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep Recurrent Q-Networks [GBC] Chap. 10

CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov Decision Processes [Put]

CS885 Reinforcement Learning Lecture 13c: June 13, 2018 Adversarial Search [RusNor] Sec. 5.1-5.4

CS885 Reinforcement Learning Lecture 14c: June 15, 2018 Trust Region Methods [Nocedal and

Neural Combinatorial Optimization With Reinforcement Learning CS885 Reinforcement Learning Paper

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning-Based End-to-End Parking for Automatic Parking System CS885

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Deep Neural Networks and Mixed Integer Linear Optimization Matteo Fischetti, University of

ECE 5984: Introduction to Machine Learning Topics: Neural Networks Backprop Readings:

In Introductio ion to Neural l Networks I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 2

Linearly-polarized small-x gluons in forward heavy quark production Pieter Taels, INFN Pavia REF

Neural Networks with Googles TensorFlow Shuo Zhang Computational discourse analysis 11/22/16

dt &lt; | ( ) | h t (this has to do with system stability system stability (ECE

LSI system Input v v1 x v1 x v2 x v2 x + + L + v3 x + v3 x v4 x + v4 x + Output

Linear Prediction Analysis of Speech Sounds Berlin Chen 2003 References: 1. X. Huang et. al.,

Sambuz

Useful Links

Newsletter

Mail Us

dt < | ( ) | h t (this has to do with system stability system stability (ECE