CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep Recurrent Q-Networks [GBC] Chap. 10 University of Waterloo CS885 Spring 2018 Pascal Poupart 1

Outline • Recurrent neural networks – Long short term memory (LSTM) networks • Deep recurrent Q-networks University of Waterloo CS885 Spring 2018 Pascal Poupart 2

ParEal Observability • Hidden Markov model – Initial state distribution: Pr($ % ) – Transition probabilities: Pr($ '() |$ ' ) – Observation probabilities: Pr(+ ' |$ ' ) • Belief monitoring ∝ Pr + ' $ ' ∑ / 012 Pr $ ' $ '3) Pr($ '3) |+ )..'3) ) Pr $ ' + )..' s 0 s 1 s 2 s 4 s 3 o 1 o 2 o 3 o 4 o 5 University of Waterloo CS885 Spring 2018 Pascal Poupart 3

Recurrent Neural Network (RNN) • In RNNs, outputs can be fed back to the network as inputs, crea:ng a recurrent structure • HMMs can be simulated and generalized by RNNs • RNNs can be used for belief monitoring ! " : vector of observa:ons # " : belief state University of Waterloo CS885 Spring 2018 Pascal Poupart 4

Training • Recurrent neural networks are trained by backpropaga6on on the unrolled network – E.g. backpropaga6on through 6me • Weight sharing: – Combine gradients of shared weights into a single gradient • Challenges: – Gradient vanishing (and explosion) – Long range memory – Predic6on driF University of Waterloo CS885 Spring 2018 Pascal Poupart 5

Long Short Term Memory (LSTM) • Special gated structure to control memorization and forgetting in RNNs • Mitigate gradient vanishing • Facilitate long term memory University of Waterloo CS885 Spring 2018 Pascal Poupart 6

Unrolled long short term memory &'( $ &'( # &'( % output output output X X X gate gate gate ℎ " ℎ $ ℎ % ℎ # X X X X X X forget forget forget gate gate gate input input input )* # )* $ )* % gate gate gate University of Waterloo CS885 Spring 2018 Pascal Poupart 7

Deep Recurrent Q-Network • Hausknecht and Stone (2016) – Atari games • TransiBon model – LSTM network • ObservaBon model – ConvoluBonal network image image University of Waterloo CS885 Spring 2018 Pascal Poupart 8

Deep Recurrent Q-Network Ini@alize weights ! and " ! at random in [−1,1] Observe current state ( Loop Execute policy for en@re episode Add episode ( ) * , + * , ) , , + , , ) - , + - , … , ) / , + / ) to experience buffer Sample episode from buffer Ini@alize ℎ 1 For 2 = 1 @ll the end of the episode do 4566 4! = 7 9 ! :;; ! (= ) *..? ), = + ? − ̂ B − 4N ! OPP ! (= Q J..H ), = G H C max G HIJ Q " L :;; " ! (= ) *..?M* ), = + ?M* 8 4! = Update weights: ! ← ! − S 4566 4! Every T steps, update target: " ! ← ! University of Waterloo CS885 Spring 2018 Pascal Poupart 9

Deep Recurrent Q-Network Initialize weights ! and " ! at random in [−1,1] Observe current state ( Loop Execute policy for entire episode Add episode ( ) * , + * , ) , , + , , ) - , + - , … , ) / , + / ) to experience buffer Sample episode from buffer Initialize ℎ 1 For 2 = 1 till the end of the episode do 4566 4! = 7 9 ! :;; ! (ℎ =>* ? ) = ), ? + = − ̂ B − 4L ! MNN ! (O PQR ? S P ), ? G C max G H Q " J :;; " ! ℎ =>* ? ) = ? ) =K* , ? + =K* 8 4! ? ℎ = ← :;; " ! (ℎ =>* , ? ) = ) 4566 Update weights: ! ← ! − U 4! Every V steps, update target: " ! ← ! University of Waterloo CS885 Spring 2018 Pascal Poupart 10

Results Flickering games (missing observaBons) University of Waterloo CS885 Spring 2018 Pascal Poupart 11

CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep Recurrent Q-Networks [GBC] Chap. 10 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Outline Recurrent neural networks Long short term memory (LSTM) networks Deep

CS885 Reinforcement Learning Lecture 1a: May 2, 2018 Course Introduction [SutBar] Chapter 1,

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov Decision Processes [Put]

CS885 Reinforcement Learning Lecture 13c: June 13, 2018 Adversarial Search [RusNor] Sec. 5.1-5.4

CS885 Reinforcement Learning Lecture 14c: June 15, 2018 Trust Region Methods [Nocedal and

CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed Bandits [SutBar] Sec. 2.1-2.7,

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

CS885 Reinforcement Learning Lecture 4a: May 11, 2018 Deep Neural Networks [GBC] Chap. 6, 7, 8

CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to Markov decision processes [SutBar]

CS885 Reinforcement Learning Lecture 1b: May 2, 2018 Markov Processes [RusNor] Sec. 15.1

CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep Q-networks [SutBar] Sec. 9.4, 9.7,

Neural Combinatorial Optimization With Reinforcement Learning CS885 Reinforcement Learning Paper

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning-Based End-to-End Parking for Automatic Parking System CS885

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

MA/CSSE 474 Theory of Computation Nondeterminism NFSMs Your Questions? Previous class

T-orders in MaxEnt Arto Anttila (Stanford University) and Giorgio Magri (CNRS) Society for

Function Approximation, Deep Q Network Milan Straka November 12, 2018 Charles University in

according to curriculum framework A Sndor Egri, Pter dm, Gyula Honyek, Pter

North Dakota Tier II Instructions If you have already filed a Tier II report for a previous year,

Task-agnostic priors for reinforcement learning Karthik Narasimhan Princeton Collaborators:

Todays topics More network flow reductions CSE 421 Airplane scheduling Image

Computer Vision and stuff Willie Brink Applied Mathematics, Stellenbosch University