CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep - - PowerPoint PPT Presentation

cs885 reinforcement learning lecture 12 june 8 2018
SMART_READER_LITE
LIVE PREVIEW

CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep - - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep Recurrent Q-Networks [GBC] Chap. 10 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Outline Recurrent neural networks Long short term memory (LSTM) networks Deep


slide-1
SLIDE 1

CS885 Reinforcement Learning Lecture 12: June 8, 2018

Deep Recurrent Q-Networks [GBC] Chap. 10

CS885 Spring 2018 Pascal Poupart 1 University of Waterloo

slide-2
SLIDE 2

Outline

  • Recurrent neural networks

– Long short term memory (LSTM) networks

  • Deep recurrent Q-networks

CS885 Spring 2018 Pascal Poupart 2 University of Waterloo

slide-3
SLIDE 3

CS885 Spring 2018 Pascal Poupart 3

  • Hidden Markov model

– Initial state distribution: Pr($%) – Transition probabilities: Pr($'()|$') – Observation probabilities: Pr(+'|$')

  • Belief monitoring

Pr $' +)..' ∝ Pr +' $' ∑/012 Pr $' $'3) Pr($'3)|+)..'3))

ParEal Observability

s0 s1 s2 s3 s4

University of Waterloo

  • 1
  • 2
  • 3
  • 4
  • 5
slide-4
SLIDE 4

Recurrent Neural Network (RNN)

  • In RNNs, outputs can be fed back to the network as

inputs, crea:ng a recurrent structure

  • HMMs can be simulated and generalized by RNNs
  • RNNs can be used for belief monitoring

!": vector of observa:ons #": belief state

CS885 Spring 2018 Pascal Poupart 4 University of Waterloo

slide-5
SLIDE 5

Training

  • Recurrent neural networks are trained by

backpropaga6on on the unrolled network

– E.g. backpropaga6on through 6me

  • Weight sharing:

– Combine gradients of shared weights into a single gradient

  • Challenges:

– Gradient vanishing (and explosion) – Long range memory – Predic6on driF

CS885 Spring 2018 Pascal Poupart 5 University of Waterloo

slide-6
SLIDE 6

Long Short Term Memory (LSTM)

  • Special gated structure to

control memorization and forgetting in RNNs

  • Mitigate gradient vanishing
  • Facilitate long term memory

CS885 Spring 2018 Pascal Poupart 6 University of Waterloo

slide-7
SLIDE 7

Unrolled long short term memory

CS885 Spring 2018 Pascal Poupart 7 University of Waterloo

ℎ" ℎ# ℎ$ ℎ% &'(# &'($ &'(% )*# )*$ )*%

X X X X X X X X X

  • utput

gate

  • utput

gate

  • utput

gate input gate input gate input gate forget gate forget gate forget gate

slide-8
SLIDE 8

Deep Recurrent Q-Network

  • Hausknecht and Stone (2016)

– Atari games

  • TransiBon model

– LSTM network

  • ObservaBon model

– ConvoluBonal network

CS885 Spring 2018 Pascal Poupart 8 University of Waterloo

image image

slide-9
SLIDE 9

CS885 Spring 2018 Pascal Poupart 9

Deep Recurrent Q-Network

Ini@alize weights ! and " ! at random in [−1,1] Observe current state ( Loop Execute policy for en@re episode Add episode ()*, +*, ),, +,, )-, +-, … , )/, +/) to experience buffer Sample episode from buffer Ini@alize ℎ1 For 2 = 1 @ll the end of the episode do

4566 4! = 7

8 9! :;;!(= )*..?), = +? − ̂ B − C max

= GHIJ Q " L :;;" !(=

)*..?M*), = +?M*

4N! OPP!(= QJ..H), = GH 4!

Update weights: ! ← ! − S 4566

4!

Every T steps, update target: " ! ← !

University of Waterloo

slide-10
SLIDE 10

CS885 Spring 2018 Pascal Poupart 10

Deep Recurrent Q-Network

Initialize weights ! and " ! at random in [−1,1] Observe current state ( Loop Execute policy for entire episode Add episode ()*, +*, ),, +,, )-, +-, … , )/, +/) to experience buffer Sample episode from buffer Initialize ℎ1 For 2 = 1 till the end of the episode do

4566 4! = 7

8 9! :;;!(ℎ=>* ? )=), ? += − ̂ B − C max

? GH Q " J :;;" ! ℎ=>* ?

)= ? )=K* , ? +=K*

4L! MNN!(OPQR ? SP), ? G 4!

ℎ= ← :;;"

!(ℎ=>*, ?

)=) Update weights: ! ← ! − U

4566 4!

Every V steps, update target: " ! ← !

University of Waterloo

slide-11
SLIDE 11

Results

CS885 Spring 2018 Pascal Poupart 11 University of Waterloo

Flickering games (missing observaBons)