CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep - - PowerPoint PPT Presentation

cs885 reinforcement learning lecture 4b may 11 2018
SMART_READER_LITE
LIVE PREVIEW

CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep - - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep Q-networks [SutBar] Sec. 9.4, 9.7, [Sze] Sec. 4.3.2 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Outline Value Function Approximation Linear approximation


slide-1
SLIDE 1

CS885 Reinforcement Learning Lecture 4b: May 11, 2018

Deep Q-networks [SutBar] Sec. 9.4, 9.7, [Sze] Sec. 4.3.2

CS885 Spring 2018 Pascal Poupart 1 University of Waterloo

slide-2
SLIDE 2

CS885 Spring 2018 Pascal Poupart 2

Outline

  • Value Function Approximation

– Linear approximation – Neural network approximation

  • Deep Q-network

University of Waterloo

slide-3
SLIDE 3

CS885 Spring 2018 Pascal Poupart 3

Q-function Approximation

  • Let ! = #$, #&, … , #( )
  • Linear

* !, + ≈ ∑. /0.#.

  • Non-linear (e.g., neural network)

* !, + ≈ 1(3; 5)

University of Waterloo

slide-4
SLIDE 4

CS885 Spring 2018 Pascal Poupart 4

Gradient Q-learning

  • Minimize squared error between Q-value estimate

and target

– Q-value estimate: !"($, &) – Target: ( + * max

./ !0 "($1, &1)

  • Squared error:

2(((") =

4 5 [!" $, & − ( − * max ./ !0 " $1, &1 ]5

  • Gradient

9:;; 9" = !" $, & − ( − * max ./ !0 " $1, &1 9<" =,. 9" " fixed

University of Waterloo

slide-5
SLIDE 5

CS885 Spring 2018 Pascal Poupart 5

Gradient Q-learning

Initialize weights ! uniformly at random in [−1,1] Observe current state ' Loop

Select action ( and execute it Receive immediate reward ) Observe new state '’ Gradient: +,--

+! = /! ', ( − ) − 0 max 45 /! '6, (6 +7! 8,4 +!

Update weights: ! ← ! − : +,--

+!

Update state: ' ← '’

University of Waterloo

slide-6
SLIDE 6

CS885 Spring 2018 Pascal Poupart 6

Recap: Convergence of Tabular Q-learning

  • Tabular Q-Learning converges to optimal Q-function

under the following conditions:

∑"#$

%

&" = ∞ and ∑"#$

%

&"

) < ∞

  • Let &" +, - = 1/0(+, -)

– Where 0(+, -) is # of times that (+, -) is visited

  • Q-learning

3 +, - ← 3 +, - + &"(+, -)[7 + 8 max

<= 3 +>, -> − 3(+, -)]

University of Waterloo

slide-7
SLIDE 7

CS885 Spring 2018 Pascal Poupart 7

Convergence of Linear Gradient Q-Learning

  • Linear Q-Learning converges under the same

conditions:

∑"#$

%

&" = ∞ and ∑"#$

%

&"

) < ∞

  • Let &" = 1/-
  • Let ./ 0, 2 = ∑3 4353
  • Q-learning

/ ← / − &" ./ 0, 2 − 8 − 9 max

=> ./ 0?, 2? @A/ B,= @/

University of Waterloo

slide-8
SLIDE 8

CS885 Spring 2018 Pascal Poupart 8

Divergence of Non-linear Gradient Q-learning

  • Even when the following conditions hold

∑"#$

%

&" = ∞ and ∑"#$

%

&"

) < ∞

non-linear Q-learning may diverge

  • Intuition:

– Adjusting + to increase , at (., 0) might introduce errors at nearby state-action pairs.

University of Waterloo

slide-9
SLIDE 9

CS885 Spring 2018 Pascal Poupart 9

Mitigating divergence

  • Two tricks are often used in practice:
  • 1. Experience replay
  • 2. Use two networks:

– Q-network – Target network

University of Waterloo

slide-10
SLIDE 10

CS885 Spring 2018 Pascal Poupart 10

Experience Replay

  • Idea: store previous experiences (", $, "’, &) into a

buffer and sample a mini-batch of previous experiences at each step to learn by Q-learning

  • Advantages

– Break correlations between successive updates (more stable learning) – Fewer interactions with environment needed to converge (greater data efficiency)

University of Waterloo

slide-11
SLIDE 11

CS885 Spring 2018 Pascal Poupart 11

Target Network

  • Idea: Use a separate target network that is updated
  • nly periodically

repeat for each !, #, !$, % in mini-batch:

& ← & − )* +& !, # − % − , max

01 +2 & !$, #$

3+& !, # 3& 2 & ← &

  • Advantage: mitigate divergence

target update

University of Waterloo

slide-12
SLIDE 12

CS885 Spring 2018 Pascal Poupart 12

Target Network

  • Similar to value iteration:

repeat for all ! " ! ← max

'

( ! + * ∑,- Pr !0 !, 2 3 "(!0) ∀! 3 " ← " repeat for each !, 2, !0, 7 in mini-batch:

8 ← 8 − :; <8 !, 2 − 7 − * max

'- <= 8 !0, 20

><8 !, 2 >8 = 8 ← 8

target update target update

University of Waterloo

slide-13
SLIDE 13

CS885 Spring 2018 Pascal Poupart 13

Deep Q-network

  • Google Deep Mind:
  • Deep Q-network: Gradient Q-learning with

– Deep neural networks – Experience replay – Target network

  • Breakthrough: human-level play in many Atari

video games

University of Waterloo

slide-14
SLIDE 14

CS885 Spring 2018 Pascal Poupart 14

Deep Q-network

Initialize weights ! and " ! at random in [−1,1] Observe current state ( Loop Select action ) and execute it Receive immediate reward * Observe new state (’ Add ((, ), (-, *) to experience buffer Sample mini-batch of experiences from buffer For each experience ̂ (, 0 ), ̂ (-, ̂ * in mini-batch Gradient: 1233

1! = 5!

̂ (, 0 ) − ̂ * − 6 max

:; Q " =

̂ (-, 0 )-

1>! ̂ ?, 0 : 1!

Update weights: ! ← ! − A 1233

1!

Update state: ( ← (’ Every B steps, update target: " ! ← !

University of Waterloo

slide-15
SLIDE 15

CS885 Spring 2018 Pascal Poupart 15

Deep Q-Network for Atari

University of Waterloo

slide-16
SLIDE 16

CS885 Spring 2018 Pascal Poupart 16

DQN versus Linear approx.

University of Waterloo