CS885 Reinforcement Learning Lecture 4b: May 11, 2018
Deep Q-networks [SutBar] Sec. 9.4, 9.7, [Sze] Sec. 4.3.2
CS885 Spring 2018 Pascal Poupart 1 University of Waterloo
CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep - - PowerPoint PPT Presentation
CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep Q-networks [SutBar] Sec. 9.4, 9.7, [Sze] Sec. 4.3.2 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Outline Value Function Approximation Linear approximation
CS885 Spring 2018 Pascal Poupart 1 University of Waterloo
CS885 Spring 2018 Pascal Poupart 2
University of Waterloo
CS885 Spring 2018 Pascal Poupart 3
University of Waterloo
CS885 Spring 2018 Pascal Poupart 4
./ !0 "($1, &1)
4 5 [!" $, & − ( − * max ./ !0 " $1, &1 ]5
9:;; 9" = !" $, & − ( − * max ./ !0 " $1, &1 9<" =,. 9" " fixed
University of Waterloo
CS885 Spring 2018 Pascal Poupart 5
Select action ( and execute it Receive immediate reward ) Observe new state '’ Gradient: +,--
+! = /! ', ( − ) − 0 max 45 /! '6, (6 +7! 8,4 +!
Update weights: ! ← ! − : +,--
+!
Update state: ' ← '’
University of Waterloo
CS885 Spring 2018 Pascal Poupart 6
%
%
) < ∞
<= 3 +>, -> − 3(+, -)]
University of Waterloo
CS885 Spring 2018 Pascal Poupart 7
%
%
) < ∞
=> ./ 0?, 2? @A/ B,= @/
University of Waterloo
CS885 Spring 2018 Pascal Poupart 8
%
%
) < ∞
University of Waterloo
CS885 Spring 2018 Pascal Poupart 9
University of Waterloo
CS885 Spring 2018 Pascal Poupart 10
University of Waterloo
CS885 Spring 2018 Pascal Poupart 11
01 +2 & !$, #$
target update
University of Waterloo
CS885 Spring 2018 Pascal Poupart 12
'
'- <= 8 !0, 20
target update target update
University of Waterloo
CS885 Spring 2018 Pascal Poupart 13
University of Waterloo
CS885 Spring 2018 Pascal Poupart 14
Initialize weights ! and " ! at random in [−1,1] Observe current state ( Loop Select action ) and execute it Receive immediate reward * Observe new state (’ Add ((, ), (-, *) to experience buffer Sample mini-batch of experiences from buffer For each experience ̂ (, 0 ), ̂ (-, ̂ * in mini-batch Gradient: 1233
1! = 5!
̂ (, 0 ) − ̂ * − 6 max
:; Q " =
̂ (-, 0 )-
1>! ̂ ?, 0 : 1!
Update weights: ! ← ! − A 1233
1!
Update state: ( ← (’ Every B steps, update target: " ! ← !
University of Waterloo
CS885 Spring 2018 Pascal Poupart 15
University of Waterloo
CS885 Spring 2018 Pascal Poupart 16
University of Waterloo