CS885 Reinforcement Learning Lecture 8b: May 25, 2018
Bayesian and Contextual Bandits [SutBar] Sec. 2.9
CS885 Spring 2018 Pascal Poupart 1 University of Waterloo
CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and - - PowerPoint PPT Presentation
CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar] Sec. 2.9 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Outline Bayesian bandits Thompson sampling Contextual bandits
CS885 Spring 2018 Pascal Poupart 1 University of Waterloo
CS885 Spring 2018 Pascal Poupart 2
University of Waterloo
CS885 Spring 2018 Pascal Poupart 3
University of Waterloo
CS885 Spring 2018 Pascal Poupart 4
", ! 2 ", … , ! 4 ") based on
", ! 2 ", … , ! 4 " observed for # so far.
", ! 2 ", … , ! 4 " ∝ Pr ' Pr(! ", ! 2 ", … , ! 4 "|')
University of Waterloo
CS885 Spring 2018 Pascal Poupart 5
' #, " ) #, … , " + # = - .
' #, " ) #, … , " + # 0!
' #, " ) #, … , " + # = Pr ! " ' #, " ) #, … , " + # if ! = 1(3)
' #, ") #, … , "+ #) ≥ 1 − =
' #, ") #, … , "+ #
University of Waterloo
CS885 Spring 2018 Pascal Poupart 6
University of Waterloo
CS885 Spring 2018 Pascal Poupart 7
University of Waterloo
CS885 Spring 2018 Pascal Poupart 8
&'() $; 1, 1 &'() $; 2, 8 &'()($; 20, 80) $ Pr($)
University of Waterloo
CS885 Spring 2018 Pascal Poupart 9
Pr # ℎ&(5 ∝ Pr # Pr ℎ&(5 # ∝ #./0 1 − # 3/0 # = # .60 /0 1 − # 3/0 ∝ %&'((#; * + 1, ,) Pr # '(:; ∝ Pr # Pr '(:; # ∝ #./0 1 − # 3/0 (1 − #) = #./0 1 − # (360)/0 ∝ %&'((#; *, , + 1)
University of Waterloo
CS885 Spring 2018 Pascal Poupart 10
" ., … , - / .) for each #
& ∑34" &
" ., … , - / . = Beta <.; >., ?.
University of Waterloo
CS885 Spring 2018 Pascal Poupart 11
/ ∑89+ /
University of Waterloo
!∗ = argmax) * + !
* + ! = ,
+/ ! ~ Pr(+/(!)|6
, 7 … 6 9 7)
6
/ 7 ~ Pr(67; ;)
!∗ = argmax) < + !
< + ! = ,
9 ∑/0, 9
6/
7
6
/ 7 ~ Pr(67; ;)
CS885 Spring 2018 Pascal Poupart 12 University of Waterloo
CS885 Spring 2018 Pascal Poupart 13
.) becomes more peaked
.]
University of Waterloo
CS885 Spring 2018 Pascal Poupart 14
University of Waterloo
CS885 Spring 2018 Pascal Poupart 15
University of Waterloo
CS885 Spring 2018 Pascal Poupart 16
– !: set of states where each state " is defined by a vector of features #$ = ('(
$, '* $, … , ', $)
– .: set of actions where each action a is associated with a vector
/, '* /, … , '0 /)
– Space of rewards (often ℝ)
University of Waterloo
CS885 Spring 2018 Pascal Poupart 17
University of Waterloo
CS885 Spring 2018 Pascal Poupart 18
University of Waterloo
CS885 Spring 2018 Pascal Poupart 19
. / ( .0", 12 / . 3, 4 *.
University of Waterloo
CS885 Spring 2018 Pascal Poupart 20
University of Waterloo
CS885 Spring 2018 Pascal Poupart 21
= @ ∑GH= @
University of Waterloo
CS885 Spring 2018 Pascal Poupart 22
University of Waterloo