CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and - - PowerPoint PPT Presentation

cs885 reinforcement learning lecture 8b may 25 2018
SMART_READER_LITE
LIVE PREVIEW

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and - - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar] Sec. 2.9 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Outline Bayesian bandits Thompson sampling Contextual bandits


slide-1
SLIDE 1

CS885 Reinforcement Learning Lecture 8b: May 25, 2018

Bayesian and Contextual Bandits [SutBar] Sec. 2.9

CS885 Spring 2018 Pascal Poupart 1 University of Waterloo

slide-2
SLIDE 2

CS885 Spring 2018 Pascal Poupart 2

Outline

  • Bayesian bandits

– Thompson sampling

  • Contextual bandits

University of Waterloo

slide-3
SLIDE 3

CS885 Spring 2018 Pascal Poupart 3

Multi-Armed Bandits

  • Problem:

– ! bandits with unknown average reward "($) – Which arm $ should we play at each time step? – Exploitation/exploration tradeoff

  • Common frequentist approaches:

– &-greedy – Upper confidence bound (UCB)

  • Alternative Bayesian approaches

– Thompson sampling – Gittins indices

University of Waterloo

slide-4
SLIDE 4

CS885 Spring 2018 Pascal Poupart 4

Bayesian Learning

  • Notation:

– !": random variable for #’s rewards – Pr !"; ' : unknown distribution (parameterized by ') – ( # = *[!"]: unknown average reward

  • Idea:

– Express uncertainty about ' by a prior Pr(') – Compute posterior Pr('|!

", ! 2 ", … , ! 4 ") based on

samples !

", ! 2 ", … , ! 4 " observed for # so far.

  • Bayes theorem:

Pr ' !

", ! 2 ", … , ! 4 " ∝ Pr ' Pr(! ", ! 2 ", … , ! 4 "|')

University of Waterloo

slide-5
SLIDE 5

CS885 Spring 2018 Pascal Poupart 5

Distributional Information

  • Posterior over ! allows us to estimate

– Distribution over next reward "# Pr "#|"

' #, " ) #, … , " + # = - .

Pr "#; ! Pr ! "

' #, " ) #, … , " + # 0!

– Distribution over 1(3) when ! includes the mean Pr 1(3) "

' #, " ) #, … , " + # = Pr ! " ' #, " ) #, … , " + # if ! = 1(3)

  • To guide exploration:

– UCB: Pr 1 3 ≤ 67890("

' #, ") #, … , "+ #) ≥ 1 − =

– Bayesian techniques: Pr 1 3 |"

' #, ") #, … , "+ #

University of Waterloo

slide-6
SLIDE 6

CS885 Spring 2018 Pascal Poupart 6

Coin Example

  • Consider two biased coins !" and !#

$ !" = Pr !" = ℎ)*+ $ !# = Pr !# = ℎ)*+

  • Problem:

– Maximize # of heads in , flips – Which coin should we choose for each flip?

University of Waterloo

slide-7
SLIDE 7

CS885 Spring 2018 Pascal Poupart 7

Bernoulli Variables

  • !"#, !"$ are Bernoulli variables with domain {0,1}
  • Bernoulli dist. are parameterized by their mean

– i.e. Pr !"#; -. = -. = 0 1. Pr !"$; -2 = -2 = 0(12)

University of Waterloo

slide-8
SLIDE 8

CS885 Spring 2018 Pascal Poupart 8

Beta distribution

  • Let the prior Pr($) be a Beta distribution

&'() $; +, - ∝ $/01 1 − $ 401

  • + − 1: # of heads
  • - − 1: # of tails
  • 6 $ = +/(+ + -)

&'() $; 1, 1 &'() $; 2, 8 &'()($; 20, 80) $ Pr($)

University of Waterloo

slide-9
SLIDE 9

CS885 Spring 2018 Pascal Poupart 9

Belief Update

  • Prior: Pr # = %&'( #; *, , ∝ #./0 1 − # 3/0
  • Posterior after coin flip:

Pr # ℎ&(5 ∝ Pr # Pr ℎ&(5 # ∝ #./0 1 − # 3/0 # = # .60 /0 1 − # 3/0 ∝ %&'((#; * + 1, ,) Pr # '(:; ∝ Pr # Pr '(:; # ∝ #./0 1 − # 3/0 (1 − #) = #./0 1 − # (360)/0 ∝ %&'((#; *, , + 1)

University of Waterloo

slide-10
SLIDE 10

CS885 Spring 2018 Pascal Poupart 10

Thompson Sampling

  • Idea:

– Sample several potential average rewards: !" # , … !& # ~ Pr(!(#)|-

" ., … , - / .) for each #

– Estimate empirical average ! # = "

& ∑34" &

!3(#)

– Execute #-56#7. 0 ! #

  • Coin example

– Pr !(#) -

" ., … , - / . = Beta <.; >., ?.

where >. − 1 = #ℎF#GH and ?. − 1 = #I#JKH

University of Waterloo

slide-11
SLIDE 11

CS885 Spring 2018 Pascal Poupart 11

Thompson Sampling Algorithm Bernoulli Rewards

ThompsonSampling(ℎ) " ← 0 For % = 1 to ℎ Sample *+ , , … , */(,) ~ Pr(* , ) ∀, 6 * , ← +

/ ∑89+ /

*8(,) ∀, ,∗ ← argmax? 6 * , Execute ,∗ and receive @ " ← " + @ Update Pr(*(,∗)) based on @ Return "

University of Waterloo

slide-12
SLIDE 12

Comparison

Thompson Sampling

  • Action Selection

!∗ = argmax) * + !

  • Empirical mean

* + ! = ,

  • ∑/0,
  • +/(!)
  • Samples

+/ ! ~ Pr(+/(!)|6

, 7 … 6 9 7)

6

/ 7 ~ Pr(67; ;)

  • Some exploration

Greedy Strategy

  • Action Selection

!∗ = argmax) < + !

  • Empirical mean

< + ! = ,

9 ∑/0, 9

6/

7

  • Samples

6

/ 7 ~ Pr(67; ;)

  • No exploration

CS885 Spring 2018 Pascal Poupart 12 University of Waterloo

slide-13
SLIDE 13

CS885 Spring 2018 Pascal Poupart 13

Sample Size

  • In Thompson sampling, amount of data ! and

sample size " regulate amount of exploration

  • As ! and " increase, #

$ % becomes less stochastic, which reduces exploration

– As ! ↑, Pr($(%)|,

  • . … ,

.) becomes more peaked

– As " ↑, # $ % approaches 1[$(%)|,

  • . … ,

.]

  • The stochasticity of #

$(%) ensures that all actions are chosen with some probability

University of Waterloo

slide-14
SLIDE 14

CS885 Spring 2018 Pascal Poupart 14

Analysis

  • Thompson sampling converges to best arm
  • Theory:

– Expected cumulative regret: !(log &) – On par with UCB and (-greedy

  • Practice:

– Sample size ) often set to 1

University of Waterloo

slide-15
SLIDE 15

CS885 Spring 2018 Pascal Poupart 15

Contextual Bandits

  • In many applications, the context provides

additional information to select an action

– E.g., personalized advertising, user interfaces – Context: user demographics (location, age, gender)

  • Actions can also be characterized by features

that influence their payoff

– E.g., ads, webpages – Action features: topics, keywords, etc.

University of Waterloo

slide-16
SLIDE 16

CS885 Spring 2018 Pascal Poupart 16

Contextual Bandits

  • Contextual bandits: multi-armed bandits with states

(corresponding to contexts) and action features

  • Formally:

– !: set of states where each state " is defined by a vector of features #$ = ('(

$, '* $, … , ', $)

– .: set of actions where each action a is associated with a vector

  • f features #/ = ('(

/, '* /, … , '0 /)

– Space of rewards (often ℝ)

  • No transition function since the states at each step are

independent

  • Goal find policy 2: #$ → 5 that maximizes expected

rewards 6 7 ", 5 = 6(7|#$, #/)

University of Waterloo

slide-17
SLIDE 17

CS885 Spring 2018 Pascal Poupart 17

Approximate Reward Function

  • Common approach:

– learn approximate average reward function ! " #, % = ! "(() (where ( = ((*, (+)) by regression

  • Linear approximation: !

", ( = ,-(

  • Non-linear approximation: !

", ( = ./01%23/4((; ,)

University of Waterloo

slide-18
SLIDE 18

CS885 Spring 2018 Pascal Poupart 18

Bayesian Linear Regression

  • Consider a Gaussian prior:

!"# $ = & $ ', )*+ ∝ -.! − $0$ 2)*

  • Consider also a Gaussian likelihood:

!"# 2|4, $ = & 2 $04, 5* ∝ -.! − 2 − $04 * 25*

  • The posterior is also Gaussian:

!"# $|2, 4 ∝ !"# $ Pr 2 4, $ ∝ -.! − $0$ 2)*

  • .! − 2 − $04 *

25* = &($|9, :) where 9 = 5<*:42 and : = 5<*440 + )<*+ <>

University of Waterloo

slide-19
SLIDE 19

CS885 Spring 2018 Pascal Poupart 19

Predictive Posterior

  • Consider a state-action pair ("#, "%) = " for

which we would like to predict the reward (

  • Predictive posterior:

)*+ (|" = ∫

. / ( .0", 12 / . 3, 4 *.

= /((|12"03, "04")

  • UCB: Pr ( < 12"03 + 9 "04" > 1 − =

where 9 = 1 + ln 2/= /2

  • Thomson sampling: B

( ~ /((|12"03, "04")

University of Waterloo

slide-20
SLIDE 20

CS885 Spring 2018 Pascal Poupart 20

Upper Confidence Bound (UCB) Linear Gaussian

UCB(ℎ) " ← 0, %&' (|*, , = . ( /, 012 Repeat until 3 = ℎ Receive state 45 For each action 46 where 4 = (45, 46) do 9:3';&<39<=:>3&(?) = @14A* + 9 4A,4 ?∗ ← argmaxI 9:3';&<39<=:>3&(?) Execute ?∗ and receive J " ← " + J update * and , based on 4 = (45, 46∗) and J Return "

University of Waterloo

slide-21
SLIDE 21

CS885 Spring 2018 Pascal Poupart 21

Thompson Sampling Algorithm Linear Gaussian

ThompsonSampling(ℎ) " ← 0; %&' (|*, , = . ( /, 012 For 3 = 1 to ℎ Receive state 78 For each action 79 where 7 = (78, 79) do Sample <= > , … , <@(>) ~ .(B|C17D*, 7D,7) E < > ←

= @ ∑GH= @

<G(>) >∗ ← argmaxO E < > Execute >∗ and receive B " ← " + B update * and , based on 7 = (78, 79∗) and B Return "

University of Waterloo

slide-22
SLIDE 22

CS885 Spring 2018 Pascal Poupart 22

Industrial Use

  • Contextual bandits are now commonly used for

– Personalized advertising – Personalized web content

  • MSN news: 26% improvement in click through rate

after adoption of contextual bandits (https://www.microsoft.com/en- us/research/blog/real-world-interactive-learning- cusp-enabling-new-class-applications/)

University of Waterloo