Module 13 Bayesian Bandits CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation

β–Ά
module 13
SMART_READER_LITE
LIVE PREVIEW

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Multi-Armed Bandits Problem: bandits with unknown average reward () Which arm should we play at each


slide-1
SLIDE 1

Module 13 Bayesian Bandits

CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

slide-2
SLIDE 2

CS886 (c) 2013 Pascal Poupart

2

Multi-Armed Bandits

  • Problem:

– 𝑂 bandits with unknown average reward 𝑆(𝑏) – Which arm 𝑏 should we play at each time step? – Exploitation/exploration tradeoff

  • Common frequentist approaches:

– πœ—-greedy – Upper confidence bound (UCB)

  • Alternative Bayesian approaches

– Thompson sampling – Gittins indices

slide-3
SLIDE 3

CS886 (c) 2013 Pascal Poupart

3

Bayesian Learning

  • Notation:

– 𝑠𝑏: random variable for 𝑏’s rewards – Pr 𝑠𝑏; πœ„ : unknown distribution (parameterized by πœ„) – 𝑆 𝑏 = 𝐹[𝑠𝑏]: unknown average reward

  • Idea:

– Express uncertainty about πœ„ by a prior Pr⁑ (πœ„) – Compute posterior Pr⁑ (πœ„|𝑠

1 𝑏, 𝑠 2 𝑏, … , 𝑠 π‘œ 𝑏) based on

samples 𝑠

1 𝑏, 𝑠 2 𝑏, … , 𝑠 π‘œ 𝑏 observed for 𝑏 so far.

  • Bayes theorem:

Pr πœ„ 𝑠

1 𝑏, 𝑠 2 𝑏, … , 𝑠 π‘œ 𝑏 ∝ Pr πœ„ Pr⁑

(𝑠

1 𝑏, 𝑠 2 𝑏, … , 𝑠 π‘œ 𝑏|πœ„)

slide-4
SLIDE 4

CS886 (c) 2013 Pascal Poupart

4

Distributional Information

  • Posterior over πœ„ allows us to estimate

– Distribution over next reward 𝑠𝑏 Pr 𝑠𝑏|𝑠

1 𝑏, 𝑠 2 𝑏, … , 𝑠 π‘œ 𝑏 = Pr 𝑠𝑏; πœ„ Pr πœ„ 𝑠 1 𝑏, 𝑠 2 𝑏, … , 𝑠 π‘œ 𝑏 π‘’πœ„ πœ„

– Distribution over 𝑆(𝑏) when πœ„ includes the mean Pr 𝑆(𝑏) 𝑠

1 𝑏, 𝑠 2 𝑏, … , 𝑠 π‘œ 𝑏 = Pr πœ„ 𝑠 1 𝑏, 𝑠 2 𝑏, … , 𝑠 π‘œ 𝑏 if πœ„ = 𝑆(𝑏)

  • To guide exploration:

– UCB: Pr 𝑆 𝑏 ≀ π‘π‘π‘£π‘œπ‘’(𝑠

1 𝑏, 𝑠2 𝑏, … , π‘ π‘œ 𝑏) β‰₯ 1 βˆ’ πœ€

– Bayesian techniques: Pr 𝑆 𝑏 |𝑠

1 𝑏, 𝑠2 𝑏, … , π‘ π‘œ 𝑏

slide-5
SLIDE 5

CS886 (c) 2013 Pascal Poupart

5

Coin Example

  • Consider two biased coins 𝐷1 and 𝐷2

𝑆 𝐷1 = Pr 𝐷1 = β„Žπ‘“π‘π‘’ 𝑆 𝐷2 = Pr 𝐷2 = β„Žπ‘“π‘π‘’

  • Problem:

– Maximize # of heads in 𝑙 flips – Which coin should we choose for each flip?

slide-6
SLIDE 6

CS886 (c) 2013 Pascal Poupart

6

Bernoulli Variables

  • 𝑠𝐷1, 𝑠𝐷2 are Bernoulli variables with domain {0,1}
  • Bernoulli dist. are parameterized by their mean

– i.e. Pr 𝑠𝐷1; πœ„1 = πœ„1 = 𝑆 𝐷1 Pr 𝑠𝐷2; πœ„2 = πœ„2 = 𝑆(𝐷2)

slide-7
SLIDE 7

CS886 (c) 2013 Pascal Poupart

7

Beta distribution

  • Let the prior Pr⁑

(πœ„) be a Beta distribution 𝐢𝑓𝑒𝑏 πœ„; 𝛽, 𝛾 ∝ πœ„π›½βˆ’1 1 βˆ’ πœ„ π›Ύβˆ’1

  • 𝛽 βˆ’ 1: # of heads
  • 𝛾 βˆ’ 1: # of tails
  • 𝐹 πœ„ = 𝛽/(𝛽 + 𝛾)

𝐢𝑓𝑒𝑏 πœ„; 1, 1 𝐢𝑓𝑒𝑏 πœ„; 2, 8 𝐢𝑓𝑒𝑏(πœ„; 20, 80) πœ„ Pr⁑ (πœ„)

slide-8
SLIDE 8

CS886 (c) 2013 Pascal Poupart

8

Belief Update

  • Prior: Pr πœ„ = 𝐢𝑓𝑒𝑏 πœ„; 𝛽, 𝛾 ∝ πœ„π›½βˆ’1 1 βˆ’ πœ„ π›Ύβˆ’1
  • Posterior after coin flip:

Pr πœ„ β„Žπ‘“π‘π‘’ ∝ ⁑⁑⁑⁑⁑⁑⁑⁑Pr πœ„ ⁑⁑⁑⁑⁑⁑⁑⁑⁑⁑⁑⁑Pr β„Žπ‘“π‘π‘’ πœ„

∝ πœ„π›½βˆ’1 1 βˆ’ πœ„ π›Ύβˆ’1 πœ„ = πœ„ 𝛽+1 βˆ’1 1 βˆ’ πœ„ π›Ύβˆ’1 ∝ 𝐢𝑓𝑒𝑏(πœ„; 𝛽 + 1, 𝛾) Pr πœ„ π‘’π‘π‘—π‘š ∝ ⁑⁑⁑⁑⁑⁑⁑⁑Pr πœ„ ⁑⁑⁑⁑⁑⁑⁑⁑⁑⁑⁑⁑Pr π‘’π‘π‘—π‘š πœ„ ∝ πœ„π›½βˆ’1 1 βˆ’ πœ„ π›Ύβˆ’1 (1 βˆ’ πœ„) = πœ„π›½βˆ’1 1 βˆ’ πœ„ (𝛾+1)βˆ’1 ∝ 𝐢𝑓𝑒𝑏(πœ„; 𝛽, 𝛾 + 1)

slide-9
SLIDE 9

CS886 (c) 2013 Pascal Poupart

9

Thompson Sampling

  • Idea:

– Sample several potential average rewards: 𝑆1 𝑏 , … 𝑆𝑙 𝑏 ⁑~⁑Pr⁑ (𝑆(𝑏)|𝑠

1 𝑏, … , 𝑠 π‘œ 𝑏) for each 𝑏

– Estimate empirical average 𝑆 𝑏 = 1

𝑙

𝑆𝑗(𝑏)

𝑙 𝑗=1

– Execute 𝑏𝑠𝑕𝑛𝑏𝑦𝑏⁑𝑆 𝑏

  • Coin example

– Pr 𝑆(𝑏) 𝑠

1 𝑏, … , 𝑠 π‘œ 𝑏 = Beta πœ„π‘; 𝛽𝑏, 𝛾𝑏

where 𝛽𝑏 βˆ’ 1 = #β„Žπ‘“π‘π‘’π‘‘ and 𝛾𝑏 βˆ’ 1 = #π‘’π‘π‘—π‘šπ‘‘

slide-10
SLIDE 10

CS886 (c) 2013 Pascal Poupart

10

Thompson Sampling Algorithm Bernoulli Rewards

ThompsonSampling(β„Ž) π‘Š ← 0 For π‘œ = 1⁑toβ‘β„Ž Sample 𝑆1 𝑏 , … , 𝑆𝑙(𝑏)⁑~⁑Pr⁑ (𝑆 𝑏 )β‘β‘βˆ€π‘ 𝑆 𝑏 ←

1 𝑙

𝑆𝑗(𝑏)

𝑙 𝑗=1

β‘β‘β‘βˆ€π‘ π‘βˆ— ← argmaxa⁑𝑆 𝑏 Execute π‘βˆ— and receive 𝑠 π‘Š ← π‘Š + 𝑠 Update Pr⁑ (𝑆(π‘βˆ—)) based on 𝑠 Return π‘Š

slide-11
SLIDE 11

Comparison

Thompson Sampling

  • Action Selection

π‘βˆ— = argmaxa⁑𝑆 𝑏

  • Empirical mean

𝑆 𝑏 = 1

𝑙

𝑆𝑗(𝑏)

𝑙 𝑗=1

  • Samples

𝑆𝑗 𝑏 ⁑~⁑Pr⁑ (𝑆𝑗(𝑏)|𝑠

1 𝑏 … 𝑠 π‘œ 𝑏)

𝑠

𝑗 𝑏⁑~⁑Pr⁑

(𝑠𝑏; πœ„)

  • Some exploration

Greedy Strategy

  • Action Selection

π‘βˆ— = argmaxa⁑𝑆 𝑏

  • Empirical mean

𝑆 𝑏 = 1

π‘œ

𝑠𝑗

𝑏 π‘œ 𝑗=1

  • Samples

𝑠

𝑗 𝑏⁑~⁑Pr⁑

(𝑠𝑏; πœ„)

  • No exploration

CS886 (c) 2013 Pascal Poupart

11

slide-12
SLIDE 12

CS886 (c) 2013 Pascal Poupart

12

Sample Size

  • In Thompson sampling, amount of data π‘œβ‘and

sample size 𝑙 regulate amount of exploration

  • As π‘œ and 𝑙 increase, 𝑆

𝑏 becomes less stochastic, which reduces exploration

– As π‘œ ↑, Pr⁑ (𝑆(𝑏)|𝑠

1 𝑏 … 𝑠 π‘œ 𝑏) becomes more peaked

– As 𝑙 ↑, 𝑆 𝑏 approaches 𝐹[𝑆(𝑏)|𝑠

1 𝑏 … 𝑠 π‘œ 𝑏]

  • The stochasticity of 𝑆

(𝑏) ensures that all actions are chosen with some probability

slide-13
SLIDE 13

CS886 (c) 2013 Pascal Poupart

13

Continuous Rewards

  • So far we assumed that 𝑠 ∈ 0,1
  • What about continuous rewards, i.e. 𝑠 ∈ 0,1 ?

– NB: rewards in [𝑏, 𝑐] can be remapped to [0,1] by an affine transformation without changing the problem

  • Idea:

– When we receive a reward 𝑠 – Sample 𝑐⁑~β‘πΆπ‘“π‘ π‘œπ‘π‘£π‘šπ‘šπ‘—(𝑠) s.t. 𝑐 ∈ {0,1}

slide-14
SLIDE 14

CS886 (c) 2013 Pascal Poupart

14

Thompson Sampling Algorithm Continuous rewards

ThompsonSampling(β„Ž) π‘Š ← 0 For π‘œ = 1⁑toβ‘β„Ž Sample 𝑆1 𝑏 , … , 𝑆𝑙(𝑏)⁑~⁑Pr⁑ (𝑆 𝑏 )β‘β‘βˆ€π‘ 𝑆 𝑏 ←

1 𝑙

𝑆𝑗(𝑏)

𝑙 𝑗=1

β‘β‘β‘βˆ€π‘ π‘βˆ— ← argmaxa⁑𝑆 𝑏 Execute π‘βˆ— and receive 𝑠 π‘Š ← π‘Š + 𝑠 Sample 𝑐⁑~β‘πΆπ‘“π‘ π‘œπ‘π‘£π‘šπ‘šπ‘—(𝑠) Update Pr⁑ (𝑆(π‘βˆ—)) based on 𝑐 Return π‘Š

slide-15
SLIDE 15

CS886 (c) 2013 Pascal Poupart

15

Analysis

  • Thompson sampling converges to best arm
  • Theory:

– Expected cumulative regret: 𝑃(log β‘π‘œ) – On par with UCB and πœ—-greedy

  • Practice:

– Sample size 𝑙 often set to 1 – Used by Bing for ad placement

  • Graepel, Candela, Borchert, Herbrich (2010) Web-scale

Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine, ICML.