Module 13 Bayesian Bandits CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation
Module 13 Bayesian Bandits CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation
Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Multi-Armed Bandits Problem: bandits with unknown average reward () Which arm should we play at each
CS886 (c) 2013 Pascal Poupart
2
Multi-Armed Bandits
- Problem:
β π bandits with unknown average reward π(π) β Which arm π should we play at each time step? β Exploitation/exploration tradeoff
- Common frequentist approaches:
β π-greedy β Upper confidence bound (UCB)
- Alternative Bayesian approaches
β Thompson sampling β Gittins indices
CS886 (c) 2013 Pascal Poupart
3
Bayesian Learning
- Notation:
β π π: random variable for πβs rewards β Pr π π; π : unknown distribution (parameterized by π) β π π = πΉ[π π]: unknown average reward
- Idea:
β Express uncertainty about π by a prior Prβ‘ (π) β Compute posterior Prβ‘ (π|π
1 π, π 2 π, β¦ , π π π) based on
samples π
1 π, π 2 π, β¦ , π π π observed for π so far.
- Bayes theorem:
Pr π π
1 π, π 2 π, β¦ , π π π β Pr π Prβ‘
(π
1 π, π 2 π, β¦ , π π π|π)
CS886 (c) 2013 Pascal Poupart
4
Distributional Information
- Posterior over π allows us to estimate
β Distribution over next reward π π Pr π π|π
1 π, π 2 π, β¦ , π π π = Pr π π; π Pr π π 1 π, π 2 π, β¦ , π π π ππ π
β Distribution over π(π) when π includes the mean Pr π(π) π
1 π, π 2 π, β¦ , π π π = Pr π π 1 π, π 2 π, β¦ , π π π if π = π(π)
- To guide exploration:
β UCB: Pr π π β€ πππ£ππ(π
1 π, π 2 π, β¦ , π π π) β₯ 1 β π
β Bayesian techniques: Pr π π |π
1 π, π 2 π, β¦ , π π π
CS886 (c) 2013 Pascal Poupart
5
Coin Example
- Consider two biased coins π·1 and π·2
π π·1 = Pr π·1 = βπππ π π·2 = Pr π·2 = βπππ
- Problem:
β Maximize # of heads in π flips β Which coin should we choose for each flip?
CS886 (c) 2013 Pascal Poupart
6
Bernoulli Variables
- π π·1, π π·2 are Bernoulli variables with domain {0,1}
- Bernoulli dist. are parameterized by their mean
β i.e. Pr π π·1; π1 = π1 = π π·1 Pr π π·2; π2 = π2 = π(π·2)
CS886 (c) 2013 Pascal Poupart
7
Beta distribution
- Let the prior Prβ‘
(π) be a Beta distribution πΆππ’π π; π½, πΎ β ππ½β1 1 β π πΎβ1
- π½ β 1: # of heads
- πΎ β 1: # of tails
- πΉ π = π½/(π½ + πΎ)
πΆππ’π π; 1, 1 πΆππ’π π; 2, 8 πΆππ’π(π; 20, 80) π Prβ‘ (π)
CS886 (c) 2013 Pascal Poupart
8
Belief Update
- Prior: Pr π = πΆππ’π π; π½, πΎ β ππ½β1 1 β π πΎβ1
- Posterior after coin flip:
Pr π βπππ β β‘β‘β‘β‘β‘β‘β‘β‘Pr π β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘Pr βπππ π
β ππ½β1 1 β π πΎβ1 π = π π½+1 β1 1 β π πΎβ1 β πΆππ’π(π; π½ + 1, πΎ) Pr π π’πππ β β‘β‘β‘β‘β‘β‘β‘β‘Pr π β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘Pr π’πππ π β ππ½β1 1 β π πΎβ1 (1 β π) = ππ½β1 1 β π (πΎ+1)β1 β πΆππ’π(π; π½, πΎ + 1)
CS886 (c) 2013 Pascal Poupart
9
Thompson Sampling
- Idea:
β Sample several potential average rewards: π1 π , β¦ ππ π β‘~β‘Prβ‘ (π(π)|π
1 π, β¦ , π π π) for each π
β Estimate empirical average π π = 1
π
ππ(π)
π π=1
β Execute ππ ππππ¦πβ‘π π
- Coin example
β Pr π(π) π
1 π, β¦ , π π π = Beta ππ; π½π, πΎπ
where π½π β 1 = #βππππ‘ and πΎπ β 1 = #π’ππππ‘
CS886 (c) 2013 Pascal Poupart
10
Thompson Sampling Algorithm Bernoulli Rewards
ThompsonSampling(β) π β 0 For π = 1β‘toβ‘β Sample π1 π , β¦ , ππ(π)β‘~β‘Prβ‘ (π π )β‘β‘βπ π π β
1 π
ππ(π)
π π=1
β‘β‘β‘βπ πβ β argmaxaβ‘π π Execute πβ and receive π π β π + π Update Prβ‘ (π(πβ)) based on π Return π
Comparison
Thompson Sampling
- Action Selection
πβ = argmaxaβ‘π π
- Empirical mean
π π = 1
π
ππ(π)
π π=1
- Samples
ππ π β‘~β‘Prβ‘ (ππ(π)|π
1 π β¦ π π π)
π
π πβ‘~β‘Prβ‘
(π π; π)
- Some exploration
Greedy Strategy
- Action Selection
πβ = argmaxaβ‘π π
- Empirical mean
π π = 1
π
π π
π π π=1
- Samples
π
π πβ‘~β‘Prβ‘
(π π; π)
- No exploration
CS886 (c) 2013 Pascal Poupart
11
CS886 (c) 2013 Pascal Poupart
12
Sample Size
- In Thompson sampling, amount of data πβ‘and
sample size π regulate amount of exploration
- As π and π increase, π
π becomes less stochastic, which reduces exploration
β As π β, Prβ‘ (π(π)|π
1 π β¦ π π π) becomes more peaked
β As π β, π π approaches πΉ[π(π)|π
1 π β¦ π π π]
- The stochasticity of π
(π) ensures that all actions are chosen with some probability
CS886 (c) 2013 Pascal Poupart
13
Continuous Rewards
- So far we assumed that π β 0,1
- What about continuous rewards, i.e. π β 0,1 ?
β NB: rewards in [π, π] can be remapped to [0,1] by an affine transformation without changing the problem
- Idea:
β When we receive a reward π β Sample πβ‘~β‘πΆππ πππ£πππ(π ) s.t. π β {0,1}
CS886 (c) 2013 Pascal Poupart
14
Thompson Sampling Algorithm Continuous rewards
ThompsonSampling(β) π β 0 For π = 1β‘toβ‘β Sample π1 π , β¦ , ππ(π)β‘~β‘Prβ‘ (π π )β‘β‘βπ π π β
1 π
ππ(π)
π π=1
β‘β‘β‘βπ πβ β argmaxaβ‘π π Execute πβ and receive π π β π + π Sample πβ‘~β‘πΆππ πππ£πππ(π ) Update Prβ‘ (π(πβ)) based on π Return π
CS886 (c) 2013 Pascal Poupart
15
Analysis
- Thompson sampling converges to best arm
- Theory:
β Expected cumulative regret: π(log β‘π) β On par with UCB and π-greedy
- Practice:
β Sample size π often set to 1 β Used by Bing for ad placement
- Graepel, Candela, Borchert, Herbrich (2010) Web-scale
Bayesian click-through rate prediction for sponsored search advertising in Microsoftβs Bing search engine, ICML.