Module 13 Bayesian Bandits CS 886 Sequential Decision Making and - PowerPoint PPT Presentation
Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Multi-Armed Bandits Problem: bandits with unknown average reward () Which arm should we play at each
Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
Multi-Armed Bandits β’ Problem: β π bandits with unknown average reward π(π) β Which arm π should we play at each time step? β Exploitation/exploration tradeoff β’ Common frequentist approaches: β π -greedy β Upper confidence bound (UCB) β’ Alternative Bayesian approaches β Thompson sampling β Gittins indices 2 CS886 (c) 2013 Pascal Poupart
Bayesian Learning β’ Notation: β π π : random variable for π βs rewards β Pr π π ; π : unknown distribution (parameterized by π ) β π π = πΉ[π π ] : unknown average reward β’ Idea: β Express uncertainty about π by a prior Prβ‘ (π) π , π π , β¦ , π π ) based on β Compute posterior Prβ‘ (π|π 1 2 π π , π π , β¦ , π π observed for π so far. samples π π 1 2 β’ Bayes theorem: π β Pr π Prβ‘ π , π π , β¦ , π π , π π , β¦ , π π |π) Pr π π (π 1 2 π 1 2 π 3 CS886 (c) 2013 Pascal Poupart
Distributional Information β’ Posterior over π allows us to estimate β Distribution over next reward π π π = Pr π π ; π Pr π π π ππ π , π π , β¦ , π π , π π , β¦ , π Pr π π |π π π 1 2 1 2 π β Distribution over π(π) when π includes the mean π = Pr π π π if π = π(π) π , π π , β¦ , π π , π π , β¦ , π Pr π(π) π 1 2 π 1 2 π β’ To guide exploration: π , π 2 π , β¦ , π π π ) β₯ 1 β π β UCB : Pr π π β€ πππ£ππ( π 1 π π , π 2 π , β¦ , π π β Bayesian techniques: Pr π π | π 1 4 CS886 (c) 2013 Pascal Poupart
Coin Example β’ Consider two biased coins π· 1 and π· 2 π π· 1 = Pr π· 1 = βπππ π π· 2 = Pr π· 2 = βπππ β’ Problem: β Maximize # of heads in π flips β Which coin should we choose for each flip? 5 CS886 (c) 2013 Pascal Poupart
Bernoulli Variables β’ π π· 1 , π π· 2 are Bernoulli variables with domain {0,1} β’ Bernoulli dist. are parameterized by their mean β i.e. Pr π π· 1 ; π 1 = π 1 = π π· 1 Pr π π· 2 ; π 2 = π 2 = π(π· 2 ) 6 CS886 (c) 2013 Pascal Poupart
Beta distribution β’ Let the prior Prβ‘ (π) be a Beta distribution πΆππ’π π; π½, πΎ β π π½β1 1 β π πΎβ1 πΆππ’π π; 1, 1 πΆππ’π π; 2, 8 β’ π½ β 1: # of heads πΆππ’π(π; 20, 80) β’ πΎ β 1 : # of tails Prβ‘ (π) β’ πΉ π = π½/(π½ + πΎ) π 7 CS886 (c) 2013 Pascal Poupart
Belief Update β’ Prior: Pr π = πΆππ’π π; π½, πΎ β π π½β1 1 β π πΎβ1 β’ Posterior after coin flip: Pr π βπππ β β‘β‘β‘β‘β‘β‘β‘β‘Pr π β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘Pr βπππ π β π π½β1 1 β π πΎβ1 π = π π½+1 β1 1 β π πΎβ1 β πΆππ’π(π; π½ + 1, πΎ) Pr π π’πππ β β‘β‘β‘β‘β‘β‘β‘β‘Pr π β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘Pr π’πππ π β π π½β1 1 β π πΎβ1 (1 β π) = π π½β1 1 β π (πΎ+1)β1 β πΆππ’π(π; π½, πΎ + 1) 8 CS886 (c) 2013 Pascal Poupart
Thompson Sampling β’ Idea: β Sample several potential average rewards: π , β¦ , π π ) for each π π 1 π , β¦ π π π β‘~β‘Prβ‘ (π(π)|π π 1 β Estimate empirical average π = 1 π π π π π (π) π=1 π β Execute ππ ππππ¦ π β‘π β’ Coin example π = Beta π π ; π½ π , πΎ π π , β¦ , π β Pr π(π) π π 1 where π½ π β 1 = #βππππ‘ and πΎ π β 1 = #π’ππππ‘ 9 CS886 (c) 2013 Pascal Poupart
Thompson Sampling Algorithm Bernoulli Rewards ThompsonSampling( β ) π β 0 For π = 1β‘toβ‘β Sample π 1 π , β¦ , π π (π)β‘~β‘Prβ‘ (π π )β‘β‘βπ 1 π β π π π π π (π) β‘β‘β‘βπ π=1 π β β argmax a β‘π π Execute π β and receive π π β π + π (π(π β )) based on π Update Prβ‘ Return π 10 CS886 (c) 2013 Pascal Poupart
Comparison Thompson Sampling Greedy Strategy β’ Action Selection β’ Action Selection π β = argmax a β‘π π β = argmax a β‘π π π β’ Empirical mean β’ Empirical mean π = 1 π = 1 π π π π π π π π (π) π π π π=1 π=1 β’ Samples β’ Samples π β¦ π π π π β‘~β‘Prβ‘ π ) π (π π ; π) π β‘~β‘Prβ‘ (π π (π)|π π 1 π π (π π ; π) π β‘~β‘Prβ‘ π β’ No exploration β’ Some exploration 11 CS886 (c) 2013 Pascal Poupart
Sample Size β’ In Thompson sampling, amount of data πβ‘ and sample size π regulate amount of exploration π becomes less β’ As π and π increase, π stochastic, which reduces exploration π β¦ π π ) becomes more peaked β As π β , Prβ‘ (π(π)|π 1 π π β¦ π π approaches πΉ[π(π)|π π ] β As π β , π 1 π (π) ensures that all actions β’ The stochasticity of π are chosen with some probability 12 CS886 (c) 2013 Pascal Poupart
Continuous Rewards β’ So far we assumed that π β 0,1 β’ What about continuous rewards, i.e. π β 0,1 ? β NB: rewards in [π, π] can be remapped to [0,1] by an affine transformation without changing the problem β’ Idea: β When we receive a reward π β Sample πβ‘~β‘πΆππ πππ£πππ(π ) s.t. π β {0,1} 13 CS886 (c) 2013 Pascal Poupart
Thompson Sampling Algorithm Continuous rewards ThompsonSampling( β ) π β 0 For π = 1β‘toβ‘β Sample π 1 π , β¦ , π π (π)β‘~β‘Prβ‘ (π π )β‘β‘βπ 1 π β π π π π π (π) β‘β‘β‘βπ π=1 π β β argmax a β‘π π Execute π β and receive π π β π + π Sample πβ‘~β‘πΆππ πππ£πππ(π ) (π(π β )) based on π Update Prβ‘ Return π 14 CS886 (c) 2013 Pascal Poupart
Analysis β’ Thompson sampling converges to best arm β’ Theory: β Expected cumulative regret: π(log β‘π) β On par with UCB and π -greedy β’ Practice: β Sample size π often set to 1 β Used by Bing for ad placement β’ Graepel, Candela, Borchert, Herbrich (2010) Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoftβs Bing search engine, ICML. 15 CS886 (c) 2013 Pascal Poupart
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.