module 13
play

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Multi-Armed Bandits Problem: bandits with unknown average reward () Which arm should we play at each


  1. Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

  2. Multi-Armed Bandits β€’ Problem: – 𝑂 bandits with unknown average reward 𝑆(𝑏) – Which arm 𝑏 should we play at each time step? – Exploitation/exploration tradeoff β€’ Common frequentist approaches: – πœ— -greedy – Upper confidence bound (UCB) β€’ Alternative Bayesian approaches – Thompson sampling – Gittins indices 2 CS886 (c) 2013 Pascal Poupart

  3. Bayesian Learning β€’ Notation: – 𝑠 𝑏 : random variable for 𝑏 ’s rewards – Pr 𝑠 𝑏 ; πœ„ : unknown distribution (parameterized by πœ„ ) – 𝑆 𝑏 = 𝐹[𝑠 𝑏 ] : unknown average reward β€’ Idea: – Express uncertainty about πœ„ by a prior Pr⁑ (πœ„) 𝑏 , 𝑠 𝑏 , … , 𝑠 𝑏 ) based on – Compute posterior Pr⁑ (πœ„|𝑠 1 2 π‘œ 𝑏 , 𝑠 𝑏 , … , 𝑠 𝑏 observed for 𝑏 so far. samples 𝑠 π‘œ 1 2 β€’ Bayes theorem: 𝑏 ∝ Pr πœ„ Pr⁑ 𝑏 , 𝑠 𝑏 , … , 𝑠 𝑏 , 𝑠 𝑏 , … , 𝑠 𝑏 |πœ„) Pr πœ„ 𝑠 (𝑠 1 2 π‘œ 1 2 π‘œ 3 CS886 (c) 2013 Pascal Poupart

  4. Distributional Information β€’ Posterior over πœ„ allows us to estimate – Distribution over next reward 𝑠 𝑏 𝑏 = Pr 𝑠 𝑏 ; πœ„ Pr πœ„ 𝑠 𝑏 π‘’πœ„ 𝑏 , 𝑠 𝑏 , … , 𝑠 𝑏 , 𝑠 𝑏 , … , 𝑠 Pr 𝑠 𝑏 |𝑠 π‘œ π‘œ 1 2 1 2 πœ„ – Distribution over 𝑆(𝑏) when πœ„ includes the mean 𝑏 = Pr πœ„ 𝑠 𝑏 if πœ„ = 𝑆(𝑏) 𝑏 , 𝑠 𝑏 , … , 𝑠 𝑏 , 𝑠 𝑏 , … , 𝑠 Pr 𝑆(𝑏) 𝑠 1 2 π‘œ 1 2 π‘œ β€’ To guide exploration: 𝑏 , 𝑠 2 𝑏 , … , 𝑠 π‘œ 𝑏 ) β‰₯ 1 βˆ’ πœ€ – UCB : Pr 𝑆 𝑏 ≀ π‘π‘π‘£π‘œπ‘’( 𝑠 1 𝑏 𝑏 , 𝑠 2 𝑏 , … , 𝑠 π‘œ – Bayesian techniques: Pr 𝑆 𝑏 | 𝑠 1 4 CS886 (c) 2013 Pascal Poupart

  5. Coin Example β€’ Consider two biased coins 𝐷 1 and 𝐷 2 𝑆 𝐷 1 = Pr 𝐷 1 = β„Žπ‘“π‘π‘’ 𝑆 𝐷 2 = Pr 𝐷 2 = β„Žπ‘“π‘π‘’ β€’ Problem: – Maximize # of heads in 𝑙 flips – Which coin should we choose for each flip? 5 CS886 (c) 2013 Pascal Poupart

  6. Bernoulli Variables β€’ 𝑠 𝐷 1 , 𝑠 𝐷 2 are Bernoulli variables with domain {0,1} β€’ Bernoulli dist. are parameterized by their mean – i.e. Pr 𝑠 𝐷 1 ; πœ„ 1 = πœ„ 1 = 𝑆 𝐷 1 Pr 𝑠 𝐷 2 ; πœ„ 2 = πœ„ 2 = 𝑆(𝐷 2 ) 6 CS886 (c) 2013 Pascal Poupart

  7. Beta distribution β€’ Let the prior Pr⁑ (πœ„) be a Beta distribution 𝐢𝑓𝑒𝑏 πœ„; 𝛽, 𝛾 ∝ πœ„ π›½βˆ’1 1 βˆ’ πœ„ π›Ύβˆ’1 𝐢𝑓𝑒𝑏 πœ„; 1, 1 𝐢𝑓𝑒𝑏 πœ„; 2, 8 β€’ 𝛽 βˆ’ 1: # of heads 𝐢𝑓𝑒𝑏(πœ„; 20, 80) β€’ 𝛾 βˆ’ 1 : # of tails Pr⁑ (πœ„) β€’ 𝐹 πœ„ = 𝛽/(𝛽 + 𝛾) πœ„ 7 CS886 (c) 2013 Pascal Poupart

  8. Belief Update β€’ Prior: Pr πœ„ = 𝐢𝑓𝑒𝑏 πœ„; 𝛽, 𝛾 ∝ πœ„ π›½βˆ’1 1 βˆ’ πœ„ π›Ύβˆ’1 β€’ Posterior after coin flip: Pr πœ„ β„Žπ‘“π‘π‘’ ∝ ⁑⁑⁑⁑⁑⁑⁑⁑Pr πœ„ ⁑⁑⁑⁑⁑⁑⁑⁑⁑⁑⁑⁑Pr β„Žπ‘“π‘π‘’ πœ„ ∝ πœ„ π›½βˆ’1 1 βˆ’ πœ„ π›Ύβˆ’1 πœ„ = πœ„ 𝛽+1 βˆ’1 1 βˆ’ πœ„ π›Ύβˆ’1 ∝ 𝐢𝑓𝑒𝑏(πœ„; 𝛽 + 1, 𝛾) Pr πœ„ π‘’π‘π‘—π‘š ∝ ⁑⁑⁑⁑⁑⁑⁑⁑Pr πœ„ ⁑⁑⁑⁑⁑⁑⁑⁑⁑⁑⁑⁑Pr π‘’π‘π‘—π‘š πœ„ ∝ πœ„ π›½βˆ’1 1 βˆ’ πœ„ π›Ύβˆ’1 (1 βˆ’ πœ„) = πœ„ π›½βˆ’1 1 βˆ’ πœ„ (𝛾+1)βˆ’1 ∝ 𝐢𝑓𝑒𝑏(πœ„; 𝛽, 𝛾 + 1) 8 CS886 (c) 2013 Pascal Poupart

  9. Thompson Sampling β€’ Idea: – Sample several potential average rewards: 𝑏 , … , 𝑠 𝑏 ) for each 𝑏 𝑆 1 𝑏 , … 𝑆 𝑙 𝑏 ⁑~⁑Pr⁑ (𝑆(𝑏)|𝑠 π‘œ 1 – Estimate empirical average 𝑏 = 1 𝑆 𝑙 𝑙 𝑆 𝑗 (𝑏) 𝑗=1 𝑏 – Execute 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 ⁑𝑆 β€’ Coin example 𝑏 = Beta πœ„ 𝑏 ; 𝛽 𝑏 , 𝛾 𝑏 𝑏 , … , 𝑠 – Pr 𝑆(𝑏) 𝑠 π‘œ 1 where 𝛽 𝑏 βˆ’ 1 = #β„Žπ‘“π‘π‘’π‘‘ and 𝛾 𝑏 βˆ’ 1 = #π‘’π‘π‘—π‘šπ‘‘ 9 CS886 (c) 2013 Pascal Poupart

  10. Thompson Sampling Algorithm Bernoulli Rewards ThompsonSampling( β„Ž ) π‘Š ← 0 For π‘œ = 1⁑toβ‘β„Ž Sample 𝑆 1 𝑏 , … , 𝑆 𝑙 (𝑏)⁑~⁑Pr⁑ (𝑆 𝑏 )β‘β‘βˆ€π‘ 1 𝑏 ← 𝑙 𝑙 𝑆 𝑆 𝑗 (𝑏) β‘β‘β‘βˆ€π‘ 𝑗=1 𝑏 βˆ— ← argmax a ⁑𝑆 𝑏 Execute 𝑏 βˆ— and receive 𝑠 π‘Š ← π‘Š + 𝑠 (𝑆(𝑏 βˆ— )) based on 𝑠 Update Pr⁑ Return π‘Š 10 CS886 (c) 2013 Pascal Poupart

  11. Comparison Thompson Sampling Greedy Strategy β€’ Action Selection β€’ Action Selection 𝑏 βˆ— = argmax a ⁑𝑆 𝑏 βˆ— = argmax a ⁑𝑆 𝑏 𝑏 β€’ Empirical mean β€’ Empirical mean 𝑏 = 1 𝑏 = 1 𝑆 𝑆 𝑙 π‘œ 𝑏 𝑙 𝑆 𝑗 (𝑏) π‘œ 𝑠 𝑗 𝑗=1 𝑗=1 β€’ Samples β€’ Samples 𝑏 … 𝑠 𝑆 𝑗 𝑏 ⁑~⁑Pr⁑ 𝑏 ) 𝑠 (𝑠 𝑏 ; πœ„) 𝑏 ⁑~⁑Pr⁑ (𝑆 𝑗 (𝑏)|𝑠 π‘œ 1 𝑗 𝑠 (𝑠 𝑏 ; πœ„) 𝑏 ⁑~⁑Pr⁑ 𝑗 β€’ No exploration β€’ Some exploration 11 CS886 (c) 2013 Pascal Poupart

  12. Sample Size β€’ In Thompson sampling, amount of data π‘œβ‘ and sample size 𝑙 regulate amount of exploration 𝑏 becomes less β€’ As π‘œ and 𝑙 increase, 𝑆 stochastic, which reduces exploration 𝑏 … 𝑠 𝑏 ) becomes more peaked – As π‘œ ↑ , Pr⁑ (𝑆(𝑏)|𝑠 1 π‘œ 𝑏 … 𝑠 𝑏 approaches 𝐹[𝑆(𝑏)|𝑠 𝑏 ] – As 𝑙 ↑ , 𝑆 1 π‘œ (𝑏) ensures that all actions β€’ The stochasticity of 𝑆 are chosen with some probability 12 CS886 (c) 2013 Pascal Poupart

  13. Continuous Rewards β€’ So far we assumed that 𝑠 ∈ 0,1 β€’ What about continuous rewards, i.e. 𝑠 ∈ 0,1 ? – NB: rewards in [𝑏, 𝑐] can be remapped to [0,1] by an affine transformation without changing the problem β€’ Idea: – When we receive a reward 𝑠 – Sample 𝑐⁑~β‘πΆπ‘“π‘ π‘œπ‘π‘£π‘šπ‘šπ‘—(𝑠) s.t. 𝑐 ∈ {0,1} 13 CS886 (c) 2013 Pascal Poupart

  14. Thompson Sampling Algorithm Continuous rewards ThompsonSampling( β„Ž ) π‘Š ← 0 For π‘œ = 1⁑toβ‘β„Ž Sample 𝑆 1 𝑏 , … , 𝑆 𝑙 (𝑏)⁑~⁑Pr⁑ (𝑆 𝑏 )β‘β‘βˆ€π‘ 1 𝑏 ← 𝑙 𝑙 𝑆 𝑆 𝑗 (𝑏) β‘β‘β‘βˆ€π‘ 𝑗=1 𝑏 βˆ— ← argmax a ⁑𝑆 𝑏 Execute 𝑏 βˆ— and receive 𝑠 π‘Š ← π‘Š + 𝑠 Sample 𝑐⁑~β‘πΆπ‘“π‘ π‘œπ‘π‘£π‘šπ‘šπ‘—(𝑠) (𝑆(𝑏 βˆ— )) based on 𝑐 Update Pr⁑ Return π‘Š 14 CS886 (c) 2013 Pascal Poupart

  15. Analysis β€’ Thompson sampling converges to best arm β€’ Theory: – Expected cumulative regret: 𝑃(log β‘π‘œ) – On par with UCB and πœ— -greedy β€’ Practice: – Sample size 𝑙 often set to 1 – Used by Bing for ad placement β€’ Graepel, Candela, Borchert, Herbrich (2010) Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine, ICML. 15 CS886 (c) 2013 Pascal Poupart

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend