Module 13 Bayesian Bandits CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Multi-Armed Bandits • Problem: – 𝑂 bandits with unknown average reward 𝑆(𝑏) – Which arm 𝑏 should we play at each time step? – Exploitation/exploration tradeoff • Common frequentist approaches: – 𝜗 -greedy – Upper confidence bound (UCB) • Alternative Bayesian approaches – Thompson sampling – Gittins indices 2 CS886 (c) 2013 Pascal Poupart

Bayesian Learning • Notation: – 𝑠 𝑏 : random variable for 𝑏 ’s rewards – Pr 𝑠 𝑏 ; 𝜄 : unknown distribution (parameterized by 𝜄 ) – 𝑆 𝑏 = 𝐹[𝑠 𝑏 ] : unknown average reward • Idea: – Express uncertainty about 𝜄 by a prior Pr⁡ (𝜄) 𝑏 , 𝑠 𝑏 , … , 𝑠 𝑏 ) based on – Compute posterior Pr⁡ (𝜄|𝑠 1 2 𝑜 𝑏 , 𝑠 𝑏 , … , 𝑠 𝑏 observed for 𝑏 so far. samples 𝑠 𝑜 1 2 • Bayes theorem: 𝑏 ∝ Pr 𝜄 Pr⁡ 𝑏 , 𝑠 𝑏 , … , 𝑠 𝑏 , 𝑠 𝑏 , … , 𝑠 𝑏 |𝜄) Pr 𝜄 𝑠 (𝑠 1 2 𝑜 1 2 𝑜 3 CS886 (c) 2013 Pascal Poupart

Distributional Information • Posterior over 𝜄 allows us to estimate – Distribution over next reward 𝑠 𝑏 𝑏 = Pr 𝑠 𝑏 ; 𝜄 Pr 𝜄 𝑠 𝑏 𝑒𝜄 𝑏 , 𝑠 𝑏 , … , 𝑠 𝑏 , 𝑠 𝑏 , … , 𝑠 Pr 𝑠 𝑏 |𝑠 𝑜 𝑜 1 2 1 2 𝜄 – Distribution over 𝑆(𝑏) when 𝜄 includes the mean 𝑏 = Pr 𝜄 𝑠 𝑏 if 𝜄 = 𝑆(𝑏) 𝑏 , 𝑠 𝑏 , … , 𝑠 𝑏 , 𝑠 𝑏 , … , 𝑠 Pr 𝑆(𝑏) 𝑠 1 2 𝑜 1 2 𝑜 • To guide exploration: 𝑏 , 𝑠 2 𝑏 , … , 𝑠 𝑜 𝑏 ) ≥ 1 − 𝜀 – UCB : Pr 𝑆 𝑏 ≤ 𝑐𝑝𝑣𝑜𝑒( 𝑠 1 𝑏 𝑏 , 𝑠 2 𝑏 , … , 𝑠 𝑜 – Bayesian techniques: Pr 𝑆 𝑏 | 𝑠 1 4 CS886 (c) 2013 Pascal Poupart

Coin Example • Consider two biased coins 𝐷 1 and 𝐷 2 𝑆 𝐷 1 = Pr 𝐷 1 = ℎ𝑓𝑏𝑒 𝑆 𝐷 2 = Pr 𝐷 2 = ℎ𝑓𝑏𝑒 • Problem: – Maximize # of heads in 𝑙 flips – Which coin should we choose for each flip? 5 CS886 (c) 2013 Pascal Poupart

Bernoulli Variables • 𝑠 𝐷 1 , 𝑠 𝐷 2 are Bernoulli variables with domain {0,1} • Bernoulli dist. are parameterized by their mean – i.e. Pr 𝑠 𝐷 1 ; 𝜄 1 = 𝜄 1 = 𝑆 𝐷 1 Pr 𝑠 𝐷 2 ; 𝜄 2 = 𝜄 2 = 𝑆(𝐷 2 ) 6 CS886 (c) 2013 Pascal Poupart

Beta distribution • Let the prior Pr⁡ (𝜄) be a Beta distribution 𝐶𝑓𝑢𝑏 𝜄; 𝛽, 𝛾 ∝ 𝜄 𝛽−1 1 − 𝜄 𝛾−1 𝐶𝑓𝑢𝑏 𝜄; 1, 1 𝐶𝑓𝑢𝑏 𝜄; 2, 8 • 𝛽 − 1: # of heads 𝐶𝑓𝑢𝑏(𝜄; 20, 80) • 𝛾 − 1 : # of tails Pr⁡ (𝜄) • 𝐹 𝜄 = 𝛽/(𝛽 + 𝛾) 𝜄 7 CS886 (c) 2013 Pascal Poupart

Belief Update • Prior: Pr 𝜄 = 𝐶𝑓𝑢𝑏 𝜄; 𝛽, 𝛾 ∝ 𝜄 𝛽−1 1 − 𝜄 𝛾−1 • Posterior after coin flip: Pr 𝜄 ℎ𝑓𝑏𝑒 ∝ ⁡⁡⁡⁡⁡⁡⁡⁡Pr 𝜄 ⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡Pr ℎ𝑓𝑏𝑒 𝜄 ∝ 𝜄 𝛽−1 1 − 𝜄 𝛾−1 𝜄 = 𝜄 𝛽+1 −1 1 − 𝜄 𝛾−1 ∝ 𝐶𝑓𝑢𝑏(𝜄; 𝛽 + 1, 𝛾) Pr 𝜄 𝑢𝑏𝑗𝑚 ∝ ⁡⁡⁡⁡⁡⁡⁡⁡Pr 𝜄 ⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡Pr 𝑢𝑏𝑗𝑚 𝜄 ∝ 𝜄 𝛽−1 1 − 𝜄 𝛾−1 (1 − 𝜄) = 𝜄 𝛽−1 1 − 𝜄 (𝛾+1)−1 ∝ 𝐶𝑓𝑢𝑏(𝜄; 𝛽, 𝛾 + 1) 8 CS886 (c) 2013 Pascal Poupart

Thompson Sampling • Idea: – Sample several potential average rewards: 𝑏 , … , 𝑠 𝑏 ) for each 𝑏 𝑆 1 𝑏 , … 𝑆 𝑙 𝑏 ⁡~⁡Pr⁡ (𝑆(𝑏)|𝑠 𝑜 1 – Estimate empirical average 𝑏 = 1 𝑆 𝑙 𝑙 𝑆 𝑗 (𝑏) 𝑗=1 𝑏 – Execute 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 ⁡𝑆 • Coin example 𝑏 = Beta 𝜄 𝑏 ; 𝛽 𝑏 , 𝛾 𝑏 𝑏 , … , 𝑠 – Pr 𝑆(𝑏) 𝑠 𝑜 1 where 𝛽 𝑏 − 1 = #ℎ𝑓𝑏𝑒𝑡 and 𝛾 𝑏 − 1 = #𝑢𝑏𝑗𝑚𝑡 9 CS886 (c) 2013 Pascal Poupart

Thompson Sampling Algorithm Bernoulli Rewards ThompsonSampling( ℎ ) 𝑊 ← 0 For 𝑜 = 1⁡to⁡ℎ Sample 𝑆 1 𝑏 , … , 𝑆 𝑙 (𝑏)⁡~⁡Pr⁡ (𝑆 𝑏 )⁡⁡∀𝑏 1 𝑏 ← 𝑙 𝑙 𝑆 𝑆 𝑗 (𝑏) ⁡⁡⁡∀𝑏 𝑗=1 𝑏 ∗ ← argmax a ⁡𝑆 𝑏 Execute 𝑏 ∗ and receive 𝑠 𝑊 ← 𝑊 + 𝑠 (𝑆(𝑏 ∗ )) based on 𝑠 Update Pr⁡ Return 𝑊 10 CS886 (c) 2013 Pascal Poupart

Comparison Thompson Sampling Greedy Strategy • Action Selection • Action Selection 𝑏 ∗ = argmax a ⁡𝑆 𝑏 ∗ = argmax a ⁡𝑆 𝑏 𝑏 • Empirical mean • Empirical mean 𝑏 = 1 𝑏 = 1 𝑆 𝑆 𝑙 𝑜 𝑏 𝑙 𝑆 𝑗 (𝑏) 𝑜 𝑠 𝑗 𝑗=1 𝑗=1 • Samples • Samples 𝑏 … 𝑠 𝑆 𝑗 𝑏 ⁡~⁡Pr⁡ 𝑏 ) 𝑠 (𝑠 𝑏 ; 𝜄) 𝑏 ⁡~⁡Pr⁡ (𝑆 𝑗 (𝑏)|𝑠 𝑜 1 𝑗 𝑠 (𝑠 𝑏 ; 𝜄) 𝑏 ⁡~⁡Pr⁡ 𝑗 • No exploration • Some exploration 11 CS886 (c) 2013 Pascal Poupart

Sample Size • In Thompson sampling, amount of data 𝑜⁡ and sample size 𝑙 regulate amount of exploration 𝑏 becomes less • As 𝑜 and 𝑙 increase, 𝑆 stochastic, which reduces exploration 𝑏 … 𝑠 𝑏 ) becomes more peaked – As 𝑜 ↑ , Pr⁡ (𝑆(𝑏)|𝑠 1 𝑜 𝑏 … 𝑠 𝑏 approaches 𝐹[𝑆(𝑏)|𝑠 𝑏 ] – As 𝑙 ↑ , 𝑆 1 𝑜 (𝑏) ensures that all actions • The stochasticity of 𝑆 are chosen with some probability 12 CS886 (c) 2013 Pascal Poupart

Continuous Rewards • So far we assumed that 𝑠 ∈ 0,1 • What about continuous rewards, i.e. 𝑠 ∈ 0,1 ? – NB: rewards in [𝑏, 𝑐] can be remapped to [0,1] by an affine transformation without changing the problem • Idea: – When we receive a reward 𝑠 – Sample 𝑐⁡~⁡𝐶𝑓𝑠𝑜𝑝𝑣𝑚𝑚𝑗(𝑠) s.t. 𝑐 ∈ {0,1} 13 CS886 (c) 2013 Pascal Poupart

Thompson Sampling Algorithm Continuous rewards ThompsonSampling( ℎ ) 𝑊 ← 0 For 𝑜 = 1⁡to⁡ℎ Sample 𝑆 1 𝑏 , … , 𝑆 𝑙 (𝑏)⁡~⁡Pr⁡ (𝑆 𝑏 )⁡⁡∀𝑏 1 𝑏 ← 𝑙 𝑙 𝑆 𝑆 𝑗 (𝑏) ⁡⁡⁡∀𝑏 𝑗=1 𝑏 ∗ ← argmax a ⁡𝑆 𝑏 Execute 𝑏 ∗ and receive 𝑠 𝑊 ← 𝑊 + 𝑠 Sample 𝑐⁡~⁡𝐶𝑓𝑠𝑜𝑝𝑣𝑚𝑚𝑗(𝑠) (𝑆(𝑏 ∗ )) based on 𝑐 Update Pr⁡ Return 𝑊 14 CS886 (c) 2013 Pascal Poupart

Analysis • Thompson sampling converges to best arm • Theory: – Expected cumulative regret: 𝑃(log ⁡𝑜) – On par with UCB and 𝜗 -greedy • Practice: – Sample size 𝑙 often set to 1 – Used by Bing for ad placement • Graepel, Candela, Borchert, Herbrich (2010) Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine, ICML. 15 CS886 (c) 2013 Pascal Poupart

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Multi-Armed Bandits Problem: bandits with unknown average reward () Which arm should we play at each

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

1 MODULE SPECIFICATION Module Aims The module aims to deliver knowledge of the essential

Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module bio

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

6.15 Module 15: Research and Presentation Module Title Research and Presentation Module NFQ

Module Title: Broadcasting & Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Using the Code Review Module Szeged DrupalCon Using the Code Review Module Doug Green Stella

Module 3 Doing a Noise Audit This module and Module 2 provide the necessary training needed

MODERATE SEDATION MODULE MODERATE SEDATION MODULE MODERATE SEDATION MODULE Introduction

Auxiliary Rubrics Module 6 Module 5 Review At the conclusion of Module 5, the team completed

Clustering and Prediction Probability and Statistics for Data Science CSE594 - Spring 2016 But

AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic

Variational Inference for Diffusion Processes C edric Archambeau Xerox Research Centre Europe

Introduction to General and Generalized Linear Models Mixed effects models - Part III Henrik

Single-parameter models: Binomial data Applied Bayesian Statistics Dr. Earvin Balderama

AutoML in Full Life Circle of Deep Learning Assembly Line Junjie Yan SenseTime Group Limited

ALTE: ALTE: Apparently a Lot of pparently a Lot of Disclosures Discl sures Terro Terror for

PeCoH Performance Conscious HPC: Status J. Kunkel, K. Himstedt, N. Hbbe, S. Schrder, M.

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Multi-Armed Bandits Problem: bandits with unknown average reward () Which arm should we play at each

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Agenda Module 1 - Risk, Volatility &amp; Timescale Module 2 - Asset Allocation Module 3 -

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

1 MODULE SPECIFICATION Module Aims The module aims to deliver knowledge of the essential

Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module bio

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

6.15 Module 15: Research and Presentation Module Title Research and Presentation Module NFQ

Module Title: Broadcasting &amp; Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility &amp; Timescale Module 2 - Asset Allocation Module 3 -

Using the Code Review Module Szeged DrupalCon Using the Code Review Module Doug Green Stella

Module 3 Doing a Noise Audit This module and Module 2 provide the necessary training needed

MODERATE SEDATION MODULE MODERATE SEDATION MODULE MODERATE SEDATION MODULE Introduction

Auxiliary Rubrics Module 6 Module 5 Review At the conclusion of Module 5, the team completed

Clustering and Prediction Probability and Statistics for Data Science CSE594 - Spring 2016 But

AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic

Variational Inference for Diffusion Processes C edric Archambeau Xerox Research Centre Europe

Introduction to General and Generalized Linear Models Mixed effects models - Part III Henrik

Single-parameter models: Binomial data Applied Bayesian Statistics Dr. Earvin Balderama

AutoML in Full Life Circle of Deep Learning Assembly Line Junjie Yan SenseTime Group Limited

ALTE: ALTE: Apparently a Lot of pparently a Lot of Disclosures Discl sures Terro Terror for

PeCoH Performance Conscious HPC: Status J. Kunkel, K. Himstedt, N. Hbbe, S. Schrder, M.

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Module Title: Broadcasting & Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -