New Perspectives for Multi-Armed Bandits and Their Applications - PowerPoint PPT Presentation

New Perspectives for Multi-Armed Bandits and Their Applications Vianney Perchet Workshop Learning & Statistics IHES, January 19 2017 CMLA, ENS Paris-Saclay

Motivations & Objectives

Classical Examples of Bandits Problems – Size of data: n patients with some proba of getting cured or – Patients cured or dead 1) Inference: Find the best treatment between the red and blue 3 – Choose one of two treatments to prescribe 2) Cumul: Save as many patients as possible

Classical Examples of Bandits Problems – Size of data: n banners with some proba of click or – Banner clicked or ignored 1) Inference: Find the best ad between the red and blue 2) Cumul: Get as many clicks as possible 3 – Choose one of two ads to display

Classical Examples of Bandits Problems – Size of data: n auctions with some expected revenue or – Auction won or lost 1) Inference: Find the best strategy between the red and blue 2) Cumul: Win as many profitable auctions as possible 3 – Choose one of two strategies(bid/opt out) to follow

Classical Examples of Bandits Problems – Size of data: n mails with some proba of spam or – Mail correctly or incorrectly classified 1) Inference: Find the best strategy between the red and blue 2) Cumul: Minimize number of errors as possible 3 – Choose one of two actions: spam or ham

Classical Examples of Bandits Problems – Size of data: n patients with some proba of getting cured or – Patients cured or dead 1) Inference: Find the best treatment between the red and blue 3 – Choose one of two treatments to prescribe 2) Cumul: Save as many patients as possible

– Save as many as possible. Two-Armed Bandit – Patients arrive and are treated sequentially. 4

Two-Armed Bandit – Patients arrive and are treated sequentially. – Save as many as possible. 4

A bit of theory 5

Stochastic Multi-Armed Bandit

K -Armed Stochastic Bandit Problems i.i.d. T T T – Goal: Maximize expected reward 7 bounded X i – K actions i ∈ { 1 , . . . , K } , outcome X i t ∈ R (sub-)Gaussian, ( ) 1 , X i 2 , . . . , ∼ N µ i , 1 ( 2 , . . . , X π t − 1 ) X π 1 1 , X π 2 ∈ { 1 , . . . , K } – Non-Anticipative Policy: π t t − 1 t = 1 E X π t t = 1 µ π t ∑ T t = ∑ T – Performance: Cumulative Regret µ i − µ π t = ∆ i ∑ ∑ ∑ { } R T = max π t = i ̸ = ⋆ 1 i ∈{ 1 , 2 } t = 1 t = 1 t = 1 with ∆ i = µ ⋆ − µ i , the “gap” or cost of error i .

Most Famous algorithm [Auer, Cesa-Bianchi, Fisher, ’02] i Worst-Case: k Regret: s . • UCB - “Upper Confidence Bound” T i t 8 i X i √ 2 log ( t ) { } π t + 1 = arg max t + , T i ( t ) where T i ( t ) = ∑ t t = 1 1 { π t = i } and X ∑ t = 1 s : i s = i X i K log ( T ) log ( T ) E R T ≲ ∑ ∧ T ∆ E R T ≲ sup ∆ k ∆ ∆ √ KT log ( T ) ≂

9 i X i i • “optimal”: no algo can always have a regret smaller than • 2-lines proof: i i i i √ { } 2 log ( t ) Ideas of proof π t + 1 = arg max i t + T i ( t ) √ √ 2 log ( t ) 2 log ( t ) ⋆ π t + 1 = i ̸ = ⋆ ⇐ ⇒ X ≤ X t + t + T ⋆ ( t ) T i ( t ) √ 2 log ( t ) ⇒ T i ( t ) ≲ log ( t ) “ = ⇒ ”∆ i ≤ = T i ( t ) ∆ 2 • Number of mistakes grows as log ( t ) i ; each mistake costs ∆ i . ∆ 2 log ( T ) log ( T ) Regret at stage T ≲ ∑ × ∆ i ≂ ∑ ∆ 2 ∆ i • “ = ⇒ ” actually happens with overwhelming proba log ( T ) ∑ ∆ i

Other Algos • Other algo, MOSS [Audibert, Bubeck], variants of UCB T Discretize + UCB gives TK • Other algo, ETC [Perchet,Rigollet], pulls in round robin then 10 k eliminates log ( T ∆ k ) √ R T ≲ ∑ , worst case R T ≤ T log ( K ) K ∆ k √ R T ≲ K log ( T ∆ min / K ) , worst case R T ≤ ∆ min • Infinite number of actions x ∈ [ 0 , 1 ] d with ∆( x ) 1 Lipschitz. √ ε ≤ T 2 / 3 R T ≲ T ε +

Very interesting.... useful ? no... Here is a list of reasons 11

On the basic assumptions 1. Stochastic: Data are not iid, patients are different ill-posedness , feature selection/model selection pomdp, learn trade bias/variance grouping, evaluations 4. Combinatorial: Several decisions at each stage combinatorial optimization, cascading 12 2. Different Timing: several actions for one reward 3. Delays: Rewards not received instantaneously 5. Non-linearity: concave gain, diminishing returns, etc

Investigating (past/present/futur) them 13

Patients are different • We assumed (implicitly ?) that all patients/users are identical • Treatments efficiency 9proba of clicks) depend on age, gender... • Those covariates or contexts are observed/known before taking the decision of blue/red pill The decision (and regret...) should ultimately depend on it 14

General Model of Contextual Bandits The cookies of a user, the medical history, etc. • Reward: X k 15 • Covariates: ω t ∈ Ω = [ 0 , 1 ] d , i.i.d., law µ (equivalent to) λ • Decisions: π t ∈ { 1 , .., K } The decision can (should) depend on the context ω t t ∈ [ 0 , 1 ] ∼ ν k ( ω t ) , E [ X k | ω ] = µ k ( ω ) The expected reward of action k depend on the context ω • Objectives: Find the best decision given the request t = 1 µ π ⋆ ( ω t ) ( ω t ) − µ π t ( ω t ) Minimize regret R T := ∑ T

k and Regularity assumptions max k is not continuous. -Hölder but is 2: With K is the second max. k s t k max is the maximal k where 16 1. Smoothness of the pb: Every µ k is β -hölder, with β ∈ ( 0 , 1 ] : ∃ L > 0 , ∀ ω, ω ′ ∈ X , ∥ µ ( ω ) − µ ( ω ′ ) ∥ ≤ L ∥ ω − ω ′ ∥ β 2. Complexity of the pb: ( α -margin condition) ∃ C 0 > 0, � � [ ] ≤ C 0 δ α P X 0 < � µ 1 ( ω ) − µ 2 ( ω ) � < δ � �

Regularity assumptions is the second max. 16 1. Smoothness of the pb: Every µ k is β -hölder, with β ∈ ( 0 , 1 ] : ∃ L > 0 , ∀ ω, ω ′ ∈ X , ∥ µ ( ω ) − µ ( ω ′ ) ∥ ≤ L ∥ ω − ω ′ ∥ β 2. Complexity of the pb: ( α -margin condition) ∃ C 0 > 0, � � [ ] � µ ⋆ ( ω ) − µ ♯ ( ω ) ≤ C 0 δ α P X 0 < � < δ � � where µ ⋆ ( ω ) = max k µ k ( ω ) is the maximal µ k and µ ♯ ( ω ) = max { µ k ( ω ) s . t . µ k ( ω ) < µ ⋆ ( ω ) } With K > 2: µ ⋆ is β -Hölder but µ ♯ is not continuous.

17 Regularity: an easy example ( α big) µ 1 ( ω )

17 Regularity: an easy example ( α big) µ 1 ( ω ) µ 2 ( ω )

17 Regularity: an easy example ( α big) µ 1 ( ω ) µ 2 ( ω ) µ 3 ( ω )

17 Regularity: an easy example ( α big) µ 1 ( ω ) µ ⋆ ( ω ) µ 2 ( ω ) µ 3 ( ω )

17 Regularity: an easy example ( α big) µ 1 ( ω ) µ ⋆ ( ω ) µ 2 ( ω ) µ ♯ ( ω ) µ 3 ( ω )

18 Regularity: a hard example ( α small) µ 1 ( ω )

18 Regularity: a hard example ( α small) µ 1 ( ω ) µ 2 ( ω )

18 Regularity: a hard example ( α small) µ 1 ( ω ) µ 3 ( ω ) µ 2 ( ω )

18 Regularity: a hard example ( α small) µ 1 ( ω ) µ 3 ( ω ) µ ⋆ ( ω ) µ 2 ( ω )

18 Regularity: a hard example ( α small) µ 1 ( ω ) µ 3 ( ω ) µ ⋆ ( ω ) µ ♯ ( ω ) µ 2 ( ω )

19 Binned policy µ 1 ( ω ) µ ⋆ ( ω ) µ 2 ( ω ) µ ♯ ( ω ) µ 3 ( ω )

Binned policy 19 µ 1 ( ω ) µ 2 ( ω ) µ 3 ( ω )

Binned Successive Elimination (BSE) Theorem [P. and Rigollet (’13)] the effects of exploration/exploitation. • Same bound with full monit [Audibert and Tsybakov, ’07] 1 T 20 T ) β ( 1 + α ) ( ( ) 2 β + d , bin side 2 β + d . K log ( K ) K log ( K ) If α < 1, E [ R T ( BSE )] ≲ T For K = 2, matches lower bound: minimax optimal w.r.t. T . • No log ( T ) : difficulty of nonparametric estimation washes away • α < 1: cannot attain fast rates for easy problems. • Adaptive partitioning !

New Perspectives for Multi-Armed Bandits and Their Applications - PowerPoint PPT Presentation

New Perspectives for Multi-Armed Bandits and Their Applications Vianney Perchet Workshop Learning & Statistics IHES, January 19 2017 CMLA, ENS Paris-Saclay Motivations & Objectives Classical Examples of Bandits Problems Size of

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Muti-armed Bandits,Online Learning and Sequential Prediction Jian Li Institute for

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Marius Leordeanu, Martial Hebert, Raul Sukthankar CVPR07 CVPR 07 Presented by Weina Ge

Improving TCP Start-up over High Bandwidth Delay Paths Ren Wang, Giovanni Pau, M.Y. Sanadidi and

The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe

Ribhu Kaul University of Kentucky Collaborators Nisheeta Desai (U. Ky) Dr. Jon Demidio (U. Ky

Review of Dark Sectors Maxim Pospelov Perimeter Institute, Waterloo/University of Victoria,

Learning Human Interaction by L i H I i b Interactive Phrases Interactive Phrases Yu Kong

Information Retrieval Venkatesh Vinayakarao Term: Aug Sep, 2019 Chennai Mathematical

A Call-by-Need Lambda Calculus with Scoped Work Decorations David Sabel and Manfred

Sambuz

Useful Links

Newsletter

Mail Us

New Perspectives for Multi-Armed Bandits and Their Applications - PowerPoint PPT Presentation

New Perspectives for Multi-Armed Bandits and Their Applications Vianney Perchet Workshop Learning & Statistics IHES, January 19 2017 CMLA, ENS Paris-Saclay Motivations & Objectives Classical Examples of Bandits Problems Size of

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Advanced Econometrics 2, Hilary term 2021 Multi-armed bandits Maximilian Kasy Department of

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Muti-armed Bandits,Online Learning and Sequential Prediction Jian Li Institute for

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Marius Leordeanu, Martial Hebert, Raul Sukthankar CVPR07 CVPR 07 Presented by Weina Ge

Improving TCP Start-up over High Bandwidth Delay Paths Ren Wang, Giovanni Pau, M.Y. Sanadidi and

The multi armed-bandit problem (with covariates if we have time) Vianney Perchet &amp; Philippe

Ribhu Kaul University of Kentucky Collaborators Nisheeta Desai (U. Ky) Dr. Jon Demidio (U. Ky

Review of Dark Sectors Maxim Pospelov Perimeter Institute, Waterloo/University of Victoria,

Learning Human Interaction by L i H I i b Interactive Phrases Interactive Phrases Yu Kong

Information Retrieval Venkatesh Vinayakarao Term: Aug Sep, 2019 Chennai Mathematical

A Call-by-Need Lambda Calculus with Scoped Work Decorations David Sabel and Manfred

Sambuz

Useful Links

Newsletter

Mail Us

The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe