The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a - PowerPoint PPT Presentation

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

The bandit problem [Robbins, 1952] . . . K slot machines Rewards X i ,1 , X i ,2 , . . . of machine i are i.i.d. [ 0, 1 ] -valued random variables An allocation policy prescribes which machine I t to play at time t based on the realization of X I 1 ,1 , . . . , X I t − 1 , t − 1 Want to play as often as possible the machine with largest reward expectation µ ∗ = max i = 1,..., K E X i ,1 Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

Bandits for targeting content Choose the best content to display to the next visitor of your website Goal is to elicit a response from the visitor (e.g., click on a banner) Content options = slot machines Response rate = reward expectation Simplifying assumptions: fixed response rates 1 no visitor profiles 2 Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

Regrets, I’ve had a few (F. Sinatra) Definition (Regret after n plays) n � µ ∗ n − E X I t , t t = 1 Theorem ( Lai and Robbins, 1985) There exist allocation policies satisfying n � µ ∗ n − E X I t , t � c K ln n uniformly over n t = 1 Constant c roughly equal to 1 /∆ ∗ , where ∆ ∗ = µ ∗ − j : µ j <µ ∗ µ j max Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

A simple policy [Agrawal, 1995]  At the beginning play each machine once 1 At each time t > K play machine I t maximizing 2 � 2 ln t X i , t + over i = 1, . . . , K T i , t X i , t is the average reward obtained from machine i T i , t is number of times machine i has been played Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

A finite-time regret bound Theorem ( Auer, C-B, and Fisher, 2002) At any time n , the regret of the  policy is at most 8 K ∆ ∗ ln n + 5 K Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

Upper confidence bounds � ( 2 ln t ) /T i , t is the size (using Cherno ff -Hoe ff ding bounds) of the one-sided confidence interval for the average reward within which µ i falls with probability 1 − 1 t Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

The epsilon-greedy policy Input parameter: schedule ε 1 , ε 2 , . . . where 0 � ε t � 1 At each time t : with probability 1 − ε t play the machine I t with the 1 highest average reward with probability ε t play a random machine 2 Is there a schedule of ε t guaranteeing logarithmic regret? Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

The tuned epsilon-greedy policy Theorem ( Auer, C-B, and Fisher, 2002) If ε t = 12 / ( d 2 t ) where d satisfies 0 < d � ∆ ∗ then the instantaneous regret at any time n of tuned ε -greedy is at most � K � O dn Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

Practical performance The   policy: � � � 1 � 2 ln t ln t is replaced by min 4, V j , t T i , t T i , t where V j , t is an upper confidence bound for the variance of machine j Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

Practical performance Optimally tuned ε -greedy performs almost always best unless there are several nonoptimal machines with wildly di ff erent response rates Performance of ε -greedy is quite sensitive to bad tuning   performs comparably to a well-tuned ε -greedy and is not very sensitive to large di ff erences in the response rates Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

The nonstochastic bandit problem [Auer, C-B, Freund, and Schapire, 2002] What if probability is removed altogether? Nonstochastic bandits Bounded real rewards x i ,1 , x i ,2 , . . . are deterministically assigned to each machine i Analogies with repeated play of an unknown game [Ba˜ nos, 1968; Megiddo, 1980] Allocation policies are allowed to randomize Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

0 1 0 0 7 9 9 8 9 0 0 1 5 7 9 6 0 0 2 2 0 0 0 1 0 2 0 1 0 1 0 0 8 9 8 7 Definition (Regret) � n � n � � � � max x i , t − E x I t , t i = 1,..., K t = 1 t = 1 Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

Competing against arbitrary policies 0 1 0 0 7 9 9 8 9 0 0 1 5 7 9 6 0 0 2 2 0 0 0 1 0 2 0 1 0 1 0 0 8 9 8 7 Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

Tracking regret Regret against an arbitrary and unknown policy ( j 1 , j 2 , . . . , j n ) � n n � � � x j t , t − E x I t , t t = 1 t = 1 Theorem (Auer, C-B, Freund, and Schapire, 2002) For all fixed S , the regret of the weight sharing policy against any policy j = ( j 1 , j 2 , . . . , j n ) is at most √ S nK ln K where S is the number of times j switches to a di ff erent machine Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a - PowerPoint PPT Presentation

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem The bandit problem [Robbins, 1952] . . . K slot machines Rewards X i ,1 , X i ,2 , . . . of

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15,

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurlien Garivier

Contextual Multi-armed Bandit Algorithm for Semiparametric Reward Model Gi-Soo Kim, Myunghee Cho

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part 2 S ebastien

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Multi-armed bandit problem and its applications in reinforcement learning Pietro Lovato Ph.D.

The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling Instructor: Sham Kakade 1 The

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 *

Computer Science 210: Data Structures Fall 2010 Welcome to Data Structures! The class is

Optimal Crossover Designs for Comparing Test Treatments to a Control Treatment When Subject

Two-Factor Design: Estimating Model Parameters Recall the model: y i , j , k = + i + j +

Uniform Designs and Their Constructions Yu Tang Soochow University Apr. 23, 2015 SoochowU.jpg

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

tr ts ts t

Efficient Algorithms for Infinite-Armed Bandit Arghya Roy Chaudhuri under the guidance of Prof.

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a - PowerPoint PPT Presentation

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem The bandit problem [Robbins, 1952] . . . K slot machines Rewards X i ,1 , X i ,2 , . . . of

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

Multi-armed Bandits Prof. Kuan-Ting Lai 2020/3/12 k-armed Bandit Problem Playing k armed

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15,

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurlien Garivier

Contextual Multi-armed Bandit Algorithm for Semiparametric Reward Model Gi-Soo Kim, Myunghee Cho

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part 2 S ebastien

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Multi-armed bandit problem and its applications in reinforcement learning Pietro Lovato Ph.D.

The multi armed-bandit problem (with covariates if we have time) Vianney Perchet &amp; Philippe

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling Instructor: Sham Kakade 1 The

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 *

Computer Science 210: Data Structures Fall 2010 Welcome to Data Structures! The class is

Optimal Crossover Designs for Comparing Test Treatments to a Control Treatment When Subject

Two-Factor Design: Estimating Model Parameters Recall the model: y i , j , k = + i + j +

Uniform Designs and Their Constructions Yu Tang Soochow University Apr. 23, 2015 SoochowU.jpg

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

tr ts ts t

Efficient Algorithms for Infinite-Armed Bandit Arghya Roy Chaudhuri under the guidance of Prof.

The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe