the multi armed bandit problem
play

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a - PowerPoint PPT Presentation

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem The bandit problem [Robbins, 1952] . . . K slot machines Rewards X i ,1 , X i ,2 , . . . of


  1. The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  2. The bandit problem [Robbins, 1952] . . . K slot machines Rewards X i ,1 , X i ,2 , . . . of machine i are i.i.d. [ 0, 1 ] -valued random variables An allocation policy prescribes which machine I t to play at time t based on the realization of X I 1 ,1 , . . . , X I t − 1 , t − 1 Want to play as often as possible the machine with largest reward expectation µ ∗ = max i = 1,..., K E X i ,1 Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  3. Bandits for targeting content Choose the best content to display to the next visitor of your website Goal is to elicit a response from the visitor (e.g., click on a banner) Content options = slot machines Response rate = reward expectation Simplifying assumptions: fixed response rates 1 no visitor profiles 2 Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  4. Regrets, I’ve had a few (F. Sinatra) Definition (Regret after n plays) n � µ ∗ n − E X I t , t t = 1 Theorem ( Lai and Robbins, 1985) There exist allocation policies satisfying n � µ ∗ n − E X I t , t � c K ln n uniformly over n t = 1 Constant c roughly equal to 1 /∆ ∗ , where ∆ ∗ = µ ∗ − j : µ j <µ ∗ µ j max Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  5. A simple policy [Agrawal, 1995]  At the beginning play each machine once 1 At each time t > K play machine I t maximizing 2 � 2 ln t X i , t + over i = 1, . . . , K T i , t X i , t is the average reward obtained from machine i T i , t is number of times machine i has been played Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  6. A finite-time regret bound Theorem ( Auer, C-B, and Fisher, 2002) At any time n , the regret of the  policy is at most 8 K ∆ ∗ ln n + 5 K Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  7. Upper confidence bounds � ( 2 ln t ) /T i , t is the size (using Cherno ff -Hoe ff ding bounds) of the one-sided confidence interval for the average reward within which µ i falls with probability 1 − 1 t Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  8. The epsilon-greedy policy Input parameter: schedule ε 1 , ε 2 , . . . where 0 � ε t � 1 At each time t : with probability 1 − ε t play the machine I t with the 1 highest average reward with probability ε t play a random machine 2 Is there a schedule of ε t guaranteeing logarithmic regret? Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  9. The tuned epsilon-greedy policy Theorem ( Auer, C-B, and Fisher, 2002) If ε t = 12 / ( d 2 t ) where d satisfies 0 < d � ∆ ∗ then the instantaneous regret at any time n of tuned ε -greedy is at most � K � O dn Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  10. Practical performance The   policy: � � � 1 � 2 ln t ln t is replaced by min 4, V j , t T i , t T i , t where V j , t is an upper confidence bound for the variance of machine j Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  11. Practical performance Optimally tuned ε -greedy performs almost always best unless there are several nonoptimal machines with wildly di ff erent response rates Performance of ε -greedy is quite sensitive to bad tuning   performs comparably to a well-tuned ε -greedy and is not very sensitive to large di ff erences in the response rates Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  12. The nonstochastic bandit problem [Auer, C-B, Freund, and Schapire, 2002] What if probability is removed altogether? Nonstochastic bandits Bounded real rewards x i ,1 , x i ,2 , . . . are deterministically assigned to each machine i Analogies with repeated play of an unknown game [Ba˜ nos, 1968; Megiddo, 1980] Allocation policies are allowed to randomize Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  13. 0 1 0 0 7 9 9 8 9 0 0 1 5 7 9 6 0 0 2 2 0 0 0 1 0 2 0 1 0 1 0 0 8 9 8 7 Definition (Regret) � n � n � � � � max x i , t − E x I t , t i = 1,..., K t = 1 t = 1 Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  14. Competing against arbitrary policies 0 1 0 0 7 9 9 8 9 0 0 1 5 7 9 6 0 0 2 2 0 0 0 1 0 2 0 1 0 1 0 0 8 9 8 7 Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

  15. Tracking regret Regret against an arbitrary and unknown policy ( j 1 , j 2 , . . . , j n ) � n n � � � x j t , t − E x I t , t t = 1 t = 1 Theorem (Auer, C-B, Freund, and Schapire, 2002) For all fixed S , the regret of the weight sharing policy against any policy j = ( j 1 , j 2 , . . . , j n ) is at most √ S nK ln K where S is the number of times j switches to a di ff erent machine Nicol` o Cesa-Bianchi The Multi-Armed Bandit Problem

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend