linear bandits
play

Linear Bandits D avid P al Google, New York & Department of - PowerPoint PPT Presentation

Linear Bandits D avid P al Google, New York & Department of Computing Science University of Alberta dpal@google.com November 2, 2011 joint work with Yasin Abbasi-Yadkori and Csaba Szepesv ari Linear Bandits In round t = 1 , 2 ,


  1. Linear Bandits D´ avid P´ al Google, New York & Department of Computing Science University of Alberta dpal@google.com November 2, 2011 joint work with Yasin Abbasi-Yadkori and Csaba Szepesv´ ari

  2. Linear Bandits In round t = 1 , 2 , . . . ◮ Choose an action X t from a set D t ⊂ ❘ d . ◮ Receive a reward � X t , θ ∗ � + random noise ◮ Weights θ ∗ are unknown but fixed. ◮ Goal: Maximize total reward.

  3. Motivation ◮ exploration & exploitation with side information ◮ action = arm = ad = feature vector ◮ reward = click

  4. Outline ◮ Formal model & Regret ◮ Algorithm: Optimism in the Face of Uncertainty principle ◮ Confidence sets for Least Squares ◮ Sparse models: Online-to-Confidence-Set Conversion

  5. Formal model Unknown but fixed weight vector θ ∗ ∈ ❘ d . In round t = 1 , 2 , . . . ◮ Receive D t ⊂ ❘ d ◮ Choose an action X t ∈ D t ◮ Receive a reward Y t = � X t , θ ∗ � + η t Noise is conditionally R -sub-Gaussian i.e. � γ 2 R 2 � E [ e γη t | X 1 : t , η 1 : t − 1 ] ≤ exp ∀ γ ∈ ❘ . 2

  6. Sub-Gaussianity Definition Random variable Z is R-sub-Gaussian for some R ≥ 0 if � γ 2 R 2 � E [ e γ Z ] ≤ exp ∀ γ ∈ ❘ . 2 The condition implies that ◮ E [ Z ] = 0 ◮ Var [ Z ] ≤ R 2 Examples: ◮ Zero-mean bounded in an interval of length 2 R (Hoeffding-Azuma) ◮ Zero-mean Gaussian with variance ≤ R 2

  7. Regret ◮ If we knew θ ∗ , then in round t we’d choose action X ∗ � x , θ ∗ � t = argmax x ∈ D t ◮ Regret is our reward in n rounds relative to X ∗ t : n n � � � X ∗ Regret n = t , θ ∗ � − � X t , θ ∗ � t = 1 t = 1 ◮ We want Regret n / n → 0 as n → ∞

  8. Optimism in the Face of Uncertainty Principle ◮ Maintain a confidence set C t ⊆ ❘ d such that θ ∗ ∈ C t with high probability. ◮ In round t , choose ( X t , � θ t ) = argmax � X t , θ t � ( x ,θ ) ∈ D t × C t − 1 ◮ � θ t is an “optimistic” estimate of θ ∗ ◮ UCB algorithm is a special case.

  9. Least Squares ◮ Data ( X 1 , Y 1 ) , . . . , ( X n , Y n ) such that Y t ≈ � X t , θ ∗ � ◮ Stack them into matrices: X 1 : n is n × d and Y 1 : n is n × 1 ◮ Least squares estimate: � θ n = ( X 1 : n X T 1 : n + λ I ) − 1 X T 1 : n Y 1 : n ◮ Let V n = X 1 : n X T 1 : n + λ I Theorem If � θ ∗ � 2 ≤ S, then with probability at least 1 − δ , for all t, θ ∗ lies in � � � � det ( V t ) 1 / 2 � √ θ : � � θ t − θ � V t ≤ R C t = 2 ln + S λ δ det ( λ I ) 1 / 2 √ v T Av is the matrix A-norm. where � v � A =

  10. Confidence Set C t � θ t +1 � θ t θ ∗ ◮ Least squares solution � θ t is the center of C t ◮ θ ∗ lies somewhere in C t w.h.p. ◮ Next action � θ t + 1 is on the boundary of C t

  11. Comparison with Previous Confidence Sets ◮ Our bound: � � det ( V t ) 1 / 2 � √ � � θ t − θ ∗ � V t ≤ R 2 ln + S λ δ det ( λ I ) 1 / 2 ◮ [Dani et al.(2008)] If � θ ∗ � 2 , � X t � 2 ≤ 1 then for a specific λ � � � 128 d ln ( t ) ln ( t 2 /δ ) , 8 � � 3 ln ( t 2 /δ ) θ t − θ ∗ � V t ≤ R max ◮ [Rusmevichientong and Tsitsiklis(2010)] If � X t � 2 ≤ 1 √ � √ � � θ t − θ ∗ � V t ≤ 2 R κ d ln t + ln ( t 2 /δ ) + S ln t λ where κ = 3 + 2 ln (( 1 + λ d ) /λ ) . Our bound doesn’t depend on t .

  12. Regret of the Bandit Algorithm Theorem ([Dani et al.(2008)]) If � θ ∗ � 2 ≤ 1 and D t ’s are subsets of the unit 2 -ball with probability at least 1 − δ Regret n ≤ O ( Rd √ n · polylog ( n , d , 1 /δ )) We get the same result with smaller polylog ( n , d , 1 /δ ) factor.

  13. Sparse Bandits What if θ ∗ is sparse? ◮ Not good idea to use least squares. ◮ Better use e.g. L 1 -regularization. ◮ How do we construct confidence sets? Our new technique: Online-to-Confidence-Set Conversion ◮ Similar to Online-to-Batch Conversion, but very different ◮ We start with an online prediction algorithm.

  14. Online Prediction Algorithms In round t ◮ Receive X t ∈ ❘ d ◮ Predict � Y t ∈ ❘ ◮ Receive correct label Y t ∈ ❘ ◮ Suffer loss ( Y t − � Y t ) 2 No assumptions whatsoever on ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , . . . There are heaps of algorithms of this structure: ◮ online gradient descent [Zinkevich(2003)] ◮ online least-squares [Azoury and Warmuth(2001), Vovk(2001)] ◮ exponetiated gradient [Kivinen and Warmuth(1997)] ◮ online LASSO (??) ◮ SeqSEW [Gerchinovitz(2011), Dalalyan and Tsybakov(2007)]

  15. Online Prediction Algorithms, cnt’d ◮ Regret with respect to a linear predictor θ ∈ ❘ d n n � � Y t ) 2 − ( Y t − � ( Y t − � X t , θ � ) 2 ρ n ( θ ) = t = 1 t = 1 ◮ Prediction algorithms come with “regret bounds” B n : ∀ n ρ n ( θ ) ≤ B n ◮ B n depends on n , d , θ and possibly X 1 , X 2 , . . . , X n and Y 1 , Y 2 , . . . , Y n ◮ Typically, B n = O ( √ n ) or B n = O ( log n )

  16. Online-to-Confidence-Set Conversion ◮ Data ( X 1 , Y 1 ) , . . . , ( X n , Y n ) where Y t = � X t , θ ∗ � + η t and η t is conditionally R -sub-Gaussian. ◮ Predictions � Y 1 , � Y 2 , . . . , � Y n ◮ Regret bound ρ ( θ ∗ ) ≤ B n Theorem (Conversion) With probability at least 1 − δ , for all n, θ ∗ lies in � n � θ ∈ ❘ d : (^ Y t − � X t , θ � ) 2 C n = t = 1 � � � √ 8 + √ 1 + B n R ≤ 1 + 2 B n + 32 R 2 ln δ

  17. Optimistic Algorithm with Conversion Theorem If | � x , θ ∗ � | ≤ 1 for all x ∈ D t and all t then with probability at least 1 − δ , for all n, the regret of Optimistic Algorithm is �� � Regret n ≤ O dnB n · polylog ( n , d , 1 /δ, B n ) .

  18. Bandits combined with SeqSEW Theorem ([Gerchinovitz(2011)]) If � θ � ∞ ≤ 1 and � θ � 0 ≤ p then S EQ SEW algorithm has regret bound ρ n ( θ ) ≤ B n = O ( p log ( nd )) . Suppose � θ ∗ � 2 ≤ 1 and � θ ∗ � 0 ≤ p . Via the conversion, the Optimistic Algorithm has regret � O ( R pdn · polylog ( n , d , 1 /δ )) which is better than O ( Rd √ n · polylog ( n , d , 1 /δ )) .

  19. Open problems ◮ Confidence sets for batch algorithms e.g. offline LASSO. ◮ Adaptive bandit algorithm that doesn’t need p upfront.

  20. Questions? Read papers at http://david.palenica.com/

  21. References Katy S. Azoury and Mafred K. Warmuth. In Proceedings of the 24st Annual Conference on Learning Theory (COLT 2011) , 2011. Relative loss bounds for on-line density estimation with the exponential family of Jyrki Kivinen and Manfred K. Warmuth. distributions. Exponentiated gradient versus gradient descent Machine Learning , 43:211–246, 2001. for linear predictors. Arnak S. Dalalyan and Alexandre B. Tsybakov. Information and Computation , 132(1):1–63, January 1997. Aggregation by exponential weighting and sharp oracle inequalities. Paat Rusmevichientong and John N. Tsitsiklis. In Proceedings of the 20th Annual Conference on Learning Theory , pages 97–111, 2007. Linearly parameterized bandits. Mathematics of Operations Research , 35(2):395–411, Varsha Dani, Thomas P. Hayes, and Sham M. 2010. Kakade. Vladimir Vovk. Stochastic linear optimization under bandit Competitive on-line statistics. feedback. International Statistical Review , 69:213–248, 2001. In Rocco Servedio and Tong Zhang, editors, Proceedings of the 21st Annual Conference on Martin Zinkevich. Learning Theory (COLT 2008) , pages 355–366, Online convex programming and generalized 2008. infinitesimal gradient ascent. S` ebastien Gerchinovitz. In Proceedings of Twentieth International Conference on Machine Learning , 2003. Sparsity regret bounds for individual sequences in online linear regression.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend