Linear Bandits D avid P al Google, New York & Department of - PowerPoint PPT Presentation

Linear Bandits D´ avid P´ al Google, New York & Department of Computing Science University of Alberta dpal@google.com November 2, 2011 joint work with Yasin Abbasi-Yadkori and Csaba Szepesv´ ari

Linear Bandits In round t = 1 , 2 , . . . ◮ Choose an action X t from a set D t ⊂ ❘ d . ◮ Receive a reward � X t , θ ∗ � + random noise ◮ Weights θ ∗ are unknown but fixed. ◮ Goal: Maximize total reward.

Motivation ◮ exploration & exploitation with side information ◮ action = arm = ad = feature vector ◮ reward = click

Outline ◮ Formal model & Regret ◮ Algorithm: Optimism in the Face of Uncertainty principle ◮ Confidence sets for Least Squares ◮ Sparse models: Online-to-Confidence-Set Conversion

Formal model Unknown but fixed weight vector θ ∗ ∈ ❘ d . In round t = 1 , 2 , . . . ◮ Receive D t ⊂ ❘ d ◮ Choose an action X t ∈ D t ◮ Receive a reward Y t = � X t , θ ∗ � + η t Noise is conditionally R -sub-Gaussian i.e. � γ 2 R 2 � E [ e γη t | X 1 : t , η 1 : t − 1 ] ≤ exp ∀ γ ∈ ❘ . 2

Sub-Gaussianity Definition Random variable Z is R-sub-Gaussian for some R ≥ 0 if � γ 2 R 2 � E [ e γ Z ] ≤ exp ∀ γ ∈ ❘ . 2 The condition implies that ◮ E [ Z ] = 0 ◮ Var [ Z ] ≤ R 2 Examples: ◮ Zero-mean bounded in an interval of length 2 R (Hoeffding-Azuma) ◮ Zero-mean Gaussian with variance ≤ R 2

Regret ◮ If we knew θ ∗ , then in round t we’d choose action X ∗ � x , θ ∗ � t = argmax x ∈ D t ◮ Regret is our reward in n rounds relative to X ∗ t : n n � � � X ∗ Regret n = t , θ ∗ � − � X t , θ ∗ � t = 1 t = 1 ◮ We want Regret n / n → 0 as n → ∞

Optimism in the Face of Uncertainty Principle ◮ Maintain a confidence set C t ⊆ ❘ d such that θ ∗ ∈ C t with high probability. ◮ In round t , choose ( X t , � θ t ) = argmax � X t , θ t � ( x ,θ ) ∈ D t × C t − 1 ◮ � θ t is an “optimistic” estimate of θ ∗ ◮ UCB algorithm is a special case.

Least Squares ◮ Data ( X 1 , Y 1 ) , . . . , ( X n , Y n ) such that Y t ≈ � X t , θ ∗ � ◮ Stack them into matrices: X 1 : n is n × d and Y 1 : n is n × 1 ◮ Least squares estimate: � θ n = ( X 1 : n X T 1 : n + λ I ) − 1 X T 1 : n Y 1 : n ◮ Let V n = X 1 : n X T 1 : n + λ I Theorem If � θ ∗ � 2 ≤ S, then with probability at least 1 − δ , for all t, θ ∗ lies in � � � � det ( V t ) 1 / 2 � √ θ : � � θ t − θ � V t ≤ R C t = 2 ln + S λ δ det ( λ I ) 1 / 2 √ v T Av is the matrix A-norm. where � v � A =

Confidence Set C t � θ t +1 � θ t θ ∗ ◮ Least squares solution � θ t is the center of C t ◮ θ ∗ lies somewhere in C t w.h.p. ◮ Next action � θ t + 1 is on the boundary of C t

Comparison with Previous Confidence Sets ◮ Our bound: � � det ( V t ) 1 / 2 � √ � � θ t − θ ∗ � V t ≤ R 2 ln + S λ δ det ( λ I ) 1 / 2 ◮ [Dani et al.(2008)] If � θ ∗ � 2 , � X t � 2 ≤ 1 then for a specific λ � � � 128 d ln ( t ) ln ( t 2 /δ ) , 8 � � 3 ln ( t 2 /δ ) θ t − θ ∗ � V t ≤ R max ◮ [Rusmevichientong and Tsitsiklis(2010)] If � X t � 2 ≤ 1 √ � √ � � θ t − θ ∗ � V t ≤ 2 R κ d ln t + ln ( t 2 /δ ) + S ln t λ where κ = 3 + 2 ln (( 1 + λ d ) /λ ) . Our bound doesn’t depend on t .

Regret of the Bandit Algorithm Theorem ([Dani et al.(2008)]) If � θ ∗ � 2 ≤ 1 and D t ’s are subsets of the unit 2 -ball with probability at least 1 − δ Regret n ≤ O ( Rd √ n · polylog ( n , d , 1 /δ )) We get the same result with smaller polylog ( n , d , 1 /δ ) factor.

Sparse Bandits What if θ ∗ is sparse? ◮ Not good idea to use least squares. ◮ Better use e.g. L 1 -regularization. ◮ How do we construct confidence sets? Our new technique: Online-to-Confidence-Set Conversion ◮ Similar to Online-to-Batch Conversion, but very different ◮ We start with an online prediction algorithm.

Online Prediction Algorithms In round t ◮ Receive X t ∈ ❘ d ◮ Predict � Y t ∈ ❘ ◮ Receive correct label Y t ∈ ❘ ◮ Suffer loss ( Y t − � Y t ) 2 No assumptions whatsoever on ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , . . . There are heaps of algorithms of this structure: ◮ online gradient descent [Zinkevich(2003)] ◮ online least-squares [Azoury and Warmuth(2001), Vovk(2001)] ◮ exponetiated gradient [Kivinen and Warmuth(1997)] ◮ online LASSO (??) ◮ SeqSEW [Gerchinovitz(2011), Dalalyan and Tsybakov(2007)]

Online Prediction Algorithms, cnt’d ◮ Regret with respect to a linear predictor θ ∈ ❘ d n n � � Y t ) 2 − ( Y t − � ( Y t − � X t , θ � ) 2 ρ n ( θ ) = t = 1 t = 1 ◮ Prediction algorithms come with “regret bounds” B n : ∀ n ρ n ( θ ) ≤ B n ◮ B n depends on n , d , θ and possibly X 1 , X 2 , . . . , X n and Y 1 , Y 2 , . . . , Y n ◮ Typically, B n = O ( √ n ) or B n = O ( log n )

Online-to-Confidence-Set Conversion ◮ Data ( X 1 , Y 1 ) , . . . , ( X n , Y n ) where Y t = � X t , θ ∗ � + η t and η t is conditionally R -sub-Gaussian. ◮ Predictions � Y 1 , � Y 2 , . . . , � Y n ◮ Regret bound ρ ( θ ∗ ) ≤ B n Theorem (Conversion) With probability at least 1 − δ , for all n, θ ∗ lies in � n � θ ∈ ❘ d : (^ Y t − � X t , θ � ) 2 C n = t = 1 � � � √ 8 + √ 1 + B n R ≤ 1 + 2 B n + 32 R 2 ln δ

Optimistic Algorithm with Conversion Theorem If | � x , θ ∗ � | ≤ 1 for all x ∈ D t and all t then with probability at least 1 − δ , for all n, the regret of Optimistic Algorithm is �� Regret n ≤ O dnB n · polylog ( n , d , 1 /δ, B n ) .

Bandits combined with SeqSEW Theorem ([Gerchinovitz(2011)]) If � θ � ∞ ≤ 1 and � θ � 0 ≤ p then S EQ SEW algorithm has regret bound ρ n ( θ ) ≤ B n = O ( p log ( nd )) . Suppose � θ ∗ � 2 ≤ 1 and � θ ∗ � 0 ≤ p . Via the conversion, the Optimistic Algorithm has regret � O ( R pdn · polylog ( n , d , 1 /δ )) which is better than O ( Rd √ n · polylog ( n , d , 1 /δ )) .

Open problems ◮ Confidence sets for batch algorithms e.g. offline LASSO. ◮ Adaptive bandit algorithm that doesn’t need p upfront.

Questions? Read papers at http://david.palenica.com/

References Katy S. Azoury and Mafred K. Warmuth. In Proceedings of the 24st Annual Conference on Learning Theory (COLT 2011) , 2011. Relative loss bounds for on-line density estimation with the exponential family of Jyrki Kivinen and Manfred K. Warmuth. distributions. Exponentiated gradient versus gradient descent Machine Learning , 43:211–246, 2001. for linear predictors. Arnak S. Dalalyan and Alexandre B. Tsybakov. Information and Computation , 132(1):1–63, January 1997. Aggregation by exponential weighting and sharp oracle inequalities. Paat Rusmevichientong and John N. Tsitsiklis. In Proceedings of the 20th Annual Conference on Learning Theory , pages 97–111, 2007. Linearly parameterized bandits. Mathematics of Operations Research , 35(2):395–411, Varsha Dani, Thomas P. Hayes, and Sham M. 2010. Kakade. Vladimir Vovk. Stochastic linear optimization under bandit Competitive on-line statistics. feedback. International Statistical Review , 69:213–248, 2001. In Rocco Servedio and Tong Zhang, editors, Proceedings of the 21st Annual Conference on Martin Zinkevich. Learning Theory (COLT 2008) , pages 355–366, Online convex programming and generalized 2008. infinitesimal gradient ascent. S` ebastien Gerchinovitz. In Proceedings of Twentieth International Conference on Machine Learning , 2003. Sparsity regret bounds for individual sequences in online linear regression.

Linear Bandits D avid P al Google, New York & Department of - PowerPoint PPT Presentation

Linear Bandits D avid P al Google, New York & Department of Computing Science University of Alberta dpal@google.com November 2, 2011 joint work with Yasin Abbasi-Yadkori and Csaba Szepesv ari Linear Bandits In round t = 1 , 2 ,

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A.

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Learning wit ith Pairw rwis ise Losses Problems, Algorithms and Analysis Purushottam Kar

Efficient tracking of a growing number of experts Jaouad Mourtada & Odalric-ambrym Maillard

L M A D A Learning And Mining from DatA NANJING UNIVERSITY Adaptive Regret of Convex and

without Regret Barbara Jobstmann EPFL and Jasper DA CNRS, Verimag Joint Work with Christian von

Q4 05 STRATEGIC OVERVIEW Investor Community Conference Call TONY COMPER President and Chief

A Model for Detecting Transport Layer Data Reneging Nasif Ekiz, Paul D. Amer Nasif Ekiz, Paul D.

Managing Waiting Lines NKFUST The Economies of Waiting Features of Queuing Systems

Managing Waiting Lines NKFUST The Economies of Waiting Features of Queuing Systems

Linear Bandits D avid P al Google, New York & Department of - PowerPoint PPT Presentation

Linear Bandits D avid P al Google, New York & Department of Computing Science University of Alberta dpal@google.com November 2, 2011 joint work with Yasin Abbasi-Yadkori and Csaba Szepesv ari Linear Bandits In round t = 1 , 2 ,

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A.

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

About this class An example Bandit problems in general Two-armed bandits Multi-armed bandits

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Learning wit ith Pairw rwis ise Losses Problems, Algorithms and Analysis Purushottam Kar

Efficient tracking of a growing number of experts Jaouad Mourtada &amp; Odalric-ambrym Maillard

L M A D A Learning And Mining from DatA NANJING UNIVERSITY Adaptive Regret of Convex and

without Regret Barbara Jobstmann EPFL and Jasper DA CNRS, Verimag Joint Work with Christian von

Q4 05 STRATEGIC OVERVIEW Investor Community Conference Call TONY COMPER President and Chief

A Model for Detecting Transport Layer Data Reneging Nasif Ekiz, Paul D. Amer Nasif Ekiz, Paul D.

Managing Waiting Lines NKFUST The Economies of Waiting Features of Queuing Systems

Managing Waiting Lines NKFUST The Economies of Waiting Features of Queuing Systems

Efficient tracking of a growing number of experts Jaouad Mourtada & Odalric-ambrym Maillard