Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit - PowerPoint PPT Presentation

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S´ ebastien Bubeck Theory Group

i.i.d. multi-armed bandit, Robbins [1952]

i.i.d. multi-armed bandit, Robbins [1952] Known parameters: number of arms n and (possibly) number of rounds T ≥ n .

i.i.d. multi-armed bandit, Robbins [1952] Known parameters: number of arms n and (possibly) number of rounds T ≥ n . Unknown parameters: n probability distributions ν 1 , . . . , ν n on [0 , 1] with mean µ 1 , . . . , µ n (notation: µ ∗ = max i ∈ [ n ] µ i ).

i.i.d. multi-armed bandit, Robbins [1952] Known parameters: number of arms n and (possibly) number of rounds T ≥ n . Unknown parameters: n probability distributions ν 1 , . . . , ν n on [0 , 1] with mean µ 1 , . . . , µ n (notation: µ ∗ = max i ∈ [ n ] µ i ). Protocol: For each round t = 1 , 2 , . . . , T , the player chooses I t ∈ [ n ] based on past observations and receives a reward/observation Y t ∼ ν I t (independently from the past).

i.i.d. multi-armed bandit, Robbins [1952] Known parameters: number of arms n and (possibly) number of rounds T ≥ n . Unknown parameters: n probability distributions ν 1 , . . . , ν n on [0 , 1] with mean µ 1 , . . . , µ n (notation: µ ∗ = max i ∈ [ n ] µ i ). Protocol: For each round t = 1 , 2 , . . . , T , the player chooses I t ∈ [ n ] based on past observations and receives a reward/observation Y t ∼ ν I t (independently from the past). Performance measure: The cumulative regret is the difference between the player’s accumulated reward and the maximum the player could have obtained had she known all the parameters, � R T = T µ ∗ − E Y t . t ∈ [ T ] Fundamental tension between exploration and exploitation . Many applications!

i.i.d. multi-armed bandit: fundamental limitations How small can we expect R T to be? Consider the 2-armed case where ν 1 = Ber (1 / 2) and ν 2 = Ber (1 / 2 + ξ ∆) where ξ ∈ {− 1 , 1 } is unknown.

i.i.d. multi-armed bandit: fundamental limitations How small can we expect R T to be? Consider the 2-armed case where ν 1 = Ber (1 / 2) and ν 2 = Ber (1 / 2 + ξ ∆) where ξ ∈ {− 1 , 1 } is unknown. With τ expected observations from the second arm there is a probability at least exp( − τ ∆ 2 ) to make the wrong guess on the value of ξ .

i.i.d. multi-armed bandit: fundamental limitations How small can we expect R T to be? Consider the 2-armed case where ν 1 = Ber (1 / 2) and ν 2 = Ber (1 / 2 + ξ ∆) where ξ ∈ {− 1 , 1 } is unknown. With τ expected observations from the second arm there is a probability at least exp( − τ ∆ 2 ) to make the wrong guess on the value of ξ . Let τ ( t ) be the expected number of pulls of arm 2 when ξ = − 1.

i.i.d. multi-armed bandit: fundamental limitations How small can we expect R T to be? Consider the 2-armed case where ν 1 = Ber (1 / 2) and ν 2 = Ber (1 / 2 + ξ ∆) where ξ ∈ {− 1 , 1 } is unknown. With τ expected observations from the second arm there is a probability at least exp( − τ ∆ 2 ) to make the wrong guess on the value of ξ . Let τ ( t ) be the expected number of pulls of arm 2 when ξ = − 1. T � exp( − τ ( t )∆ 2 ) R T ( ξ = +1) + R T ( ξ = − 1) ≥ ∆ τ ( T ) + ∆ t =1 t ∈ [ T ] ( t + T exp( − t ∆ 2 )) ≥ ∆ min log( T ∆ 2 ) ≈ . ∆ See Bubeck, Perchet and Rigollet [2012] for the details.

i.i.d. multi-armed bandit: fundamental limitations How small can we expect R T to be? Consider the 2-armed case where ν 1 = Ber (1 / 2) and ν 2 = Ber (1 / 2 + ξ ∆) where ξ ∈ {− 1 , 1 } is unknown. With τ expected observations from the second arm there is a probability at least exp( − τ ∆ 2 ) to make the wrong guess on the value of ξ . Let τ ( t ) be the expected number of pulls of arm 2 when ξ = − 1. T � exp( − τ ( t )∆ 2 ) R T ( ξ = +1) + R T ( ξ = − 1) ≥ ∆ τ ( T ) + ∆ t =1 t ∈ [ T ] ( t + T exp( − t ∆ 2 )) ≥ ∆ min log( T ∆ 2 ) ≈ . ∆ See Bubeck, Perchet and Rigollet [2012] for the details. For ∆ fixed the lower bound is log( T ) , and for the worse ∆ √ √ ∆ ( ≈ 1 / T ) it is T (Auer, Cesa-Bianchi, Freund and Schapire √ [1995]: Tn for the n -armed case).

i.i.d. multi-armed bandit: fundamental limitations Notation: ∆ i = µ ∗ − µ i and N i ( t ) is the number of pulls of arm i up to time t . Then one has R T = � n i =1 ∆ i E N i ( T ).

i.i.d. multi-armed bandit: fundamental limitations Notation: ∆ i = µ ∗ − µ i and N i ( t ) is the number of pulls of arm i up to time t . Then one has R T = � n i =1 ∆ i E N i ( T ). For p , q ∈ [0 , 1] , kl ( p , q ) := p log p q + (1 − p ) log 1 − p 1 − q .

i.i.d. multi-armed bandit: fundamental limitations Notation: ∆ i = µ ∗ − µ i and N i ( t ) is the number of pulls of arm i up to time t . Then one has R T = � n i =1 ∆ i E N i ( T ). For p , q ∈ [0 , 1] , kl ( p , q ) := p log p q + (1 − p ) log 1 − p 1 − q . Theorem (Lai and Robbins [1985]) Consider a strategy s.t. ∀ a > 0 , we have E N i ( T ) = o ( T a ) if ∆ i > 0 . Then for any Bernoulli distributions, � R T ∆ i lim inf log( T ) ≥ kl ( µ i , µ ∗ ) . n → + ∞ i :∆ i > 0

i.i.d. multi-armed bandit: fundamental limitations Notation: ∆ i = µ ∗ − µ i and N i ( t ) is the number of pulls of arm i up to time t . Then one has R T = � n i =1 ∆ i E N i ( T ). For p , q ∈ [0 , 1] , kl ( p , q ) := p log p q + (1 − p ) log 1 − p 1 − q . Theorem (Lai and Robbins [1985]) Consider a strategy s.t. ∀ a > 0 , we have E N i ( T ) = o ( T a ) if ∆ i > 0 . Then for any Bernoulli distributions, � R T ∆ i lim inf log( T ) ≥ kl ( µ i , µ ∗ ) . n → + ∞ i :∆ i > 0 kl ( µ i ,µ ∗ ) ≥ µ ∗ (1 − µ ∗ ) 1 ∆ i Note that 2∆ i ≥ so up to a variance-like term 2∆ i the Lai and Robbins lower bound is � log( T ) 2∆ i . i :∆ i > 0

i.i.d. multi-armed bandit: fundamental strategy Hoeffding’s inequality: w.p. ≥ 1 − 1 / T , ∀ t ∈ [ T ] , i ∈ [ n ], � � 1 2 log( T ) µ i ≤ Y s + =: UCB i ( t ) . N i ( t ) N i ( t ) s < t : I s = i

i.i.d. multi-armed bandit: fundamental strategy Hoeffding’s inequality: w.p. ≥ 1 − 1 / T , ∀ t ∈ [ T ] , i ∈ [ n ], � � 1 2 log( T ) µ i ≤ Y s + =: UCB i ( t ) . N i ( t ) N i ( t ) s < t : I s = i UCB (Upper Confidence Bound) strategy (Lai and Robbins [1985], Agarwal [1995], Auer, Cesa-Bianchi and Fischer [2002]): I t ∈ argmax UCB i ( t ) . i ∈ [ n ]

i.i.d. multi-armed bandit: fundamental strategy Hoeffding’s inequality: w.p. ≥ 1 − 1 / T , ∀ t ∈ [ T ] , i ∈ [ n ], � � 1 2 log( T ) µ i ≤ Y s + =: UCB i ( t ) . N i ( t ) N i ( t ) s < t : I s = i UCB (Upper Confidence Bound) strategy (Lai and Robbins [1985], Agarwal [1995], Auer, Cesa-Bianchi and Fischer [2002]): I t ∈ argmax UCB i ( t ) . i ∈ [ n ] Simple analysis: on a 1 − 2 / T probability event one has i ⇒ UCB i ( t ) < µ ∗ ≤ UCB i ∗ ( t ) , N i ( t ) ≥ 8 log( T ) / ∆ 2

i.i.d. multi-armed bandit: fundamental strategy Hoeffding’s inequality: w.p. ≥ 1 − 1 / T , ∀ t ∈ [ T ] , i ∈ [ n ], � � 1 2 log( T ) µ i ≤ Y s + =: UCB i ( t ) . N i ( t ) N i ( t ) s < t : I s = i UCB (Upper Confidence Bound) strategy (Lai and Robbins [1985], Agarwal [1995], Auer, Cesa-Bianchi and Fischer [2002]): I t ∈ argmax UCB i ( t ) . i ∈ [ n ] Simple analysis: on a 1 − 2 / T probability event one has i ⇒ UCB i ( t ) < µ ∗ ≤ UCB i ∗ ( t ) , N i ( t ) ≥ 8 log( T ) / ∆ 2 so that E N i ( T ) ≤ 2 + 8 log( T ) / ∆ 2 i and in fact � 8 log( T ) R T ≤ 2 + . ∆ i i :∆ i > 0

i.i.d. multi-armed bandit: going further 1. Optimal constant (replacing 8 by 1 / 2 in the UCB regret bound) and Lai and Robbins variance-like term (replacing ∆ i by kl ( µ i , µ ∗ )): see Capp´ e, Garivier, Maillard, Munos and Stoltz [2013].

i.i.d. multi-armed bandit: going further 1. Optimal constant (replacing 8 by 1 / 2 in the UCB regret bound) and Lai and Robbins variance-like term (replacing ∆ i by kl ( µ i , µ ∗ )): see Capp´ e, Garivier, Maillard, Munos and Stoltz [2013]. 2. In many applications one is merely interested in finding the best arm (instead of maximizing cumulative reward): this is the best arm identification problem. For the fundamental strategies see Even-Dar, Mannor and Mansour [2006] for the fixed-confidence setting (see also Jamieson and Nowak [2014] for a recent short survey) and Audibert, Bubeck and Munos [2010] for the fixed budget setting. Key takeaway: one needs of order H := � i ∆ − 2 rounds to find the best arm. i

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit - PowerPoint PPT Presentation

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known parameters: number of arms n and

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part 2 S ebastien

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15,

Modelling nonstationary signals using stochastic and nonstochastic approach Jacek Lekow

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 *

No-Regret Learning in Convex Games Geoff Gordon, Amy Greenwald, Casey Marks, and Martin Zinkevich

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

Delay and Cooperation in Nonstochastic Bandits Nicol` o Cesa-Bianchi Universit` a degli Studi

Nonstochastic Information for Worst-Case Networked Estimation and Control Girish Nair Department

Elements of a Nonstochastic Information Theory Girish Nair Dept. Electrical & Electronic

Regret Bounds for Lifelong Learning Pierre Alquier Groupe de Travail de Machine learning du CMLA

Acceleration through Optimistic No-Regret Dynamics Jun-Kun Wang and Jacob Abernethy Georgia Tech

A Closer Look at Adaptive Regret Dmitry Adamskiy Joint work with Wouter Koolen, Volodya Vovk and

Royal Economic Society The history of Regret Theory Robert Sugden Contribution to Economic

Composability of Regret Minimizers Gabriele Farina 1 Christian Kroer 2 Tuomas Sandholm 1,3,4,5 1

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem

Regret bounds for online variational inference Pierre Alquier ACML Nagoya, Nov. 18, 2019

NVIDIA CUDA Implementation of a Hierarchical Object Recognition Algorithm Sharat Chikkerur CBCL,

ReLu and Maxout Networks and Their Possible Connections to Tropical Methods J org Zimmermann,

Binary Heaps Autumn 2018 Shrirang (Shri) Mare shri@cs.washington.edu Thanks to Kasey Champion,

Vectorial Quasi-flat Zones for Color Image Simplification Erhan Aptoula, Jonathan Weber,

que ueue ue ban and Agenda Product Vision Critical-risks Key findings Moving Forward

2020 Effective Mentoring Program Combined Program (School and Early Childhood) Day 2 1 2020 SB

Bringing Financial Wellness Services into the Workplace December 14, 2017 Welcome Carmen

Presidents Initiative on Enriching Campus Wellbeing Campus Forums Todays Agenda

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit - PowerPoint PPT Presentation

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known parameters: number of arms n and

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part 2 S ebastien

The Nonstochastic Multi Armed Bandit Problem Part 2 and counting... Shahaf Nacson TAU Nov 15,

Modelling nonstationary signals using stochastic and nonstochastic approach Jacek Lekow

Multi-armed bandits S Bubeck, N Cesa-Bianchi Foundations and Trends in Machine Learning 2012 *

No-Regret Learning in Convex Games Geoff Gordon, Amy Greenwald, Casey Marks, and Martin Zinkevich

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

Delay and Cooperation in Nonstochastic Bandits Nicol` o Cesa-Bianchi Universit` a degli Studi

Nonstochastic Information for Worst-Case Networked Estimation and Control Girish Nair Department

Elements of a Nonstochastic Information Theory Girish Nair Dept. Electrical &amp; Electronic

Regret Bounds for Lifelong Learning Pierre Alquier Groupe de Travail de Machine learning du CMLA

Acceleration through Optimistic No-Regret Dynamics Jun-Kun Wang and Jacob Abernethy Georgia Tech

A Closer Look at Adaptive Regret Dmitry Adamskiy Joint work with Wouter Koolen, Volodya Vovk and

Royal Economic Society The history of Regret Theory Robert Sugden Contribution to Economic

Composability of Regret Minimizers Gabriele Farina 1 Christian Kroer 2 Tuomas Sandholm 1,3,4,5 1

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem

Regret bounds for online variational inference Pierre Alquier ACML Nagoya, Nov. 18, 2019

NVIDIA CUDA Implementation of a Hierarchical Object Recognition Algorithm Sharat Chikkerur CBCL,

ReLu and Maxout Networks and Their Possible Connections to Tropical Methods J org Zimmermann,

Binary Heaps Autumn 2018 Shrirang (Shri) Mare shri@cs.washington.edu Thanks to Kasey Champion,

Vectorial Quasi-flat Zones for Color Image Simplification Erhan Aptoula, Jonathan Weber,

que ueue ue ban and Agenda Product Vision Critical-risks Key findings Moving Forward

2020 Effective Mentoring Program Combined Program (School and Early Childhood) Day 2 1 2020 SB

Bringing Financial Wellness Services into the Workplace December 14, 2017 Welcome Carmen

Presidents Initiative on Enriching Campus Wellbeing Campus Forums Todays Agenda

Elements of a Nonstochastic Information Theory Girish Nair Dept. Electrical & Electronic