Beating Stochastic and Adversarial Semi-bandits Optimally and - PowerPoint PPT Presentation

Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously Julian Zimmert (University of Copenhagen) Haipeng Luo (University of Southern California) Chen-Yu Wei (University of Southern California)

Semi-bandits Example Day 1 15 mins Day 2 13 mins Day 3 16 mins . . . Goal : minimize the average commuting time

Types of Environments i.i.d. adversarial (more benign) Algorithms for i.i.d.: perform bad in the adversarial case. Algorithms for adversarial: when the environment is i.i.d., they do not take advantage of it.

Types of Environments i.i.d. adversarial (more benign) Algorithms for i.i.d.: perform bad in the adversarial case. Algorithms for adversarial: when the environment is i.i.d., they do not take advantage of it. ⇒ To achieve optimal performance, they need to know which environments they are in and pick the corresponding algorithms.

Motivation i.i.d. unknown adversarial (more benign) mixed What if 1. We have no prior knowledge about the environment. 2. The environment is usually i.i.d., but we want to be robust to adversarial attack. 3. The environment is usually arbitrary but we want to exploit the benignness when we got lucky.

Our Results ◮ We propose the first semi-bandit algorithm that has optimal performance guarantees in both i.i.d. and adversarial environments, without knowing which environment it is in.

Formalizing Semi-bandits Given: action set X = { X (1) , X (2) , . . . } ⊆ { 0 , 1 } d . For t = 1 , . . . , T , ◮ The learner chooses X t ∈ X . ◮ The environment reveals ℓ ti for which X ti = 1. ◮ The learner suffers loss � X t , ℓ t � . d = #edges

Formalizing Semi-bandits Given: action set X = { X (1) , X (2) , . . . } ⊆ { 0 , 1 } d . (set of all paths) For t = 1 , . . . , T , ◮ The learner chooses X t ∈ X . ◮ The environment reveals ℓ ti for which X ti = 1. ◮ The learner suffers loss � X t , ℓ t � . d = #edges

Formalizing Semi-bandits Given: action set X = { X (1) , X (2) , . . . } ⊆ { 0 , 1 } d . (set of all paths) For t = 1 , . . . , T , ◮ The learner chooses X t ∈ X (choose a path). ◮ The environment reveals ℓ ti for which X ti = 1. ◮ The learner suffers loss � X t , ℓ t � . d = #edges

Formalizing Semi-bandits Given: action set X = { X (1) , X (2) , . . . } ⊆ { 0 , 1 } d . (set of all paths) For t = 1 , . . . , T , ◮ The learner chooses X t ∈ X (choose a path). ◮ The environment reveals ℓ ti for which X ti = 1. (reveal the cost on each chosen edge) ◮ The learner suffers loss � X t , ℓ t � . d = #edges

Formalizing Semi-bandits Given: action set X = { X (1) , X (2) , . . . } ⊆ { 0 , 1 } d . (set of all paths) For t = 1 , . . . , T , ◮ The learner chooses X t ∈ X (choose a path). ◮ The environment reveals ℓ ti for which X ti = 1. (reveal the cost on each chosen edge) ◮ The learner suffers loss � X t , ℓ t � . (suffer the path cost) d = #edges

Semi-bandits Regret Bounds Goal : Minimize � T � T � � � � Regret = E � X t , ℓ t � − X ∈X E min � X , ℓ t � . t =1 t =1 � �� Learner’s total cost Best fixed action’s total cost ◮ When ℓ t are i.i.d.: Regret = Θ (log T ) � √ � ◮ When ℓ t are adversarially generated: Regret = Θ T √ Our algorithm : always has O ( T ), but gets O (log T ) when the losses happen to be i.i.d.

Related Work in Multi-armed Bandit (MAB) MAB is special case of SB with X = { e 1 , . . . , e d } . Algorithm Idea

Related Work in Multi-armed Bandit (MAB) MAB is special case of SB with X = { e 1 , . . . , e d } . Algorithm Idea SAO [BS12] i.i.d. algorithm + non-i.i.d. detection SAPO [AC16]

Related Work in Multi-armed Bandit (MAB) MAB is special case of SB with X = { e 1 , . . . , e d } . Algorithm Idea SAO [BS12] i.i.d. algorithm + non-i.i.d. detection SAPO [AC16] EXP3++ adversarial algorithm (EXP3) [SS14, SL17] + sophisticated exploration mechanism

Related Work in Multi-armed Bandit (MAB) MAB is special case of SB with X = { e 1 , . . . , e d } . Algorithm Idea SAO [BS12] i.i.d. algorithm + non-i.i.d. detection SAPO [AC16] EXP3++ adversarial algorithm (EXP3) [SS14, SL17] + sophisticated exploration mechanism BROAD [WL18] adversarial algorithm (FTRL with special regularizer) T-INF [ZS19] + improved analysis (optimal)

Related Work in Multi-armed Bandit (MAB) MAB is special case of SB with X = { e 1 , . . . , e d } . Algorithm Idea SAO [BS12] i.i.d. algorithm + non-i.i.d. detection SAPO [AC16] EXP3++ adversarial algorithm (EXP3) [SS14, SL17] + sophisticated exploration mechanism BROAD [WL18] adversarial algorithm (FTRL with special regularizer) T-INF [ZS19] + improved analysis (optimal) Our work is a generalization of [WL18] and [ZS19]’s idea to semi-bandits.

Algorithm Following the Regularized Leader Learning rate η t = 1 / √ t , regularizer Ψ

Algorithm Following the Regularized Leader Learning rate η t = 1 / √ t , regularizer Ψ for t = 1 , 2 , 3 , . . . ◮ Compute � t − 1 � � ˆ + η − 1 x t = argmin x , ℓ s t Ψ( x ) . x ∈ Conv( X ) s =1

Algorithm Following the Regularized Leader Learning rate η t = 1 / √ t , regularizer Ψ for t = 1 , 2 , 3 , . . . ◮ Compute � t − 1 � � ˆ + η − 1 x t = argmin x , ℓ s t Ψ( x ) . x ∈ Conv( X ) s =1 ◮ Sample X t such that E [ X t ] = x t , and observe ℓ ti for i with X ti = 1.

Algorithm Following the Regularized Leader Learning rate η t = 1 / √ t , regularizer Ψ for t = 1 , 2 , 3 , . . . ◮ Compute � t − 1 � � ˆ + η − 1 x t = argmin x , ℓ s t Ψ( x ) . x ∈ Conv( X ) s =1 ◮ Sample X t such that E [ X t ] = x t , and observe ℓ ti for i with X ti = 1. ◮ Construct ℓ t ’s unbiased estimator ˆ ℓ t : ˆ ℓ ti = ℓ ti 1 [ X ti =1] . x ti

Regularizer (Key Contribution) Two-sided hybrid regularizer: d d −√ x i � � Ψ( x ) = + (1 − x i ) log(1 − x i ) . i =1 i =1 � �� Neg-entropy for complement [AB09]’s Poly-INF

Regularizer (Key Contribution) Two-sided hybrid regularizer: d d −√ x i � � Ψ( x ) = + (1 − x i ) log(1 − x i ) . i =1 i =1 � �� Neg-entropy for complement [AB09]’s Poly-INF Intuition: ◮ when x i is close to 0, the learner starves for information ⇒ like a bandit problem ⇒ using the optimal regularizer for bandit (Poly-INF) ◮ when x i is close to 1 ⇒ like a full-info problem ⇒ using the optimal regularizer for full-info (Neg-entropy)

Results Overview X General Env. md log T i.i.d. ∆ min √ Adversarial mdT m � max X ∈X � X � 1 . ∆ min = E [second-best action’s loss] − E [best action’s loss] (minimal optimality gap)

Results Overview X { X ∈ { 0 , 1 } d : � X � 1 = m } General { 0 , 1 } d Env. � � md log T log T log T i.i.d. ∆ min i > m ∆ i i ∆ i � √ √ √ m ≤ d mdT , Adversarial mdT 2 d T ( d − m ) √ T log d m > d 2 m � max X ∈X � X � 1 . ∆ min = E [second-best action’s loss] − E [best action’s loss] (minimal optimality gap)

Analysis Steps √ 1. Analyze FTRL for the new regularizer and get O ( T ) for the adversarial setting. 2. Further use self-bounding technique to get O (log T ) for the i.i.d. setting.

Analyzing FTRL for the New Regularizer Key lemma. T � √ x ti , � �� 1 1 � � √ t Reg ≤ min (1 − x ti ) 1 + log . 1 − x ti t =1 i Remarks . 1. The analysis is mostly standard, but needs more care (don’t drop some terms as did in usual analysis). 2. The two-sided -ness of the regularizer is the key to get “min {· , ·} ”. √ 3. From this bound, we get O ( T ) bound easily.

Self-bounding to Get O (log T ) Bound T � √ x ti , � �� 1 1 � � √ t Reg ≤ min (1 − x ti ) 1 + log 1 − x ti t =1 i � �� Goal : upper bound this by C Pr[ X t � = X ∗ ] Intuitively true: Pr[ X t � = X ∗ ] → 0 ⇒ x t → X ∗ ⇒ the above expression → 0 .

Self-bounding to Get O (log T ) Bound T � √ x ti , � �� 1 1 � � √ t Reg ≤ min (1 − x ti ) 1 + log 1 − x ti t =1 i � �� Goal : upper bound this by C Pr[ X t � = X ∗ ] Assume it is proved...

Self-bounding to Get O (log T ) Bound T � √ x ti , � �� 1 1 � � √ t Reg ≤ min (1 − x ti ) 1 + log 1 − x ti t =1 i � �� Goal : upper bound this by C Pr[ X t � = X ∗ ] Assume it is proved... � ∆ min Pr[ X t � = X ∗ ] ≤ Reg t

Self-bounding to Get O (log T ) Bound T � √ x ti , � �� 1 1 � � √ t Reg ≤ min (1 − x ti ) 1 + log 1 − x ti t =1 i � �� Goal : upper bound this by C Pr[ X t � = X ∗ ] Assume it is proved... � Pr[ X t � = X ∗ ] C � � ∆ min Pr[ X t � = X ∗ ] ≤ Reg ≤ √ t t t C 2 ∆ min Pr[ X t � = X ∗ ] � � ≤ + 2 t ∆ min 2 t t (AM-GM)

Beating Stochastic and Adversarial Semi-bandits Optimally and - PowerPoint PPT Presentation

Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously Julian Zimmert (University of Copenhagen) Haipeng Luo (University of Southern California) Chen-Yu Wei (University of Southern California) Semi-bandits Example Day 1

Beating The Best - The Santander Bank Kaggle Beating The Best - The Santander Bank Kaggle Beating

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

The Stochastic Matching Problem: Beating Half with a Non-Adaptive Algorithm Sepehr Assadi

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Beating the No Win Scenario Joe DeVivo @joedevivo Tuesday, 26 March 2013 Beating the No Win

Optimally Propagating SAT Encodings Martin Brain, Liana Hadarean , Ruben Martins and Daniel

Data Poisoning Attack cks on Stoch chastic c Bandits Fang Liu and Ness Shroff Outline

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

MergeDTS for Large Scale Condorcet Dueling Bandits Chang Li , Ilya Markov, Maarten de Rijke and

Weighted bandits or: How bandits learn distorted values that are not expected Prashanth L.A.

Procurement Connection to Other Requirements Equal Conflict of Opportunity/ Interest and

RAFT continued Distributed Systems Nikita Borisov Slide content borrowed from Diego Ongaro, John

Resilient Washington Subcabinet Report: Findings related to post-earthquake reconaissance April 17

Learning to Rank Vinay Setty Jannik Strtgen vsetty@mpi-inf.mpg.de

CSC304 Lecture 7 Game Theory : Security games, Applications to security CSC304 - Nisarg Shah 1

Using L oss Data to Win Ove r Clie nts 14 th ,11am E We binar : Se pte mbe r DT Ho w c an yo

Recent results on hollow beam collimation Giulio Stancari Fermi National Accelerator Laboratory

S-38.180 Quality of Service in the Internet Exercise 3: Differentiated Services Based on