Beating Stochastic and Adversarial Semi-bandits Optimally and - - PowerPoint PPT Presentation

beating stochastic and adversarial semi bandits optimally
SMART_READER_LITE
LIVE PREVIEW

Beating Stochastic and Adversarial Semi-bandits Optimally and - - PowerPoint PPT Presentation

Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously Julian Zimmert (University of Copenhagen) Haipeng Luo (University of Southern California) Chen-Yu Wei (University of Southern California) Semi-bandits Example Day 1


slide-1
SLIDE 1

Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously

Julian Zimmert (University of Copenhagen) Haipeng Luo (University of Southern California) Chen-Yu Wei (University of Southern California)

slide-2
SLIDE 2

Semi-bandits Example

Day 1 15 mins Day 2 13 mins Day 3 16 mins . . . Goal: minimize the average commuting time

slide-3
SLIDE 3

Types of Environments

i.i.d. (more benign) adversarial Algorithms for i.i.d.: perform bad in the adversarial case. Algorithms for adversarial: when the environment is i.i.d., they do not take advantage of it.

slide-4
SLIDE 4

Types of Environments

i.i.d. (more benign) adversarial Algorithms for i.i.d.: perform bad in the adversarial case. Algorithms for adversarial: when the environment is i.i.d., they do not take advantage of it. ⇒ To achieve optimal performance, they need to know which environments they are in and pick the corresponding algorithms.

slide-5
SLIDE 5

Motivation

i.i.d. (more benign) unknown mixed adversarial What if

  • 1. We have no prior knowledge about the environment.
  • 2. The environment is usually i.i.d., but we want to be robust to

adversarial attack.

  • 3. The environment is usually arbitrary but we want to exploit

the benignness when we got lucky.

slide-6
SLIDE 6

Our Results

◮ We propose the first semi-bandit algorithm that has optimal

performance guarantees in both i.i.d. and adversarial environments, without knowing which environment it is in.

slide-7
SLIDE 7

Formalizing Semi-bandits

Given: action set X = {X (1), X (2), . . .} ⊆ {0, 1}d. For t = 1, . . . , T,

◮ The learner chooses Xt ∈ X

.

◮ The environment reveals ℓti for which Xti = 1. ◮ The learner suffers loss Xt, ℓt.

d = #edges

slide-8
SLIDE 8

Formalizing Semi-bandits

Given: action set X = {X (1), X (2), . . .} ⊆ {0, 1}d. (set of all paths) For t = 1, . . . , T,

◮ The learner chooses Xt ∈ X

.

◮ The environment reveals ℓti for which Xti = 1. ◮ The learner suffers loss Xt, ℓt.

d = #edges

slide-9
SLIDE 9

Formalizing Semi-bandits

Given: action set X = {X (1), X (2), . . .} ⊆ {0, 1}d. (set of all paths) For t = 1, . . . , T,

◮ The learner chooses Xt ∈ X (choose a path). ◮ The environment reveals ℓti for which Xti = 1. ◮ The learner suffers loss Xt, ℓt.

d = #edges

slide-10
SLIDE 10

Formalizing Semi-bandits

Given: action set X = {X (1), X (2), . . .} ⊆ {0, 1}d. (set of all paths) For t = 1, . . . , T,

◮ The learner chooses Xt ∈ X (choose a path). ◮ The environment reveals ℓti for which Xti = 1.

(reveal the cost on each chosen edge)

◮ The learner suffers loss Xt, ℓt.

d = #edges

slide-11
SLIDE 11

Formalizing Semi-bandits

Given: action set X = {X (1), X (2), . . .} ⊆ {0, 1}d. (set of all paths) For t = 1, . . . , T,

◮ The learner chooses Xt ∈ X (choose a path). ◮ The environment reveals ℓti for which Xti = 1.

(reveal the cost on each chosen edge)

◮ The learner suffers loss Xt, ℓt. (suffer the path cost)

d = #edges

slide-12
SLIDE 12

Semi-bandits Regret Bounds

Goal: Minimize Regret = E T

  • t=1

Xt, ℓt

  • Learner’s total cost

− min

X∈X E

T

  • t=1

X, ℓt

  • Best fixed action’s total cost

.

◮ When ℓt are i.i.d.: Regret = Θ (log T) ◮ When ℓt are adversarially generated: Regret = Θ

√ T

  • Our algorithm: always has O(

√ T), but gets O(log T) when the losses happen to be i.i.d.

slide-13
SLIDE 13

Related Work in Multi-armed Bandit (MAB)

MAB is special case of SB with X = {e1, . . . , ed}. Algorithm Idea

slide-14
SLIDE 14

Related Work in Multi-armed Bandit (MAB)

MAB is special case of SB with X = {e1, . . . , ed}. Algorithm Idea SAO [BS12] SAPO [AC16] i.i.d. algorithm + non-i.i.d. detection

slide-15
SLIDE 15

Related Work in Multi-armed Bandit (MAB)

MAB is special case of SB with X = {e1, . . . , ed}. Algorithm Idea SAO [BS12] SAPO [AC16] i.i.d. algorithm + non-i.i.d. detection EXP3++ [SS14, SL17] adversarial algorithm (EXP3) + sophisticated exploration mechanism

slide-16
SLIDE 16

Related Work in Multi-armed Bandit (MAB)

MAB is special case of SB with X = {e1, . . . , ed}. Algorithm Idea SAO [BS12] SAPO [AC16] i.i.d. algorithm + non-i.i.d. detection EXP3++ [SS14, SL17] adversarial algorithm (EXP3) + sophisticated exploration mechanism BROAD [WL18] T-INF [ZS19] (optimal) adversarial algorithm (FTRL with special regularizer) + improved analysis

slide-17
SLIDE 17

Related Work in Multi-armed Bandit (MAB)

MAB is special case of SB with X = {e1, . . . , ed}. Algorithm Idea SAO [BS12] SAPO [AC16] i.i.d. algorithm + non-i.i.d. detection EXP3++ [SS14, SL17] adversarial algorithm (EXP3) + sophisticated exploration mechanism BROAD [WL18] T-INF [ZS19] (optimal) adversarial algorithm (FTRL with special regularizer) + improved analysis Our work is a generalization of [WL18] and [ZS19]’s idea to semi-bandits.

slide-18
SLIDE 18

Algorithm

Following the Regularized Leader Learning rate ηt = 1/√t, regularizer Ψ

slide-19
SLIDE 19

Algorithm

Following the Regularized Leader Learning rate ηt = 1/√t, regularizer Ψ for t = 1, 2, 3, . . .

◮ Compute

xt = argmin

x∈Conv(X)

  • x,

t−1

  • s=1

ˆ ℓs

  • + η−1

t Ψ(x).

slide-20
SLIDE 20

Algorithm

Following the Regularized Leader Learning rate ηt = 1/√t, regularizer Ψ for t = 1, 2, 3, . . .

◮ Compute

xt = argmin

x∈Conv(X)

  • x,

t−1

  • s=1

ˆ ℓs

  • + η−1

t Ψ(x). ◮ Sample Xt such that E[Xt] = xt,

and observe ℓti for i with Xti = 1.

slide-21
SLIDE 21

Algorithm

Following the Regularized Leader Learning rate ηt = 1/√t, regularizer Ψ for t = 1, 2, 3, . . .

◮ Compute

xt = argmin

x∈Conv(X)

  • x,

t−1

  • s=1

ˆ ℓs

  • + η−1

t Ψ(x). ◮ Sample Xt such that E[Xt] = xt,

and observe ℓti for i with Xti = 1.

◮ Construct ℓt’s unbiased estimator ˆ

ℓt: ˆ ℓti = ℓti1[Xti=1]

xti

.

slide-22
SLIDE 22

Regularizer (Key Contribution)

Two-sided hybrid regularizer: Ψ(x) =

d

  • i=1

−√xi

  • [AB09]’s Poly-INF

+

d

  • i=1

(1 − xi) log(1 − xi)

  • Neg-entropy for complement

.

slide-23
SLIDE 23

Regularizer (Key Contribution)

Two-sided hybrid regularizer: Ψ(x) =

d

  • i=1

−√xi

  • [AB09]’s Poly-INF

+

d

  • i=1

(1 − xi) log(1 − xi)

  • Neg-entropy for complement

. Intuition:

◮ when xi is close to 0, the learner starves for information

⇒ like a bandit problem ⇒ using the optimal regularizer for bandit (Poly-INF)

◮ when xi is close to 1

⇒ like a full-info problem ⇒ using the optimal regularizer for full-info (Neg-entropy)

slide-24
SLIDE 24

Results Overview

Env. X General i.i.d.

md log T ∆min

Adversarial √ mdT m maxX∈X X1. ∆min = E[second-best action’s loss] − E[best action’s loss] (minimal optimality gap)

slide-25
SLIDE 25

Results Overview

Env. X General

{X ∈ {0, 1}d : X1 = m} {0, 1}d

i.i.d.

md log T ∆min

  • i>m

log T ∆i

  • i

log T ∆i

Adversarial √ mdT

√ mdT, m ≤ d

2

(d − m)√T log d m > d

2

d √ T m maxX∈X X1. ∆min = E[second-best action’s loss] − E[best action’s loss] (minimal optimality gap)

slide-26
SLIDE 26

Results Overview

Env. X General

{X ∈ {0, 1}d : X1 = m} {0, 1}d

i.i.d.

md log T ∆min

  • i>m

log T ∆i

  • i

log T ∆i

Adversarial √ mdT

√ mdT, m ≤ d

2

(d − m)√T log d m > d

2

d √ T m maxX∈X X1. ∆min = E[second-best action’s loss] − E[best action’s loss] (minimal optimality gap)

slide-27
SLIDE 27

Analysis Steps

  • 1. Analyze FTRL for the new regularizer and get O(

√ T) for the adversarial setting.

  • 2. Further use self-bounding technique to get O(log T) for the

i.i.d. setting.

slide-28
SLIDE 28

Analyzing FTRL for the New Regularizer

Key lemma. Reg ≤

T

  • t=1

1 √t

  • i

min √xti, (1 − xti)

  • 1 + log

1 1 − xti

  • .

Remarks.

  • 1. The analysis is mostly standard, but needs more care (don’t

drop some terms as did in usual analysis).

  • 2. The two-sided-ness of the regularizer is the key to get

“min{·, ·}”.

  • 3. From this bound, we get O(

√ T) bound easily.

slide-29
SLIDE 29

Self-bounding to Get O(log T) Bound

Reg ≤

T

  • t=1

1 √t

  • i

min √xti, (1 − xti)

  • 1 + log

1 1 − xti

  • Goal: upper bound this by C
  • Pr[Xt = X ∗]

Intuitively true: Pr[Xt = X ∗] → 0 ⇒ xt → X ∗ ⇒ the above expression → 0.

slide-30
SLIDE 30

Self-bounding to Get O(log T) Bound

Reg ≤

T

  • t=1

1 √t

  • i

min √xti, (1 − xti)

  • 1 + log

1 1 − xti

  • Goal: upper bound this by C
  • Pr[Xt = X ∗]

Assume it is proved...

slide-31
SLIDE 31

Self-bounding to Get O(log T) Bound

Reg ≤

T

  • t=1

1 √t

  • i

min √xti, (1 − xti)

  • 1 + log

1 1 − xti

  • Goal: upper bound this by C
  • Pr[Xt = X ∗]

Assume it is proved...

  • t

∆min Pr[Xt = X ∗] ≤ Reg

slide-32
SLIDE 32

Self-bounding to Get O(log T) Bound

Reg ≤

T

  • t=1

1 √t

  • i

min √xti, (1 − xti)

  • 1 + log

1 1 − xti

  • Goal: upper bound this by C
  • Pr[Xt = X ∗]

Assume it is proved...

  • t

∆min Pr[Xt = X ∗] ≤ Reg ≤

  • t

C

  • Pr[Xt = X ∗]

√t ≤

  • t

C 2 2t∆min +

  • t

∆min Pr[Xt = X ∗] 2 (AM-GM)

slide-33
SLIDE 33

Self-bounding to Get O(log T) Bound

Reg ≤

T

  • t=1

1 √t

  • i

min √xti, (1 − xti)

  • 1 + log

1 1 − xti

  • Goal: upper bound this by C
  • Pr[Xt = X ∗]

Assume it is proved...

  • t

∆min Pr[Xt = X ∗] ≤ Reg ≤

  • t

C

  • Pr[Xt = X ∗]

√t ≤

  • t

C 2 2t∆min +

  • t

∆min Pr[Xt = X ∗] 2 (AM-GM) Thus,

t ∆min Pr[Xt = X ∗] ≤ T t=1 C 2 t∆min = C 2 log T ∆min

= ⇒ Reg ≤ C 2 log T

∆min .

slide-34
SLIDE 34

Experiments (regret vs. time)

200 400 600 800 2500 5000 7500 10000

i.i.d.

Exp2 Cucb LogBar TS Ours 10 102 103 104 104 105 106 107 200 400 600 800 1000 1200 1400 2500 5000 7500 10000

Non-i.i.d.

Exp2 Cucb LogBar TS Ours 102 103 104 105 106 107 104 105 106 107

slide-35
SLIDE 35

Summary

◮ This paper considers semi-bandits, and proposes the first

single algorithm that has optimal regret guarantees both in adversarial and i.i.d. environments.

◮ The algorithm is a simple instantiation of the Follow the

Regularized Leader framework. The keys to get O(logT) bound in the i.i.d. setting are to

  • 1. use the two-sided hybrid regularizer
  • 2. analyze it using the self-bounding technique

◮ Experiments show our algorithm indeed has best-of-both-world

performance, while previous algorithms do not.

Poster #126