On adaptive regret bounds for non- stochastic bandits Gergely Neu - - PowerPoint PPT Presentation

β–Ά
on adaptive regret
SMART_READER_LITE
LIVE PREVIEW

On adaptive regret bounds for non- stochastic bandits Gergely Neu - - PowerPoint PPT Presentation

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team Universitat Pompeu Fabra, Barcelona Online learning and bandits Adaptive bounds in online learning Adaptive bounds for bandits Outline


slide-1
SLIDE 1

On adaptive regret bounds for non- stochastic bandits

Gergely Neu

INRIA Lille, SequeL team

β†’ Universitat Pompeu Fabra, Barcelona

slide-2
SLIDE 2

Outline

  • Online learning and bandits
  • Adaptive bounds in online learning
  • Adaptive bounds for bandits
  • What we already have
  • What’s new: First-order bounds
  • What may be possible
  • What seems impossible*

*Opinion alert!

slide-3
SLIDE 3

Online learning and non-stochastic bandits

For each round 𝑒 = 1,2, … , π‘ˆ

  • Learner chooses action 𝐽𝑒 ∈ {1,2, … , 𝑂}
  • Environment chooses losses ℓ𝑒,𝑗 ∈ [0,1] for all 𝑗
  • Learner suffers loss ℓ𝑒,𝐽𝑒
  • Learner observes losses ℓ𝑒,𝑗 for all 𝑗
slide-4
SLIDE 4

Online learning and non-stochastic bandits

For each round 𝑒 = 1,2, … , π‘ˆ

  • Learner chooses action 𝐽𝑒 ∈ {1,2, … , 𝑂}
  • Environment chooses losses ℓ𝑒,𝑗 ∈ [0,1] for all 𝑗
  • Learner suffers loss ℓ𝑒,𝐽𝑒
  • Learner observes losses ℓ𝑒,𝑗 for all 𝑗

For each round 𝑒 = 1,2, … , π‘ˆ

  • Learner chooses action 𝐽𝑒 ∈ {1,2, … , 𝑂}
  • Environment chooses losses ℓ𝑒,𝑗 ∈ [0,1] for all 𝑗
  • Learner suffers loss ℓ𝑒,𝐽𝑒
  • Learner observes its own loss ℓ𝑒,𝐽𝑒
slide-5
SLIDE 5

Online learning and non-stochastic bandits

For each round 𝑒 = 1,2, … , π‘ˆ

  • Learner chooses action 𝐽𝑒 ∈ {1,2, … , 𝑂}
  • Environment chooses losses ℓ𝑒,𝑗 ∈ [0,1] for all 𝑗
  • Learner suffers loss ℓ𝑒,𝐽𝑒
  • Learner observes losses ℓ𝑒,𝑗 for all 𝑗

For each round 𝑒 = 1,2, … , π‘ˆ

  • Learner chooses action 𝐽𝑒 ∈ {1,2, … , 𝑂}
  • Environment chooses losses ℓ𝑒,𝑗 ∈ [0,1] for all 𝑗
  • Learner suffers loss ℓ𝑒,𝐽𝑒
  • Learner observes its own loss ℓ𝑒,𝐽𝑒

Need to explore!

slide-6
SLIDE 6

Minimax regret

  • Define (expected) regret against action 𝑗 as

π‘†π‘ˆ,𝑗 = 𝐅

𝑒=1 π‘ˆ

ℓ𝑒,𝐽𝑒 βˆ’

𝑒=1 π‘ˆ

ℓ𝑒,𝑗

  • Goal: minimize regret against the best action π‘—βˆ—

π‘†π‘ˆ = π‘†π‘ˆ,π‘—βˆ— = max

𝑗

π‘†π‘ˆ,𝑗

slide-7
SLIDE 7

Minimax regret

  • Define (expected) regret against action 𝑗 as

π‘†π‘ˆ,𝑗 = 𝐅

𝑒=1 π‘ˆ

ℓ𝑒,𝐽𝑒 βˆ’

𝑒=1 π‘ˆ

ℓ𝑒,𝑗

  • Goal: minimize regret against the best action π‘—βˆ—

π‘†π‘ˆ = π‘†π‘ˆ,π‘—βˆ— = max

𝑗

π‘†π‘ˆ,𝑗 π‘†π‘ˆ = Θ π‘‚π‘ˆ π‘†π‘ˆ = Θ π‘ˆ log 𝑂

Full information Bandit

slide-8
SLIDE 8

Beyond minimax: i.i.d. losses

π‘†π‘ˆ = Θ π‘‚π‘ˆ π‘†π‘ˆ = Θ π‘ˆ log 𝑂

Full information Bandit

Θ(log 𝑂) Θ(𝑂log π‘ˆ)

slide-9
SLIDE 9

Beyond minimax: β€œHigher-order” bounds

π‘†π‘ˆ = 𝑃 π‘‚π‘ˆ π‘†π‘ˆ = 𝑃 π‘ˆ log 𝑂

Full information Bandit minimax first-order π‘€π‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗

π‘†π‘ˆ = 𝑃 π‘€π‘ˆ,π‘—βˆ— log 𝑂

second-order π‘‡π‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗

2

π‘†π‘ˆ = 𝑃 𝑇𝑒,π‘—βˆ— log 𝑂

variance

π‘Šπ‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗 βˆ’ 𝑛

2

π‘†π‘ˆ = 𝑃 π‘Šπ‘ˆ,π‘—βˆ— log 𝑂

Cesa-Bianchi, Mansour, Stoltz (2005) Hazan and Kale (2010)

with a little cheating

slide-10
SLIDE 10

Beyond minimax: β€œHigher-order” bounds

π‘†π‘ˆ = 𝑃 π‘‚π‘ˆ π‘†π‘ˆ = 𝑃 π‘ˆ log 𝑂

Full information Bandit minimax first-order π‘€π‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗

π‘†π‘ˆ = 𝑃 π‘€π‘ˆ,π‘—βˆ— log 𝑂

second-order π‘‡π‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗

2

π‘†π‘ˆ = 𝑃 𝑇𝑒,π‘—βˆ— log 𝑂

variance

π‘Šπ‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗 βˆ’ 𝑛

2

π‘†π‘ˆ = 𝑃 π‘Šπ‘ˆ,π‘—βˆ— log 𝑂 π‘†π‘ˆ = 𝑃 𝑗 𝑇𝑒,𝑗 π‘†π‘ˆ = 𝑃 𝑂2 𝑗 π‘Š

𝑒,𝑗

Auer et al. (2002) + some hacking Hazan and Kale (2010) Hazan and Kale (2011)

with a little cheating

Cesa-Bianchi, Mansour, Stoltz (2005)

slide-11
SLIDE 11

Beyond minimax: β€œHigher-order” bounds

π‘†π‘ˆ = 𝑃 π‘‚π‘ˆ π‘†π‘ˆ = 𝑃 π‘ˆ log 𝑂

Full information Bandit minimax first-order π‘€π‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗

π‘†π‘ˆ = 𝑃 π‘€π‘ˆ,π‘—βˆ— log 𝑂

second-order π‘‡π‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗

2

π‘†π‘ˆ = 𝑃 𝑇𝑒,π‘—βˆ— log 𝑂

variance

π‘Šπ‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗 βˆ’ 𝑛

2

π‘†π‘ˆ = 𝑃 π‘Šπ‘ˆ,π‘—βˆ— log 𝑂 π‘†π‘ˆ = 𝑃 𝑗 𝑇𝑒,𝑗 π‘†π‘ˆ = 𝑃 𝑂2 𝑗 π‘Š

𝑒,𝑗

(it’s complicated)

Auer et al. (2002) + some hacking Hazan and Kale (2010) Hazan and Kale (2011)

with a little cheating

Cesa-Bianchi, Mansour, Stoltz (2005)

slide-12
SLIDE 12

First-order bounds for bandits

  • β€œSmall-gain” bounds:
  • Consider the gain game with 𝑕𝑒,𝑗 = 1 βˆ’ ℓ𝑒,𝑗
  • Auer, Cesa-Bianchi, Freund and Schapire (2002):

(it’s complicated)

π‘†π‘ˆ = 𝑃 π‘‚π»π‘ˆ,π‘—βˆ— log 𝑂

π»π‘ˆ,𝑗 = 𝑒 𝑕𝑒,𝑗

slide-13
SLIDE 13

First-order bounds for bandits

  • β€œSmall-gain” bounds:
  • Consider the gain game with 𝑕𝑒,𝑗 = 1 βˆ’ ℓ𝑒,𝑗
  • Auer, Cesa-Bianchi, Freund and Schapire (2002):

(it’s complicated)

π‘†π‘ˆ = 𝑃 π‘‚π»π‘ˆ,π‘—βˆ— log 𝑂

π»π‘ˆ,𝑗 = 𝑒 𝑕𝑒,𝑗

Problem:

  • nly good if best expert is bad!
slide-14
SLIDE 14

First-order bounds for bandits

  • β€œSmall-gain” bounds:
  • A little trickier analysis gives

(it’s complicated)

π‘†π‘ˆ = 𝑃 π‘‚π»π‘ˆ,π‘—βˆ— log 𝑂 π‘†π‘ˆ = 𝑃 𝑒 𝑗 𝑕𝑒,𝑗 log 𝑂 π‘†π‘ˆ = 𝑃 𝑒 𝑗 ℓ𝑒,𝑗 log 𝑂

  • r
slide-15
SLIDE 15

First-order bounds for bandits

  • β€œSmall-gain” bounds:
  • A little trickier analysis gives

(it’s complicated)

π‘†π‘ˆ = 𝑃 π‘‚π»π‘ˆ,π‘—βˆ— log 𝑂 π‘†π‘ˆ = 𝑃 𝑒 𝑗 𝑕𝑒,𝑗 log 𝑂 π‘†π‘ˆ = 𝑃 𝑒 𝑗 ℓ𝑒,𝑗 log 𝑂

  • r

Problem:

  • ne misbehaving action ruins the bound!
slide-16
SLIDE 16

First-order bounds for bandits

  • β€œSmall-gain” bounds:
  • A little trickier analysis gives
  • Some obscure actual first-order bounds:

β€Ί Stoltz (2005): 𝑂 π‘€π‘ˆ

βˆ—

β€Ί Allenberg, Auer, GyΓΆrfi and OttucsΓ‘k (2006): π‘‚π‘€π‘ˆ

βˆ—

β€Ί Rakhlin and Sridharan (2013): 𝑂3/2 π‘€π‘ˆ

βˆ—

(it’s complicated)

π‘†π‘ˆ = 𝑃 π‘‚π»π‘ˆ,π‘—βˆ— log 𝑂 π‘†π‘ˆ = 𝑃 𝑒 𝑗 ℓ𝑒,𝑗 log 𝑂

slide-17
SLIDE 17
  • β€œSmall-gain” bounds:
  • A little trickier analysis gives
  • Some obscure actual first-order bounds:

β€Ί Stoltz (2005): 𝑂 π‘€π‘ˆ

βˆ—

β€Ί Allenberg, Auer, GyΓΆrfi and OttucsΓ‘k (2006): π‘‚π‘€π‘ˆ

βˆ—

β€Ί Rakhlin and Sridharan (2013): 𝑂3/2 π‘€π‘ˆ

βˆ—

First-order bounds for bandits

(it’s complicated)

π‘†π‘ˆ = 𝑃 π‘‚π»π‘ˆ,π‘—βˆ— log 𝑂 π‘†π‘ˆ = 𝑃 𝑒 𝑗 ℓ𝑒,𝑗 log 𝑂

Problem: no real insight from analyses!

slide-18
SLIDE 18

First-order bounds for non- stochastic bandits

slide-19
SLIDE 19

A typical bandit algorithm

For every round 𝑒 = 1,2, … , π‘ˆ

  • Choose arm 𝐽𝑒 = 𝑗 with probability π‘žπ‘’,𝑗
  • Compute unbiased loss estimate

ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 𝟐 𝐽𝑒=𝑗

  • Use

ℓ𝑒,𝑗 in a black-box online learning algorithm to compute 𝒒𝑒+1

slide-20
SLIDE 20

A typical regret bound

log 𝑂 πœƒ + πœƒπ…

𝑒=1 π‘ˆ 𝑗=1 𝑂

π‘žπ‘’,𝑗 ℓ𝑒,𝑗

2

π‘†π‘ˆ ≀

(πœƒ: β€œlearning rate”)

slide-21
SLIDE 21

A typical regret bound

log 𝑂 πœƒ + πœƒπ…

𝑒=1 π‘ˆ 𝑗=1 𝑂

ℓ𝑒,𝑗

≀ π‘†π‘ˆ ≀

(πœƒ: β€œlearning rate”) log 𝑂 πœƒ + πœƒπ…

𝑒=1 π‘ˆ 𝑗=1 𝑂

π‘žπ‘’,𝑗 ℓ𝑒,𝑗

2

slide-22
SLIDE 22

A typical regret bound

≀

log 𝑂 πœƒ + πœƒ

𝑗=1 𝑂

π‘€π‘ˆ,𝑗

= π‘†π‘ˆ ≀

(πœƒ: β€œlearning rate”) log 𝑂 πœƒ + πœƒπ…

𝑒=1 π‘ˆ 𝑗=1 𝑂

ℓ𝑒,𝑗 log 𝑂 πœƒ + πœƒπ…

𝑒=1 π‘ˆ 𝑗=1 𝑂

π‘žπ‘’,𝑗 ℓ𝑒,𝑗

2

slide-23
SLIDE 23

A typical regret bound

≀ = π‘†π‘ˆ ≀

𝑃 𝑗=1

𝑂

π‘€π‘ˆ,𝑗

=

(πœƒ: β€œlearning rate”) (for appropriate πœƒ) log 𝑂 πœƒ + πœƒ

𝑗=1 𝑂

π‘€π‘ˆ,𝑗 log 𝑂 πœƒ + πœƒπ…

𝑒=1 π‘ˆ 𝑗=1 𝑂

ℓ𝑒,𝑗 log 𝑂 πœƒ + πœƒπ…

𝑒=1 π‘ˆ 𝑗=1 𝑂

π‘žπ‘’,𝑗 ℓ𝑒,𝑗

2

slide-24
SLIDE 24

A typical regret bound

π‘†π‘ˆ = 𝑃 𝑗=1

𝑂

π‘€π‘ˆ,𝑗

slide-25
SLIDE 25

A typical regret bound

π‘†π‘ˆ = 𝑃 𝑗=1

𝑂

π‘€π‘ˆ,𝑗

It’s all because 𝐅 π‘€π‘ˆ,𝑗 = π‘€π‘ˆ,𝑗!!!

slide-26
SLIDE 26

A typical regret bound

π‘†π‘ˆ = 𝑃 𝑗=1

𝑂

π‘€π‘ˆ,𝑗

It’s all because 𝐅 π‘€π‘ˆ,𝑗 = π‘€π‘ˆ,𝑗!!! Idea: try to enforce 𝐅 π‘€π‘ˆ,𝑗 = 𝑃(π‘€π‘ˆ,π‘—βˆ—)

slide-27
SLIDE 27

A typical regret bound

π‘†π‘ˆ = 𝑃 𝑗=1

𝑂

π‘€π‘ˆ,𝑗

It’s all because 𝐅 π‘€π‘ˆ,𝑗 = π‘€π‘ˆ,𝑗!!! Idea: try to enforce 𝐅 π‘€π‘ˆ,𝑗 = 𝑃(π‘€π‘ˆ,π‘—βˆ—)

Need optimistic estimates!

slide-28
SLIDE 28

A typical algorithm – fixed!

For every round 𝑒 = 1,2, … , π‘ˆ

  • Choose arm 𝐽𝑒 = 𝑗 with probability π‘žπ‘’,𝑗
  • Compute unbiased loss estimate

ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 𝟐 𝐽𝑒=𝑗

  • Use

ℓ𝑒,𝑗 in a black-box online learning algorithm to compute 𝒒𝑒+1

slide-29
SLIDE 29

A typical algorithm – fixed!

For every round 𝑒 = 1,2, … , π‘ˆ

  • Choose arm 𝐽𝑒 = 𝑗 with probability π‘žπ‘’,𝑗
  • Compute biased loss estimate

ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 𝟐 𝐽𝑒=𝑗

  • Use

ℓ𝑒,𝑗 in a black-box online learning algorithm to compute 𝒒𝑒+1 ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 + 𝛿 𝟐 𝐽𝑒=𝑗 β€œImplicit exploration”

(KocΓ‘k, N, Valko and Munos, 2015)

slide-30
SLIDE 30

Algorithm: Follow the Perturbed Leader

(Kalai and Vempala, 2005, Poland, 2005)

For every round 𝑒 = 1,2, … , π‘ˆ

  • Draw perturbation π‘Žπ‘’,𝑗 ∼ Exp(1) for all 𝑗
  • Choose arm 𝐽𝑒 = arg min

𝑗

πœƒ π‘€π‘’βˆ’1,𝑗 βˆ’ π‘Žπ‘’,𝑗

  • Compute biased loss estimate

ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 𝟐 𝐽𝑒=𝑗

  • Update

𝑀𝑒,𝑗 = π‘€π‘’βˆ’1,𝑗 + ℓ𝑒,𝑗 ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 + 𝛿 𝟐 𝐽𝑒=𝑗 β€œImplicit exploration”

(KocΓ‘k, N, Valko and Munos, 2015)

slide-31
SLIDE 31

Implicit exploration in action

ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 + 𝛿 𝟐 𝐽𝑒=𝑗

slide-32
SLIDE 32

Implicit exploration in action

ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 + 𝛿 𝟐 𝐽𝑒=𝑗

True losses

slide-33
SLIDE 33

Implicit exploration in action

ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 + 𝛿 𝟐 𝐽𝑒=𝑗

True losses unbiased estimates

slide-34
SLIDE 34

Implicit exploration in action

ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 + 𝛿 𝟐 𝐽𝑒=𝑗

True losses unbiased estimates +uniform exploration

slide-35
SLIDE 35

Implicit exploration in action

ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 + 𝛿 𝟐 𝐽𝑒=𝑗

True losses IX estimates

slide-36
SLIDE 36

Optimistic estimates

Lemma (N, 2015a): Assume that π‘Žπ‘’,𝑗 ≀ 𝐢 for all 𝑒 and 𝑗. Then, for any 𝑗 and π‘˜, π‘€π‘ˆ,𝑗 ≀ π‘€π‘ˆ,π‘˜ + log 𝑂 + 𝐢 πœƒ + 1 𝛿

slide-37
SLIDE 37

Optimistic estimates

Lemma (N, 2015a): Assume that π‘Žπ‘’,𝑗 ≀ 𝐢 for all 𝑒 and 𝑗. Then, for any 𝑗 and π‘˜, π‘€π‘ˆ,𝑗 ≀ π‘€π‘ˆ,π‘˜ + log 𝑂 + 𝐢 πœƒ + 1 𝛿 All perturbations are nicely bounded with high probability β†’ bad arms are suppressed!

slide-38
SLIDE 38

Suppressing bad arms

ℓ𝑒,1 ℓ𝑒,2 ℓ𝑒,3 ℓ𝑒,4 ℓ𝑒,5 ℓ𝑒,6

slide-39
SLIDE 39

Suppressing bad arms

ℓ𝑒,1 ℓ𝑒,2 ℓ𝑒,3 ℓ𝑒,4 ℓ𝑒,5 ℓ𝑒,6

𝐢

slide-40
SLIDE 40

Suppressing bad arms

ℓ𝑒,1 ℓ𝑒,2 ℓ𝑒,3 ℓ𝑒,4 ℓ𝑒,5 ℓ𝑒,6

𝐢

slide-41
SLIDE 41

Suppressing bad arms

ℓ𝑒,1 ℓ𝑒,2 ℓ𝑒,3 ℓ𝑒,4 ℓ𝑒,5 ℓ𝑒,6

𝐢 Bad arms are no longer drawn!

slide-42
SLIDE 42

Suppressing bad arms

ℓ𝑒,1 ℓ𝑒,2 ℓ𝑒,3 ℓ𝑒,4 ℓ𝑒,5 ℓ𝑒,6

𝐢 Bad arms are no longer drawn! 𝑀𝑒,𝑗 stops growing if 𝑗 is bad

slide-43
SLIDE 43

Finally: a first-order bound!

≀ = π‘†π‘ˆ ≀

𝑃 𝑗=1

𝑂

π‘€π‘ˆ,𝑗

=

(for appropriate πœƒ) log 𝑂 πœƒ + πœƒ

𝑗=1 𝑂

π‘€π‘ˆ,𝑗 log 𝑂 πœƒ + πœƒπ…

𝑒=1 π‘ˆ 𝑗=1 𝑂

ℓ𝑒,𝑗 log 𝑂 πœƒ + πœƒπ…

𝑒=1 π‘ˆ 𝑗=1 𝑂

π‘žπ‘’,𝑗 ℓ𝑒,𝑗

2

slide-44
SLIDE 44

Finally: a first-order bound!

≀ ≀ π‘†π‘ˆ ≀

𝑃 π‘‚π‘€π‘ˆ,π‘—βˆ—

=

(for appropriate πœƒ, 𝛿, 𝐢)

Lemma

log 𝑂 πœƒ + πœƒπ‘‚π‘€π‘ˆ,π‘—βˆ— +𝑂(𝐢 + log 𝑂) log 𝑂 πœƒ + πœƒπ…

𝑒=1 π‘ˆ 𝑗=1 𝑂

ℓ𝑒,𝑗 log 𝑂 πœƒ + πœƒπ…

𝑒=1 π‘ˆ 𝑗=1 𝑂

π‘žπ‘’,𝑗 ℓ𝑒,𝑗

2

slide-45
SLIDE 45

Finally: a first-order bound!

π‘†π‘ˆ = 𝑃 π‘‚π‘€π‘ˆ,π‘—βˆ—

Parameters:

  • Set 𝛿 = πœƒ/2
slide-46
SLIDE 46

Finally: a first-order bound!

π‘†π‘ˆ = 𝑃 π‘‚π‘€π‘ˆ,π‘—βˆ—

Parameters:

  • Set 𝛿 = πœƒ/2
  • If we know π‘€π‘ˆ,π‘—βˆ—: πœƒ =

(log 𝑂+1) π‘‚π‘€π‘ˆ,π‘—βˆ—

slide-47
SLIDE 47

Finally: a first-order bound!

π‘†π‘ˆ = 𝑃 π‘‚π‘€π‘ˆ,π‘—βˆ—

Parameters:

  • Set 𝛿 = πœƒ/2
  • If we know π‘€π‘ˆ,π‘—βˆ—: πœƒ =

(log 𝑂+1) π‘‚π‘€π‘ˆ,π‘—βˆ—

  • If we don’t: πœƒπ‘’ =

(log 𝑂+1) 𝑂 1+ 𝑗 π‘€π‘’βˆ’1,𝑗

slide-48
SLIDE 48

Finally: a first-order bound!

π‘†π‘ˆ = 𝑃 π‘‚π‘€π‘ˆ,π‘—βˆ—

Parameters:

  • Set 𝛿 = πœƒ/2
  • If we know π‘€π‘ˆ,π‘—βˆ—: πœƒ =

(log 𝑂+1) π‘‚π‘€π‘ˆ,π‘—βˆ—

  • If we don’t: πœƒπ‘’ =

(log 𝑂+1) 𝑂 1+ 𝑗 π‘€π‘’βˆ’1,𝑗

Arguments also extend to combinatorial semi-bandits!

slide-49
SLIDE 49

What’s next?

slide-50
SLIDE 50

Beyond minimax: β€œHigher-order” bounds

π‘†π‘ˆ = 𝑃 π‘‚π‘ˆ π‘†π‘ˆ = Θ π‘ˆ log 𝑂

Full information Bandit minimax first-order π‘€π‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗

π‘†π‘ˆ = 𝑃 π‘€π‘ˆ,π‘—βˆ— log 𝑂

second-order π‘‡π‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗

2

π‘†π‘ˆ = 𝑃 𝑇𝑒,π‘—βˆ— log 𝑂

variance

π‘Šπ‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗 βˆ’ 𝑛

2

π‘†π‘ˆ = 𝑃 π‘Šπ‘ˆ,π‘—βˆ— log 𝑂 π‘†π‘ˆ = 𝑃 𝑗 𝑇𝑒,𝑗 π‘†π‘ˆ = 𝑃 𝑂2 𝑗 π‘Š

𝑒,𝑗

Auer et al. (2002) + some hacking Hazan and Kale (2010) Hazan and Kale (2011)

with a little cheating

(it’s complicated)

Cesa-Bianchi, Mansour, Stoltz (2005)

slide-51
SLIDE 51

Beyond minimax: β€œHigher-order” bounds

π‘†π‘ˆ = 𝑃 π‘‚π‘ˆ π‘†π‘ˆ = Θ π‘ˆ log 𝑂

Full information Bandit minimax first-order π‘€π‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗

π‘†π‘ˆ = 𝑃 π‘€π‘ˆ,π‘—βˆ— log 𝑂

second-order π‘‡π‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗

2

π‘†π‘ˆ = 𝑃 𝑇𝑒,π‘—βˆ— log 𝑂

variance

π‘Šπ‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗 βˆ’ 𝑛

2

π‘†π‘ˆ = 𝑃 π‘Šπ‘ˆ,π‘—βˆ— log 𝑂 π‘†π‘ˆ = 𝑃 𝑗 𝑇𝑒,𝑗 π‘†π‘ˆ = 𝑃 𝑂2 𝑗 π‘Š

𝑒,𝑗

π‘†π‘ˆ = 𝑃 π‘‚π‘€π‘ˆ,π‘—βˆ—

Auer et al. (2002) + some hacking Hazan and Kale (2010) Hazan and Kale (2011)

with a little cheating

Cesa-Bianchi, Mansour, Stoltz (2005)

slide-52
SLIDE 52

Beyond minimax: β€œHigher-order” bounds

π‘†π‘ˆ = 𝑃 π‘‚π‘ˆ π‘†π‘ˆ = Θ π‘ˆ log 𝑂

Full information Bandit minimax first-order π‘€π‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗

π‘†π‘ˆ = 𝑃 π‘€π‘ˆ,π‘—βˆ— log 𝑂

second-order π‘‡π‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗

2

π‘†π‘ˆ = 𝑃 𝑇𝑒,π‘—βˆ— log 𝑂

variance

π‘Šπ‘ˆ,𝑗 = 𝑒 ℓ𝑒,𝑗 βˆ’ 𝑛

2

π‘†π‘ˆ = 𝑃 π‘Šπ‘ˆ,π‘—βˆ— log 𝑂 π‘†π‘ˆ = 𝑃 𝑗 𝑇𝑒,𝑗 π‘†π‘ˆ = 𝑃 𝑂2 𝑗 π‘Š

𝑒,𝑗

π‘†π‘ˆ = 𝑃 π‘‚π‘€π‘ˆ,π‘—βˆ—

Auer et al. (2002) + some hacking Hazan and Kale (2010) Hazan and Kale (2011)

with a little cheating

Cesa-Bianchi, Mansour, Stoltz (2005)

What about these?

slide-53
SLIDE 53

Beyond first-order bounds?

A key tool for adaptive bounds in full- info: PROD (Cesa-Bianchi, Mansour and Stoltz, 2005) π‘žπ‘’,𝑗 ∝

𝑑=1 π‘’βˆ’1

(1 βˆ’ πœƒβ„“π‘‘,𝑗)

slide-54
SLIDE 54

Beyond first-order bounds?

A key tool for adaptive bounds in full- info: PROD (Cesa-Bianchi, Mansour and Stoltz, 2005) π‘žπ‘’,𝑗 ∝

𝑑=1 π‘’βˆ’1

(1 βˆ’ πœƒβ„“π‘‘,𝑗)

Used for proving

  • Second-order bounds (Cesa-Bianchi et al., 2005)
  • Variance-dependent bounds (Hazan and Kale, 2010)
  • Path-length bounds (Steinhardt and Liang, 2014)
  • Quantile bounds (Koolen and Van Erven, 2015)
  • Best-of-both-worlds bounds (Sani et al., 2014)
  • …
slide-55
SLIDE 55

Beyond first-order bounds?

A key tool for adaptive bounds in full- info: PROD (Cesa-Bianchi, Mansour and Stoltz, 2005) π‘žπ‘’,𝑗 ∝

𝑑=1 π‘’βˆ’1

(1 βˆ’ πœƒβ„“π‘‘,𝑗)

Used for proving

  • Second-order bounds (Cesa-Bianchi et al., 2005)
  • Variance-dependent bounds (Hazan and Kale, 2010)
  • Path-length bounds (Steinhardt and Liang, 2014)
  • Quantile bounds (Koolen and Van Erven, 2015)
  • Best-of-both-worlds bounds (Sani et al., 2014)
  • …

But does it work for bandits?

slide-56
SLIDE 56

Beyond first-order bounds?

A key tool for adaptive bounds in full- info: PROD (Cesa-Bianchi, Mansour and Stoltz, 2005) π‘žπ‘’,𝑗 ∝

𝑑=1 π‘’βˆ’1

(1 βˆ’ πœƒβ„“π‘‘,𝑗)

Used for proving

  • Second-order bounds (Cesa-Bianchi et al., 2005)
  • Variance-dependent bounds (Hazan and Kale, 2010)
  • Path-length bounds (Steinhardt and Liang, 2014)
  • Quantile bounds (Koolen and Van Erven, 2015)
  • Best-of-both-worlds bounds (Sani et al., 2014)
  • …

But does it work for bandits?

slide-57
SLIDE 57

Why does PROD fail?

π‘žπ‘’,𝑗 ∝

𝑑=1 π‘’βˆ’1

1 βˆ’ πœƒ ℓ𝑑,𝑗 π‘žπ‘’,𝑗 ∝ π‘“βˆ’πœƒ 𝑑=1

π‘’βˆ’1

ℓ𝑑,𝑗

with ℓ𝑒,𝑗 = βˆ’ 1 πœƒ log 1 βˆ’ πœƒ ℓ𝑒,𝑗

=

slide-58
SLIDE 58

Why does PROD fail?

π‘žπ‘’,𝑗 ∝

𝑑=1 π‘’βˆ’1

1 βˆ’ πœƒ ℓ𝑑,𝑗 π‘žπ‘’,𝑗 ∝ π‘“βˆ’πœƒ 𝑑=1

π‘’βˆ’1

ℓ𝑑,𝑗

with ℓ𝑒,𝑗 = βˆ’ 1 πœƒ log 1 βˆ’ πœƒ ℓ𝑒,𝑗

=

EXP3 with a pessimistic estimate: 𝐅 ℓ𝑒,𝑗 β‰₯ ℓ𝑒,𝑗

slide-59
SLIDE 59

Why does PROD fail?

π‘žπ‘’,𝑗 ∝

𝑑=1 π‘’βˆ’1

1 βˆ’ πœƒ ℓ𝑑,𝑗 π‘žπ‘’,𝑗 ∝ π‘“βˆ’πœƒ 𝑑=1

π‘’βˆ’1

ℓ𝑑,𝑗

with ℓ𝑒,𝑗 = βˆ’ 1 πœƒ log 1 βˆ’ πœƒ ℓ𝑒,𝑗

=

ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 + 𝛿 𝟐 𝐽𝑒=𝑗 Implicit exploration

EXP3 with a pessimistic estimate: 𝐅 ℓ𝑒,𝑗 β‰₯ ℓ𝑒,𝑗

slide-60
SLIDE 60

Why does PROD fail?

π‘žπ‘’,𝑗 ∝

𝑑=1 π‘’βˆ’1

1 βˆ’ πœƒ ℓ𝑑,𝑗 π‘žπ‘’,𝑗 ∝ π‘“βˆ’πœƒ 𝑑=1

π‘’βˆ’1

ℓ𝑑,𝑗

with ℓ𝑒,𝑗 = βˆ’ 1 πœƒ log 1 βˆ’ πœƒ ℓ𝑒,𝑗

=

π‘žπ‘’,𝑗 ∝ π‘“βˆ’πœƒ 𝑑=1

π‘’βˆ’1

ℓ𝑑,𝑗

with ℓ𝑒,𝑗 = 1 πœƒ log 1 + πœƒ ℓ𝑒,𝑗

β‰ˆ

ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 + 𝛿 𝟐 𝐽𝑒=𝑗 Implicit exploration

EXP3 with a pessimistic estimate: 𝐅 ℓ𝑒,𝑗 β‰₯ ℓ𝑒,𝑗

slide-61
SLIDE 61

PROD for bandits

True losses

slide-62
SLIDE 62

PROD for bandits

True losses

slide-63
SLIDE 63

Summary

  • Key for first-order bounds: implicit exploration

ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 + 𝛿 𝟐 𝐽𝑒=𝑗

+ also key for high- probability bounds! (NIPS 2015)

slide-64
SLIDE 64

Summary

  • Key for first-order bounds: implicit exploration
  • Further bounds seem to be difficult to prove:

smoothness conflicts with need to explore!

ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 + 𝛿 𝟐 𝐽𝑒=𝑗

+ also key for high- probability bounds! (NIPS 2015)

slide-65
SLIDE 65

Summary

  • Key for first-order bounds: implicit exploration
  • Further bounds seem to be difficult to prove:

smoothness conflicts with need to explore!

  • More depressing results by Lattimore (NIPS 2015)

ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 + 𝛿 𝟐 𝐽𝑒=𝑗

+ also key for high- probability bounds! (NIPS 2015)

slide-66
SLIDE 66

Summary

  • Key for first-order bounds: implicit exploration
  • Further bounds seem to be difficult to prove:

smoothness conflicts with need to explore!

  • More depressing results by Lattimore (NIPS 2015)
  • Second-order bounds (Cesa-Bianchi et al., 2005)
  • Variance-dependent bounds (Hazan and Kale, 2010)
  • Path-length bounds (Steinhardt and Liang, 2014)
  • Quantile bounds (Koolen and Van Erven, 2015)

ℓ𝑒,𝑗 = ℓ𝑒,𝑗 π‘žπ‘’,𝑗 + 𝛿 𝟐 𝐽𝑒=𝑗

+ also key for high- probability bounds! (NIPS 2015)

?

slide-67
SLIDE 67

Thanks!

slide-68
SLIDE 68
slide-69
SLIDE 69

Appendix

slide-70
SLIDE 70

First-order bounds for combinatorial semi-bandits

slide-71
SLIDE 71

For every round 𝑒 = 1,2, … , π‘ˆ

  • learner picks an action π‘Š

𝑒 ∈ 𝑇 βŠ† 0,1 𝑒

  • Environment chooses loss vector ℓ𝑒 ∈ 0,1 𝑒
  • Learner suffers loss π‘Š

𝑒 βŠ€β„“π‘’

  • Learner observes losses π‘Š

𝑒,𝑗ℓ𝑒,𝑗

π‘₯

𝑣

Combinatorial semi-bandits

slide-72
SLIDE 72

π‘₯

𝑣

For every round 𝑒 = 1,2, … , π‘ˆ

  • learner picks an action π‘Š

𝑒 ∈ 𝑇 βŠ† 0,1 𝑒

  • Environment chooses loss vector ℓ𝑒 ∈ 0,1 𝑒
  • Learner suffers loss π‘Š

𝑒 βŠ€β„“π‘’

  • Learner observes losses π‘Š

𝑒,𝑗ℓ𝑒,𝑗

Combinatorial semi-bandits

slide-73
SLIDE 73

For every round 𝑒 = 1,2, … , π‘ˆ

  • learner picks an action π‘Š

𝑒 ∈ 𝑇 βŠ† 0,1 𝑒

  • Environment chooses loss vector ℓ𝑒 ∈ 0,1 𝑒
  • Learner suffers loss π‘Š

𝑒 βŠ€β„“π‘’

  • Learner observes losses π‘Š

𝑒,𝑗ℓ𝑒,𝑗

π‘₯

𝑣

Decision set: 𝑇 = 𝑀𝑗 𝑗=1

𝑂

βŠ† 0,1 𝑒 𝑀𝑗 1 ≀ 𝑛

Combinatorial semi-bandits

slide-74
SLIDE 74

Combinatorial semi-bandits

  • Goal: minimize (expected) regret

π‘†π‘ˆ = max

π‘€βˆˆπ‘‡ 𝐅 𝑒=1 π‘ˆ

π‘Š

𝑒 βˆ’ 𝑀 βŠ€β„“π‘’

  • Minimax regret is

π‘†π‘ˆ = Θ π‘›π‘’π‘ˆ

  • Best efficient algorithm (FPL) gives

π‘†π‘ˆ = 𝑃 𝑛 π‘’π‘ˆ log(𝑒)

slide-75
SLIDE 75

Combinatorial semi-bandits

  • Goal: minimize (expected) regret

π‘†π‘ˆ = max

π‘€βˆˆπ‘‡ 𝐅 𝑒=1 π‘ˆ

π‘Š

𝑒 βˆ’ 𝑀 βŠ€β„“π‘’

  • Minimax regret is

π‘†π‘ˆ = Θ π‘›π‘’π‘ˆ

  • Best efficient algorithm (FPL) gives

π‘†π‘ˆ = 𝑃 𝑛 π‘’π‘ˆ log(𝑒)

  • Our bound:

π‘†π‘ˆ = 𝑃 𝑛 π‘’π‘€π‘ˆ

βˆ— log(𝑒)