Tight Bounds on Minimax Regret under Logarithmic Loss via - - PowerPoint PPT Presentation

tight bounds on minimax regret under logarithmic loss via
SMART_READER_LITE
LIVE PREVIEW

Tight Bounds on Minimax Regret under Logarithmic Loss via - - PowerPoint PPT Presentation

Tight Bounds on Minimax Regret under Logarithmic Loss via Self-Concordance Blair Bilodeau 1,2,3 , Dylan J. Foster 4 , and Daniel M. Roy 1,2,3 Presented at the 2020 International Conference on Machine Learning 1 Department of Statistical Sciences,


slide-1
SLIDE 1

Tight Bounds on Minimax Regret under Logarithmic Loss via Self-Concordance

Blair Bilodeau1,2,3, Dylan J. Foster4, and Daniel M. Roy1,2,3 Presented at the 2020 International Conference on Machine Learning

1Department of Statistical Sciences, University of Toronto 2Vector Institute 3Institute for Advanced Study 4Institute for Foundations of Data Science, Massachusetts Institute of Technology

slide-2
SLIDE 2

Contextual Online Learning with Log Loss

Example: Image Identification For rounds t = 1, . . . , n:

slide-3
SLIDE 3

Contextual Online Learning with Log Loss

Example: Image Identification For rounds t = 1, . . . , n:

  • Receive an image.
slide-4
SLIDE 4

Contextual Online Learning with Log Loss

Example: Image Identification For rounds t = 1, . . . , n:

  • Receive an image.
  • Assign a probability to whether the image is adversarially generated.
slide-5
SLIDE 5

Contextual Online Learning with Log Loss

Example: Image Identification For rounds t = 1, . . . , n:

  • Receive an image.
  • Assign a probability to whether the image is adversarially generated.
  • Observe the true label.
slide-6
SLIDE 6

Contextual Online Learning with Log Loss

Example: Image Identification For rounds t = 1, . . . , n:

  • Receive an image.
  • Assign a probability to whether the image is adversarially generated.
  • Observe the true label.
  • Incur penalty based on prediction and observation.
slide-7
SLIDE 7

Contextual Online Learning with Log Loss

Example: Image Identification For rounds t = 1, . . . , n:

  • Receive an image. Context xt ∈ X
  • Assign a probability to whether the image is adversarially generated.
  • Observe the true label.
  • Incur penalty based on prediction and observation.
slide-8
SLIDE 8

Contextual Online Learning with Log Loss

Example: Image Identification For rounds t = 1, . . . , n:

  • Receive an image. Context xt ∈ X
  • Assign a probability. Prediction ˆ

pt ∈ [0, 1]

  • Observe the true label.
  • Incur penalty based on prediction and observation.
slide-9
SLIDE 9

Contextual Online Learning with Log Loss

Example: Image Identification For rounds t = 1, . . . , n:

  • Receive an image. Context xt ∈ X
  • Assign a probability. Prediction ˆ

pt ∈ [0, 1]

  • Observe the true label. Observation yt ∈ {0, 1}
  • Incur penalty based on prediction and observation.
slide-10
SLIDE 10

Contextual Online Learning with Log Loss

Example: Image Identification For rounds t = 1, . . . , n:

  • Receive an image. Context xt ∈ X
  • Assign a probability. Prediction ˆ

pt ∈ [0, 1]

  • Observe the true label. Observation yt ∈ {0, 1}
  • Incur penalty. Loss ℓlog(ˆ

pt, yt) = −yt log(ˆ pt) − (1 − yt) log(ˆ pt)

slide-11
SLIDE 11

Contextual Online Learning with Log Loss

Example: Image Identification For rounds t = 1, . . . , n:

  • Receive an image. Context xt ∈ X
  • Assign a probability. Prediction ˆ

pt ∈ [0, 1]

  • Observe the true label. Observation yt ∈ {0, 1}
  • Incur penalty. Loss ℓlog(ˆ

pt, yt) = −yt log(ˆ pt) − (1 − yt) log(ˆ pt) Notice that ℓlog equals the negative log likelihood of yt under the model ˆ pt.

slide-12
SLIDE 12

Contextual Online Learning with Log Loss

Example: Image Identification For rounds t = 1, . . . , n:

  • Receive an image. Context xt ∈ X
  • Assign a probability. Prediction ˆ

pt ∈ [0, 1]

  • Observe the true label. Observation yt ∈ {0, 1}
  • Incur penalty. Loss ℓlog(ˆ

pt, yt) = −yt log(ˆ pt) − (1 − yt) log(ˆ pt) Notice that ℓlog equals the negative log likelihood of yt under the model ˆ pt. Challenges

slide-13
SLIDE 13

Contextual Online Learning with Log Loss

Example: Image Identification For rounds t = 1, . . . , n:

  • Receive an image. Context xt ∈ X
  • Assign a probability. Prediction ˆ

pt ∈ [0, 1]

  • Observe the true label. Observation yt ∈ {0, 1}
  • Incur penalty. Loss ℓlog(ˆ

pt, yt) = −yt log(ˆ pt) − (1 − yt) log(ˆ pt) Notice that ℓlog equals the negative log likelihood of yt under the model ˆ pt. Challenges

  • We do not rely on data-generating assumptions.
slide-14
SLIDE 14

Contextual Online Learning with Log Loss

Example: Image Identification For rounds t = 1, . . . , n:

  • Receive an image. Context xt ∈ X
  • Assign a probability. Prediction ˆ

pt ∈ [0, 1]

  • Observe the true label. Observation yt ∈ {0, 1}
  • Incur penalty. Loss ℓlog(ˆ

pt, yt) = −yt log(ˆ pt) − (1 − yt) log(ˆ pt) Notice that ℓlog equals the negative log likelihood of yt under the model ˆ pt. Challenges

  • We do not rely on data-generating assumptions.
  • ℓlog is neither bounded nor Lipschitz.
slide-15
SLIDE 15

Measuring Performance with Regret

Without model assumptions, guaranteed small loss on predictions is impossible.

slide-16
SLIDE 16

Measuring Performance with Regret

Without model assumptions, guaranteed small loss on predictions is impossible. If I can’t promise about the future, can I say something about the past?

slide-17
SLIDE 17

Measuring Performance with Regret

Without model assumptions, guaranteed small loss on predictions is impossible. If I can’t promise about the future, can I say something about the past? Consider a relative notion of performance in hindsight.

  • Relative to a class F ⊆ {f : X → [0, 1]}, consisting of experts f ∈ F.
  • Compete against the optimal f ∈ F on the actual sequence of observations.
slide-18
SLIDE 18

Measuring Performance with Regret

Without model assumptions, guaranteed small loss on predictions is impossible. If I can’t promise about the future, can I say something about the past? Consider a relative notion of performance in hindsight.

  • Relative to a class F ⊆ {f : X → [0, 1]}, consisting of experts f ∈ F.
  • Compete against the optimal f ∈ F on the actual sequence of observations.

Regret: Rn(ˆ p; F, x, y) =

n

  • t=1

ℓlog(ˆ pt, yt) − inf

f∈F n

  • t=1

ℓlog(f(xt), yt).

slide-19
SLIDE 19

Measuring Performance with Regret

Without model assumptions, guaranteed small loss on predictions is impossible. If I can’t promise about the future, can I say something about the past? Consider a relative notion of performance in hindsight.

  • Relative to a class F ⊆ {f : X → [0, 1]}, consisting of experts f ∈ F.
  • Compete against the optimal f ∈ F on the actual sequence of observations.

Regret: Rn(ˆ p; F, x, y) =

n

  • t=1

ℓlog(ˆ pt, yt) − inf

f∈F n

  • t=1

ℓlog(f(xt), yt). This quantity depends on

  • ˆ

p: Player predictions,

  • F: Expert class,
  • x: Observed contexts,
  • y: Observed data points.
slide-20
SLIDE 20

Summary of Results

We control the minimax regret using the sequential entropy of the experts F.

slide-21
SLIDE 21

Summary of Results

We control the minimax regret using the sequential entropy of the experts F.

  • Minimax regret: the smallest possible regret under worst-case observations.
  • Sequential entropy: a data-dependent complexity measure for F.
slide-22
SLIDE 22

Summary of Results

We control the minimax regret using the sequential entropy of the experts F.

  • Minimax regret: the smallest possible regret under worst-case observations.
  • Sequential entropy: a data-dependent complexity measure for F.

Contributions

slide-23
SLIDE 23

Summary of Results

We control the minimax regret using the sequential entropy of the experts F.

  • Minimax regret: the smallest possible regret under worst-case observations.
  • Sequential entropy: a data-dependent complexity measure for F.

Contributions

  • Improved upper bound for expert classes with polynomial sequential entropy.
slide-24
SLIDE 24

Summary of Results

We control the minimax regret using the sequential entropy of the experts F.

  • Minimax regret: the smallest possible regret under worst-case observations.
  • Sequential entropy: a data-dependent complexity measure for F.

Contributions

  • Improved upper bound for expert classes with polynomial sequential entropy.
  • Novel proof technique that exploits the curvature of log loss to avoid a key

“truncation step” used by previous works.

slide-25
SLIDE 25

Summary of Results

We control the minimax regret using the sequential entropy of the experts F.

  • Minimax regret: the smallest possible regret under worst-case observations.
  • Sequential entropy: a data-dependent complexity measure for F.

Contributions

  • Improved upper bound for expert classes with polynomial sequential entropy.
  • Novel proof technique that exploits the curvature of log loss to avoid a key

“truncation step” used by previous works.

  • Resolve the minimax regret with log loss for Lipschitz experts on [0, 1]p with

matching lower bounds.

slide-26
SLIDE 26

Summary of Results

We control the minimax regret using the sequential entropy of the experts F.

  • Minimax regret: the smallest possible regret under worst-case observations.
  • Sequential entropy: a data-dependent complexity measure for F.

Contributions

  • Improved upper bound for expert classes with polynomial sequential entropy.
  • Novel proof technique that exploits the curvature of log loss to avoid a key

“truncation step” used by previous works.

  • Resolve the minimax regret with log loss for Lipschitz experts on [0, 1]p with

matching lower bounds.

  • Conclude the minimax regret with log loss cannot be completely

characterized using sequential entropy.

slide-27
SLIDE 27

Minimax Regret

Regret: Rn(ˆ p; F, x, y) =

n

  • t=1

ℓlog(ˆ pt, yt) − inf

f∈F n

  • t=1

ℓlog(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rn(F) = sup

x1

inf

ˆ p1 sup y1

sup

x2

inf

ˆ p2 sup y2

· · · sup

xn

inf

ˆ pn sup yn

Rn(ˆ p; F, x, y).

slide-28
SLIDE 28

Minimax Regret

Regret: Rn(ˆ p; F, x, y) =

n

  • t=1

ℓlog(ˆ pt, yt) − inf

f∈F n

  • t=1

ℓlog(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rn(F) = sup

x1

inf

ˆ p1 sup y1

sup

x2

inf

ˆ p2 sup y2

· · · sup

xn

inf

ˆ pn sup yn

Rn(ˆ p; F, x, y). The first context is observed.

slide-29
SLIDE 29

Minimax Regret

Regret: Rn(ˆ p; F, x, y) =

n

  • t=1

ℓlog(ˆ pt, yt) − inf

f∈F n

  • t=1

ℓlog(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rn(F) = sup

x1

inf

ˆ p1 sup y1

sup

x2

inf

ˆ p2 sup y2

· · · sup

xn

inf

ˆ pn sup yn

Rn(ˆ p; F, x, y). The player makes their prediction.

slide-30
SLIDE 30

Minimax Regret

Regret: Rn(ˆ p; F, x, y) =

n

  • t=1

ℓlog(ˆ pt, yt) − inf

f∈F n

  • t=1

ℓlog(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rn(F) = sup

x1

inf

ˆ p1 sup y1

sup

x2

inf

ˆ p2 sup y2

· · · sup

xn

inf

ˆ pn sup yn

Rn(ˆ p; F, x, y). The adversary plays an observation.

slide-31
SLIDE 31

Minimax Regret

Regret: Rn(ˆ p; F, x, y) =

n

  • t=1

ℓlog(ˆ pt, yt) − inf

f∈F n

  • t=1

ℓlog(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rn(F) = sup

x1

inf

ˆ p1 sup y1

sup

x2

inf

ˆ p2 sup y2

· · · sup

xn

inf

ˆ pn sup yn

Rn(ˆ p; F, x, y). This repeats for all n rounds.

slide-32
SLIDE 32

Minimax Regret

Regret: Rn(ˆ p; F, x, y) =

n

  • t=1

ℓlog(ˆ pt, yt) − inf

f∈F n

  • t=1

ℓlog(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rn(F) = sup

x1

inf

ˆ p1 sup y1

sup

x2

inf

ˆ p2 sup y2

· · · sup

xn

inf

ˆ pn sup yn

Rn(ˆ p; F, x, y). This repeats for all n rounds.

slide-33
SLIDE 33

Minimax Regret

Regret: Rn(ˆ p; F, x, y) =

n

  • t=1

ℓlog(ˆ pt, yt) − inf

f∈F n

  • t=1

ℓlog(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rn(F) = sup

x1

inf

ˆ p1 sup y1

sup

x2

inf

ˆ p2 sup y2

· · · sup

xn

inf

ˆ pn sup yn

Rn(ˆ p; F, x, y). Interpretation: The experts F are minimax online learnable if Rn(F) < o(n).

  • slow rate: Rn(F) = Θ(√n)
  • fast rate: Rn(F) ≤ O(log(n))
slide-34
SLIDE 34

Covering Numbers

Goal: Obtain regret bounds using a notion of complexity of the expert class F.

slide-35
SLIDE 35

Covering Numbers

Goal: Obtain regret bounds using a notion of complexity of the expert class F. Covering Numbers

slide-36
SLIDE 36

Covering Numbers

Goal: Obtain regret bounds using a notion of complexity of the expert class F. Covering Numbers

  • Define a notion of distance between experts, d(f, g).
slide-37
SLIDE 37

Covering Numbers

Goal: Obtain regret bounds using a notion of complexity of the expert class F. Covering Numbers

  • Define a notion of distance between experts, d(f, g).
  • Find the smallest G ⊆ F so that for each f ∈ F, there is a g ∈ G with

d(f, g) ≤ γ.

slide-38
SLIDE 38

Covering Numbers

Goal: Obtain regret bounds using a notion of complexity of the expert class F. Covering Numbers

  • Define a notion of distance between experts, d(f, g).
  • Find the smallest G ⊆ F so that for each f ∈ F, there is a g ∈ G with

d(f, g) ≤ γ.

  • The covering number for F is |G|, and the entropy is log(|G|).
slide-39
SLIDE 39

Covering Numbers

Goal: Obtain regret bounds using a notion of complexity of the expert class F. Covering Numbers

  • Define a notion of distance between experts, d(f, g).
  • Find the smallest G ⊆ F so that for each f ∈ F, there is a g ∈ G with

d(f, g) ≤ γ.

  • The covering number for F is |G|, and the entropy is log(|G|).

Uniform Covering d(f, g) = sup

x∈X

sup

y∈{0,1}

|ℓlog(f(x), y) − ℓlog(g(x), y)|

slide-40
SLIDE 40

Covering Numbers

Goal: Obtain regret bounds using a notion of complexity of the expert class F. Covering Numbers

  • Define a notion of distance between experts, d(f, g).
  • Find the smallest G ⊆ F so that for each f ∈ F, there is a g ∈ G with

d(f, g) ≤ γ.

  • The covering number for F is |G|, and the entropy is log(|G|).

Uniform Covering d(f, g) = sup

x∈X

sup

y∈{0,1}

|ℓlog(f(x), y) − ℓlog(g(x), y)| A uniform covering may be infinite for large expert classes.

slide-41
SLIDE 41

Covering Numbers

Goal: Obtain regret bounds using a notion of complexity of the expert class F. Covering Numbers

  • Define a notion of distance between experts, d(f, g).
  • Find the smallest G ⊆ F so that for each f ∈ F, there is a g ∈ G with

d(f, g) ≤ γ.

  • The covering number for F is |G|, and the entropy is log(|G|).

Uniform Covering d(f, g) = sup

x∈X

sup

y∈{0,1}

|ℓlog(f(x), y) − ℓlog(g(x), y)| A uniform covering may be infinite for large expert classes. Instead, we use sequential covering from Rakhlin and Sridharan (2014).

slide-42
SLIDE 42

Sequential Covering

Key characteristics of sequential covering:

  • Only need to cover the expert predictions on the actual observed contexts.
  • The cover respects the sequential dependency of the online game.
slide-43
SLIDE 43

Sequential Covering

Key characteristics of sequential covering:

  • Only need to cover the expert predictions on the actual observed contexts.
  • The cover respects the sequential dependency of the online game.

Rn(F) = sup

x1

inf

ˆ p1 sup y1

sup

x2

inf

ˆ p2 sup y2

· · · sup

xn

inf

ˆ pn sup yn

Rn(ˆ p; F, x, y). We encode the sequential nature of xt and yt using binary trees:

slide-44
SLIDE 44

Sequential Covering

Key characteristics of sequential covering:

  • Only need to cover the expert predictions on the actual observed contexts.
  • The cover respects the sequential dependency of the online game.

Rn(F) = sup

x1

inf

ˆ p1 sup y1

sup

x2

inf

ˆ p2 sup y2

· · · sup

xn

inf

ˆ pn sup yn

Rn(ˆ p; F, x, y). We encode the sequential nature of xt and yt using binary trees:

slide-45
SLIDE 45

Sequential Covering

Key characteristics of sequential covering:

  • Only need to cover the expert predictions on the actual observed contexts.
  • The cover respects the sequential dependency of the online game.

Rn(F) = sup

x1

inf

ˆ p1 sup y1

sup

x2

inf

ˆ p2 sup y2

· · · sup

xn

inf

ˆ pn sup yn

Rn(ˆ p; F, x, y). We encode the sequential nature of xt and yt using binary trees:

slide-46
SLIDE 46

Sequential Covering

Key characteristics of sequential covering:

  • Only need to cover the expert predictions on the actual observed contexts.
  • The cover respects the sequential dependency of the online game.

Rn(F) = sup

x1

inf

ˆ p1 sup y1

sup

x2

inf

ˆ p2 sup y2

· · · sup

xn

inf

ˆ pn sup yn

Rn(ˆ p; F, x, y). We encode the sequential nature of xt and yt using binary trees:

slide-47
SLIDE 47

Sequential Covering

Key characteristics of sequential covering:

  • Only need to cover the expert predictions on the actual observed contexts.
  • The cover respects the sequential dependency of the online game.

Rn(F) = sup

x1

inf

ˆ p1 sup y1

sup

x2

inf

ˆ p2 sup y2

· · · sup

xn

inf

ˆ pn sup yn

Rn(ˆ p; F, x, y). We encode the sequential nature of xt and yt using binary trees:

slide-48
SLIDE 48

Sequential Covering

Key characteristics of sequential covering:

  • Only need to cover the expert predictions on the actual observed contexts.
  • The cover respects the sequential dependency of the online game.

Rn(F) = sup

x1

inf

ˆ p1 sup y1

sup

x2

inf

ˆ p2 sup y2

· · · sup

xn

inf

ˆ pn sup yn

Rn(ˆ p; F, x, y). We encode the sequential nature of xt and yt using binary trees:

slide-49
SLIDE 49

Sequential Covering

A class of trees V sequentially covers F at margin γ on context tree x if: sup

f∈F

sup

y∈{0,1}n inf v∈V sup t∈[n]

|f(xt(y)) − vt(y)| ≤ γ.

slide-50
SLIDE 50

Sequential Covering

A class of trees V sequentially covers F at margin γ on context tree x if: sup

f∈F

sup

y∈{0,1}n inf v∈V sup t∈[n]

|f(xt(y)) − vt(y)| ≤ γ. Observations

  • V is chosen after observing x, so it doesn’t have to apply to all of X.
  • v ∈ V is chosen with knowledge of y, the actual path of observations.
slide-51
SLIDE 51

Sequential Covering

A class of trees V sequentially covers F at margin γ on context tree x if: sup

f∈F

sup

y∈{0,1}n inf v∈V sup t∈[n]

|f(xt(y)) − vt(y)| ≤ γ. Observations

  • V is chosen after observing x, so it doesn’t have to apply to all of X.
  • v ∈ V is chosen with knowledge of y, the actual path of observations.

Definitions

  • The size of the smallest such V for x is N∞ (F ◦ x, γ).
  • Sequential entropy for n rounds is H∞ (F, γ, n) = supx log (N∞ (F ◦ x, γ)).
slide-52
SLIDE 52

Improved Minimax Bounds

Theorem (BFR ’20) There exists c > 0 such that for all F, Rn(F) ≤ inf

γ>0

  • 4nγ + c H∞ (F, γ, n)
  • .
slide-53
SLIDE 53

Improved Minimax Bounds

Theorem (BFR ’20) There exists c > 0 such that for all F, Rn(F) ≤ inf

γ>0

  • 4nγ + c H∞ (F, γ, n)
  • .

Upper Bound (Computation) If H∞ (F, γ, n) = Θ(γ−p) for p > 0, Rn(F) ≤ O(n

p p+1 ).

slide-54
SLIDE 54

Improved Minimax Bounds

Theorem (BFR ’20) There exists c > 0 such that for all F, Rn(F) ≤ inf

γ>0

  • 4nγ + c H∞ (F, γ, n)
  • .

Upper Bound (Computation) If H∞ (F, γ, n) = Θ(γ−p) for p > 0, Rn(F) ≤ O(n

p p+1 ).

Theorem (BFR ’20) If p ∈ N, there exists an F with H∞ (F, γ, n) = Θ(γ−p) and Rn(F) ≥ Ω(n

p p+1 ).

slide-55
SLIDE 55

Applications

slide-56
SLIDE 56

Applications

  • 1-Lipschitz:

F = {f | f : [0, 1]p → [0, 1], |f(x) − f(y)| ≤ x − y ∀x, y ∈ [0, 1]p}. H∞ (F, γ, n) = Θ

  • γ−p

.

slide-57
SLIDE 57

Applications

  • 1-Lipschitz:

F = {f | f : [0, 1]p → [0, 1], |f(x) − f(y)| ≤ x − y ∀x, y ∈ [0, 1]p}. H∞ (F, γ, n) = Θ

  • γ−p

. We have matching upper and lower bounds for this class, so: Rn(F) = Θ(n

p p+1 ).

slide-58
SLIDE 58

Applications

  • Linear Predictors:

F = {f | ∃w s.t. w2 ≤ 1, f(x) = 1

2[1 + w, x] ∀ x2 ≤ 1}.

H∞ (F, γ, n) = ˜ Θ

  • γ−2

.

slide-59
SLIDE 59

Applications

  • Linear Predictors:

F = {f | ∃w s.t. w2 ≤ 1, f(x) = 1

2[1 + w, x] ∀ x2 ≤ 1}.

H∞ (F, γ, n) = ˜ Θ

  • γ−2

. Our upper bound prescribes: Rn(F) ≤ ˜ O(n2/3).

slide-60
SLIDE 60

Applications

  • Linear Predictors:

F = {f | ∃w s.t. w2 ≤ 1, f(x) = 1

2[1 + w, x] ∀ x2 ≤ 1}.

H∞ (F, γ, n) = ˜ Θ

  • γ−2

. Our upper bound prescribes: Rn(F) ≤ ˜ O(n2/3). However, Rakhlin & Sridharan (2015) showed (with an explicit algorithm) Rn(F) ≤ ˜ O(√n).

slide-61
SLIDE 61

Applications

  • Linear Predictors:

F = {f | ∃w s.t. w2 ≤ 1, f(x) = 1

2[1 + w, x] ∀ x2 ≤ 1}.

H∞ (F, γ, n) = ˜ Θ

  • γ−2

. Our upper bound prescribes: Rn(F) ≤ ˜ O(n2/3). However, Rakhlin & Sridharan (2015) showed (with an explicit algorithm) Rn(F) ≤ ˜ O(√n). Our upper bound cannot be improved, so the minimax regret under log loss cannot be characterized solely by sequential entropy.