Improved Bounds on Minimax Regret under Logarithmic Loss via - - PowerPoint PPT Presentation

improved bounds on minimax regret under logarithmic loss
SMART_READER_LITE
LIVE PREVIEW

Improved Bounds on Minimax Regret under Logarithmic Loss via - - PowerPoint PPT Presentation

Improved Bounds on Minimax Regret under Logarithmic Loss via Self-Concordance Blair Bilodeau 1 , 2 with Dylan J. Foster 3 and Daniel M. Roy 1 , 2 March 11, 2020 1 Department of Statistical Sciences, University of Toronto 2 Vector Institute 3


slide-1
SLIDE 1

Improved Bounds on Minimax Regret under Logarithmic Loss via Self-Concordance

Blair Bilodeau1,2 with Dylan J. Foster3 and Daniel M. Roy1,2 March 11, 2020

1Department of Statistical Sciences, University of Toronto 2Vector Institute 3Institute for Foundations of Data Science, Massachusetts Institute of Technology

slide-2
SLIDE 2

Motivation

slide-3
SLIDE 3

Weather Forecasting

Goal: forecast the probability of rain from historical data and current conditions.

slide-4
SLIDE 4

Weather Forecasting

Goal: forecast the probability of rain from historical data and current conditions. Considerations

  • Which assumptions to make about historical trends continuing?
  • How many physical relationships should be incorporated in the model?
  • Are some missed predictions more expensive than others?
slide-5
SLIDE 5

Traditional Statistical Learning

  • Receive a batch of data
  • Estimate a prediction function ˆ

h

  • Evaluate performance on new data assumed to be from the same distribution
slide-6
SLIDE 6

Traditional Statistical Learning

But what if there’s a changepoint...

slide-7
SLIDE 7

Traditional Statistical Learning

...or your training data isn’t even i.i.d.?

slide-8
SLIDE 8

Statistical Solutions

We want to remove assumptions about the data generating process. In particular, future data may not be i.i.d. with past data.

slide-9
SLIDE 9

Statistical Solutions

We want to remove assumptions about the data generating process. In particular, future data may not be i.i.d. with past data. Statistics does this with, for example,

  • Markov assumption
  • stationarity assumption (time series)
  • covariance structure assumption (e.g., Gaussian process)
slide-10
SLIDE 10

Statistical Solutions

We want to remove assumptions about the data generating process. In particular, future data may not be i.i.d. with past data. Statistics does this with, for example,

  • Markov assumption
  • stationarity assumption (time series)
  • covariance structure assumption (e.g., Gaussian process)

But these assumptions are often uncheckable or false.

slide-11
SLIDE 11

Online Learning

slide-12
SLIDE 12

Online Learning

A framework where the past may not be indicative of the future.

slide-13
SLIDE 13

Online Learning

A framework where the past may not be indicative of the future. Online Learning For rounds t = 1, . . . , n:

  • Predict ˆ

yt ∈ ˆ Y

  • Observe yt ∈ Y
  • Incur loss ℓ(ˆ

yt, yt)

slide-14
SLIDE 14

Online Learning

A framework where the past may not be indicative of the future. Online Learning For rounds t = 1, . . . , n:

  • Predict ˆ

yt ∈ ˆ Y

  • Observe yt ∈ Y ←

− We do not assume this is generated by a model

  • Incur loss ℓ(ˆ

yt, yt)

slide-15
SLIDE 15

Online Learning

A framework where the past may not be indicative of the future. Contextual Online Learning For rounds t = 1, . . . , n:

  • Observe context xt ∈ X
  • Predict ˆ

yt ∈ ˆ Y

  • Observe yt ∈ Y ←

− We do not assume this is generated by a model

  • Incur loss ℓ(ˆ

yt, yt)

slide-16
SLIDE 16

Online Learning

A framework where the past may not be indicative of the future. Contextual Online Learning For rounds t = 1, . . . , n:

  • Observe context xt ∈ X ←

− Also has no model assumptions

  • Predict ˆ

yt ∈ ˆ Y

  • Observe yt ∈ Y ←

− We do not assume this is generated by a model

  • Incur loss ℓ(ˆ

yt, yt)

slide-17
SLIDE 17

Measuring Performance

In statistical learning, performance is often measured against:

  • a ground truth, e.g., parameter estimation
  • the best predictor from some class for the underlying probability model
slide-18
SLIDE 18

Measuring Performance

In statistical learning, performance is often measured against:

  • a ground truth, e.g., parameter estimation
  • the best predictor from some class for the underlying probability model

These measures quantify guarantees about the future given the past. Without a probabilistic model:

  • no notion of ground truth to compare with
  • the “best hypothesis” in a class is not clearly defined
  • cannot naively hope to do well on future observations
slide-19
SLIDE 19

Measuring Performance

In statistical learning, performance is often measured against:

  • a ground truth, e.g., parameter estimation
  • the best predictor from some class for the underlying probability model

These measures quantify guarantees about the future given the past. Without a probabilistic model:

  • no notion of ground truth to compare with
  • the “best hypothesis” in a class is not clearly defined
  • cannot naively hope to do well on future observations

If I can’t promise about the future, can I say something about the past?

slide-20
SLIDE 20

Measuring Performance

In statistical learning, performance is often measured against:

  • a ground truth, e.g., parameter estimation
  • the best predictor from some class for the underlying probability model

These measures quantify guarantees about the future given the past. Without a probabilistic model:

  • no notion of ground truth to compare with
  • the “best hypothesis” in a class is not clearly defined
  • cannot naively hope to do well on future observations

Consider a relative notion of performance in hindsight.

  • Relative to a class F ⊆ {f : X → ˆ

Y}, consisting of experts f ∈ F.

  • Compete against the optimal f ∈ F on the actual sequence of observations

from past rounds.

slide-21
SLIDE 21

Regret

Regret: Rℓ

n(ˆ

y; F, x, y) =

n

  • t=1

ℓ(ˆ yt, yt) − inf

f∈F n

  • t=1

ℓ(f(xt), yt).

slide-22
SLIDE 22

Regret

Regret: Rℓ

n(ˆ

y; F, x, y) =

n

  • t=1

ℓ(ˆ yt, yt) − inf

f∈F n

  • t=1

ℓ(f(xt), yt). This quantity depends on

  • ˆ

y: Player predictions,

  • F: Expert class,
  • x: Observed contexts,
  • y: Observed data points.
slide-23
SLIDE 23

Minimax Regret

Regret: Rℓ

n(ˆ

y; F, x, y) =

n

  • t=1

ℓ(ˆ yt, yt) − inf

f∈F n

  • t=1

ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ

n(F) = sup x1

inf

ˆ y1 sup y1

sup

x2

inf

ˆ y2 sup y2

· · · sup

xn

inf

ˆ yn sup yn

Rℓ

n(ˆ

y; F, x, y).

slide-24
SLIDE 24

Minimax Regret

Regret: Rℓ

n(ˆ

y; F, x, y) =

n

  • t=1

ℓ(ˆ yt, yt) − inf

f∈F n

  • t=1

ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ

n(F) = sup

x1 inf

ˆ y1 sup y1

sup

x2

inf

ˆ y2 sup y2

· · · sup

xn

inf

ˆ yn sup yn

Rℓ

n(ˆ

y; F, x, y). The first context is observed.

slide-25
SLIDE 25

Minimax Regret

Regret: Rℓ

n(ˆ

y; F, x, y) =

n

  • t=1

ℓ(ˆ yt, yt) − inf

f∈F n

  • t=1

ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ

n(F) = sup x1

inf ˆ y1 sup

y1

sup

x2

inf

ˆ y2 sup y2

· · · sup

xn

inf

ˆ yn sup yn

Rℓ

n(ˆ

y; F, x, y). The player makes their prediction.

slide-26
SLIDE 26

Minimax Regret

Regret: Rℓ

n(ˆ

y; F, x, y) =

n

  • t=1

ℓ(ˆ yt, yt) − inf

f∈F n

  • t=1

ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ

n(F) = sup x1

inf

ˆ y1 sup

y1 sup

x2

inf

ˆ y2 sup y2

· · · sup

xn

inf

ˆ yn sup yn

Rℓ

n(ˆ

y; F, x, y). The adversary plays an observation.

slide-27
SLIDE 27

Minimax Regret

Regret: Rℓ

n(ˆ

y; F, x, y) =

n

  • t=1

ℓ(ˆ yt, yt) − inf

f∈F n

  • t=1

ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ

n(F) = sup x1

inf

ˆ y1 sup y1

sup x2 inf ˆ y2 sup y2 · · · sup

xn

inf

ˆ yn sup yn

Rℓ

n(ˆ

y; F, x, y). This repeats for all n rounds.

slide-28
SLIDE 28

Minimax Regret

Regret: Rℓ

n(ˆ

y; F, x, y) =

n

  • t=1

ℓ(ˆ yt, yt) − inf

f∈F n

  • t=1

ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ

n(F) = sup x1

inf

ˆ y1 sup y1

sup

x2

inf

ˆ y2 sup y2

· · · sup xn inf ˆ yn sup yn Rℓ

n(ˆ

y; F, x, y). This repeats for all n rounds.

slide-29
SLIDE 29

Minimax Regret

Regret: Rℓ

n(ˆ

y; F, x, y) =

n

  • t=1

ℓ(ˆ yt, yt) − inf

f∈F n

  • t=1

ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ

n(F) = ⟪sup xt

inf

ˆ yt sup yt

n t=1

Rℓ

n(ˆ

y; F, x, y). The notation ⟪·⟫n

t=1 denotes repeated application of operators.

slide-30
SLIDE 30

Minimax Regret

Regret: Rℓ

n(ˆ

y; F, x, y) =

n

  • t=1

ℓ(ˆ yt, yt) − inf

f∈F n

  • t=1

ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ

n(F) = ⟪sup xt

inf

ˆ yt sup yt

n t=1

Rℓ

n(ˆ

y; F, x, y). Interpretation: The tuple (ℓ, F) is online learnable if Rℓ

n(F) < o(n).

  • slow rate: Rℓ

n(F) = Θ(√n)

  • fast rate: Rℓ

n(F) ≤ O(log(n))

slide-31
SLIDE 31

Logarithmic Loss

slide-32
SLIDE 32

Problem Formulation

Sequential Probability Assignment In each round, the prediction is a distribution on possible observations.

slide-33
SLIDE 33

Problem Formulation

Sequential Probability Assignment In each round, the prediction is a distribution on possible observations. Predicting Binary Outcomes y ∈ Y = {0, 1} and ˆ p ∈ ˆ Y ≡ [0, 1]

slide-34
SLIDE 34

Measuring Loss

What is the correct notion of loss?

slide-35
SLIDE 35

Measuring Loss

Intuition: being confidently wrong is much worse than being indecisive. Statistical motivation: maximum likelihood estimation for a Bernoulli.

slide-36
SLIDE 36

Measuring Loss

Intuition: being confidently wrong is much worse than being indecisive. Statistical motivation: maximum likelihood estimation for a Bernoulli. Logarithmic Loss ℓlog(ˆ pt, yt) = −yt log(ˆ pt) − (1 − yt) log(1 − ˆ pt).

slide-37
SLIDE 37

Measuring Loss

Why is this difficult? Standard online learning techniques rely on loss being bounded or Lipschitz.

slide-38
SLIDE 38

Measuring Loss

Why is this difficult? Standard online learning techniques rely on loss being bounded or Lipschitz. ℓlog(ˆ pt, yt) = −yt log(ˆ pt) − (1 − yt) log(1 − ˆ pt). y = 1 y = 0

0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 p logloss(p,1) 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 p logloss(p,0)

slide-39
SLIDE 39

Measuring Loss

Why is this difficult? Standard online learning techniques rely on loss being bounded or Lipschitz. ℓlog(ˆ pt, yt) = −yt log(ˆ pt) − (1 − yt) log(1 − ˆ pt). y = 1 y = 0

0.0 0.2 0.4 0.6 0.8 1.0 −100 −80 −60 −40 −20 p d/dp logloss(p,1) 0.0 0.2 0.4 0.6 0.8 1.0 20 40 60 80 100 p d/dp logloss(p,0)

slide-40
SLIDE 40

Bounding Regret

slide-41
SLIDE 41

Dual Game

Recall that the minimax regret is Rlog

n (F) = ⟪sup xt

inf

ˆ pt sup yt

n t=1

Rlog

n (ˆ

p; F, x, y).

slide-42
SLIDE 42

Dual Game

Recall that the minimax regret is Rlog

n (F) = ⟪sup xt

inf

ˆ pt sup yt

n t=1

Rlog

n (ˆ

p; F, x, y). The worst-case observations can equivalently be viewed as Rlog

n (F) = ⟪sup xt

inf

ˆ pt sup pt

E

yt∼pt⟫ n t=1

Rlog

n (ˆ

p; F, x, y).

slide-43
SLIDE 43

Dual Game

Recall that the minimax regret is Rlog

n (F) = ⟪sup xt

inf

ˆ pt sup yt

n t=1

Rlog

n (ˆ

p; F, x, y). The worst-case observations can equivalently be viewed as Rlog

n (F) = ⟪sup xt

inf

ˆ pt sup pt

E

yt∼pt⟫ n t=1

Rlog

n (ˆ

p; F, x, y). (Abernethy et al., 2009, Rakhlin and Sridharan, 2015) An extension of the minimax theorem gives Rlog

n (F) = ⟪sup xt

sup

pt

E

yt∼pt⟫ n t=1

Rlog

n (p; F, x, y).

slide-44
SLIDE 44

Empirical Process Theory

Expanding the regret term, we get Rlog

n (F) = ⟪sup xt

sup

pt

E

yt∼pt⟫ n t=1

sup

f∈F

n

  • t=1

ℓlog(pt, yt) − ℓlog(f(xt), yt)

  • .
slide-45
SLIDE 45

Empirical Process Theory

Expanding the regret term, we get Rlog

n (F) = ⟪sup xt

sup

pt

E

yt∼pt⟫ n t=1

sup

f∈F

n

  • t=1

ℓlog(pt, yt) − ℓlog(f(xt), yt)

  • .

The presence of an expected supremum suggests empirical process theory.

slide-46
SLIDE 46

Empirical Process Theory

Expanding the regret term, we get Rlog

n (F) = ⟪sup xt

sup

pt

E

yt∼pt⟫ n t=1

sup

f∈F

n

  • t=1

ℓlog(pt, yt) − ℓlog(f(xt), yt)

  • .

The presence of an expected supremum suggests empirical process theory.

  • Discretize the infinite supremum into a finite cover.
  • Bound the expected maximum of the finite cover.
  • Bound the error from only considering the finite cover.
slide-47
SLIDE 47

Uniform Covering Fails

Early work (Cesa-Bianchi and Lugosi, 1999, Opper and Haussler, 1999) used a uniform covering approach, but this is too coarse for many expert classes.

slide-48
SLIDE 48

Uniform Covering Fails

Early work (Cesa-Bianchi and Lugosi, 1999, Opper and Haussler, 1999) used a uniform covering approach, but this is too coarse for many expert classes. Distance between f, g ∈ F: d(f, g) = sup

x∈X

sup

y∈{0,1}

|ℓlog(f(x), y) − ℓlog(g(x), y)|

slide-49
SLIDE 49

Uniform Covering Fails

Early work (Cesa-Bianchi and Lugosi, 1999, Opper and Haussler, 1999) used a uniform covering approach, but this is too coarse for many expert classes. Distance between f, g ∈ F: d(f, g) = sup

x∈X

sup

y∈{0,1}

|ℓlog(f(x), y) − ℓlog(g(x), y)| Class G covers class F at margin γ if: sup

f∈F

inf

g∈G d(f, g) ≤ γ.

slide-50
SLIDE 50

Uniform Covering Fails

Early work (Cesa-Bianchi and Lugosi, 1999, Opper and Haussler, 1999) used a uniform covering approach, but this is too coarse for many expert classes. Distance between f, g ∈ F: d(f, g) = sup

x∈X

sup

y∈{0,1}

|ℓlog(f(x), y) − ℓlog(g(x), y)| Class G covers class F at margin γ if: sup

f∈F

inf

g∈G d(f, g) ≤ γ.

Instead, we use sequential covering from Rakhlin and Sridharan (2014).

slide-51
SLIDE 51

Binary Tree Notation

Rlog

n (F) = ⟪sup xt

sup

pt

E

yt∼pt⟫ n t=1

sup

f∈F

n

  • t=1

ℓlog(pt, yt) − ℓlog(f(xt), yt)

  • .

We can encode the sequential nature of xt and pt using binary trees:

slide-52
SLIDE 52

Binary Tree Notation

Rlog

n (F) = ⟪sup xt

sup

pt

E

yt∼pt⟫ n t=1

sup

f∈F

n

  • t=1

ℓlog(pt, yt) − ℓlog(f(xt), yt)

  • .

We can encode the sequential nature of xt and pt using binary trees:

slide-53
SLIDE 53

Binary Tree Notation

Rlog

n (F) = ⟪sup xt

sup

pt

E

yt∼pt⟫ n t=1

sup

f∈F

n

  • t=1

ℓlog(pt, yt) − ℓlog(f(xt), yt)

  • .

We can encode the sequential nature of xt and pt using binary trees:

slide-54
SLIDE 54

Binary Tree Notation

Rlog

n (F) = ⟪sup xt

sup

pt

E

yt∼pt⟫ n t=1

sup

f∈F

n

  • t=1

ℓlog(pt, yt) − ℓlog(f(xt), yt)

  • .

We can encode the sequential nature of xt and pt using binary trees:

slide-55
SLIDE 55

Binary Tree Notation

Rlog

n (F) = ⟪sup xt

sup

pt

E

yt∼pt⟫ n t=1

sup

f∈F

n

  • t=1

ℓlog(pt, yt) − ℓlog(f(xt), yt)

  • .

We can encode the sequential nature of xt and pt using binary trees:

slide-56
SLIDE 56

Sequential Covering

Rlog

n (F) = sup x sup p

E

y∼p sup f∈F

n

  • t=1

ℓlog(pt(y), yt) − ℓlog(f(xt(y)), yt)

  • .
slide-57
SLIDE 57

Sequential Covering

Rlog

n (F) = sup x sup p

E

y∼p sup f∈F

n

  • t=1

ℓlog(pt(y), yt) − ℓlog(f(xt(y)), yt)

  • .

Cover the class of trees F ◦ x defined by composing F with a context tree x:

slide-58
SLIDE 58

Sequential Covering

Rlog

n (F) = sup x sup p

E

y∼p sup f∈F

n

  • t=1

ℓlog(pt(y), yt) − ℓlog(f(xt(y)), yt)

  • .

Cover the class of trees F ◦ x defined by composing F with a context tree x: A class of trees V sequentially covers F ◦ x at margin γ if: sup

u∈F◦x

sup

y∈{0,1}n inf v∈V u(y) − v(y)p ≤ γ.

slide-59
SLIDE 59

Sequential Covering

Rlog

n (F) = sup x sup p

E

y∼p sup f∈F

n

  • t=1

ℓlog(pt(y), yt) − ℓlog(f(xt(y)), yt)

  • .

Cover the class of trees F ◦ x defined by composing F with a context tree x: A class of trees V sequentially covers F ◦ x at margin γ if: sup

u∈F◦x

sup

y∈{0,1}n inf v∈V u(y) − v(y)p ≤ γ.

The order of observations and covering elements is reversed from a uniform cover.

slide-60
SLIDE 60

Sequential Covering Example

To illustrate the utility of sequential covering, consider binary experts for n = 2:

slide-61
SLIDE 61

Sequential Covering Example

To illustrate the utility of sequential covering, consider binary experts for n = 2: The only uniform cover of F ◦ x is itself, which has 8 elements.

slide-62
SLIDE 62

Sequential Covering Example

To illustrate the utility of sequential covering, consider binary experts for n = 2: The only uniform cover of F ◦ x is itself, which has 8 elements. For a sequential cover, we can choose a different element for each path, so only 4 trees are required.

slide-63
SLIDE 63

Sequential Covering Examples

Examples of sequential covering numbers:

slide-64
SLIDE 64

Sequential Covering Examples

Examples of sequential covering numbers:

  • Time-Invariant: F = {f | ∃q ∈ [0, 1] s.t. f(x) = q ∀x ∈ X}.

sup

x log (N∞ (F ◦ x, γ)) ≤ log(1/γ).

slide-65
SLIDE 65

Sequential Covering Examples

Examples of sequential covering numbers:

  • Time-Invariant: F = {f | ∃q ∈ [0, 1] s.t. f(x) = q ∀x ∈ X}.

sup

x log (N∞ (F ◦ x, γ)) ≤ log(1/γ).

  • Linear Predictors:

F = {f | ∃w s.t. w2 ≤ 1, f(x) = 1

2[1 + w, x] ∀ x2 ≤ 1}.

sup

x log (N∞ (F ◦ x, γ)) = 1/γ2.

slide-66
SLIDE 66

Sequential Covering Examples

Examples of sequential covering numbers:

  • Time-Invariant: F = {f | ∃q ∈ [0, 1] s.t. f(x) = q ∀x ∈ X}.

sup

x log (N∞ (F ◦ x, γ)) ≤ log(1/γ).

  • Linear Predictors:

F = {f | ∃w s.t. w2 ≤ 1, f(x) = 1

2[1 + w, x] ∀ x2 ≤ 1}.

sup

x log (N∞ (F ◦ x, γ)) = 1/γ2.

  • 1-Lipschitz: F = {f | f : Rd → [0, 1], ∇f(x)∞ ≤ 1}.

sup

x log (N∞ (F ◦ x, γ)) = 1/γd.

slide-67
SLIDE 67

Improved Minimax Bounds

slide-68
SLIDE 68

Improved Minimax Bounds

Theorem (B., Foster, Roy, 2020) There exists c > 0 such that for all F, Rlog

n (F) ≤ sup x

inf

γ>0 {4nγ + c log (N∞ (F ◦ x, γ))} .

slide-69
SLIDE 69

Improved Minimax Bounds

Theorem (B., Foster, Roy, 2020) There exists c > 0 such that for all F, Rlog

n (F) ≤ sup x

inf

γ>0 {4nγ + c log (N∞ (F ◦ x, γ))} .

Upper Bound (Computation) If supx log (N∞ (F ◦ x, γ)) ≍ γ−p, Rlog

n (F) ≤ O(n

p p+1 ).

slide-70
SLIDE 70

Improved Minimax Bounds

Theorem (B., Foster, Roy, 2020) There exists c > 0 such that for all F, Rlog

n (F) ≤ sup x

inf

γ>0 {4nγ + c log (N∞ (F ◦ x, γ))} .

Upper Bound (Computation) If supx log (N∞ (F ◦ x, γ)) ≍ γ−p, Rlog

n (F) ≤ O(n

p p+1 ).

Theorem (B., Foster, Roy, 2020) If p > 0, there exists an F with supx log (N∞ (F ◦ x, γ)) ≍ γ−p and Rlog

n (F) ≥ Ω

  • n

p p+2

  • .
slide-71
SLIDE 71

Improved Minimax Bounds Visualized

Our results compared to the previous best upper bound from Foster et al. (2018).

Order of sequential covering number Optimized Power of n in Regret Bound Foster et al. (2018) New Upper Bound New Lower Bound 1 2 3 4 5 6 7 8 9 10 0.2 0.4 0.5 0.6 0.8 1

slide-72
SLIDE 72

Advances Underlying Results

slide-73
SLIDE 73

Truncation Free

The standard procedure to control log loss uses truncation. Define the truncated expert class Fδ = {f δ : f ∈ F} for δ ∈ (0, 1/2), where f δ(x) =        δ f(x) < δ f(x) δ ≤ f(x) ≤ 1 − δ 1 − δ f(x) > 1 − δ .

slide-74
SLIDE 74

Truncation Free

The standard procedure to control log loss uses truncation. Define the truncated expert class Fδ = {f δ : f ∈ F} for δ ∈ (0, 1/2), where f δ(x) =        δ f(x) < δ f(x) δ ≤ f(x) ≤ 1 − δ 1 − δ f(x) > 1 − δ .

  • Observe that for p ∈ [δ, 1 − δ], ℓlog(p, y) is 1/δ-Lipschitz.
slide-75
SLIDE 75

Truncation Free

The standard procedure to control log loss uses truncation. Define the truncated expert class Fδ = {f δ : f ∈ F} for δ ∈ (0, 1/2), where f δ(x) =        δ f(x) < δ f(x) δ ≤ f(x) ≤ 1 − δ 1 − δ f(x) > 1 − δ .

  • Observe that for p ∈ [δ, 1 − δ], ℓlog(p, y) is 1/δ-Lipschitz.
  • It can be shown that Rlog

n (F) ≤ Rlog n (Fδ) + 2nδ.

slide-76
SLIDE 76

Truncation Free

The standard procedure to control log loss uses truncation. Define the truncated expert class Fδ = {f δ : f ∈ F} for δ ∈ (0, 1/2), where f δ(x) =        δ f(x) < δ f(x) δ ≤ f(x) ≤ 1 − δ 1 − δ f(x) > 1 − δ .

  • Observe that for p ∈ [δ, 1 − δ], ℓlog(p, y) is 1/δ-Lipschitz.
  • It can be shown that Rlog

n (F) ≤ Rlog n (Fδ) + 2nδ.

Rakhlin and Sridharan (2015) hypothesize this truncation argument is suboptimal, and pose the open problem of finding a tighter bound without it.

slide-77
SLIDE 77

Truncation Free

The standard procedure to control log loss uses truncation. Define the truncated expert class Fδ = {f δ : f ∈ F} for δ ∈ (0, 1/2), where f δ(x) =        δ f(x) < δ f(x) δ ≤ f(x) ≤ 1 − δ 1 − δ f(x) > 1 − δ .

  • Observe that for p ∈ [δ, 1 − δ], ℓlog(p, y) is 1/δ-Lipschitz.
  • It can be shown that Rlog

n (F) ≤ Rlog n (Fδ) + 2nδ.

Rakhlin and Sridharan (2015) hypothesize this truncation argument is suboptimal, and pose the open problem of finding a tighter bound without it. Our argument does not require truncation.

slide-78
SLIDE 78

Self-Concordance

Self-Concordant (Nesterov and Nemirovski, 1994) A function F : R → R is self-concordant if |F ′′′(x)| ≤ 2F ′′(x)3/2.

slide-79
SLIDE 79

Self-Concordance

Self-Concordant (Nesterov and Nemirovski, 1994) A function F : R → R is self-concordant if |F ′′′(x)| ≤ 2F ′′(x)3/2. Logarithmic loss is self-concordant as a function of p.

slide-80
SLIDE 80

Self-Concordance

Self-Concordant (Nesterov and Nemirovski, 1994) A function F : R → R is self-concordant if |F ′′′(x)| ≤ 2F ′′(x)3/2. Logarithmic loss is self-concordant as a function of p. Utility: In convex optimization, encoding the constraint boundary with a self-concordant barrier function leads to polynomial iterations for high accuracy.

slide-81
SLIDE 81

Self-Concordance

Self-Concordant (Nesterov and Nemirovski, 1994) A function F : R → R is self-concordant if |F ′′′(x)| ≤ 2F ′′(x)3/2. Logarithmic loss is self-concordant as a function of p. Utility: In convex optimization, encoding the constraint boundary with a self-concordant barrier function leads to polynomial iterations for high accuracy. If F is self-concordant, then ∀x, y ∈ R F(x) − F(y) ≤ (x − y)F ′(x) − |x − y|

  • F ′′(x) + log
  • 1 + |x − y|
  • F ′′(x)
  • .
slide-82
SLIDE 82

Self-Concordance

Self-Concordant (Nesterov and Nemirovski, 1994) A function F : R → R is self-concordant if |F ′′′(x)| ≤ 2F ′′(x)3/2. Logarithmic loss is self-concordant as a function of p. Utility: In convex optimization, encoding the constraint boundary with a self-concordant barrier function leads to polynomial iterations for high accuracy. If F is self-concordant, then ∀x, y ∈ R F(x) − F(y) ≤ (x − y)F ′(x) − |x − y|

  • F ′′(x) + log
  • 1 + |x − y|
  • F ′′(x)
  • .

We use the second term to control the gradient of logarithmic loss.

slide-83
SLIDE 83

Chaining Free

Recall our upper bound: Rlog

n (F) ≤ sup x

inf

γ>0 {4nγ + c log (N∞ (F ◦ x, γ))} .

slide-84
SLIDE 84

Chaining Free

Recall our upper bound: Rlog

n (F) ≤ sup x

inf

γ>0 {4nγ + c log (N∞ (F ◦ x, γ))} .

  • Rather than a single discretization step, it is common to use multiple,

nested discretizations of finer sizes – called chaining.

slide-85
SLIDE 85

Chaining Free

Recall our upper bound: Rlog

n (F) ≤ sup x

inf

γ>0 {4nγ + c log (N∞ (F ◦ x, γ))} .

  • Rather than a single discretization step, it is common to use multiple,

nested discretizations of finer sizes – called chaining.

  • Our current approach does not permit such a technique, yet improves on

previous results which do.

slide-86
SLIDE 86

Chaining Free

Recall our upper bound: Rlog

n (F) ≤ sup x

inf

γ>0 {4nγ + c log (N∞ (F ◦ x, γ))} .

  • Rather than a single discretization step, it is common to use multiple,

nested discretizations of finer sizes – called chaining.

  • Our current approach does not permit such a technique, yet improves on

previous results which do.

  • Naive attempts to change our result to allow chaining fail, and we are

actively working on this area.

slide-87
SLIDE 87

Summary

slide-88
SLIDE 88

Summary

Motivation

  • Make probabilistic forecasts without making assumptions about the data

generating process – whether i.i.d. or more sophisticated dependence structure.

slide-89
SLIDE 89

Summary

Motivation

  • Make probabilistic forecasts without making assumptions about the data

generating process – whether i.i.d. or more sophisticated dependence structure. Problem Setup

  • Bounding minimax regret for arbitrary expert classes under logarithmic loss.
slide-90
SLIDE 90

Summary

Motivation

  • Make probabilistic forecasts without making assumptions about the data

generating process – whether i.i.d. or more sophisticated dependence structure. Problem Setup

  • Bounding minimax regret for arbitrary expert classes under logarithmic loss.

Contributions

  • Improved upper bound for complex classes and provided lower bound.
  • Proof technique is truncation free and only requires one step discretization.
slide-91
SLIDE 91

Summary

Motivation

  • Make probabilistic forecasts without making assumptions about the data

generating process – whether i.i.d. or more sophisticated dependence structure. Problem Setup

  • Bounding minimax regret for arbitrary expert classes under logarithmic loss.

Contributions

  • Improved upper bound for complex classes and provided lower bound.
  • Proof technique is truncation free and only requires one step discretization.

Next Steps

  • Match upper and lower bounds.
  • Obtain bounds that interpolate between stochastic and fully adversarial.
slide-92
SLIDE 92

Open Problem

Infinite Dimensional Linear Prediction

  • X = B2, the unit ball in a Hilbert space,
  • F = {f(x) = (w, x + 1)/2 : w ∈ B2},
  • Log-loss can be written as

gt(w) = −yt log(1 + w, xt) − (1 − yt) log(1 − w, xt).

slide-93
SLIDE 93

Open Problem

Infinite Dimensional Linear Prediction

  • X = B2, the unit ball in a Hilbert space,
  • F = {f(x) = (w, x + 1)/2 : w ∈ B2},
  • Log-loss can be written as

gt(w) = −yt log(1 + w, xt) − (1 − yt) log(1 − w, xt). Constructive Algorithm (Rakhlin and Sridharan, 2015)

  • Follow-the-Regularized-Leader with a self-concordant barrier function gives

Rlog

n (F) ≤ ˜

O (√n).

slide-94
SLIDE 94

Open Problem

Infinite Dimensional Linear Prediction

  • X = B2, the unit ball in a Hilbert space,
  • F = {f(x) = (w, x + 1)/2 : w ∈ B2},
  • Log-loss can be written as

gt(w) = −yt log(1 + w, xt) − (1 − yt) log(1 − w, xt). Constructive Algorithm (Rakhlin and Sridharan, 2015)

  • Follow-the-Regularized-Leader with a self-concordant barrier function gives

Rlog

n (F) ≤ ˜

O (√n).

  • This is tighter than any known upper bounds, including ours, and matches

the lower bound.

slide-95
SLIDE 95

Open Problem

Infinite Dimensional Linear Prediction

  • X = B2, the unit ball in a Hilbert space,
  • F = {f(x) = (w, x + 1)/2 : w ∈ B2},
  • Log-loss can be written as

gt(w) = −yt log(1 + w, xt) − (1 − yt) log(1 − w, xt). Constructive Algorithm (Rakhlin and Sridharan, 2015)

  • Follow-the-Regularized-Leader with a self-concordant barrier function gives

Rlog

n (F) ≤ ˜

O (√n).

  • This is tighter than any known upper bounds, including ours, and matches

the lower bound.

  • It is not well-defined how to apply a concrete algorithm technique like this to

arbitrary expert classes.