SLIDE 1 Improved Bounds on Minimax Regret under Logarithmic Loss via Self-Concordance
Blair Bilodeau1,2 with Dylan J. Foster3 and Daniel M. Roy1,2 March 11, 2020
1Department of Statistical Sciences, University of Toronto 2Vector Institute 3Institute for Foundations of Data Science, Massachusetts Institute of Technology
SLIDE 2
Motivation
SLIDE 3
Weather Forecasting
Goal: forecast the probability of rain from historical data and current conditions.
SLIDE 4 Weather Forecasting
Goal: forecast the probability of rain from historical data and current conditions. Considerations
- Which assumptions to make about historical trends continuing?
- How many physical relationships should be incorporated in the model?
- Are some missed predictions more expensive than others?
SLIDE 5 Traditional Statistical Learning
- Receive a batch of data
- Estimate a prediction function ˆ
h
- Evaluate performance on new data assumed to be from the same distribution
SLIDE 6
Traditional Statistical Learning
But what if there’s a changepoint...
SLIDE 7
Traditional Statistical Learning
...or your training data isn’t even i.i.d.?
SLIDE 8
Statistical Solutions
We want to remove assumptions about the data generating process. In particular, future data may not be i.i.d. with past data.
SLIDE 9 Statistical Solutions
We want to remove assumptions about the data generating process. In particular, future data may not be i.i.d. with past data. Statistics does this with, for example,
- Markov assumption
- stationarity assumption (time series)
- covariance structure assumption (e.g., Gaussian process)
SLIDE 10 Statistical Solutions
We want to remove assumptions about the data generating process. In particular, future data may not be i.i.d. with past data. Statistics does this with, for example,
- Markov assumption
- stationarity assumption (time series)
- covariance structure assumption (e.g., Gaussian process)
But these assumptions are often uncheckable or false.
SLIDE 11
Online Learning
SLIDE 12
Online Learning
A framework where the past may not be indicative of the future.
SLIDE 13 Online Learning
A framework where the past may not be indicative of the future. Online Learning For rounds t = 1, . . . , n:
yt ∈ ˆ Y
- Observe yt ∈ Y
- Incur loss ℓ(ˆ
yt, yt)
SLIDE 14 Online Learning
A framework where the past may not be indicative of the future. Online Learning For rounds t = 1, . . . , n:
yt ∈ ˆ Y
− We do not assume this is generated by a model
yt, yt)
SLIDE 15 Online Learning
A framework where the past may not be indicative of the future. Contextual Online Learning For rounds t = 1, . . . , n:
- Observe context xt ∈ X
- Predict ˆ
yt ∈ ˆ Y
− We do not assume this is generated by a model
yt, yt)
SLIDE 16 Online Learning
A framework where the past may not be indicative of the future. Contextual Online Learning For rounds t = 1, . . . , n:
− Also has no model assumptions
yt ∈ ˆ Y
− We do not assume this is generated by a model
yt, yt)
SLIDE 17 Measuring Performance
In statistical learning, performance is often measured against:
- a ground truth, e.g., parameter estimation
- the best predictor from some class for the underlying probability model
SLIDE 18 Measuring Performance
In statistical learning, performance is often measured against:
- a ground truth, e.g., parameter estimation
- the best predictor from some class for the underlying probability model
These measures quantify guarantees about the future given the past. Without a probabilistic model:
- no notion of ground truth to compare with
- the “best hypothesis” in a class is not clearly defined
- cannot naively hope to do well on future observations
SLIDE 19 Measuring Performance
In statistical learning, performance is often measured against:
- a ground truth, e.g., parameter estimation
- the best predictor from some class for the underlying probability model
These measures quantify guarantees about the future given the past. Without a probabilistic model:
- no notion of ground truth to compare with
- the “best hypothesis” in a class is not clearly defined
- cannot naively hope to do well on future observations
If I can’t promise about the future, can I say something about the past?
SLIDE 20 Measuring Performance
In statistical learning, performance is often measured against:
- a ground truth, e.g., parameter estimation
- the best predictor from some class for the underlying probability model
These measures quantify guarantees about the future given the past. Without a probabilistic model:
- no notion of ground truth to compare with
- the “best hypothesis” in a class is not clearly defined
- cannot naively hope to do well on future observations
Consider a relative notion of performance in hindsight.
- Relative to a class F ⊆ {f : X → ˆ
Y}, consisting of experts f ∈ F.
- Compete against the optimal f ∈ F on the actual sequence of observations
from past rounds.
SLIDE 21 Regret
Regret: Rℓ
n(ˆ
y; F, x, y) =
n
ℓ(ˆ yt, yt) − inf
f∈F n
ℓ(f(xt), yt).
SLIDE 22 Regret
Regret: Rℓ
n(ˆ
y; F, x, y) =
n
ℓ(ˆ yt, yt) − inf
f∈F n
ℓ(f(xt), yt). This quantity depends on
y: Player predictions,
- F: Expert class,
- x: Observed contexts,
- y: Observed data points.
SLIDE 23 Minimax Regret
Regret: Rℓ
n(ˆ
y; F, x, y) =
n
ℓ(ˆ yt, yt) − inf
f∈F n
ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ
n(F) = sup x1
inf
ˆ y1 sup y1
sup
x2
inf
ˆ y2 sup y2
· · · sup
xn
inf
ˆ yn sup yn
Rℓ
n(ˆ
y; F, x, y).
SLIDE 24 Minimax Regret
Regret: Rℓ
n(ˆ
y; F, x, y) =
n
ℓ(ˆ yt, yt) − inf
f∈F n
ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ
n(F) = sup
x1 inf
ˆ y1 sup y1
sup
x2
inf
ˆ y2 sup y2
· · · sup
xn
inf
ˆ yn sup yn
Rℓ
n(ˆ
y; F, x, y). The first context is observed.
SLIDE 25 Minimax Regret
Regret: Rℓ
n(ˆ
y; F, x, y) =
n
ℓ(ˆ yt, yt) − inf
f∈F n
ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ
n(F) = sup x1
inf ˆ y1 sup
y1
sup
x2
inf
ˆ y2 sup y2
· · · sup
xn
inf
ˆ yn sup yn
Rℓ
n(ˆ
y; F, x, y). The player makes their prediction.
SLIDE 26 Minimax Regret
Regret: Rℓ
n(ˆ
y; F, x, y) =
n
ℓ(ˆ yt, yt) − inf
f∈F n
ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ
n(F) = sup x1
inf
ˆ y1 sup
y1 sup
x2
inf
ˆ y2 sup y2
· · · sup
xn
inf
ˆ yn sup yn
Rℓ
n(ˆ
y; F, x, y). The adversary plays an observation.
SLIDE 27 Minimax Regret
Regret: Rℓ
n(ˆ
y; F, x, y) =
n
ℓ(ˆ yt, yt) − inf
f∈F n
ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ
n(F) = sup x1
inf
ˆ y1 sup y1
sup x2 inf ˆ y2 sup y2 · · · sup
xn
inf
ˆ yn sup yn
Rℓ
n(ˆ
y; F, x, y). This repeats for all n rounds.
SLIDE 28 Minimax Regret
Regret: Rℓ
n(ˆ
y; F, x, y) =
n
ℓ(ˆ yt, yt) − inf
f∈F n
ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ
n(F) = sup x1
inf
ˆ y1 sup y1
sup
x2
inf
ˆ y2 sup y2
· · · sup xn inf ˆ yn sup yn Rℓ
n(ˆ
y; F, x, y). This repeats for all n rounds.
SLIDE 29 Minimax Regret
Regret: Rℓ
n(ˆ
y; F, x, y) =
n
ℓ(ˆ yt, yt) − inf
f∈F n
ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ
n(F) = ⟪sup xt
inf
ˆ yt sup yt
⟫
n t=1
Rℓ
n(ˆ
y; F, x, y). The notation ⟪·⟫n
t=1 denotes repeated application of operators.
SLIDE 30 Minimax Regret
Regret: Rℓ
n(ˆ
y; F, x, y) =
n
ℓ(ˆ yt, yt) − inf
f∈F n
ℓ(f(xt), yt). Minimax regret: an algorithm-free quantity on worst-case observations. Rℓ
n(F) = ⟪sup xt
inf
ˆ yt sup yt
⟫
n t=1
Rℓ
n(ˆ
y; F, x, y). Interpretation: The tuple (ℓ, F) is online learnable if Rℓ
n(F) < o(n).
n(F) = Θ(√n)
n(F) ≤ O(log(n))
SLIDE 31
Logarithmic Loss
SLIDE 32
Problem Formulation
Sequential Probability Assignment In each round, the prediction is a distribution on possible observations.
SLIDE 33
Problem Formulation
Sequential Probability Assignment In each round, the prediction is a distribution on possible observations. Predicting Binary Outcomes y ∈ Y = {0, 1} and ˆ p ∈ ˆ Y ≡ [0, 1]
SLIDE 34
Measuring Loss
What is the correct notion of loss?
SLIDE 35
Measuring Loss
Intuition: being confidently wrong is much worse than being indecisive. Statistical motivation: maximum likelihood estimation for a Bernoulli.
SLIDE 36
Measuring Loss
Intuition: being confidently wrong is much worse than being indecisive. Statistical motivation: maximum likelihood estimation for a Bernoulli. Logarithmic Loss ℓlog(ˆ pt, yt) = −yt log(ˆ pt) − (1 − yt) log(1 − ˆ pt).
SLIDE 37
Measuring Loss
Why is this difficult? Standard online learning techniques rely on loss being bounded or Lipschitz.
SLIDE 38 Measuring Loss
Why is this difficult? Standard online learning techniques rely on loss being bounded or Lipschitz. ℓlog(ˆ pt, yt) = −yt log(ˆ pt) − (1 − yt) log(1 − ˆ pt). y = 1 y = 0
0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 p logloss(p,1) 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 p logloss(p,0)
SLIDE 39 Measuring Loss
Why is this difficult? Standard online learning techniques rely on loss being bounded or Lipschitz. ℓlog(ˆ pt, yt) = −yt log(ˆ pt) − (1 − yt) log(1 − ˆ pt). y = 1 y = 0
0.0 0.2 0.4 0.6 0.8 1.0 −100 −80 −60 −40 −20 p d/dp logloss(p,1) 0.0 0.2 0.4 0.6 0.8 1.0 20 40 60 80 100 p d/dp logloss(p,0)
SLIDE 40
Bounding Regret
SLIDE 41 Dual Game
Recall that the minimax regret is Rlog
n (F) = ⟪sup xt
inf
ˆ pt sup yt
⟫
n t=1
Rlog
n (ˆ
p; F, x, y).
SLIDE 42 Dual Game
Recall that the minimax regret is Rlog
n (F) = ⟪sup xt
inf
ˆ pt sup yt
⟫
n t=1
Rlog
n (ˆ
p; F, x, y). The worst-case observations can equivalently be viewed as Rlog
n (F) = ⟪sup xt
inf
ˆ pt sup pt
E
yt∼pt⟫ n t=1
Rlog
n (ˆ
p; F, x, y).
SLIDE 43 Dual Game
Recall that the minimax regret is Rlog
n (F) = ⟪sup xt
inf
ˆ pt sup yt
⟫
n t=1
Rlog
n (ˆ
p; F, x, y). The worst-case observations can equivalently be viewed as Rlog
n (F) = ⟪sup xt
inf
ˆ pt sup pt
E
yt∼pt⟫ n t=1
Rlog
n (ˆ
p; F, x, y). (Abernethy et al., 2009, Rakhlin and Sridharan, 2015) An extension of the minimax theorem gives Rlog
n (F) = ⟪sup xt
sup
pt
E
yt∼pt⟫ n t=1
Rlog
n (p; F, x, y).
SLIDE 44 Empirical Process Theory
Expanding the regret term, we get Rlog
n (F) = ⟪sup xt
sup
pt
E
yt∼pt⟫ n t=1
sup
f∈F
n
ℓlog(pt, yt) − ℓlog(f(xt), yt)
SLIDE 45 Empirical Process Theory
Expanding the regret term, we get Rlog
n (F) = ⟪sup xt
sup
pt
E
yt∼pt⟫ n t=1
sup
f∈F
n
ℓlog(pt, yt) − ℓlog(f(xt), yt)
The presence of an expected supremum suggests empirical process theory.
SLIDE 46 Empirical Process Theory
Expanding the regret term, we get Rlog
n (F) = ⟪sup xt
sup
pt
E
yt∼pt⟫ n t=1
sup
f∈F
n
ℓlog(pt, yt) − ℓlog(f(xt), yt)
The presence of an expected supremum suggests empirical process theory.
- Discretize the infinite supremum into a finite cover.
- Bound the expected maximum of the finite cover.
- Bound the error from only considering the finite cover.
SLIDE 47
Uniform Covering Fails
Early work (Cesa-Bianchi and Lugosi, 1999, Opper and Haussler, 1999) used a uniform covering approach, but this is too coarse for many expert classes.
SLIDE 48 Uniform Covering Fails
Early work (Cesa-Bianchi and Lugosi, 1999, Opper and Haussler, 1999) used a uniform covering approach, but this is too coarse for many expert classes. Distance between f, g ∈ F: d(f, g) = sup
x∈X
sup
y∈{0,1}
|ℓlog(f(x), y) − ℓlog(g(x), y)|
SLIDE 49 Uniform Covering Fails
Early work (Cesa-Bianchi and Lugosi, 1999, Opper and Haussler, 1999) used a uniform covering approach, but this is too coarse for many expert classes. Distance between f, g ∈ F: d(f, g) = sup
x∈X
sup
y∈{0,1}
|ℓlog(f(x), y) − ℓlog(g(x), y)| Class G covers class F at margin γ if: sup
f∈F
inf
g∈G d(f, g) ≤ γ.
SLIDE 50 Uniform Covering Fails
Early work (Cesa-Bianchi and Lugosi, 1999, Opper and Haussler, 1999) used a uniform covering approach, but this is too coarse for many expert classes. Distance between f, g ∈ F: d(f, g) = sup
x∈X
sup
y∈{0,1}
|ℓlog(f(x), y) − ℓlog(g(x), y)| Class G covers class F at margin γ if: sup
f∈F
inf
g∈G d(f, g) ≤ γ.
Instead, we use sequential covering from Rakhlin and Sridharan (2014).
SLIDE 51 Binary Tree Notation
Rlog
n (F) = ⟪sup xt
sup
pt
E
yt∼pt⟫ n t=1
sup
f∈F
n
ℓlog(pt, yt) − ℓlog(f(xt), yt)
We can encode the sequential nature of xt and pt using binary trees:
SLIDE 52 Binary Tree Notation
Rlog
n (F) = ⟪sup xt
sup
pt
E
yt∼pt⟫ n t=1
sup
f∈F
n
ℓlog(pt, yt) − ℓlog(f(xt), yt)
We can encode the sequential nature of xt and pt using binary trees:
SLIDE 53 Binary Tree Notation
Rlog
n (F) = ⟪sup xt
sup
pt
E
yt∼pt⟫ n t=1
sup
f∈F
n
ℓlog(pt, yt) − ℓlog(f(xt), yt)
We can encode the sequential nature of xt and pt using binary trees:
SLIDE 54 Binary Tree Notation
Rlog
n (F) = ⟪sup xt
sup
pt
E
yt∼pt⟫ n t=1
sup
f∈F
n
ℓlog(pt, yt) − ℓlog(f(xt), yt)
We can encode the sequential nature of xt and pt using binary trees:
SLIDE 55 Binary Tree Notation
Rlog
n (F) = ⟪sup xt
sup
pt
E
yt∼pt⟫ n t=1
sup
f∈F
n
ℓlog(pt, yt) − ℓlog(f(xt), yt)
We can encode the sequential nature of xt and pt using binary trees:
SLIDE 56 Sequential Covering
Rlog
n (F) = sup x sup p
E
y∼p sup f∈F
n
ℓlog(pt(y), yt) − ℓlog(f(xt(y)), yt)
SLIDE 57 Sequential Covering
Rlog
n (F) = sup x sup p
E
y∼p sup f∈F
n
ℓlog(pt(y), yt) − ℓlog(f(xt(y)), yt)
Cover the class of trees F ◦ x defined by composing F with a context tree x:
SLIDE 58 Sequential Covering
Rlog
n (F) = sup x sup p
E
y∼p sup f∈F
n
ℓlog(pt(y), yt) − ℓlog(f(xt(y)), yt)
Cover the class of trees F ◦ x defined by composing F with a context tree x: A class of trees V sequentially covers F ◦ x at margin γ if: sup
u∈F◦x
sup
y∈{0,1}n inf v∈V u(y) − v(y)p ≤ γ.
SLIDE 59 Sequential Covering
Rlog
n (F) = sup x sup p
E
y∼p sup f∈F
n
ℓlog(pt(y), yt) − ℓlog(f(xt(y)), yt)
Cover the class of trees F ◦ x defined by composing F with a context tree x: A class of trees V sequentially covers F ◦ x at margin γ if: sup
u∈F◦x
sup
y∈{0,1}n inf v∈V u(y) − v(y)p ≤ γ.
The order of observations and covering elements is reversed from a uniform cover.
SLIDE 60
Sequential Covering Example
To illustrate the utility of sequential covering, consider binary experts for n = 2:
SLIDE 61
Sequential Covering Example
To illustrate the utility of sequential covering, consider binary experts for n = 2: The only uniform cover of F ◦ x is itself, which has 8 elements.
SLIDE 62
Sequential Covering Example
To illustrate the utility of sequential covering, consider binary experts for n = 2: The only uniform cover of F ◦ x is itself, which has 8 elements. For a sequential cover, we can choose a different element for each path, so only 4 trees are required.
SLIDE 63
Sequential Covering Examples
Examples of sequential covering numbers:
SLIDE 64 Sequential Covering Examples
Examples of sequential covering numbers:
- Time-Invariant: F = {f | ∃q ∈ [0, 1] s.t. f(x) = q ∀x ∈ X}.
sup
x log (N∞ (F ◦ x, γ)) ≤ log(1/γ).
SLIDE 65 Sequential Covering Examples
Examples of sequential covering numbers:
- Time-Invariant: F = {f | ∃q ∈ [0, 1] s.t. f(x) = q ∀x ∈ X}.
sup
x log (N∞ (F ◦ x, γ)) ≤ log(1/γ).
F = {f | ∃w s.t. w2 ≤ 1, f(x) = 1
2[1 + w, x] ∀ x2 ≤ 1}.
sup
x log (N∞ (F ◦ x, γ)) = 1/γ2.
SLIDE 66 Sequential Covering Examples
Examples of sequential covering numbers:
- Time-Invariant: F = {f | ∃q ∈ [0, 1] s.t. f(x) = q ∀x ∈ X}.
sup
x log (N∞ (F ◦ x, γ)) ≤ log(1/γ).
F = {f | ∃w s.t. w2 ≤ 1, f(x) = 1
2[1 + w, x] ∀ x2 ≤ 1}.
sup
x log (N∞ (F ◦ x, γ)) = 1/γ2.
- 1-Lipschitz: F = {f | f : Rd → [0, 1], ∇f(x)∞ ≤ 1}.
sup
x log (N∞ (F ◦ x, γ)) = 1/γd.
SLIDE 67
Improved Minimax Bounds
SLIDE 68 Improved Minimax Bounds
Theorem (B., Foster, Roy, 2020) There exists c > 0 such that for all F, Rlog
n (F) ≤ sup x
inf
γ>0 {4nγ + c log (N∞ (F ◦ x, γ))} .
SLIDE 69 Improved Minimax Bounds
Theorem (B., Foster, Roy, 2020) There exists c > 0 such that for all F, Rlog
n (F) ≤ sup x
inf
γ>0 {4nγ + c log (N∞ (F ◦ x, γ))} .
Upper Bound (Computation) If supx log (N∞ (F ◦ x, γ)) ≍ γ−p, Rlog
n (F) ≤ O(n
p p+1 ).
SLIDE 70 Improved Minimax Bounds
Theorem (B., Foster, Roy, 2020) There exists c > 0 such that for all F, Rlog
n (F) ≤ sup x
inf
γ>0 {4nγ + c log (N∞ (F ◦ x, γ))} .
Upper Bound (Computation) If supx log (N∞ (F ◦ x, γ)) ≍ γ−p, Rlog
n (F) ≤ O(n
p p+1 ).
Theorem (B., Foster, Roy, 2020) If p > 0, there exists an F with supx log (N∞ (F ◦ x, γ)) ≍ γ−p and Rlog
n (F) ≥ Ω
p p+2
SLIDE 71 Improved Minimax Bounds Visualized
Our results compared to the previous best upper bound from Foster et al. (2018).
Order of sequential covering number Optimized Power of n in Regret Bound Foster et al. (2018) New Upper Bound New Lower Bound 1 2 3 4 5 6 7 8 9 10 0.2 0.4 0.5 0.6 0.8 1
SLIDE 72
Advances Underlying Results
SLIDE 73
Truncation Free
The standard procedure to control log loss uses truncation. Define the truncated expert class Fδ = {f δ : f ∈ F} for δ ∈ (0, 1/2), where f δ(x) = δ f(x) < δ f(x) δ ≤ f(x) ≤ 1 − δ 1 − δ f(x) > 1 − δ .
SLIDE 74 Truncation Free
The standard procedure to control log loss uses truncation. Define the truncated expert class Fδ = {f δ : f ∈ F} for δ ∈ (0, 1/2), where f δ(x) = δ f(x) < δ f(x) δ ≤ f(x) ≤ 1 − δ 1 − δ f(x) > 1 − δ .
- Observe that for p ∈ [δ, 1 − δ], ℓlog(p, y) is 1/δ-Lipschitz.
SLIDE 75 Truncation Free
The standard procedure to control log loss uses truncation. Define the truncated expert class Fδ = {f δ : f ∈ F} for δ ∈ (0, 1/2), where f δ(x) = δ f(x) < δ f(x) δ ≤ f(x) ≤ 1 − δ 1 − δ f(x) > 1 − δ .
- Observe that for p ∈ [δ, 1 − δ], ℓlog(p, y) is 1/δ-Lipschitz.
- It can be shown that Rlog
n (F) ≤ Rlog n (Fδ) + 2nδ.
SLIDE 76 Truncation Free
The standard procedure to control log loss uses truncation. Define the truncated expert class Fδ = {f δ : f ∈ F} for δ ∈ (0, 1/2), where f δ(x) = δ f(x) < δ f(x) δ ≤ f(x) ≤ 1 − δ 1 − δ f(x) > 1 − δ .
- Observe that for p ∈ [δ, 1 − δ], ℓlog(p, y) is 1/δ-Lipschitz.
- It can be shown that Rlog
n (F) ≤ Rlog n (Fδ) + 2nδ.
Rakhlin and Sridharan (2015) hypothesize this truncation argument is suboptimal, and pose the open problem of finding a tighter bound without it.
SLIDE 77 Truncation Free
The standard procedure to control log loss uses truncation. Define the truncated expert class Fδ = {f δ : f ∈ F} for δ ∈ (0, 1/2), where f δ(x) = δ f(x) < δ f(x) δ ≤ f(x) ≤ 1 − δ 1 − δ f(x) > 1 − δ .
- Observe that for p ∈ [δ, 1 − δ], ℓlog(p, y) is 1/δ-Lipschitz.
- It can be shown that Rlog
n (F) ≤ Rlog n (Fδ) + 2nδ.
Rakhlin and Sridharan (2015) hypothesize this truncation argument is suboptimal, and pose the open problem of finding a tighter bound without it. Our argument does not require truncation.
SLIDE 78
Self-Concordance
Self-Concordant (Nesterov and Nemirovski, 1994) A function F : R → R is self-concordant if |F ′′′(x)| ≤ 2F ′′(x)3/2.
SLIDE 79
Self-Concordance
Self-Concordant (Nesterov and Nemirovski, 1994) A function F : R → R is self-concordant if |F ′′′(x)| ≤ 2F ′′(x)3/2. Logarithmic loss is self-concordant as a function of p.
SLIDE 80
Self-Concordance
Self-Concordant (Nesterov and Nemirovski, 1994) A function F : R → R is self-concordant if |F ′′′(x)| ≤ 2F ′′(x)3/2. Logarithmic loss is self-concordant as a function of p. Utility: In convex optimization, encoding the constraint boundary with a self-concordant barrier function leads to polynomial iterations for high accuracy.
SLIDE 81 Self-Concordance
Self-Concordant (Nesterov and Nemirovski, 1994) A function F : R → R is self-concordant if |F ′′′(x)| ≤ 2F ′′(x)3/2. Logarithmic loss is self-concordant as a function of p. Utility: In convex optimization, encoding the constraint boundary with a self-concordant barrier function leads to polynomial iterations for high accuracy. If F is self-concordant, then ∀x, y ∈ R F(x) − F(y) ≤ (x − y)F ′(x) − |x − y|
- F ′′(x) + log
- 1 + |x − y|
- F ′′(x)
- .
SLIDE 82 Self-Concordance
Self-Concordant (Nesterov and Nemirovski, 1994) A function F : R → R is self-concordant if |F ′′′(x)| ≤ 2F ′′(x)3/2. Logarithmic loss is self-concordant as a function of p. Utility: In convex optimization, encoding the constraint boundary with a self-concordant barrier function leads to polynomial iterations for high accuracy. If F is self-concordant, then ∀x, y ∈ R F(x) − F(y) ≤ (x − y)F ′(x) − |x − y|
- F ′′(x) + log
- 1 + |x − y|
- F ′′(x)
- .
We use the second term to control the gradient of logarithmic loss.
SLIDE 83 Chaining Free
Recall our upper bound: Rlog
n (F) ≤ sup x
inf
γ>0 {4nγ + c log (N∞ (F ◦ x, γ))} .
SLIDE 84 Chaining Free
Recall our upper bound: Rlog
n (F) ≤ sup x
inf
γ>0 {4nγ + c log (N∞ (F ◦ x, γ))} .
- Rather than a single discretization step, it is common to use multiple,
nested discretizations of finer sizes – called chaining.
SLIDE 85 Chaining Free
Recall our upper bound: Rlog
n (F) ≤ sup x
inf
γ>0 {4nγ + c log (N∞ (F ◦ x, γ))} .
- Rather than a single discretization step, it is common to use multiple,
nested discretizations of finer sizes – called chaining.
- Our current approach does not permit such a technique, yet improves on
previous results which do.
SLIDE 86 Chaining Free
Recall our upper bound: Rlog
n (F) ≤ sup x
inf
γ>0 {4nγ + c log (N∞ (F ◦ x, γ))} .
- Rather than a single discretization step, it is common to use multiple,
nested discretizations of finer sizes – called chaining.
- Our current approach does not permit such a technique, yet improves on
previous results which do.
- Naive attempts to change our result to allow chaining fail, and we are
actively working on this area.
SLIDE 87
Summary
SLIDE 88 Summary
Motivation
- Make probabilistic forecasts without making assumptions about the data
generating process – whether i.i.d. or more sophisticated dependence structure.
SLIDE 89 Summary
Motivation
- Make probabilistic forecasts without making assumptions about the data
generating process – whether i.i.d. or more sophisticated dependence structure. Problem Setup
- Bounding minimax regret for arbitrary expert classes under logarithmic loss.
SLIDE 90 Summary
Motivation
- Make probabilistic forecasts without making assumptions about the data
generating process – whether i.i.d. or more sophisticated dependence structure. Problem Setup
- Bounding minimax regret for arbitrary expert classes under logarithmic loss.
Contributions
- Improved upper bound for complex classes and provided lower bound.
- Proof technique is truncation free and only requires one step discretization.
SLIDE 91 Summary
Motivation
- Make probabilistic forecasts without making assumptions about the data
generating process – whether i.i.d. or more sophisticated dependence structure. Problem Setup
- Bounding minimax regret for arbitrary expert classes under logarithmic loss.
Contributions
- Improved upper bound for complex classes and provided lower bound.
- Proof technique is truncation free and only requires one step discretization.
Next Steps
- Match upper and lower bounds.
- Obtain bounds that interpolate between stochastic and fully adversarial.
SLIDE 92 Open Problem
Infinite Dimensional Linear Prediction
- X = B2, the unit ball in a Hilbert space,
- F = {f(x) = (w, x + 1)/2 : w ∈ B2},
- Log-loss can be written as
gt(w) = −yt log(1 + w, xt) − (1 − yt) log(1 − w, xt).
SLIDE 93 Open Problem
Infinite Dimensional Linear Prediction
- X = B2, the unit ball in a Hilbert space,
- F = {f(x) = (w, x + 1)/2 : w ∈ B2},
- Log-loss can be written as
gt(w) = −yt log(1 + w, xt) − (1 − yt) log(1 − w, xt). Constructive Algorithm (Rakhlin and Sridharan, 2015)
- Follow-the-Regularized-Leader with a self-concordant barrier function gives
Rlog
n (F) ≤ ˜
O (√n).
SLIDE 94 Open Problem
Infinite Dimensional Linear Prediction
- X = B2, the unit ball in a Hilbert space,
- F = {f(x) = (w, x + 1)/2 : w ∈ B2},
- Log-loss can be written as
gt(w) = −yt log(1 + w, xt) − (1 − yt) log(1 − w, xt). Constructive Algorithm (Rakhlin and Sridharan, 2015)
- Follow-the-Regularized-Leader with a self-concordant barrier function gives
Rlog
n (F) ≤ ˜
O (√n).
- This is tighter than any known upper bounds, including ours, and matches
the lower bound.
SLIDE 95 Open Problem
Infinite Dimensional Linear Prediction
- X = B2, the unit ball in a Hilbert space,
- F = {f(x) = (w, x + 1)/2 : w ∈ B2},
- Log-loss can be written as
gt(w) = −yt log(1 + w, xt) − (1 − yt) log(1 − w, xt). Constructive Algorithm (Rakhlin and Sridharan, 2015)
- Follow-the-Regularized-Leader with a self-concordant barrier function gives
Rlog
n (F) ≤ ˜
O (√n).
- This is tighter than any known upper bounds, including ours, and matches
the lower bound.
- It is not well-defined how to apply a concrete algorithm technique like this to
arbitrary expert classes.