Easy Data Peter Grnwald Centrum Wiskunde & Informatica - - PowerPoint PPT Presentation

easy data
SMART_READER_LITE
LIVE PREVIEW

Easy Data Peter Grnwald Centrum Wiskunde & Informatica - - PowerPoint PPT Presentation

Easy Data Peter Grnwald Centrum Wiskunde & Informatica Amsterdam Mathematical Institute Leiden University Joint work with W. Koolen, T. Van Erven, N. Mehta, T. Sterkenburg Today: Three Things To Tell You 1. Nifty Reformulation of


slide-1
SLIDE 1

Easy Data

Peter Grünwald

Centrum Wiskunde & Informatica – Amsterdam Mathematical Institute – Leiden University

Joint work with

  • W. Koolen, T. Van Erven, N. Mehta, T. Sterkenburg
slide-2
SLIDE 2

Today: Three Things To Tell You

  • 1. Nifty Reformulation of Conditions for Fast

Rates in Statistical Learning – Tsybakov, Bernstein, Exp-Concavity,...

  • 2. Do this via new concept: ESI
  • 3. Precise Analogue of Bernstein Condition for

Fast Rates in Individual Sequence Setting – ...and algorithm that achieves these rates!

slide-3
SLIDE 3

Today: Three Things To Tell You

  • 1. Nifty Reformulation of Conditions for Fast

Rates in Statistical Learning

  • 2. Do this via new concept: ESI
  • 3. Precise Analogue of Bernstein Condition for

Fast Rates in Individual Sequence Setting – ...and algorithm that achieves these rates!

slide-4
SLIDE 4
  • Plaatje van stochmix paper

Van Erven, G. Mehta, Reid, Williamson Fast Rates in Statistical and Online Learning. JMLR Special Issue in Memory of A. Chervonenkis, Oct. 2015

VC: Vapnik-Chervonenkis (1974!) optimistic (realizability) condition TM: Tsybakov (2004) margin condition (special case: Massart Condition) 𝒗-BC: Audibert, Bousquet (2005), Bartlett, Mendelson (2006) “Bernstein Condition”

  • Does not require 0/1 or

absolute loss

  • Does not require

Bayes act to be in model

slide-5
SLIDE 5

Decision Problem

  • A decision problem (DP) is defined as a tuple

where

  • 𝑄 is the distribution of random quantity 𝑎 taking

values in ,

  • the model

is a set of predictors 𝑔, and for each , indicates loss 𝑔 makes on 𝑎

  • Example: squared error loss
slide-6
SLIDE 6

Decision Problem

  • A decision problem (DP) is defined as a tuple

where

  • 𝑄 is the distribution of random quantity 𝑎 taking

values in ,

  • the model

is a set of predictors 𝑔, and for each , indicates loss 𝑔 makes on 𝑎

  • We assume throughout that the model contains a

risk minimizer 𝑔∗, achieving

  • abbreviates
slide-7
SLIDE 7

Bernstein Condition

  • Fix a DP with (for now) bounded loss
  • DP satisfies the 𝐷, 𝛽 -Bernstein condition if there

exists 𝐷 > 0, 𝛽 ∈ 0,1 , such that for all where we set and

  • is ‘regret of 𝑔 relative to 𝑔∗’.
slide-8
SLIDE 8

Bernstein Condition

  • Fix a DP with (for now) bounded loss
  • DP satisfies the 𝐷, 𝛽 -Bernstein condition if there

exists 𝐷 > 0, 𝛽 ∈ 0,1 , such that for all where we set and

  • Generalizes Tsybakov condition: 𝑔∗ does not need

to be Bayes act, loss does not need to be 0/1

slide-9
SLIDE 9

Bernstein Condition

  • Fix a DP with (for now) bounded loss
  • DP satisfies the 𝐷, 𝛽 -Bernstein condition if there

exists 𝐷 > 0, 𝛽 ∈ 0,1 , such that for all where we set and

  • Suppose data are i.i.d. and the 𝐷, 𝛽 -Bernstein

condition holds. Then...

slide-10
SLIDE 10

Under Bernstein(𝑫, 𝜷)

  • Empirical Risk minimization satisfies, with high prob*,
  • 𝛽 = 0: condition trivially satisfied, get minimax rate
  • 𝛽 = 1: nice case (Massart condition), get ‘log-loss’

rate

slide-11
SLIDE 11

Under Bernstein(𝑫, 𝜷)

  • 𝜽 −“Bayes” MAP satisfies, with high prob*,
  • This requires setting “learning rate” 𝜃 in terms of 𝛽

and 𝑈!

  • 𝛽 = 0: slow rate

; 𝛽 = 1: fast rate

slide-12
SLIDE 12

GOAL: Sequential Bernstein

  • 𝜃 −“Bayes” MAP satisfies, with high prob*,
  • GOAL: design ‘sequential Bernstein condition’ and

accompanying sequential prediction algorithm s.t.

  • 1. cumulative regret always satisfies, for all 𝑔∗, all

sequences

  • 2. if condition holds, it also satisfies, with high prob*
slide-13
SLIDE 13

GOAL: Sequential Bernstein

  • GOAL: design ‘sequential Bernstein condition’ and

accompanying sequential prediction algorithm s.t.

  • 1. cumulative regret always satisfies, for all 𝑔∗, all

sequences

  • 2. if condition holds, it also satisfies, with high prob*
slide-14
SLIDE 14

DREAM

  • DREAM: design ‘sequential Bernstein condition’ and

accompanying sequential prediction algorithm s.t.

  • 1. cumulative regret always satisfies, for all 𝑔∗, all

sequences

  • 2. if condition holds for given sequence, then

cumulative regret satisfies, for that sequence:

slide-15
SLIDE 15

GOAL: Sequential Bernstein

  • GOAL: design ‘sequential Bernstein condition’ s.t.
  • 1. for all 𝑔∗, all sequences
  • 2. if condition holds, it also satisfies, with high prob*,

Approach 1: define seq. Bernstein as standard Bernstein+i.i.d. Even then none of the standard algorithms achieve this... With one (?) exception!

slide-16
SLIDE 16

Today: Three Things To Tell You

  • 1. Nifty Reformulation of Fast Rate Conditions

in Statistical Learning

  • 2. Do this via new concept: ESI
  • 3. Precise Analogue of Bernstein Condition for

Fast Rates in Individual Sequence Setting – ...and algorithm that achieves these rates!

slide-17
SLIDE 17

Exponential Stochastic Inequality (ESI)

  • For any given 𝜃 > 0 we write 𝒀 ≤𝜽

∗ 𝝑 as

shorthand for

  • 𝑌 ≤𝜃

∗ 𝜗 implies, via Jensen,

  • 𝑌 ≤𝜃

∗ 𝜗 implies, via Markov, for all 𝐵,

slide-18
SLIDE 18

ESI-Example

  • Hoeffding’s Inequality: suppose that 𝑌 has

support [−1,1], and mean 0. Then

slide-19
SLIDE 19

ESI – More Properties

  • For i.i.d. rvs 𝑌, 𝑌1, … , 𝑌𝑈 we have
  • For arbitrary rvs 𝑌, 𝑍 we have
slide-20
SLIDE 20

Bernstein in ESI Terms

  • Most general form of Bernstein condition: for some

nondecreasing function :

slide-21
SLIDE 21

Bernstein in ESI Terms

  • Most general form of Bernstein condition: for some

nondecreasing function :

  • Van Erven et al. (2015) show this is equivalent to

having for some nondecreasing function with

slide-22
SLIDE 22

U-Central Condition

  • Van Erven et al. (2015) show Bernstein condition is

is equivalent to the existence of increasing function such that for some :

They term this the 𝒗-central condition

slide-23
SLIDE 23

U-Central Condition

  • Van Erven et al. (2015) show Bernstein condition is

is equivalent to the existence of increasing function such that for some :

They term this the 𝒗-central condition – can also be related to mixability, exp-concavity,

JRT-condition, condition for well-behavedness of Bayesian inference under misspecification

slide-24
SLIDE 24

U-Central Condition

  • Van Erven et al. (2015) show Bernstein condition is

is equivalent to the existence of increasing function such that for some :

They term this the 𝒗-central condition – can also be related to mixability, exp-concavity,

JRT-condition, condition for well-behavedness of Bayesian inference under misspecification – for unbounded losses, it becomes different (and better!) than Bernstein condition – it is one-sided

slide-25
SLIDE 25

Three Equivalent Notions for Bounded Losses

  • U-central condition in terms of regret:

.....or equivalently (extending notation):

slide-26
SLIDE 26

Three Equivalent Notions for Bounded Losses

  • U-central condition in terms of regret: with
  • For bounded losses, this turns out to be equivalent

to: for some appropriately chosen with :

slide-27
SLIDE 27

Three Equivalent Notions for Bounded Losses

  • U-central condition in terms of regret: with
  • For bounded losses, this turns out to be equivalent

to: for some appropriately chosen with :

  • More similar to original Bernstein condition.

However, condition is now in ‘exponential’ rather than ‘expectation’ form

slide-28
SLIDE 28

Today: Three Things To Tell You

  • 1. Nifty Reformulation of Fast Rate

Conditions in Statistical Learning

  • 2. Do this via new concept: ESI
  • 3. Precise Analogue of Bernstein Condition for

Fast Rates in Individual Sequence Setting – ...and algorithm that achieves these rates!

slide-29
SLIDE 29
  • Suppose that 𝑣-central condition holds (i.e. 𝑦 / 𝑣(𝑦) –

Bernstein holds) , and data are i.i.d. Then by generic property of ESI, with 𝜃𝜗 = 𝐷1 ⋅ 𝑣(𝜗), where

T-fold U-Central Condition

slide-30
SLIDE 30
  • Under 𝑣-central cond. and iid data, with 𝜃𝜗 = 𝐷1 ⋅ 𝑣 𝜗 :

but also for every learning algorithm with

T-fold U-Central Condition

slide-31
SLIDE 31
  • Under 𝑣-central cond. and iid data, with 𝜃𝜗 = 𝐷1 ⋅ 𝑣 𝜗 :

but also for every learning algorithm This condition may of course also hold for non-i.i.d.

  • data. It is the condition we need, so we term it the

cumulative u-central condition

Cumulative U-Central Condition

slide-32
SLIDE 32

Hedge with Oracle Learning Rate

  • Hedge with learning rate 𝜃 achieves regret bound,

for all

  • We assume cumulative 𝑣-central condition for some

𝑣. For simplicity assume ; then: and even for some other constant

slide-33
SLIDE 33

Hedge with Oracle Learning Rate

  • Combining we get
  • We can set 𝜗 (or eqv. 𝜃) as we like. Best possible

bound achieved if we make sure all terms are of same order, i.e. we set at time 𝑈,

  • and then and
slide-34
SLIDE 34

Squint without Oracle Learning Rate!

  • Hedge achieves ESI- (!)-bound

...but needs to know 𝑔∗, 𝛾 and 𝑈 to set learning rate!

  • Squint (Koolen and Van Erven ’15)
  • achieves same bound without knowing these!
  • Gets bound with 𝛾 = 0 automatically for individual

sequences

  • What about Adanormalhedge? (Luo & Shapire ‘15)
slide-35
SLIDE 35

Dessert: Easy Data Rather than Distributions

  • We are working with algorithms such as

Hedge and Squint, designed for individual, nonstochastic sequences

  • Yet condition is stochastic
  • Does there exist nonstochastic analogue?
  • Answer is yes:
slide-36
SLIDE 36

Non-Stochastic Inequality

Suppose 𝑣-cumulative central condition holds for some 𝑣. Using Martingale theory one shows that this also implies the following:

  • fix a countable, otherwise arbitrary set of learning

algorithms.

  • Fix a decreasing sequence 𝜗1, 𝜗2, … and set

corresponding 𝜃1 = 𝑣 𝜗1 , 𝜃2 = 𝑣 𝜗2 , …

  • Then we have with probability 1: for every

there exists 𝐷 such that

slide-37
SLIDE 37

Individual Sequence Condition

Hence we define: (we only give special case with 𝑣 𝑦 = 𝑦𝛾 here) An individual sequence satisfies the 𝑣-fast rate condition relative to countable set of learning algoritms and constants if there exists 𝑔∗ such that for all 𝑈 > 0, for all , with we have

slide-38
SLIDE 38

Conclusion

  • If a sequence satisfies u-fast rate condition, then

Hedge (with oracle) and Squint (without oracle) both achieve desired regret bound

  • We’ve removed all stochastics!
  • Similar idea used by György and Szepesvári in

this workshop!

  • Notion implies a (very close!) analogy to Martin-Löf

randomness Van Erven, G. Mehta, Reid, Williamson Fast Rates in Statistical and Online Learning. JMLR Special Issue in Memory of A. Chervonenkis, Oct. 2015

slide-39
SLIDE 39

Iets zeggen over: L* bound, unbounded losses, mixability, JRT,exp-concavity, .... Tell Csaba, Peter B, Philippe \eta \leq u(\epsilon), maar ook met \eta = u(\epsilon) Star means...