Easy Data
Peter Grünwald
Centrum Wiskunde & Informatica – Amsterdam Mathematical Institute – Leiden University
Joint work with
- W. Koolen, T. Van Erven, N. Mehta, T. Sterkenburg
Easy Data Peter Grnwald Centrum Wiskunde & Informatica - - PowerPoint PPT Presentation
Easy Data Peter Grnwald Centrum Wiskunde & Informatica Amsterdam Mathematical Institute Leiden University Joint work with W. Koolen, T. Van Erven, N. Mehta, T. Sterkenburg Today: Three Things To Tell You 1. Nifty Reformulation of
Centrum Wiskunde & Informatica – Amsterdam Mathematical Institute – Leiden University
Joint work with
Rates in Statistical Learning – Tsybakov, Bernstein, Exp-Concavity,...
Fast Rates in Individual Sequence Setting – ...and algorithm that achieves these rates!
Rates in Statistical Learning
Fast Rates in Individual Sequence Setting – ...and algorithm that achieves these rates!
Van Erven, G. Mehta, Reid, Williamson Fast Rates in Statistical and Online Learning. JMLR Special Issue in Memory of A. Chervonenkis, Oct. 2015
VC: Vapnik-Chervonenkis (1974!) optimistic (realizability) condition TM: Tsybakov (2004) margin condition (special case: Massart Condition) 𝒗-BC: Audibert, Bousquet (2005), Bartlett, Mendelson (2006) “Bernstein Condition”
absolute loss
Bayes act to be in model
where
values in ,
is a set of predictors 𝑔, and for each , indicates loss 𝑔 makes on 𝑎
where
values in ,
is a set of predictors 𝑔, and for each , indicates loss 𝑔 makes on 𝑎
risk minimizer 𝑔∗, achieving
exists 𝐷 > 0, 𝛽 ∈ 0,1 , such that for all where we set and
exists 𝐷 > 0, 𝛽 ∈ 0,1 , such that for all where we set and
to be Bayes act, loss does not need to be 0/1
exists 𝐷 > 0, 𝛽 ∈ 0,1 , such that for all where we set and
condition holds. Then...
rate
and 𝑈!
; 𝛽 = 1: fast rate
accompanying sequential prediction algorithm s.t.
sequences
accompanying sequential prediction algorithm s.t.
sequences
accompanying sequential prediction algorithm s.t.
sequences
cumulative regret satisfies, for that sequence:
Approach 1: define seq. Bernstein as standard Bernstein+i.i.d. Even then none of the standard algorithms achieve this... With one (?) exception!
in Statistical Learning
Fast Rates in Individual Sequence Setting – ...and algorithm that achieves these rates!
∗ 𝝑 as
shorthand for
∗ 𝜗 implies, via Jensen,
∗ 𝜗 implies, via Markov, for all 𝐵,
support [−1,1], and mean 0. Then
nondecreasing function :
nondecreasing function :
having for some nondecreasing function with
is equivalent to the existence of increasing function such that for some :
They term this the 𝒗-central condition
is equivalent to the existence of increasing function such that for some :
They term this the 𝒗-central condition – can also be related to mixability, exp-concavity,
JRT-condition, condition for well-behavedness of Bayesian inference under misspecification
is equivalent to the existence of increasing function such that for some :
They term this the 𝒗-central condition – can also be related to mixability, exp-concavity,
JRT-condition, condition for well-behavedness of Bayesian inference under misspecification – for unbounded losses, it becomes different (and better!) than Bernstein condition – it is one-sided
.....or equivalently (extending notation):
to: for some appropriately chosen with :
to: for some appropriately chosen with :
However, condition is now in ‘exponential’ rather than ‘expectation’ form
Conditions in Statistical Learning
Fast Rates in Individual Sequence Setting – ...and algorithm that achieves these rates!
Bernstein holds) , and data are i.i.d. Then by generic property of ESI, with 𝜃𝜗 = 𝐷1 ⋅ 𝑣(𝜗), where
but also for every learning algorithm with
but also for every learning algorithm This condition may of course also hold for non-i.i.d.
cumulative u-central condition
for all
𝑣. For simplicity assume ; then: and even for some other constant
bound achieved if we make sure all terms are of same order, i.e. we set at time 𝑈,
...but needs to know 𝑔∗, 𝛾 and 𝑈 to set learning rate!
sequences
Hedge and Squint, designed for individual, nonstochastic sequences
Suppose 𝑣-cumulative central condition holds for some 𝑣. Using Martingale theory one shows that this also implies the following:
algorithms.
corresponding 𝜃1 = 𝑣 𝜗1 , 𝜃2 = 𝑣 𝜗2 , …
there exists 𝐷 such that
Hence we define: (we only give special case with 𝑣 𝑦 = 𝑦𝛾 here) An individual sequence satisfies the 𝑣-fast rate condition relative to countable set of learning algoritms and constants if there exists 𝑔∗ such that for all 𝑈 > 0, for all , with we have
Hedge (with oracle) and Squint (without oracle) both achieve desired regret bound
this workshop!
randomness Van Erven, G. Mehta, Reid, Williamson Fast Rates in Statistical and Online Learning. JMLR Special Issue in Memory of A. Chervonenkis, Oct. 2015
Iets zeggen over: L* bound, unbounded losses, mixability, JRT,exp-concavity, .... Tell Csaba, Peter B, Philippe \eta \leq u(\epsilon), maar ook met \eta = u(\epsilon) Star means...