Mixability in Statistical Learning Tim van Erven Joint work with: - - PowerPoint PPT Presentation

mixability in statistical learning
SMART_READER_LITE
LIVE PREVIEW

Mixability in Statistical Learning Tim van Erven Joint work with: - - PowerPoint PPT Presentation

Mixability in Statistical Learning Tim van Erven Joint work with: Peter Grnwald, Mark Reid, Bob Williamson SMILE Seminar, 24 September 2012 Summary Stochastic mixability fast rates of convergence in different settings: statistical


slide-1
SLIDE 1

SMILE Seminar, 24 September 2012

Mixability in Statistical Learning

Tim van Erven

Joint work with: Peter Grünwald, Mark Reid, Bob Williamson

slide-2
SLIDE 2

Summary

  • Stochastic mixability fast rates of convergence in

different settings:

  • statistical learning (margin condition)
  • sequential prediction (mixability)
slide-3
SLIDE 3

Outline

  • Part 1: Statistical learning
  • Stochastic mixability (definition)
  • Equivalence to margin condition
  • Part 2: Sequential prediction
  • Part 3: Convexity interpretation for stochastic mixability
  • Part 4: Grünwald’s idea for adaptation to the margin
slide-4
SLIDE 4

Notation

slide-5
SLIDE 5
  • Data:
  • Predict from :
  • Loss:

Notation

` : Y × A → [0, ∞] (X1, Y1), . . . , (Xn, Yn) F = {f : X → A} Y X

slide-6
SLIDE 6

Classification

  • Data:
  • Predict from :
  • Loss:

Notation

` : Y × A → [0, ∞] (X1, Y1), . . . , (Xn, Yn) F = {f : X → A} Y = {0, 1}, A = {0, 1} `(y, a) = ( if y = a 1 if y 6= a Y X

slide-7
SLIDE 7

Density estimation Classification

  • Data:
  • Predict from :
  • Loss:

Notation

` : Y × A → [0, ∞] (X1, Y1), . . . , (Xn, Yn) F = {f : X → A} Y = {0, 1}, A = {0, 1} `(y, a) = ( if y = a 1 if y 6= a A = density functions on Y `(y, p) = − log p(y) Y X

slide-8
SLIDE 8

Density estimation Classification

  • Data:
  • Predict from :
  • Loss:

Notation

` : Y × A → [0, ∞] (X1, Y1), . . . , (Xn, Yn) F = {f : X → A} Y = {0, 1}, A = {0, 1} `(y, a) = ( if y = a 1 if y 6= a A = density functions on Y `(y, p) = − log p(y) Y X Without X : F ⊂ A

slide-9
SLIDE 9

Statistical Learning

slide-10
SLIDE 10

Statistical Learning

(X1, Y1), . . . , (Xn, Yn)

iid

∼ P ∗ f ∗ = arg min

f∈F

E[`(Y, f(X))] d( ˆ f, f ∗) = E[`(Y, ˆ f(X)) − `(Y, f ∗(X))]

slide-11
SLIDE 11

Statistical Learning

(X1, Y1), . . . , (Xn, Yn)

iid

∼ P ∗ f ∗ = arg min

f∈F

E[`(Y, f(X))] d( ˆ f, f ∗) = E[`(Y, ˆ f(X)) − `(Y, f ∗(X))]

slide-12
SLIDE 12

Statistical Learning

(X1, Y1), . . . , (Xn, Yn)

iid

∼ P ∗ f ∗ = arg min

f∈F

E[`(Y, f(X))] d( ˆ f, f ∗) = E[`(Y, ˆ f(X)) − `(Y, f ∗(X))] = O(n−?)

slide-13
SLIDE 13
  • Two factors that determine rate of convergence:
  • 1. complexity of 2. the margin condition

Statistical Learning

(X1, Y1), . . . , (Xn, Yn)

iid

∼ P ∗ f ∗ = arg min

f∈F

E[`(Y, f(X))] F d( ˆ f, f ∗) = E[`(Y, ˆ f(X)) − `(Y, f ∗(X))] = O(n−?)

slide-14
SLIDE 14

Definition of Stochastic Mixability

  • Let . Then is -stochastically mixable if

there exists an such that

  • Stochastically mixable: this holds for some

f ∗ ∈ F η ≥ 0 (`, F, P ∗) η E  e−⌘`(Y,f(X)) e−⌘`(Y,f ∗(X))

  • ≤ 1

for all f ∈ F. η > 0

slide-15
SLIDE 15

Immediate Consequences

  • minimizes risk over :
  • The larger , the stronger the property of being -

stochastically mixable F f ∗ f ∗ = arg min

f∈F

E[`(Y, f(X))] E  e−⌘`(Y,f(X)) e−⌘`(Y,f ∗(X))

  • ≤ 1

for all f ∈ F η η

slide-16
SLIDE 16
  • Log-loss: ,
  • Suppose is the true density
  • Then for and any :

Density estimation example 1

`(y, p) = − log p(y) η = 1 F = {pθ | θ ∈ Θ} pθ∗ ∈ F pθ ∈ F E  e−⌘`(Y,pθ) e−⌘`(Y,pθ∗)

  • =

Z p✓(y) p✓∗(y)P ∗(dy) = 1

slide-17
SLIDE 17

Density estimation example 2

slide-18
SLIDE 18

Density estimation example 2

  • Normal location family with fixed variance :
  • -stochastically mixable for :

σ2 F = {N(µ, σ2) | µ ∈ R} η P ∗ = N(µ∗, τ 2) η = σ2/τ 2

E  e−⌘`(Y,pµ) e−⌘`(Y,pµ∗)

  • =

1 √ 2⇡⌧ 2

Z e−

η 2σ2 (y−µ)2+ η 2σ2 (y−µ∗)2− 1 2τ2 (y−µ∗)2dy

=

1 √ 2⇡⌧ 2

Z e−

1 2τ2 (y−µ)2dy = 1

slide-19
SLIDE 19

Density estimation example 2

  • Normal location family with fixed variance :
  • -stochastically mixable for :

σ2 F = {N(µ, σ2) | µ ∈ R} η P ∗ = N(µ∗, τ 2) η = σ2/τ 2

E  e−⌘`(Y,pµ) e−⌘`(Y,pµ∗)

  • =

1 √ 2⇡⌧ 2

Z e−

η 2σ2 (y−µ)2+ η 2σ2 (y−µ∗)2− 1 2τ2 (y−µ∗)2dy

=

1 √ 2⇡⌧ 2

Z e−

1 2τ2 (y−µ)2dy = 1

  • If is empirical mean: E[d( ˆ

f, f ∗)] = τ 2 2σ2n = η−1 2n ˆ f

slide-20
SLIDE 20

Outline

  • Part 1: Statistical learning
  • Stochastic mixability (definition)
  • Equivalence to margin condition
  • Part 2: Sequential prediction
  • Part 3: Convexity interpretation for stochastic mixability
  • Part 4: Grünwald’s idea for adaptation to the margin
slide-21
SLIDE 21

Margin condition

  • where
  • For 0/1-loss implies rate of convergence
  • So smaller is better

d(f, f ∗) = E[`(Y, f(X)) − `(Y, f ∗(X))] V (f, f ∗) = E

  • `(Y, f(X)) − `(Y, f ∗(X))

2

c0V (f, f ∗)κ ≤ d(f, f ∗) for all f ∈ F O(n−κ/(2κ−1)) κ

κ ≥ 1, c0 > 0 [Tsybakov, 2004]

slide-22
SLIDE 22

Stochastic mixability margin

  • Thm [ ]: Suppose takes values in . Then is

stochastically mixable if and only if there exists such that the margin condition is satisfied with .

c0 > 0 κ = 1 ` [0, V ] (`, F, P ∗)

c0V (f, f ∗)κ ≤ d(f, f ∗) for all f ∈ F

κ = 1

slide-23
SLIDE 23

Margin condition with

  • Thm [all ]: Suppose takes values in . Then the

margin condition is satisfied if and only if there exists a constant such that, for all , is - stochastically mixable for .

κ > 1

F✏ = {f ∗} ∪ {f ∈ F | d(f, f ∗) ≥ ✏} κ ≥ 1 ` [0, V ] C > 0 ✏ > 0 (`, F✏, P ∗) η ⌘ = C✏(κ−1)/κ

slide-24
SLIDE 24

Outline

  • Part 1: Statistical learning
  • Part 2: Sequential prediction
  • Part 3: Convexity interpretation for stochastic mixability
  • Part 4: Grünwald’s idea for adaptation to the margin
slide-25
SLIDE 25

Sequential Prediction with Expert Advice

  • For rounds :
  • K experts predict
  • Predict by choosing
  • Observe
  • Regret =
  • Game-theoretic (minimax) analysis: want to guarantee small regret

against adversarial data

t = 1, . . . , n (xt, yt) (xt, yt) ˆ ft ˆ f 1

t , . . . , ˆ

f K

t

1 n

n

X

t=1

`(yt, ˆ ft(xt)) − min

k

1 n

n

X

t=1

`(yt, ˆ f k

t (xt))

slide-26
SLIDE 26

Sequential Prediction with Expert Advice

  • For rounds :
  • K experts predict
  • Predict by choosing
  • Observe
  • Regret =
  • Worst-case regret = iff the loss is mixable! [Vovk, 1995]

t = 1, . . . , n (xt, yt) (xt, yt) ˆ ft ˆ f 1

t , . . . , ˆ

f K

t

1 n

n

X

t=1

`(yt, ˆ ft(xt)) − min

k

1 n

n

X

t=1

`(yt, ˆ f k

t (xt))

O(1/n)

slide-27
SLIDE 27

Mixability

  • A loss is -mixable if for any

distribution on there exists an action such that

  • Vovk: fast rates if and only if loss is mixable

`: Y × A → [0, ∞] η π A aπ ∈ A EA∼⇡  e−⌘`(y,A) e−⌘`(y,aπ)

  • ≤ 1

for all y. O(1/n)

slide-28
SLIDE 28

(Stochastic) Mixability

  • A loss is -mixable if for any

distribution on there exists an action such that

  • is -stochastically mixable if

`: Y × A → [0, ∞] η π A aπ ∈ A EA∼⇡  e−⌘`(y,A) e−⌘`(y,aπ)

  • ≤ 1

for all y. η (`, F, P ∗) EX,Y ∼P ∗  e−⌘`(Y,f(X)) e−⌘`(Y,f ∗(X))

  • ≤ 1

for all f ∈ F.

slide-29
SLIDE 29
slide-30
SLIDE 30

(Stochastic) Mixability

  • A loss is -mixable if for any

distribution on there exists an action such that `: Y × A → [0, ∞] η π A aπ ∈ A `(y, a⇡) ≤ −1 ⌘ ln Z e−⌘`(y,a)⇡(da) for all y.

slide-31
SLIDE 31
  • Thm: is -stochastically mixable iff for any

distribution on there exists such that

(Stochastic) Mixability

  • A loss is -mixable if for any

distribution on there exists an action such that `: Y × A → [0, ∞] η π A aπ ∈ A η (`, F, P ∗) `(y, a⇡) ≤ −1 ⌘ ln Z e−⌘`(y,a)⇡(da) for all y. π F f ∗ ∈ F E[`(Y, f ∗(X))] ≤ E[−1 ⌘ ln Z e−⌘`(Y,f(X))⇡(df)]

slide-32
SLIDE 32

Equivalence of Stochastic Mixability and Ordinary Mixability

slide-33
SLIDE 33

Equivalence of Stochastic Mixability and Ordinary Mixability

  • Thm: Suppose is a proper loss and is discrete. Then

is -mixable if and only if is -stochastically mixable for all . Ffull = {all functions from X to A} X ` ` η (`, Ffull, P ∗) η P ∗

slide-34
SLIDE 34

Equivalence of Stochastic Mixability and Ordinary Mixability

  • Thm: Suppose is a proper loss and is discrete. Then

is -mixable if and only if is -stochastically mixable for all . Ffull = {all functions from X to A}

  • Proper losses are e.g. 0/1-loss, log-loss, squared loss
  • Thm generalizes to other losses that satisfy two technical

conditions

X ` ` η (`, Ffull, P ∗) η P ∗

slide-35
SLIDE 35

Summary

  • Stochastic mixability fast rates of convergence in

different settings:

  • statistical learning (margin condition)
  • sequential prediction (mixability)
slide-36
SLIDE 36

Outline

  • Part 1: Statistical learning
  • Part 2: Sequential prediction
  • Part 3: Convexity interpretation for stochastic mixability
  • Part 4: Grünwald’s idea for adaptation to the margin
slide-37
SLIDE 37
  • Log-loss: ,
  • Suppose is the true density
  • Then for and any :

Density estimation example 1

`(y, p) = − log p(y) η = 1 F = {pθ | θ ∈ Θ} pθ∗ ∈ F pθ ∈ F E  e−⌘`(Y,pθ) e−⌘`(Y,pθ∗)

  • =

Z p✓(y) p✓∗(y)P ∗(dy) = 1

slide-38
SLIDE 38

Log-loss example 3 (convex )

  • Log-loss: ,
  • Suppose model misspecified:

is not the true density

  • Thm [Li, 1999]: Suppose is convex. Then

F

  • Convexity is common condition for convergence of

minimum description length and Bayesian methods

F

`(y, p) = − log p(y) F = {pθ | θ ∈ Θ} pθ∗ = arg min

pθ∈F

E[− log pθ(Y )] Z pθ(y) pθ∗(y)P ∗(dy) ≤ 1 for all pθ ∈ F

slide-39
SLIDE 39

Log-loss and convexity for η = 1

slide-40
SLIDE 40
  • Thm: is -stochastically mixable iff for any

distribution on there exists such that

Log-loss and convexity for η = 1

η (`, F, P ∗) π F f ∗ ∈ F E[`(Y, f ∗(X))] ≤ E[−1 ⌘ ln Z e−⌘`(Y,f(X))⇡(df)]

slide-41
SLIDE 41
  • Thm: is -stochastically mixable iff for any

distribution on there exists such that

Log-loss and convexity for

  • Corollary: For log-loss, 1-stochastic mixability means

where denotes the convex hull of .

co(F) F

η = 1

η (`, F, P ∗) π F f ∗ ∈ F min

p∈F E[− ln p(Y )] =

min

p∈co(F) E[− ln p(Y )],

E[`(Y, f ∗(X))] ≤ E[−1 ⌘ ln Z e−⌘`(Y,f(X))⇡(df)]

slide-42
SLIDE 42

Log-loss and convexity for

  • Corollary: For log-loss, 1-stochastic mixability means

where denotes the convex hull of .

co(F) F

η = 1

p∗ p∗ pθ∗ pθ∗ Stochastically mixable Not stochastically mixable

F F

co(F) co(F)

min

p∈F E[− ln p(Y )] =

min

p∈co(F) E[− ln p(Y )],

slide-43
SLIDE 43
  • Pseudo-likelihoods:

Convexity interpretation with pseudo-likelihoods

  • Corollary: is -stochastically mixable iff

η (`, F, P ∗)

min

p∈PF(η) E[− 1 η ln p(Y |X)] =

min

p∈co(PF(η)) E[− 1 η ln p(Y |X)]

pf,⌘(Y |X) = e−⌘`(Y,f(X)) PF(η) = {pf,η(Y |X) | f ∈ F}

slide-44
SLIDE 44

Outline

  • Part 1: Statistical learning
  • Part 2: Sequential prediction
  • Part 3: Convexity interpretation for stochastic mixability
  • Part 4: Grünwald’s idea for adaptation to the margin
slide-45
SLIDE 45

Adapting to the margin /

  • Penalized empirical risk minimization:
  • Optimal depends on / the margin
  • Single model: take no need to know
  • Model selection: ,

η

ˆ f = arg min

f∈F

n 1 n

n

X

i=1

`(Yi, f(Xi)) + · pen(f)

  • η

λ ∝ 1/η pen(f) = const. λ F = [

m

Fm pen(f) = pen(m) 6= const.

slide-46
SLIDE 46

Convexity testing

slide-47
SLIDE 47

Convexity testing

  • Corollary: is -stochastically mixable iff

(`, F, P ∗)

min

p∈PF(η) E[− 1 η ln p(Y |X)] =

min

p∈co(PF(η)) E[− 1 η ln p(Y |X)]

η

slide-48
SLIDE 48

Convexity testing

  • [Grünwald, 2011]: pick the largest such that
  • Corollary: is -stochastically mixable iff

(`, F, P ∗)

min

p∈PF(η) E[− 1 η ln p(Y |X)] =

min

p∈co(PF(η)) E[− 1 η ln p(Y |X)]

η η

min

p∈PF(η) 1 n n

X

i=1

− 1

η ln p(Yi|Xi) ≥

min

p∈co(PF(η)) 1 n n

X

i=1

− 1

η ln p(Yi|Xi) − something

where “something” depends on concentration inequalities and penalty function.

slide-49
SLIDE 49

Summary

  • Stochastic mixability fast rates of convergence in

different settings:

  • statistical learning (margin condition)
  • sequential prediction (mixability)
  • Convexity interpretation
  • Idea for adaptation to the margin
slide-50
SLIDE 50

References

  • P.D. Grünwald. Safe Learning: bridging the gap between Bayes, MDL and statistical learning theory via empirical
  • convexity. Proceedings 24th Conference on Learning Theory (COLT 2011), pp. 551-573, 2011.
  • J.-Y. Audibert, Fast learning rates in statistical inference through aggregation, Annals of Statistics, 2009
  • B.J.K. Kleijn, A.W. van der Vaart, Misspecification in Infinite-Dimensional Bayesian Statistics, The Annals of

Statistics, 2006

  • A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statis- tics, 32(1):135–166,

2004.

  • Y. Kalnishkan and M. V. Vyugin. The weak aggregating algorithm and weak mixability. Journal of Computer and

System Sciences, 74:1228–1244, 2008.

  • J. Li, Estimation of Mixture Models (PhD thesis), Yale University, 1999
  • V. Vovk. A game of prediction with expert advice. In Proceedings of the Eighth Annual Conference on

Computational Learning Theory, pages 51–60. ACM, 1995.

Slides and NIPS 2012 paper: www.timvanerven.nl