consistency result Classifjcation and regression based on - - PDF document

consistency result classifjcation and regression based on
SMART_READER_LITE
LIVE PREVIEW

consistency result Classifjcation and regression based on - - PDF document

HAL Id: hal-00668212 scientifjques de niveau recherche, publis ou non, Nathalie Villa-Vialaneix, Fabrice Rossi. Classifjcation and regression based on derivatives : a consis- To cite this version: Nathalie Villa-Vialaneix, Fabrice Rossi


slide-1
SLIDE 1

HAL Id: hal-00668212 https://hal.archives-ouvertes.fr/hal-00668212

Submitted on 9 Feb 2012 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entifjc research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la difgusion de documents scientifjques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Classifjcation and regression based on derivatives : a consistency result

Nathalie Villa-Vialaneix, Fabrice Rossi To cite this version:

Nathalie Villa-Vialaneix, Fabrice Rossi. Classifjcation and regression based on derivatives : a consis- tency result. II Simposio sobre Modelamiento Estadístico, Dec 2010, Valparaiso, Chile. ฀hal-00668212฀

slide-2
SLIDE 2

Classification and regression based on derivatives: a consistency result

Nathalie Villa-Vialaneix (Joint work with Fabrice Rossi) http://www.nathalievilla.org II Simposio sobre Modelamiento Estadístico Valparaiso, December, 3rd

1 / 30 Nathalie Villa-Vialaneix

slide-3
SLIDE 3

Introduction and motivations

Outline

1 Introduction and motivations 2 A general consistency result 3 Examples

2 / 30 Nathalie Villa-Vialaneix

slide-4
SLIDE 4

Introduction and motivations

Regression and classification from an infinite dimensional predictor

Settings

(X, Y) is a random pair of variables where

Y ∈ {−1, 1} (binary classification problem) or Y ∈ R

3 / 30 Nathalie Villa-Vialaneix

slide-5
SLIDE 5

Introduction and motivations

Regression and classification from an infinite dimensional predictor

Settings

(X, Y) is a random pair of variables where

Y ∈ {−1, 1} (binary classification problem) or Y ∈ R X ∈ (X, ., .X), an infinite dimensional Hilbert space.

3 / 30 Nathalie Villa-Vialaneix

slide-6
SLIDE 6

Introduction and motivations

Regression and classification from an infinite dimensional predictor

Settings

(X, Y) is a random pair of variables where

Y ∈ {−1, 1} (binary classification problem) or Y ∈ R X ∈ (X, ., .X), an infinite dimensional Hilbert space.

We are given a learning set Sn = {(Xi, Yi)}n

i=1 of n i.i.d. copies

  • f (X, Y).

3 / 30 Nathalie Villa-Vialaneix

slide-7
SLIDE 7

Introduction and motivations

Regression and classification from an infinite dimensional predictor

Settings

(X, Y) is a random pair of variables where

Y ∈ {−1, 1} (binary classification problem) or Y ∈ R X ∈ (X, ., .X), an infinite dimensional Hilbert space.

We are given a learning set Sn = {(Xi, Yi)}n

i=1 of n i.i.d. copies

  • f (X, Y).

Purpose: Find φn : X → {−1, 1} or R, that is universally consistent: Classification case: limn→+∞ P (φn(X) Y) = L∗ where L∗ = infφ:X→{−1,1} P (φ(X) Y) is the Bayes risk.

3 / 30 Nathalie Villa-Vialaneix

slide-8
SLIDE 8

Introduction and motivations

Regression and classification from an infinite dimensional predictor

Settings

(X, Y) is a random pair of variables where

Y ∈ {−1, 1} (binary classification problem) or Y ∈ R X ∈ (X, ., .X), an infinite dimensional Hilbert space.

We are given a learning set Sn = {(Xi, Yi)}n

i=1 of n i.i.d. copies

  • f (X, Y).

Purpose: Find φn : X → {−1, 1} or R, that is universally consistent: Regression case: limn→+∞ E

  • [φn(X) − Y]2

= L∗ where

L∗ = infφ:X→R E

  • [φ(X) − Y]2

will also be called the Bayes risk.

3 / 30 Nathalie Villa-Vialaneix

slide-9
SLIDE 9

Introduction and motivations

An example

Predicting the rate of yellow berry in durum wheat from its NIR spectrum.

4 / 30 Nathalie Villa-Vialaneix

slide-10
SLIDE 10

Introduction and motivations

Using derivatives

Practically, X(m) is often more relevant than X for the prediction.

5 / 30 Nathalie Villa-Vialaneix

slide-11
SLIDE 11

Introduction and motivations

Using derivatives

Practically, X(m) is often more relevant than X for the prediction.

5 / 30 Nathalie Villa-Vialaneix

slide-12
SLIDE 12

Introduction and motivations

Using derivatives

Practically, X(m) is often more relevant than X for the prediction. But X → X(m) induces information loss and inf

φ:DmX→{−1,1} P

  • φ(X(m)) Y

inf

φ:X→{−1,1} P (φ(X) Y) = L∗

and inf

φ:DmX→R E

  • φ(X(m)) − Y

2 ≥

inf

φ:X→R P

  • [φ(X) − Y]2

= L∗.

5 / 30 Nathalie Villa-Vialaneix

slide-13
SLIDE 13

Introduction and motivations

Sampled functions

Practically, (Xi)i are not perfectly known; only a discrete sampling is given: Xτd

i

= (Xi(t))t∈τd where τd = {tτd

1 , . . . , tτd |τd|}.

6 / 30 Nathalie Villa-Vialaneix

slide-14
SLIDE 14

Introduction and motivations

Sampled functions

Practically, (Xi)i are not perfectly known; only a discrete sampling is given: Xτd

i

= (Xi(t))t∈τd where τd = {tτd

1 , . . . , tτd |τd|}.

The sampling can be non uniform...

6 / 30 Nathalie Villa-Vialaneix

slide-15
SLIDE 15

Introduction and motivations

Sampled functions

Practically, (Xi)i are not perfectly known; only a discrete sampling is given: Xτd

i

= (Xi(t))t∈τd where τd = {tτd

1 , . . . , tτd |τd|}.

... and the data can be corrupted by noise.

6 / 30 Nathalie Villa-Vialaneix

slide-16
SLIDE 16

Introduction and motivations

Sampled functions

Practically, (Xi)i are not perfectly known; only a discrete sampling is given: Xτd

i

= (Xi(t))t∈τd where τd = {tτd

1 , . . . , tτd |τd|}.

Then, X(m)

i

is estimated from Xτd

i , by

X(m)

τd , which also induces

information loss: inf

φ:DmX→{−1,1} P

  • φ(

X(m)

τd ) Y

inf

φ:DmX→{−1,1} P

  • φ(X(m)) Y
  • ≥ L∗

and inf

φ:DmX→R E

  • φ(

X(m)

τd ) − Y

2 ≥

inf

φ:DmX→R E

  • φ(X(m)) − Y

2 ≥ L∗.

6 / 30 Nathalie Villa-Vialaneix

slide-17
SLIDE 17

Introduction and motivations

Purpose of the presentation

Find a classifier or a regression function φn,τd built from X(m)

τd

such that the risk of φn,τd asymptotically reaches the Bayes risk L∗: lim

|τd|→+∞ lim n→+∞ P

  • φn,τd(

X(m)

τd ) Y

  • = L∗
  • r

lim

|τd|→+∞ lim n→+∞ E

  • φn,τd(

X(m)

τd ) − Y

2 = L∗

7 / 30 Nathalie Villa-Vialaneix

slide-18
SLIDE 18

Introduction and motivations

Purpose of the presentation

Find a classifier or a regression function φn,τd built from X(m)

τd

such that the risk of φn,τd asymptotically reaches the Bayes risk L∗: lim

|τd|→+∞ lim n→+∞ P

  • φn,τd(

X(m)

τd ) Y

  • = L∗
  • r

lim

|τd|→+∞ lim n→+∞ E

  • φn,τd(

X(m)

τd ) − Y

2 = L∗

Main idea: Use a relevant way to estimate X(m) from Xτd (by smoothing splines) and combine the consistency of splines with the consistency of a R|τd|-classifier or regression function.

7 / 30 Nathalie Villa-Vialaneix

slide-19
SLIDE 19

A general consistency result

Outline

1 Introduction and motivations 2 A general consistency result 3 Examples

8 / 30 Nathalie Villa-Vialaneix

slide-20
SLIDE 20

A general consistency result

Basics about smoothing splines I

Suppose that X is the Sobolev space

Hm =

  • h ∈ L2

[0,1]|∀ j = 1, . . . , m, Djh exists (weak sense) and Dmh ∈ L2

9 / 30 Nathalie Villa-Vialaneix

slide-21
SLIDE 21

A general consistency result

Basics about smoothing splines I

Suppose that X is the Sobolev space

Hm =

  • h ∈ L2

[0,1]|∀ j = 1, . . . , m, Djh exists (weak sense) and Dmh ∈ L2

equipped with the scalar product

u, vHm = Dmu, DmvL2 +

m

  • j=1

BjuBjv where B are m boundary conditions such that KerB ∩ Pm−1 = {0}.

9 / 30 Nathalie Villa-Vialaneix

slide-22
SLIDE 22

A general consistency result

Basics about smoothing splines I

Suppose that X is the Sobolev space

Hm =

  • h ∈ L2

[0,1]|∀ j = 1, . . . , m, Djh exists (weak sense) and Dmh ∈ L2

equipped with the scalar product

u, vHm = Dmu, DmvL2 +

m

  • j=1

BjuBjv where B are m boundary conditions such that KerB ∩ Pm−1 = {0}.

(Hm, ., .Hm) is a RKHS: ∃ k0 : Pm−1 × Pm−1 → R and

k1 : KerB × KerB → R such that

∀ u ∈ Pm−1, t ∈ [0, 1], u, k0(t, .)Hm = u(t)

and

∀ u ∈ KerB, t ∈ [0, 1], u, k1(t, .)Hm = u(t)

See [Berlinet and Thomas-Agnan, 2004] for further details.

9 / 30 Nathalie Villa-Vialaneix

slide-23
SLIDE 23

A general consistency result

Basics about smoothing splines II

A simple example of boundary conditions: h(0) = h(1)(0) = . . . = h(m−1)(0) = 0. Then, k0(s, t) =

m−1

  • k=0

tksk

(k!)2

and k1(s, t) =

1 (t − w)m−1

+

(s − w)m−1

+

(m − 1)!

dw.

10 / 30 Nathalie Villa-Vialaneix

slide-24
SLIDE 24

A general consistency result

Estimating the predictors with smooth- ing splines I

Assumption (A1) |τd| ≥ m − 1

sampling points are distinct in [0, 1] Bj are linearly independent from h → h(t) for all t ∈ τd

11 / 30 Nathalie Villa-Vialaneix

slide-25
SLIDE 25

A general consistency result

Estimating the predictors with smooth- ing splines I

Assumption (A1) |τd| ≥ m − 1

sampling points are distinct in [0, 1] Bj are linearly independent from h → h(t) for all t ∈ τd

[Kimeldorf and Wahba, 1971]: for xτd in R|τd|, ∃ !ˆ xλ,τd ∈ Hm solution of arg min

h∈Hm

1

|τd|

|τd|

  • l=1

(h(tl) − xτd)2 + λ

  • [0,1]

(h(m)(t))2dt.

and ˆ xλ,τd = Sλ,τdxτd where Sλ,τd : R|τd| → Hm.

11 / 30 Nathalie Villa-Vialaneix

slide-26
SLIDE 26

A general consistency result

Estimating the predictors with smooth- ing splines I

Assumption (A1) |τd| ≥ m − 1

sampling points are distinct in [0, 1] Bj are linearly independent from h → h(t) for all t ∈ τd

[Kimeldorf and Wahba, 1971]: for xτd in R|τd|, ∃ !ˆ xλ,τd ∈ Hm solution of arg min

h∈Hm

1

|τd|

|τd|

  • l=1

(h(tl) − xτd)2 + λ

  • [0,1]

(h(m)(t))2dt.

and ˆ xλ,τd = Sλ,τdxτd where Sλ,τd : R|τd| → Hm. These assumptions are fullfilled by the previous simple example as long as 0 τd.

11 / 30 Nathalie Villa-Vialaneix

slide-27
SLIDE 27

A general consistency result

Estimating the predictors with smooth- ing splines II

Sλ,τd is given by: Sλ,τd = ωT(U(K1 + λI|τd|)UT)−1U(K1 + λI|τd|)−1 +ηT(K1 + λI|τd|)−1(I|τd| − UT(U(K1 + λI|τd|)−1U(K1 + λI|τd|)−1) = ωTM0 + ηTM1

with {ω1, . . . , ωm} is a basis of Pm−1, ω = (ω1, . . . , ωm)T and

U = (ωi(t))i=1,...,m t∈τd; η = (k1(t, .))T

t∈τd and K1 = (k1(t, t′))t,t′∈τd.

12 / 30 Nathalie Villa-Vialaneix

slide-28
SLIDE 28

A general consistency result

Estimating the predictors with smooth- ing splines II

Sλ,τd is given by: Sλ,τd = ωT(U(K1 + λI|τd|)UT)−1U(K1 + λI|τd|)−1 +ηT(K1 + λI|τd|)−1(I|τd| − UT(U(K1 + λI|τd|)−1U(K1 + λI|τd|)−1) = ωTM0 + ηTM1

with {ω1, . . . , ωm} is a basis of Pm−1, ω = (ω1, . . . , ωm)T and

U = (ωi(t))i=1,...,m t∈τd; η = (k1(t, .))T

t∈τd and K1 = (k1(t, t′))t,t′∈τd.

The observations of the predictor X (NIR spectra) are then estimated from their sampling Xτd by Xλ,τd.

12 / 30 Nathalie Villa-Vialaneix

slide-29
SLIDE 29

A general consistency result

Two important consequences

1

No information loss inf

φ:Hm→{−1,1} P

  • φ(

Xλ,τd) Y

  • =

inf

φ:R|τd |→{−1,1} P (φ(Xτd) Y)

and inf

φ:Hm→{−1,1} E

  • φ(

Xλ,τd) − Y 2 = inf

φ:R|τd |→{−1,1} P

  • [φ(Xτd) − Y]2

13 / 30 Nathalie Villa-Vialaneix

slide-30
SLIDE 30

A general consistency result

Two important consequences

1

No information loss inf

φ:Hm→{−1,1} P

  • φ(

Xλ,τd) Y

  • =

inf

φ:R|τd |→{−1,1} P (φ(Xτd) Y)

and inf

φ:Hm→{−1,1} E

  • φ(

Xλ,τd) − Y 2 = inf

φ:R|τd |→{−1,1} P

  • [φ(Xτd) − Y]2

2

Easy way to use derivatives: Sλ,τduτd, Sλ,τdvτdHm =

  • uλ,τd,

vλ,τdHm

13 / 30 Nathalie Villa-Vialaneix

slide-31
SLIDE 31

A general consistency result

Two important consequences

1

No information loss inf

φ:Hm→{−1,1} P

  • φ(

Xλ,τd) Y

  • =

inf

φ:R|τd |→{−1,1} P (φ(Xτd) Y)

and inf

φ:Hm→{−1,1} E

  • φ(

Xλ,τd) − Y 2 = inf

φ:R|τd |→{−1,1} P

  • [φ(Xτd) − Y]2

2

Easy way to use derivatives: (uτd)TMT

0 WM0vτd + (uτd)TMT 1 K1M1vτd

=

  • uλ,τd,

vλ,τdHm where K1, M0 and M1 have been previously defined and W = (ωi, ωjHm)i,j=1,...,m.

13 / 30 Nathalie Villa-Vialaneix

slide-32
SLIDE 32

A general consistency result

Two important consequences

1

No information loss inf

φ:Hm→{−1,1} P

  • φ(

Xλ,τd) Y

  • =

inf

φ:R|τd |→{−1,1} P (φ(Xτd) Y)

and inf

φ:Hm→{−1,1} E

  • φ(

Xλ,τd) − Y 2 = inf

φ:R|τd |→{−1,1} P

  • [φ(Xτd) − Y]2

2

Easy way to use derivatives: (uτd)TMλ,τdvτd =

  • uλ,τd,

vλ,τdHm where Mλ,τd is symmetric, definite positive.

13 / 30 Nathalie Villa-Vialaneix

slide-33
SLIDE 33

A general consistency result

Two important consequences

1

No information loss inf

φ:Hm→{−1,1} P

  • φ(

Xλ,τd) Y

  • =

inf

φ:R|τd |→{−1,1} P (φ(Xτd) Y)

and inf

φ:Hm→{−1,1} E

  • φ(

Xλ,τd) − Y 2 = inf

φ:R|τd |→{−1,1} P

  • [φ(Xτd) − Y]2

2

Easy way to use derivatives: (Qλ,τduτd)T(Qλ,τdvτd) =

  • uλ,τd,

vλ,τdHm where Qλ,τd is the Choleski triangle of Mλ,τd: QT

λ,τdQλ,τd = Mλ,τd.

Remark: Qλ,τd is calculated only from the RKHS, λ and τd: it does not depend on the data set.

13 / 30 Nathalie Villa-Vialaneix

slide-34
SLIDE 34

A general consistency result

Two important consequences

1

No information loss inf

φ:Hm→{−1,1} P

  • φ(

Xλ,τd) Y

  • =

inf

φ:R|τd |→{−1,1} P (φ(Xτd) Y)

and inf

φ:Hm→{−1,1} E

  • φ(

Xλ,τd) − Y 2 = inf

φ:R|τd |→{−1,1} P

  • [φ(Xτd) − Y]2

2

Easy way to use derivatives: (Qλ,τduτd)T(Qλ,τdvτd) =

  • uλ,τd,

vλ,τdHm ≃

  • u(m)

λ,τd,

v(m)

λ,τdL2

where Qλ,τd is the Choleski triangle of Mλ,τd: QT

λ,τdQλ,τd = Mλ,τd.

Remark: Qλ,τd is calculated only from the RKHS, λ and τd: it does not depend on the data set.

13 / 30 Nathalie Villa-Vialaneix

slide-35
SLIDE 35

A general consistency result

Classification and regression based on derivatives

Suppose that we know a consistent classifier or regression function in R|τd| that is based on R|τd| scalar product or norm. Example: Nonparametric kernel regression

Ψ : u ∈ R|τd| → n

i=1 TiK

u−UiR|τd|

hn

  • n

i=1 K

u−UiR|τd|

hn

  • where (Ui, Ti)i=1,...,n is a learning set in R|τd| × R.

14 / 30 Nathalie Villa-Vialaneix

slide-36
SLIDE 36

A general consistency result

Classification and regression based on derivatives

Suppose that we know a consistent classifier or regression function in R|τd| that is based on R|τd| scalar product or norm. The corresponding derivative based classifier or regression function is given by using the norm induced by Qλ,τd: Example: Nonparametric kernel regression

φn,d = Ψ ◦ Qλ,τd : x ∈ Hm → n

i=1 YiK

  • Qλ,τd xτd −Qλ,τd X

τd i R|τd|

hn

  • n

i=1 K

  • Qλ,τd xτd −Qλ,τd X

τd i R|τd|

hn

  • 14 / 30

Nathalie Villa-Vialaneix

slide-37
SLIDE 37

A general consistency result

Classification and regression based on derivatives

Suppose that we know a consistent classifier or regression function in R|τd| that is based on R|τd| scalar product or norm. The corresponding derivative based classifier or regression function is given by using the norm induced by Qλ,τd: Example: Nonparametric kernel regression

φn,d = Ψ ◦ Qλ,τd : x ∈ Hm → n

i=1 YiK

  • Qλ,τd xτd −Qλ,τd X

τd i R|τd|

hn

  • n

i=1 K

  • Qλ,τd xτd −Qλ,τd X

τd i R|τd|

hn

− → n

i=1 YiK

  • x(m)−X(m)

i

L2 hn

  • n

i=1 K

  • x(m)−X(m)

i

L2 hn

  • 14 / 30

Nathalie Villa-Vialaneix

slide-38
SLIDE 38

A general consistency result

Remark for consistency

Classification case (approximatively the same is true for regression):

P

  • φn,τd(

Xλ,τd) Y

  • − L∗ = P
  • φn,τd(

Xλ,τd) Y

  • − L∗

d + L∗ d − L∗

where L∗

d = infφ:R|τd|→{−1,1} P (φ(Xτd) Y).

15 / 30 Nathalie Villa-Vialaneix

slide-39
SLIDE 39

A general consistency result

Remark for consistency

Classification case (approximatively the same is true for regression):

P

  • φn,τd(

Xλ,τd) Y

  • − L∗ = P
  • φn,τd(

Xλ,τd) Y

  • − L∗

d + L∗ d − L∗

where L∗

d = infφ:R|τd|→{−1,1} P (φ(Xτd) Y).

1

For all fixed d, lim

n→+∞ P

  • φn,τd(

Xλ,τd) Y

  • = L∗

d

as long as the R|τd|-classifier is consistent because there is a

  • ne-to-one mapping between Xτd and

Xλ,τd.

15 / 30 Nathalie Villa-Vialaneix

slide-40
SLIDE 40

A general consistency result

Remark for consistency

Classification case (approximatively the same is true for regression):

P

  • φn,τd(

Xλ,τd) Y

  • − L∗ = P
  • φn,τd(

Xλ,τd) Y

  • − L∗

d + L∗ d − L∗

where L∗

d = infφ:R|τd|→{−1,1} P (φ(Xτd) Y).

1

For all fixed d, lim

n→+∞ P

  • φn,τd(

Xλ,τd) Y

  • = L∗

d

as long as the R|τd|-classifier is consistent because there is a

  • ne-to-one mapping between Xτd and

Xλ,τd.

2

L∗

d − L∗ ≤ E

  • E(Y|

Xλ,τd) − E(Y|X)

  • with consistency of spline estimate

Xλ,τd and assumption on the regularity of E(Y|X = .), consistency would be proved.

15 / 30 Nathalie Villa-Vialaneix

slide-41
SLIDE 41

A general consistency result

Remark for consistency

Classification case (approximatively the same is true for regression):

P

  • φn,τd(

Xλ,τd) Y

  • − L∗ = P
  • φn,τd(

Xλ,τd) Y

  • − L∗

d + L∗ d − L∗

where L∗

d = infφ:R|τd|→{−1,1} P (φ(Xτd) Y).

1

For all fixed d, lim

n→+∞ P

  • φn,τd(

Xλ,τd) Y

  • = L∗

d

as long as the R|τd|-classifier is consistent because there is a

  • ne-to-one mapping between Xτd and

Xλ,τd.

2

L∗

d − L∗ ≤ E

  • E(Y|

Xλ,τd) − E(Y|X)

  • with consistency of spline estimate

Xλ,τd and assumption on the regularity of E(Y|X = .), consistency would be proved. But continuity of E(Y|X = .) is a strong assumption in infinite dimensional case, and is not easy to check.

15 / 30 Nathalie Villa-Vialaneix

slide-42
SLIDE 42

A general consistency result

Spline consistency

Let λ depends on d and denote (λd)d the series of regularization

  • parameters. Also introduce

∆τd := max{t1, t2 − t1, . . . , 1 − t|τd|}, ∆τd := min1≤i<|τd|{ti+1 − ti}

Assumption (A2) ∃ R such that ∆τd/∆τd ≤ R for all d;

limd→+∞ |τd| = +∞; limd→+∞ λd = 0.

16 / 30 Nathalie Villa-Vialaneix

slide-43
SLIDE 43

A general consistency result

Spline consistency

Let λ depends on d and denote (λd)d the series of regularization

  • parameters. Also introduce

∆τd := max{t1, t2 − t1, . . . , 1 − t|τd|}, ∆τd := min1≤i<|τd|{ti+1 − ti}

Assumption (A2) ∃ R such that ∆τd/∆τd ≤ R for all d;

limd→+∞ |τd| = +∞; limd→+∞ λd = 0.

[Ragozin, 1983]: Under (A1) and (A2), ∃AR,m and BR,m such that for any x ∈ Hm and any λd > 0,

  • ˆ

xλd,τd − x

  • 2

L2 ≤

  • AR,mλd + BR,m

1

|τd|2m

  • Dmx2

L2 d→+∞

− − − − − − → 0

16 / 30 Nathalie Villa-Vialaneix

slide-44
SLIDE 44

A general consistency result

Bayes risk consistency

Assumption (A3a)

E

  • DmX2

L2

  • is finite and Y ∈ {−1, 1}.

17 / 30 Nathalie Villa-Vialaneix

slide-45
SLIDE 45

A general consistency result

Bayes risk consistency

Assumption (A3a)

E

  • DmX2

L2

  • is finite and Y ∈ {−1, 1}.
  • r

Assumption (A3b)

τd ⊂ τd+1 for all d and E(Y2) is finite.

17 / 30 Nathalie Villa-Vialaneix

slide-46
SLIDE 46

A general consistency result

Bayes risk consistency

Assumption (A3a)

E

  • DmX2

L2

  • is finite and Y ∈ {−1, 1}.
  • r

Assumption (A3b)

τd ⊂ τd+1 for all d and E(Y2) is finite.

Under (A1)-(A3), limd→+∞ L∗

d = L∗.

17 / 30 Nathalie Villa-Vialaneix

slide-47
SLIDE 47

A general consistency result

Proof under assumption (A3a)

Assumption (A3a)

E

  • DmX2

L2

  • is finite and Y ∈ {−1, 1}.

18 / 30 Nathalie Villa-Vialaneix

slide-48
SLIDE 48

A general consistency result

Proof under assumption (A3a)

Assumption (A3a)

E

  • DmX2

L2

  • is finite and Y ∈ {−1, 1}.

The proof is based on a result of [Faragó and Györfi, 1975]: For a pair of random variables (X, Y) taking their values in

X×{−1, 1} where X is an arbitrary metric space and for a series

  • f functions Td : X → X such that

E(δ(Td(X), X))

d→+∞

− − − − − − → 0

then limd→+∞ infφ:X→{−1,1} P(φ(Td(X)) Y) = L∗.

18 / 30 Nathalie Villa-Vialaneix

slide-49
SLIDE 49

A general consistency result

Proof under assumption (A3a)

Assumption (A3a)

E

  • DmX2

L2

  • is finite and Y ∈ {−1, 1}.

The proof is based on a result of [Faragó and Györfi, 1975]:

Td is the spline estimate based on the sampling; the inequality of [Ragozin, 1983] about this estimate is exactly the assumption of Farago and Gyorfi’s Theorem.

Then the result follows.

18 / 30 Nathalie Villa-Vialaneix

slide-50
SLIDE 50

A general consistency result

Proof under assumption (A3b)

Assumption (A3b)

τd ⊂ τd+1 for all d and E(Y2) is finite.

19 / 30 Nathalie Villa-Vialaneix

slide-51
SLIDE 51

A general consistency result

Proof under assumption (A3b)

Assumption (A3b)

τd ⊂ τd+1 for all d and E(Y2) is finite.

Under (A3b), (E(Y| Xλd,τd))d is a uniformly bounded martingale and thus converges for the L1-norm. Using the consistency of ( Xλd,τd)d to X ends the proof.

19 / 30 Nathalie Villa-Vialaneix

slide-52
SLIDE 52

A general consistency result

Concluding result (consistency)

Theorem Under assumptions (A1)-(A3), lim

|τd|→+∞ lim n→+∞ P

  • φn,τd(

Xλd,τd) Y

  • = L∗

and lim

|τd|→+∞ lim n→+∞ E

  • φn,τd(

Xλd,τd) − Y

2 = L∗

Proof: For a ǫ > 0, fix d0 such that, for all d ≥ d0, L∗

d − L∗ ≤ ǫ/2.

Then, by consistency of the R|τd|-classifier or regression function, conclude.

20 / 30 Nathalie Villa-Vialaneix

slide-53
SLIDE 53

A general consistency result

A practical application to SVM I

Recall that, for a learning set (Ui, Ti)i=1,...,n in Rp × {−1, 1}, gaussian SVM is the classifier u ∈ Rp → Sign

      

n

  • i=1

αiTie−γu−Ui2

Rp

      

where (αi)i satisfy the following quadratic optimization problem: arg min

w n

  • i=1
  • 1 − Tiw(Ui)
  • + + Cw2

S

where w(u) = n

i=1 αie−γu−Ui2

Rp and S is the RKHS associated

with the gaussian kernel and C is a regularization parameter.

21 / 30 Nathalie Villa-Vialaneix

slide-54
SLIDE 54

A general consistency result

A practical application to SVM I

Recall that, for a learning set (Ui, Ti)i=1,...,n in Rp × {−1, 1}, gaussian SVM is the classifier u ∈ Rp → Sign

      

n

  • i=1

αiTie−γu−Ui2

Rp

      

where (αi)i satisfy the following quadratic optimization problem: arg min

w n

  • i=1
  • 1 − Tiw(Ui)
  • + + Cw2

S

where w(u) = n

i=1 αie−γu−Ui2

Rp and S is the RKHS associated

with the gaussian kernel and C is a regularization parameter. Under suitable assumptions, [Steinwart, 2002] proves the consistency of SVM classifiers.

21 / 30 Nathalie Villa-Vialaneix

slide-55
SLIDE 55

A general consistency result

A practical application to SVM II

Additional assumptions related to SVM: Assumptions (A4)

For all d, the regularization parameter depends on n such that limn→+∞ nCd

n = +∞ and Cd n = On

  • nβd−1

for a 0 < βd < 1/d. For all d, there is a bounded subset of R|τd|, Bd, such that Xτd belongs to Bd.

22 / 30 Nathalie Villa-Vialaneix

slide-56
SLIDE 56

A general consistency result

A practical application to SVM II

Additional assumptions related to SVM: Assumptions (A4)

For all d, the regularization parameter depends on n such that limn→+∞ nCd

n = +∞ and Cd n = On

  • nβd−1

for a 0 < βd < 1/d. For all d, there is a bounded subset of R|τd|, Bd, such that Xτd belongs to Bd.

Result: Under assumptions (A1)-(A4), the SVM φn,d : x ∈ Hm →

Sign       

n

  • i=1

αiYie−γQλd,τd xτd −Qλd,τd X

τd i 2 Rd

       ≃ Sign       

n

  • i=1

αiYie−γx(m)−X(m)

i

|2

L2

      

is consistent: lim|τd|→+∞ limn→+∞ P

  • φn,τd(

Xλd,τd) Y

  • = L∗.

22 / 30 Nathalie Villa-Vialaneix

slide-57
SLIDE 57

A general consistency result

Additional remark about the link be- tween n and |τd|

Under suitable (and usual) regularity assumptions on E(Y|X = .) and if n ∼ ν|τd| log |τd|, the rate of convergence of this method is of

  • rder d−

2ν 2ν+1 where ν is either equal to m or to a Lipchitz constant

related to E(Y|X = .).

23 / 30 Nathalie Villa-Vialaneix

slide-58
SLIDE 58

Examples

Outline

1 Introduction and motivations 2 A general consistency result 3 Examples

24 / 30 Nathalie Villa-Vialaneix

slide-59
SLIDE 59

Examples

Chosen regression method: Regression with kernel ridge regression

Recall that kernel ridge regression in Rp is given by solving arg min

w n

  • i=1

(Ti − w(Ui))2 + Cw2

S

where S is a RKHS induced by a given kernel (such as the Gaussian kernel) and (Ui, Ti)i is a training sample in Rp × R.

25 / 30 Nathalie Villa-Vialaneix

slide-60
SLIDE 60

Examples

Chosen regression method: Regression with kernel ridge regression

Recall that kernel ridge regression in Rp is given by solving arg min

w n

  • i=1

(Ti − w(Ui))2 + Cw2

S

where S is a RKHS induced by a given kernel (such as the Gaussian kernel) and (Ui, Ti)i is a training sample in Rp × R. In the following examples, Ui is either:

the original (sampled) functions Xi (viewed as R|τd| vectors); Qλ,τdXτd

i

for derivatives of order 1 or 2.

25 / 30 Nathalie Villa-Vialaneix

slide-61
SLIDE 61

Examples

Example 1: Predicting yellow berry in durum wheat from NIR spectra

953 wheat samples were analyzed:

NIR spectrometry: 1049 wavelengths regularly ranged from 400 to 2498 nm; Yellow berry: manual count (%) of affected grains.

26 / 30 Nathalie Villa-Vialaneix

slide-62
SLIDE 62

Examples

Example 1: Predicting yellow berry in durum wheat from NIR spectra

953 wheat samples were analyzed:

NIR spectrometry: 1049 wavelengths regularly ranged from 400 to 2498 nm; Yellow berry: manual count (%) of affected grains.

Methodology for comparison:

Split the data into train/test sets (50 times); Train 50 regression functions for the 50 train sets (hyper-parameters were tuned by CV); Evaluate these regression functions by calculating the MSE for the 50 corresponding test sets.

26 / 30 Nathalie Villa-Vialaneix

slide-63
SLIDE 63

Examples

Example 1: Predicting yellow berry in durum wheat from NIR spectra

Kernel (SVM) MSE on test (and sd ×10−3) Linear (L) 0.122 (8.77) Linear on derivatives (L(1)) 0.138 (9.53) Linear on second derivatives (L(2)) 0.122 (1.71) Gaussian (G) 0.110 (20.2) Gaussian on derivatives (G(1)) 0.098 (7.92) Gaussian on second derivatives (G(2)) 0.094 (8.35)

The differences are significant between G(2) / G(1) and be- tween G(1) / G.

26 / 30 Nathalie Villa-Vialaneix

slide-64
SLIDE 64

Examples

Comparison with PLS...

MSE (mean) MSE (sd) PLS 0.154 0.012 Kernel PLS 0.154 0.013 KRR splines (reg. D2) 0.094 0.008 Error decrease: almost 40 %

SVM−D2 KPLS PLS 0.08 0.10 0.12 0.14 0.16 0.18

27 / 30 Nathalie Villa-Vialaneix

slide-65
SLIDE 65

Examples

Example 2: Simulated noisy spectra

Original data: Variable to predict: Fat content of pieces of meat.

28 / 30 Nathalie Villa-Vialaneix

slide-66
SLIDE 66

Examples

Example 2: Simulated noisy spectra

Noisy data: Xb

i (t) = Xi(t) + ǫit, ǫit ∼ N(0, 0.01), i.i.d.:

28 / 30 Nathalie Villa-Vialaneix

slide-67
SLIDE 67

Examples

Example 2: Simulated noisy spectra

Worse noisy data: Xb

i (t) = Xi(t) + ǫit, ǫit ∼ N(0, 0.2), i.i.d.:

28 / 30 Nathalie Villa-Vialaneix

slide-68
SLIDE 68

Examples

Methodology for comparison

Split the data into train/test sets (250 times); Train 250 regression functions for the 250 train sets (hyper-parameters were tuned by CV) with the predictors being

the original (sampled) functions Xi (viewed as R|τd| vectors); Qλ,τd Xτd

i

for derivatives of order 1 or 2: smoothing splines derivatives; Q0,τd Xτd

i

for derivatives of order 1 or 2: interpolating splines derivatives; derivatives of order 1 or 2 evaluated by

Xi(tj+1)−Xi(tj) tj+1−tj

: finite differences derivatives;

Evaluate these regression functions by calculating the MSE for the 50 corresponding test sets.

29 / 30 Nathalie Villa-Vialaneix

slide-69
SLIDE 69

Examples

Performances

30 / 30 Nathalie Villa-Vialaneix

slide-70
SLIDE 70

Examples

Performances

30 / 30 Nathalie Villa-Vialaneix

slide-71
SLIDE 71

References

Berlinet, A. and Thomas-Agnan, C. (2004).

Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publisher.

Faragó, T. and Györfi, L. (1975).

On the continuity of the error distortion function for multiple-hypothesis decisions. IEEE Transactions on Information Theory, 21(4):458–460.

Kimeldorf, G. and Wahba, G. (1971).

Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33(1):82–95.

Ragozin, D. (1983).

Error bounds for derivative estimation based on spline smoothing of exact or noisy data. Journal of Approximation Theory, 37:335–355.

Steinwart, I. (2002).

Support vector machines are universally consistent. Journal of Complexity, 18:768–791.

Any question?

30 / 30 Nathalie Villa-Vialaneix