Classification and statistical machine learning Sylvain Arlot - - PowerPoint PPT Presentation

classification and statistical machine learning
SMART_READER_LITE
LIVE PREVIEW

Classification and statistical machine learning Sylvain Arlot - - PowerPoint PPT Presentation

Introduction Goals Overfitting Examples Key issues Conclusion Classification and statistical machine learning Sylvain Arlot http://www.di.ens.fr/~arlot/ 1 Cnrs 2 erieure (Paris), DI/ENS , Ecole Normale Sup Equipe Sierra CEMRACS


slide-1
SLIDE 1

1/53

Introduction Goals Overfitting Examples Key issues Conclusion

Classification and statistical machine learning

Sylvain Arlot

http://www.di.ens.fr/~arlot/

1Cnrs 2´

Ecole Normale Sup´ erieure (Paris), DI/ENS, ´ Equipe Sierra

CEMRACS 2013, July 26th, 2013

Classification and statistical machine learning Sylvain Arlot

slide-2
SLIDE 2

2/53

Introduction Goals Overfitting Examples Key issues Conclusion

Outline

1

Introduction

2

Goals

3

Overfitting

4

Examples

5

Key issues

Classification and statistical machine learning Sylvain Arlot

slide-3
SLIDE 3

2/53

Introduction Goals Overfitting Examples Key issues Conclusion

Hand-written digit recognition (MNIST) ⇒ ?

http://yann.lecun.com/exdb/mnist/

Classification and statistical machine learning Sylvain Arlot

slide-4
SLIDE 4

3/53

Introduction Goals Overfitting Examples Key issues Conclusion

Object recognition

American flag: · · · Butterfly: · · · Teddy bear: · · · ⇒ American flag? Butterfly? Teddy bear? . . .

http://www.vision.caltech.edu/Image_Datasets/Caltech256/

Classification and statistical machine learning Sylvain Arlot

slide-5
SLIDE 5

4/53

Introduction Goals Overfitting Examples Key issues Conclusion

Kinect: body part recognition [Shotton et al., 2011]

http://research.microsoft.com/en-us/projects/vrkinect/

Classification and statistical machine learning Sylvain Arlot

slide-6
SLIDE 6

5/53

Introduction Goals Overfitting Examples Key issues Conclusion

Predict biochemical properties of molecules from structure

Mutagenic compounds Non-mutagenic compounds A compound with unknown properties: Is it likely to be mutagenic or not? [Mah´ e et al., 2005, Shervashidze et al., 2011]

Figure obtained from Koji Tsuda Classification and statistical machine learning Sylvain Arlot

slide-7
SLIDE 7

6/53

Introduction Goals Overfitting Examples Key issues Conclusion

Many other applications

Bioinformatics:

sequencing data for diagnosis and prognosis (cancer, ...) personalized medicine . . .

Text classification:

Spam detection Google ads Automatic document classification

Action recognition in videos Speech recognition Credit scoring . . .

Classification and statistical machine learning Sylvain Arlot

slide-8
SLIDE 8

7/53

Introduction Goals Overfitting Examples Key issues Conclusion

Classification in R: data

10 1 class 0 class 1

Classification and statistical machine learning Sylvain Arlot

slide-9
SLIDE 9

8/53

Introduction Goals Overfitting Examples Key issues Conclusion

Classification in R: regression function

10 1

Classification and statistical machine learning Sylvain Arlot

slide-10
SLIDE 10

9/53

Introduction Goals Overfitting Examples Key issues Conclusion

Classification in R: Bayes classifier

10 1

Classification and statistical machine learning Sylvain Arlot

slide-11
SLIDE 11

10/53

Introduction Goals Overfitting Examples Key issues Conclusion

Binary supervised classification

Data Dn : (X1, Y1), . . . , (Xn, Yn) ∈ X × {0, 1} (i.i.d. ∼ P)

Classification and statistical machine learning Sylvain Arlot

slide-12
SLIDE 12

10/53

Introduction Goals Overfitting Examples Key issues Conclusion

Binary supervised classification

Data Dn : (X1, Y1), . . . , (Xn, Yn) ∈ X × {0, 1} (i.i.d. ∼ P) Classifier: f : X → {0, 1} measurable

Classification and statistical machine learning Sylvain Arlot

slide-13
SLIDE 13

10/53

Introduction Goals Overfitting Examples Key issues Conclusion

Binary supervised classification

Data Dn : (X1, Y1), . . . , (Xn, Yn) ∈ X × {0, 1} (i.i.d. ∼ P) Classifier: f : X → {0, 1} measurable Cost/Loss function ℓ(f (x), y) measures how well f (x) “predicts” y For this talk: ℓ(y, y′) = 1y=y′ (0–1 loss)

Classification and statistical machine learning Sylvain Arlot

slide-14
SLIDE 14

10/53

Introduction Goals Overfitting Examples Key issues Conclusion

Binary supervised classification

Data Dn : (X1, Y1), . . . , (Xn, Yn) ∈ X × {0, 1} (i.i.d. ∼ P) Classifier: f : X → {0, 1} measurable Cost/Loss function ℓ(f (x), y) measures how well f (x) “predicts” y For this talk: ℓ(y, y′) = 1y=y′ (0–1 loss) Goal: learn f ∈ S = {measurable functions X → {0, 1}} s.t. the risk R (f ) := E(X,Y )∼P [ℓ(f (X), Y )] = P (f (X) = Y ) is minimal.

Classification and statistical machine learning Sylvain Arlot

slide-15
SLIDE 15

10/53

Introduction Goals Overfitting Examples Key issues Conclusion

Binary supervised classification

Data Dn : (X1, Y1), . . . , (Xn, Yn) ∈ X × {0, 1} (i.i.d. ∼ P) Classifier: f : X → {0, 1} measurable Cost/Loss function ℓ(f (x), y) measures how well f (x) “predicts” y For this talk: ℓ(y, y′) = 1y=y′ (0–1 loss) Goal: learn f ∈ S = {measurable functions X → {0, 1}} s.t. the risk R (f ) := E(X,Y )∼P [ℓ(f (X), Y )] = P (f (X) = Y ) is minimal. Remark: asymmetric cost ℓw(f (x), y) = w(y)1f (x)=y with w(0) = w(1) > 0 (spams, medical diagnosis).

Classification and statistical machine learning Sylvain Arlot

slide-16
SLIDE 16

11/53

Introduction Goals Overfitting Examples Key issues Conclusion

Bayes estimator and excess risk

Bayes classifier: f ⋆ ∈ argminf ∈S {R(f )}

Classification and statistical machine learning Sylvain Arlot

slide-17
SLIDE 17

11/53

Introduction Goals Overfitting Examples Key issues Conclusion

Bayes estimator and excess risk

Bayes classifier: f ⋆ ∈ argminf ∈S {R(f )} Proposition In binary classification with the 0–1 loss, f ⋆(X) = 1η(X)≥1/2 (except maybe on {η(X) = 1/2}) where η(X) = P (Y = 1 | X ) is the regression function.

Classification and statistical machine learning Sylvain Arlot

slide-18
SLIDE 18

11/53

Introduction Goals Overfitting Examples Key issues Conclusion

Bayes estimator and excess risk

Bayes classifier: f ⋆ ∈ argminf ∈S {R(f )} Proposition In binary classification with the 0–1 loss, f ⋆(X) = 1η(X)≥1/2 (except maybe on {η(X) = 1/2}) where η(X) = P (Y = 1 | X ) is the regression function. The Bayes risk is R(f ⋆) = E [min {η(X), 1 − η(X)}] and the excess risk of any f ∈ S is R (f ) − R (f ⋆ ) = E

  • |2η(X) − 1| 1f (X)=f ⋆(X)
  • .

Remark: for the asymmetric cost ℓw, a similar result holds with 1/2 replaced by w(0)/(w(0) + w(1)).

Classification and statistical machine learning Sylvain Arlot

slide-19
SLIDE 19

11/53

Introduction Goals Overfitting Examples Key issues Conclusion

Bayes estimator and excess risk: proof

P (f (X) = Y | X ) = P (Y = 1) 1f (X)=1 + P (Y = 0) 1f (X)=0 = η(X)1f (X)=0 + (1 − η(X))1f (X)=1 ≥ min {η(X), 1 − η(X)} with equality if and only if η(X) = 1/2 or f (X) = 1η(X)≥1/2. The first two results follow by integrating over X. Then, the excess risk is equal to E

  • 1f (X)=Y − 1f ⋆(X)=Y
  • = E
  • 1f (X)=f ⋆(X)
  • 1f (X)=Y − 1f ⋆(X)=Y
  • = E
  • E
  • 1f (X)=f ⋆(X)
  • 1f (X)=Y − 1f ⋆(X)=Y

X

  • = E
  • 1f (X)=f ⋆(X) (max {η(X), 1 − η(X)} − min {η(X), 1 − η(X)})
  • = E
  • |2η(X) − 1| 1f (X)=f ⋆(X)
  • Classification and statistical machine learning

Sylvain Arlot

slide-20
SLIDE 20

12/53

Introduction Goals Overfitting Examples Key issues Conclusion

Classification seen as a testing problem

10 1 class 0 class 1

Classification and statistical machine learning Sylvain Arlot

slide-21
SLIDE 21

13/53

Introduction Goals Overfitting Examples Key issues Conclusion

Classification seen as a testing problem

fi: density of Pi = L (X | Y = i ) for i = 0, 1 Regression function η(x) = P(Y = 1)f1(x) P(Y = 0)f0(x) + P(Y = 1)f1(x) Bayes predictor f ⋆(x) = 1η(x)≥ 1

2 = 1 f1(x) f0(x) ≥ P(Y =0) P(Y =1) Classification and statistical machine learning Sylvain Arlot

slide-22
SLIDE 22

13/53

Introduction Goals Overfitting Examples Key issues Conclusion

Classification seen as a testing problem

fi: density of Pi = L (X | Y = i ) for i = 0, 1 Regression function η(x) = P(Y = 1)f1(x) P(Y = 0)f0(x) + P(Y = 1)f1(x) Bayes predictor f ⋆(x) = 1η(x)≥ 1

2 = 1 f1(x) f0(x) ≥ P(Y =0) P(Y =1)

⇔ likelihood-ratio test 1 f1(x)

f0(x) ≥t of

H0: “X ∼ P0” against H1: “X ∼ P1”.

Classification and statistical machine learning Sylvain Arlot

slide-23
SLIDE 23

14/53

Introduction Goals Overfitting Examples Key issues Conclusion

Outline

1

Introduction

2

Goals

3

Overfitting

4

Examples

5

Key issues

Classification and statistical machine learning Sylvain Arlot

slide-24
SLIDE 24

14/53

Introduction Goals Overfitting Examples Key issues Conclusion

Classification rule/algorithm

Classification rule

  • f :
  • n≥1

(X × {0, 1})n → S Input: a data set Dn (of any size n ≥ 1) Output: a classifier f (Dn): X → {0, 1}

Classification and statistical machine learning Sylvain Arlot

slide-25
SLIDE 25

14/53

Introduction Goals Overfitting Examples Key issues Conclusion

Classification rule/algorithm

Classification rule

  • f :
  • n≥1

(X × {0, 1})n → S Input: a data set Dn (of any size n ≥ 1) Output: a classifier f (Dn): X → {0, 1} Example: k-nearest neighbours (k-NN): x ∈ X → majority vote among the Yi such that Xi is one of the k nearest neighbours of x in X1, . . . , Xn

Classification and statistical machine learning Sylvain Arlot

slide-26
SLIDE 26

15/53

Introduction Goals Overfitting Examples Key issues Conclusion

Example: 3-nearest neighbours

10 1

Classification and statistical machine learning Sylvain Arlot

slide-27
SLIDE 27

16/53

Introduction Goals Overfitting Examples Key issues Conclusion

Universal consistency

weak consistency: E

  • R(

f (Dn))

− − →

n→∞ R(f ⋆)

Classification and statistical machine learning Sylvain Arlot

slide-28
SLIDE 28

16/53

Introduction Goals Overfitting Examples Key issues Conclusion

Universal consistency

weak consistency: E

  • R(

f (Dn))

− − →

n→∞ R(f ⋆)

strong consistency: R( f (Dn))

a.s.

− − − →

n→∞ R(f ⋆)

Classification and statistical machine learning Sylvain Arlot

slide-29
SLIDE 29

16/53

Introduction Goals Overfitting Examples Key issues Conclusion

Universal consistency

weak consistency: E

  • R(

f (Dn))

− − →

n→∞ R(f ⋆)

strong consistency: R( f (Dn))

a.s.

− − − →

n→∞ R(f ⋆)

universal (weak) consistency: for all P, E

  • R(

f (Dn))

− − →

n→∞ R(f ⋆)

universal strong consistency: for all P, R( f (Dn))

a.s.

− − − →

n→∞ R(f ⋆)

Classification and statistical machine learning Sylvain Arlot

slide-30
SLIDE 30

16/53

Introduction Goals Overfitting Examples Key issues Conclusion

Universal consistency

weak consistency: E

  • R(

f (Dn))

− − →

n→∞ R(f ⋆)

strong consistency: R( f (Dn))

a.s.

− − − →

n→∞ R(f ⋆)

universal (weak) consistency: for all P, E

  • R(

f (Dn))

− − →

n→∞ R(f ⋆)

universal strong consistency: for all P, R( f (Dn))

a.s.

− − − →

n→∞ R(f ⋆)

Stone’s theorem [Stone, 1977]: If X = Rd with the Euclidean distance, kn-NN is (weakly) universally consistent if kn → +∞ and kn/n → 0 as n → +∞.

Classification and statistical machine learning Sylvain Arlot

slide-31
SLIDE 31

17/53

Introduction Goals Overfitting Examples Key issues Conclusion

Uniform universal consistency?

universal weak consistency: sup

P∈M1(X×{ 0,1 })

lim

n→+∞ E

  • R(

f (Dn))

  • − R(f ⋆) = 0

uniform universal weak consistency: lim

n→+∞

sup

P∈M1(X×{ 0,1 })

  • E
  • R(

f (Dn))

  • − R(f ⋆)
  • = 0

that is, a common learning rate for all P?

Classification and statistical machine learning Sylvain Arlot

slide-32
SLIDE 32

17/53

Introduction Goals Overfitting Examples Key issues Conclusion

Uniform universal consistency?

universal weak consistency: sup

P∈M1(X×{ 0,1 })

lim

n→+∞ E

  • R(

f (Dn))

  • − R(f ⋆) = 0

uniform universal weak consistency: lim

n→+∞

sup

P∈M1(X×{ 0,1 })

  • E
  • R(

f (Dn))

  • − R(f ⋆)
  • = 0

that is, a common learning rate for all P? Yes if X is finite. No otherwise (see Chapter 7 of [Devroye et al., 1996]).

Classification and statistical machine learning Sylvain Arlot

slide-33
SLIDE 33

18/53

Introduction Goals Overfitting Examples Key issues Conclusion

Classification on X finite

Theorem If X is finite and f maj is the majority vote rule (for each x ∈ X, majority vote among {Yi / Xi = x }), sup

P

  • E
  • R(

f maj(Dn))

  • − R(f ⋆)
  • Card(X) log(2)

2n .

Classification and statistical machine learning Sylvain Arlot

slide-34
SLIDE 34

18/53

Introduction Goals Overfitting Examples Key issues Conclusion

Classification on X finite

Theorem If X is finite and f maj is the majority vote rule (for each x ∈ X, majority vote among {Yi / Xi = x }), sup

P

  • E
  • R(

f maj(Dn))

  • − R(f ⋆)
  • Card(X) log(2)

2n . Proof: standard risk bounds (see next section) + maximal inequality E

  • sup

t∈T

  • n
  • i=1

ξi,t

  • log(Card(T))

2n if for all t, (ξi,t)i are independent, centered and in [0, 1].

See e.g. http://www.di.ens.fr/~arlot/2013orsay.htm

Classification and statistical machine learning Sylvain Arlot

slide-35
SLIDE 35

18/53

Introduction Goals Overfitting Examples Key issues Conclusion

Classification on X finite

Theorem If X is finite and f maj is the majority vote rule (for each x ∈ X, majority vote among {Yi / Xi = x }), sup

P

  • E
  • R(

f maj(Dn))

  • − R(f ⋆)
  • Card(X) log(2)

2n . Constants matter: Card(X) can be larger than n ⇒ beware of asymptotic results and O(·) that can hide such constants in first or second order terms.

Classification and statistical machine learning Sylvain Arlot

slide-36
SLIDE 36

19/53

Introduction Goals Overfitting Examples Key issues Conclusion

No Free Lunch Theorem

Theorem If X is infinite, for any classification rule f and any n ≥ 1, sup

P∈M1(X×{ 0,1 })

  • E
  • R(

f (Dn))

  • − R(f ⋆)
  • ≥ 1

2 .

Classification and statistical machine learning Sylvain Arlot

slide-37
SLIDE 37

19/53

Introduction Goals Overfitting Examples Key issues Conclusion

No Free Lunch Theorem

Theorem If X is infinite, for any classification rule f and any n ≥ 1, sup

P∈M1(X×{ 0,1 })

  • E
  • R(

f (Dn))

  • − R(f ⋆)
  • ≥ 1

2 .

1 X Y=eta(X) data points

Classification and statistical machine learning Sylvain Arlot

slide-38
SLIDE 38

19/53

Introduction Goals Overfitting Examples Key issues Conclusion

No Free Lunch Theorem

Theorem If X is infinite, for any classification rule f and any n ≥ 1, sup

P∈M1(X×{ 0,1 })

  • E
  • R(

f (Dn))

  • − R(f ⋆)
  • ≥ 1

2 .

1 X Y=eta(X) data points estimator

Classification and statistical machine learning Sylvain Arlot

slide-39
SLIDE 39

19/53

Introduction Goals Overfitting Examples Key issues Conclusion

No Free Lunch Theorem

Theorem If X is infinite, for any classification rule f and any n ≥ 1, sup

P∈M1(X×{ 0,1 })

  • E
  • R(

f (Dn))

  • − R(f ⋆)
  • ≥ 1

2 .

1 X Y=eta(X) data points estimator unobserved points

Classification and statistical machine learning Sylvain Arlot

slide-40
SLIDE 40

19/53

Introduction Goals Overfitting Examples Key issues Conclusion

No Free Lunch Theorem

Theorem If X is infinite, for any classification rule f and any n ≥ 1, sup

P∈M1(X×{ 0,1 })

  • E
  • R(

f (Dn))

  • − R(f ⋆)
  • ≥ 1

2 .

1 X Y=eta(X) data points estimator unobserved points

Classification and statistical machine learning Sylvain Arlot

slide-41
SLIDE 41

19/53

Introduction Goals Overfitting Examples Key issues Conclusion

No Free Lunch Theorem

Theorem If X is infinite, for any classification rule f and any n ≥ 1, sup

P∈M1(X×{ 0,1 })

  • E
  • R(

f (Dn))

  • − R(f ⋆)
  • ≥ 1

2 .

1 X Y=eta(X) data points estimator unobserved points

Classification and statistical machine learning Sylvain Arlot

slide-42
SLIDE 42

19/53

Introduction Goals Overfitting Examples Key issues Conclusion

No Free Lunch Theorem

Theorem If X is infinite, for any classification rule f and any n ≥ 1, sup

P∈M1(X×{ 0,1 })

  • E
  • R(

f (Dn))

  • − R(f ⋆)
  • ≥ 1

2 .

1 X Y=eta(X) data points estimator unobserved points

Classification and statistical machine learning Sylvain Arlot

slide-43
SLIDE 43

19/53

Introduction Goals Overfitting Examples Key issues Conclusion

No Free Lunch Theorem

Theorem If X is infinite, for any classification rule f and any n ≥ 1, sup

P∈M1(X×{ 0,1 })

  • E
  • R(

f (Dn))

  • − R(f ⋆)
  • ≥ 1

2 . Remark: for any (an) decreasing to zero and any f , some P exists such that E

  • R(

f (Dn))

  • − R(f ⋆) ≥ an. See Chapter 7 of

[Devroye et al., 1996]. ⇒ impossible to have

C(P) log log n as a universal risk bound!

Classification and statistical machine learning Sylvain Arlot

slide-44
SLIDE 44

19/53

Introduction Goals Overfitting Examples Key issues Conclusion

No Free Lunch Theorem: proof

Assume N ⊂ X and let K ≥ 1. For any r ∈ {0, 1}K, define Pr by X uniform on {1, . . . , K } and P (Y = ri | X = i ) = 1 for all i = 1, . . . , K. Under Pr, f ⋆(x) = rx and R(f ⋆) = 0. So, sup

P

  • EP
  • RP(

f (Dn))

  • − RP(f ⋆)
  • ≥ sup

Pr

  • PPr
  • f (X; Dn) = rX
  • ≥ Er∼R
  • PPr
  • f (X; Dn) = rX
  • ≥ E
  • 1X /

∈{ X1,...,Xn }E

  • 1

f (X;(Xi,rXi )i=1...n)=rX

  • X, (Xi, rXi)i=1...n
  • = 1

2P (X / ∈ {X1, . . . , Xn }) = 1 2

  • 1 − 1

K n

Classification and statistical machine learning Sylvain Arlot

slide-45
SLIDE 45

20/53

Introduction Goals Overfitting Examples Key issues Conclusion

Learning rates

How can we get a bound such as R

  • f (Dn)
  • − R (f ⋆ ) ≤ C(P)n−1/2 ?

Classification and statistical machine learning Sylvain Arlot

slide-46
SLIDE 46

20/53

Introduction Goals Overfitting Examples Key issues Conclusion

Learning rates

How can we get a bound such as R

  • f (Dn)
  • − R (f ⋆ ) ≤ C(P)n−1/2 ?

No Free Lunch Theorems ⇒ must make assumptions on P

Classification and statistical machine learning Sylvain Arlot

slide-47
SLIDE 47

20/53

Introduction Goals Overfitting Examples Key issues Conclusion

Learning rates

How can we get a bound such as R

  • f (Dn)
  • − R (f ⋆ ) ≤ C(P)n−1/2 ?

No Free Lunch Theorems ⇒ must make assumptions on P Minimax rate: given a set P ⊂ M1(X × {0, 1}), inf

  • f

sup

P∈P

  • E
  • R
  • f (Dn)
  • − R (f ⋆ )
  • Classification and statistical machine learning

Sylvain Arlot

slide-48
SLIDE 48

20/53

Introduction Goals Overfitting Examples Key issues Conclusion

Learning rates

How can we get a bound such as R

  • f (Dn)
  • − R (f ⋆ ) ≤ C(P)n−1/2 ?

No Free Lunch Theorems ⇒ must make assumptions on P Minimax rate: given a set P ⊂ M1(X × {0, 1}), inf

  • f

sup

P∈P

  • E
  • R
  • f (Dn)
  • − R (f ⋆ )
  • Examples:
  • V /n when f ⋆ ∈ S known and dimVC(S) = V

[Devroye et al., 1996] V /(nh) when in addition P (|η(X) − 1/2| ≤ h) = 0 (margin assumption) [Massart and N´ ed´ elec, 2006]

Classification and statistical machine learning Sylvain Arlot

slide-49
SLIDE 49

21/53

Introduction Goals Overfitting Examples Key issues Conclusion

Outline

1

Introduction

2

Goals

3

Overfitting

4

Examples

5

Key issues

Classification and statistical machine learning Sylvain Arlot

slide-50
SLIDE 50

21/53

Introduction Goals Overfitting Examples Key issues Conclusion

Overfitting with k-nearest-neighbours: k = 1

10 1

Classification and statistical machine learning Sylvain Arlot

slide-51
SLIDE 51

22/53

Introduction Goals Overfitting Examples Key issues Conclusion

Choosing k ∈ {1, 3, 20, 200} for k-NN (n = 200)

10 1 10 1 10 1 10 1

Classification and statistical machine learning Sylvain Arlot

slide-52
SLIDE 52

23/53

Introduction Goals Overfitting Examples Key issues Conclusion

Empirical risk minimization

Empirical risk

  • Rn (f ) := 1

n

n

  • i=1

ℓ (f (Xi), Yi ) Empirical risk minimizer over a model S ⊂ S:

  • fS ∈ argminf ∈S
  • Rn (f )
  • Classification and statistical machine learning

Sylvain Arlot

slide-53
SLIDE 53

23/53

Introduction Goals Overfitting Examples Key issues Conclusion

Empirical risk minimization

Empirical risk

  • Rn (f ) := 1

n

n

  • i=1

ℓ (f (Xi), Yi ) Empirical risk minimizer over a model S ⊂ S:

  • fS ∈ argminf ∈S
  • Rn (f )
  • Examples:

partitioning rule: S =

  • k≥1 αk1Ak / αk ∈ {0, 1}
  • for some

partition (Ak)k≥1 of X linear discrimination (X = Rd): S =

  • x → 1β⊤x+β0≥0 / β ∈ Rd, β0 ∈ R
  • ...

Classification and statistical machine learning Sylvain Arlot

slide-54
SLIDE 54

24/53

Introduction Goals Overfitting Examples Key issues Conclusion

Example: linear discrimination

  • Fig. 4.3 of [Devroye et al., 1996]

Classification and statistical machine learning Sylvain Arlot

slide-55
SLIDE 55

25/53

Introduction Goals Overfitting Examples Key issues Conclusion

Bias-variance trade-off

E

  • R
  • fS
  • − R (f ⋆ )
  • = Bias + Variance

Bias or Approximation error R (f ⋆

S ) − R (f ⋆ ) = inf f ∈S R (f ) − R (f ⋆ )

Variance or Estimation error OLS in regression: σ2 dim(S) n k-NN in regression: σ2 k

Classification and statistical machine learning Sylvain Arlot

slide-56
SLIDE 56

25/53

Introduction Goals Overfitting Examples Key issues Conclusion

Bias-variance trade-off

E

  • R
  • fS
  • − R (f ⋆ )
  • = Bias + Variance

Bias or Approximation error R (f ⋆

S ) − R (f ⋆ ) = inf f ∈S R (f ) − R (f ⋆ )

Variance or Estimation error OLS in regression: σ2 dim(S) n k-NN in regression: σ2 k Bias-variance trade-off ⇔ avoid overfitting and underfitting

Classification and statistical machine learning Sylvain Arlot

slide-57
SLIDE 57

26/53

Introduction Goals Overfitting Examples Key issues Conclusion

Outline

1

Introduction

2

Goals

3

Overfitting

4

Examples Plug in rules Empirical risk minimization and model selection Convexification and support vector machines Decision trees and forests

5

Key issues

Classification and statistical machine learning Sylvain Arlot

slide-58
SLIDE 58

26/53

Introduction Goals Overfitting Examples Key issues Conclusion

Plug in classifiers

Idea: f ⋆(x) = 1η(x)≥ 1

2

⇒ if η(Dn) estimates η (regression problem),

  • f (x; Dn) = 1

η(x;Dn)≥ 1

2

Examples: partitioning, k-NN, local average classifiers [Devroye et al., 1996], [Audibert and Tsybakov, 2007]...

Classification and statistical machine learning Sylvain Arlot

slide-59
SLIDE 59

27/53

Introduction Goals Overfitting Examples Key issues Conclusion

Risk bound for plug in

Proposition (Theorem 2.2 in [Devroye et al., 1996]) For a plug in classifier f , R

  • f (Dn)
  • − R (f ⋆ ) ≤ 2E [|η(X) −

η(X; Dn)| | Dn ] ≤ 2

  • E
  • (η(X) −

η(X; Dn))2

  • Dn
  • (First step for proving Stone’s theorem [Stone, 1977])

Proof: R

  • f (Dn)
  • − R (f ⋆ ) = E
  • |2η(X) − 1| 1

f (X;Dn)=f ⋆(X)

  • Dn
  • and
  • f (X; Dn) = f ⋆(X) implies |2η(X) − 1| ≤ 2 |η(X) −

η(X; Dn)| .

Classification and statistical machine learning Sylvain Arlot

slide-60
SLIDE 60

28/53

Introduction Goals Overfitting Examples Key issues Conclusion

Empirical risk minimization (ERM)

ERM over S: fS ∈ argminf ∈S

  • Rn (f )
  • E
  • R
  • fS
  • − R (f ⋆ )
  • = Approximation error + Estimation error

Classification and statistical machine learning Sylvain Arlot

slide-61
SLIDE 61

28/53

Introduction Goals Overfitting Examples Key issues Conclusion

Empirical risk minimization (ERM)

ERM over S: fS ∈ argminf ∈S

  • Rn (f )
  • E
  • R
  • fS
  • − R (f ⋆ )
  • = Approximation error + Estimation error

Approximation error R (f ⋆

S ) − R (f ⋆ ): bounded thanks to

approximation theory, or assumed equal to zero

Classification and statistical machine learning Sylvain Arlot

slide-62
SLIDE 62

28/53

Introduction Goals Overfitting Examples Key issues Conclusion

Empirical risk minimization (ERM)

ERM over S: fS ∈ argminf ∈S

  • Rn (f )
  • E
  • R
  • fS
  • − R (f ⋆ )
  • = Approximation error + Estimation error

Approximation error R (f ⋆

S ) − R (f ⋆ ): bounded thanks to

approximation theory, or assumed equal to zero Estimation error E

  • R
  • fS
  • − R (f ⋆

S )

  • ≤ E
  • sup

f ∈S

  • R (f ) −

Rn (f )

  • Proof:

R

  • fS
  • − R (f ⋆

S )

= R

  • fS

Rn

  • fS
  • − R (f ⋆

S ) +

Rn (f ⋆

S ) +

Rn

  • fS

Rn (f ⋆

S )

≤ sup

f ∈S

  • R (f ) −

Rn (f )

  • +

Rn (f ⋆

S ) − R (f ⋆ S )

Classification and statistical machine learning Sylvain Arlot

slide-63
SLIDE 63

29/53

Introduction Goals Overfitting Examples Key issues Conclusion

Bounds on the estimation error (1): global approach

E

  • R
  • fS
  • − R (f ⋆

S )

  • ≤ E
  • sup

f ∈S

  • R (f ) −

Rn (f )

  • (global complexity of S)

≤ 2E

  • sup

f ∈S

  • 1

n

n

  • i=1

εiℓ (f (Xi), Yi )

  • (symmetrization)

≤ 2 √ 2 √n E

  • H(S; X1, . . . , Xn)
  • (combinatorial entropy)

≤ 2

  • 2V (S) log
  • en

V (S)

  • n

(VC dimension) References: Section 3 of [Boucheron et al., 2005], Chapters 12–13

  • f [Devroye et al., 1996]

See also lectures 1–2 of http://www.di.ens.fr/~arlot/2013orsay.htm

Classification and statistical machine learning Sylvain Arlot

slide-64
SLIDE 64

30/53

Introduction Goals Overfitting Examples Key issues Conclusion

Bounds on the estimation error (2): localization

supf ∈S{var(R(f ) − Rn (f ))} ≥ Cn−1/2 ⇒ no faster rate

Classification and statistical machine learning Sylvain Arlot

slide-65
SLIDE 65

30/53

Introduction Goals Overfitting Examples Key issues Conclusion

Bounds on the estimation error (2): localization

supf ∈S{var(R(f ) − Rn (f ))} ≥ Cn−1/2 ⇒ no faster rate Margin condition: P (|η(X) − 1/2| ≤ h) = 0 with h > 0 [Mammen and Tsybakov, 1999] Localization idea: use that fS is not anywhere in S

Classification and statistical machine learning Sylvain Arlot

slide-66
SLIDE 66

30/53

Introduction Goals Overfitting Examples Key issues Conclusion

Bounds on the estimation error (2): localization

supf ∈S{var(R(f ) − Rn (f ))} ≥ Cn−1/2 ⇒ no faster rate Margin condition: P (|η(X) − 1/2| ≤ h) = 0 with h > 0 [Mammen and Tsybakov, 1999] Localization idea: use that fS is not anywhere in S

  • fS ∈ {f ∈ S / R (f ) − R (f ⋆ ) ≤ ε}

⊂ {f ∈ S / var (ℓ(f (X), Y ) − ℓ(f ⋆(X), Y )) ≤ ε/h} by the margin condition.

Classification and statistical machine learning Sylvain Arlot

slide-67
SLIDE 67

30/53

Introduction Goals Overfitting Examples Key issues Conclusion

Bounds on the estimation error (2): localization

supf ∈S{var(R(f ) − Rn (f ))} ≥ Cn−1/2 ⇒ no faster rate Margin condition: P (|η(X) − 1/2| ≤ h) = 0 with h > 0 [Mammen and Tsybakov, 1999] Localization idea: use that fS is not anywhere in S

  • fS ∈ {f ∈ S / R (f ) − R (f ⋆ ) ≤ ε}

⊂ {f ∈ S / var (ℓ(f (X), Y ) − ℓ(f ⋆(X), Y )) ≤ ε/h} by the margin condition. + Talagrand concentration inequality [Talagrand, 1996, Bousquet, 2002] + . . . ⇒ fast rates (depending on the assumptions), e.g., κ V (S)

nh

  • 1 + log
  • nh2

V (S)

  • [Boucheron et al., 2005, Sec. 5], [Massart and N´

ed´ elec, 2006].

Classification and statistical machine learning Sylvain Arlot

slide-68
SLIDE 68

31/53

Introduction Goals Overfitting Examples Key issues Conclusion

Model selection

family of models (Sm)m∈M ⇒ family of classifiers ( fm(Dn))m∈Mn ⇒ choose m = m(Dn) such that R

  • f

m(Dn)

  • is minimal?

Classification and statistical machine learning Sylvain Arlot

slide-69
SLIDE 69

31/53

Introduction Goals Overfitting Examples Key issues Conclusion

Model selection

family of models (Sm)m∈M ⇒ family of classifiers ( fm(Dn))m∈Mn ⇒ choose m = m(Dn) such that R

  • f

m(Dn)

  • is minimal?

Goal: minimize the risk, i.e., Oracle inequality (in expectation or with a large probability): R

  • f

m

  • − R (f ⋆ ) ≤ C inf

m∈M

  • R
  • fm
  • − R (f ⋆ )
  • + Rn

Interpretation of m: the best model can be wrong / the true model can be worse than smaller ones.

Classification and statistical machine learning Sylvain Arlot

slide-70
SLIDE 70

32/53

Introduction Goals Overfitting Examples Key issues Conclusion

Penalization for model selection

Penalization:

  • m ∈ argminm∈M
  • Rn
  • fm
  • + pen(m)
  • Ideal penalty:

penid(m) = R

  • fm

Rn

  • fm
  • m ∈ argminm∈M
  • R
  • fm
  • Classification and statistical machine learning

Sylvain Arlot

slide-71
SLIDE 71

32/53

Introduction Goals Overfitting Examples Key issues Conclusion

Penalization for model selection

Penalization:

  • m ∈ argminm∈M
  • Rn
  • fm
  • + pen(m)
  • Ideal penalty:

penid(m) = R

  • fm

Rn

  • fm
  • m ∈ argminm∈M
  • R
  • fm
  • General idea: choose pen such that pen(m) ≈ penid(m) or at

least pen(m) ≥ penid(m) for all m ∈ M. Lemma (see next slide): if pen(m) ≥ penid(m) for all m ∈ M, R

  • f

m

  • −R (f ⋆ ) ≤ inf

m∈M

  • R
  • fm
  • − R (f ⋆ ) + pen(m) − penid(m)
  • .

Classification and statistical machine learning Sylvain Arlot

slide-72
SLIDE 72

32/53

Introduction Goals Overfitting Examples Key issues Conclusion

Penalization for model selection: lemma

Lemma If ∀m ∈ M, −B(m) ≤ pen(m) − penid(m) ≤ A(m), then, R

  • f

m

  • − R (f ⋆ ) − B(

m) ≤ inf

m∈M

  • R
  • fm
  • − R (f ⋆ ) + A(m)
  • .

Proof: For all m ∈ M, by definition of m,

  • Rn
  • f

m

  • + pen(

m) ≤ Rn

  • fm
  • + pen(m) .

So,

  • Rn
  • f

m

  • + pen(

m) = R

  • f

m

  • − penid(

m) + pen( m) ≥ R

  • f

m

  • − B(

m) and

  • Rn
  • fm
  • + pen(m) = R
  • fm
  • − penid(m) + pen(m)

≤ R

  • fm
  • + A(m) .

Classification and statistical machine learning Sylvain Arlot

slide-73
SLIDE 73

33/53

Introduction Goals Overfitting Examples Key issues Conclusion

Penalization for model selection

Structural risk minimization (Vapnik): penid(m) ≤ sup

f ∈Sm

  • R (f ) −

Rn (f )

  • ⇒ can use previous bounds

[Koltchinskii, 2001, Bartlett et al., 2002, Fromont, 2007] but remainder terms ≥ Cn−1/2 ⇒ no fast rates.

Classification and statistical machine learning Sylvain Arlot

slide-74
SLIDE 74

33/53

Introduction Goals Overfitting Examples Key issues Conclusion

Penalization for model selection

Structural risk minimization (Vapnik): penid(m) ≤ sup

f ∈Sm

  • R (f ) −

Rn (f )

  • ⇒ can use previous bounds

[Koltchinskii, 2001, Bartlett et al., 2002, Fromont, 2007] but remainder terms ≥ Cn−1/2 ⇒ no fast rates. Tighter estimates of penid(m) for fast rates: localization [Koltchinskii, 2006], resampling [Arlot, 2009].

See also Section 8 of [Boucheron et al., 2005].

Classification and statistical machine learning Sylvain Arlot

slide-75
SLIDE 75

34/53

Introduction Goals Overfitting Examples Key issues Conclusion

Convexification of the classification problem

Convention: Yi ∈ {−1, 1} so that 1y=y′ = 1yy′<0 = Φ0−1(yy′) min

f

1 n

n

  • i=1

Φ0−1(Yif (Xi)) computationally heavy in general.

Classification and statistical machine learning Sylvain Arlot

slide-76
SLIDE 76

34/53

Introduction Goals Overfitting Examples Key issues Conclusion

Convexification of the classification problem

Convention: Yi ∈ {−1, 1} so that 1y=y′ = 1yy′<0 = Φ0−1(yy′) min

f

1 n

n

  • i=1

Φ0−1(Yif (Xi)) computationally heavy in general. Classifier f : X → {−1, 1} ⇒ prediction function f : X → R such that sign(f (x)) will be used to classify x Risk R0−1(f ) = E [Φ0−1 (Yf (X))] ⇒ Φ-risk RΦ (f ) = E [Φ (Yf (X))] for some Φ : R → R+

Classification and statistical machine learning Sylvain Arlot

slide-77
SLIDE 77

34/53

Introduction Goals Overfitting Examples Key issues Conclusion

Convexification of the classification problem

Convention: Yi ∈ {−1, 1} so that 1y=y′ = 1yy′<0 = Φ0−1(yy′) min

f

1 n

n

  • i=1

Φ0−1(Yif (Xi)) computationally heavy in general. Classifier f : X → {−1, 1} ⇒ prediction function f : X → R such that sign(f (x)) will be used to classify x Risk R0−1(f ) = E [Φ0−1 (Yf (X))] ⇒ Φ-risk RΦ (f ) = E [Φ (Yf (X))] for some Φ : R → R+ ⇒ min

f ∈S

1 n

n

  • i=1

Φ(Yif (Xi)) with S and Φ convex.

Classification and statistical machine learning Sylvain Arlot

slide-78
SLIDE 78

35/53

Introduction Goals Overfitting Examples Key issues Conclusion

Examples of functions Φ

Figure from [Bartlett et al., 2006].

exponential: Φ(u) = e−u ⇒ AdaBoost hinge: Φ(u) = max {1 − u, 0} ⇒ support vector machines logistic/logit: Φ(u) = log(1 + exp(−u)) ⇒ logistic regression truncated quadratic: Φ(u) = (max {1 − u, 0})2

References: [Bartlett et al., 2006] and Section 4 of [Boucheron et al., 2005].

Classification and statistical machine learning Sylvain Arlot

slide-79
SLIDE 79

36/53

Introduction Goals Overfitting Examples Key issues Conclusion

Links between 0–1 and convex risks

Definition Φ is classification-calibrated if for any x with η(x) = 1/2, sign(f ⋆

Φ(x)) = f ⋆(x)

for any f ⋆

Φ ∈ argminf RΦ (f )

Classification and statistical machine learning Sylvain Arlot

slide-80
SLIDE 80

36/53

Introduction Goals Overfitting Examples Key issues Conclusion

Links between 0–1 and convex risks

Definition Φ is classification-calibrated if for any x with η(x) = 1/2, sign(f ⋆

Φ(x)) = f ⋆(x)

for any f ⋆

Φ ∈ argminf RΦ (f )

Theorem ([Bartlett et al., 2006]) Φ convex is classification-calibrated ⇔ Φ differentiable at 0 and Φ′(0) < 0. Then, a function ψ exists such that ψ

  • R0−1 (f ) − R0−1
  • f ⋆

0−1

  • ≤ RΦ (f ) − RΦ (f ⋆

Φ ) .

Examples: exponential loss: ψ(θ) = 1 − √ 1 − θ2 hinge loss: ψ(θ) = |θ| truncated quadratic: ψ(θ) = θ2

Classification and statistical machine learning Sylvain Arlot

slide-81
SLIDE 81

37/53

Introduction Goals Overfitting Examples Key issues Conclusion

Support Vector Machines: linear classifier

X = Rd, linear classifier: sign(β⊤x + β0) with β ∈ Rd, β0 ∈ R

Classification and statistical machine learning Sylvain Arlot

slide-82
SLIDE 82

37/53

Introduction Goals Overfitting Examples Key issues Conclusion

Support Vector Machines: linear classifier

X = Rd, linear classifier: sign(β⊤x + β0) with β ∈ Rd, β0 ∈ R argminβ,β0 / β≤R

  • 1

n

n

  • i=1

Φhinge

  • Yi
  • β⊤Xi + β0

argminβ,β0

  • 1

n

n

  • i=1

Φhinge

  • Yi
  • β⊤Xi + β0
  • + λ β2
  • up to some (random) reparametrization (λ = λ(R; Dn)).

⇒ quadratic program with 2n linear constraints.

Classification and statistical machine learning Sylvain Arlot

slide-83
SLIDE 83

38/53

Introduction Goals Overfitting Examples Key issues Conclusion

Support Vector Machines: linear classifier

Figure from http://cbio.ensmp.fr/~jvert/svn/kernelcourse/slides/master/master.pdf Classification and statistical machine learning Sylvain Arlot

slide-84
SLIDE 84

39/53

Introduction Goals Overfitting Examples Key issues Conclusion

Support Vector Machines: kernel trick

Positive definite kernel k : X × X → R s.t. (k(Xi, Xj))i,j symmetric positive definite Reproducing Kernel Hilbert Space (RKHS) F: space of functions X → R spanned by the Φ(x) = k(x, ·), x ∈ X.

Figure from http://cbio.ensmp.fr/~jvert/svn/kernelcourse/slides/master/master.pdf Classification and statistical machine learning Sylvain Arlot

slide-85
SLIDE 85

39/53

Introduction Goals Overfitting Examples Key issues Conclusion

Support Vector Machines: kernel trick

Positive definite kernel k : X × X → R s.t. (k(Xi, Xj))i,j symmetric positive definite Reproducing Kernel Hilbert Space (RKHS) F: space of functions X → R spanned by the Φ(x) = k(x, ·), x ∈ X. Theorem (Representer theorem) For any cost function ℓ, min

f ∈F

  • 1

n

n

  • i=1

ℓ(Yi, f (Xi)) + λ f 2

F

  • is attained at some f of the form

n

  • i=1

αik(Xi, ·) ⇒ any algorithm for X = Rd relying only on the dot products (Xi, Xj)i,j can be kernelized.

Classification and statistical machine learning Sylvain Arlot

slide-86
SLIDE 86

40/53

Introduction Goals Overfitting Examples Key issues Conclusion

Kernel examples

linear kernel: X = Rd, k(x, y) = x, y ⇒ F = Rd Euclidean polynomial kernel: X = Rd, k(x, y) = (x, y + 1)r ⇒ F = Rr[X1, . . . , Xd]

Classification and statistical machine learning Sylvain Arlot

slide-87
SLIDE 87

40/53

Introduction Goals Overfitting Examples Key issues Conclusion

Kernel examples

linear kernel: X = Rd, k(x, y) = x, y ⇒ F = Rd Euclidean polynomial kernel: X = Rd, k(x, y) = (x, y + 1)r ⇒ F = Rr[X1, . . . , Xd] Gaussian kernel: X = Rd, k(x, y) = e−x−y2/(2σ2) Laplace kernel: X = R, k(x, y) = e−|x−y|/2 ⇒ F = H1 (Sobolev space), f 2

F = f 2 L2 + f ′2 L2.

min kernel: X = [0, 1], k(x, y) = min {x, y } ⇒ F = {f ∈ C0([0, 1]), f ′ ∈ L2, f (0) = 0}, f F = f ′L2.

Classification and statistical machine learning Sylvain Arlot

slide-88
SLIDE 88

40/53

Introduction Goals Overfitting Examples Key issues Conclusion

Kernel examples

Gaussian kernel: X = Rd, k(x, y) = e−x−y2/(2σ2) Laplace kernel: X = R, k(x, y) = e−|x−y|/2 ⇒ F = H1 (Sobolev space), f 2

F = f 2 L2 + f ′2 L2.

min kernel: X = [0, 1], k(x, y) = min {x, y } ⇒ F = {f ∈ C0([0, 1]), f ′ ∈ L2, f (0) = 0}, f F = f ′L2. ⇒ intersection kernel: X =

  • p ∈ [0, 1]d / p1 + · · · + pd = 1
  • ,

k(p, q) = d

i=1 min(pi, qi), useful in computer vision

[Hein and Bousquet, 2004, Maji et al., 2008].

  • ther kernels on non-vectorial data (graphs, words / DNA

sequences, ...): see for instance [Sch¨

  • lkopf et al., 2004,

Mah´ e et al., 2005, Shervashidze et al., 2011] and http://cbio.

ensmp.fr/~jvert/svn/kernelcourse/slides/master/master.pdf

Classification and statistical machine learning Sylvain Arlot

slide-89
SLIDE 89

41/53

Introduction Goals Overfitting Examples Key issues Conclusion

Support Vector Machines: results / references

Main mathematical tools for SVM analysis: probability in Hilbert spaces (RKHS), functional analysis. Some references: Risk bounds: e.g., [Blanchard et al., 2008] (SVM as a penalization procedure for selecting among balls). see also [Boucheron et al., 2005, Section 4] Tutorials and lecture notes: [Burges, 1998], http://cbio.ensmp.fr/~jvert/svn/kernelcourse/ slides/master/master.pdf Books: e.g., [Steinwart and Christmann, 2008, Hastie et al., 2009, Scholkopf and Smola, 2001]

Classification and statistical machine learning Sylvain Arlot

slide-90
SLIDE 90

42/53

Introduction Goals Overfitting Examples Key issues Conclusion

Decision / classification tree

piecewise constant predictor partition obtained by recursive splitting of X ⊂ Rp,

  • rthogonally to one axis (X j < t vs. X j ≥ t)

empirical risk minimization

Figures from [Hastie et al., 2009] Classification and statistical machine learning Sylvain Arlot

slide-91
SLIDE 91

43/53

Introduction Goals Overfitting Examples Key issues Conclusion

CART (Classification And Regression Trees)

CART [Breiman et al., 1984]:

1 generate one large tree by

splitting recursively the data (minimization of some impurity measure), ⇒ over-adapted to data

Classification and statistical machine learning Sylvain Arlot

slide-92
SLIDE 92

43/53

Introduction Goals Overfitting Examples Key issues Conclusion

CART (Classification And Regression Trees)

CART [Breiman et al., 1984]:

1 generate one large tree by

splitting recursively the data (minimization of some impurity measure), ⇒ over-adapted to data

2 pruning (⇔ model selection)

Model selection results: e.g., [Gey and N´ ed´ elec, 2005, Sauv´ e and Tuleau-Malot, 2011, Gey and Mary-Huard, 2011].

Classification and statistical machine learning Sylvain Arlot

slide-93
SLIDE 93

44/53

Introduction Goals Overfitting Examples Key issues Conclusion

Random forests [Breiman, 2001]

Dn Bootstrap

  • D⋆1

n

Tree building

  • D⋆2

n

  • . . .

. . . D⋆K

n

  • . . .

. . . T1 Voting

  • T2
  • . . .

. . . TK

  • Classifier

Various ways to build individual trees (subset of variables...) Purely random forests: partitions independent from training data.

Classification and statistical machine learning Sylvain Arlot

slide-94
SLIDE 94

45/53

Introduction Goals Overfitting Examples Key issues Conclusion

Results on random forests (classification and regression)

Most theoretical results on purely random forests (partitions independent from training data: by data splitting or with simpler models) Consistency result in classification [Biau et al., 2008] Convergence rate and some combination with variable selection [Biau, 2012] From a single tree to a large forest:

estimation error reduction (at least a constant factor) [Genuer, 2012] approximation error reduction (A. & Genuer, work in progress) ⇒ sometimes improvement in the learning rate

See also [Breiman, 2004, Genuer et al., 2008, Genuer et al., 2010].

Classification and statistical machine learning Sylvain Arlot

slide-95
SLIDE 95

46/53

Introduction Goals Overfitting Examples Key issues Conclusion

Kinect: depth features ⇒ body part

Depth image ⇒ depth comparison features at each pixel ⇒ body part at each pixel ⇒ body part positions ⇒ · · ·

Figure from [Shotton et al., 2011]

Classification and statistical machine learning Sylvain Arlot

slide-96
SLIDE 96

47/53

Introduction Goals Overfitting Examples Key issues Conclusion

Outline

1

Introduction

2

Goals

3

Overfitting

4

Examples

5

Key issues

Classification and statistical machine learning Sylvain Arlot

slide-97
SLIDE 97

47/53

Introduction Goals Overfitting Examples Key issues Conclusion

Hyperparameter choice

Always one or several parameters to choose: k for k-NN, model selection, λ for SVM, kernel bandwidth for SVM with Gaussian kernel, tree size in random forests, ... No universal choice possible (No Free Lunch Theorems apply) ⇒ must use some prior knowledge at some point

Classification and statistical machine learning Sylvain Arlot

slide-98
SLIDE 98

47/53

Introduction Goals Overfitting Examples Key issues Conclusion

Hyperparameter choice

Always one or several parameters to choose: k for k-NN, model selection, λ for SVM, kernel bandwidth for SVM with Gaussian kernel, tree size in random forests, ... No universal choice possible (No Free Lunch Theorems apply) ⇒ must use some prior knowledge at some point Most general ideas: data splitting (cross-validation) [Arlot and Celisse, 2010] Sometimes specific approaches (penalization...): more efficient (for risk and computational cost) but also dependent on stronger assumptions

Classification and statistical machine learning Sylvain Arlot

slide-99
SLIDE 99

47/53

Introduction Goals Overfitting Examples Key issues Conclusion

Hyperparameter choice

Always one or several parameters to choose: k for k-NN, model selection, λ for SVM, kernel bandwidth for SVM with Gaussian kernel, tree size in random forests, ... No universal choice possible (No Free Lunch Theorems apply) ⇒ must use some prior knowledge at some point Most general ideas: data splitting (cross-validation) [Arlot and Celisse, 2010] Sometimes specific approaches (penalization...): more efficient (for risk and computational cost) but also dependent on stronger assumptions Important to choose a good parametrization (e.g., for cross-validation, the optimal parameter should not vary too much from a sample to another)

Classification and statistical machine learning Sylvain Arlot

slide-100
SLIDE 100

48/53

Introduction Goals Overfitting Examples Key issues Conclusion

Computational complexity

Most classifiers are defined as

  • f ∈ argminf ∈S C(f )

Optimization algorithms: usually faster (polynomial) when C and S convex. Often NP hard with 0–1 loss. Counterexample: interval classification [Kearns et al., 1997].

Classification and statistical machine learning Sylvain Arlot

slide-101
SLIDE 101

48/53

Introduction Goals Overfitting Examples Key issues Conclusion

Computational complexity

Most classifiers are defined as

  • f ∈ argminf ∈S C(f )

Optimization algorithms: usually faster (polynomial) when C and S convex. Often NP hard with 0–1 loss. Counterexample: interval classification [Kearns et al., 1997]. General convex optimization algorithms usually too slow if n

  • r p = dim(X) are > 103.

⇒ Need for specific faster algorithms (e.g., for SVM, consider the dual problem and take advantage of the “sparsity” of the solution). Constants matter! (e.g., dependence on p).

Classification and statistical machine learning Sylvain Arlot

slide-102
SLIDE 102

48/53

Introduction Goals Overfitting Examples Key issues Conclusion

Computational complexity

Most classifiers are defined as

  • f ∈ argminf ∈S C(f )

Optimization algorithms: usually faster (polynomial) when C and S convex. Often NP hard with 0–1 loss. Counterexample: interval classification [Kearns et al., 1997]. General convex optimization algorithms usually too slow if n

  • r p = dim(X) are > 103.

⇒ Need for specific faster algorithms (e.g., for SVM, consider the dual problem and take advantage of the “sparsity” of the solution). Constants matter! (e.g., dependence on p). Choice of a classification learning algorithm: trade-off between statistical performances and computational cost. Also depends on the confidence in the modelling chosen.

Classification and statistical machine learning Sylvain Arlot

slide-103
SLIDE 103

49/53

Introduction Goals Overfitting Examples Key issues Conclusion

Optimization error

Risk = Approximation error + Estimation error

Classification and statistical machine learning Sylvain Arlot

slide-104
SLIDE 104

49/53

Introduction Goals Overfitting Examples Key issues Conclusion

Optimization error

Risk = Approximation error + Estimation error + Optimization error

Figure from [Bottou and Bousquet, 2011] Classification and statistical machine learning Sylvain Arlot

slide-105
SLIDE 105

50/53

Introduction Goals Overfitting Examples Key issues Conclusion

The big data setting

Given ε > 0, what do we need to get R

  • f
  • − R (f ⋆ ) ≤ ε?

Classification and statistical machine learning Sylvain Arlot

slide-106
SLIDE 106

50/53

Introduction Goals Overfitting Examples Key issues Conclusion

The big data setting

Given ε > 0, what do we need to get R

  • f
  • − R (f ⋆ ) ≤ ε?

Traditional statistical learning: sample complexity, i.e., n ≥ n0(ε), whatever the computational cost

Classification and statistical machine learning Sylvain Arlot

slide-107
SLIDE 107

50/53

Introduction Goals Overfitting Examples Key issues Conclusion

The big data setting

Given ε > 0, what do we need to get R

  • f
  • − R (f ⋆ ) ≤ ε?

Traditional statistical learning: sample complexity, i.e., n ≥ n0(ε), whatever the computational cost Big data: n so large that exploring all data is impossible (and unnecessary) ⇒ better to throw away some data! [Bottou and Bousquet, 2008, Shalev-Shwartz and Srebro, 2008] ⇒ time complexity, i.e., minimal number of computations, whatever n

Classification and statistical machine learning Sylvain Arlot

slide-108
SLIDE 108

50/53

Introduction Goals Overfitting Examples Key issues Conclusion

The big data setting

Given ε > 0, what do we need to get R

  • f
  • − R (f ⋆ ) ≤ ε?

Traditional statistical learning: sample complexity, i.e., n ≥ n0(ε), whatever the computational cost Big data: n so large that exploring all data is impossible (and unnecessary) ⇒ better to throw away some data! [Bottou and Bousquet, 2008, Shalev-Shwartz and Srebro, 2008] ⇒ time complexity, i.e., minimal number of computations, whatever n A very active field: Big Data Research and Development Initiative (US government), MASTODONS (CNRS), AMPLab (UC Berkeley), ...

Classification and statistical machine learning Sylvain Arlot

slide-109
SLIDE 109

51/53

Introduction Goals Overfitting Examples Key issues Conclusion

Computational trade-offs, from statistics to big data

Figure from [Chandrasekaran and Jordan, 2012] Classification and statistical machine learning Sylvain Arlot

slide-110
SLIDE 110

52/53

Introduction Goals Overfitting Examples Key issues Conclusion

Conclusion

Learning theory: assumptions ⇒ learning rates (NFLT) Main danger: overfitting

Classification and statistical machine learning Sylvain Arlot

slide-111
SLIDE 111

52/53

Introduction Goals Overfitting Examples Key issues Conclusion

Conclusion

Learning theory: assumptions ⇒ learning rates (NFLT) Main danger: overfitting Various ways to model the data:

k-NN: f ⋆ locally constant w.r.t. d ERM/model selection: family of possible f ⋆ SVM: kernel ⇒ smoothness of f ⋆ / feature space random forests: weak modelling (trees) + aggregation many other approaches: Bayesian statistics, neural networks, deep learning, ...

Classification and statistical machine learning Sylvain Arlot

slide-112
SLIDE 112

52/53

Introduction Goals Overfitting Examples Key issues Conclusion

Conclusion

Learning theory: assumptions ⇒ learning rates (NFLT) Main danger: overfitting Various ways to model the data:

k-NN: f ⋆ locally constant w.r.t. d ERM/model selection: family of possible f ⋆ SVM: kernel ⇒ smoothness of f ⋆ / feature space random forests: weak modelling (trees) + aggregation many other approaches: Bayesian statistics, neural networks, deep learning, ...

Key issues: tuning parameters & computational complexity Big data ⇒ new challenges

Classification and statistical machine learning Sylvain Arlot

slide-113
SLIDE 113

52/53

Introduction Goals Overfitting Examples Key issues Conclusion

Conclusion

Learning theory: assumptions ⇒ learning rates (NFLT) Main danger: overfitting Various ways to model the data:

k-NN: f ⋆ locally constant w.r.t. d ERM/model selection: family of possible f ⋆ SVM: kernel ⇒ smoothness of f ⋆ / feature space random forests: weak modelling (trees) + aggregation many other approaches: Bayesian statistics, neural networks, deep learning, ...

Key issues: tuning parameters & computational complexity Big data ⇒ new challenges Main mathematical domains involved (outside statistics): probability theory (concentration of measure, ...), approximation theory, functional analysis, optimization, ...

Classification and statistical machine learning Sylvain Arlot

slide-114
SLIDE 114

53/53

Introduction Goals Overfitting Examples Key issues Conclusion

More references

These slides: http://www.di.ens.fr/~arlot/ Devroye, L., Gy¨

  • rfi, L., and Lugosi, G. (1996).

A probabilistic theory of pattern recognition, volume 31 of Applications of Mathematics (New York). Springer-Verlag, New York. Boucheron, S., Bousquet, O., and Lugosi, G. (2005). Theory of classification: a survey of some recent advances. ESAIM Probab. Stat., 9:323–375 (electronic). Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning. Springer Series in Statistics. Springer, New York, second edition. Data mining, inference, and prediction.

Classification and statistical machine learning Sylvain Arlot

slide-115
SLIDE 115

53/53

Introduction Goals Overfitting Examples Key issues Conclusion

Arlot, S. (2009). Model selection by resampling penalization.

  • Electron. J. Stat., 3:557–624 (electronic).

Arlot, S. and Celisse, A. (2010). A survey of cross-validation procedures for model selection.

  • Statist. Surv., 4:40–79.

Audibert, J.-Y. and Tsybakov, A. (2007). Fast learning rates for plug-in classifiers. Annals of Statistics, 35(2):608–633. Bartlett, P. L., Boucheron, S., and Lugosi, G. (2002). Model selection and error estimation. Machine Learning, 48:85–113. Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006). Convexity, classification, and risk bounds.

Classification and statistical machine learning Sylvain Arlot

slide-116
SLIDE 116

53/53

Introduction Goals Overfitting Examples Key issues Conclusion

Journal of the American Statistical Association, 101(473):138–156. (Was Department of Statistics, U.C. Berkeley Technical Report number 638, 2003). Biau, G. (2012). Analysis of a random forests model.

  • J. Mach. Learn. Res., 13:1063–1095.

Biau, G., Devroye, L., and Lugosi, G. (2008). Consistency of random forests and other averaging classifiers.

  • J. Mach. Learn. Res., 9:2015–2033.

Blanchard, G., Bousquet, O., and Massart, P. (2008). Statistical performance of support vector machines.

  • Ann. Statist., 36(2):489–531.

Bottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning.

Classification and statistical machine learning Sylvain Arlot

slide-117
SLIDE 117

53/53

Introduction Goals Overfitting Examples Key issues Conclusion

In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems, volume 20, pages 161–168. NIPS Foundation (http://books.nips.cc). Bottou, L. and Bousquet, O. (2011). The tradeoffs of large scale learning. In Sra, S., Nowozin, S., and Wright, S. J., editors, Optimization for Machine Learning, pages 351–368. MIT Press. Boucheron, S., Bousquet, O., and Lugosi, G. (2005). Theory of classification: a survey of some recent advances. ESAIM Probab. Stat., 9:323–375 (electronic). Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes.

  • C. R. Math. Acad. Sci. Paris, 334(6):495–500.

Classification and statistical machine learning Sylvain Arlot

slide-118
SLIDE 118

53/53

Introduction Goals Overfitting Examples Key issues Conclusion

Breiman, L. (2001). Random forests. Machine Learning, 45:5–32. 10.1023/A:1010933404324. Breiman, L. (2004). Consistency for a simple model of random forests. Technical Report Technical Report 670, U.C. Berkeley Department of Statistics. available at http://www.stat.berkeley.edu/tech-reports/670.pdf. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth Statistics/Probability Series. Wadsworth Advanced Books and Software, Belmont, CA. Burges, C. (1998).

Classification and statistical machine learning Sylvain Arlot

slide-119
SLIDE 119

53/53

Introduction Goals Overfitting Examples Key issues Conclusion

A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121–167. http://research.microsoft.com/en- us/um/people/cburges/papers/SVMTutorial.pdf. Chandrasekaran, V. and Jordan, M. I. (2012). Computational and statistical tradeoffs via convex relaxation. arXiv:1211.1073. Devroye, L., Gy¨

  • rfi, L., and Lugosi, G. (1996).

A probabilistic theory of pattern recognition, volume 31 of Applications of Mathematics (New York). Springer-Verlag, New York. Fromont, M. (2007). Model selection by bootstrap penalization for classification.

  • Mach. Learn., 66(2–3):165–207.

Genuer, R. (2012). Variance reduction in purely random forests.

Classification and statistical machine learning Sylvain Arlot

slide-120
SLIDE 120

53/53

Introduction Goals Overfitting Examples Key issues Conclusion

Journal of Nonparametric Statistics, 24(3):543–562. Genuer, R., Poggi, J.-M., and Tuleau, C. (2008). Random forests: Some methodological insights. arXiv:0811.3619. Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14):2225–2236. Gey, S. and Mary-Huard, T. (2011). Risk bounds for embedded variable selection in classification trees. arXiv:1108.0757. Gey, S. and N´ ed´ elec, ´

  • E. (2005).

Model selection for CART regression trees. IEEE Trans. Inform. Theory, 51(2):658–670. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning.

Classification and statistical machine learning Sylvain Arlot

slide-121
SLIDE 121

53/53

Introduction Goals Overfitting Examples Key issues Conclusion

Springer Series in Statistics. Springer, New York, second edition. Data mining, inference, and prediction. Hein, M. and Bousquet, O. (2004). Hilbertian metrics and positive-definite kernels on probability measures. In AISTATS. Kearns, M., Mansour, Y., Ng, A. Y., and Ron, D. (1997). An Experimental and Theoretical Comparison of Model Selection Methods.

  • Mach. Learn., 7:7–50.

Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory, 47(5):1902–1914. Koltchinskii, V. (2006).

Classification and statistical machine learning Sylvain Arlot

slide-122
SLIDE 122

53/53

Introduction Goals Overfitting Examples Key issues Conclusion

Local Rademacher complexities and oracle inequalities in risk minimization.

  • Ann. Statist., 34(6):2593–2656.

Mah´ e, P., Ueda, N., Akutsu, T., Perret, J.-L., and Vert, J.-P. (2005). Graph kernels for molecular structure-activity relationship analysis with support vector machines. Journal of Chemical Information and Modeling, 45(4):939–951. Maji, S., Berg, A. C., and Malik, J. (2008). Classification using intersection kernel support vector machines is efficient. In CVPR. Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis.

  • Ann. Statist., 27(6):1808–1829.

Massart, P. and N´ ed´ elec, ´

  • E. (2006).

Classification and statistical machine learning Sylvain Arlot

slide-123
SLIDE 123

53/53

Introduction Goals Overfitting Examples Key issues Conclusion

Risk bounds for statistical learning.

  • Ann. Statist., 34(5):2326–2366.

Sauv´ e, M. and Tuleau-Malot, C. (2011). Variable selection through cart. arxiv:1101.0689. Scholkopf, B. and Smola, A. J. (2001). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA. Sch¨

  • lkopf, B., Tsuda, K., and Vert, J.-P., editors (2004).

Kernel Methods in Computational Biology. MIT Press. Shalev-Shwartz, S. and Srebro, N. (2008). Svm optimization: Inverse dependence on training set size. In 25th International Conference on Machine Learning (ICML).

Classification and statistical machine learning Sylvain Arlot

slide-124
SLIDE 124

53/53

Introduction Goals Overfitting Examples Key issues Conclusion

Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K., and Borgwardt, K. (2011). Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561. Shotton, J., Fitzgibbon, A. W., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., and Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In CVPR, pages 1297–1304. Steinwart, I. and Christmann, A. (2008). Support vector machines. Information Science and Statistics. Springer, New York. Stone, C. J. (1977). Consistent nonparametric regression.

  • Ann. Statist., 5(4):595–645.

With discussion and a reply by the author.

Classification and statistical machine learning Sylvain Arlot

slide-125
SLIDE 125

53/53

Introduction Goals Overfitting Examples Key issues Conclusion

Talagrand, M. (1996). New concentration inequalities in product spaces.

  • Invent. Math., 126(3):505–563.

Classification and statistical machine learning Sylvain Arlot