3. Dimensionality Reductjon Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation

3 dimensionality reductjon
SMART_READER_LITE
LIVE PREVIEW

3. Dimensionality Reductjon Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation

Introductjon to Machine Learning CentraleSuplec Paris Fall 2017 3. Dimensionality Reductjon Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves Give


slide-1
SLIDE 1
  • 3. Dimensionality Reductjon

Introductjon to Machine Learning CentraleSupélec Paris — Fall 2017 Chloé-Agathe Azencot

Centre for Computatjonal Biology, Mines ParisTech

chloe-agathe.azencott@mines-paristech.fr

slide-2
SLIDE 2

2

  • Give reasons why one would wish to reduce the

dimensionality of a data set.

  • Explain the difgerence between feature selectjon

and feature extractjon.

  • Implement some fjlter strategies.
  • Implement some wrapper strategies.
  • Derive the computatjon of principal components

from a “max variance” defjnitjon

  • Implement PCA.

Learning objectjves

slide-3
SLIDE 3

3

Curse of dimensionality

  • Methods / intuitjons that work in low dimension

may not apply to high dimensions.

  • p=2: Fractjon of the points within a square that fall
  • utside of the circle inscribed in it:

?

slide-4
SLIDE 4

4

Curse of dimensionality

  • Methods / intuitjons that work in low dimension

may not apply to high dimensions.

  • p=2: Fractjon of the points within a square that fall
  • utside of the circle inscribed in it:

r

slide-5
SLIDE 5

5

Curse of dimensionality

  • Methods / intuitjons that work in low dimension

may not apply to high dimensions.

  • p=3: Fractjon of the points within a cube that fall
  • utside of the sphere inscribed in it:

r

slide-6
SLIDE 6

6

Curse of dimensionality

  • Volume of a p-sphere:
  • When p ↗ the proportjon of a hypercube outside of

its inscribed hypersphere approaches 1.

  • What this means:

– hyperspace is very big – all points are far apart

⇒ dimensionality reductjon.

The Gamma functjon Γ generalizes the factorial. Γ(n) = (n-1)!

slide-7
SLIDE 7

7

More reasons to reduce dimensionality

  • Computatjonal complexity (tjme and space)
  • Interpretability
  • Simpler models are more robust (less variance)
  • Data visualizatjon
  • Cost of data acquisitjon
  • Eliminate non-relevant atuributes that can make it

harder for an algorithm to learn.

slide-8
SLIDE 8

8

Approaches to dimensionality reductjon

  • Feature selectjon

Choose m < p features, ignore the remaining (p-m)

– Filtering approaches

Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test).

– Wrapper approaches

Search problem: Find the best set of features for a given predictjve model.

– Embedded approaches

Simultaneously fjt a model and learn which features should be included.

All these feature selectjon approaches are supervised.

slide-9
SLIDE 9

9

Approaches to dimensionality reductjon

  • Feature selectjon

Choose m < p features, ignore the remaining (p-m)

– Filtering approaches

Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test).

– Wrapper approaches

Search problem: Find the best set of features for a given predictjve model.

– Embedded approaches

Simultaneously fjt a model and learn which features should be included.

All these feature selectjon approaches are supervised.

slide-10
SLIDE 10

10

Approaches to dimensionality reductjon

  • Feature selectjon

Choose m < p features, ignore the remaining (p-m)

– Filtering approaches

Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test).

– Wrapper approaches

Search problem: Find the best set of features for a given predictjve model.

– Embedded approaches

Simultaneously fjt a model and learn which features should be included.

All these feature selectjon approaches are supervised.

slide-11
SLIDE 11

11

Approaches to dimensionality reductjon

  • Feature selectjon

Choose m < p features, ignore the remaining (p-m)

– Filtering approaches

Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test).

– Wrapper approaches

Search problem: Find the best set of features for a given predictjve model.

– Embedded approaches

Simultaneously fjt a model and learn which features should be included.

All these feature selectjon approaches are supervised.

Are those approaches supervised or unsupervised? ?

slide-12
SLIDE 12

12

Feature selectjon: Overview

All features Filter approaches Wrapper approaches Embedded approaches

Features set 1 Features set 2 Features set M

Features Predictor Predictor Predictor Features Predictor Features

Subset selectjon: forward selectjon backward selectjon fmoatjng selectjon Lasso Elastjc Net S e e C h a p . 7

slide-13
SLIDE 13

13

Feature selectjon: Subset selectjon

slide-14
SLIDE 14

14

Approaches to dimensionality reductjon

  • Feature selectjon

Choose m < p features, ignore the remaining (p-m)

– Filtering approaches

Apply a statjstjcal measure to assign a score to each feature (correlatjon, χ²-test).

– Wrapper approaches

Search problem: Find the best set of features for a given predictjve model.

– Embedded approaches

Simultaneously fjt a model and learn which features should be included.

All these feature selectjon approaches are supervised.

slide-15
SLIDE 15

15

Subset selectjon

  • Goal: Find the subset of features that leads to the

best-performing algorithm.

  • How many subsets of p features are there? ?
slide-16
SLIDE 16

16

Subset selectjon

  • Goal: Find the subset of features that leads to the

best-performing algorithm.

  • Issue: such sets.
slide-17
SLIDE 17

17

Subset selectjon

  • Goal: Find the subset of features that leads to the

best-performing algorithm.

  • Issue: such sets.
  • Greedy approach: forward search

Add the “best” feature at each step

– Initjally: – New best feature: – stop if – else:

: Error of a predictor trained only using the features in

slide-18
SLIDE 18

18

Subset selectjon

  • Goal: Find the subset of features that leads to the

best-performing algorithm.

  • Issue: such sets.
  • Greedy approach: forward search

Add the “best” feature at each step

– Initjally: – New best feature: – stop if – else:

: Error of a predictor trained only using the features in

What is the complexity of this algorithm? ?

slide-19
SLIDE 19

19

Subset selectjon

  • Goal: Find the subset of features that leads to the best-

performing algorithm.

  • Issue: such sets.
  • Greedy approach: forward search

Add the “best” feature at each step

– Initjally: – New best feature: – stop if – else:

Complexity: O(p² x C) where C=complexity of training and evaluatjng the model (might depend on p also). Much betuer than O(2p)!

: Error of a predictor trained only using the features in

slide-20
SLIDE 20

20

Subset selectjon

  • Greedy approach: forward search

Add the “best” feature at each step

– Initjally: – New best feature: – stop if – else:

Complexity: O(p²)

  • Alternatjve strategies:

– Backward search: start from {1, …, p}, eliminate features. – Floatjng search: alternatjvely add q features and remove r

features.

: Error of a predictor trained only using the features in

slide-21
SLIDE 21

21

Approaches to dimensionality reductjon

  • Feature extractjon

Project the p features on m < p new dimensions

  • Principal Components Analysis (PCA)
  • Factor Analysis (FA)
  • Non-negatjve Matrix Factorizatjon (NMF)
  • Linear Discriminant Analysis (LDA)
  • Multjdimensional scaling (MDS)
  • Isometric feature mapping (Isomap)
  • Locally Linear Embedding (LLE)
  • Autoencoders

Most of these approaches are unsupervised.

Linear Non linear Supervised

slide-22
SLIDE 22

22

Feature extractjon: Principal Component Analysis

slide-23
SLIDE 23

23

Principal Components Analysis (PCA)

  • Goal: Find a low-dimensional space such that

informatjon loss is minimized when the data is projected on that space.

slide-24
SLIDE 24

24

Principal Components Analysis (PCA)

  • Goal: Find a low-dimensional space such that

informatjon loss is minimized when the data is projected on that space.

  • Unsupervised: We're only looking at the data, not

at any labels.

slide-25
SLIDE 25

25

Principal Components Analysis (PCA)

  • Goal: Find a low-dimensional space such that

informatjon loss is minimized when the data is projected on that space.

  • Unsupervised: We're only looking at the data, not

at any labels. In PCA, we want the variance to be maximized.

slide-26
SLIDE 26

26

Principal Components Analysis (PCA)

  • Goal: Find a low-dimensional space such that

informatjon loss is minimized when the data is projected on that space.

  • Unsupervised: We're only looking at the data, not

at any labels. In PCA, we want the variance to be maximized.

Projectjon on x1 Projectjon on x2

slide-27
SLIDE 27

27

Principal Components Analysis (PCA)

  • Goal: Find a low-dimensional space such that

informatjon loss is minimized when the data is projected on that space.

  • Unsupervised: We're only looking at the data, not

at any labels. In PCA, we want the variance to be maximized.

Projectjon on x1 Projectjon on x2

Warning! This requires standardizing the features.

slide-28
SLIDE 28

28

Feature standardizatjon

  • Variance of feature j in data set D : ?
slide-29
SLIDE 29

29

Feature standardizatjon

  • Variance of feature j in data set D :
  • Features that take large values will have large

variance

Compare [10, 20, 30, 40, 50] with [0.1, 0.2, 0.3, 0.4, 0.5].

  • Standardizatjon:

– mean centering: give each feature a mean of 0 – variance scaling: give each feature a variance of 1

slide-30
SLIDE 30

30

Feature standardizatjon

  • Variance of feature j in data set D :
  • Features that take large values will have large

variance

Compare [10, 20, 30, 40, 50] with [0.1, 0.2, 0.3, 0.4, 0.5].

  • Standardizatjon:

– mean centering: give each feature a mean of 0 – variance scaling: give each feature a variance of 1

slide-31
SLIDE 31

31

Feature standardizatjon

  • Variance of feature j in data set D :
  • Features that take large values will have large

variance

Compare [10, 20, 30, 40, 50] with [0.1, 0.2, 0.3, 0.4, 0.5].

  • Standardizatjon:

– mean centering: give each feature a mean of 0 – variance scaling: give each feature a variance of 1

slide-32
SLIDE 32

32

Principal Components Analysis (PCA)

  • Goal: Find a low-dimensional space such that

variance is maximized when the data is projected

  • n that space.
  • Assumptjon: the data is centered i.e. it has mean 0.

If not: subtract the mean:

  • What formula gives us the projectjon of x on the

directjon of w (assuming w to be a unit vector)?

?

slide-33
SLIDE 33

33

Principal Components Analysis (PCA)

  • Goal: Find a low-dimensional space such that

variance is maximized when the data is projected

  • n that space.
  • Projectjon of X on the directjon of w:
  • Compute Var(z) as a functjon of w and X.

unit vector:

?

dimensions: (n, 1) (n, p) x (p, 1) n samples p features

slide-34
SLIDE 34

34

Principal Components Analysis (PCA)

  • Goal: Find a low-dimensional space such that

variance is maximized when the data is projected

  • n that space.
  • Projectjon of X on the directjon of w:
  • Var(z) can be computed as:

Estjmated from data: dimensions: (1, p) x (p, n) x (n, p) x (p, 1) 0 (data centered)

slide-35
SLIDE 35

35

  • Goal: fjnd

subject to What kind of optjmizatjon problem do we have?

First principal component (PC1)

?

slide-36
SLIDE 36

36

  • Goal: fjnd

subject to

  • Optjmizatjon problem: Maximize f(w) s.t. g(w)=Cte.

First principal component (PC1)

?

slide-37
SLIDE 37

37

  • Goal: fjnd

subject to

  • Optjmizatjon problem: Maximize f(w) s.t. g(w)=0.

iso-contours of f

Lagrangian:

First principal component (PC1)

slide-38
SLIDE 38

38

  • Maximize f(w) s.t. g(w)=0
  • If maximize the Lagrangian, then

– (statjonarity point) hence – For any admissible

(by defjnitjon of the maximum of L), hence (because of admissibility).

  • That is to say, any maximizer of the Lagrangian gives

us a maximizer of f under the constraint that g=0.

First principal component (PC1)

slide-39
SLIDE 39

39

  • Goal: fjnd

subject to

  • Optjmizatjon problem: Maximize f(w) s.t. g(w)=0.

Hence:

Lagrangian:

First principal component (PC1)

?

slide-40
SLIDE 40

40

First principal component (PC1)

  • The maximum is an extremum point, hence

necessarily the gradient is zero.

  • Hence

What does it tell us about α, w1 w.r.t the matrix Σ ? ?

slide-41
SLIDE 41

41

First principal component (PC1)

  • The maximum is an extremum point, hence

necessarily the gradient is zero.

  • Hence

are an (eigenvalue, eigenvector) of

slide-42
SLIDE 42

42

First principal component (PC1)

  • are an (eigenvalue, eigenvector) of
  • Now we want to fjnd the maximum of all possible

extremum points

  • Hence is the eigenvector of corresponding to

its largest eigenvalue.

slide-43
SLIDE 43

43

Second principal component (PC2)

  • Goal: fjnd w2

– of unit length – orthogonal to w1 – such that the projectjon of x on w2 has maximal

variance.

Write out each of these requirements. ?

slide-44
SLIDE 44

44

Second principal component (PC2)

  • Goal: fjnd w2

– of unit length – orthogonal to w1 – such that the projectjon of x on w2 has maximal

variance.

  • Hence the following optjmizatjon problem

(Lagrangian):

?

slide-45
SLIDE 45

45

Second principal component (PC2)

  • Take the derivatjve, set it to 0

scalar Hence

slide-46
SLIDE 46

46

Second principal component (PC2)

  • The second principal component is given by the

eigenvector of Σ with the second largest eigenvalue.

  • And so on and so forth for all other PCs.
slide-47
SLIDE 47

47

Singular value decompositjon

  • SVD of X:
  • Cov(X):
  • Eigendecompositjon of Σ:

– The singular values of X are the square root of the

eigenvalues of Σ.

– The lefu-singular vectors of X (the columns of U) are the

eigenvectors of Σ.

n x p n x n

  • rthogonal

n x p diagonal, non-neg p x p

  • rthogonal

p x p p x p j-th col=j-th eigenvector p x p diagonal, eigenvalues

slide-48
SLIDE 48

48

What PCA does

  • W: p x m matrix of the m leading eigenvectors of Σ.
  • m fjrst PCs:

dimensions: (n, m) (n, p) x (p, m)

slide-49
SLIDE 49

49

How to choose m

  • Percentage of variance explained:

– Total variance in the data = Tr(Σ) = – The fjrst m PCs account for of the total

variance.

  • Scree graph:

number of PCs variance explained

slide-50
SLIDE 50

50

How to choose m

  • Pick enough components to explain a fjxed

percentage of the variance.

  • Find the “elbow” (where adding more components

doesn't really help much).

  • Scree graph:

number of PCs variance explained

slide-51
SLIDE 51

51

Scree graph (example)

Optdigits dataset of the UCI repository [Ethem Alpaydin]

slide-52
SLIDE 52

52

PCA example: Populatjon genetjcs

Novembre et al, 2008

Genetjc data of 1387 Europeans

slide-53
SLIDE 53

53

Non-linear feature extractjon

slide-54
SLIDE 54

54

Multjdimensional scaling (MDS)

  • Find a mapping that preserves the dissimilaritjes

between the data points.

– In Euclidean space:

Equivalent to PCA!

– dissimilarity can come from another metric – … or be non-metric.

Applicatjon: place citjes on a map so as to respect their pairwise distances.

Works up to translatjon / rotatjon.

?

slide-55
SLIDE 55

55

Multjdimensional scaling (MDS)

  • Find a mapping that preserves the dissimilaritjes

between the data points.

– In Euclidean space:

Equivalent to PCA!

– dissimilarity can come from another metric – … or be non-metric.

Applicatjon: place citjes on a map so as to respect their pairwise distances.

Works up to translatjon / rotatjon.

slide-56
SLIDE 56

56

Multjdimensional scaling (MDS)

  • Find a mapping that preserves the dissimilaritjes

between the data points.

– In Euclidean space:

Equivalent to PCA!

– dissimilarity can come from another metric – … or be non-metric.

Applicatjon: place citjes on a map so as to respect their pairwise distances.

Works up to translatjon / rotatjon.

?

slide-57
SLIDE 57

57

Multjdimensional scaling (MDS)

  • Find a mapping that preserves the dissimilaritjes

between the data points.

– In Euclidean space:

Equivalent to PCA!

– dissimilarity can come from another metric – … or be non-metric.

Applicatjon: place citjes on a map so as to respect their pairwise distances.

Works up to translatjon / rotatjon.

slide-58
SLIDE 58

58

Multjdimensional scaling (MDS)

  • Find a mapping that preserves the dissimilaritjes

between the data points.

– In Euclidean space:

Equivalent to PCA!

– dissimilarity can come from another metric – … or be non-metric.

Applicatjon: place citjes on a map so as to respect their pairwise distances.

Works up to translatjon / rotatjon.

slide-59
SLIDE 59

59

Multjdimensional scaling (MDS)

  • Find a mapping that preserves the dissimilaritjes

between the data points.

– In Euclidean space:

Equivalent to PCA!

– dissimilarity can come from another metric – … or be non-metric.

Applicatjon: place citjes on a map so as to respect their pairwise distances.

Works up to translatjon / rotatjon.

slide-60
SLIDE 60

60

Reconstructjng a map of the Netherlands

http://www.wouterspekkink.org/?p=299

slide-61
SLIDE 61

61

IsoMap

  • Incorporate local structure in MDS
  • Create a neighborhood graph

– connect each point to its K nearest neighbors (or all

neighbors within a distance ε)

– weight by distance

  • Compute geodesic distances on the neighborhood

graph

length of the shortest (weighted) path – e.g. Djikstra

  • Apply MDS to the resultjng dissimilarity matrix.
slide-62
SLIDE 62

62

Tennenbaum et al. (2000) A Global Geometric Framework for Nonlinear Dimensionality Reductjon. Science. http://web.mit.edu/cocosci/Papers/sci_reprint.pdf

IsoMap on “Swiss roll” data

slide-63
SLIDE 63

63

t-Stochastjc Neighbor Embedding (tSNE)

  • Very popular at the moment
  • Approximate the distributjon of pairwise distances

in the data by a t-distributjon (Student's)

https://distill.pub/2016/misread-tsne/

– distributjon of the conditjonal probability

that xi picks xj as a neighbor

– neighbors are picked in proportjon to their

probability density under a Gaussian centered in xi.

follows a t-distributjon Kullback-Leibler divergence (measures how much P diverges from Q).

slide-64
SLIDE 64

64

Summary

  • Goal: Use m < p features.
  • Feature selectjon:

– Filter approaches – Wrapper approaches: Subset selectjon

Greedy search of the best set of features.

– Embedded approaches: Lasso, Elastjc Net

  • Feature extractjon: unsupervised

– Linear: Principal Components Analysis

Center the data and align it with axes of largest variance.

– Non-linear: MDS, IsoMap, tSNE

Preserve local distances/structure.

slide-65
SLIDE 65

65

References

  • An Introductjon to Feature Selectjon. In: Applied Predictjve Modeling. (2013)

Kuhn and Johnson

https://link.springer.com/chapter/10.1007/978-1-4614-6849-3_19 This book chapter should be available to you from within CentraleSupelec. If you cannot access it, please let me know.

– Feature Selectjon: Chapter 19.1 – 19.3

  • A Course in Machine Learning.

http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf

– PCA: Chapter 15.2

  • The Elements of Statjstjcal Learning.

http://web.stanford.edu/~hastie/ElemStatLearn/

– PCA: Chapter 14.5.1 – MDS: Chapter 14.8

  • To go further

– An Introductjon to Variable and Feature Selectjon (2003), Guyon and Elisseefg.

http://www.jmlr.org/papers/v3/guyon03a.html

slide-66
SLIDE 66

66

Lab 1: some pointers

  • If pandas.plotting does not work

– Update your version of pandas

  • Pip: pip install pandas --upgrade
  • Anaconda: conda update pandas

– Or try using pandas.tools.plotting instead.

slide-67
SLIDE 67

67

Standardizatjon

slide-68
SLIDE 68

68

Finding the fjrst 2 components

slide-69
SLIDE 69

69

slide-70
SLIDE 70

70

Computjng the percentage of variance explained (sklearn)