Preprocessing and Dimensionality Reduction J er emy Fix - - PowerPoint PPT Presentation

preprocessing and dimensionality reduction
SMART_READER_LITE
LIVE PREVIEW

Preprocessing and Dimensionality Reduction J er emy Fix - - PowerPoint PPT Presentation

Datasets Preprocessing Dimensionality reduction Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec jeremy.fix@centralesupelec.fr 2017 1 / 73 Datasets Preprocessing Dimensionality reduction Where to get data


slide-1
SLIDE 1

Datasets Preprocessing Dimensionality reduction

Preprocessing and Dimensionality Reduction

J´ er´ emy Fix

CentraleSup´ elec jeremy.fix@centralesupelec.fr

2017

1 / 73

slide-2
SLIDE 2

Datasets Preprocessing Dimensionality reduction Where to get data

You need datasets

You can use open datasets

For example for experimenting a new ML algorithm:

  • UCI ML Repo : http://archive.ics.uci.edu/ml/
  • Kaggle competitions, e.g. https:

//www.kaggle.com/c/diabetic-retinopathy-detection

  • specific well known datasets for specific ML problems

2 / 73

slide-3
SLIDE 3

Datasets Preprocessing Dimensionality reduction Where to get data

Some available datasets

Face expression classification

48x48 pixel grayscale images of faces 0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral 28K Train; 3K for public test, another 3K for final test.

Kaggle, ICML 2013 3 / 73

slide-4
SLIDE 4

Datasets Preprocessing Dimensionality reduction Where to get data

Some available datasets

Object localization/detection

PascalVOC2012: 20 classes, 20000 Train images, 20000 Test, 11000 Test Avg image size : 469x387 pixels, RGB

Classes : person/bird, cat, cow, dog, horse, sheep/aeroplane, bicycle, boat, bus, car, motorbike, train/bottle, chair, dining table, potted plant, sofa, tv-monitor http://host.robots.ox.ac.uk/pascal/VOC/ 4 / 73

slide-5
SLIDE 5

Datasets Preprocessing Dimensionality reduction Where to get data

Some available datasets

Object localization/detection

ImageNet, ILSVRC2014: 1000 classes, 1.2M Train images, 50K Valid, 100K Test Avg image size : 482x415 pixels, RGB

ImageNet Large Scale Visual Recognition Challenge, Russakovsky et al. (2015) 5 / 73

slide-6
SLIDE 6

Datasets Preprocessing Dimensionality reduction Where to get data

Some available datasets

Object localization/detection

Open Images Dataset:

https://github.com/openimages/dataset

  • ≈ 9M automatically labelled images, 4M human validated
  • 80M bounding boxes, 6000 classes
  • both meta labels (e.g. vehicle), fine-grain labels (e.g. honda

nsx)

6 / 73

slide-7
SLIDE 7

Datasets Preprocessing Dimensionality reduction Where to get data

Some available datasets

Object segmentation

COCO 2017: 200K images, 80 classes, 500K masks

http://cocodataset.org/ 7 / 73

slide-8
SLIDE 8

Datasets Preprocessing Dimensionality reduction Where to get data

Some available datasets

Recommendation systems

MovieLens, Netflix Prize, Anime Recommendations Database MovieLens 20M

  • 27K movies by 138K users
  • 5star ratings, 1/2 increment (0.0, 0.5, ..)
  • 20M ratings
  • metadata (e.g. genre)
  • links to imdb to enrich metadata

https://grouplens.org/datasets/movielens/

8 / 73

slide-9
SLIDE 9

Datasets Preprocessing Dimensionality reduction Where to get data

Some available datasets

Automatic speech recognition

Timit, VoxForge, ... Timit :

  • 630 speakers, eight American english dialects
  • time-aligned orthographic, phonetic and word transcriptions
  • 16kHz speech waveform file for each utterance

https://catalog.ldc.upenn.edu/ldc93s1

9 / 73

slide-10
SLIDE 10

Datasets Preprocessing Dimensionality reduction Where to get data

Some available datasets

Sentiment analysis

Large Movie Review Dataset (IMDB)

  • 25K reviews for training, 25K reviews for testing
  • movie reviews (sentences), with rating ([1,10])
  • aim : Are reviews on a given product positive/negative ?

Maas(2011), Learning Word Vectors for Sentiment Analysis

Automatic translation

Dataset from the european parliament (Europarl dataset)

  • single language datasets (language model)
  • parallel corpora (translation), e.g. french-english (2M

sentences), Czech-English (650K sentences), ..

Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp

10 / 73

slide-11
SLIDE 11

Datasets Preprocessing Dimensionality reduction Make your own dataset

You need datasets

You have a specific problem

You may need to collect data on your own.

  • Crawl the web ? (e.g. Tweeter API, ..)
  • if supervized learning : assign labels (mechanical turk, domain

experts (classifying tumors))

  • Ensure you collected sufficient features

11 / 73

slide-12
SLIDE 12

Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values

Preprocessing

12 / 73

slide-13
SLIDE 13

Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values

Preprocessing data

Data are not necessarily vectorial

  • Ordinal or Categorical : poor/faire/excellent ; Male/Female
  • Text documents : bag of words / word embeddings

Even if vectorial

  • Missing data : check how missing values are indicated (-9, ’ ’,

..) → Imputation of missing values

  • Feature scaling

13 / 73

slide-14
SLIDE 14

Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values

Your data might not be vectorial data

Ordinal and categorical features

Ordinal values have an order. Ordinal Feature value poor fair excellent Numerical feature value

  • 1

1 Categorical values do not have an order (use one-hot) : Categorical value American Spanish German French Numerical value [1, 0, 0, 0] [0, 1, 0, 0] [0, 0, 1, 0] [0, 0, 0, 1]

14 / 73

slide-15
SLIDE 15

Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values

Your data might not be vectorial data

Vectorial representation of text documents

Bag Of Words

  • define a vocabulary V, |V| = n
  • for each document, build a vector x so that xi is the

frequency of the word Vi e.g. V = {I, in, love, metz, machinelearning, study} I love machine learning and love metz too. → x = [1, 0, 2, 1, 1, 0] I love studying machine learning in Metz. → x = [1, 1, 1, 1, 1, 1] Does not take the order into account → N − gram, but this leads to sparser representations

15 / 73

slide-16
SLIDE 16

Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values

Your data might not be vectorial data

Vectorial representation of text documents

Word/Sentence embeddings (e.g. word2vec, GLoVe, fasttext). Continuous Bag of Words (CBOW) : predict a word given its context

  • Input and output coded with one-hot
  • predict a word given its context
  • hidden layer : word representation

Captures some semantic information. For sentences : tweet2vec, sentence2vec, word vector avg see also : Bayesian approaches (e.g. Latent Dirichlet Allocation)

Pennington(2014) GloVe: Global Vectors for Word Representation; Mikolov(2013) Efficient Estimation of Word Representations in Vector Space; https://fasttext.cc/ 16 / 73

slide-17
SLIDE 17

Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values

Some features might be missing

Missing features

  • Completely drop out the samples with missing attributes, or

the dimensions that have missing values

  • or try to impute, i.e. set a value in place of the missing

attributes For missing value imputation, there are plenty of methods :

  • global : assign the mean, median, most frequent value of an

attribute

  • local : based on k-nearest neighbors, decide which value to

impute The bias you may introduce by imputing a value may depend on the causes of the missing values, see [Silva(2014)].

Silva(2014). A brief review of the main approaches for treatment of missing data 17 / 73

slide-18
SLIDE 18

Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values

Some vectorial data might not be appropriately scaled

Feature scaling

  • dimensions with the largest variations will dominate euclidean

distances (e.g. nearest neighbors)

  • when gradient descent is involved, feature scaling makes

convergence faster (because the loss is circular symmetric)

  • when regularization is involved, we would like to use a single

regularization coefficient, independent on the scale of the features

18 / 73

slide-19
SLIDE 19

Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values

Some vectorial data might not be appropriately scaled

Feature scaling

Given xi ∈ Rd, you can normalize by :

  • min/max scaling :

∀i, ∀j ∈ [0, d − 1]x′

i,j = xi,j−mink xk,j maxk xk,j−minkxk,j

  • z-score normalization :

∀i, ∀j ∈ [0, d − 1]xi,j = xi,j − µj σj µj = 1 N

  • k

xk,j σj =

  • 1

N

  • k

(xk,j − µj)2 Your statistics must be computed from the training set and applied also to test data.

19 / 73

slide-20
SLIDE 20

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction

20 / 73

slide-21
SLIDE 21

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction : what/why/how ?

What

Optimally transform xi ∈ Rn into zi ∈ Rd so that d << n It remains to define what means “optimally transform”

Why

  • visualization of the data
  • interpretability of the predictor
  • speed up the algorithms whose complexity depends on n
  • data may occupy a manifold of lower dimensionality than n
  • curse of dimensionality : data get quickly sparse, models may
  • verfit

21 / 73

slide-22
SLIDE 22

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

Data analysis/Visualization

How are your data distributed ? How are your classes intricated ? Do we have discriminative features ? t-SNE, Mnist, Maaten et al.

22 / 73

slide-23
SLIDE 23

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

Interpretability of the predictor

e.g. Why does this predictor say the tumor is malignant ? Real risk = 0.92 ± 0.05 Real risk = 0.92 ± 0.06 UCI ML Breast Cancer Wisconsin (Diagnostic) dataset Real risk estimated by 10-fold CV.

23 / 73

slide-24
SLIDE 24

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

Speed up of the algorithms

Decreasing dimensionality decreases training/inference times. For example :

  • Linear regression ˆ

y = θTx + b

  • Logistic regression(classification) : P(y = 1/x) =

1 1+exp(θT x)

Both training and inference in O(n), x ∈ Rn

24 / 73

slide-25
SLIDE 25

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

The data may occupy a lower dimensional manifold

Swiss roll → you do not necessarily loose information by reducing the number of dimensions

25 / 73

slide-26
SLIDE 26

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

The data may occupy a lower dimensional manifold

You want to classify facial expressions of a single person, controlled illumination:

  • suppose a huge image resolution, e.g. 1024 × 1024 RGB

pixels, x ∈ R1024×1024×3

  • what is the dimensionality of the data manifold ?

26 / 73

slide-27
SLIDE 27

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

The data may occupy a lower dimensional manifold

You want to classify facial expressions of a single person, controlled illumination:

  • suppose a huge image resolution, e.g. 1024 × 1024 RGB

pixels, x ∈ R1024×1024×3

  • what is the dimensionality of the data manifold ? ≈ 50

27 / 73

slide-28
SLIDE 28

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

The data may occupy a lower dimensional manifold

You want to classify facial expressions of a single person, controlled illumination:

  • suppose a huge image resolution, e.g. 1024 × 1024 RGB

pixels, x ∈ R1024×1024×3

  • what is the dimensionality of the data manifold ? ≈ 50

→ you do not necessarily loose information by reducing the number of dimensions

28 / 73

slide-29
SLIDE 29

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

You may even have better predictors : Curse of dimensionality

The data become (exponentially) quickly sparse with respect to the dimension Image from [Goodfellow, Bengio, Courville (2016) : Deep learning] See also [Hastie et al.(2017), The elements of statistical learning]

29 / 73

slide-30
SLIDE 30

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction : what/why/how ?

What

Optimally transform xi ∈ Rn into zi ∈ Rd so that d << n It remains to define what means “optimally transform”

How

  • select a subset of the original features : feature selection
  • compute new features from the original ones :feature

extraction

30 / 73

slide-31
SLIDE 31

Datasets Preprocessing Dimensionality reduction Feature selection

Feature selection Select a subset of the original features/attributes/dimensions

xi ∈ Rn z ∈ Rd

31 / 73

slide-32
SLIDE 32

Datasets Preprocessing Dimensionality reduction Feature selection

Feature selection

Overview

  • Embededed : The ML algorithm is designed to select a subset
  • f the features, e.g. linear regression with L1 penalty
  • Filters : dimensions are selected based on a heuristic
  • Wrappers : dimensions are selected based on an estimation
  • f the real risk

⇒ Notebook ”Feature selection.ipynb”

32 / 73

slide-33
SLIDE 33

Datasets Preprocessing Dimensionality reduction Feature selection

Feature selection: embedded

Embedded : the loss to minimize embeds a penalty promoting sparsity.

Least Absolute Shrinkage and Selection Operator (LASSO)

Given a regression problem (xi, yi), xi ∈ Rn, yi ∈ R, optimize w.r.t. θ: 1 N

N−1

  • i=0

(yi − θTxi)2 + λ|θ|1 (1) Linear regression with L1 penalty L1 penalty promotes sparse predictors Tibshirani (1996). Regression shrinkage and Selection via the Lasso

33 / 73

slide-34
SLIDE 34

Datasets Preprocessing Dimensionality reduction Feature selection

Feature selection: embedded

LASSO example

N=30 points, yi = 0.5 + 0.4 ∗ sin(2πxi) + N(0, 0.01) 30 RBF features + constant term: φ(x) = [1, e

(x0−x)2 2σ2

, ..., e

(xN−1−x)2 2σ2

]

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Fit samples true lreg lreg_l1

5 10 15 20 25 30 0.4 0.2 0.0 0.2 0.4 0.6

Parameters

34 / 73

slide-35
SLIDE 35

Datasets Preprocessing Dimensionality reduction Feature selection

Feature selection: embedded

Decision tree example

Decision Tree with gini impurity, max depth=2, 10-fold CV (0.92) UCI ML Breast Cancer Wisconsin dataset. 569 samples, binary classif, 30 continuous features.

35 / 73

slide-36
SLIDE 36

Datasets Preprocessing Dimensionality reduction Feature selection

Feature selection: univariate filters

Principle : measure correlation/dependency between each input features, considered independently, and the target. E.g. chi-2, Anova test of independence, mutual information measure, pearson correlation,..

Example Continuous → Discrete : Anova

Breast cancer, F-values Anova P(x14/y) (lowest F) P(x27/y) (highest F)

36 / 73

slide-37
SLIDE 37

Datasets Preprocessing Dimensionality reduction Feature selection

Feature selection: multivariate filters and wrappers

Overview

Denote χ a subset of the dimensions/attributes/features

  • suppose we are provided a measure of how good this subset is

J(χ)

  • we optimize J(χ) over the possible subsets χ

If x ∈ Rn, we have 2n possible subsets χ: χ ∈ {∅, {x1}, {x2}, · · · , {x1, x2}, · · · } http://featureselection.asu.edu/ : Algorithms and datasets Python package scikit-feature. John et al.(1994) Irrelevant features and the subset selection problem.

37 / 73

slide-38
SLIDE 38

Datasets Preprocessing Dimensionality reduction Feature selection

Feature selection: optimizing J(χ)

Tree search

{x0, x1} ∅ Xd {x0} {x1} {xd−1} {x0, x2} {x0, xd−1} · · · · · · · · · {x1, xd−1} {xd−2, xd−1} · · · · · · . . . . . . . . . Number of sets 1 1 d d(d-1)/2

d! k!(d−k)!

Sequential Forward Search Sequential Backward Search

If you allow to undo steps, “Sequential Floating Forward Search”/”Sequential Floating Backward Search” Variants and extensions : Somol et al.(2010) Efficient Feature Subset Selection and Subset Size Optimization

38 / 73

slide-39
SLIDE 39

Datasets Preprocessing Dimensionality reduction Feature selection

Feature selection: quantifying the quality of a subset of features

We need to quantify the quality of a subset of features J(χ) Filters use a heuristic to be maximized.

Filters

Heuristic : e.g. Correlation based feature selection Strategy: Keep features correlated with the label, yet uncorrelated between each other. Given a training set {(xi, yi)}:

JCSF (χ) = k¯ r(χ, y)

  • k(k − 1)¯

r(χ, χ) ¯ r(χ, y) = 1 k

  • j∈χ

r(x.,j , y) ¯ r(χ, χ) = 1 k(k − 1)

  • (j1,j2)∈χ,j1=j2

r(x.,j1 , x.,j2 )

with k = |χ| and r a measure of correlation

39 / 73

slide-40
SLIDE 40

Datasets Preprocessing Dimensionality reduction Feature selection

Feature selection: quantifying the quality of a subset of features

We need to quantify the quality of a subset of features J(χ) Wrappers use an estimation of the real risk to be minimized.

Wrappers

1 Train a predictor from the subset χ 2 J(χ) = estimation of the real risk (e.g. cross validation)

More theoretically grounded, but more computationally expensive.

40 / 73

slide-41
SLIDE 41

Datasets Preprocessing Dimensionality reduction Feature extraction

Feature extraction Given N samples xi ∈ Rd, We compute r ≪ d new features from the original d features.

41 / 73

slide-42
SLIDE 42

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analsysis [Pearson(1901)]

Statement : Find an affine transformation of the data minimizing the reconstruction error

Intuition and formalisation

1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.0 0.5 0.0 0.5 1.0 1.5 2.0

w0 w1

In 1D, we seek a line (w0, w1) minimizing the sum of the squared length of the red segments. It is not unique !

42 / 73

slide-43
SLIDE 43

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analsysis [Pearson(1901)]

Statement : Find an affine transformation of the data minimizing the reconstruction error Formally : min

{w0,w1,..wr}∈Rd N−1

  • i=0
  • xi − (w0 +

r

  • j=1

(wT

j (xi − w0))wj)

  • 2

2

(2) subject to wT

i wj = δi,j.

→ matrix form ?

43 / 73

slide-44
SLIDE 44

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analsysis [Pearson(1901)]

Matrix formulation of PCA

Introduce W = (w1| . . . |wr) ∈ Md×r(R) (2) ⇔ min

{w0,w1,..wr}∈Rd N−1

  • i=0
  • (Id − WWT)(xi − w0)
  • 2

2

subject to WTW = Ir

44 / 73

slide-45
SLIDE 45

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

Simplification of the matrix formulation

  • If M is idempotent, so is (I − M)
  • (Id − WWT) is symmetric and idempotent

(2) ⇔ min

{w0,w1,..wr}∈Rd N−1

  • i=0

(xi − w0)T(Id − WWT)(xi − w0) subject to WTW = Ir

45 / 73

slide-46
SLIDE 46

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

Remember : For u : Rn → Rm, v : Rn → Rm, A ∈ Mm,m(R): duTAv dx = du dx Av + dv dx ATu

Finding w0

J =

N−1

  • i=0

(xi − w0)T(Id − WWT)(xi − w0) ∂J ∂w0 = −2(Id − WWT)

N−1

  • i=0

(xi − w0)

46 / 73

slide-47
SLIDE 47

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

Finding w0

J =

N−1

  • i=0

(xi − w0)T(Id − WWT)(xi − w0) ∂J ∂w0 = −2(Id − WWT)

N−1

  • i=0

(xi − w0) ∂J ∂w0 = 0 ⇔ ∃h ∈ span{w1, ..., wr}, w0 = h + 1 N

  • i

xi

(Id − WWT )h is the residual vector by the orthogonal projection on the column vectors of W If h ∈ span{w1, ..., wr }, the residual is 0 If h ∈ span{w1, ..., wr }⊥, the residual is h 47 / 73

slide-48
SLIDE 48

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

Finding w0

J =

N−1

  • i=0

(xi − w0)T(Id − WWT)(xi − w0) argminw0 J ⇒ w0 = h + 1

N

  • i xi

h ∈ span{w1, ..., wr} e.g. h = 0

The offset w0

The offset w0 is the mean of the data points, up to a translation in the space spaned by the principal components vectors. Step 1 : Center the data

48 / 73

slide-49
SLIDE 49

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

Denote ˜ xi = xi − ¯ x, ¯ x = 1

N

  • i xi

Deriving the first principal component

  • J = N−1

i=0 ˜

xT

i (Id − WWT)˜

xi

  • argminw1,..wr J = argmaxw1,..wr

N−1

i=0 ˜

xT

i WWT ˜

xi

  • ˜

X = ( ˜ x0| . . . | ˜ xN−1)

  • argminw1,..wr J = argmaxw1,..wr

r

j=1 wT j ˜

X˜ XTwj Our optimization problem turns out to be : argmaxw1,...,wr r

j=1 wT j ˜

X˜ XTwj subject to WTW = Ir We have a constrained optimization problem : Lagrangian.

49 / 73

slide-50
SLIDE 50

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

Deriving the first principal component : Lagrangian

argmaxw1 wT

1 ˜

X˜ XTw1 subject to wT

1 w1 = 1

  • L(w1, λ1) = wT

1 ˜

X˜ XTw1 + λ1(1 − wT

1 w1)

  • ∂L

dw1 = 0 ⇒ ˜

X˜ XTw1 = λ1w1, w1eigen v but which λ1 ?

  • wT

1 ˜

X˜ XTw1 = λ1, λ1 is the largest eigenvalue of ˜ X˜ XT

First principal component vector

The first principal component vector is a normalized eigenvector associated with the largest eigenvalue of the “sample covariance matrix” ˜ X˜ XT

50 / 73

slide-51
SLIDE 51

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

Deriving the second principal component : Greedy

Suppose we have w1 a norm. eigenvector of ˜ X˜ XT associated with its largest eigenvalue. Denote λ1 ≥ λ2 ≥ . . . λd ≥ 0 the eigenvalues. We want to optimize : argmaxw2 wT

1 ˜

X˜ XTw1 + wT

2 ˜

X˜ XTw2 = argmaxw2 λ1 + wT

2 ˜

X˜ XTw2 = argmaxw2 wT

2 ˜

X˜ XTw2 with wT

i wj = δi,j.

And wT

2 ˜

X˜ XTw2 = wT

2 (˜

X˜ XT − λ1w1wT

1 )w2

w2 is a normalized eigenvector associated with the largest eigenvalue of (˜ X˜ XT − λ1w1wT

1 ), i.e. λ2

And so on. But is the greedy algorithm finding the optimum ?

51 / 73

slide-52
SLIDE 52

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

Deriving the other principal component : Greedy

Does it make sense to use a greedy algorithm ? (proof in lecture notes)

Theorem

For any symmetric positive semi-definite matrix M ∈ Md×d(R), denote {λi}i=1..d its eigenvalues with λ1 ≥ λ2 · · · ≥ λd ≥ 0. For any set of r ∈ [|1, d|] orthogonal unit vectors, {v1, . . . , vr}, we have :

r

  • j=1

vT

j Mvj ≤ r

  • j=1

λj (3) And this upper bound is reached by eigenvectors associated with the largest eigenvalues of M

52 / 73

slide-53
SLIDE 53

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

PCA : recipe

Given {x0, . . . , xN−1} ∈ Rd, to compute the r principal component vectors :

1 Center your data ˜

xi = xi − ¯ x

2 Build the matrix ˜

X = [˜ x0| . . . | ˜ xN−1]

3 Compute r normalized eigenvectors associated with the r

largest eigenvalues of ˜ X˜ XT

53 / 73

slide-54
SLIDE 54

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

PCA is a projection method

Given x ∈ Rd, its principal components are its coordinates in the selected eigenspace : x → ((x − ¯ x)Tw1, (x − ¯ x)Tw2, . . . , (x − ¯ x)Twr) If x ∈ {x0, . . . , xN−1}, you should better use the SVD which gives you directly the principal components.

54 / 73

slide-55
SLIDE 55

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

Singular Value Decomposition

For any matrix M ∈ Md,N(R), there exists an orthogonal matrix U ∈ Md,d(R), a diagonal matrix D ∈ Md,N(R), and an

  • rthogonal matrix V ∈ MN,N(R), such that :

M = UDVT Orthogonal matrices : UT = U−1.

55 / 73

slide-56
SLIDE 56

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

PCA with SVD

Given ˜ X = UDVT : ˜ X˜ XT = UDDTU−1 This is the diagonalization of ˜ X˜ XT The projection vectors are the column vectors of U : {w1, . . . , wr} = {u1, . . . , ur}. The principal components of the training set are the r first rows

  • f :

UT ˜ X = UTUDVT = DVT

56 / 73

slide-57
SLIDE 57

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

What is ˜ X˜ XT ?

˜ X˜ XT =

N−1

  • i=0

˜ xi ˜ xT

i

=

  • i

(xi − 1 N

  • j

xj)(xi − 1 N

  • j

xj)T = (N − 1)Σ with Σ the sample covariance matrix. Σ is symmetric, positive semi-definite, i.e. its eigenvalues are all positive.

57 / 73

slide-58
SLIDE 58

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

Equivalent formulations

There are two equivalent formulations of the PCA :

  • Find an affine transformation minimizing the reconstruction

error

  • Find an affine transformation maximizing the variance of the

projections

58 / 73

slide-59
SLIDE 59

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis

Maximizing the variance of the projections

Suppose your data are centered, i.e.

1 N

  • i xi = 0.

Denote zi ∈ Rr the projection of xi over w1, . . . , wr. We have ¯ z = 0. The sample covariance matrix Σ ∈ Mr,r(R) is : Σ = 1 N − 1

  • i

zizT

i

= 1 N − 1WTxixT

i W

We want to maximize r

j=1 Σj,j and : r

  • j=1

Σj,j = 1 N − 1

  • j
  • i

(wT

j xi)(xT i wj) =

1 N − 1

  • j

wT

j XXTwj

This is the same optimization problem as before.

59 / 73

slide-60
SLIDE 60

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

What is the fraction of variance we keep ?

For any matrix M, orthogonal matrix P : Tr P−1MP = Tr M Therefore, Tr ˜ X˜ XT = N−1

i=0 λi. The variance of our datapoints is

Tr

1 N−1 ˜

X˜ XT =

1 N−1

N−1

i=0 λi

If we keep r principal components, we keep a fraction of the variance equals to : r−1

i=0 λi

N−1

i=0 λi

60 / 73

slide-61
SLIDE 61

Datasets Preprocessing Dimensionality reduction Feature extraction

Principal Component Analysis [Pearson(1901)]

PCA on MNIST (28 × 28 images)

PCA with 2 princip. vectors, 17.05% tot var

10 8 6 4 2 2 4 6

w1

6 4 2 2 4 6

w2 1 2 3 4 5 6 7 8 9

10 first princip. vectors

0.10 0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06 0.08 0.10

61 / 73

slide-62
SLIDE 62

Datasets Preprocessing Dimensionality reduction Feature extraction

Sample covariance and Gram matrices

Definitions

The sample coviarance matrix is : Σ = 1 N − 1

  • i

(xi − ¯ x)(xi − ¯ x)T = 1 N − 1XXT The Gram matrix is : G = XTX =    xT

0 x0

xT

0 x1

· · · xT

0 xN−1

. . . . . . . . . . . . xT

N−1x0

xT

N−1x1

· · · xT

N−1xN−1

   The Gram matrix is build up from dot products. The eigenvectors/eigenvalues of G and Σ are related !

62 / 73

slide-63
SLIDE 63

Datasets Preprocessing Dimensionality reduction Feature extraction

Sample covariance and Gram matrices

Lemma

∀A ∈ Rn×m, ker (A) = ker

  • ATA
  • Theorem (Rank-nullity)

∀A ∈ Rn×m, rk (A) + dim (ker (A)) = m.

Theorem

∀A ∈ Rn×m, rk

  • ATA
  • = rk
  • AAT

≤ min(n, m)

63 / 73

slide-64
SLIDE 64

Datasets Preprocessing Dimensionality reduction Feature extraction

Lemma (Eigenvalues of the covariance and gram matrices)

The nonzero eigenvalues of the scaled covariance matrix (N − 1)Σ = XXT and gram matrix G = XTX are the same : {λ ∈ R∗, ∃v = 0, (N − 1)Σv = λv} = {λ ∈ R∗, ∃v = 0, Gv = λv} And, during the proof, we show that :

  • If (λ, v) eigen of XX T, then (λ, X Tv) eigen of X TX
  • If (λ, w) eigen of X TX, then (λ, Xw) eigen of XX T

There are several applications of this property :

  • the eigenface algorithm, used when N ≪ d
  • the nonlinear PCA called Kernel PCA [Schoelkopf, 1999]

64 / 73

slide-65
SLIDE 65

Datasets Preprocessing Dimensionality reduction Feature extraction

What to do when N ≪ d

G ∈ MN,N(R), Σ ∈ Md,d(R).

Eigenface

If N ≪ d, it is much more efficient to “diagonalize” G than Σ. In that case, the recipe is :

1 Center your data ˜

xi = xi − ¯ x

2 Build the matrix ˜

X = [˜ x0| . . . | ˜ xN−1]

3 Compute the r normalized eigenvectors wj ∈ RN of G, with

eigenvalues λj

4 Project your data on the r normalized eigenvectors of Σ given

by : ˜ Xwj

  • ˜

Xwj

  • 2

= 1

  • λj

˜ Xwj

65 / 73

slide-66
SLIDE 66

Datasets Preprocessing Dimensionality reduction Feature extraction

Toward a Kernel PCA

We can reformulate the PCA using only dot products.

PCA with only dot products

Computing the Gram matrix involves only dot products between xi Projecting a vector x on the vector

1

λj

˜ Xwj reads : ( 1

  • λj

˜ Xwj)Tx = 1

  • λj

wT

j

   < x0, x > . . . < xN−1, x >    A linear algorithm involving only dot products can be rendered non-linear using the kernel trick (see SVM). The only remaining difficulty is that we must ensure the vectors in the feature space are centered.

66 / 73

slide-67
SLIDE 67

Datasets Preprocessing Dimensionality reduction Feature extraction

Non linear PCA

Kernel PCA [Scholkopf(1999)]

Consider a kernel k : RN × RN → R, < φ(x), φ(x′) >= k(x, x′) e.g.

  • RBF kernel : k(x, x′) = exp(− |x−x′|2

2

2σ2

) We perform a PCA in the feature space, image of φ. Compute the Gram matrix, its eigenvectors/eigenvalues λj, wj. For projecting a vector x, compute : 1

  • λj

wT

j

   k(x0, x) . . . k(xN−1, x)    What about centering the φ(xi) ?

67 / 73

slide-68
SLIDE 68

Datasets Preprocessing Dimensionality reduction Feature extraction

Non linear PCA

Kernel PCA : centering in the feature space

It can be shown that introducing the Gram matrix ˜ G : ˜ G = (IN − 1 N 1)G(IN − 1 N 1) is the matrix of the dot products of the feature vectors centered in the feature space. The above transformation is called double centering transformation.

68 / 73

slide-69
SLIDE 69

Datasets Preprocessing Dimensionality reduction Feature extraction

Feature extraction : Manifold learning Goal : For each xi ∈ Rd, associate yi ∈ Rr so that the pairwise distances in Rd are as similar as possible to the pairwise distance in Rr Perfect for visualizing the datasets in low dimensions. Examples : LLE, MDS, Isomap, SNE, t-SNE, ..

69 / 73

slide-70
SLIDE 70

Datasets Preprocessing Dimensionality reduction Feature extraction

Manifold learning

Overview

xi ∈ Rd, yi ∈ Rr, r ≪ d, e.g. r = 2

1 Quantify the similarity between pairs of points in Rd 2 Quantify the similarity between pairs of points in Rr 3 Quantify the discrepancy between these similarities 4 Optimize with respect to yi

70 / 73

slide-71
SLIDE 71

Datasets Preprocessing Dimensionality reduction Feature extraction

Manifold learning

t-Stochastic Neighbhorhood Embedding (t-SNE) [van der Maaten(2008)]

Focuses on preserving local distances, allowing larger distances in Rd to be even larger in Rr

  • Similarity in Rd

∀i, j, pi,j = pi/j + pj/i 2N ∀i, j, pi/j = exp(− |xi−xj|2

2σ2

i

)

  • k=i exp(− |xi−xk|2

2σ2

i

)

71 / 73

slide-72
SLIDE 72

Datasets Preprocessing Dimensionality reduction Feature extraction

Manifold learning

t-Stochastic Neighbhorhood Embedding (t-SNE) [van der Maaten(2008)]

Focuses on preserving local distances, allowing larger distances in Rd to be even larger in Rr

  • Similarity in Rr

∀i, j, qi,j = (1 + |yi − yj|2

2)−1

  • k=l(1 + |yk − yl|2

2)−1

  • Maximize the similarity of qi,j, pi,j with the Kullback-Leibler

divergence : C =

  • i,j

pi,j log(pi,j qi,j ) Complexity (O(N2)). Optimized version with Barnes-Hutt, complexity

72 / 73

slide-73
SLIDE 73

Datasets Preprocessing Dimensionality reduction Feature extraction

Manifold learning

t-Stochastic Neighbhorhood Embedding (t-SNE) [van der Maaten(2008)]

73 / 73