Linear Dimensionality Reduction Practical Machine Learning - - PowerPoint PPT Presentation

linear dimensionality reduction
SMART_READER_LITE
LIVE PREVIEW

Linear Dimensionality Reduction Practical Machine Learning - - PowerPoint PPT Presentation

Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang Lots of high-dimensional data... According to media reports, a pair of hackers said on Saturday that the Firefox Zambian President Levy


slide-1
SLIDE 1

Linear Dimensionality Reduction

Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

slide-2
SLIDE 2

Lots of high-dimensional data...

face images

Zambian President Levy Mwanawasa has won a second term in office in an election his challenger Michael Sata accused him

  • f rigging,
  • fficial results

showed on Monday. According to media reports, a pair of hackers said on Saturday that the Firefox Web browser, commonly perceived as the safer and more customizable alternative to market leader Internet Explorer, is critically flawed. A presentation

  • n

the flaw was shown during the ToorCon hacker conference in San Diego.

documents gene expression data MEG readings 2

slide-3
SLIDE 3

Motivation and context

Why do dimensionality reduction?

  • Computational: compress data ⇒ time/space efficiency

3

slide-4
SLIDE 4

Motivation and context

Why do dimensionality reduction?

  • Computational: compress data ⇒ time/space efficiency
  • Statistical: fewer dimensions ⇒ better generalization

3

slide-5
SLIDE 5

Motivation and context

Why do dimensionality reduction?

  • Computational: compress data ⇒ time/space efficiency
  • Statistical: fewer dimensions ⇒ better generalization
  • Visualization: understand structure of data

3

slide-6
SLIDE 6

Motivation and context

Why do dimensionality reduction?

  • Computational: compress data ⇒ time/space efficiency
  • Statistical: fewer dimensions ⇒ better generalization
  • Visualization: understand structure of data
  • Anomaly detection: describe normal data, detect outliers

3

slide-7
SLIDE 7

Motivation and context

Why do dimensionality reduction?

  • Computational: compress data ⇒ time/space efficiency
  • Statistical: fewer dimensions ⇒ better generalization
  • Visualization: understand structure of data
  • Anomaly detection: describe normal data, detect outliers

Dimensionality reduction in this course:

  • Linear methods (this week)
  • Clustering (last week)
  • Feature selection (next week)
  • Nonlinear methods (later)

3

slide-8
SLIDE 8

Types of problems

  • Prediction x → y: classification, regression

4

slide-9
SLIDE 9

Types of problems

  • Prediction x → y: classification, regression

Applications: face recognition, gene expression prediction Techniques: kNN, SVM, least squares (+ dimensionality reduction preprocessing)

4

slide-10
SLIDE 10

Types of problems

  • Prediction x → y: classification, regression

Applications: face recognition, gene expression prediction Techniques: kNN, SVM, least squares (+ dimensionality reduction preprocessing)

  • Structure discovery x → z: find an alternative

representation z of data x

4

slide-11
SLIDE 11

Types of problems

  • Prediction x → y: classification, regression

Applications: face recognition, gene expression prediction Techniques: kNN, SVM, least squares (+ dimensionality reduction preprocessing)

  • Structure discovery x → z: find an alternative

representation z of data x

Applications: visualization Techniques: clustering, linear dimensionality reduction

4

slide-12
SLIDE 12

Types of problems

  • Prediction x → y: classification, regression

Applications: face recognition, gene expression prediction Techniques: kNN, SVM, least squares (+ dimensionality reduction preprocessing)

  • Structure discovery x → z: find an alternative

representation z of data x

Applications: visualization Techniques: clustering, linear dimensionality reduction

  • Density estimation p(x): model the data

4

slide-13
SLIDE 13

Types of problems

  • Prediction x → y: classification, regression

Applications: face recognition, gene expression prediction Techniques: kNN, SVM, least squares (+ dimensionality reduction preprocessing)

  • Structure discovery x → z: find an alternative

representation z of data x

Applications: visualization Techniques: clustering, linear dimensionality reduction

  • Density estimation p(x): model the data

Applications: anomaly detection, language modeling Techniques: clustering, linear dimensionality reduction

4

slide-14
SLIDE 14

Basic idea of linear dimensionality reduction

Represent each face as a high-dimensional vector x ∈ R361

5

slide-15
SLIDE 15

Basic idea of linear dimensionality reduction

Represent each face as a high-dimensional vector x ∈ R361 x ∈ R361 z = U⊤x z ∈ R10

5

slide-16
SLIDE 16

Basic idea of linear dimensionality reduction

Represent each face as a high-dimensional vector x ∈ R361 x ∈ R361 z = U⊤x z ∈ R10 How do we choose U?

5

slide-17
SLIDE 17

Outline

  • Principal component analysis (PCA)

– Basic principles – Case studies – Kernel PCA – Probabilistic PCA

  • Canonical correlation analysis (CCA)
  • Fisher discriminant analysis (FDA)
  • Summary

6

slide-18
SLIDE 18

Roadmap

  • Principal component analysis (PCA)

– Basic principles – Case studies – Kernel PCA – Probabilistic PCA

  • Canonical correlation analysis (CCA)
  • Fisher discriminant analysis (FDA)
  • Summary

Principal component analysis (PCA) / Basic principles 7

slide-19
SLIDE 19

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . , xn ∈ Rd

Principal component analysis (PCA) / Basic principles 8

slide-20
SLIDE 20

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . , xn ∈ Rd X =( x1 · · · · · · xn) ∈ Rd×n

Principal component analysis (PCA) / Basic principles 8

slide-21
SLIDE 21

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . , xn ∈ Rd X =( x1 · · · · · · xn) ∈ Rd×n Want to reduce dimensionality from d to k

Principal component analysis (PCA) / Basic principles 8

slide-22
SLIDE 22

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . , xn ∈ Rd X =( x1 · · · · · · xn) ∈ Rd×n Want to reduce dimensionality from d to k Choose k directions u1, . . . , uk

Principal component analysis (PCA) / Basic principles 8

slide-23
SLIDE 23

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . , xn ∈ Rd X =( x1 · · · · · · xn) ∈ Rd×n Want to reduce dimensionality from d to k Choose k directions u1, . . . , uk U =( u1 ·· uk ) ∈ Rd×k

Principal component analysis (PCA) / Basic principles 8

slide-24
SLIDE 24

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . , xn ∈ Rd X =( x1 · · · · · · xn) ∈ Rd×n Want to reduce dimensionality from d to k Choose k directions u1, . . . , uk U =( u1 ·· uk ) ∈ Rd×k For each uj, compute “similarity” zj = u⊤

j x

Principal component analysis (PCA) / Basic principles 8

slide-25
SLIDE 25

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . , xn ∈ Rd X =( x1 · · · · · · xn) ∈ Rd×n Want to reduce dimensionality from d to k Choose k directions u1, . . . , uk U =( u1 ·· uk ) ∈ Rd×k For each uj, compute “similarity” zj = u⊤

j x

Project x down to z = (z1, . . . , zk)⊤ = U⊤x

Principal component analysis (PCA) / Basic principles 8

slide-26
SLIDE 26

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . , xn ∈ Rd X =( x1 · · · · · · xn) ∈ Rd×n Want to reduce dimensionality from d to k Choose k directions u1, . . . , uk U =( u1 ·· uk ) ∈ Rd×k For each uj, compute “similarity” zj = u⊤

j x

Project x down to z = (z1, . . . , zk)⊤ = U⊤x How to choose U?

Principal component analysis (PCA) / Basic principles 8

slide-27
SLIDE 27

PCA objective 1: reconstruction error

U serves two functions:

  • Encode: z = U⊤x,

zj = u⊤

j x

Principal component analysis (PCA) / Basic principles 9

slide-28
SLIDE 28

PCA objective 1: reconstruction error

U serves two functions:

  • Encode: z = U⊤x,

zj = u⊤

j x

  • Decode: ˜

x = Uz = k

j=1 zjuj

Principal component analysis (PCA) / Basic principles 9

slide-29
SLIDE 29

PCA objective 1: reconstruction error

U serves two functions:

  • Encode: z = U⊤x,

zj = u⊤

j x

  • Decode: ˜

x = Uz = k

j=1 zjuj

Want reconstruction error x − ˜ x to be small

Principal component analysis (PCA) / Basic principles 9

slide-30
SLIDE 30

PCA objective 1: reconstruction error

U serves two functions:

  • Encode: z = U⊤x,

zj = u⊤

j x

  • Decode: ˜

x = Uz = k

j=1 zjuj

Want reconstruction error x − ˜ x to be small Objective: minimize total squared reconstruction error min

U∈Rd×k n

  • i=1

xi − UU⊤xi2

Principal component analysis (PCA) / Basic principles 9

slide-31
SLIDE 31

PCA objective 2: projected variance

Empirical distribution: uniform over x1, . . . , xn

Principal component analysis (PCA) / Basic principles 10

slide-32
SLIDE 32

PCA objective 2: projected variance

Empirical distribution: uniform over x1, . . . , xn Expectation (think sum over data points): ˆ E[f(x)] = 1

n

n

i=1 f(xi)

Principal component analysis (PCA) / Basic principles 10

slide-33
SLIDE 33

PCA objective 2: projected variance

Empirical distribution: uniform over x1, . . . , xn Expectation (think sum over data points): ˆ E[f(x)] = 1

n

n

i=1 f(xi)

Variance (think sum of squares if centered):

  • var[f(x)] + (ˆ

E[f(x)])2 = ˆ E[f(x)2] = 1

n

n

i=1 f(xi)2

Principal component analysis (PCA) / Basic principles 10

slide-34
SLIDE 34

PCA objective 2: projected variance

Empirical distribution: uniform over x1, . . . , xn Expectation (think sum over data points): ˆ E[f(x)] = 1

n

n

i=1 f(xi)

Variance (think sum of squares if centered):

  • var[f(x)] + (ˆ

E[f(x)])2 = ˆ E[f(x)2] = 1

n

n

i=1 f(xi)2

Assume data is centered: ˆ E[x] = 0

Principal component analysis (PCA) / Basic principles 10

slide-35
SLIDE 35

PCA objective 2: projected variance

Empirical distribution: uniform over x1, . . . , xn Expectation (think sum over data points): ˆ E[f(x)] = 1

n

n

i=1 f(xi)

Variance (think sum of squares if centered):

  • var[f(x)] + (ˆ

E[f(x)])2 = ˆ E[f(x)2] = 1

n

n

i=1 f(xi)2

Assume data is centered: ˆ E[x] = 0 (what’s ˆ

E[U⊤x]?)

Principal component analysis (PCA) / Basic principles 10

slide-36
SLIDE 36

PCA objective 2: projected variance

Empirical distribution: uniform over x1, . . . , xn Expectation (think sum over data points): ˆ E[f(x)] = 1

n

n

i=1 f(xi)

Variance (think sum of squares if centered):

  • var[f(x)] + (ˆ

E[f(x)])2 = ˆ E[f(x)2] = 1

n

n

i=1 f(xi)2

Assume data is centered: ˆ E[x] = 0 (what’s ˆ

E[U⊤x]?)

Objective: maximize variance of projected data max

U∈Rd×k,U⊤U=I

ˆ E[U⊤x2]

Principal component analysis (PCA) / Basic principles 10

slide-37
SLIDE 37

Equivalence in two objectives

Key intuition: variance of data

  • fixed

= captured variance

  • want large

+ reconstruction error

  • want small

Principal component analysis (PCA) / Basic principles 11

slide-38
SLIDE 38

Equivalence in two objectives

Key intuition: variance of data

  • fixed

= captured variance

  • want large

+ reconstruction error

  • want small

Pythagorean decomposition: x = UU⊤x + (I − UU⊤)x

Principal component analysis (PCA) / Basic principles 11

slide-39
SLIDE 39

Equivalence in two objectives

Key intuition: variance of data

  • fixed

= captured variance

  • want large

+ reconstruction error

  • want small

Pythagorean decomposition: x = UU⊤x + (I − UU⊤)x UU⊤x (I − UU⊤)x x

Principal component analysis (PCA) / Basic principles 11

slide-40
SLIDE 40

Equivalence in two objectives

Key intuition: variance of data

  • fixed

= captured variance

  • want large

+ reconstruction error

  • want small

Pythagorean decomposition: x = UU⊤x + (I − UU⊤)x UU⊤x (I − UU⊤)x x Take expectations; note rotation U doesn’t affect length: ˆ E[x2] = ˆ E[U⊤x2] + ˆ E[x − UU⊤x2]

Principal component analysis (PCA) / Basic principles 11

slide-41
SLIDE 41

Equivalence in two objectives

Key intuition: variance of data

  • fixed

= captured variance

  • want large

+ reconstruction error

  • want small

Pythagorean decomposition: x = UU⊤x + (I − UU⊤)x UU⊤x (I − UU⊤)x x Take expectations; note rotation U doesn’t affect length: ˆ E[x2] = ˆ E[U⊤x2] + ˆ E[x − UU⊤x2] Minimize reconstruction error ↔ Maximize captured variance

Principal component analysis (PCA) / Basic principles 11

slide-42
SLIDE 42

Finding one principal component

Input data: X =( x1 . . . xn)

Principal component analysis (PCA) / Basic principles 12

slide-43
SLIDE 43

Finding one principal component

Input data: X =( x1 . . . xn) Objective: maximize variance

  • f projected data

Principal component analysis (PCA) / Basic principles 12

slide-44
SLIDE 44

Finding one principal component

Input data: X =( x1 . . . xn) Objective: maximize variance

  • f projected data

= max

u=1

ˆ E[(u⊤x)2]

Principal component analysis (PCA) / Basic principles 12

slide-45
SLIDE 45

Finding one principal component

Input data: X =( x1 . . . xn) Objective: maximize variance

  • f projected data

= max

u=1

ˆ E[(u⊤x)2] = max

u=1

1 n

n

  • i=1

(u⊤xi)2

Principal component analysis (PCA) / Basic principles 12

slide-46
SLIDE 46

Finding one principal component

Input data: X =( x1 . . . xn) Objective: maximize variance

  • f projected data

= max

u=1

ˆ E[(u⊤x)2] = max

u=1

1 n

n

  • i=1

(u⊤xi)2 = max

u=1

1 nu⊤X2

Principal component analysis (PCA) / Basic principles 12

slide-47
SLIDE 47

Finding one principal component

Input data: X =( x1 . . . xn) Objective: maximize variance

  • f projected data

= max

u=1

ˆ E[(u⊤x)2] = max

u=1

1 n

n

  • i=1

(u⊤xi)2 = max

u=1

1 nu⊤X2 = max

u=1 u⊤

1 nXX⊤

  • u

Principal component analysis (PCA) / Basic principles 12

slide-48
SLIDE 48

Finding one principal component

Input data: X =( x1 . . . xn) Objective: maximize variance

  • f projected data

= max

u=1

ˆ E[(u⊤x)2] = max

u=1

1 n

n

  • i=1

(u⊤xi)2 = max

u=1

1 nu⊤X2 = max

u=1 u⊤

1 nXX⊤

  • u

= largest eigenvalue of C

def

= 1 nXX⊤ (C is covariance matrix of data)

Principal component analysis (PCA) / Basic principles 12

slide-49
SLIDE 49

How many principal components?

  • Similar to question of “How many clusters?”
  • Magnitude of eigenvalues indicate fraction of variance captured.

Principal component analysis (PCA) / Basic principles 15

slide-50
SLIDE 50

How many principal components?

  • Similar to question of “How many clusters?”
  • Magnitude of eigenvalues indicate fraction of variance captured.
  • Eigenvalues on a face image dataset:

2 3 4 5 6 7 8 9 10 11

i

287.1 553.6 820.1 1086.7 1353.2

λi

Principal component analysis (PCA) / Basic principles 15

slide-51
SLIDE 51

How many principal components?

  • Similar to question of “How many clusters?”
  • Magnitude of eigenvalues indicate fraction of variance captured.
  • Eigenvalues on a face image dataset:

2 3 4 5 6 7 8 9 10 11

i

287.1 553.6 820.1 1086.7 1353.2

λi

  • Eigenvalues typically drop off sharply, so don’t need that many.
  • Of course variance isn’t everything...

Principal component analysis (PCA) / Basic principles 15

slide-52
SLIDE 52

Computing PCA

Method 1: eigendecomposition U are eigenvectors of covariance matrix C = 1

nXX⊤

Computing C already takes O(nd2) time (very expensive)

Principal component analysis (PCA) / Basic principles 16

slide-53
SLIDE 53

Computing PCA

Method 1: eigendecomposition U are eigenvectors of covariance matrix C = 1

nXX⊤

Computing C already takes O(nd2) time (very expensive) Method 2: singular value decomposition (SVD) Find X = Ud×dΣd×nV⊤

n×n

where U⊤U = Id×d, V⊤V = In×n, Σ is diagonal Computing top k singular vectors takes only O(ndk)

Principal component analysis (PCA) / Basic principles 16

slide-54
SLIDE 54

Computing PCA

Method 1: eigendecomposition U are eigenvectors of covariance matrix C = 1

nXX⊤

Computing C already takes O(nd2) time (very expensive) Method 2: singular value decomposition (SVD) Find X = Ud×dΣd×nV⊤

n×n

where U⊤U = Id×d, V⊤V = In×n, Σ is diagonal Computing top k singular vectors takes only O(ndk) Relationship between eigendecomposition and SVD: Left singular vectors are principal components (C = UΣ2U⊤)

Principal component analysis (PCA) / Basic principles 16

slide-55
SLIDE 55

Roadmap

  • Principal component analysis (PCA)

– Basic principles – Case studies – Kernel PCA – Probabilistic PCA

  • Canonical correlation analysis (CCA)
  • Fisher discriminant analysis (FDA)
  • Summary

Principal component analysis (PCA) / Case studies 17

slide-56
SLIDE 56

Eigen-faces [Turk and Pentland, 1991]

  • d = number of pixels
  • Each xi ∈ Rd is a face image
  • xji = intensity of the j-th pixel in image i

Principal component analysis (PCA) / Case studies 18

slide-57
SLIDE 57

Eigen-faces [Turk and Pentland, 1991]

  • d = number of pixels
  • Each xi ∈ Rd is a face image
  • xji = intensity of the j-th pixel in image i

Xd×n ≅ Ud×k Zk×n

(

. . .

) ≅ ( )( z1 . . . zn)

Principal component analysis (PCA) / Case studies 18

slide-58
SLIDE 58

Eigen-faces [Turk and Pentland, 1991]

  • d = number of pixels
  • Each xi ∈ Rd is a face image
  • xji = intensity of the j-th pixel in image i

Xd×n ≅ Ud×k Zk×n

(

. . .

) ≅ ( )( z1 . . . zn)

Idea: zi more “meaningful” representation of i-th face than xi Can use zi for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d ≫ k Why no time savings for linear classifier?

Principal component analysis (PCA) / Case studies 18

slide-59
SLIDE 59

Latent Semantic Analysis [Deerwater, 1990]

  • d = number of words in the vocabulary
  • Each xi ∈ Rd is a vector of word counts
  • xji = frequency of word j in document i

Xd×n ≅ Ud×k Zk×n

(

stocks: 2 · · · · · · · · · 0 chairman: 4 · · · · · · · · · 1 the: 8 · · · · · · · · · 7 · · · . . . · · · · · · · · · . . . wins: 0 · · · · · · · · · 2 game: 1 · · · · · · · · · 3)

≅(

0.4 ·· -0.001 0.8 ·· 0.03 0.01 ·· 0.04 . . . ·· . . . 0.002 ·· 2.3 0.003 ·· 1.9 )( z1 . . . zn)

Principal component analysis (PCA) / Case studies 19

slide-60
SLIDE 60

Latent Semantic Analysis [Deerwater, 1990]

  • d = number of words in the vocabulary
  • Each xi ∈ Rd is a vector of word counts
  • xji = frequency of word j in document i

Xd×n ≅ Ud×k Zk×n

(

stocks: 2 · · · · · · · · · 0 chairman: 4 · · · · · · · · · 1 the: 8 · · · · · · · · · 7 · · · . . . · · · · · · · · · . . . wins: 0 · · · · · · · · · 2 game: 1 · · · · · · · · · 3)

≅(

0.4 ·· -0.001 0.8 ·· 0.03 0.01 ·· 0.04 . . . ·· . . . 0.002 ·· 2.3 0.003 ·· 1.9 )( z1 . . . zn)

How to measure similarity between two documents? z⊤

1 z2 is probably better than x⊤ 1 x2

Principal component analysis (PCA) / Case studies 19

slide-61
SLIDE 61

Latent Semantic Analysis [Deerwater, 1990]

  • d = number of words in the vocabulary
  • Each xi ∈ Rd is a vector of word counts
  • xji = frequency of word j in document i

Xd×n ≅ Ud×k Zk×n

(

stocks: 2 · · · · · · · · · 0 chairman: 4 · · · · · · · · · 1 the: 8 · · · · · · · · · 7 · · · . . . · · · · · · · · · . . . wins: 0 · · · · · · · · · 2 game: 1 · · · · · · · · · 3)

≅(

0.4 ·· -0.001 0.8 ·· 0.03 0.01 ·· 0.04 . . . ·· . . . 0.002 ·· 2.3 0.003 ·· 1.9 )( z1 . . . zn)

How to measure similarity between two documents? z⊤

1 z2 is probably better than x⊤ 1 x2

Applications: information retrieval

Principal component analysis (PCA) / Case studies 19

slide-62
SLIDE 62

Latent Semantic Analysis [Deerwater, 1990]

  • d = number of words in the vocabulary
  • Each xi ∈ Rd is a vector of word counts
  • xji = frequency of word j in document i

Xd×n ≅ Ud×k Zk×n

(

stocks: 2 · · · · · · · · · 0 chairman: 4 · · · · · · · · · 1 the: 8 · · · · · · · · · 7 · · · . . . · · · · · · · · · . . . wins: 0 · · · · · · · · · 2 game: 1 · · · · · · · · · 3)

≅(

0.4 ·· -0.001 0.8 ·· 0.03 0.01 ·· 0.04 . . . ·· . . . 0.002 ·· 2.3 0.003 ·· 1.9 )( z1 . . . zn)

How to measure similarity between two documents? z⊤

1 z2 is probably better than x⊤ 1 x2

Applications: information retrieval Note: no computational savings; original x is already sparse

Principal component analysis (PCA) / Case studies 19

slide-63
SLIDE 63

Network anomaly detection [Lakhina, ’05]

xji = amount of traffic on link j in the network during each time interval i

Principal component analysis (PCA) / Case studies 20

slide-64
SLIDE 64

Network anomaly detection [Lakhina, ’05]

xji = amount of traffic on link j in the network during each time interval i

Model assumption: total traffic is sum of flows along a few “paths”

Principal component analysis (PCA) / Case studies 20

slide-65
SLIDE 65

Network anomaly detection [Lakhina, ’05]

xji = amount of traffic on link j in the network during each time interval i

Model assumption: total traffic is sum of flows along a few “paths” Apply PCA: each principal component intuitively represents a “path”

Principal component analysis (PCA) / Case studies 20

slide-66
SLIDE 66

Network anomaly detection [Lakhina, ’05]

xji = amount of traffic on link j in the network during each time interval i

Model assumption: total traffic is sum of flows along a few “paths” Apply PCA: each principal component intuitively represents a “path” Anomaly when traffic deviates from first few principal components

Principal component analysis (PCA) / Case studies 20

slide-67
SLIDE 67

Network anomaly detection [Lakhina, ’05]

xji = amount of traffic on link j in the network during each time interval i

Model assumption: total traffic is sum of flows along a few “paths” Apply PCA: each principal component intuitively represents a “path” Anomaly when traffic deviates from first few principal components

Principal component analysis (PCA) / Case studies 20

slide-68
SLIDE 68

Unsupervised POS tagging [Sch¨ utze, ’95]

Part-of-speech (POS) tagging task:

Input: I like reducing the dimensionality

  • f

data . Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN .

Principal component analysis (PCA) / Case studies 21

slide-69
SLIDE 69

Unsupervised POS tagging [Sch¨ utze, ’95]

Part-of-speech (POS) tagging task:

Input: I like reducing the dimensionality

  • f

data . Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN .

Each xi is (the context distribution of) a word. xji is number of times word i appeared in context j Key idea: words appearing in similar contexts tend to have the same POS tags; so cluster using the contexts of each word type Problem: contexts are too sparse

Principal component analysis (PCA) / Case studies 21

slide-70
SLIDE 70

Unsupervised POS tagging [Sch¨ utze, ’95]

Part-of-speech (POS) tagging task:

Input: I like reducing the dimensionality

  • f

data . Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN .

Each xi is (the context distribution of) a word. xji is number of times word i appeared in context j Key idea: words appearing in similar contexts tend to have the same POS tags; so cluster using the contexts of each word type Problem: contexts are too sparse Solution: run PCA first, then cluster using new representation

Principal component analysis (PCA) / Case studies 21

slide-71
SLIDE 71

Multi-task learning [Ando & Zhang, ’05]

  • Have n related tasks (classify documents for various users)
  • Each task has a linear classifier with weights xi
  • Want to share structure between classifiers

Principal component analysis (PCA) / Case studies 22

slide-72
SLIDE 72

Multi-task learning [Ando & Zhang, ’05]

  • Have n related tasks (classify documents for various users)
  • Each task has a linear classifier with weights xi
  • Want to share structure between classifiers

One step of their procedure: given n linear classifiers x1, . . . , xn, run PCA to identify shared structure:

Principal component analysis (PCA) / Case studies 22

slide-73
SLIDE 73

Multi-task learning [Ando & Zhang, ’05]

  • Have n related tasks (classify documents for various users)
  • Each task has a linear classifier with weights xi
  • Want to share structure between classifiers

One step of their procedure: given n linear classifiers x1, . . . , xn, run PCA to identify shared structure: X =( x1 . . . xn) ≅ UZ

Principal component analysis (PCA) / Case studies 22

slide-74
SLIDE 74

Multi-task learning [Ando & Zhang, ’05]

  • Have n related tasks (classify documents for various users)
  • Each task has a linear classifier with weights xi
  • Want to share structure between classifiers

One step of their procedure: given n linear classifiers x1, . . . , xn, run PCA to identify shared structure: X =( x1 . . . xn) ≅ UZ Each principal component is a eigen-classifier

Principal component analysis (PCA) / Case studies 22

slide-75
SLIDE 75

Multi-task learning [Ando & Zhang, ’05]

  • Have n related tasks (classify documents for various users)
  • Each task has a linear classifier with weights xi
  • Want to share structure between classifiers

One step of their procedure: given n linear classifiers x1, . . . , xn, run PCA to identify shared structure: X =( x1 . . . xn) ≅ UZ Each principal component is a eigen-classifier Other step of their procedure: Retrain classifiers, regularizing towards subspace U

Principal component analysis (PCA) / Case studies 22

slide-76
SLIDE 76

PCA summary

  • Intuition: capture variance of data or minimize

reconstruction error

Principal component analysis (PCA) / Case studies 23

slide-77
SLIDE 77

PCA summary

  • Intuition: capture variance of data or minimize

reconstruction error

  • Algorithm: find eigendecomposition of covariance

matrix or SVD

Principal component analysis (PCA) / Case studies 23

slide-78
SLIDE 78

PCA summary

  • Intuition: capture variance of data or minimize

reconstruction error

  • Algorithm: find eigendecomposition of covariance

matrix or SVD

  • Impact: reduce storage (from O(nd) to O(nk)), reduce

time complexity

Principal component analysis (PCA) / Case studies 23

slide-79
SLIDE 79

PCA summary

  • Intuition: capture variance of data or minimize

reconstruction error

  • Algorithm: find eigendecomposition of covariance

matrix or SVD

  • Impact: reduce storage (from O(nd) to O(nk)), reduce

time complexity

  • Advantages: simple, fast

Principal component analysis (PCA) / Case studies 23

slide-80
SLIDE 80

PCA summary

  • Intuition: capture variance of data or minimize

reconstruction error

  • Algorithm: find eigendecomposition of covariance

matrix or SVD

  • Impact: reduce storage (from O(nd) to O(nk)), reduce

time complexity

  • Advantages: simple, fast
  • Applications: eigen-faces, eigen-documents, network

anomaly detection, etc.

Principal component analysis (PCA) / Case studies 23

slide-81
SLIDE 81

Roadmap

  • Principal component analysis (PCA)

– Basic principles – Case studies – Kernel PCA – Probabilistic PCA

  • Canonical correlation analysis (CCA)
  • Fisher discriminant analysis (FDA)
  • Summary

Principal component analysis (PCA) / Kernel PCA 24

slide-82
SLIDE 82

Limitations of linearity

Principal component analysis (PCA) / Kernel PCA 25

slide-83
SLIDE 83

Limitations of linearity

PCA is effective

Principal component analysis (PCA) / Kernel PCA 25

slide-84
SLIDE 84

Limitations of linearity

PCA is effective

Principal component analysis (PCA) / Kernel PCA 25

slide-85
SLIDE 85

Limitations of linearity

PCA is effective PCA is ineffective

Principal component analysis (PCA) / Kernel PCA 25

slide-86
SLIDE 86

Limitations of linearity

PCA is effective PCA is ineffective Problem is that PCA subspace is linear: S = {x = Uz : z ∈ Rk}

Principal component analysis (PCA) / Kernel PCA 25

slide-87
SLIDE 87

Limitations of linearity

PCA is effective PCA is ineffective Problem is that PCA subspace is linear: S = {x = Uz : z ∈ Rk} In this example: S = {(x1, x2) : x2 = u2

u1x1}

Principal component analysis (PCA) / Kernel PCA 25

slide-88
SLIDE 88

Going beyond linearity: quick solution

Broken solution

Principal component analysis (PCA) / Kernel PCA 26

slide-89
SLIDE 89

Going beyond linearity: quick solution

Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2

u1x2 1}

Principal component analysis (PCA) / Kernel PCA 26

slide-90
SLIDE 90

Going beyond linearity: quick solution

Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2

u1x2 1}

We can get this: S = {φ(x) = Uz} with φ(x) = (x2

1, x2)⊤

Principal component analysis (PCA) / Kernel PCA 26

slide-91
SLIDE 91

Going beyond linearity: quick solution

Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2

u1x2 1}

We can get this: S = {φ(x) = Uz} with φ(x) = (x2

1, x2)⊤

Linear dimensionality reduction in φ(x) space ⇔ Nonlinear dimensionality reduction in x space

Principal component analysis (PCA) / Kernel PCA 26

slide-92
SLIDE 92

Going beyond linearity: quick solution

Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2

u1x2 1}

We can get this: S = {φ(x) = Uz} with φ(x) = (x2

1, x2)⊤

Linear dimensionality reduction in φ(x) space ⇔ Nonlinear dimensionality reduction in x space

In general, can set φ(x) = (x1, x2

1, x1x2, sin(x1), . . . )⊤

Principal component analysis (PCA) / Kernel PCA 26

slide-93
SLIDE 93

Going beyond linearity: quick solution

Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2

u1x2 1}

We can get this: S = {φ(x) = Uz} with φ(x) = (x2

1, x2)⊤

Linear dimensionality reduction in φ(x) space ⇔ Nonlinear dimensionality reduction in x space

In general, can set φ(x) = (x1, x2

1, x1x2, sin(x1), . . . )⊤

Problems: (1) ad-hoc and tedious (2) φ(x) large, computationally expensive

Principal component analysis (PCA) / Kernel PCA 26

slide-94
SLIDE 94

Towards kernels

Representer theorem: PCA solution is linear combination of xis

Principal component analysis (PCA) / Kernel PCA 27

slide-95
SLIDE 95

Towards kernels

Representer theorem: PCA solution is linear combination of xis Why? Recall PCA eigenvalue problem: XX⊤u = λu

Principal component analysis (PCA) / Kernel PCA 27

slide-96
SLIDE 96

Towards kernels

Representer theorem: PCA solution is linear combination of xis Why? Recall PCA eigenvalue problem: XX⊤u = λu Notice that u = Xα = n

i=1 αixi for some weights α

Principal component analysis (PCA) / Kernel PCA 27

slide-97
SLIDE 97

Towards kernels

Representer theorem: PCA solution is linear combination of xis Why? Recall PCA eigenvalue problem: XX⊤u = λu Notice that u = Xα = n

i=1 αixi for some weights α

Analogy with SVMs: weight vector w = Xα

Principal component analysis (PCA) / Kernel PCA 27

slide-98
SLIDE 98

Towards kernels

Representer theorem: PCA solution is linear combination of xis Why? Recall PCA eigenvalue problem: XX⊤u = λu Notice that u = Xα = n

i=1 αixi for some weights α

Analogy with SVMs: weight vector w = Xα Key fact: PCA only needs inner products K = X⊤X

Principal component analysis (PCA) / Kernel PCA 27

slide-99
SLIDE 99

Towards kernels

Representer theorem: PCA solution is linear combination of xis Why? Recall PCA eigenvalue problem: XX⊤u = λu Notice that u = Xα = n

i=1 αixi for some weights α

Analogy with SVMs: weight vector w = Xα Key fact: PCA only needs inner products K = X⊤X Why? Use representer theorem on PCA objective: max

u=1 u⊤XX⊤u =

max α⊤X⊤Xα=1 α⊤(X⊤X)(X⊤X)α

Principal component analysis (PCA) / Kernel PCA 27

slide-100
SLIDE 100

Kernel PCA

Kernel function: k(x1, x2) such that K, the kernel matrix formed by Kij = k(xi, xj), is positive semi-definite

Principal component analysis (PCA) / Kernel PCA 28

slide-101
SLIDE 101

Kernel PCA

Kernel function: k(x1, x2) such that K, the kernel matrix formed by Kij = k(xi, xj), is positive semi-definite Examples: Linear kernel: k(x1, x2) = x⊤

1 x2

Principal component analysis (PCA) / Kernel PCA 28

slide-102
SLIDE 102

Kernel PCA

Kernel function: k(x1, x2) such that K, the kernel matrix formed by Kij = k(xi, xj), is positive semi-definite Examples: Linear kernel: k(x1, x2) = x⊤

1 x2

Polynomial kernel: k(x1, x2) = (1 + x⊤

1 x2)2

Principal component analysis (PCA) / Kernel PCA 28

slide-103
SLIDE 103

Kernel PCA

Kernel function: k(x1, x2) such that K, the kernel matrix formed by Kij = k(xi, xj), is positive semi-definite Examples: Linear kernel: k(x1, x2) = x⊤

1 x2

Polynomial kernel: k(x1, x2) = (1 + x⊤

1 x2)2

Gaussian (RBF) kernel: k(x1, x2) = e−x1−x22

Principal component analysis (PCA) / Kernel PCA 28

slide-104
SLIDE 104

Kernel PCA

Kernel function: k(x1, x2) such that K, the kernel matrix formed by Kij = k(xi, xj), is positive semi-definite Examples: Linear kernel: k(x1, x2) = x⊤

1 x2

Polynomial kernel: k(x1, x2) = (1 + x⊤

1 x2)2

Gaussian (RBF) kernel: k(x1, x2) = e−x1−x22 Treat data points x as black boxes, only access via k

Principal component analysis (PCA) / Kernel PCA 28

slide-105
SLIDE 105

Kernel PCA

Kernel function: k(x1, x2) such that K, the kernel matrix formed by Kij = k(xi, xj), is positive semi-definite Examples: Linear kernel: k(x1, x2) = x⊤

1 x2

Polynomial kernel: k(x1, x2) = (1 + x⊤

1 x2)2

Gaussian (RBF) kernel: k(x1, x2) = e−x1−x22 Treat data points x as black boxes, only access via k k intuitively measures “similarity” between two inputs

Principal component analysis (PCA) / Kernel PCA 28

slide-106
SLIDE 106

Kernel PCA

Kernel function: k(x1, x2) such that K, the kernel matrix formed by Kij = k(xi, xj), is positive semi-definite Examples: Linear kernel: k(x1, x2) = x⊤

1 x2

Polynomial kernel: k(x1, x2) = (1 + x⊤

1 x2)2

Gaussian (RBF) kernel: k(x1, x2) = e−x1−x22 Treat data points x as black boxes, only access via k k intuitively measures “similarity” between two inputs Mercer’s theorem (using kernels is sensible) Exists high-dimensional feature space φ such that k(x1, x2) = φ(x1)⊤φ(x2) (like quick solution earlier!)

Principal component analysis (PCA) / Kernel PCA 28

slide-107
SLIDE 107

Solving kernel PCA

Direct method: Kernel PCA objective: max α⊤Kα=1 α⊤K2α

Principal component analysis (PCA) / Kernel PCA 29

slide-108
SLIDE 108

Solving kernel PCA

Direct method: Kernel PCA objective: max α⊤Kα=1 α⊤K2α ⇒ kernel PCA eigenvalue problem: X⊤Xα = λ′α

Principal component analysis (PCA) / Kernel PCA 29

slide-109
SLIDE 109

Solving kernel PCA

Direct method: Kernel PCA objective: max α⊤Kα=1 α⊤K2α ⇒ kernel PCA eigenvalue problem: X⊤Xα = λ′α Modular method (if you don’t want to think about kernels): Find vectors x′

1, . . . , x′ n such that

x′⊤

i x′ j = Kij = φ(xi)⊤φ(xj)

Principal component analysis (PCA) / Kernel PCA 29

slide-110
SLIDE 110

Solving kernel PCA

Direct method: Kernel PCA objective: max α⊤Kα=1 α⊤K2α ⇒ kernel PCA eigenvalue problem: X⊤Xα = λ′α Modular method (if you don’t want to think about kernels): Find vectors x′

1, . . . , x′ n such that

x′⊤

i x′ j = Kij = φ(xi)⊤φ(xj)

Key: use any vectors that preserve inner products

Principal component analysis (PCA) / Kernel PCA 29

slide-111
SLIDE 111

Solving kernel PCA

Direct method: Kernel PCA objective: max α⊤Kα=1 α⊤K2α ⇒ kernel PCA eigenvalue problem: X⊤Xα = λ′α Modular method (if you don’t want to think about kernels): Find vectors x′

1, . . . , x′ n such that

x′⊤

i x′ j = Kij = φ(xi)⊤φ(xj)

Key: use any vectors that preserve inner products One possibility is Cholesky decomposition K = X⊤X

Principal component analysis (PCA) / Kernel PCA 29

slide-112
SLIDE 112

Roadmap

  • Principal component analysis (PCA)

– Basic principles – Case studies – Kernel PCA – Probabilistic PCA

  • Canonical correlation analysis (CCA)
  • Fisher discriminant analysis (FDA)
  • Summary

Principal component analysis (PCA) / Probabilistic PCA 30

slide-113
SLIDE 113

Probabilistic modeling

So far, deal with objective functions: min

U f(X, U)

Principal component analysis (PCA) / Probabilistic PCA 31

slide-114
SLIDE 114

Probabilistic modeling

So far, deal with objective functions: min

U f(X, U)

Probabilistic modeling: max

U p(X | U)

Principal component analysis (PCA) / Probabilistic PCA 31

slide-115
SLIDE 115

Probabilistic modeling

So far, deal with objective functions: min

U f(X, U)

Probabilistic modeling: max

U p(X | U)

Invent a generative story of how data X arose Play detective: infer parameters U that produced X

Principal component analysis (PCA) / Probabilistic PCA 31

slide-116
SLIDE 116

Probabilistic modeling

So far, deal with objective functions: min

U f(X, U)

Probabilistic modeling: max

U p(X | U)

Invent a generative story of how data X arose Play detective: infer parameters U that produced X Advantages:

  • Model reports estimates of uncertainty
  • Natural way to handle missing data

Principal component analysis (PCA) / Probabilistic PCA 31

slide-117
SLIDE 117

Probabilistic modeling

So far, deal with objective functions: min

U f(X, U)

Probabilistic modeling: max

U p(X | U)

Invent a generative story of how data X arose Play detective: infer parameters U that produced X Advantages:

  • Model reports estimates of uncertainty
  • Natural way to handle missing data
  • Natural way to introduce prior knowledge
  • Natural way to incorporate in a larger model

Principal component analysis (PCA) / Probabilistic PCA 31

slide-118
SLIDE 118

Probabilistic modeling

So far, deal with objective functions: min

U f(X, U)

Probabilistic modeling: max

U p(X | U)

Invent a generative story of how data X arose Play detective: infer parameters U that produced X Advantages:

  • Model reports estimates of uncertainty
  • Natural way to handle missing data
  • Natural way to introduce prior knowledge
  • Natural way to incorporate in a larger model

Example from last lecture: k-means ⇒ GMMs

Principal component analysis (PCA) / Probabilistic PCA 31

slide-119
SLIDE 119

Probabilistic PCA

Generative story [Tipping and Bishop, 1999]: For each data point i = 1, . . . , n:

Principal component analysis (PCA) / Probabilistic PCA 32

slide-120
SLIDE 120

Probabilistic PCA

Generative story [Tipping and Bishop, 1999]: For each data point i = 1, . . . , n: Draw the latent vector: zi ∼ N(0, Ik×k)

Principal component analysis (PCA) / Probabilistic PCA 32

slide-121
SLIDE 121

Probabilistic PCA

Generative story [Tipping and Bishop, 1999]: For each data point i = 1, . . . , n: Draw the latent vector: zi ∼ N(0, Ik×k) Create the data point: xi ∼ N(Uzi, σ2Id×d)

Principal component analysis (PCA) / Probabilistic PCA 32

slide-122
SLIDE 122

Probabilistic PCA

Generative story [Tipping and Bishop, 1999]: For each data point i = 1, . . . , n: Draw the latent vector: zi ∼ N(0, Ik×k) Create the data point: xi ∼ N(Uzi, σ2Id×d) PCA finds the U that maximizes the likelihood of the data

Principal component analysis (PCA) / Probabilistic PCA 32

slide-123
SLIDE 123

Probabilistic PCA

Generative story [Tipping and Bishop, 1999]: For each data point i = 1, . . . , n: Draw the latent vector: zi ∼ N(0, Ik×k) Create the data point: xi ∼ N(Uzi, σ2Id×d) PCA finds the U that maximizes the likelihood of the data Advantages:

  • Handles missing data (important for collaborative

filtering)

Principal component analysis (PCA) / Probabilistic PCA 32

slide-124
SLIDE 124

Probabilistic PCA

Generative story [Tipping and Bishop, 1999]: For each data point i = 1, . . . , n: Draw the latent vector: zi ∼ N(0, Ik×k) Create the data point: xi ∼ N(Uzi, σ2Id×d) PCA finds the U that maximizes the likelihood of the data Advantages:

  • Handles missing data (important for collaborative

filtering)

  • Extension to factor analysis: allow non-isotropic noise

(replace σ2Id×d with arbitrary diagonal matrix)

Principal component analysis (PCA) / Probabilistic PCA 32

slide-125
SLIDE 125

Probabilistic latent semantic analysis (pLSA)

Motivation: in text analysis, X contains word counts; PCA (LSA) is bad model as it allows negative counts; pLSA fixes this

Principal component analysis (PCA) / Probabilistic PCA 33

slide-126
SLIDE 126

Probabilistic latent semantic analysis (pLSA)

Motivation: in text analysis, X contains word counts; PCA (LSA) is bad model as it allows negative counts; pLSA fixes this Generative story for pLSA [Hofmann, 1999]: For each document i = 1, . . . , n:

Principal component analysis (PCA) / Probabilistic PCA 33

slide-127
SLIDE 127

Probabilistic latent semantic analysis (pLSA)

Motivation: in text analysis, X contains word counts; PCA (LSA) is bad model as it allows negative counts; pLSA fixes this Generative story for pLSA [Hofmann, 1999]: For each document i = 1, . . . , n: Repeat M times (number of word tokens in document):

Principal component analysis (PCA) / Probabilistic PCA 33

slide-128
SLIDE 128

Probabilistic latent semantic analysis (pLSA)

Motivation: in text analysis, X contains word counts; PCA (LSA) is bad model as it allows negative counts; pLSA fixes this Generative story for pLSA [Hofmann, 1999]: For each document i = 1, . . . , n: Repeat M times (number of word tokens in document): Draw a latent topic: z ∼ p(z | i)

Principal component analysis (PCA) / Probabilistic PCA 33

slide-129
SLIDE 129

Probabilistic latent semantic analysis (pLSA)

Motivation: in text analysis, X contains word counts; PCA (LSA) is bad model as it allows negative counts; pLSA fixes this Generative story for pLSA [Hofmann, 1999]: For each document i = 1, . . . , n: Repeat M times (number of word tokens in document): Draw a latent topic: z ∼ p(z | i) Choose the word token: x ∼ p(x | z)

Principal component analysis (PCA) / Probabilistic PCA 33

slide-130
SLIDE 130

Probabilistic latent semantic analysis (pLSA)

Motivation: in text analysis, X contains word counts; PCA (LSA) is bad model as it allows negative counts; pLSA fixes this Generative story for pLSA [Hofmann, 1999]: For each document i = 1, . . . , n: Repeat M times (number of word tokens in document): Draw a latent topic: z ∼ p(z | i) Choose the word token: x ∼ p(x | z) Set xji to be the number of times word j was chosen

Principal component analysis (PCA) / Probabilistic PCA 33

slide-131
SLIDE 131

Probabilistic latent semantic analysis (pLSA)

Motivation: in text analysis, X contains word counts; PCA (LSA) is bad model as it allows negative counts; pLSA fixes this Generative story for pLSA [Hofmann, 1999]: For each document i = 1, . . . , n: Repeat M times (number of word tokens in document): Draw a latent topic: z ∼ p(z | i) Choose the word token: x ∼ p(x | z) Set xji to be the number of times word j was chosen Learning using Hard EM (analog of k-means): E-step: fix parameters, choose best topics M-step: fix topics, optimize parameters More sophisticated methods: EM, Latent Dirichlet Allocation

Principal component analysis (PCA) / Probabilistic PCA 33

slide-132
SLIDE 132

Probabilistic latent semantic analysis (pLSA)

Motivation: in text analysis, X contains word counts; PCA (LSA) is bad model as it allows negative counts; pLSA fixes this Generative story for pLSA [Hofmann, 1999]: For each document i = 1, . . . , n: Repeat M times (number of word tokens in document): Draw a latent topic: z ∼ p(z | i) Choose the word token: x ∼ p(x | z) Set xji to be the number of times word j was chosen Learning using Hard EM (analog of k-means): E-step: fix parameters, choose best topics M-step: fix topics, optimize parameters More sophisticated methods: EM, Latent Dirichlet Allocation Comparison to a mixture model for clustering: Mixture model: assume a single topic for entire document pLSA: allow multiple topics per document

Principal component analysis (PCA) / Probabilistic PCA 33

slide-133
SLIDE 133

Roadmap

  • Principal component analysis (PCA)

– Basic principles – Case studies – Kernel PCA – Probabilistic PCA

  • Canonical correlation analysis (CCA)
  • Fisher discriminant analysis (FDA)
  • Summary

Canonical correlation analysis (CCA) 34

slide-134
SLIDE 134

Motivation for CCA [Hotelling, 1936]

Often, each data point consists of two views:

  • Image retrieval: for each image, have the following:

– x: Pixels (or other visual features) – y: Text around the image

Canonical correlation analysis (CCA) 35

slide-135
SLIDE 135

Motivation for CCA [Hotelling, 1936]

Often, each data point consists of two views:

  • Image retrieval: for each image, have the following:

– x: Pixels (or other visual features) – y: Text around the image

  • Time series:

– x: Signal at time t – y: Signal at time t + 1

Canonical correlation analysis (CCA) 35

slide-136
SLIDE 136

Motivation for CCA [Hotelling, 1936]

Often, each data point consists of two views:

  • Image retrieval: for each image, have the following:

– x: Pixels (or other visual features) – y: Text around the image

  • Time series:

– x: Signal at time t – y: Signal at time t + 1

  • Two-view learning: divide features into two sets

– x: Features of a word/object, etc. – y: Features of the context in which it appears

Canonical correlation analysis (CCA) 35

slide-137
SLIDE 137

Motivation for CCA [Hotelling, 1936]

Often, each data point consists of two views:

  • Image retrieval: for each image, have the following:

– x: Pixels (or other visual features) – y: Text around the image

  • Time series:

– x: Signal at time t – y: Signal at time t + 1

  • Two-view learning: divide features into two sets

– x: Features of a word/object, etc. – y: Features of the context in which it appears Goal: reduce the dimensionality of the two views jointly

Canonical correlation analysis (CCA) 35

slide-138
SLIDE 138

An example

Setup: Input data: (x1, y1), . . . , (xn, yn) (matrices X, Y) Goal: find pair of projections (u, v)

Canonical correlation analysis (CCA) 36

slide-139
SLIDE 139

An example

Setup: Input data: (x1, y1), . . . , (xn, yn) (matrices X, Y) Goal: find pair of projections (u, v) In figure, x and y are paired by brightness

Canonical correlation analysis (CCA) 36

slide-140
SLIDE 140

An example

Setup: Input data: (x1, y1), . . . , (xn, yn) (matrices X, Y) Goal: find pair of projections (u, v) In figure, x and y are paired by brightness Dimensionality reduction solutions: Independent

Canonical correlation analysis (CCA) 36

slide-141
SLIDE 141

An example

Setup: Input data: (x1, y1), . . . , (xn, yn) (matrices X, Y) Goal: find pair of projections (u, v) In figure, x and y are paired by brightness Dimensionality reduction solutions: Independent Joint

Canonical correlation analysis (CCA) 36

slide-142
SLIDE 142

From PCA to CCA

PCA on views separately: no covariance term

max

u,v

u⊤XX⊤u u⊤u + v⊤YY⊤v v⊤v

Canonical correlation analysis (CCA) 37

slide-143
SLIDE 143

From PCA to CCA

PCA on views separately: no covariance term

max

u,v

u⊤XX⊤u u⊤u + v⊤YY⊤v v⊤v

PCA on concatenation (X⊤, Y⊤)⊤: includes covariance term

max

u,v

u⊤XX⊤u + 2u⊤XY⊤v + v⊤YY⊤v u⊤u + v⊤v

Canonical correlation analysis (CCA) 37

slide-144
SLIDE 144

From PCA to CCA

PCA on views separately: no covariance term

max

u,v

u⊤XX⊤u u⊤u + v⊤YY⊤v v⊤v

PCA on concatenation (X⊤, Y⊤)⊤: includes covariance term

max

u,v

u⊤XX⊤u + 2u⊤XY⊤v + v⊤YY⊤v u⊤u + v⊤v

Maximum covariance: drop variance terms

max

u,v

u⊤XY⊤v √ u⊤u √ v⊤v

Canonical correlation analysis (CCA) 37

slide-145
SLIDE 145

From PCA to CCA

PCA on views separately: no covariance term

max

u,v

u⊤XX⊤u u⊤u + v⊤YY⊤v v⊤v

PCA on concatenation (X⊤, Y⊤)⊤: includes covariance term

max

u,v

u⊤XX⊤u + 2u⊤XY⊤v + v⊤YY⊤v u⊤u + v⊤v

Maximum covariance: drop variance terms

max

u,v

u⊤XY⊤v √ u⊤u √ v⊤v

Maximum correlation (CCA): divide out variance terms

max

u,v

u⊤XY⊤v √ u⊤XX⊤u √ v⊤YY⊤v

Canonical correlation analysis (CCA) 37

slide-146
SLIDE 146

Canonical correlation analysis (CCA)

Definitions: Variance: var(u⊤x) = u⊤XX⊤u

Canonical correlation analysis (CCA) 38

slide-147
SLIDE 147

Canonical correlation analysis (CCA)

Definitions: Variance: var(u⊤x) = u⊤XX⊤u Covariance: cov(u⊤x, v⊤y) = u⊤XY⊤v

Canonical correlation analysis (CCA) 38

slide-148
SLIDE 148

Canonical correlation analysis (CCA)

Definitions: Variance: var(u⊤x) = u⊤XX⊤u Covariance: cov(u⊤x, v⊤y) = u⊤XY⊤v Correlation:

c cov(u⊤x,v⊤y)

c var(u⊤x)√ c var(v⊤y)

Canonical correlation analysis (CCA) 38

slide-149
SLIDE 149

Canonical correlation analysis (CCA)

Definitions: Variance: var(u⊤x) = u⊤XX⊤u Covariance: cov(u⊤x, v⊤y) = u⊤XY⊤v Correlation:

c cov(u⊤x,v⊤y)

c var(u⊤x)√ c var(v⊤y)

Objective: maximize correlation between projected views max

u,v

corr(u⊤x, v⊤y)

Canonical correlation analysis (CCA) 38

slide-150
SLIDE 150

Canonical correlation analysis (CCA)

Definitions: Variance: var(u⊤x) = u⊤XX⊤u Covariance: cov(u⊤x, v⊤y) = u⊤XY⊤v Correlation:

c cov(u⊤x,v⊤y)

c var(u⊤x)√ c var(v⊤y)

Objective: maximize correlation between projected views max

u,v

corr(u⊤x, v⊤y) Properties:

  • Focus on how variables are related, not how much they vary

Canonical correlation analysis (CCA) 38

slide-151
SLIDE 151

Canonical correlation analysis (CCA)

Definitions: Variance: var(u⊤x) = u⊤XX⊤u Covariance: cov(u⊤x, v⊤y) = u⊤XY⊤v Correlation:

c cov(u⊤x,v⊤y)

c var(u⊤x)√ c var(v⊤y)

Objective: maximize correlation between projected views max

u,v

corr(u⊤x, v⊤y) Properties:

  • Focus on how variables are related, not how much they vary
  • Invariant to any rotation and scaling of data

Canonical correlation analysis (CCA) 38

slide-152
SLIDE 152

Canonical correlation analysis (CCA)

Definitions: Variance: var(u⊤x) = u⊤XX⊤u Covariance: cov(u⊤x, v⊤y) = u⊤XY⊤v Correlation:

c cov(u⊤x,v⊤y)

c var(u⊤x)√ c var(v⊤y)

Objective: maximize correlation between projected views max

u,v

corr(u⊤x, v⊤y) Properties:

  • Focus on how variables are related, not how much they vary
  • Invariant to any rotation and scaling of data

Solved via a generalized eigenvalue problem (Aw = λBw)

Canonical correlation analysis (CCA) 38

slide-153
SLIDE 153

Regularization is important

Extreme examples of degeneracy:

  • If x = Ay, then any (u, v) with u = Av is optimal

(correlation 1)

Canonical correlation analysis (CCA) 41

slide-154
SLIDE 154

Regularization is important

Extreme examples of degeneracy:

  • If x = Ay, then any (u, v) with u = Av is optimal

(correlation 1)

  • If x and y are independent, then any (u, v) is optimal

(correlation 0)

Canonical correlation analysis (CCA) 41

slide-155
SLIDE 155

Regularization is important

Extreme examples of degeneracy:

  • If x = Ay, then any (u, v) with u = Av is optimal

(correlation 1)

  • If x and y are independent, then any (u, v) is optimal

(correlation 0) Problem: if X or Y has rank n, then any (u, v) is optimal (correlation 1) with u = X†⊤Yv ⇒ CCA is meaningless!

Canonical correlation analysis (CCA) 41

slide-156
SLIDE 156

Regularization is important

Extreme examples of degeneracy:

  • If x = Ay, then any (u, v) with u = Av is optimal

(correlation 1)

  • If x and y are independent, then any (u, v) is optimal

(correlation 0) Problem: if X or Y has rank n, then any (u, v) is optimal (correlation 1) with u = X†⊤Yv ⇒ CCA is meaningless! Solution: regularization (interpolate between maximum covariance and maximum correlation)

max

u,v

u⊤XY⊤v

  • u⊤(XX⊤ + λI)u
  • v⊤(YY⊤ + λI)v

Canonical correlation analysis (CCA) 41

slide-157
SLIDE 157

Kernel CCA

Two kernels: kx and ky

Canonical correlation analysis (CCA) 42

slide-158
SLIDE 158

Kernel CCA

Two kernels: kx and ky Direct method: (some math)

Canonical correlation analysis (CCA) 42

slide-159
SLIDE 159

Kernel CCA

Two kernels: kx and ky Direct method: (some math) Modular method:

  • 1. Transform xi into x′

i ∈ Rn satisfying

k(xi, xj) = x′⊤

i x′ j (do same for y)

Canonical correlation analysis (CCA) 42

slide-160
SLIDE 160

Kernel CCA

Two kernels: kx and ky Direct method: (some math) Modular method:

  • 1. Transform xi into x′

i ∈ Rn satisfying

k(xi, xj) = x′⊤

i x′ j (do same for y)

  • 2. Perform regular CCA

Canonical correlation analysis (CCA) 42

slide-161
SLIDE 161

Kernel CCA

Two kernels: kx and ky Direct method: (some math) Modular method:

  • 1. Transform xi into x′

i ∈ Rn satisfying

k(xi, xj) = x′⊤

i x′ j (do same for y)

  • 2. Perform regular CCA

Regularization is especially important for kernel CCA!

Canonical correlation analysis (CCA) 42

slide-162
SLIDE 162

Roadmap

  • Principal component analysis (PCA)

– Basic principles – Case studies – Kernel PCA – Probabilistic PCA

  • Canonical correlation analysis (CCA)
  • Fisher discriminant analysis (FDA)
  • Summary

Fisher discriminant analysis (FDA) 43

slide-163
SLIDE 163

Motivation for FDA [Fisher, 1936]

What is the best linear projection?

Fisher discriminant analysis (FDA) 44

slide-164
SLIDE 164

Motivation for FDA [Fisher, 1936]

What is the best linear projection? PCA solution

Fisher discriminant analysis (FDA) 44

slide-165
SLIDE 165

Motivation for FDA [Fisher, 1936]

What is the best linear projection with these labels? PCA solution

Fisher discriminant analysis (FDA) 44

slide-166
SLIDE 166

Motivation for FDA [Fisher, 1936]

What is the best linear projection with these labels? PCA solution FDA solution

Fisher discriminant analysis (FDA) 44

slide-167
SLIDE 167

Motivation for FDA [Fisher, 1936]

What is the best linear projection with these labels? PCA solution FDA solution Goal: reduce the dimensionality given labels Idea: want projection to maximize overall interclass variance relative to intraclass variance

Fisher discriminant analysis (FDA) 44

slide-168
SLIDE 168

Motivation for FDA [Fisher, 1936]

What is the best linear projection with these labels? PCA solution FDA solution Goal: reduce the dimensionality given labels Idea: want projection to maximize overall interclass variance relative to intraclass variance Linear classifiers (logistic regression, SVMs) have similar feel: Find one-dimensional subspace w, e.g., to maximize margin between different classes

Fisher discriminant analysis (FDA) 44

slide-169
SLIDE 169

Motivation for FDA [Fisher, 1936]

What is the best linear projection with these labels? PCA solution FDA solution Goal: reduce the dimensionality given labels Idea: want projection to maximize overall interclass variance relative to intraclass variance Linear classifiers (logistic regression, SVMs) have similar feel: Find one-dimensional subspace w, e.g., to maximize margin between different classes FDA handles multiple classes, allows multiple dimensions

Fisher discriminant analysis (FDA) 44

slide-170
SLIDE 170

FDA objective function

Setup: xi ∈ Rd, yi ∈ {1, . . . , m}, for i = 1, . . . , n

Fisher discriminant analysis (FDA) 45

slide-171
SLIDE 171

FDA objective function

Setup: xi ∈ Rd, yi ∈ {1, . . . , m}, for i = 1, . . . , n Objective: maximize interclass variance

intraclass variance = total variance intraclass variance − 1

Fisher discriminant analysis (FDA) 45

slide-172
SLIDE 172

FDA objective function

Setup: xi ∈ Rd, yi ∈ {1, . . . , m}, for i = 1, . . . , n Objective: maximize interclass variance

intraclass variance = total variance intraclass variance − 1

Total variance: 1

n

  • i(u⊤(xi − µ))2

Mean of all points: µ = 1

n

  • i xi

Fisher discriminant analysis (FDA) 45

slide-173
SLIDE 173

FDA objective function

Setup: xi ∈ Rd, yi ∈ {1, . . . , m}, for i = 1, . . . , n Objective: maximize interclass variance

intraclass variance = total variance intraclass variance − 1

Total variance: 1

n

  • i(u⊤(xi − µ))2

Mean of all points: µ = 1

n

  • i xi

Intraclass variance: 1

n

  • i(u⊤(xi − µyi))2

Mean of points in class y: µy =

1 |{i:yi=y}|

  • i:yi=y xi

Fisher discriminant analysis (FDA) 45

slide-174
SLIDE 174

FDA objective function

Setup: xi ∈ Rd, yi ∈ {1, . . . , m}, for i = 1, . . . , n Objective: maximize interclass variance

intraclass variance = total variance intraclass variance − 1

Total variance: 1

n

  • i(u⊤(xi − µ))2

Mean of all points: µ = 1

n

  • i xi

Intraclass variance: 1

n

  • i(u⊤(xi − µyi))2

Mean of points in class y: µy =

1 |{i:yi=y}|

  • i:yi=y xi

Reduces to a generalized eigenvalue problem.

Fisher discriminant analysis (FDA) 45

slide-175
SLIDE 175

FDA objective function

Setup: xi ∈ Rd, yi ∈ {1, . . . , m}, for i = 1, . . . , n Objective: maximize interclass variance

intraclass variance = total variance intraclass variance − 1

Total variance: 1

n

  • i(u⊤(xi − µ))2

Mean of all points: µ = 1

n

  • i xi

Intraclass variance: 1

n

  • i(u⊤(xi − µyi))2

Mean of points in class y: µy =

1 |{i:yi=y}|

  • i:yi=y xi

Reduces to a generalized eigenvalue problem. Kernel FDA: use modular method

Fisher discriminant analysis (FDA) 45

slide-176
SLIDE 176

Other linear methods

Random projections: Randomly project data onto k = O(log n) dimensions

Fisher discriminant analysis (FDA) 47

slide-177
SLIDE 177

Other linear methods

Random projections: Randomly project data onto k = O(log n) dimensions All pairwise distances preserved with high probability U⊤xi − U⊤xj2 ≅ xi − xj2 for all i, j

Fisher discriminant analysis (FDA) 47

slide-178
SLIDE 178

Other linear methods

Random projections: Randomly project data onto k = O(log n) dimensions All pairwise distances preserved with high probability U⊤xi − U⊤xj2 ≅ xi − xj2 for all i, j Trivial to implement

Fisher discriminant analysis (FDA) 47

slide-179
SLIDE 179

Other linear methods

Random projections: Randomly project data onto k = O(log n) dimensions All pairwise distances preserved with high probability U⊤xi − U⊤xj2 ≅ xi − xj2 for all i, j Trivial to implement Kernel dimensionality reduction: One type of sufficient dimensionality reduction Find subspace that contains all information about labels

Fisher discriminant analysis (FDA) 47

slide-180
SLIDE 180

Other linear methods

Random projections: Randomly project data onto k = O(log n) dimensions All pairwise distances preserved with high probability U⊤xi − U⊤xj2 ≅ xi − xj2 for all i, j Trivial to implement Kernel dimensionality reduction: One type of sufficient dimensionality reduction Find subspace that contains all information about labels y ⊥ ⊥ x | U⊤x

Fisher discriminant analysis (FDA) 47

slide-181
SLIDE 181

Other linear methods

Random projections: Randomly project data onto k = O(log n) dimensions All pairwise distances preserved with high probability U⊤xi − U⊤xj2 ≅ xi − xj2 for all i, j Trivial to implement Kernel dimensionality reduction: One type of sufficient dimensionality reduction Find subspace that contains all information about labels y ⊥ ⊥ x | U⊤x Capturing information is stronger than capturing variance

Fisher discriminant analysis (FDA) 47

slide-182
SLIDE 182

Other linear methods

Random projections: Randomly project data onto k = O(log n) dimensions All pairwise distances preserved with high probability U⊤xi − U⊤xj2 ≅ xi − xj2 for all i, j Trivial to implement Kernel dimensionality reduction: One type of sufficient dimensionality reduction Find subspace that contains all information about labels y ⊥ ⊥ x | U⊤x Capturing information is stronger than capturing variance Hard nonconvex optimization problem

Fisher discriminant analysis (FDA) 47

slide-183
SLIDE 183

Summary

Framework: z = U⊤x, x ≅ Uz

Fisher discriminant analysis (FDA) 48

slide-184
SLIDE 184

Summary

Framework: z = U⊤x, x ≅ Uz Criteria for choosing U:

  • PCA: maximize projected variance
  • CCA: maximize projected correlation
  • FDA: maximize projected interclass variance

intraclass variance

Fisher discriminant analysis (FDA) 48

slide-185
SLIDE 185

Summary

Framework: z = U⊤x, x ≅ Uz Criteria for choosing U:

  • PCA: maximize projected variance
  • CCA: maximize projected correlation
  • FDA: maximize projected interclass variance

intraclass variance

Algorithm: generalized eigenvalue problem

Fisher discriminant analysis (FDA) 48

slide-186
SLIDE 186

Summary

Framework: z = U⊤x, x ≅ Uz Criteria for choosing U:

  • PCA: maximize projected variance
  • CCA: maximize projected correlation
  • FDA: maximize projected interclass variance

intraclass variance

Algorithm: generalized eigenvalue problem Extensions: non-linear using kernels (using same linear framework) probabilistic, sparse, robust (hard optimization)

Fisher discriminant analysis (FDA) 48