Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 - - PowerPoint PPT Presentation

data mining techniques
SMART_READER_LITE
LIVE PREVIEW

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 - - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression Linear regression Basis


slide-1
SLIDE 1

Data Mining Techniques

CS 6220 - Section 2 - Spring 2017

Lecture 4

Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton
 Rasmussen & Williams, Percy Liang)

slide-2
SLIDE 2

Kernel Regression

slide-3
SLIDE 3

Basis function regression

Linear regression Basis function regression Polynomial regression For N samples

slide-4
SLIDE 4

Basis Function Regression

x t M = 3 1 −1 1

slide-5
SLIDE 5

Define a kernel function such that k can be cheaper to evaluate than φ!

The Kernel Trick

slide-6
SLIDE 6

Kernel Ridge Regression

Φ := Φ(X) A := (Φ>Φ + λI) E[w|y] = A1Φ>y

MAP / Expected value for Weights (requires inversion of DxD matrix) Predictive posterior (using kernel function)

E[f (x⇤)| y] = φ(x⇤)>E[w|y] = φ(x⇤)>Φ>(K + λI)1y X

Alternate representation (requires inversion of NxN matrix)

K := λ1ΦΦ> A1Φ> = Φ>(K + λI)1

slide-7
SLIDE 7

Kernel Ridge Regression

Φ := Φ(X) A := (Φ>Φ + λI) E[w|y] = A1Φ>y

MAP / Expected value for Weights (requires inversion of DxD matrix) Predictive posterior (using kernel function) Alternate representation (requires inversion of NxN matrix)

K := λ1ΦΦ> A1Φ> = Φ>(K + λI)1 E[f (x⇤)| y] = φ(x⇤)>E[w|y] = φ(x⇤)>Φ>(K + λI)1y = X

n,m

k(x⇤, xn)(K + λI)1

nm ym

slide-8
SLIDE 8

Kernel Ridge Regression

f ∗ = arg min

f ∈H

n X

i=1

(yi hf , φ(xi)iH)2 + λkf k2

H

! .

−0.5 0.5 1 1.5 −1 −0.5 0.5 1

λ=0.1, σ=0.6

−0.5 0.5 1 1.5 −1 −0.5 0.5 1

λ=10, σ=0.6

−0.5 0.5 1 1.5 −1 −0.5 0.5 1 1.5

λ=1e−07, σ=0.6

Closed form Solution

slide-9
SLIDE 9

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Gaussian Processes

−5 5 −2 −1 1 2 input, x

  • utput, f(x)

−5 5 −2 −1 1 2 input, x

  • utput, f(x)

p(y⇤|x⇤, x, y) ∼ N

  • k(x⇤, x)>[K + σ2

noiseI]−1y,

k(x⇤, x⇤) + σ2

noise − k(x⇤, x)>[K + σ2 noiseI]−1k(x⇤, x)

  • (a.k.a. Kernel Ridge Regression with Variance Estimates)
slide-10
SLIDE 10

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Choosing Kernel Hyperparameters

  • −10

−5 5 10 −0.5 0.0 0.5 1.0 input, x function value, y too long about right too short

The mean posterior predictive function is plotted for 3 different length scales (the

function: k(x, x0) = v2 exp

  • − (x − x0)2

2`2

  • + 2

noisexx0.

Characteristic Lengthscales

slide-11
SLIDE 11

Intermezzo: Kernels

Borrowing from:
 Arthur Gretton 
 (Gatsby, UCL)

slide-12
SLIDE 12

Hilbert Spaces

Definition (Inner product) Let H be a vector space over R. A function h·, ·iH : H ⇥ H ! R is an inner product on H if

1 Linear: hα1f1 + α2f2, giH = α1 hf1, giH + α2 hf2, giH 2 Symmetric: hf , giH = hg, f iH 3 hf , f iH 0 and hf , f iH = 0 if and only if f = 0.

Norm induced by the inner product: kf kH := p hf , f iH

slide-13
SLIDE 13

Example: Fourier Bases

slide-14
SLIDE 14

Example: Fourier Bases

slide-15
SLIDE 15

Example: Fourier Bases

slide-16
SLIDE 16

Example: Fourier Bases

Fourier modes define a vector space

slide-17
SLIDE 17

Kernels

Definition Let X be a non-empty set. A function k : X × X → R is a kernel if there exists an R-Hilbert space and a map φ : X → H such that ∀x, x0 ∈ X, k(x, x0) := ⌦ φ(x), φ(x0) ↵

H .

Almost no conditions on X (eg, X itself doesn’t need an inner product, eg. documents). A single kernel can correspond to several possible features. A trivial example for X := R: φ1(x) = x and φ2(x) =  x/ √ 2 x/ √ 2

slide-18
SLIDE 18

Sums, Transformations, Products

Theorem (Sums of kernels are kernels) Given α > 0 and k, k1 and k2 all kernels on X, then αk and k1 + k2 are kernels on X. (Proof via positive definiteness: later!) A difference of kernels may not be a kernel (why?) Theorem (Mappings between spaces) Let X and e X be sets, and define a map A : X → e

  • X. Define the

kernel k on e

  • X. Then the kernel k(A(x), A(x0)) is a kernel on X.

Example: k(x, x0) = x2 (x0)2 . Theorem (Products of kernels are kernels) Given k1 on X1 and k2 on X2, then k1 ⇥ k2 is a kernel on X1 ⇥ X2. If X1 = X2 = X, then k := k1 ⇥ k2 is a kernel on X. Proof: Main idea only!

slide-19
SLIDE 19

Polynomial Kernels

Theorem (Polynomial kernels) Let x, x0 2 Rd for d 1, and let m 1 be an integer and c 0 be a positive real. Then k(x, x0) := ⌦ x, x0↵ + c m is a valid kernel. To prove: expand into a sum (with non-negative scalars) of kernels hx, x0i raised to integer powers. These individual terms are valid kernels by the product rule.

slide-20
SLIDE 20

Infinite Sequences

Definition The space `2 (square summable sequences) comprises all sequences a := (ai)i1 for which kak2

`2 = 1

X

i=1

a2

i < 1.

Definition Given sequence of functions (i(x))i1 in `2 where i : X ! R is the ith coordinate of (x). Then k(x, x0) :=

1

X

i=1

i(x)i(x0) (1)

slide-21
SLIDE 21

Infinite Sequences

Why square summable? By Cauchy-Schwarz,

  • 1

X

i=1

φi(x)φi(x0)

  •  kφ(x)k`2
  • φ(x0)
  • `2 ,

so the sequence defining the inner product converges for all x, x0 2 X

slide-22
SLIDE 22

Taylor Series Kernels

Definition (Taylor series kernel) For r 2 (0, 1], with an 0 for all n 0 f (z) =

1

X

n=0

anzn |z| < r, z 2 R, Define X to be the pr-ball in Rd, sokxk < pr, k(x, x0) = f ⌦ x, x0↵ =

1

X

n=0

an ⌦ x, x0↵n . Example (Exponential kernel) k(x, x0) := exp ⌦ x, x0↵ .

slide-23
SLIDE 23

Gaussian Kernel

Example (Gaussian kernel) The Gaussian kernel on Rd is defined as k(x, x0) := exp ⇣ −γ2 x − x0 2⌘ . Proof: an exercise! Use product rule, mapping rule, exponential kernel.

(also known as Radial Basis Function (RBF) kernel)

slide-24
SLIDE 24

Gaussian Kernel

Example (Gaussian kernel) The Gaussian kernel on Rd is defined as k(x, x0) := exp ⇣ −γ2 x − x0 2⌘ . Proof: an exercise! Use product rule, mapping rule, exponential kernel.

(also known as Radial Basis Function (RBF) kernel) Squared Exponential (SE) Automatic Relevance 
 Determination (ARD)

slide-25
SLIDE 25

Products of Kernels

Lin × Lin SE × Per Lin × SE Lin × Per

x (with xÕ = 1) x − xÕ x (with xÕ = 1) x (with xÕ = 1) me: Squared-exp (SE) Periodic (Per) Linear (Lin) σ2

f exp

1

−(x≠xÕ)2

2¸2

2

σ2

f exp

1

− 2

¸2 sin2 1

π x≠xÕ

p

22

σ2

f(x − c)(xÕ − c)

): x − xÕ x − xÕ x (with xÕ = 1)

source: David Duvenaud (PhD Thesis)

slide-26
SLIDE 26

Positive Definiteness

Definition (Positive definite functions) A symmetric function k : X × X → R is positive definite if ∀n ≥ 1, ∀(a1, . . . an) ∈ Rn, ∀(x1, . . . , xn) ∈ X n,

n

X

i=1 n

X

j=1

aiajk(xi, xj) ≥ 0. The function k(·, ·) is strictly positive definite if for mutually distinct xi, the equality holds only when all the ai are zero.

slide-27
SLIDE 27

Mercer’s Theorem

Theorem Let H be a Hilbert space, X a non-empty set and φ : X ! H. Then hφ(x), φ(y)iH =: k(x, y) is positive definite. Proof.

n

X

i=1 n

X

j=1

aiajk(xi, xj) =

n

X

i=1 n

X

j=1

haiφ(xi), ajφ(xj)iH =

  • n

X

i=1

aiφ(xi)

  • 2

H

0. Reverse also holds: positive definite k(x, x0) is inner product in a unique H (Moore-Aronsajn: coming later!).

slide-28
SLIDE 28

DIMENSIONALITY REDUCTION

Borrowing from:
 Percy Liang
 (Stanford)

slide-29
SLIDE 29

Linear Dimensionality Reduction

∈ x ∈ R361 z = U>x z ∈ R10

Idea: Project high-dimensional vector 


  • nto a lower dimensional space
slide-30
SLIDE 30

Problem Setup

Given n data points in d dimensions: x1, . . . , xn ∈ Rd X =( x1 · · · · · · xn) ∈ Rd⇥n

Transpose of X
 used in regression!

slide-31
SLIDE 31

Problem Setup

Given n data points in d dimensions: x1, . . . , xn ∈ Rd X =( x1 · · · · · · xn) ∈ Rd⇥n Want to reduce dimensionality from d to k Choose k directions u1, . . . , uk U =( u1 ·· uk ) ∈ Rd⇥k

slide-32
SLIDE 32

Problem Setup

Given n data points in d dimensions: x1, . . . , xn ∈ Rd X =( x1 · · · · · · xn) ∈ Rd⇥n Want to reduce dimensionality from d to k Choose k directions u1, . . . , uk U =( u1 ·· uk ) ∈ Rd⇥k For each uj, compute “similarity” zj = u>

j x

slide-33
SLIDE 33

Problem Setup

Given n data points in d dimensions: x1, . . . , xn ∈ Rd X =( x1 · · · · · · xn) ∈ Rd⇥n Want to reduce dimensionality from d to k Choose k directions u1, . . . , uk U =( u1 ·· uk ) ∈ Rd⇥k For each uj, compute “similarity” zj = u>

j x

Project x down to z = (z1, . . . , zk)> = U>x How to choose U?

slide-34
SLIDE 34

Principal Component Analysis

∈ x ∈ R361 z = U>x z ∈ R10

Optimize two equivalent objectives

  • 1. Minimize the reconstruction error
  • 2. Maximizes the projected variance
slide-35
SLIDE 35

PCA Objective 1: Reconstruction Error

U serves two functions:

  • Encode: z = U>x,

zj = u>

j x

P

slide-36
SLIDE 36

PCA Objective 1: Reconstruction Error

U serves two functions:

  • Encode: z = U>x,

zj = u>

j x

  • Decode: ˜

x = Uz = Pk

j=1 zjuj

slide-37
SLIDE 37

PCA Objective 1: Reconstruction Error

U serves two functions:

  • Encode: z = U>x,

zj = u>

j x

  • Decode: ˜

x = Uz = Pk

j=1 zjuj

Want reconstruction error kx ˜ xk to be small

slide-38
SLIDE 38

PCA Objective 1: Reconstruction Error

U serves two functions:

  • Encode: z = U>x,

zj = u>

j x

  • Decode: ˜

x = Uz = Pk

j=1 zjuj

Want reconstruction error kx ˜ xk to be small Objective: minimize total squared reconstruction error min

U2Rd⇥k n

X

i=1

kxi UU>xik2

slide-39
SLIDE 39

PCA Objective 2: Projected Variance

Empirical distribution: uniform over x1, . . . , xn Expectation (think sum over data points): ˆ E[f(x)] = 1

n

Pn

i=1 f(xi)

Variance (think sum of squares if centered): c var[f(x)] + (ˆ E[f(x)])2 = ˆ E[f(x)2] = 1

n

Pn

i=1 f(xi)2

slide-40
SLIDE 40

PCA Objective 2: Projected Variance

Empirical distribution: uniform over x1, . . . , xn Expectation (think sum over data points): ˆ E[f(x)] = 1

n

Pn

i=1 f(xi)

Variance (think sum of squares if centered): c var[f(x)] + (ˆ E[f(x)])2 = ˆ E[f(x)2] = 1

n

Pn

i=1 f(xi)2

c Assume data is centered: ˆ E[x] = 0 (what’s

slide-41
SLIDE 41

PCA Objective 2: Projected Variance

Empirical distribution: uniform over x1, . . . , xn Expectation (think sum over data points): ˆ E[f(x)] = 1

n

Pn

i=1 f(xi)

Variance (think sum of squares if centered): c var[f(x)] + (ˆ E[f(x)])2 = ˆ E[f(x)2] = 1

n

Pn

i=1 f(xi)2

c Assume data is centered: ˆ E[x] = 0 (what’s

slide-42
SLIDE 42

PCA Objective 2: Projected Variance

Empirical distribution: uniform over x1, . . . , xn Expectation (think sum over data points): ˆ E[f(x)] = 1

n

Pn

i=1 f(xi)

Variance (think sum of squares if centered): c var[f(x)] + (ˆ E[f(x)])2 = ˆ E[f(x)2] = 1

n

Pn

i=1 f(xi)2

Objective: maximize variance of projected data max

U2Rd⇥k,U>U=I

ˆ E[kU>xk2] c Assume data is centered: ˆ E[x] = 0 (what’s

slide-43
SLIDE 43

PCA Objective 2: Projected Variance

Empirical distribution: uniform over x1, . . . , xn Expectation (think sum over data points): ˆ E[f(x)] = 1

n

Pn

i=1 f(xi)

Variance (think sum of squares if centered): c var[f(x)] + (ˆ E[f(x)])2 = ˆ E[f(x)2] = 1

n

Pn

i=1 f(xi)2

c P Assume data is centered: ˆ E[x] = 0 (what’s ˆ

E[U>x]?)

Objective: maximize variance of projected data max

U2Rd⇥k,U>U=I

ˆ E[kU>xk2]

slide-44
SLIDE 44

Equivalence of two objectives

Key intuition: variance of data | {z }

fixed

= captured variance | {z }

want large

+ reconstruction error | {z }

want small

slide-45
SLIDE 45

Equivalence of two objectives

Key intuition: variance of data | {z }

fixed

= captured variance | {z }

want large

+ reconstruction error | {z }

want small

Pythagorean decomposition: x = UU>x + (I UU>)x kUU>xk k(I UU>)xk kxk Take expectations; note rotation U doesn’t affect length: ˆ E[kxk2] = ˆ E[kU>xk2] + ˆ E[kx UU>xk2]

slide-46
SLIDE 46

Equivalence of two objectives

Key intuition: variance of data | {z }

fixed

= captured variance | {z }

want large

+ reconstruction error | {z }

want small

Pythagorean decomposition: x = UU>x + (I UU>)x kUU>xk k(I UU>)xk kxk Take expectations; note rotation U doesn’t affect length: ˆ E[kxk2] = ˆ E[kU>xk2] + ˆ E[kx UU>xk2] Minimize reconstruction error $ Maximize captured variance

slide-47
SLIDE 47

Changes of Basis

U =( u1 ·· uk ) ∈ Rd⇥

Data Orthonormal Basis

X =( x1 · · · · · · xn) ∈ Rd⇥n

d d

slide-48
SLIDE 48

Changes of Basis

U =( u1 ·· uk ) ∈ Rd⇥

Data Orthonormal Basis

X =( x1 · · · · · · xn) ∈ Rd⇥n

d d

Change of basis

z = U>x

” zj = u>

j x

to z = (z1, . . . , zk)>

d

Inverse Change of basis

˜ x = Uz =

> j x

d

slide-49
SLIDE 49

U =( u1 ·· uk ) ∈ Rd⇥

Data Orthonormal Basis

X =( x1 · · · · · · xn) ∈ Rd⇥n

d d

Eigenvectors of Covariance

Λ =    λ1 λ2 ... λd   

Eigen-decomposition

Principal Component Analysis

d

Claim: Eigenvectors of a symmetric matrix are orthogonal

slide-50
SLIDE 50

n

(from stack exchange)

Principal Component Analysis

slide-51
SLIDE 51

Λ =    λ1 λ2 ... λd   

U =( u1 ·· uk ) ∈ Rd⇥

Data Orthonormal Basis

X =( x1 · · · · · · xn) ∈ Rd⇥n

d d

Eigenvectors of Covariance Eigen-decomposition

Principal Component Analysis

d

Idea: Take top-k eigenvectors to maximize variance

slide-52
SLIDE 52

U =( u1 ·· uk ) ∈ Rd⇥k

Data Truncated Basis

X =( x1 · · · · · · xn) ∈ Rd⇥n

Truncated decomposition Eigenvectors of Covariance

Principal Component Analysis

Λ(k) =    λ1 λ2 ... λk   

slide-53
SLIDE 53

Top 2 components Bottom 2 components

Data: three varieties of wheat: Kama, Rosa, Canadian
 Attributes: Area, Perimeter, Compactness, Length of Kernel, 
 Width of Kernel, Asymmetry Coefficient, Length of Groove

Principal Component Analysis

slide-54
SLIDE 54

PCA: Complexity

Using eigen-value decomposition

  • Computation of covariance C: O(n d 2)
  • Eigen-value decomposition: O(d 3)
  • Total complexity: O(n d 2 +d 3)

U =( u1 ·· uk ) ∈ Rd⇥k

Data Truncated Basis

X =( x1 · · · · · · xn) ∈ Rd⇥n

slide-55
SLIDE 55

PCA: Complexity

Using singular-value decomposition

  • Full decomposition: O(min{nd 2 , n 2d})
  • Rank-k decomposition: O(k d n log(n))


(with power method)


U =( u1 ·· uk ) ∈ Rd⇥k

Data Truncated Basis

X =( x1 · · · · · · xn) ∈ Rd⇥n

slide-56
SLIDE 56

Idea: Decompose a
 d x d matrix M into

  • 1. Change of basis V


(unitary matrix)

  • 2. A scaling Σ


(diagonal matrix)

  • 3. Change of basis U


(unitary matrix)

Singular Value Decomposition

slide-57
SLIDE 57

Singular Value Decomposition

Idea: Decompose the
 d x n matrix X into

  • 1. A n x n basis V


(unitary matrix)

  • 2. A d x n matrix Σ


(diagonal projection)

  • 3. A d x d basis U


(unitary matrix)

d X = Ud⇥dΣd⇥nV>

n⇥n

slide-58
SLIDE 58
  • d = number of pixels
  • Each xi 2 Rd is a face image
  • xji = intensity of the j-th pixel in image i

Eigen-faces [Turk & Pentland 1991]

slide-59
SLIDE 59

Eigen-faces [Turk & Pentland 1991]

  • d = number of pixels
  • Each xi 2 Rd is a face image
  • xji = intensity of the j-th pixel in image i

Xd×n u Ud×k Zk×n

(

. . .

) u ( )( z1 . . . zn)

slide-60
SLIDE 60

Eigen-faces [Turk & Pentland 1991]

  • d = number of pixels
  • Each xi 2 Rd is a face image
  • xji = intensity of the j-th pixel in image i

Xd×n u Ud×k Zk×n

(

. . .

) u ( )( z1 . . . zn)

Idea: zi more “meaningful” representation of i-th face than xi Can use zi for nearest-neighbor classification

slide-61
SLIDE 61
  • d = number of pixels
  • Each xi 2 Rd is a face image
  • xji = intensity of the j-th pixel in image i

Xd×n u Ud×k Zk×n

(

. . .

) u ( )( z1 . . . zn)

Idea: zi more “meaningful” representation of i-th face than xi Can use zi for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d k

Eigen-faces [Turk & Pentland 1991]

slide-62
SLIDE 62
  • d = number of pixels
  • Each xi 2 Rd is a face image
  • xji = intensity of the j-th pixel in image i

Xd×n u Ud×k Zk×n

(

. . .

) u ( )( z1 . . . zn)

Idea: zi more “meaningful” representation of i-th face than xi Can use zi for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d k Why no time savings for linear classifier?

Eigen-faces [Turk & Pentland 1991]

slide-63
SLIDE 63

Aside: How many components?

  • Magnitude of eigenvalues indicate fraction of variance captured.
  • Eigenvalues on a face image dataset:

2 3 4 5 6 7 8 9 10 11

i

287.1 553.6 820.1 1086.7 1353.2

λi

  • Eigenvalues typically drop off sharply, so don’t need that many.
  • Of course variance isn’t everything...
slide-64
SLIDE 64

Latent Semantic Analysis [Deerwater 1990]

  • d = number of words in the vocabulary
  • Each xi ∈ Rd is a vector of word counts

=

slide-65
SLIDE 65

Latent Semantic Analysis [Deerwater 1990]

  • d = number of words in the vocabulary
  • Each xi ∈ Rd is a vector of word counts
  • xji = frequency of word j in document i

Xd⇥n u Ud⇥k Zk⇥n

(

stocks: 2 · · · · · · · · · 0 chairman: 4 · · · · · · · · · 1 the: 8 · · · · · · · · · 7 · · · . . . · · · · · · · · · . . . wins: 0 · · · · · · · · · 2 game: 1 · · · · · · · · · 3)

u(

0.4 ·· -0.001 0.8 ·· 0.03 0.01 ·· 0.04 . . . ·· . . . 0.002 ·· 2.3 0.003 ·· 1.9 )( z1 . . . zn)

slide-66
SLIDE 66

Latent Semantic Analysis [Deerwater 1990]

  • d = number of words in the vocabulary
  • Each xi ∈ Rd is a vector of word counts
  • xji = frequency of word j in document i

Xd⇥n u Ud⇥k Zk⇥n

(

stocks: 2 · · · · · · · · · 0 chairman: 4 · · · · · · · · · 1 the: 8 · · · · · · · · · 7 · · · . . . · · · · · · · · · . . . wins: 0 · · · · · · · · · 2 game: 1 · · · · · · · · · 3)

u(

0.4 ·· -0.001 0.8 ·· 0.03 0.01 ·· 0.04 . . . ·· . . . 0.002 ·· 2.3 0.003 ·· 1.9 )( z1 . . . zn)

How to measure similarity between two documents? z>

1 z2 is probably better than x> 1 x2

slide-67
SLIDE 67

Latent Semantic Analysis [Deerwater 1990]

  • d = number of words in the vocabulary
  • Each xi ∈ Rd is a vector of word counts
  • xji = frequency of word j in document i

Xd⇥n u Ud⇥k Zk⇥n

(

stocks: 2 · · · · · · · · · 0 chairman: 4 · · · · · · · · · 1 the: 8 · · · · · · · · · 7 · · · . . . · · · · · · · · · . . . wins: 0 · · · · · · · · · 2 game: 1 · · · · · · · · · 3)

u(

0.4 ·· -0.001 0.8 ·· 0.03 0.01 ·· 0.04 . . . ·· . . . 0.002 ·· 2.3 0.003 ·· 1.9 )( z1 . . . zn)

How to measure similarity between two documents? z>

1 z2 is probably better than x> 1 x2

Applications: information retrieval Note: no computational savings; original x is already sparse

slide-68
SLIDE 68

PCA Summary

  • Intuition: capture variance of data or minimize

reconstruction error

  • Algorithm: find eigendecomposition of covariance

matrix or SVD

  • Impact: reduce storage (from O(nd) to O(nk)), reduce

time complexity

  • Advantages: simple, fast
  • Applications: eigen-faces, eigen-documents, network

anomaly detection, etc.

slide-69
SLIDE 69

Probabilistic Interpretation

g: max

U p(X | U)

For each data point i = 1, . . . , n: Draw the latent vector: zi ∼ N(0, Ik×k) Create the data point: xi ∼ N(Uzi, σ2Id×d) PCA finds the U that maximizes the likelihood of the data

Generative Model [Tipping and Bishop, 1999]:

slide-70
SLIDE 70

Probabilistic Interpretation

g: max

U p(X | U)

For each data point i = 1, . . . , n: Draw the latent vector: zi ∼ N(0, Ik×k) Create the data point: xi ∼ N(Uzi, σ2Id×d) PCA finds the U that maximizes the likelihood of the data

Generative Model [Tipping and Bishop, 1999]:

Advantages:

  • Handles missing data (important for collaborative

filtering)

  • Extension to factor analysis: allow non-isotropic noise

(replace σ2Id×d with arbitrary diagonal matrix)

slide-71
SLIDE 71

Limitations of Linearity

PCA is effective PCA is ineffective

slide-72
SLIDE 72

Limitations of Linearity

PCA is effective PCA is ineffective Problem is that PCA subspace is linear: S = {x = Uz : z ∈ Rk} In this example: S = {(x1, x2) : x2 = u2

u1x1}

slide-73
SLIDE 73

Nonlinear PCA

Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2

u1x2 1}

slide-74
SLIDE 74

Nonlinear PCA

Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2

u1x2 1}

We can get this: S = {φ(x) = Uz} with φ(x) = (x2

1, x2)>

slide-75
SLIDE 75

Nonlinear PCA

Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2

u1x2 1}

We can get this: S = {φ(x) = Uz} with φ(x) = (x2

1, x2)>

{ }

Linear dimensionality reduction in φ(x) space ⇔ Nonlinear dimensionality reduction in x space

slide-76
SLIDE 76

Nonlinear PCA

Broken solution Desired solution We want desired solution: S = {(x1, x2) : x2 = u2

u1x2 1}

We can get this: S = {φ(x) = Uz} with φ(x) = (x2

1, x2)>

{ }

Linear dimensionality reduction in φ(x) space ⇔ Nonlinear dimensionality reduction in x space

Idea: Use kernels

slide-77
SLIDE 77

Kernel PCA

t u = Xα = Pn

i=1 αixi

Representer theorem:

: XX>u = λu x

slide-78
SLIDE 78

Kernel PCA

Kernel function: k(x1, x2) such that K, the kernel matrix formed by Kij = k(xi, xj), is positive semi-definite

t u = Xα = Pn

i=1 αixi

Representer theorem:

: XX>u = λu x

slide-79
SLIDE 79

Kernel PCA

Kernel function: k(x1, x2) such that K, the kernel matrix formed by Kij = k(xi, xj), is positive semi-definite

u = max α>Kα=1 α>K2α max

kuk=1 u>XX>u =

max α>X>Xα=1 α>(X>X)(X>X)α

t u = Xα = Pn

i=1 αixi

Representer theorem:

: XX>u = λu x

slide-80
SLIDE 80

Kernel PCA

Direct method: Kernel PCA objective: max α>Kα=1 α>K2α ⇒ kernel PCA eigenvalue problem: X>Xα = λ0α Modular method (if you don’t want to think about kernels): Find vectors x0

1, . . . , x0 n such that

x0>

i x0 j = Kij = φ(xi)>φ(xj)

Key: use any vectors that preserve inner products One possibility is Cholesky decomposition K = X>X

. . , x0

n

. . , x0

n

slide-81
SLIDE 81

Kernel PCA

slide-82
SLIDE 82

Canonical Correlation Analysis (CCA)

slide-83
SLIDE 83

Motivation for CCA [Hotelling 1936]

Often, each data point consists of two views:

  • Image retrieval: for each image, have the following:

– x: Pixels (or other visual features) – y: Text around the image

slide-84
SLIDE 84

Often, each data point consists of two views:

  • Image retrieval: for each image, have the following:

– x: Pixels (or other visual features) – y: Text around the image

  • Time series:

– x: Signal at time t – y: Signal at time t + 1

Motivation for CCA [Hotelling 1936]

slide-85
SLIDE 85

Often, each data point consists of two views:

  • Image retrieval: for each image, have the following:

– x: Pixels (or other visual features) – y: Text around the image

  • Time series:

– x: Signal at time t – y: Signal at time t + 1

  • Two-view learning: divide features into two sets

– x: Features of a word/object, etc. – y: Features of the context in which it appears

Motivation for CCA [Hotelling 1936]

slide-86
SLIDE 86

Often, each data point consists of two views:

  • Image retrieval: for each image, have the following:

– x: Pixels (or other visual features) – y: Text around the image

  • Time series:

– x: Signal at time t – y: Signal at time t + 1

  • Two-view learning: divide features into two sets

– x: Features of a word/object, etc. – y: Features of the context in which it appears Goal: reduce the dimensionality of the two views jointly

Motivation for CCA [Hotelling 1936]

slide-87
SLIDE 87

CCA Example

Setup: Input data: (x1, y1), . . . , (xn, yn) (matrices X, Y) Goal: find pair of projections (u, v)

slide-88
SLIDE 88

CCA Example

Setup: Input data: (x1, y1), . . . , (xn, yn) (matrices X, Y) Goal: find pair of projections (u, v) Dimensionality reduction solutions: Independent Joint , x and y are paired by brightness

slide-89
SLIDE 89

CCA Definition

Definitions: Variance: c var(u>x) = u>XX>u Covariance: c cov(u>x, v>y) = u>XY>v Correlation:

c cov(u>x,v>y)

c var(u>x)√ c var(v>y)

Objective: maximize correlation between projected views max

u,v d

corr(u>x, v>y) Properties:

  • Focus on how variables are related, not how much they vary
  • Invariant to any rotation and scaling of data
slide-90
SLIDE 90

From PCA to CCA

PCA on views separately: no covariance term

max

u,v

u>XX>u u>u + v>YY>v v>v

PCA on concatenation (X>, Y>)>: includes covariance term

max

u,v

u>XX>u + 2u>XY>v + v>YY>v u>u + v>v

slide-91
SLIDE 91

From PCA to CCA

PCA on views separately: no covariance term

max

u,v

u>XX>u u>u + v>YY>v v>v

PCA on concatenation (X>, Y>)>: includes covariance term

max

u,v

u>XX>u + 2u>XY>v + v>YY>v u>u + v>v

Maximum covariance: drop variance terms

max

u,v

u>XY>v √ u>u √ v>v

slide-92
SLIDE 92

From PCA to CCA

PCA on views separately: no covariance term

max

u,v

u>XX>u u>u + v>YY>v v>v

PCA on concatenation (X>, Y>)>: includes covariance term

max

u,v

u>XX>u + 2u>XY>v + v>YY>v u>u + v>v

Maximum covariance: drop variance terms

max

u,v

u>XY>v √ u>u √ v>v

Maximum correlation (CCA): divide out variance terms

max

u,v

u>XY>v √ u>XX>u √ v>YY>v

slide-93
SLIDE 93

Importance of Regularization

Extreme examples of degeneracy:

  • If x = Ay, then any (u, v) with u = Av is optimal

(correlation 1)

  • If x and y are independent, then any (u, v) is optimal

(correlation 0)

slide-94
SLIDE 94

Importance of Regularization

Extreme examples of degeneracy:

  • If x = Ay, then any (u, v) with u = Av is optimal

(correlation 1)

  • If x and y are independent, then any (u, v) is optimal

(correlation 0) Problem: if X or Y has rank n, then any (u, v) is optimal

†>Yv ⇒ CCA is meaningless!

(correlation 1) with u = X

slide-95
SLIDE 95

Importance of Regularization

Extreme examples of degeneracy:

  • If x = Ay, then any (u, v) with u = Av is optimal

(correlation 1)

  • If x and y are independent, then any (u, v) is optimal

(correlation 0) Problem: if X or Y has rank n, then any (u, v) is optimal

†>Yv ⇒ CCA is meaningless!

(correlation 1) with u = X ⇒ Solution: regularization (interpolate between maximum covariance and maximum correlation)

max

u,v

u>XY>v p u>(XX> + λI)u p v>(YY> + λI)v