CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji - - PowerPoint PPT Presentation

cs 6316 machine learning
SMART_READER_LITE
LIVE PREVIEW

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji - - PowerPoint PPT Presentation

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Reducing DImensions 2. Principal Component Analysis 3. A Different Viewpoint of PCA 1 Reducing DImensions


slide-1
SLIDE 1

CS 6316 Machine Learning

Dimensionality Reduction

Yangfeng Ji

Department of Computer Science University of Virginia

slide-2
SLIDE 2

Overview

  • 1. Reducing DImensions
  • 2. Principal Component Analysis
  • 3. A Different Viewpoint of PCA

1

slide-3
SLIDE 3

Reducing DImensions

slide-4
SLIDE 4

Curse of Dimensionality

What is the volume difference between two d-dimensional balls with radii r1 1 and r2 0.99

3

slide-5
SLIDE 5

Curse of Dimensionality

What is the volume difference between two d-dimensional balls with radii r1 1 and r2 0.99

◮ d 2: 1

2π(r2 1 − r2 2) ≈ 0.03

◮ d 3: 4

3π(r3 1 − r3 2) ≈ 0.12

3

slide-6
SLIDE 6

Curse of Dimensionality

What is the volume difference between two d-dimensional balls with radii r1 1 and r2 0.99

◮ d 2: 1

2π(r2 1 − r2 2) ≈ 0.03

◮ d 3: 4

3π(r3 1 − r3 2) ≈ 0.12

◮ General form:

πd/2 Γ( d

2 +1)(rd

1 − rd 2 )

with rd

2 → 0 when d → ∞

◮ E.g., r500

2

0.00657

3

slide-7
SLIDE 7

Curse of Dimensionality

What is the volume difference between two d-dimensional balls with radii r1 1 and r2 0.99

◮ d 2: 1

2π(r2 1 − r2 2) ≈ 0.03

◮ d 3: 4

3π(r3 1 − r3 2) ≈ 0.12

◮ General form:

πd/2 Γ( d

2 +1)(rd

1 − rd 2 )

with rd

2 → 0 when d → ∞

◮ E.g., r500

2

0.00657

Question: what will happen if we uniformly sample from a d-dimensional ball?

3

slide-8
SLIDE 8

Dimensionality Reduction

Dimensionality Reduction is the process of taking data in a high dimensional space and mapping it into a new space whose dimensionality is much smaller.

4

slide-9
SLIDE 9

Dimensionality Reduction

Dimensionality Reduction is the process of taking data in a high dimensional space and mapping it into a new space whose dimensionality is much smaller. Mathematically, it means f : x → ˜ x (1) where x ∈ Rd, ˜ x ∈ Rn with n < d

4

slide-10
SLIDE 10

Reducing Dimensions: A toy example

For the purpose of reducing dimensions, we can project x (x1, x2) into the direction along x1 or x2

x1 x2

Question: Given these two data examples, which direction we should pick? x1 or x2?

5

slide-11
SLIDE 11

Reducing Dimensions: A toy example

For the purpose of reducing dimensions, we can project x (x1, x2) into the direction along x1 or x2

x1 x2

Question: Given these two data examples, which direction we should pick? x1 or x2?

5

slide-12
SLIDE 12

Reducing Dimensions: A toy example (II)

There is a better solution if we are allowed to rotate the coordinate

x1 x2

6

slide-13
SLIDE 13

Reducing Dimensions: A toy example (II)

There is a better solution if we are allowed to rotate the coordinate

x1 x2 u1 u2

Pick u1, then we preserve all the variance of the examples

6

slide-14
SLIDE 14

Reducing Dimensions: A toy example (III)

Consider a general case, where the examples do not lie on a perfect line [Bishop, 2006, Section 12.1]

7

slide-15
SLIDE 15

Reducing Dimensions: A toy example (III)

Consider a general case, where the examples do not lie on a perfect line We can follow the same idea by finding a direction that can preserve most of the variance of the examples [Bishop, 2006, Section 12.1]

7

slide-16
SLIDE 16

Principal Component Analysis

slide-17
SLIDE 17

Formulation

Given a set of example S {x1, . . . , xm}

◮ Centering the data by removing the mean

¯ x 1

m

m

i1 xi

xi ← xi − ¯ x

∀i ∈ [m]

(2)

9

slide-18
SLIDE 18

Formulation

Given a set of example S {x1, . . . , xm}

◮ Centering the data by removing the mean

¯ x 1

m

m

i1 xi

xi ← xi − ¯ x

∀i ∈ [m]

(2)

◮ Assume the direction that we would like to project

the data is u, then the objective function is the data variance J(u) 1 m

m

  • i1

(uTxi)2 (3)

9

slide-19
SLIDE 19

Formulation

Given a set of example S {x1, . . . , xm}

◮ Centering the data by removing the mean

¯ x 1

m

m

i1 xi

xi ← xi − ¯ x

∀i ∈ [m]

(2)

◮ Assume the direction that we would like to project

the data is u, then the objective function is the data variance J(u) 1 m

m

  • i1

(uTxi)2 (3)

◮ Maximize J(u) is trivial, if there is no constriant on u.

Therefore, we set u2

2 uTu 1

9

slide-20
SLIDE 20

Covariance Matrix

The definition of J(u) can be written as J(u)

  • 1

m

m

  • i1

(uTxi)2 (4)

  • 1

m

m

  • i1

uTxiuTxi (5)

  • 1

m

m

  • i1

uTxixT

i u

(6)

  • uT 1

m

m

  • i1

xixT

i

  • u

(7)

  • uTΣu

(8) where Σ is the data covariance matrix

10

slide-21
SLIDE 21

Optimization

◮ The optimization of finding a single direction

projection is max

u

J(u)

  • uTΣu

(9) s.t. uTu 1 (10)

11

slide-22
SLIDE 22

Optimization

◮ The optimization of finding a single direction

projection is max

u

J(u)

  • uTΣu

(9) s.t. uTu 1 (10)

◮ It can be converted to an unconstrained optimization

problem with a Lagrange multiplier max

u

  • uTΣu + λ(1 − uTu)
  • (11)

11

slide-23
SLIDE 23

Optimization

◮ The optimization of finding a single direction

projection is max

u

J(u)

  • uTΣu

(9) s.t. uTu 1 (10)

◮ It can be converted to an unconstrained optimization

problem with a Lagrange multiplier max

u

  • uTΣu + λ(1 − uTu)
  • (11)

◮ The optimal solution is given by

Σu − λu 0 (12) Σu λu (13)

11

slide-24
SLIDE 24

Two Observations

There are two observations from Σu λu (14)

◮ Firs, λ is an eigenvalue of Σ and u is the

corresponding eigenvector (Lecture 01 page 29).

12

slide-25
SLIDE 25

Two Observations

There are two observations from Σu λu (14)

◮ Firs, λ is an eigenvalue of Σ and u is the

corresponding eigenvector (Lecture 01 page 29).

◮ Second, multiplying uT on both sides, we have

uTΣu λ (15) In order to maximize J(u), λ has to the largest eigenvalue and

12

slide-26
SLIDE 26

Principal Component Analysis

◮ As u indicates the first major direction that can

preserve the data variance, it is called the first principal component

13

slide-27
SLIDE 27

Principal Component Analysis

◮ As u indicates the first major direction that can

preserve the data variance, it is called the first principal component

◮ In general, with eigen decomposition, we have

UTΣU Λ (16) ◮ Eigenvalues Λ diag(λ1, . . . , λd) ◮ Eigenvectors U [u1, . . . , ud]

13

slide-28
SLIDE 28

Principal Component Analysis (II)

Assume in Λ diag(λ1, . . . , λd), λ1 ≥ λ2 ≥ · · · ≥ λd (17)

14

slide-29
SLIDE 29

Principal Component Analysis (II)

Assume in Λ diag(λ1, . . . , λd), λ1 ≥ λ2 ≥ · · · ≥ λd (17) To reduce the dimensionality of x from d to n, with n < d

◮ Take the first n eigenvectors in U and form

˜ U [u1, . . . , un] ∈ Rd×n (18)

14

slide-30
SLIDE 30

Principal Component Analysis (II)

Assume in Λ diag(λ1, . . . , λd), λ1 ≥ λ2 ≥ · · · ≥ λd (17) To reduce the dimensionality of x from d to n, with n < d

◮ Take the first n eigenvectors in U and form

˜ U [u1, . . . , un] ∈ Rd×n (18)

◮ Reduce the dimensionality of x as

˜ x ˜ UTx ∈ Rn (19)

14

slide-31
SLIDE 31

Principal Component Analysis (II)

Assume in Λ diag(λ1, . . . , λd), λ1 ≥ λ2 ≥ · · · ≥ λd (17) To reduce the dimensionality of x from d to n, with n < d

◮ Take the first n eigenvectors in U and form

˜ U [u1, . . . , un] ∈ Rd×n (18)

◮ Reduce the dimensionality of x as

˜ x ˜ UTx ∈ Rn (19)

◮ The value of n can be determined by the following n

i1 λi

d

i1 λi

≈ 0.95 (20)

14

slide-32
SLIDE 32

Applications: Image Processing

Reduce the dimensionality of an image dataset from 28 × 28 784 to M

(a) Original data

[Bishop, 2006, Section 12.1]

15

slide-33
SLIDE 33

Applications: Image Processing

Reduce the dimensionality of an image dataset from 28 × 28 784 to M

(a) Original data (b) With the first M principal components

[Bishop, 2006, Section 12.1]

15

slide-34
SLIDE 34

A Different Viewpoint of PCA

slide-35
SLIDE 35

Data Reconstruction

Another way to formulate the objective function of PCA min

W ,U m

  • i1

xi − UW xi2

2

(21) where

◮ W ∈ Rn×d: mapping xi from the original space to a

lower-dimensional space Rn

◮ U ∈ Rd×n: mapping back the original space Rd

[Shalev-Shwartz and Ben-David, 2014, Chap 23]

17

slide-36
SLIDE 36

Data Reconstruction

Another way to formulate the objective function of PCA min

W ,U m

  • i1

xi − UW xi2

2

(21) where

◮ W ∈ Rn×d: mapping xi from the original space to a

lower-dimensional space Rn

◮ U ∈ Rd×n: mapping back the original space Rd ◮ Dimensionality reduction is performed as ˜

x Ux, while W make sure the reduction does not loss much information [Shalev-Shwartz and Ben-David, 2014, Chap 23]

17

slide-37
SLIDE 37

Optimization

Consider the optimization problem min

W ,V m

  • i1

xi − UW xi2

2

(22)

◮ Let W , U be a solution of equation 24

[Shalev-Shwartz and Ben-David, 2014, Lemma 23.1] ◮ the columns of U are orthonormal ◮ W UT

18

slide-38
SLIDE 38

Optimization

Consider the optimization problem min

W ,V m

  • i1

xi − UW xi2

2

(22)

◮ Let W , U be a solution of equation 24

[Shalev-Shwartz and Ben-David, 2014, Lemma 23.1] ◮ the columns of U are orthonormal ◮ W UT

◮ The optimization problem can be simplified as

min

UTUI m

  • i1

xi − UUTxi2

2

(23) The solution will be the same.

18

slide-39
SLIDE 39

Nonlinear Extension

If we extend the both mappings to be nonlinear, then the model becomes a simple encoder-decoder neural network model min

W ,V m

  • i1

xi − tanh(U · tanh(W xi))2

2

(24) where

◮ ˜

x tanh(W xi) is a simple encoder

◮ x tanh(U ˜

x) is a simple decoder

◮ No closed-form solutions of W , U, although the

backpropagation algorithm still applies here

19

slide-40
SLIDE 40

Reference

Bishop, C. M. (2006). Pattern recognition and machine learning. springer. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.

20