PCA CS 446 Supervised learning So far, weve done supervised - - PowerPoint PPT Presentation

pca
SMART_READER_LITE
LIVE PREVIEW

PCA CS 446 Supervised learning So far, weve done supervised - - PowerPoint PPT Presentation

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) , find f with f ( x i ) y i . k -nn, decision trees, . . . 1 / 18 Supervised learning So far, weve done supervised learning: Given (( x i , y


slide-1
SLIDE 1

PCA

CS 446

slide-2
SLIDE 2

Supervised learning

So far, we’ve done supervised learning: Given ((xi, yi)), find f with f(xi) ≈ yi. k-nn, decision trees, . . .

1 / 18

slide-3
SLIDE 3

Supervised learning

So far, we’ve done supervised learning: Given ((xi, yi)), find f with f(xi) ≈ yi. k-nn, decision trees, . . . Most methods used (regularized) ERM: minimize R(f) = 1

n

n

i=1 ℓ(f(xi), yi), hope R is small.

least squares, logistic regression, deep networks, SVM, perceptron, . . .

1 / 18

slide-4
SLIDE 4

Unsupervised learning

Now we only receive (xi)n

i=1, and the goal is. . . ?

2 / 18

slide-5
SLIDE 5

Unsupervised learning

Now we only receive (xi)n

i=1, and the goal is. . . ?

◮ Encoding data in some compact representation (and decoding this). ◮ Data analysis; recovering “hidden structure” in data (e.g., recovering cliques or clusters). ◮ Features for supervised learning. ◮ . . . ?

2 / 18

slide-6
SLIDE 6

Unsupervised learning

Now we only receive (xi)n

i=1, and the goal is. . . ?

◮ Encoding data in some compact representation (and decoding this). ◮ Data analysis; recovering “hidden structure” in data (e.g., recovering cliques or clusters). ◮ Features for supervised learning. ◮ . . . ? The task is less clear-cut. In 2019 we still have people trying to formalize it!

2 / 18

slide-7
SLIDE 7
  • 1. PCA (Principal Component Analysis)
slide-8
SLIDE 8

PCA motivation

Let’s formulate a simplistic linear unsupervised method.

3 / 18

slide-9
SLIDE 9

PCA motivation

Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation.

3 / 18

slide-10
SLIDE 10

PCA motivation

Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. Let’s linearly map data in Rd to Rk and back.

3 / 18

slide-11
SLIDE 11

PCA motivation

Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. Let’s linearly map data in Rd to Rk and back. ◮ Data analysis; recovering “hidden structure” in data.

3 / 18

slide-12
SLIDE 12

PCA motivation

Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. Let’s linearly map data in Rd to Rk and back. ◮ Data analysis; recovering “hidden structure” in data. Let’s find if data mostly lies on a low-dimensional subspace.

3 / 18

slide-13
SLIDE 13

PCA motivation

Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. Let’s linearly map data in Rd to Rk and back. ◮ Data analysis; recovering “hidden structure” in data. Let’s find if data mostly lies on a low-dimensional subspace. ◮ Features for supervised learning.

3 / 18

slide-14
SLIDE 14

PCA motivation

Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. Let’s linearly map data in Rd to Rk and back. ◮ Data analysis; recovering “hidden structure” in data. Let’s find if data mostly lies on a low-dimensional subspace. ◮ Features for supervised learning. Let’s feed the Rk-dimensional encoding to supervised methods.

3 / 18

slide-15
SLIDE 15

SVD reminder

4 / 18

slide-16
SLIDE 16

SVD reminder

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M Tu = sv.

4 / 18

slide-17
SLIDE 17

SVD reminder

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M Tu = sv.
  • 2. Thin decomposition SVD: M = r

i=1 siuivT i .

4 / 18

slide-18
SLIDE 18

SVD reminder

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M Tu = sv.
  • 2. Thin decomposition SVD: M = r

i=1 siuivT i .

  • 3. Full factorization SVD: M = USV T.

4 / 18

slide-19
SLIDE 19

SVD reminder

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M Tu = sv.
  • 2. Thin decomposition SVD: M = r

i=1 siuivT i .

  • 3. Full factorization SVD: M = USV T.
  • 4. “Operational” view of SVD: for M ∈ Rn×d,

   ↑ ↑ ↑ ↑ u1 · · · ur ur+1 · · · un ↓ ↓ ↓ ↓    ·       s1 ... sr       ·    ↑ ↑ ↑ ↑ v1 · · · vr vr+1 · · · vd ↓ ↓ ↓ ↓   

.

4 / 18

slide-20
SLIDE 20

SVD reminder

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M Tu = sv.
  • 2. Thin decomposition SVD: M = r

i=1 siuivT i .

  • 3. Full factorization SVD: M = USV T.
  • 4. “Operational” view of SVD: for M ∈ Rn×d,

   ↑ ↑ ↑ ↑ u1 · · · ur ur+1 · · · un ↓ ↓ ↓ ↓    ·       s1 ... sr       ·    ↑ ↑ ↑ ↑ v1 · · · vr vr+1 · · · vd ↓ ↓ ↓ ↓   

.

First part of U, V span the col / row space (respectively), second part the left / right nullspaces (respectively).

4 / 18

slide-21
SLIDE 21

SVD reminder

  • 1. SV triples: (s, u, v) satisfies Mv = su, and M Tu = sv.
  • 2. Thin decomposition SVD: M = r

i=1 siuivT i .

  • 3. Full factorization SVD: M = USV T.
  • 4. “Operational” view of SVD: for M ∈ Rn×d,

   ↑ ↑ ↑ ↑ u1 · · · ur ur+1 · · · un ↓ ↓ ↓ ↓    ·       s1 ... sr       ·    ↑ ↑ ↑ ↑ v1 · · · vr vr+1 · · · vd ↓ ↓ ↓ ↓   

.

First part of U, V span the col / row space (respectively), second part the left / right nullspaces (respectively). New: let (U k, Sk, V k) denote the truncated SVD with U k ∈ Rd×k (first k columns of U), similarly for the others.

4 / 18

slide-22
SLIDE 22

PCA (Principal component analysis)

Input: Data as rows of Rn×d ∋ X = USV T, integer k. Output: Encoder V k, decoder V T

k, encoded data XV k = U kSk.

5 / 18

slide-23
SLIDE 23

PCA (Principal component analysis)

Input: Data as rows of Rn×d ∋ X = USV T, integer k. Output: Encoder V k, decoder V T

k, encoded data XV k = U kSk.

The goal in unsupervised learning is unclear. We’ll try to define this as “best encoding/decoding in Frobenius sense”: min

D∈Rk×d E∈Rd×k

X − XED2

F =

  • X − XV kV

T

k

  • 2

F . 5 / 18

slide-24
SLIDE 24

PCA (Principal component analysis)

Input: Data as rows of Rn×d ∋ X = USV T, integer k. Output: Encoder V k, decoder V T

k, encoded data XV k = U kSk.

The goal in unsupervised learning is unclear. We’ll try to define this as “best encoding/decoding in Frobenius sense”: min

D∈Rk×d E∈Rd×k

X − XED2

F =

  • X − XV kV

T

k

  • 2

F .

Note V kV T

k performs orthogonal projection onto subspace spanned by V k;

thus we are finding “best k-dimensional projection of the data”.

5 / 18

slide-25
SLIDE 25

PCA properties

  • Theorem. Let X ∈ Rn×d with SVD X = USV T and integer k ≤ r be given.

min

D∈Rk×d E∈Rd×k

X − XED2

F =

min

D∈Rd×k DTD=I

  • X − XDD

T

  • 2

F

=

  • X − XV kV

T

k

  • 2

F =

r

  • i=k+1

s2

i .

Additionally, min

D∈Rd×k DTD=I

  • X − XDD

T

  • 2

F =X2 F −

max

D∈Rd×k DTD=I

XD2

F

=X2

F −XV k2 F =X2 F −

k

  • i=1

s2

i .

6 / 18

slide-26
SLIDE 26

PCA properties

  • Theorem. Let X ∈ Rn×d with SVD X = USV T and integer k ≤ r be given.

min

D∈Rk×d E∈Rd×k

X − XED2

F =

min

D∈Rd×k DTD=I

  • X − XDD

T

  • 2

F

=

  • X − XV kV

T

k

  • 2

F =

r

  • i=k+1

s2

i .

Additionally, min

D∈Rd×k DTD=I

  • X − XDD

T

  • 2

F =X2 F −

max

D∈Rd×k DTD=I

XD2

F

=X2

F −XV k2 F =X2 F −

k

  • i=1

s2

i .

Remark 1. SVD is not unique, but r

i=1 s2 i is identical across SVD choices.

6 / 18

slide-27
SLIDE 27

PCA properties

  • Theorem. Let X ∈ Rn×d with SVD X = USV T and integer k ≤ r be given.

min

D∈Rk×d E∈Rd×k

X − XED2

F =

min

D∈Rd×k DTD=I

  • X − XDD

T

  • 2

F

=

  • X − XV kV

T

k

  • 2

F =

r

  • i=k+1

s2

i .

Additionally, min

D∈Rd×k DTD=I

  • X − XDD

T

  • 2

F =X2 F −

max

D∈Rd×k DTD=I

XD2

F

=X2

F −XV k2 F =X2 F −

k

  • i=1

s2

i .

Remark 1. SVD is not unique, but r

i=1 s2 i is identical across SVD choices.

Remark 2. As written, this is not a convex optimization problem!

6 / 18

slide-28
SLIDE 28

PCA properties

  • Theorem. Let X ∈ Rn×d with SVD X = USV T and integer k ≤ r be given.

min

D∈Rk×d E∈Rd×k

X − XED2

F =

min

D∈Rd×k DTD=I

  • X − XDD

T

  • 2

F

=

  • X − XV kV

T

k

  • 2

F =

r

  • i=k+1

s2

i .

Additionally, min

D∈Rd×k DTD=I

  • X − XDD

T

  • 2

F =X2 F −

max

D∈Rd×k DTD=I

XD2

F

=X2

F −XV k2 F =X2 F −

k

  • i=1

s2

i .

Remark 1. SVD is not unique, but r

i=1 s2 i is identical across SVD choices.

Remark 2. As written, this is not a convex optimization problem! Remark 3. The second form is interesting. . .

6 / 18

slide-29
SLIDE 29

Centered PCA

Some treatments replace X with X − 1µT, with mean µ = 1

n

  • i=1 xi.

7 / 18

slide-30
SLIDE 30

Centered PCA

Some treatments replace X with X − 1µT, with mean µ = 1

n

  • i=1 xi.

1 nX

TX ∈ Rd×d is data covariance; 7 / 18

slide-31
SLIDE 31

Centered PCA

Some treatments replace X with X − 1µT, with mean µ = 1

n

  • i=1 xi.

1 nX

TX ∈ Rd×d is data covariance;

1 n(XD)

T(XD) is data covariance after projection; 7 / 18

slide-32
SLIDE 32

Centered PCA

Some treatments replace X with X − 1µT, with mean µ = 1

n

  • i=1 xi.

1 nX

TX ∈ Rd×d is data covariance;

1 n(XD)

T(XD) is data covariance after projection;

lastly 1 nXD2

F = 1

n tr

  • (XD)

T(XD)

  • = 1

n

k

  • i=1

(XDei)

T(XDei),

therefore PCA is maximizing the resulting per-coordinate variances!

7 / 18

slide-33
SLIDE 33

Concrete applications of PCA

◮ Image data; e.g., “eigenfaces”.

8 / 18

slide-34
SLIDE 34

Concrete applications of PCA

◮ Image data; e.g., “eigenfaces”. Weirdness: negative faces? This motivates non-negative matrix factorization.

8 / 18

slide-35
SLIDE 35

Concrete applications of PCA

◮ Image data; e.g., “eigenfaces”. Weirdness: negative faces? This motivates non-negative matrix factorization. ◮ LSI (Latent Semantic Indexing): collect many documents into X ∈ Rn×d, where xi is a normalize bag-of-words vector (plus nonlinear mappings). Can interpret new representation as weighting over “topics”.

8 / 18

slide-36
SLIDE 36

Application: digit data

Data (xi)n

i=1 with xi ∈ R784.

◮ Residual variance left by rank-k PCA projection: 1 − k

j=1 variance in direction vj

total variance = 1 − XV k2

F

X2

F

. ◮ Residual variance left by best k coordinate projections: 1 − k

j=1 variance in direction ej

total variance = 1 − k

j=1(Xej)T(Xej)

X2

F

.

dimension of projections k 200 400 600 800 fraction of residual variance 0.2 0.4 0.6 0.8 1

coordinate projections PCA projections

9 / 18

slide-37
SLIDE 37

Application: digit data

16 × 16 pixel images of handwritten 3s (as vectors in R256) Mean µ and right singular vectors v1, v2, v3, v4

Mean λ1 = 3.4 · 105 λ2 = 2.8 · 105 λ3 = 2.4 · 105 λ4 = 1.6 · 105

Reconstructions: x k = 1 k = 10 k = 50 k = 200 Only have to store k numbers per image, along with the mean µ and k eigenvectors (256(k + 1) numbers).

10 / 18

slide-38
SLIDE 38

Solving for SVD?

Easiest solver: power method. Suppose M = USV T is positive semi-definite.

11 / 18

slide-39
SLIDE 39

Solving for SVD?

Easiest solver: power method. Suppose M = USV T is positive semi-definite. ◮ Start from some random x with x = 1.

11 / 18

slide-40
SLIDE 40

Solving for SVD?

Easiest solver: power method. Suppose M = USV T is positive semi-definite. ◮ Start from some random x with x = 1. ◮ Let’s rewrite it in the basis V : x = V V Tx = V α with α = V Tx.

11 / 18

slide-41
SLIDE 41

Solving for SVD?

Easiest solver: power method. Suppose M = USV T is positive semi-definite. ◮ Start from some random x with x = 1. ◮ Let’s rewrite it in the basis V : x = V V Tx = V α with α = V Tx. ◮ Now Mx = V Sα, thus V α replaced with V (Sα).

11 / 18

slide-42
SLIDE 42

Solving for SVD?

Easiest solver: power method. Suppose M = USV T is positive semi-definite. ◮ Start from some random x with x = 1. ◮ Let’s rewrite it in the basis V : x = V V Tx = V α with α = V Tx. ◮ Now Mx = V Sα, thus V α replaced with V (Sα). ◮ After t iterations, have V Stα.

11 / 18

slide-43
SLIDE 43

Solving for SVD?

Easiest solver: power method. Suppose M = USV T is positive semi-definite. ◮ Start from some random x with x = 1. ◮ Let’s rewrite it in the basis V : x = V V Tx = V α with α = V Tx. ◮ Now Mx = V Sα, thus V α replaced with V (Sα). ◮ After t iterations, have V Stα. ◮ V α is amplified the most in direction v1, less in others.

11 / 18

slide-44
SLIDE 44

Solving for SVD?

Easiest solver: power method. Suppose M = USV T is positive semi-definite. ◮ Start from some random x with x = 1. ◮ Let’s rewrite it in the basis V : x = V V Tx = V α with α = V Tx. ◮ Now Mx = V Sα, thus V α replaced with V (Sα). ◮ After t iterations, have V Stα. ◮ V α is amplified the most in direction v1, less in others. ◮ Finally, V Stα/V Stα is output. (Can also normalize with each iteration.)

11 / 18

slide-45
SLIDE 45
  • 2. Proofs
slide-46
SLIDE 46
  • Fact. Let X ∈ Rn×d and k ≤ r be given.

min

M∈Rd×d rank(M)=k

X − XM2

F =

min

D∈Rk×d E∈Rd×k

X − XED2

F =

min

D∈Rd×k DTD=I

  • X − XDD

T

  • 2

F . 12 / 18

slide-47
SLIDE 47
  • Fact. Let X ∈ Rn×d and k ≤ r be given.

min

M∈Rd×d rank(M)=k

X − XM2

F =

min

D∈Rk×d E∈Rd×k

X − XED2

F =

min

D∈Rd×k DTD=I

  • X − XDD

T

  • 2

F .

  • Proof. Since
  • M ∈ Rd×d : rank(M) = k
  • DE : D ∈ Rk×d, E ∈ Rd×k

  • DD

T : D ∈ Rd×k, D TD = I

  • ,

it suffices to show min

D∈Rd×k DTD=I

  • X − XDD

T

  • 2

F ≤

min

M∈Rd×d rank(M)=k

X − XM2

F . 12 / 18

slide-48
SLIDE 48

Proof (continued). For any M = USV T ∈ Rd×d with rank(M) ≤ k (whereby M = U kSkV T

k),

X − XM2

F =

  • X − XV kV

T

k + XV kV

T

k − XM

  • 2

F

=

  • X − XV kV

T

k

  • 2

F + 2 tr

  • X − XV kV

T

k

T XV kV

T

k − XM

  • +
  • XV kV

T

k − XM

  • 2

F .

We’ll show the middle (trace) term is 0, and therefore X − M2

F =

  • X − XV kV

T

k

  • 2

F +

  • XV kV

T

k − XM

  • 2

F ≥

  • X − XV kV

T

k

  • 2

F . 13 / 18

slide-49
SLIDE 49

Proof (continued). Note tr

  • X − XV kV

T

k

T XV kV

T

k − XM

  • = tr
  • I − V kV

T

k

T X

TX

  • X − XU kSkV

T

k

  • V kV

T

k

  • = tr
  • X

TX

  • X − XU kSkV

T

k

  • V kV

T

k

  • I − V kV

T

k

T , and (I − V kV

T

k)

T

V kV

T

k

  • =

 

d

  • i=1

viv

T

i − k

  • i=1

viv

T

i

 

T

k

  • j=1

vjv

T

j

=  

d

  • i=k+1

viv

T

i

 

k

  • j=1

vjv

T

j = 0.

Therefore tr

  • X − XV kV

T

k

T XV kV

T

k − XM

  • = 0.
  • 14 / 18
slide-50
SLIDE 50
  • Fact. Let X ∈ Rn×d be given along with D ∈ Rd×k with DTD = I. Then

X − XDDT2

F = X2 F − XD2 F, and

min

D∈Rd×k DTD=I

  • X − XDD

T

  • 2

F = X2 F −

max

D∈Rd×k DTD=I

XD2

F,

arg min

D∈Rd×k DTD=I

  • X − XDD

T

  • 2

F = arg max

D∈Rd×k DTD=I

XD2

F. 15 / 18

slide-51
SLIDE 51
  • Fact. Let X ∈ Rn×d be given along with D ∈ Rd×k with DTD = I. Then

X − XDDT2

F = X2 F − XD2 F, and

min

D∈Rd×k DTD=I

  • X − XDD

T

  • 2

F = X2 F −

max

D∈Rd×k DTD=I

XD2

F,

arg min

D∈Rd×k DTD=I

  • X − XDD

T

  • 2

F = arg max

D∈Rd×k DTD=I

XD2

F.

  • Proof. Since

XDD

T2 F = tr

  • (XDD

T) T(XDD T)

  • = tr
  • (XD)

T(XDD TD)

  • = tr
  • (XD)

T(XD)

  • = XD2

F,

therefore

  • X − XDD

T

  • 2

F =X2 F − 2 tr

  • (XDD

T) TX

  • + XDD

T2 F

=X2

F − XD2 F.

  • 15 / 18
slide-52
SLIDE 52
  • Fact. Let X ∈ Rn×d be given with SVD X = USV T. Then

max

D∈Rd×k DTD=I

XD2

F =XV k2 F =

k

  • i=1

s2

k.

16 / 18

slide-53
SLIDE 53
  • Fact. Let X ∈ Rn×d be given with SVD X = USV T. Then

max

D∈Rd×k DTD=I

XD2

F =XV k2 F =

k

  • i=1

s2

k.

  • Proof. Define

S1 := {D ∈ Rd×k : D

TD = I},

S2 := {V D : D ∈ S1}. Note S1 = S2: ◮ S1 ⊆ S2, since D ∈ S1 implies (V TD)TV TD = I, thus D = V (V TD) ∈ S2. ◮ S2 ⊆ S1, since V D ∈ S2 implies (V D)T(V D) = I thus V D ∈ S1. Therefore max

D∈S1XD2

F = max

M∈S2XM2

F = max

D∈S1XV D2

F = max

D∈S1

  • USV

TV D

  • 2

F

= max

D∈S1 tr

  • (USD)

T(USD)

  • = max

D∈S1 tr

  • DD

TS TS

  • = max

D∈S1 r

  • j=1

s2

j k

  • i=1

D2

ij.

16 / 18

slide-54
SLIDE 54

Proof (continued). We’ve the reduced the proof to showing max

D∈Rd×k DTD=I r

  • j=1

s2

j k

  • i=1

D2

ij = XV k2

F,

and note moreover XV kF = tr

  • (U kSk)

T(U kSk)

  • =

k

  • i=1

s2

i . Lastly:

◮ Since V k ∈ Rd×k and V T

kV k = I,

max

D∈Rd×k DTD=I r

  • j=1

s2

j k

  • i=1

D2

ij ≥ XV k2

F.

◮ For any feasible D ∈ Rd×k, extend it to orthonormal M ∈ Rd×d; since M TM = I = MM T, then M T is orthonormal as well, and k

i=1 D2 ij ≤ d i=1 M 2 ij = 1. Moreover, i,j D2 ij ≤ k, so

max

D∈Rd×k DTD=I r

  • j=1

S2

j

 

k

  • i=1

D2

ij

  ≤ max

w∈[0,1]d

  • i wi≤k

r

  • j=1

S2

j w2 j ≤ k

  • j=1

S2

j .

  • 17 / 18
slide-55
SLIDE 55
  • 3. PCA summary
slide-56
SLIDE 56

PCA summary

◮ Unsupervised learning has no “labels”; goal is unclear. ◮ Three perspectives are: encoding/decoding, finding structure, feature learning. ◮ PCA takes in X = USV T,

  • utputs encoder V k and encoding XV k = U kSk.

This is the “best rank-k subspace” in a few concrete ways.

18 / 18