compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514 algorithms for data science
SMART_READER_LITE
LIVE PREVIEW

compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2020. Lecture 15 0 logistics whether people are working in groups. So if you dont have a group, I encourage you to join one. There are


slide-1
SLIDE 1

compsci 514: algorithms for data science

Cameron Musco University of Massachusetts Amherst. Fall 2020. Lecture 15

slide-2
SLIDE 2

logistics

  • Problem Set 3 is due next Friday 10/23, 8pm.
  • Problem set grades seem to be strongly correlated with

whether people are working in groups. So if you don’t have a group, I encourage you to join one. There are multiple people looking so post on Piazza to find some.

  • This week’s quiz due Monday at 8pm.

1

slide-3
SLIDE 3

summary

Last Class: Low-Rank Approximation

  • When data lies in a k-dimensional subspace V, we can

perfectly embed into k dimensions using an orthonormal span V ∈ Rd×k.

  • When data lies close to V, the optimal embedding in that

space is given by projecting onto that space. XVVT = arg min

B with rows in V

∥X − B∥2

F.

This Class: Finding V via eigendecomposition.

  • How do we find the best low-dimensional subspace to

approximate X?

  • PCA and its connection to eigendecomposition.

2

slide-4
SLIDE 4

basic set up

Reminder of Set Up: Assume that ⃗ x1, . . . ,⃗ xn lie close to any k-dimensional subspace V of Rd. Let X ∈ Rn×d be the data matrix. Let ⃗ v1, . . . ,⃗ vk be an orthonormal basis for V and V ∈ Rd×k be the matrix with these vectors as its columns.

  • VVT ∈ Rd×d is the projection matrix onto V.
  • X ≈ X(VVT). Gives the closest approximation to X with rows in V.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 3

slide-5
SLIDE 5

dual view of low-rank approximation

4

slide-6
SLIDE 6

best fit subspace

If ⃗ x1, . . . ,⃗ xn are close to a k-dimensional subspace V with

  • rthonormal basis V ∈ Rd×k, the data matrix can be approximated as
  • XVVT. XV gives optimal embedding of X in V.

How do we find V (equivalently V)? arg min

  • rthonormal V∈Rd×k ∥X − XVVT∥2

F =

arg max

  • rthonormal V∈Rd×k ∥XV∥2

F.

5

slide-7
SLIDE 7

solution via eigendecomposition

V minimizing ∥X − XVVT∥2

F is given by:

arg max

  • rthonormal V∈Rd×k ∥XV∥2

F = n

i=1

∥VT⃗ xi∥2

2 = k

j=1

∥X⃗ vj∥2

2

Surprisingly, can find the columns of V, ⃗ v1, . . . ,⃗ vk greedily. ⃗ v1 = arg max

⃗ v with ∥v∥2=1

∥X⃗ v∥2

2⃗

vTXTX⃗ v. ⃗ v2 = arg max

⃗ v with ∥v∥2=1, ⟨⃗ v,⃗ v1⟩=0

⃗ vTXTX⃗ v.

. . .

⃗ vk = arg max

⃗ v with ∥v∥2=1, ⟨⃗ v,⃗ vj⟩=0 ∀j<k

⃗ vTXTX⃗ v. These are exactly the top k eigenvectors of XTX.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 6

slide-8
SLIDE 8

review of eigenvectors and eigendecomposition

Eigenvector: ⃗ x ∈ Rd is an eigenvector of a matrix A ∈ Rd×d if A⃗ x = λ⃗ x for some scalar λ (the eigenvalue corresponding to ⃗ x).

  • That is, A just ‘stretches’ x.
  • If A is symmetric, can find d orthonormal eigenvectors

⃗ v1, . . . ,⃗

  • vd. Let V ∈ Rd×d have these vectors as columns.

AV =    | | | | A⃗ v1 A⃗ v2 · · · A⃗ vd | | | |    =    | | | | λ1⃗ v1 λ2⃗ v2 · · · λ⃗ vd | | | |    = VΛ Yields eigendecomposition: AVVT = A = VΛVT.

7

slide-9
SLIDE 9

review of eigenvectors and eigendecomposition

Typically order the eigenvectors in decreasing order: λ1 ≥ λ2 ≥ . . . ≥ λd.

8

slide-10
SLIDE 10

courant-fischer principal

Courant-Fischer Principal: For symmetric A, the eigenvectors are given via the greedy optimization: ⃗ v1 = arg max

⃗ v with ∥v∥2=1

⃗ vTA⃗ v. ⃗ v2 = arg max

⃗ v with ∥v∥2=1, ⟨⃗ v,⃗ v1⟩=0

⃗ vTA⃗ v.

. . .

⃗ vd = arg max

⃗ v with ∥v∥2=1, ⟨⃗ v,⃗ vj⟩=0 ∀j<d

⃗ vTA⃗ v.

vT

j A⃗

vj = λj ·⃗ vT

j⃗

vj = λj, the jth largest eigenvalue.

  • The first k eigenvectors of XTX (corresponding to the largest k

eigenvalues) are exactly the directions of greatest variance in X that we use for low-rank approximation.

9

slide-11
SLIDE 11

low-rank approximation via eigendecomposition

10

slide-12
SLIDE 12

low-rank approximation via eigendecomposition

Upshot: Letting Vk have columns ⃗ v1, . . . ,⃗ vk corresponding to the top k eigenvectors of the covariance matrix XTX, Vk is the

  • rthogonal basis minimizing

∥X − XVkVT

k∥2 F,

This is principal component analysis (PCA). How accurate is this low-rank approximation? Can understand using eigenvalues of XTX.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: top eigenvectors of XTX, Vk ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 11

slide-13
SLIDE 13

spectrum analysis

Let ⃗ v1, . . . ,⃗ vk be the top k eigenvectors of XTX (the top k principal components). Approximation error is: ∥X − XVkVT

k∥2 F = ∥X∥2 F tr(XTX) − ∥XVkVT k∥2 F tr(VT kXTXVk)

=

d

i=1

λi(XTX) −

k

i=1

⃗ vT

i XTX⃗

vi =

d

i=1

λi(XTX) −

k

i=1

λi(XTX) =

d

i=k+1

λi(XTX)

  • For any matrix A, ∥A∥2

F = ∑d i=1 ∥⃗

ai∥2

2 = tr(ATA) (sum of

diagonal entries = sum eigenvalues).

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: top eigenvectors of XTX, Vk ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 12

slide-14
SLIDE 14

spectrum analysis

Claim: The error in approximating X with the best rank k approximation (projecting onto the top k eigenvectors of XTX is: ∥X − XVkVT

k∥2 F = d

i=k+1

λi(XTX)

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: top eigenvectors of XTX, Vk ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 13

slide-15
SLIDE 15

spectrum analysis

Plotting the spectrum of the covariance matrix XTX (its eigenvalues) shows how compressible X is using low-rank approximation (i.e., how close ⃗ x1, . . . ,⃗ xn are to a low-dimensional subspace).

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: top eigenvectors of XTX, Vk ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 14

slide-16
SLIDE 16

spectrum analysis

Exercises:

  • 1. Show that the eigenvalues of XTX are always positive. Hint:

Use that λj = ⃗ vT

j XTX⃗

vj.

  • 2. Show that for symmetric A, the trace is the sum of

eigenvalues: tr(A) = ∑n

i=1 λi(A). 15

slide-17
SLIDE 17

summary

  • Many (most) datasets can be approximated via projection
  • nto a low-dimensional subspace.
  • Find this subspace via a maximization problem:

max

  • rthonormal V ∥XV∥2

F.

  • Greedy solution via eigendecomposition of XTX.
  • Columns of V are the top eigenvectors of XTX.
  • Error of best low-rank approximation (compressibility of

data) is determined by the tail of XTX’s eigenvalue spectrum.

16

slide-18
SLIDE 18

interpretation in terms of correlation

Recall: Low-rank approximation is possible when our data features are correlated. Our compressed dataset is C = XVk where the columns of Vk are the top k eigenvectors of XTX. What is the covariance of C? CTC = VT

kXTXVk = VT kVΛVTVk = Λk

Covariance becomes diagonal. I.e., all correlations have been

  • removed. Maximal compression.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: top eigenvectors of XTX, Vk ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 17