compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514 algorithms for data science
SMART_READER_LITE
LIVE PREVIEW

compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 13 0 MAP Feedback: Going to adjust a bit how I take questions in class. Will try to more clearly identify important


slide-1
SLIDE 1

compsci 514: algorithms for data science

Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 13

slide-2
SLIDE 2

logistics

  • Pass/Fail Deadline is 10/29 for undergraduates and 10/31 for
  • graduates. We will have your Problem Set 2 and midterm

grades back before then.

  • Will release Problem Set 3 next week due ∼ 11/11.
  • MAP Feedback:
  • Going to adjust a bit how I take questions in class.
  • Will try to more clearly identify important information (what will

appear on exams or problem sets) v.s. motivating examples.

  • Will try to use iPad more to write out proofs in class.

1

slide-3
SLIDE 3

logistics

  • Pass/Fail Deadline is 10/29 for undergraduates and 10/31 for
  • graduates. We will have your Problem Set 2 and midterm

grades back before then.

  • Will release Problem Set 3 next week due ∼ 11/11.
  • MAP Feedback:
  • Going to adjust a bit how I take questions in class.
  • Will try to more clearly identify important information (what will

appear on exams or problem sets) v.s. motivating examples.

  • Will try to use iPad more to write out proofs in class.

1

slide-4
SLIDE 4

summary

Last Few Classes: Low-Rank Approximation and PCA

  • Discussed how to compress a dataset that lies close to a

k-dimensional subspace.

  • Optimal compression by projecting onto the top k

eigenvectors of the covariance matrix XTX (PCA).

  • Saw how to calculate the error of the approximation –

interpret the spectrum of XTX. This Class: Low-rank approximation and connection to singular value decomposition.

  • Show how PCA can be interpreted in terms of the singular

value decomposition (SVD) of X.

  • Applications to word embeddings, graph embeddings,

document classification, recommendation systems.

2

slide-5
SLIDE 5

summary

Last Few Classes: Low-Rank Approximation and PCA

  • Discussed how to compress a dataset that lies close to a

k-dimensional subspace.

  • Optimal compression by projecting onto the top k

eigenvectors of the covariance matrix XTX (PCA).

  • Saw how to calculate the error of the approximation –

interpret the spectrum of XTX. This Class: Low-rank approximation and connection to singular value decomposition.

  • Show how PCA can be interpreted in terms of the singular

value decomposition (SVD) of X.

  • Applications to word embeddings, graph embeddings,

document classification, recommendation systems.

2

slide-6
SLIDE 6

summary

Last Few Classes: Low-Rank Approximation and PCA

  • Discussed how to compress a dataset that lies close to a

k-dimensional subspace.

  • Optimal compression by projecting onto the top k

eigenvectors of the covariance matrix XTX (PCA).

  • Saw how to calculate the error of the approximation –

interpret the spectrum of XTX. This Class: Low-rank approximation and connection to singular value decomposition.

  • Show how PCA can be interpreted in terms of the singular

value decomposition (SVD) of X.

  • Applications to word embeddings, graph embeddings,

document classification, recommendation systems.

2

slide-7
SLIDE 7

summary

Last Few Classes: Low-Rank Approximation and PCA

  • Discussed how to compress a dataset that lies close to a

k-dimensional subspace.

  • Optimal compression by projecting onto the top k

eigenvectors of the covariance matrix XTX (PCA).

  • Saw how to calculate the error of the approximation –

interpret the spectrum of XTX. This Class: Low-rank approximation and connection to singular value decomposition.

  • Show how PCA can be interpreted in terms of the singular

value decomposition (SVD) of X.

  • Applications to word embeddings, graph embeddings,

document classification, recommendation systems.

2

slide-8
SLIDE 8

review

Set Up: Assume that data points ⃗ x1, . . . ,⃗ xn lie close to any k-dimensional subspace V of Rd. Let X ∈ Rn×d be the data matrix. Let v1 vk be an orthonormal basis for and V

d k be the

matrix with these vectors as its columns.

  • VVT

d d is the projection matrix onto

.

  • X

X VVT . Gives the closest approximation to X with rows in .

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 3

slide-9
SLIDE 9

review

Set Up: Assume that data points ⃗ x1, . . . ,⃗ xn lie close to any k-dimensional subspace V of Rd. Let X ∈ Rn×d be the data matrix. Let ⃗ v1, . . . ,⃗ vk be an orthonormal basis for V and V ∈ Rd×k be the matrix with these vectors as its columns.

  • VVT

d d is the projection matrix onto

.

  • X

X VVT . Gives the closest approximation to X with rows in .

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 3

slide-10
SLIDE 10

review

Set Up: Assume that data points ⃗ x1, . . . ,⃗ xn lie close to any k-dimensional subspace V of Rd. Let X ∈ Rn×d be the data matrix. Let ⃗ v1, . . . ,⃗ vk be an orthonormal basis for V and V ∈ Rd×k be the matrix with these vectors as its columns.

  • VVT ∈ Rd×d is the projection matrix onto V.
  • X

X VVT . Gives the closest approximation to X with rows in .

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 3

slide-11
SLIDE 11

review

Set Up: Assume that data points ⃗ x1, . . . ,⃗ xn lie close to any k-dimensional subspace V of Rd. Let X ∈ Rn×d be the data matrix. Let ⃗ v1, . . . ,⃗ vk be an orthonormal basis for V and V ∈ Rd×k be the matrix with these vectors as its columns.

  • VVT ∈ Rd×d is the projection matrix onto V.
  • X

X VVT . Gives the closest approximation to X with rows in .

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 3

slide-12
SLIDE 12

review

Set Up: Assume that data points ⃗ x1, . . . ,⃗ xn lie close to any k-dimensional subspace V of Rd. Let X ∈ Rn×d be the data matrix. Let ⃗ v1, . . . ,⃗ vk be an orthonormal basis for V and V ∈ Rd×k be the matrix with these vectors as its columns.

  • VVT ∈ Rd×d is the projection matrix onto V.
  • X ≈ X(VVT). Gives the closest approximation to X with rows in V.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 3

slide-13
SLIDE 13

review of last time

Low-Rank Approximation: Approximate X ≈ XVVT.

  • XVVT is a rank-k matrix – all its rows fall in

.

  • X’s rows are approximately spanned by the columns of V.
  • X’s columns are approximately spanned by the columns of XV.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 4

slide-14
SLIDE 14

review of last time

Low-Rank Approximation: Approximate X ≈ XVVT.

  • XVVT is a rank-k matrix – all its rows fall in V.
  • X’s rows are approximately spanned by the columns of V.
  • X’s columns are approximately spanned by the columns of XV.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 4

slide-15
SLIDE 15

review of last time

Low-Rank Approximation: Approximate X ≈ XVVT.

  • XVVT is a rank-k matrix – all its rows fall in V.
  • X’s rows are approximately spanned by the columns of V.
  • X’s columns are approximately spanned by the columns of XV.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 4

slide-16
SLIDE 16

dual view of low-rank approximation

5

slide-17
SLIDE 17
  • ptimal low-rank approximation

Given ⃗ x1, . . . ,⃗ xn (the rows of X) we want to find an orthonormal span V ∈ Rd×k (spanning a k-dimensional subspace V). arg min

  • rthonormal V

d k X

XVVT 2

F

arg max

  • rthonormal V

d k XVVT 2

F n i 1

VVTxi

2 2

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 6

slide-18
SLIDE 18
  • ptimal low-rank approximation

Given ⃗ x1, . . . ,⃗ xn (the rows of X) we want to find an orthonormal span V ∈ Rd×k (spanning a k-dimensional subspace V). arg min

  • rthonormal V∈Rd×k ∥X − XVVT∥2

F

arg max

  • rthonormal V

d k XVVT 2

F n i 1

VVTxi

2 2

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 6

slide-19
SLIDE 19
  • ptimal low-rank approximation

Given ⃗ x1, . . . ,⃗ xn (the rows of X) we want to find an orthonormal span V ∈ Rd×k (spanning a k-dimensional subspace V). arg min

  • rthonormal V∈Rd×k ∥X − XVVT∥2

F =

arg max

  • rthonormal V∈Rd×k ∥XVVT∥2

F n i 1

VVTxi

2 2

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 6

slide-20
SLIDE 20
  • ptimal low-rank approximation

Given ⃗ x1, . . . ,⃗ xn (the rows of X) we want to find an orthonormal span V ∈ Rd×k (spanning a k-dimensional subspace V). arg min

  • rthonormal V∈Rd×k ∥X − XVVT∥2

F =

arg max

  • rthonormal V∈Rd×k ∥XVVT∥2

F = n

i=1

∥VVT⃗ xi∥2

2

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 6

slide-21
SLIDE 21
  • ptimal low-rank approximation

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 7

slide-22
SLIDE 22

solution via eigendecomposition

V minimizing the error ∥X − XVVT∥2

F is given by:

arg max

  • rthonormal V∈Rd×k ∥XVVT∥2

F = k

i=1

⃗ vT

i XTX⃗

vi Surprisingly, can find the columns of V, v1 vk greedily. v1 arg max

v with v

2

1

vTXTXv v2 arg max

v with v

2

1 v v1

vTXTXv vk arg max

v with v

2

1 v vj j k

vTXTXv The top k eigenvectors of XTX by the Courant-Fischer Principal.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 8

slide-23
SLIDE 23

solution via eigendecomposition

V minimizing the error ∥X − XVVT∥2

F is given by:

arg max

  • rthonormal V∈Rd×k ∥XVVT∥2

F = k

i=1

⃗ vT

i XTX⃗

vi Surprisingly, can find the columns of V, ⃗ v1, . . . ,⃗ vk greedily. v1 arg max

v with v

2

1

vTXTXv v2 arg max

v with v

2

1 v v1

vTXTXv vk arg max

v with v

2

1 v vj j k

vTXTXv The top k eigenvectors of XTX by the Courant-Fischer Principal.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 8

slide-24
SLIDE 24

solution via eigendecomposition

V minimizing the error ∥X − XVVT∥2

F is given by:

arg max

  • rthonormal V∈Rd×k ∥XVVT∥2

F = k

i=1

⃗ vT

i XTX⃗

vi Surprisingly, can find the columns of V, ⃗ v1, . . . ,⃗ vk greedily. ⃗ v1 = arg max

⃗ v with ∥v∥2=1

⃗ vTXTX⃗ v. v2 arg max

v with v

2

1 v v1

vTXTXv vk arg max

v with v

2

1 v vj j k

vTXTXv The top k eigenvectors of XTX by the Courant-Fischer Principal.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 8

slide-25
SLIDE 25

solution via eigendecomposition

V minimizing the error ∥X − XVVT∥2

F is given by:

arg max

  • rthonormal V∈Rd×k ∥XVVT∥2

F = k

i=1

⃗ vT

i XTX⃗

vi Surprisingly, can find the columns of V, ⃗ v1, . . . ,⃗ vk greedily. ⃗ v1 = arg max

⃗ v with ∥v∥2=1

⃗ vTXTX⃗ v. ⃗ v2 = arg max

⃗ v with ∥v∥2=1, ⟨⃗ v,⃗ v1⟩=0

⃗ vTXTX⃗ v. vk arg max

v with v

2

1 v vj j k

vTXTXv The top k eigenvectors of XTX by the Courant-Fischer Principal.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 8

slide-26
SLIDE 26

solution via eigendecomposition

V minimizing the error ∥X − XVVT∥2

F is given by:

arg max

  • rthonormal V∈Rd×k ∥XVVT∥2

F = k

i=1

⃗ vT

i XTX⃗

vi Surprisingly, can find the columns of V, ⃗ v1, . . . ,⃗ vk greedily. ⃗ v1 = arg max

⃗ v with ∥v∥2=1

⃗ vTXTX⃗ v. ⃗ v2 = arg max

⃗ v with ∥v∥2=1, ⟨⃗ v,⃗ v1⟩=0

⃗ vTXTX⃗ v.

. . .

⃗ vk = arg max

⃗ v with ∥v∥2=1, ⟨⃗ v,⃗ vj⟩=0 ∀j<k

⃗ vTXTX⃗ v. The top k eigenvectors of XTX by the Courant-Fischer Principal.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 8

slide-27
SLIDE 27

solution via eigendecomposition

V minimizing the error ∥X − XVVT∥2

F is given by:

arg max

  • rthonormal V∈Rd×k ∥XVVT∥2

F = k

i=1

⃗ vT

i XTX⃗

vi Surprisingly, can find the columns of V, ⃗ v1, . . . ,⃗ vk greedily. ⃗ v1 = arg max

⃗ v with ∥v∥2=1

⃗ vTXTX⃗ v. ⃗ v2 = arg max

⃗ v with ∥v∥2=1, ⟨⃗ v,⃗ v1⟩=0

⃗ vTXTX⃗ v.

. . .

⃗ vk = arg max

⃗ v with ∥v∥2=1, ⟨⃗ v,⃗ vj⟩=0 ∀j<k

⃗ vTXTX⃗ v. The top k eigenvectors of XTX by the Courant-Fischer Principal.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: orthogo- nal basis for subspace V. V ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 8

slide-28
SLIDE 28

eigendecomposition

Any symmetric matrix A can be decomposed as A = VΛVT, where the columns V are d orthonormal eigenvectors ⃗ v1, . . . ,⃗ vd. Typically order the eigenvalues in decreasing order:

1 2 d.

When A XTX all eigenvalues are

  • 0. Why?

9

slide-29
SLIDE 29

eigendecomposition

Any symmetric matrix A can be decomposed as A = VΛVT, where the columns V are d orthonormal eigenvectors ⃗ v1, . . . ,⃗ vd. Typically order the eigenvalues in decreasing order: λ1 ≥ λ2 ≥ . . . λd. When A XTX all eigenvalues are

  • 0. Why?

9

slide-30
SLIDE 30

eigendecomposition

Any symmetric matrix A can be decomposed as A = VΛVT, where the columns V are d orthonormal eigenvectors ⃗ v1, . . . ,⃗ vd. Typically order the eigenvalues in decreasing order: λ1 ≥ λ2 ≥ . . . λd. When A = XTX all eigenvalues are ≥ 0. Why?

9

slide-31
SLIDE 31

low-rank approximation via eigendecomposition

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: top eigenvectors of XTX, Vk ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 10

slide-32
SLIDE 32

low-rank approximation via eigendecomposition

Upshot: Letting Vk have columns ⃗ v1, . . . ,⃗ vk corresponding to the top k eigenvectors of the covariance matrix XTX, Vk is the

  • rthogonal basis minimizing

∥X − XVkVT

k∥2 F,

This is principal component analysis (PCA). Last Time: Saw how to determine accuracy by looking at the eigenvalues (the ‘spectrum’) of XTX.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: top eigenvectors of XTX, Vk ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 11

slide-33
SLIDE 33

low-rank approximation via eigendecomposition

Upshot: Letting Vk have columns ⃗ v1, . . . ,⃗ vk corresponding to the top k eigenvectors of the covariance matrix XTX, Vk is the

  • rthogonal basis minimizing

∥X − XVkVT

k∥2 F,

This is principal component analysis (PCA). Last Time: Saw how to determine accuracy by looking at the eigenvalues (the ‘spectrum’) of XTX.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: top eigenvectors of XTX, Vk ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 11

slide-34
SLIDE 34

low-rank approximation via eigendecomposition

Upshot: Letting Vk have columns ⃗ v1, . . . ,⃗ vk corresponding to the top k eigenvectors of the covariance matrix XTX, Vk is the

  • rthogonal basis minimizing

∥X − XVkVT

k∥2 F,

This is principal component analysis (PCA). Last Time: Saw how to determine accuracy by looking at the eigenvalues (the ‘spectrum’) of XTX.

⃗ x1, . . . ,⃗ xn ∈ Rd: data points, X ∈ Rn×d: data matrix, ⃗ v1, . . . ,⃗ vk ∈ Rd: top eigenvectors of XTX, Vk ∈ Rd×k: matrix with columns ⃗ v1, . . . ,⃗ vk. 11

slide-35
SLIDE 35

singular value decomposition

The Singular Value Decomposition (SVD) generalizes the eigendecomposition to asymmetric (even rectangular) matrices. Any matrix X

n d with rank X

r can be written as X U VT.

  • U has orthonormal columns u1

ur

n (left singular vectors).

  • V has orthonormal columns v1

vr

d (right singular vectors).

  • is diagonal with elements

1 2 r

0 (singular values). The ‘swiss army knife’ of linear algebra.

12

slide-36
SLIDE 36

singular value decomposition

The Singular Value Decomposition (SVD) generalizes the eigendecomposition to asymmetric (even rectangular) matrices. Any matrix X ∈ Rn×d with rank(X) = r can be written as X = UΣVT.

  • U has orthonormal columns ⃗

u1, . . . ,⃗ ur ∈ Rn (left singular vectors).

  • V has orthonormal columns ⃗

v1, . . . ,⃗ vr ∈ Rd (right singular vectors).

  • Σ is diagonal with elements σ1 ≥ σ2 ≥ . . . ≥ σr > 0 (singular

values). The ‘swiss army knife’ of linear algebra.

12

slide-37
SLIDE 37

singular value decomposition

The Singular Value Decomposition (SVD) generalizes the eigendecomposition to asymmetric (even rectangular) matrices. Any matrix X ∈ Rn×d with rank(X) = r can be written as X = UΣVT.

  • U has orthonormal columns ⃗

u1, . . . ,⃗ ur ∈ Rn (left singular vectors).

  • V has orthonormal columns ⃗

v1, . . . ,⃗ vr ∈ Rd (right singular vectors).

  • Σ is diagonal with elements σ1 ≥ σ2 ≥ . . . ≥ σr > 0 (singular

values). The ‘swiss army knife’ of linear algebra.

12

slide-38
SLIDE 38

singular value decomposition

The Singular Value Decomposition (SVD) generalizes the eigendecomposition to asymmetric (even rectangular) matrices. Any matrix X ∈ Rn×d with rank(X) = r can be written as X = UΣVT.

  • U has orthonormal columns ⃗

u1, . . . ,⃗ ur ∈ Rn (left singular vectors).

  • V has orthonormal columns ⃗

v1, . . . ,⃗ vr ∈ Rd (right singular vectors).

  • Σ is diagonal with elements σ1 ≥ σ2 ≥ . . . ≥ σr > 0 (singular

values). The ‘swiss army knife’ of linear algebra.

12

slide-39
SLIDE 39

connection of the svd to eigendecomposition

Writing X ∈ Rn×d in its singular value decomposition X = UΣVT: XTX = V UTU VT V

2VT (the eigendecomposition)

Similarly: XXT U VTV UT U

2UT

The left and right singular vectors are the eigenvectors of the covariance matrix XTX and the gram matrix XXT respectively. So, letting Vk

d k have columns equal to v1

vk, we have that XVkVT

k is the best rank-k approximation to X (given by PCA

approximation). What about UkUT

kX where Uk n k has columns equal to u1

uk? Gives exactly the same approximation!

X ∈ Rn×d: data matrix, U ∈ Rn×rank(X): matrix with orthonormal columns ⃗ u1,⃗ u2, . . . (left singular vectors), V ∈ Rd×rank(X): matrix with orthonormal columns ⃗ v1,⃗ v2, . . . (right singular vectors), Σ ∈ Rrank(X)×rank(X): positive di- agonal matrix containing singular values of X. 13

slide-40
SLIDE 40

connection of the svd to eigendecomposition

Writing X ∈ Rn×d in its singular value decomposition X = UΣVT: XTX = VΣUTUΣVT V

2VT (the eigendecomposition)

Similarly: XXT U VTV UT U

2UT

The left and right singular vectors are the eigenvectors of the covariance matrix XTX and the gram matrix XXT respectively. So, letting Vk

d k have columns equal to v1

vk, we have that XVkVT

k is the best rank-k approximation to X (given by PCA

approximation). What about UkUT

kX where Uk n k has columns equal to u1

uk? Gives exactly the same approximation!

X ∈ Rn×d: data matrix, U ∈ Rn×rank(X): matrix with orthonormal columns ⃗ u1,⃗ u2, . . . (left singular vectors), V ∈ Rd×rank(X): matrix with orthonormal columns ⃗ v1,⃗ v2, . . . (right singular vectors), Σ ∈ Rrank(X)×rank(X): positive di- agonal matrix containing singular values of X. 13

slide-41
SLIDE 41

connection of the svd to eigendecomposition

Writing X ∈ Rn×d in its singular value decomposition X = UΣVT: XTX = VΣUTUΣVT = VΣ2VT (the eigendecomposition) Similarly: XXT U VTV UT U

2UT

The left and right singular vectors are the eigenvectors of the covariance matrix XTX and the gram matrix XXT respectively. So, letting Vk

d k have columns equal to v1

vk, we have that XVkVT

k is the best rank-k approximation to X (given by PCA

approximation). What about UkUT

kX where Uk n k has columns equal to u1

uk? Gives exactly the same approximation!

X ∈ Rn×d: data matrix, U ∈ Rn×rank(X): matrix with orthonormal columns ⃗ u1,⃗ u2, . . . (left singular vectors), V ∈ Rd×rank(X): matrix with orthonormal columns ⃗ v1,⃗ v2, . . . (right singular vectors), Σ ∈ Rrank(X)×rank(X): positive di- agonal matrix containing singular values of X. 13

slide-42
SLIDE 42

connection of the svd to eigendecomposition

Writing X ∈ Rn×d in its singular value decomposition X = UΣVT: XTX = VΣUTUΣVT = VΣ2VT (the eigendecomposition) Similarly: XXT U VTV UT U

2UT

The left and right singular vectors are the eigenvectors of the covariance matrix XTX and the gram matrix XXT respectively. So, letting Vk

d k have columns equal to v1

vk, we have that XVkVT

k is the best rank-k approximation to X (given by PCA

approximation). What about UkUT

kX where Uk n k has columns equal to u1

uk? Gives exactly the same approximation!

X ∈ Rn×d: data matrix, U ∈ Rn×rank(X): matrix with orthonormal columns ⃗ u1,⃗ u2, . . . (left singular vectors), V ∈ Rd×rank(X): matrix with orthonormal columns ⃗ v1,⃗ v2, . . . (right singular vectors), Σ ∈ Rrank(X)×rank(X): positive di- agonal matrix containing singular values of X. 13

slide-43
SLIDE 43

connection of the svd to eigendecomposition

Writing X ∈ Rn×d in its singular value decomposition X = UΣVT: XTX = VΣUTUΣVT = VΣ2VT (the eigendecomposition) Similarly: XXT = UΣVTVΣUT = UΣ2UT. The left and right singular vectors are the eigenvectors of the covariance matrix XTX and the gram matrix XXT respectively. So, letting Vk

d k have columns equal to v1

vk, we have that XVkVT

k is the best rank-k approximation to X (given by PCA

approximation). What about UkUT

kX where Uk n k has columns equal to u1

uk? Gives exactly the same approximation!

X ∈ Rn×d: data matrix, U ∈ Rn×rank(X): matrix with orthonormal columns ⃗ u1,⃗ u2, . . . (left singular vectors), V ∈ Rd×rank(X): matrix with orthonormal columns ⃗ v1,⃗ v2, . . . (right singular vectors), Σ ∈ Rrank(X)×rank(X): positive di- agonal matrix containing singular values of X. 13

slide-44
SLIDE 44

connection of the svd to eigendecomposition

Writing X ∈ Rn×d in its singular value decomposition X = UΣVT: XTX = VΣUTUΣVT = VΣ2VT (the eigendecomposition) Similarly: XXT = UΣVTVΣUT = UΣ2UT. The left and right singular vectors are the eigenvectors of the covariance matrix XTX and the gram matrix XXT respectively. So, letting Vk

d k have columns equal to v1

vk, we have that XVkVT

k is the best rank-k approximation to X (given by PCA

approximation). What about UkUT

kX where Uk n k has columns equal to u1

uk? Gives exactly the same approximation!

X ∈ Rn×d: data matrix, U ∈ Rn×rank(X): matrix with orthonormal columns ⃗ u1,⃗ u2, . . . (left singular vectors), V ∈ Rd×rank(X): matrix with orthonormal columns ⃗ v1,⃗ v2, . . . (right singular vectors), Σ ∈ Rrank(X)×rank(X): positive di- agonal matrix containing singular values of X. 13

slide-45
SLIDE 45

connection of the svd to eigendecomposition

Writing X ∈ Rn×d in its singular value decomposition X = UΣVT: XTX = VΣUTUΣVT = VΣ2VT (the eigendecomposition) Similarly: XXT = UΣVTVΣUT = UΣ2UT. The left and right singular vectors are the eigenvectors of the covariance matrix XTX and the gram matrix XXT respectively. So, letting Vk ∈ Rd×k have columns equal to ⃗ v1, . . . ,⃗ vk, we have that XVkVT

k is the best rank-k approximation to X (given by PCA

approximation). What about UkUT

kX where Uk n k has columns equal to u1

uk? Gives exactly the same approximation!

X ∈ Rn×d: data matrix, U ∈ Rn×rank(X): matrix with orthonormal columns ⃗ u1,⃗ u2, . . . (left singular vectors), V ∈ Rd×rank(X): matrix with orthonormal columns ⃗ v1,⃗ v2, . . . (right singular vectors), Σ ∈ Rrank(X)×rank(X): positive di- agonal matrix containing singular values of X. 13

slide-46
SLIDE 46

connection of the svd to eigendecomposition

Writing X ∈ Rn×d in its singular value decomposition X = UΣVT: XTX = VΣUTUΣVT = VΣ2VT (the eigendecomposition) Similarly: XXT = UΣVTVΣUT = UΣ2UT. The left and right singular vectors are the eigenvectors of the covariance matrix XTX and the gram matrix XXT respectively. So, letting Vk ∈ Rd×k have columns equal to ⃗ v1, . . . ,⃗ vk, we have that XVkVT

k is the best rank-k approximation to X (given by PCA

approximation). What about UkUT

kX where Uk ∈ Rn×k has columns equal to ⃗

u1, . . . ,⃗ uk? Gives exactly the same approximation!

X ∈ Rn×d: data matrix, U ∈ Rn×rank(X): matrix with orthonormal columns ⃗ u1,⃗ u2, . . . (left singular vectors), V ∈ Rd×rank(X): matrix with orthonormal columns ⃗ v1,⃗ v2, . . . (right singular vectors), Σ ∈ Rrank(X)×rank(X): positive di- agonal matrix containing singular values of X. 13

slide-47
SLIDE 47

connection of the svd to eigendecomposition

Writing X ∈ Rn×d in its singular value decomposition X = UΣVT: XTX = VΣUTUΣVT = VΣ2VT (the eigendecomposition) Similarly: XXT = UΣVTVΣUT = UΣ2UT. The left and right singular vectors are the eigenvectors of the covariance matrix XTX and the gram matrix XXT respectively. So, letting Vk ∈ Rd×k have columns equal to ⃗ v1, . . . ,⃗ vk, we have that XVkVT

k is the best rank-k approximation to X (given by PCA

approximation). What about UkUT

kX where Uk ∈ Rn×k has columns equal to ⃗

u1, . . . ,⃗ uk? Gives exactly the same approximation!

X ∈ Rn×d: data matrix, U ∈ Rn×rank(X): matrix with orthonormal columns ⃗ u1,⃗ u2, . . . (left singular vectors), V ∈ Rd×rank(X): matrix with orthonormal columns ⃗ v1,⃗ v2, . . . (right singular vectors), Σ ∈ Rrank(X)×rank(X): positive di- agonal matrix containing singular values of X. 13

slide-48
SLIDE 48

the svd and optimal low-rank approximation

The best low-rank approximation to X: Xk = arg minrank −k B∈Rn×d ∥X − B∥F is given by: Xk = XVkVT

k

UkUT

kX

Uk

kVT k

Correspond to projecting the rows (data points) onto the span

  • f Vk or the columns (features) onto the span of Uk

14

slide-49
SLIDE 49

the svd and optimal low-rank approximation

The best low-rank approximation to X: Xk = arg minrank −k B∈Rn×d ∥X − B∥F is given by: Xk = XVkVT

k = UkUT kX

Uk

kVT k

Correspond to projecting the rows (data points) onto the span

  • f Vk or the columns (features) onto the span of Uk

14

slide-50
SLIDE 50

the svd and optimal low-rank approximation

The best low-rank approximation to X: Xk = arg minrank −k B∈Rn×d ∥X − B∥F is given by: Xk = XVkVT

k = UkUT kX

Uk

kVT k

Correspond to projecting the rows (data points) onto the span

  • f Vk or the columns (features) onto the span of Uk

14

slide-51
SLIDE 51

the svd and optimal low-rank approximation

The best low-rank approximation to X: Xk = arg minrank −k B∈Rn×d ∥X − B∥F is given by: Xk = XVkVT

k = UkUT kX

Uk

kVT k

Correspond to projecting the rows (data points) onto the span

  • f Vk or the columns (features) onto the span of Uk

14

slide-52
SLIDE 52

the svd and optimal low-rank approximation

The best low-rank approximation to X: Xk = arg minrank −k B∈Rn×d ∥X − B∥F is given by: Xk = XVkVT

k = UkUT kX

Uk

kVT k

Correspond to projecting the rows (data points) onto the span

  • f Vk or the columns (features) onto the span of Uk

14

slide-53
SLIDE 53

the svd and optimal low-rank approximation

The best low-rank approximation to X: Xk = arg minrank −k B∈Rn×d ∥X − B∥F is given by: Xk = XVkVT

k = UkUT kX = UkΣkVT k

Correspond to projecting the rows (data points) onto the span

  • f Vk or the columns (features) onto the span of Uk

14

slide-54
SLIDE 54

the svd and optimal low-rank approximation

The best low-rank approximation to X: Xk = arg minrank −k B∈Rn×d ∥X − B∥F is given by: Xk = XVkVT

k = UkUT kX = UkΣkVT k

X ∈ Rn×d: data matrix, U ∈ Rn×rank(X): matrix with orthonormal columns ⃗ u1,⃗ u2, . . . (left singular vectors), V ∈ Rd×rank(X): matrix with orthonormal columns ⃗ v1,⃗ v2, . . . (right singular vectors), Σ ∈ Rrank(X)×rank(X): positive di- agonal matrix containing singular values of X. 15

slide-55
SLIDE 55

the svd and optimal low-rank approximation

X ∈ Rn×d: data matrix, U ∈ Rn×rank(X): matrix with orthonormal columns ⃗ u1,⃗ u2, . . . (left singular vectors), V ∈ Rd×rank(X): matrix with orthonormal columns ⃗ v1,⃗ v2, . . . (right singular vectors), Σ ∈ Rrank(X)×rank(X): positive di- agonal matrix containing singular values of X. 16

slide-56
SLIDE 56

the svd and linear regression

SVD is a ‘swiss army knife’. Classic Linear Regression: Given X

n d where n

d (we have more data points than parameters), and response vector y

d, want to find c d minimizing Xc

y 2.

E.g., c1 baths c2 sq ft c3 floors home price

17

slide-57
SLIDE 57

the svd and linear regression

SVD is a ‘swiss army knife’. Classic Linear Regression: Given X ∈ Rn×d where n > d (we have more data points than parameters), and response vector ⃗ y ∈ Rd, want to find ⃗ c ∈ Rd minimizing ∥X⃗ c −⃗ y∥2.

E.g., c1 baths c2 sq ft c3 floors home price

17

slide-58
SLIDE 58

the svd and linear regression

SVD is a ‘swiss army knife’. Classic Linear Regression: Given X ∈ Rn×d where n > d (we have more data points than parameters), and response vector ⃗ y ∈ Rd, want to find ⃗ c ∈ Rd minimizing ∥X⃗ c −⃗ y∥2.

E.g., c1 baths c2 sq ft c3 floors home price

17

slide-59
SLIDE 59

the svd and linear regression

SVD is a ‘swiss army knife’. Classic Linear Regression: Given X ∈ Rn×d where n > d (we have more data points than parameters), and response vector ⃗ y ∈ Rd, want to find ⃗ c ∈ Rd minimizing ∥X⃗ c −⃗ y∥2.

E.g., c1 · (# baths) + c2 · (sq.ft.) + c3 · (# floors) + . . . ≈ home price

17

slide-60
SLIDE 60

the svd and linear regression

Classic Linear Regression: Given X ∈ Rn×d where n > d (we have more data points than parameters) and response vector ⃗ y ∈ Rd, want to find ⃗ c ∈ Rd minimizing ∥X⃗ c −⃗ y∥2. Optimal solution is to chose c so that Xc PXy – the projection of y

  • nto the column span of X.

Writing the SVD X U VT we have:

X

n d: data matrix, U n rank X : matrix with orthonormal columns

u1 u2 (left singular vectors), V

d rank X : matrix with orthonormal

columns v1 v2 (right singular vectors),

rank X rank X : positive di-

agonal matrix containing singular values of X. 18

slide-61
SLIDE 61

the svd and linear regression

Classic Linear Regression: Given X ∈ Rn×d where n > d (we have more data points than parameters) and response vector ⃗ y ∈ Rd, want to find ⃗ c ∈ Rd minimizing ∥X⃗ c −⃗ y∥2. Optimal solution is to chose ⃗ c so that X⃗ c = PX⃗ y – the projection of ⃗ y

  • nto the column span of X.

Writing the SVD X U VT we have:

X

n d: data matrix, U n rank X : matrix with orthonormal columns

u1 u2 (left singular vectors), V

d rank X : matrix with orthonormal

columns v1 v2 (right singular vectors),

rank X rank X : positive di-

agonal matrix containing singular values of X. 18

slide-62
SLIDE 62

the svd and linear regression

Classic Linear Regression: Given X ∈ Rn×d where n > d (we have more data points than parameters) and response vector ⃗ y ∈ Rd, want to find ⃗ c ∈ Rd minimizing ∥X⃗ c −⃗ y∥2. Optimal solution is to chose ⃗ c so that X⃗ c = PX⃗ y – the projection of ⃗ y

  • nto the column span of X.

Writing the SVD X U VT we have:

X

n d: data matrix, U n rank X : matrix with orthonormal columns

u1 u2 (left singular vectors), V

d rank X : matrix with orthonormal

columns v1 v2 (right singular vectors),

rank X rank X : positive di-

agonal matrix containing singular values of X. 18

slide-63
SLIDE 63

the svd and linear regression

Classic Linear Regression: Given X ∈ Rn×d where n > d (we have more data points than parameters) and response vector ⃗ y ∈ Rd, want to find ⃗ c ∈ Rd minimizing ∥X⃗ c −⃗ y∥2. Optimal solution is to chose ⃗ c so that X⃗ c = PX⃗ y – the projection of ⃗ y

  • nto the column span of X.

Writing the SVD X = UΣVT we have:

X ∈ Rn×d: data matrix, U ∈ Rn×rank(X): matrix with orthonormal columns ⃗ u1,⃗ u2, . . . (left singular vectors), V ∈ Rd×rank(X): matrix with orthonormal columns ⃗ v1,⃗ v2, . . . (right singular vectors), Σ ∈ Rrank(X)×rank(X): positive di- agonal matrix containing singular values of X. 18

slide-64
SLIDE 64

the svd and linear regression

X ∈ Rn×d: data matrix, U ∈ Rn×rank(X): matrix with orthonormal columns ⃗ u1,⃗ u2, . . . (left singular vectors), V ∈ Rd×rank(X): matrix with orthonormal columns ⃗ v1,⃗ v2, . . . (right singular vectors), Σ ∈ Rrank(X)×rank(X): positive di- agonal matrix containing singular values of X. 19

slide-65
SLIDE 65

applications of low-rank approximation

Rest of Class: Examples of how low-rank approximation is applied in a variety of data science applications.

  • Used for many reasons other than dimensionality

reduction/data compression.

20

slide-66
SLIDE 66

applications of low-rank approximation

Rest of Class: Examples of how low-rank approximation is applied in a variety of data science applications.

  • Used for many reasons other than dimensionality

reduction/data compression.

20

slide-67
SLIDE 67

matrix completion

Consider a matrix X ∈ Rn×d which we cannot fully observe but believe is close to rank-k (i.e., well approximated by a rank k matrix). Classic example: the Netflix prize problem. Solve: Y arg min

rank k B observed j k

Xj k Bj k

2

Under certain assumptions, can show that Y well approximates X on both the observed and (most importantly) unobserved entries.

21

slide-68
SLIDE 68

matrix completion

Consider a matrix X ∈ Rn×d which we cannot fully observe but believe is close to rank-k (i.e., well approximated by a rank k matrix). Classic example: the Netflix prize problem. Solve: Y arg min

rank k B observed j k

Xj k Bj k

2

Under certain assumptions, can show that Y well approximates X on both the observed and (most importantly) unobserved entries.

21

slide-69
SLIDE 69

matrix completion

Consider a matrix X ∈ Rn×d which we cannot fully observe but believe is close to rank-k (i.e., well approximated by a rank k matrix). Classic example: the Netflix prize problem. Solve: Y = arg min

rank −k B

  • bserved (j,k)

[ Xj,k − Bj,k ]2 Under certain assumptions, can show that Y well approximates X on both the observed and (most importantly) unobserved entries.

21

slide-70
SLIDE 70

matrix completion

Consider a matrix X ∈ Rn×d which we cannot fully observe but believe is close to rank-k (i.e., well approximated by a rank k matrix). Classic example: the Netflix prize problem. Solve: Y = arg min

rank −k B

  • bserved (j,k)

[ Xj,k − Bj,k ]2 Under certain assumptions, can show that Y well approximates X on both the observed and (most importantly) unobserved entries.

21

slide-71
SLIDE 71

entity embeddings

Dimensionality reduction embeds d-dimensional vectors into d′ dimensions. But what about when you want to embed

  • bjects other than vectors?
  • Documents (for topic-based search and classification)
  • Words (to identify synonyms, translations, etc.)
  • Nodes in a social network

Classical approach is to convert each item into a high-dimensional feature vector and then apply low-rank approximation

22

slide-72
SLIDE 72

entity embeddings

Dimensionality reduction embeds d-dimensional vectors into d′ dimensions. But what about when you want to embed

  • bjects other than vectors?
  • Documents (for topic-based search and classification)
  • Words (to identify synonyms, translations, etc.)
  • Nodes in a social network

Classical approach is to convert each item into a high-dimensional feature vector and then apply low-rank approximation

22

slide-73
SLIDE 73

entity embeddings

Dimensionality reduction embeds d-dimensional vectors into d′ dimensions. But what about when you want to embed

  • bjects other than vectors?
  • Documents (for topic-based search and classification)
  • Words (to identify synonyms, translations, etc.)
  • Nodes in a social network

Classical approach is to convert each item into a high-dimensional feature vector and then apply low-rank approximation

22

slide-74
SLIDE 74

example: latent semantic analysis

  • yi za

1 when doci contains worda.

  • If doci and doci both contain worda, yi za

yj za 1.

23

slide-75
SLIDE 75

example: latent semantic analysis

  • ⟨⃗

yi,⃗ za⟩ ≈ 1 when doci contains worda.

  • If doci and doci both contain worda, ⟨⃗

yi,⃗ za⟩ ≈ ⟨⃗ yj,⃗ za⟩ = 1.

23

slide-76
SLIDE 76

example: latent semantic analysis

  • ⟨⃗

yi,⃗ za⟩ ≈ 1 when doci contains worda.

  • If doci and doci both contain worda, ⟨⃗

yi,⃗ za⟩ ≈ ⟨⃗ yj,⃗ za⟩ = 1.

23

slide-77
SLIDE 77

example: latent semantic analysis

  • The columns z1 z2

give representations of words, with zi and zj tending to have high dot product if wordi and wordj appear in many of the same documents.

  • Z corresponds to the top k right singular vectors: the eigenvectors
  • f XXT. Intuitively, what is XXT?
  • XXT i j

documents that wordi and wordj co-occur in.

  • A document based similarity matrix.

24

slide-78
SLIDE 78

example: latent semantic analysis

  • The columns ⃗

z1,⃗ z2, . . . give representations of words, with ⃗ zi and ⃗ zj tending to have high dot product if wordi and wordj appear in many of the same documents.

  • Z corresponds to the top k right singular vectors: the eigenvectors
  • f XXT.

Intuitively, what is XXT?

  • XXT i j

documents that wordi and wordj co-occur in.

  • A document based similarity matrix.

24

slide-79
SLIDE 79

example: latent semantic analysis

  • The columns ⃗

z1,⃗ z2, . . . give representations of words, with ⃗ zi and ⃗ zj tending to have high dot product if wordi and wordj appear in many of the same documents.

  • Z corresponds to the top k right singular vectors: the eigenvectors
  • f XXT. Intuitively, what is XXT?
  • XXT i j

documents that wordi and wordj co-occur in.

  • A document based similarity matrix.

24

slide-80
SLIDE 80

example: latent semantic analysis

  • The columns ⃗

z1,⃗ z2, . . . give representations of words, with ⃗ zi and ⃗ zj tending to have high dot product if wordi and wordj appear in many of the same documents.

  • Z corresponds to the top k right singular vectors: the eigenvectors
  • f XXT. Intuitively, what is XXT?
  • (XXT)i,j = # documents that wordi and wordj co-occur in.
  • A document based similarity matrix.

24

slide-81
SLIDE 81

example: latent semantic analysis

  • The columns ⃗

z1,⃗ z2, . . . give representations of words, with ⃗ zi and ⃗ zj tending to have high dot product if wordi and wordj appear in many of the same documents.

  • Z corresponds to the top k right singular vectors: the eigenvectors
  • f XXT. Intuitively, what is XXT?
  • (XXT)i,j = # documents that wordi and wordj co-occur in.
  • A document based similarity matrix.

24

slide-82
SLIDE 82

example: word embedding

Not obvious how to convert a word into a feature vector that captures the meaning of that word.

  • In LSA, feature vector is the set of documents that word appears in.
  • SVD of term-document matrix X corresponds to

eigendecomposition of document based similarity matrix XXT.

  • Many alternative similarities: how often do wordi wordj appear in

the same sentence, in the same window of w words, in similar positions of documents in different languages, etc.

  • Replacing XXT with these different metrics (sometimes

appropriately transformed) leads to popular word embedding algorithms: word2vec, GloVe, fastTest, etc.

  • Perform low-rank approximation of similarity matrix directly.

25

slide-83
SLIDE 83

example: word embedding

Not obvious how to convert a word into a feature vector that captures the meaning of that word.

  • In LSA, feature vector is the set of documents that word appears in.
  • SVD of term-document matrix X corresponds to

eigendecomposition of document based similarity matrix XXT.

  • Many alternative similarities: how often do wordi, wordj appear in

the same sentence, in the same window of w words, in similar positions of documents in different languages, etc.

  • Replacing XXT with these different metrics (sometimes

appropriately transformed) leads to popular word embedding algorithms: word2vec, GloVe, fastTest, etc.

  • Perform low-rank approximation of similarity matrix directly.

25

slide-84
SLIDE 84

example: word embedding

Not obvious how to convert a word into a feature vector that captures the meaning of that word.

  • In LSA, feature vector is the set of documents that word appears in.
  • SVD of term-document matrix X corresponds to

eigendecomposition of document based similarity matrix XXT.

  • Many alternative similarities: how often do wordi, wordj appear in

the same sentence, in the same window of w words, in similar positions of documents in different languages, etc.

  • Replacing XXT with these different metrics (sometimes

appropriately transformed) leads to popular word embedding algorithms: word2vec, GloVe, fastTest, etc.

  • Perform low-rank approximation of similarity matrix directly.

25

slide-85
SLIDE 85

example: word embedding

Not obvious how to convert a word into a feature vector that captures the meaning of that word.

  • In LSA, feature vector is the set of documents that word appears in.
  • SVD of term-document matrix X corresponds to

eigendecomposition of document based similarity matrix XXT.

  • Many alternative similarities: how often do wordi, wordj appear in

the same sentence, in the same window of w words, in similar positions of documents in different languages, etc.

  • Replacing XXT with these different metrics (sometimes

appropriately transformed) leads to popular word embedding algorithms: word2vec, GloVe, fastTest, etc.

  • Perform low-rank approximation of similarity matrix directly.

25

slide-86
SLIDE 86

example: word embedding

word2vec was originally described as a neural-network method, but Levy and Goldberg show that it is simply low-rank approximation of a specific similarity matrix. Neural word embedding as implicit matrix factorization.

26

slide-87
SLIDE 87

example: word embedding

word2vec was originally described as a neural-network method, but Levy and Goldberg show that it is simply low-rank approximation of a specific similarity matrix. Neural word embedding as implicit matrix factorization.

26

slide-88
SLIDE 88

Next Time: Build on the idea of low-rank approximation of similarity matrix low-rank approximation to perform non-linear dimensionality reduction for data that is not close to a low-dimensional linear subspace.

Questions?

27