http://cs246.stanford.edu Often, our data can be represented by an - - PowerPoint PPT Presentation

http cs246 stanford edu often our data can be represented
SMART_READER_LITE
LIVE PREVIEW

http://cs246.stanford.edu Often, our data can be represented by an - - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found our material useful for giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a


slide-1
SLIDE 1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

http://cs246.stanford.edu

Note to other teachers and users of these slides: We would be delighted if you found our material useful for giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

slide-2
SLIDE 2

¡ Often, our data can be represented by an

𝑛-by-𝑜 matrix

¡ And this matrix can be closely approximated

by the product of three matrices that share a small common dimension 𝑠

2 2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets

A U m r n VT n r ≈ ´ ´

S

r m

slide-3
SLIDE 3

¡ Compress / reduce dimensionality:

§ 106 rows; 103 columns; no updates § Random access to any cell(s); small error: OK

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 3

Note: The above matrix is really “2-dimensional.” All rows can be reconstructed by scaling [1 1 1 0 0] or [0 0 0 1 1]

New representation [1 0] [2 0] [1 0] [5 0] [0 2] [0 3] [0 1]

slide-4
SLIDE 4

There are hidden, or latent factors, latent dimensions that – to a close approximation – explain why the values are as they appear in the data matrix

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 4

slide-5
SLIDE 5

The axes of these dimensions can be chosen by:

§ The first dimension is the direction in which the points exhibit the greatest variance § The second dimension is the direction, orthogonal to the first, in which points show the 2nd greatest variance § And so on…, until you have enough dimensions that variance is really low

5 2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets

slide-6
SLIDE 6

¡ Q: What is rank of a matrix A? ¡ A: Number of linearly independent rows of A ¡ Cloud of points in 3D space:

§ Think of point coordinates as a matrix:

¡ We can rewrite coordinates more efficiently!

§ Old basis vectors: [1 0 0] [0 1 0] [0 0 1] § New basis vectors: [1 2 1] [-2 -3 1] § Then A has new coordinates: [1 0], B: [0 1], C: [1 -1]

§ Notice: We reduced the number of dimensions/coordinates!

1 row per point: A B C

A

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 6

slide-7
SLIDE 7

¡ Goal of dimensionality reduction is to

discover the axes of data!

Rather than representing every point with 2 coordinates we represent each point with 1 coordinate (corresponding to the position of the point on the red line). By doing this we incur a bit of error as the points do not exactly lie on the line

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 7

slide-8
SLIDE 8
slide-9
SLIDE 9

¡ Gives a decomposition of any matrix into a

product of three matrices:

¡ There are strong constraints on the form of each

  • f these matrices

§ Results in a unique decomposition

¡ From this decomposition, you can choose any

number 𝑠 of intermediate concepts (latent factors) in a way that minimizes the reconstruction error

9 2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets

A U

m r n

VT

n r ~ ´ ´

S

r m

slide-10
SLIDE 10

¡ A: Input data matrix

§ m x n matrix (e.g., m documents, n terms)

¡ U: Left singular vectors

§ m x r matrix (m documents, r concepts)

¡ S: Singular values

§ r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix A)

¡ V: Right singular vectors

§ n x r matrix (n terms, r concepts)

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 10

A

m n

S

m n

U VT

»

T r r r

slide-11
SLIDE 11

11

A

m n

»

+

s1u1v1 s2u2v2

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets

σi … scalar ui … vector vi … vector

T

If we set s2 = 0, then the green columns may as well not exist.

slide-12
SLIDE 12

It is always possible to decompose a real matrix A into A = U S VT , where

¡ U, S, V: unique ¡ U, V: column orthonormal

§ UT U = I; VT V = I (I: identity matrix) § (Columns are orthogonal unit vectors)

¡ S: diagonal

§ Entries (singular values) are non-negative, and sorted in decreasing order (σ1 ³ σ2 ³ ... ³ 0)

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 12

Nice proof of uniqueness: https://www.cs.cornell.edu/courses/cs322/2008sp/stuff/TrefethenBau_Lec4_SVD.pdf

slide-13
SLIDE 13

¡ Consider a matrix. What does SVD do?

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 13

=

SciFi

Romance

Matrix Alien Serenity Casablanca Amelie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2

S

m n

U VT

“Concepts” AKA Latent dimensions AKA Latent factors

Ratings matrix where each column corresponds to a movie and each row to a user. First 4 users prefer SciFi, while others prefer Romance.

slide-14
SLIDE 14

¡ A = U S VT - example: Users to Movies

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 14

=

SciFi

x x

Matrix Alien Serenity Casablanca Amelie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32 12.4 0 0 0 9.5 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

Romance

slide-15
SLIDE 15

¡ A = U S VT - example: Users to Movies

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 15

SciFi-concept Romance-concept

=

SciFi

x x

Matrix Alien Serenity Casablanca Amelie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32 12.4 0 0 0 9.5 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

Romance

slide-16
SLIDE 16

¡ A = U S VT - example:

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 16

Romance-concept

U is “user-to-concept” factor matrix

SciFi-concept

=

SciFi

x x

Matrix Alien Serenity Casablanca Amelie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32 12.4 0 0 0 9.5 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

Romance

slide-17
SLIDE 17

¡ A = U S VT - example:

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 17

SciFi SciFi-concept “strength” of the SciFi-concept

=

SciFi

x x

Matrix Alien Serenity Casablanca Amelie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32 12.4 0 0 0 9.5 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

Romance

slide-18
SLIDE 18

¡ A = U S VT - example:

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 18

SciFi-concept

V is “movie-to-concept” factor matrix

SciFi-concept

=

SciFi

x x

Matrix Alien Serenity Casablanca Amelie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32 12.4 0 0 0 9.5 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

Romance

slide-19
SLIDE 19

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 19

Movies, users and concepts:

¡ U: user-to-concept matrix ¡ V: movie-to-concept matrix ¡ S: its diagonal elements:

‘strength’ of each concept

slide-20
SLIDE 20
slide-21
SLIDE 21

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 21

v1 first right singular vector Movie 1 rating Movie 2 rating

¡ Instead of using two coordinates (𝒚, 𝒛) to describe

point positions, let’s use only one coordinate

¡ Point’s position is its location along vector 𝒘𝟐

slide-22
SLIDE 22

¡ A = U S VT - example:

§ U: “user-to-concept” matrix § V: “movie-to-concept” matrix

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 22

v1

first right singular vector Movie 1 rating Movie 2 rating

= x x

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32 12.4 0 0 0 9.5 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

slide-23
SLIDE 23

¡ A = U S VT - example:

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 23

v1

first right singular vector Movie 1 rating Movie 2 rating

variance (‘spread’)

  • n the v1 axis

= x x

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32 12.4 0 0 0 9.5 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

slide-24
SLIDE 24

A = U S VT - example:

¡ U S: Gives the coordinates

  • f the points in the

projection axis

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 24

v1

first right singular vector Movie 1 rating Movie 2 rating

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 1.61 0.19 -0.01 5.08 0.66 -0.03 6.82 0.85 -0.05 8.43 1.04 -0.06 1.86 -5.60 0.84 0.86 -6.93 -0.87 0.86 -2.75 0.41 Projection of users

  • n the “Sci-Fi” axis

U S:

slide-25
SLIDE 25

More details

¡ Q: How is dim. reduction done?

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 25

= x x

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32 12.4 0 0 0 9.5 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

slide-26
SLIDE 26

More details

¡ Q: How exactly is dim. reduction done? ¡ A: Set smallest singular values to zero

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 26

= x x

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32 12.4 0 0 0 9.5 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

slide-27
SLIDE 27

More details

¡ Q: How exactly is dim. reduction done? ¡ A: Set smallest singular values to zero

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 27

x x

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32 12.4 0 0 0 9.5 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

»

slide-28
SLIDE 28

More details

¡ Q: How exactly is dim. reduction done? ¡ A: Set smallest singular values to zero

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 28

x x

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32 12.4 0 0 0 9.5 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

»

This is Rank 2 approximation to A. We could also do Rank 1 approx. The larger the rank the more accurate the approximation.

slide-29
SLIDE 29

More details

¡ Q: How exactly is dim. reduction done? ¡ A: Set smallest singular values to zero

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 29

»

x x

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 0.41 0.07 0.55 0.09 0.68 0.11 0.15 -0.59 0.07 -0.73 0.07 -0.29 12.4 0 0 9.5 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69

This is Rank 2 approximation to A. We could also do Rank 1 approx. The larger the rank the more accurate the approximation.

slide-30
SLIDE 30

More details

¡ Q: How exactly is dim. reduction done? ¡ A: Set smallest singular values to zero

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 30

»

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.92 0.95 0.92 0.01 0.01 2.91 3.01 2.91 -0.01 -0.01 3.90 4.04 3.90 0.01 0.01 4.82 5.00 4.82 0.03 0.03 0.70 0.53 0.70 4.11 4.11

  • 0.69 1.34 -0.69 4.78 4.78

0.32 0.23 0.32 2.01 2.01 Reconstruction Error is quantified by the Frobenius norm:

ǁMǁF = ÖΣij Mij

2

ǁA-BǁF = Ö Σij (Aij-Bij)2

is “small”

This is Rank 2 approximation to A. We could also do Rank 1 approx. The larger the rank the more accurate the approximation

Reconstructed data matrix B

slide-31
SLIDE 31

¡ Fact: SVD gives ‘best’ axis to project on:

§ ‘best’ = minimizing the sum of reconstruction errors

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 31

A U

Sigma

VT

=

B U

Sigma

VT

=

B is best approximation of A:

𝐵 − 𝐶 / = 1

23

𝐵23 − 𝐶23

4

slide-32
SLIDE 32

¡ SVD: A= U S VT: unique

§ U: user-to-concept factors § V: movie-to-concept factors § S : strength of each concept

¡ Q: So what’s a good value for 𝒔 (# of latent factors)? ¡ Let the energy of a set of singular values be the sum of

their squares.

¡ Pick r so the retained singular values have at least 90%

  • f the total energy.

¡ Back to our example:

§ With singular values 12.4, 9.5, and 1.3, total energy = 245.7 § If we drop 1.3, whose square is only 1.7, we are left with energy 244, or over 99% of the total

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 33

slide-33
SLIDE 33
slide-34
SLIDE 34

¡ How do we actually compute SVD? ¡ First we need a method for finding the principal

eigenvalue (the largest one) and the corresponding eigenvector of a symmetric matrix

§ 𝑁 is symmetric if 𝑛𝑗𝑘 = 𝑛𝑘𝑗 for all 𝑗 and 𝑘

¡ Method:

§ Start with any “guess eigenvector” 𝒚0 § Construct 𝒚:;< =

=𝒚𝒍 | =𝒚𝒍 | for 𝑙 = 0, 1, …

§ || … || denotes the Frobenius norm

§ Stop when consecutive 𝒚𝑙 show little change

35 2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets

slide-35
SLIDE 35

36

M = 1 2 2 3 x0 = 1 1 Mx0 ||Mx0|| = 3 5 /Ö34 = 0.51 0.86 = x1 Mx1 ||Mx1|| = 2.23 3.60 /Ö17.93 = 0.53 0.85 = x2

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets

…..

slide-36
SLIDE 36

¡ Once you have the principal eigenvector 𝒚, you

find its eigenvalue l by l = 𝒚𝑈𝑁𝒚.

§ In proof: We know 𝒚l = 𝑁𝒚 if l is the eigenvalue; multiply both sides by 𝒚𝑈 on the left. § Since 𝒚𝑈𝒚 = 1 we have l = 𝒚𝑈𝑁𝒚

¡ Example: If we take xT = [0.53, 0.85], then

l =

37

] [ [ ]

[0.53 0.85] 1 2 2 3 0.53 0.85 = 4.25

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets

slide-37
SLIDE 37

¡ Eliminate the portion of the matrix 𝑁 that can

be generated by the first eigenpair, l and 𝒚: 𝑁∗: = 𝑁 – 𝜇 𝑦 𝑦𝑈

¡ Recursively find the principal eigenpair for 𝑁∗,

eliminate the effect of that pair, and so on

¡ Example:

38

M* = [ ]

  • 0.19 0.09

0.09 0.07 – 4.25 [ ] 0.53 0.85 [0.53 0.85] 1 2 2 3 =[

]

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets

slide-38
SLIDE 38

¡ Start by supposing 𝑩 = 𝑽S𝑾𝑼 ¡ 𝐵𝑈 = (𝑉S𝑊𝑈)𝑈 = (𝑊O)𝑈S𝑈𝑉𝑈 = 𝑊S𝑉O

§ Why? (1) Rule for transpose of a product; (2) the transpose of the transpose and the transpose of a diagonal matrix are both the identity functions

¡ 𝑩𝑼𝑩 = 𝑾S𝑽𝑼𝑽S𝑾𝑼 = 𝑾S𝟑𝑾𝑼

§ Why? 𝑉 is orthonormal, so 𝑉𝑈𝑉 is an identity matrix § Also note that S2 is a diagonal matrix whose 𝑗-th element is the square of the 𝑗-th element of S

¡ 𝑩𝑼𝑩𝑾 = 𝑾S𝟑𝑾𝑼𝑾 = 𝑾S𝟑

§ Why? 𝑊 is also orthonormal

39 2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets

slide-39
SLIDE 39

¡ Starting with 𝑩𝑼𝑩 𝑾 = 𝑾S𝟑

§ Note that therefore the 𝑗-th column of 𝑊 is an eigenvector of 𝐵𝑈𝐵, and its eigenvalue is the 𝑗-th element of S2

¡ Thus, we can find 𝑊 and S by finding the

eigenpairs for 𝐵𝑈𝐵

§ Once we have the eigenvalues in S2, we can find the singular values by taking the square root of these eigenvalues

¡ Symmetric argument, 𝐵𝐵𝑈gives us 𝑉

40 2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets

slide-40
SLIDE 40

¡ To compute the full SVD using specialized

methods:

§ O(nm2) or O(n2m) (whichever is less)

¡ But:

§ Less work, if we just want singular values § or if we want the first k singular vectors § or if the matrix is sparse

¡ Implemented in linear algebra packages like

§ LINPACK, Matlab, SPlus, Mathematica ...

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 41

slide-41
SLIDE 41
slide-42
SLIDE 42

¡ Q: Find users that like ‘Matrix’ ¡ A: Map query into a ‘concept space’ – how?

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 43

=

SciFi

Romance

x x

Matrix Alien Serenity Casablanca Amelie

1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32 12.4 0 0 0 9.5 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09

slide-43
SLIDE 43

¡ Q: Find users that like ‘Matrix’ ¡ A: Map query into a ‘concept space’ – how?

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 44

5 0

q =

Matrix Alien v1 q v2

Matrix Alien Serenity Casablanca Amelie

Project into concept space: Inner product with each ‘concept’ vector vi

slide-44
SLIDE 44

¡ Q: Find users that like ‘Matrix’ ¡ A: Map query into a ‘concept space’ – how?

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 45

v1 q q*v1 5 0

Matrix Alien Serenity Casablanca Amelie

v2 Matrix Alien

q =

Project into concept space: Inner product with each ‘concept’ vector vi

slide-45
SLIDE 45

Compactly, we have: qconcept = q V E.g.:

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 46

movie-to-concept factors (V)

=

SciFi-concept

5 0

Matrix Alien Serenity Casablanca Amelie

q = 0.56 0.12 0.59 -0.02 0.56 0.12 0.09 -0.69 0.09 -0.69

x

2.8 0.6

slide-46
SLIDE 46

¡ How would the user d that rated

(‘Alien’, ‘Serenity’) be handled? dconcept = d V E.g.:

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 47

movie-to-concept factors (V)

=

SciFi-concept

0 4 5

Matrix Alien Serenity Casablanca Amelie

d = 0.56 0.12 0.59 -0.02 0.56 0.12 0.09 -0.69 0.09 -0.69

x

5.2 0.4

slide-47
SLIDE 47

¡ Observation: User d that rated (‘Alien’,

‘Serenity’) will be similar to user q that rated (‘Matrix’), although d and q have zero ratings in common!

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 48

0 4 5

d =

SciFi-concept

5 0

q =

Matrix Alien Serenity Casablanca Amelie Zero ratings in common Similarity > 0

2.8 0.6 5.2 0.4

slide-48
SLIDE 48

+ Optimal low-rank approximation

in terms of Frobenius norm

  • Interpretability problem:

§ A singular vector specifies a linear combination of all input columns or rows

  • Lack of sparsity:

§ Singular vectors are dense!

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 49

=

U S VT

slide-49
SLIDE 49
slide-50
SLIDE 50

¡ It is common for the matrix 𝐵 that we wish to

decompose to be very sparse

¡ But 𝑉 and 𝑊 from a SVD decomposition will

not be sparse

¡ CUR decomposition solves this problem by

using only (randomly chosen) rows and columns of 𝐵

51 2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets

slide-51
SLIDE 51

¡ Goal: Express 𝑩 as a product of matrices 𝑫, 𝑽, 𝑺

Make ǁ𝑩 − 𝑫 · 𝑽 · 𝑺ǁ𝑮 small

¡ “Constraints” on 𝑫 and 𝑺:

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 52

A C U R

Frobenius norm:

ǁXǁF = Ö Σij Xij2

slide-52
SLIDE 52

¡ Goal: Express 𝑩 as a product of matrices 𝑫, 𝑽, 𝑺

Make ǁ𝑩 − 𝑫 · 𝑽 · 𝑺ǁ𝑮 small

¡ “Constraints” on 𝑫 and 𝑺:

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 53

Pseudo-inverse of the intersection of 𝑫 and 𝑺

A C U R

Frobenius norm:

ǁXǁF = Ö Σij Xij2

slide-53
SLIDE 53

¡ Let 𝑿 be the “intersection” of sampled columns

𝑫 and rows 𝑺

¡ Def: W+ is the pseudoinverse

§ Let SVD of 𝑿 = 𝒀 𝒂 𝒁𝑈 § Then: 𝑿

+ = 𝒁 𝒂;𝒀𝑈

§ Z+: reciprocals of non-zero singular values: Z+

ii =1/

1/ Zii ¡ Let: U

U = 𝒁 (𝒂;)𝟑𝒀𝑈

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 54

Why the intersection? These are high magnitude numbers Why pseudoinverse works? 𝑋 = 𝑌 𝑎 𝑍O then 𝑋b< = 𝑍O b< 𝑎b< 𝑌b< Due to orthonormality: 𝑌b< = 𝑌O, 𝑍b< = 𝑍O Since Z is diagonal Zb< = 1/𝑎𝑗𝑗 Thus, if W is nonsingular, pseudoinverse is the true inverse

A W

=

columns, C rows, R intersection

slide-54
SLIDE 54

¡ To decrease the expected error between 𝐵 and its

decomposition, we must pick rows and columns in a nonuniform manner

¡ The importance of a row or column of 𝐵 is the

square of its Frobenius norm

§ That is, the sum of the squares of its elements.

¡ When picking rows and columns, the probabilities

must be proportional to importance

¡ Example: [3,4,5] has importance 50, and [3,0,1] has

importance 10, so pick the first 5 times as often as the second

55 2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets

slide-55
SLIDE 55

¡ Sampling columns (similarly for rows):

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 56

Note this is a randomized algorithm, same column can be sampled more than once

slide-56
SLIDE 56

¡ Rough and imprecise intuition behind CUR

§ CUR is more likely to pick points away from the origin

§ Assuming smooth data with no outliers these are the directions of maximum variation

¡ Example: Assume we have 2 clouds at an angle

§ SVD dimensions are orthogonal and thus will be in the middle of the two clouds § CUR will find the two clouds (but will be redundant)

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 57

Singular vector Actual column

slide-57
SLIDE 57

¡ For example:

§ Select 𝒅 = 𝑷

𝒍 𝒎𝒑𝒉 𝒍 𝜻𝟑

columns of A using ColumnSelect algorithm (slide 56) § Select 𝒔 = 𝑷

𝒍 𝒎𝒑𝒉 𝒍 𝜻𝟑

rows of A using RowSelect algorithm (slide 56) § Set 𝑽 = 𝒁 (𝒂;)𝟑𝒀𝑈

¡ Then:

𝐵 − 𝐷𝑉𝑆

/ ≤ 2 + 𝜁

𝐵 − 𝐵o

/

with probability 98%

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 58

In practice: Pick 4k cols/rowsfor a “rank-k” approximation

SVD error CUR error

slide-58
SLIDE 58

+ Easy interpretation

  • Since the basis vectors are actual

columns and rows

+ Sparse basis

  • Since the basis vectors are actual

columns and rows

  • Duplicate columns and rows
  • Columns of large norms will be sampled many

times

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 59

Singular vector Actual column

slide-59
SLIDE 59

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 60

SVD: A = U S VT

Huge but sparse Big and dense

CUR: A = C U R

Huge but sparse Big but sparse dense but small sparse and small

slide-60
SLIDE 60

¡ DBLP bibliographic data

§ Author-to-conference big sparse matrix § Aij: Number of papers published by author i at conference j § 428K authors (rows), 3659 conferences (columns)

§ Very sparse

¡ Want to reduce dimensionality

§ How much time does it take? § What is the reconstruction error? § How much space do we need?

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 61

slide-61
SLIDE 61

¡ Accuracy:

§ 1 – relative sum squared errors

¡ Space ratio:

§ #output matrix entries / #input matrix entries

¡ CPU time

2/4/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 62

SVD CUR CUR no duplicates SVD CUR CUR no dup

Sun, Faloutsos: Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM ’07.

CUR SVD