http://cs246.stanford.edu High-dimension == many features Find - - PowerPoint PPT Presentation
http://cs246.stanford.edu High-dimension == many features Find - - PowerPoint PPT Presentation
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu High-dimension == many features Find concepts/topics/genres: Documents: Features: thousands of words, millions of word pairs Surveys
High-dimension == many features Find concepts/topics/genres:
- Documents:
- Features: thousands of words, millions of word pairs
- Surveys – Netflix: 480k users x 177k movies
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 2
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 3
Compress / reduce dimensionality:
- 106 rows; 103 columns; no updates
- random access to any cell(s); small error: OK
Assumption: Data lies on or near a low
d-dimensional subspace
Axes of this subspace are effective
representation of the data
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 4
Why reduce dimensionality?
- Discover hidden correlations/topics
- Words that occur commonly together
- Remove redundant and noisy features
- Not all words are useful
- Interpretation and visualization
- Easier storage and processing of the data
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 5
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 6
A[n x m] = U[n x r] Σ [ r x r] (V[m x r])T
A: Input data matrix
- n x m matrix (e.g., n documents, m terms)
U: Left singular vectors
- n x r matrix (n documents, r concepts)
Σ: Singular values
- r x r diagonal matrix (strength of each ‘concept’)
(r : rank of the matrix)
V: Right singular vectos
- m x r matrix (m terms, r concepts)
7
A
m n
Σ
m n
U VT
≈
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
8
A
m n
≈
+
σ1u1v1 σ2u2v2
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
σi … scalar ui … vector vi … vector
It is always possible to decompose a real matrix A into A = U Σ VT , where
U, Σ, V: unique U, V: column orthonormal:
- UT U = I; VT V = I (I: identity matrix)
- (Cols. are orthogonal unit vectors)
Σ: diagonal
- Entries (singular values) are positive,
and sorted in decreasing order (σ1 ≥ σ2 ≥ σ3 ≥ ...)
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 9
A = U Σ VT - example:
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 10
1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1
0.18 0 0.36 0 0.18 0 0.90 0 0.53 0.80 0.27
= SciFi Romnce
9.64 0 5.29
x
0.58 0.58 0.58 0 0.71 0.71
x Matrix Alien Serenity Casablanca Amelie
A = U Σ VT - example:
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 11
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
0.18 0 0.36 0 0.18 0 0.90 0 0.53 0.80 0.27
=
9.64 0 5.29
x
0.58 0.58 0.58 0 0.71 0.71
x SciFi-concept Romance-concept SciFi Romnce Matrix Alien Serenity Casablanca Amelie
A = U Σ VT - example:
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 12
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
0.18 0 0.36 0 0.18 0 0.90 0 0.53 0.80 0.27
=
9.64 0 5.29
x
0.58 0.58 0.58 0 0.71 0.71
x SciFi-concept Romance-concept
user-to-concept similarity matrix
SciFi Romnce Matrix Alien Serenity Casablanca Amelie
A = U Σ VT - example:
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 13
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
0.18 0 0.36 0 0.18 0 0.90 0 0.53 0.80 0.27
=
9.64 0 5.29
x
0.58 0.58 0.58 0 0.71 0.71
x ‘strength’ of SciFi-concept SciFi Romnce Matrix Alien Serenity Casablanca Amelie
A = U Σ VT - example:
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 14
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
0.18 0 0.36 0 0.18 0 0.90 0 0.53 0.80 0.27
=
9.64 0 5.29
x
0.58 0.58 0.58 0 0.71 0.71
x
movie-to-concept similarity matrix
SciFi-concept SciFi Romnce Matrix Alien Serenity Casablanca Amelie
A = U Σ VT - example:
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 15
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
0.18 0 0.36 0 0.18 0 0.90 0 0.53 0.80 0.27
=
9.64 0 5.29
x
0.58 0.58 0.58 0 0.71 0.71
x SciFi-concept SciFi Romnce Matrix Alien Serenity Casablanca Amelie
movie-to-concept similarity matrix
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 16
‘movies’, ‘users’ and ‘concepts’:
U: user-to-concept similarity matrix V: movie-to-concept sim. matrix Σ: its diagonal elements:
‘strength’ of each concept
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 17
SVD gives best axis to project on:
‘best’ = min sum
- f squares of
projection errors
minimum
reconstruction error
v1 first singular vector Movie 1 rating Movie 2 rating
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 18
A = U Σ VT - example:
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
0.18 0 0.36 0 0.18 0 0.90 0 0.53 0.80 0.27 9.64 0 5.29
x
0.58 0.58 0.58 0 0.71 0.71
x v1
=
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 19
A = U Σ VT - example:
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
0.18 0 0.36 0 0.18 0 0.90 0 0.53 0.80 0.27 9.64 0 5.29
x
0.58 0.58 0.58 0 0.71 0.71
x variance (‘spread’)
- n the v1 axis
=
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 20
A = U Σ VT - example:
- UΣ: gives the coordinates of the
points in the projection axis
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
0.18 0 0.36 0 0.18 0 0.90 0 0.53 0.80 0.27 9.64 0 5.29
x
0.58 0.58 0.58 0 0.71 0.71
x
=
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 21
More details
Q: How exactly is dim. reduction done?
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
0.18 0 0.36 0 0.18 0 0.90 0 0.53 0.80 0.27 9.64 0 5.29
x
0.58 0.58 0.58 0 0.71 0.71
x
=
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 22
More details
Q: How exactly is dim. reduction done? A: Set the smallest singular values to zero
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
0.18 0 0.36 0 0.18 0 0.90 0 0.53 0.80 0.27
=
9.64 0 5.29
x
0.58 0.58 0.58 0 0.71 0.71
x A=
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 23
More details
Q: How exactly is dim. reduction done? A: Set the smallest singular values to zero
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
0.18 0 0.36 0 0.18 0 0.90 0 0.53 0.80 0.27 9.64 0
x
0.58 0.58 0.58 0 0.71 0.71
x A=
~
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 24
More details
Q: How exactly is dim. reduction done? A: Set the smallest singular values to zero:
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
0.18 0 0.36 0 0.18 0 0.90 0 0.53 0.80 0.27 9.64 0
x
0.58 0.58 0.58 0 0.71 0.71
x A=
~
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 25
More details
Q: How exactly is dim. reduction done? A: Set the smallest singular values to zero:
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
0.18 0.36 0.18 0.90
9.64
x
0.58 0.58 0.58 0
x A=
~
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 26
More details
Q: How exactly is dim. reduction done? A: Set the smallest singular values to zero
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
~
1 1 1 2 2 2 1 1 1 5 5 5 0 0 0 0 0 0
A= B=
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 27
Theorem: Let A = U Σ VT
(σ1≥σ2≥…, rank(A)=n)
then B = U S VT
- S = diagonal nxn matrix where si=σi (i=1…k) else si=0
is a best rank-k approximation to A:
- B is solution to minB ǁA-BǁF where rank(B)=k
Why?
∑
= =
− = − Σ = −
n i i i s F F k B rank B
s S B A
i
1 2 ) ( ,
) ( min min min σ
∑ ∑ ∑
+ = + = =
= + − =
n k i i n k i i k i i i s
s
i
1 2 1 2 1 2
) ( min σ σ σ
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 28
Equivalent: ‘spectral decomposition’ of the matrix:
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
= x x u1 u2 σ1 σ2 v1 v2
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 29
Equivalent: ‘spectral decomposition’ of the matrix:
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
=
u1 σ1 vT
1
u2 σ2 vT
2
+ +... n m
n x 1 1 x m
r terms assume: σ1 ≥ σ2 ≥ σ3 ≥ ...
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 30
Q: How many σs to keep? A: Rule-of-a thumb: keep 80-90% of ‘energy’ (=∑σi
2)
1 1 1 2 2 2 1 1 1 5 5 5 0 0 2 2 0 0 3 3 0 0 1 1
= u1 σ1 vT
1
u2 σ2 vT
2
+ +... n m assume: σ1 ≥ σ2 ≥ σ3 ≥ ...
To compute SVD:
- O(nm2) or O(n2m) (whichever is less)
But:
- Less work, if we just want singular values
- or if we want first k singular vectors
- or if the matrix is sparse
Implemented:
- Linear algebra packages like: LINPACK, Matlab,
SPlus, Mathematica ...
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 31
SVD: A= U Σ VT: unique
- U: user-to-concept similarities
- V: movie-to-concept similarities
- Σ : strength of each concept
Dimensionality reduction:
- keep the few largest singular values
(80-90% of ‘energy’)
- SVD: picks up linear correlations
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 32
SVD gives us:
- A = U Σ VT
Eigen-decomposition:
- A = X L XT
- A is symmetric
- U, V, X are orthonormal (UTU=I),
- Λ, Σ are diagonal
What is:
- AAT= UΣ VT(UΣ VT)T = UΣ VT(VΣTUT) = UΣΣT UT
- ATA= VΣT UT (UΣ VT) = V ΣΣT VT
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 33
X L XT
So, λi = σi
2
A AT = U Σ2 UT ATA = V Σ2 VT (ATA) k = V Σ2k VT
- E.g.: (ATA)2 = V Σ2 VT V Σ2 VT = V Σ4 VT
(ATA) k ~ v1 σ1
2k v1 T
for k>>1
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 34
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 35
Q: Find users that like ‘Matrix’ and ‘Alien’
1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1
0.18 0 0.36 0 0.18 0 0.90 0 0.53 0.80 0.27
= SciFi Romnce
9.64 0 5.29
x
0.58 0.58 0.58 0 0.71 0.71
x Matrix Alien Serenity Casablanca Amelie
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 36
Q: Find users that like ‘Matrix’ and ‘Alien’ A: Map query into a ‘concept space’ – how?
1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1
0.18 0 0.36 0 0.18 0 0.90 0 0.53 0.80 0.27
= SciFi Romnce
9.64 0 5.29
x
0.58 0.58 0.58 0 0.71 0.71
x Matrix Alien Serenity Casablanca Amelie
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 37
Q: Find users that like ‘Matrix’ A: map query vectors into ‘concept space’ – how?
5 0
q= Matrix Alien v1 q v2 Matrix Alien Serenity Casablanca Amelie Project into concept space: Inner product with each ‘concept’ vector vi
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 38
Q: Find users that like ‘Matrix’ A: map query vectors into ‘concept space’ – how?
v1 q q*v1
5 0
q= Matrix Alien Serenity Casablanca Amelie v2 Matrix Alien Project into concept space: Inner product with each ‘concept’ vector vi
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 39
Compactly, we have: qconcept = q V E.g.:
0.58 0 0.58 0 0.58 0 0.71 0.71
movie-to-concept similarities =
2.9
SciFi-concept
5 0
q= Matrix Alien Serenity Casablanca Amelie
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 40
How would the user (‘Alien’, ‘Serenity’) be handled? dconcept = d V E.g.:
0.58 0 0.58 0 0.58 0 0.71 0.71
movie-to-concept similarities =
5.22 0
SciFi-concept
0 4 5
d= Matrix Alien Serenity Casablanca Amelie
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 41
Observation: User (‘Alien’, ‘Serenity’) will be retrieved by query (‘Matrix’), although it did not rate ‘Matrix’!
0 4 5
d=
1.16 0
SciFi-concept
5 0
0.58 0
q= Matrix Alien Serenity Casablanca Amelie
+ Optimal low-rank approximation:
- in L2 norm
- Interpretability problem:
- A singular vector specifies a linear
combination of all input columns or rows
- Lack of Sparsity:
- Singular vectors are dense
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 42
=
U Σ VT
Goal:
Make ǁA-CURǁF small
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 43
A C U R
Frobenius norm:
ǁXǁF = Σij Xij
2
Goal:
Make ǁA-CURǁF small
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 44
Pseudo-inverse of the intersection of C and R
A C U R
Frobenius norm:
ǁXǁF = Σij Xij
2
Let:
Ak be the “best” rank k approximation to A (e.i., SVD) Theorem [Drineas et al.]: CUR in O(mn) time achieves
- ǁA-CURǁF ≤ ǁA-AkǁF + εǁAǁF
with probability at least 1-δ, by picking
- O(k log(1/δ)/ε2) columns, and
- O(k2log3(1/δ)/ε6) rows
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 45
Sample columns (similarly for rows):
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 46
Let W be the “intersection” of
sampled columns C and rows R
- Let SVD of W = X Σ YT
Then:
U = W+ = X Σ+ YT
- Σ+: reciprocals of non-zero
singular values: Σ+
ii =1/ Σii
i.e., Moore–Penrose pseudoinverse
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 47
A C R U = W+ W
≈
+ Easy interpretation
- Since the basis vectors are actual
columns and rows
+ Sparse basis
- Since the basis vectors are actual
columns and rows
- Duplicate columns and rows
- Columns of large norms will be sampled many
times
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 48
Singular vector Actual column
If we want to get rid of the duplicates:
- Throw them away
- Scale the columns/rows by the square
root of the number of duplicates
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 49
A Cd Rd Cs Rs
Construct a small U
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 50
SVD: A = U Σ VT
Huge but sparse Big and dense
CUR: A = C U R
Huge but sparse Big but sparse dense but small sparse and small
DBLP bibliographic data
- Author-to-conference big sparse matrix
- Aij: Number of papers published by author i at conference j
- 428K authors (rows), 3659 conferences (columns)
- Very sparse
Want to reduce dimensionality
- How much time does it take?
- What is the reconstruction error?
- How much space do we need?
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 51
Accuracy: 1 – relative sum square error Space ratio:
- #output matrix entries / #input matrix entries
CPU time
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 52
SVD CUR CUR no duplicates SVD CUR CUR no dup
More details: Sun,Faloutsos: Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM ’07.
SVD is limited to linear projections:
- Lower-dimensional linear projection
that preserves Euclidean distances
Non-linear methods: Isomap
- Data lies on a nonlinear low-dim curve aka manifold
- Use the distance as measured along the manifold
- How?
- Build adjacency graph
- Geodesic distance is
graph distance
- SVD/PCA the graph
pairwise distance matrix
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 53
Drineas et al., Fast Monte Carlo Algorithms for Matrices III:
Computing a Compressed Approximate Matrix Decomposition, SIAM Journal on Computing, 2006.
J. Sun, Y. Xie, H. Zhang, C. Faloutsos: Less is More:
Compact Matrix Decomposition for Large Sparse Graphs, SDM 2007
Intra- and interpopulation genotype reconstruction from
tagging SNPs, P. Paschou, M. W. Mahoney, A. Javed, J. R. Kidd, A. J. Pakstis, S. Gu, K. K. Kidd, and P. Drineas, Genome Research, 17(1), 96-107 (2007)
Tensor-CUR Decompositions For Tensor-Based Data, M. W.
Mahoney, M. Maggioni, and P. Drineas, Proc. 12-th Annual SIGKDD, 327-336 (2006)
1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 54