CS345a: Data Mining Jure Leskovec and Anand Rajaraman j
Stanford University
CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - - PowerPoint PPT Presentation
CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Homework 2 is out: Homework 2 is out: Due Monday 15 th at midnight! Submit PDFs Submit PDFs Talk: http://rain.stanford.edu Wed at 12:30 in Terman 453
CS345a: Data Mining Jure Leskovec and Anand Rajaraman j
Stanford University
Due Monday 15th at midnight! Submit PDFs Submit PDFs
http://rain.stanford.edu Wed at 12:30 in Terman 453 Yehuda Koren – Winner of the Netflix
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2
Text ‐ LSI: find ‘concepts’
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3
Compress / reduce dimensionality
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6
A: n x m matrix A: n x m matrix
U: n x r matrix
: r x r diagonal matrix
V: m x r matrix
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7
n n
A
m
m
VT
A
m
U
U
8 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
n
1u1v1 2u2v2
A A
m
9 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
THEOREM [P 92] l ibl THEOREM [Press+92]: always possible to decompose matrix A into A = U VT , where
U V: unique U, V: unique U, V: column orthonormal:
; ( y )
: diagonal
sorted in decreasing order
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10
A = U VT ‐ example:
d inf. retrieval brainlung
1 1 1 2 2 2
data brain g
0.18 0 0 36 0
2 2 2 1 1 1 5 5 5
0.36 0 0.18 0 0.90 0
= CS
9.64 0 5.29
x x
0 0 2 2 0 0 3 3 0 0 1 1
0.53 0.80 0 27
MD
0.58 0.58 0.58 0 0 71 0 71
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11
0.27 0.71 0.71
A = U VT ‐ example:
d inf. retrieval brainlung
CS‐concept MD‐concept
1 1 1 2 2 2
data brain g
0.18 0 0 36 0
p
2 2 2 1 1 1 5 5 5
0.36 0 0.18 0 0.90 0
= CS
9.64 0 5.29
x x
0 0 2 2 0 0 3 3 0 0 1 1
0.53 0.80 0 27
MD
0.58 0.58 0.58 0 0 71 0 71
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12
0.27 0.71 0.71
d t t
A = U VT ‐ example:
d inf. retrieval brainlung
CS‐concept MD‐concept doc‐to‐concept similarity matrix
1 1 1 2 2 2
data brain g
0.18 0 0 36 0
p
2 2 2 1 1 1 5 5 5
0.36 0 0.18 0 0.90 0
= CS
9.64 0 5.29
x x
0 0 2 2 0 0 3 3 0 0 1 1
0.53 0.80 0 27
MD
0.58 0.58 0.58 0 0 71 0 71
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13
0.27 0.71 0.71
A = U VT ‐ example:
d inf. retrieval brainlung
1 1 1 2 2 2
data brain g
0.18 0 0 36 0
‘strength’ of CS‐concept
2 2 2 1 1 1 5 5 5
0.36 0 0.18 0 0.90 0
= CS
9.64 0 5.29
x x
0 0 2 2 0 0 3 3 0 0 1 1
0.53 0.80 0 27
MD
0.58 0.58 0.58 0 0 71 0 71
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14
0.27 0.71 0.71
A = U VT ‐ example:
d inf. retrieval brainlung
term‐to‐concept similarity matrix
1 1 1 2 2 2
data brain g
0.18 0 0 36 0
CS‐concept
2 2 2 1 1 1 5 5 5
0.36 0 0.18 0 0.90 0
= CS
9.64 0 5.29
x x
0 0 2 2 0 0 3 3 0 0 1 1
0.53 0.80 0 27
MD
0.58 0.58 0.58 0 0 71 0 71
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 15
0.27 0.71 0.71
A = U VT ‐ example:
d inf. retrieval brainlung
term‐to‐concept similarity matrix
1 1 1 2 2 2
data brain g
0.18 0 0 36 0
CS‐concept
2 2 2 1 1 1 5 5 5
0.36 0 0.18 0 0.90 0
= CS
9.64 0 5.29
x x
0 0 2 2 0 0 3 3 0 0 1 1
0.53 0.80 0 27
MD
0.58 0.58 0.58 0 0 71 0 71
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16
0.27 0.71 0.71
U: document‐to‐concept similarity matrix V: term‐to‐concept sim. matrix
: its diagonal elements:
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17
SVD: gives best axis to project
best axis to
project on:
first singular vector
p j (‘best’ = min sum of squares
v1
errors)
minimum RMS
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18
error
A = U VT ‐ example:
1 1 1 2 2 2
0.18 0 0 36 0
2 2 2 1 1 1 5 5 5
0.36 0 0.18 0 0.90 0
=
9.64 0 5.29
x x
0 0 2 2 0 0 3 3 0 0 1 1
0.53 0.80 0 27 0.58 0.58 0.58 0 0 71 0 71
v1
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 19
0.27 0.71 0.71
A = U VT ‐ example:
variance (‘spread’)
1 1 1 2 2 2
0.18 0 0 36 0
p
2 2 2 1 1 1 5 5 5
0.36 0 0.18 0 0.90 0
=
9.64 0 5.29
x x
0 0 2 2 0 0 3 3 0 0 1 1
0.53 0.80 0 27 0.58 0.58 0.58 0 0 71 0 71
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20
0.27 0.71 0.71
A = U VT ‐ example:
points in the projection axis points in the projection axis
1 1 1 2 2 2
0.18 0 0 36 0
2 2 2 1 1 1 5 5 5
0.36 0 0.18 0 0.90 0
=
9.64 0 5.29
x x
0 0 2 2 0 0 3 3 0 0 1 1
0.53 0.80 0 27 0.58 0.58 0.58 0 0 71 0 71
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21
0.27 0.71 0.71
Q: how exactly is dim. reduction done?
1 1 1 2 2 2
0.18 0 0 36 0
2 2 2 1 1 1 5 5 5
0.36 0 0.18 0 0.90 0
=
9.64 0 5.29
x x
0 0 2 2 0 0 3 3 0 0 1 1
0.53 0.80 0 27 0.58 0.58 0.58 0 0 71 0 71
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22
0.27 0.71 0.71
Q: how exactly is dim. reduction done? A: set the smallest singular values to zero: A: set the smallest singular values to zero:
1 1 1 2 2 2
0.18 0 0 36 0
2 2 2 1 1 1 5 5 5
0.36 0 0.18 0 0.90 0
=
9.64 0 5.29
x x
0 0 2 2 0 0 3 3 0 0 1 1
0.53 0.80 0 27 0.58 0.58 0.58 0 0 71 0 71
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23
0.27 0.71 0.71
1 1 1 2 2 2
0.18 0 0 36 0
2 2 2 1 1 1 5 5 5
0.36 0 0.18 0 0.90 0
~
9.64 0
x x
0 0 2 2 0 0 3 3 0 0 1 1
0.53 0.80 0 27 0.58 0.58 0.58 0 0 71 0 71
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24
0.27 0.71 0.71
1 1 1 2 2 2
0.18 0 0 36 0
2 2 2 1 1 1 5 5 5
0.36 0 0.18 0 0.90 0
~
9.64 0
x x
0 0 2 2 0 0 3 3 0 0 1 1
0.53 0.80 0 27 0.58 0.58 0.58 0 0 71 0 71
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 25
0.27 0.71 0.71
1 1 1 2 2 2
0.18 0 36
2 2 2 1 1 1 5 5 5
0.36 0.18 0.90
~
9.64
x x
0 0 2 2 0 0 3 3 0 0 1 1
0.58 0.58 0.58 0
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 26
1 1 1 2 2 2 1 1 1 2 2 2 2 2 2 1 1 1 5 5 5
~
1 1 1 5 5 5 0 0 0 0 2 2 0 0 3 3 0 0 1 1 0 0 0 0 0 0
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 27
1 1 1 2 2 2
0.18 0 0 36 0
2 2 2 1 1 1 5 5 5
0.36 0 0.18 0 0.90 0
=
9.64 0 5.29
x x
0 0 2 2 0 0 3 3 0 0 1 1
0.53 0.80 0 27 0.58 0.58 0.58 0 0 71 0 71
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 28
0.27 0.71 0.71
1 1 1 2 2 2 2 2 2 1 1 1 5 5 5
= x x u1 u2 1 2
0 0 2 2 0 0 3 3 0 0 1 1
v1
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 29
v2
1 1 1 2 2 2
m
2 2 2 1 1 1 5 5 5
= u1 1 vT
1
u2 2 vT
2
+ +... n
0 0 2 2 0 0 3 3 0 0 1 1
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 30
1 1 1 2 2 2
m r terms
2 2 2 1 1 1 5 5 5
= u1 1 vT
1
u2 2 vT
2
+ +... n
0 0 2 2 0 0 3 3 0 0 1 1
n x 1 1 x m
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 31
1 x m
1 1 1 2 2 2
m
2 2 2 1 1 1 5 5 5
= u1 1 vT
1
u2 2 vT
2
+ +... n
0 0 2 2 0 0 3 3 0 0 1 1
assume: 1 >= 2 >= ...
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 32
2)
1 1 1 2 2 2
m
2 2 2 1 1 1 5 5 5
= u1 1 vT
1
u2 2 vT
2
+ +... n
0 0 2 2 0 0 3 3 0 0 1 1
assume: 1 >= 2 >= ...
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 33
2)
O(nm2) or O(n2m) (whichever is less) But:
Implemented:
SPlus, Mathematica ...
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 34
SVD: A= U VT : unique (*) U: document‐to‐concept similarities V: term to concept similarities V: term‐to‐concept similarities : strength of each concept Dim. reduction:
(80 90% f ‘ ’) (80‐90% of ‘energy’)
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 35
SVD gives us: SVD gives us:
Eigen‐decomposition:
What is:
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 36
A AT = U 2 UT A A = U U ATA = V 2 VT (ATA) k= V 2k VT (ATA) k ~ v1 1
2k v1 T
(ATA)k x ~ (constant) v1
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 37
l
1 1 1
data inf. retrieval brainlung
0.18 0
1 1 1 2 2 2 1 1 1 5 5 5
0.18 0 0.36 0 0.18 0 0.90 0 = CS 9.64 0 5.29 x x
2 2 3 3 1 1
0.53 0.80 0.27 MD 0.58 0.58 0.58 0 0.71 0.71
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 38
l
1 1 1
data inf. retrieval brainlung
0.18 0
1 1 1 2 2 2 1 1 1 5 5 5
0.18 0 0.36 0 0.18 0 0.90 0 = CS 9.64 0 5.29 x x
2 2 3 3 1 1
0.53 0.80 0.27 MD 0.58 0.58 0.58 0 0.71 0.71
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 39
l
1 0
data inf. retrieval brainlung
q= term2 q
1 0 0 0 0
q= v1 v2 term1
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 40
l
1 0
data inf. retrieval brainlung
q= term2 q
1 0 0 0 0
q= v1 v2 term1 A: inner product (cosine similarity) with each ‘concept’ vector vi
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 41
l
1 0
data inf. retrieval brainlung
q= term2 q
1 0 0 0 0
q= v1 v2 q o v1 term1 A: inner product (cosine similarity) with each ‘concept’ vector vi
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 42
d t inf. retrieval brainlung
0.58 0
CS‐concept
1 0 0 0 0
data brain g
q=
0.58 0 0.58 0 0.71
=
0 .5 8
0.71
term‐to‐concept
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 43
p similarities
data inf. retrieval brainlung
d
0.58 0 0.58 0 0 58 0
=
1.16 0
CS‐concept 0 1 1 d=
0.58 0 0.71 0.71
1.16 0
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 44
term‐to‐concept similarities
f retrieval
CS‐concept 0 1 1
data inf. brainlung
d=
1.16 0
1 0
0 .5 8
q=
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 45
+ Optimal low‐rank approximation:
Optimal low rank approximation:
all input columns or rows.
=
VT
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 46
U
Goal: Goal:
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 47
Goal: Goal:
Pseudo‐inverse of the intersection of C and R
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 48
Let: Let:
with probability at least 1‐, by picking
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 49
Sample columns (similarly for rows): Sample columns (similarly for rows):
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 50
Let W be the “intersection” of Let W be the intersection of
Then:
singular values: +
ii ii
i.e., Moore–Penrose pseudoinverse
A C R W
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 51
C U = W+
+ Easy interpretation + Easy interpretation
columns and rows
+ Sparse basis
Singular vector Actual column
columns and rows
C l f l ill b l d
times
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 52
If we want to get rid of the duplicates: If we want to get rid of the duplicates:
l th l / b th
root of the number of duplicates
Rd Rs A Cd Cs
Construct a small U
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 53
sparse and small
Huge but sparse Big and dense
dense but small
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 54
Huge but sparse Big but sparse
Drineas et al., Fast Monte Carlo Algorithms for Matrices III:
Computing a Compressed Approximate Matrix Decomposition, SIAM Journal on Computing, 2006.
J. Sun, Y. Xie, H. Zhang, C. Faloutsos: Less is More: Compact Matrix
Decomposition for Large Sparse Graphs, SDM 2007
Intra‐ and interpopulation genotype reconstruction from tagging
SNPs P Paschou M W Mahoney A Javed J R Kidd A J Pakstis SNPs, P. Paschou, M. W. Mahoney, A. Javed, J. R. Kidd, A. J. Pakstis,
(2007)
Tensor‐CUR Decompositions For Tensor‐Based Data, M. W.
Mahoney, M. Maggioni, and P. Drineas, Proc. 12‐th Annual SIGKDD, 327‐336 (2006)
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 55
Slides borrowed from Jimeng Sun Christos Slides borrowed from Jimeng Sun, Christos
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 56