Data Mining Techniques
CS 6220 - Section 3 - Fall 2016
Lecture 13
Jan-Willem van de Meent (credit: David Lopez-Paz, David Duvenaud, Laurens van der Maaten)
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 13 - - PowerPoint PPT Presentation
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 13 Jan-Willem van de Meent (credit: David Lopez-Paz, David Duvenaud, Laurens van der Maaten) Homework Homework 3 is out today (due 4 Nov) Homework 1 has been graded
CS 6220 - Section 3 - Fall 2016
Jan-Willem van de Meent (credit: David Lopez-Paz, David Duvenaud, Laurens van der Maaten)
(we will grade Homework 2 a little faster)
problems (e.g. code not running).
regrading.
The final grade can be lower than before.
U =( u1 ·· uk ) ∈ Rd⇥
Data Orthonormal Basis
X =( x1 · · · · · · xn) ∈ Rd⇥n
n n
Change of basis
z = U>x
” zj = u>
j x
to z = (z1, . . . , zk)>
n
Inverse Change of basis
˜ x = Uz =
> j x
U =( u1 ·· uk ) ∈ Rd⇥
Data Orthonormal Basis
X =( x1 · · · · · · xn) ∈ Rd⇥n
n n
Eigenvectors of Covariance Eigen-decomposition
U =( u1 ·· uk ) ∈ Rd⇥
Data Orthonormal Basis
X =( x1 · · · · · · xn) ∈ Rd⇥n
n n
Eigenvectors of Covariance Eigen-decomposition Claim: Eigenvectors of a symmetric matrix are orthogonal
n
(from stack exchange)
U =( u1 ·· uk ) ∈ Rd⇥
Data Orthonormal Basis
X =( x1 · · · · · · xn) ∈ Rd⇥n
n n
Eigenvectors of Covariance Eigen-decomposition Claim: Eigenvectors of a symmetric matrix are orthogonal
U =( u1 ·· uk ) ∈ Rd⇥k
Data Truncated Basis
X =( x1 · · · · · · xn) ∈ Rd⇥n
Truncated decomposition Eigenvectors of Covariance
U =( u1 ·· uk ) ∈ Rd⇥k
Data Truncated Basis
X =( x1 · · · · · · xn) ∈ Rd⇥n
Projection / Encoding
z = U>x
Reconstruction / Decoding
e: ˜ x = Uz =
Top 2 components Bottom 2 components
Data: three varieties of wheat: Kama, Rosa, Canadian Attributes: Area, Perimeter, Compactness, Length of Kernel, Width of Kernel, Asymmetry Coefficient, Length of Groove
Using eigen-value decomposition
U =( u1 ·· uk ) ∈ Rd⇥k
Data Truncated Basis
X =( x1 · · · · · · xn) ∈ Rd⇥n
Using singular-value decomposition
(with power method)
U =( u1 ·· uk ) ∈ Rd⇥k
Data Truncated Basis
X =( x1 · · · · · · xn) ∈ Rd⇥n
Idea: Decompose a d x d matrix M into
(unitary matrix)
(diagonal matrix)
(unitary matrix)
Idea: Decompose the d x n matrix X into
(unitary matrix)
(diagonal projection)
(unitary matrix)
d X = Ud⇥dΣd⇥nV>
n⇥n
Borrowing from: David Lopez-Paz & David Duvenaud
Fast, efficient and & distance-preserving dimensionality reduction! R40500 R1000 x1 x2 y1 y2 w 2 R40500×1000 w 2 R40500×1000
(1 ✏)kx1 x2k2 ky1 y2k2 (1 + ✏)kx1 x2k2 This result is formalized in the Johnson-Lindenstrauss Lemma
The proof is a great example of Erd¨
Paul Erd¨
Joram Lindenstrauss William B. Johnson 1913-1996 1936-2012 1944-
For any 0 < ✏ < 1/2 and any integer m > 4, let k = 20 log m
✏2
. Then, for any set V of m points in RN 9 f : RN ! Rk s.t. 8 u, v 2 V : (1 ✏)ku vk2 kf(u) f(v)k2 (1 + ✏)ku vk2.
For any 0 < ✏ < 1/2 and any integer m > 4, let k = 20 log m
✏2
. Then, for any set V of m points in RN 9 f : RN ! Rk s.t. 8 u, v 2 V : (1 ✏)ku vk2 kf(u) f(v)k2 (1 + ✏)ku vk2.
Holds when f is linear function with random coefficients t f =
1 √ kA, A 2 Rk×N, k < N and Aij ⇠ N(0, 1).
Data: 20-newsgroups, from 100.000 features to 300 (0.3%)
Data: 20-newsgroups, from 100.000 features to 1.000 (1%)
Data: 20-newsgroups, from 100.000 features to 10.000 (10%)
Data: 20-newsgroups, from 100.000 features to 10.000 (10%)
Conclusion: RP preserves distances like PCA, but faster than PCA number of dimensions is vey large
Borrowing from: Laurens van der Maaten (Delft -> Facebook AI)
Idea: Perform a non-linear dimensionality reduction in a manner that preserves proximity (but not distances)
Euclidean distance is not always a good notion of proximity
Bad projection: relative position to neighbors changes
Intuition: Want to preserve local neighborhood
pj|i = exp(−||xi − xj||2/2σ2
i )
P
k6=i exp(−||xi − xk||2/2σ2 i )
qj|i = exp(−||yi − yj||2) P
k6=i exp(−||yi − yk||2)
Similarity in high dimension Similarity in low dimension
Idea: Optimize yi via gradient descent on C
Similarity of datapoints in High Dimension pj|i = exp(−||xi − xj||2/2σ2
i )
P
k6=i exp(−||xi − xk||2/2σ2 i )
Similarity of datapoints in Low Dimension qj|i = exp(−||yi − yj||2) P
k6=i exp(−||yi − yk||2)
Cost function C = X
i
KL(Pi||Qi) = X
i
X
j
pj|ilog pj|i qj|i
Idea: Optimize yi via gradient descent on C
Similarity of datapoints in High Dimension pj|i = exp(−||xi − xj||2/2σ2
i )
P
k6=i exp(−||xi − xk||2/2σ2 i )
Similarity of datapoints in Low Dimension qj|i = exp(−||yi − yj||2) P
k6=i exp(−||yi − yk||2)
Cost function C = X
i
KL(Pi||Qi) = X
i
X
j
pj|ilog pj|i qj|i
Gradient has a surprisingly simple form ∂C ∂yi = X
j6=i
(pj|i − qj|i + pi|j − qi|j)(yi − yj) The gradient update with momentum term is given by Y (t) = Y (t1) + η∂C ∂yi + β(t)(Y (t1) − Y (t2))
Gradient has a surprisingly simple form ∂C ∂yi = X
j6=i
(pj|i − qj|i + pi|j − qi|j)(yi − yj) The gradient update with momentum term is given by Y (t) = Y (t1) + η∂C ∂yi + β(t)(Y (t1) − Y (t2))
Problem: pj|i is not equal to pi|j
X X X
|
Minimize a single KL divergence between a joint probability distribution C = KL(P||Q) = X
i
X
j6=i
pijlog pij qij The obvious way to redefine the pairwise similarities is pij = exp(−||xi − xj||2/2σ2) P
k6=l exp(−||xl − xk||2/2σ2)
qij = exp(−||yi − yj||2) P
k6=l exp(−||yl − yk||2)
X X X
|
Minimize a single KL divergence between a joint probability distribution C = KL(P||Q) = X
i
X
j6=i
pijlog pij qij The obvious way to redefine the pairwise similarities is pij = exp(−||xi − xj||2/2σ2) P
k6=l exp(−||xl − xk||2/2σ2)
qij = exp(−||yi − yj||2) P
k6=l exp(−||yl − yk||2)
Problem: How should we choose σ ?
Bad σ: Neighborhood is not local in manifold
pij = exp(−||xi − xj||2/2σ2) P
k6=l exp(−||xl − xk||2/2σ2)
pij = exp(−||xi − xj||2/2σ2) P
k6=l exp(−||xl − xk||2/2σ2)
Good σ: Neighborhood contains 5-50 points
pij = exp(−||xi − xj||2/2σ2) P
k6=l exp(−||xl − xk||2/2σ2)
Problem: optimal σ may vary if density not uniform
Solution: Define σi per point.
pij = pj|i + pi|j 2N
pj|i = exp(−||xi − xj||2/2σ2
i )
P
k6=i exp(−||xi − xk||2/2σ2 i )
Solution: Define σi per point.
pij = pj|i + pi|j 2N
pj|i = exp(−||xi − xj||2/2σ2
i )
P
k6=i exp(−||xi − xk||2/2σ2 i )
Solution: Define σi per point.
pij = pj|i + pi|j 2N
pj|i = exp(−||xi − xj||2/2σ2
i )
P
k6=i exp(−||xi − xk||2/2σ2 i )
Set σi to ensure constant perplexity
∂C ∂yi = 4 X
j6=i
(pij − qij)(1 + ||yi − yj||2)1(yi − yj) pij = exp(−||xi − xj||2/2σ2) P
k6=l exp(−||xl − xk||2/2σ2)
qij = (1 + ||yi − yj||2)1 P
k6=l(1 + ||yk − yl||2)1
Similarity in High Dimension Similarity in Low Dimension Gradient
∂C ∂yi = 4 X
j6=i
(pij − qij)(1 + ||yi − yj||2)1(yi − yj) pij = exp(−||xi − xj||2/2σ2) P
k6=l exp(−||xl − xk||2/2σ2)
qij = (1 + ||yi − yj||2)1 P
k6=l(1 + ||yk − yl||2)1
Similarity in High Dimension Similarity in Low Dimension Gradient
X X X
|
Minimize a single KL divergence between a joint probability distribution C = KL(P||Q) = X
i
X
j6=i
pijlog pij qij The obvious way to redefine the pairwise similarities is pij = exp(−||xi − xj||2/2σ2) P
k6=l exp(−||xl − xk||2/2σ2)
qij = exp(−||yi − yj||2) P
k6=l exp(−||yl − yk||2)
Problem: How should we choose σ ?
1 2 3 4 5 6 7 8 9
MNIST
1 2 3 4 5 6 7 8 9