data mining techniques
play

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 13 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 13 Jan-Willem van de Meent (credit: David Lopez-Paz, David Duvenaud, Laurens van der Maaten) Homework Homework 3 is out today (due 4 Nov) Homework 1 has been graded


  1. Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 13 Jan-Willem van de Meent (credit: David Lopez-Paz, David Duvenaud, Laurens van der Maaten)

  2. Homework • Homework 3 is out today (due 4 Nov) • Homework 1 has been graded 
 (we will grade Homework 2 a little faster) • Regrading policy • Step 1: E-mail TAs to resolve simple 
 problems (e.g. code not running). • Step 2: E-mail instructor to request 
 regrading. • We will regrade the entire problem set. 
 The final grade can be lower than before.

  3. Review: PCA Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n n n Inverse Change of basis Change of basis z = U > x > j x x = Uz = ˜ to z = ( z 1 , . . . , z k ) > n ” z j = u > j x

  4. Review: PCA Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n n n Eigenvectors of Covariance Eigen-decomposition

  5. Review: PCA Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n n n Eigenvectors of Covariance Eigen-decomposition Claim : Eigenvectors of a symmetric matrix are orthogonal

  6. Review: PCA n (from stack exchange)

  7. Review: PCA Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n n n Eigenvectors of Covariance Eigen-decomposition Claim : Eigenvectors of a symmetric matrix are orthogonal

  8. Review: PCA Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Eigenvectors of Covariance Truncated decomposition

  9. Review: PCA Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Reconstruction / Decoding Projection / Encoding z = U > x e: ˜ x = Uz =

  10. Review: PCA Top 2 components Bottom 2 components Data : three varieties of wheat: Kama, Rosa, Canadian 
 Attributes : Area, Perimeter, Compactness, Length of Kernel, 
 Width of Kernel, Asymmetry Coefficient, Length of Groove

  11. PCA: Complexity Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Using eigen-value decomposition • Computation of covariance C : O ( n d 2 ) • Eigen-value decomposition: O ( d 3 ) • Total complexity: O ( n d 2 + d 3 )

  12. PCA: Complexity Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Using singular-value decomposition • Full decomposition: O(min{ nd 2 , n 2 d }) • Rank-k decomposition: O( k d n log(n)) 
 (with power method) 


  13. Singular Value Decomposition Idea : Decompose a 
 d x d matrix M into 1. Change of basis V 
 (unitary matrix) 2. A scaling Σ 
 (diagonal matrix) 3. Change of basis U 
 (unitary matrix)

  14. Singular Value Decomposition Idea : Decompose the 
 d x n matrix X into 1. A n x n basis V 
 (unitary matrix) 2. A d x n matrix Σ 
 (diagonal projection) 3. A d x d basis U 
 (unitary matrix) d X = U d ⇥ d Σ d ⇥ n V > n ⇥ n

  15. Random Projections Borrowing from : 
 David Lopez-Paz & David Duvenaud

  16. Random Projections Fast, e ffi cient and & distance-preserving dimensionality reduction ! w 2 R 40500 × 1000 y 1 x 1 � � (1 ± ✏ ) y 2 x 2 w 2 R 40500 × 1000 R 40500 R 1000 (1 � ✏ ) k x 1 � x 2 k 2  k y 1 � y 2 k 2  (1 + ✏ ) k x 1 � x 2 k 2 This result is formalized in the Johnson-Lindenstrauss Lemma

  17. Johnson-Lindenstrauss Lemma For any 0 < ✏ < 1 / 2 and any integer m > 4, let k = 20 log m . Then, ✏ 2 for any set V of m points in R N 9 f : R N ! R k s.t. 8 u , v 2 V : (1 � ✏ ) k u � v k 2  k f ( u ) � f ( v ) k 2  (1 + ✏ ) k u � v k 2 . The proof is a great example of Erd¨ os’ probabilistic method (1947). Paul Erd¨ os Joram Lindenstrauss William B. Johnson 1913-1996 1936-2012 1944-

  18. Johnson-Lindenstrauss Lemma For any 0 < ✏ < 1 / 2 and any integer m > 4, let k = 20 log m . Then, ✏ 2 for any set V of m points in R N 9 f : R N ! R k s.t. 8 u , v 2 V : (1 � ✏ ) k u � v k 2  k f ( u ) � f ( v ) k 2  (1 + ✏ ) k u � v k 2 . Holds when f is linear function with random coefficients 1 k A , A 2 R k × N , k < N and A ij ⇠ N (0 , 1). t f = √

  19. Example: 20-newsgroups data Data: 20-newsgroups, from 100.000 features to 300 (0.3%)

  20. Example: 20-newsgroups data Data: 20-newsgroups, from 100.000 features to 1.000 (1%)

  21. Example: 20-newsgroups data Data: 20-newsgroups, from 100.000 features to 10.000 (10%)

  22. Example: 20-newsgroups data Data: 20-newsgroups, from 100.000 features to 10.000 (10%) Conclusion : RP preserves distances like PCA, 
 but faster than PCA number of dimensions is vey large

  23. Stochastic Neighbor 
 Embeddings Borrowing from : 
 Laurens van der Maaten 
 (Delft -> Facebook AI)

  24. Manifold Learning Idea : Perform a non-linear dimensionality reduction 
 in a manner that preserves proximity (but not distances)

  25. Manifold Learning

  26. PCA on MNIST Digits

  27. Swiss Roll Euclidean distance is not always 
 a good notion of proximity

  28. Non-linear Projection Bad projection: relative position to neighbors changes

  29. Non-linear Projection Intuition: Want to preserve local neighborhood

  30. Stochastic Neighbor Embedding Similarity in high dimension Similarity in low dimension exp ( − || x i − x j || 2 / 2 σ 2 exp ( − || y i − y j || 2 ) i ) p j | i = q j | i = k 6 = i exp ( − || x i − x k || 2 / 2 σ 2 k 6 = i exp ( − || y i − y k || 2 ) P P i )

  31. Stochastic Neighbor Embedding Similarity of datapoints in High Dimension exp ( − || x i − x j || 2 / 2 σ 2 i ) p j | i = k 6 = i exp ( − || x i − x k || 2 / 2 σ 2 P i ) Similarity of datapoints in Low Dimension exp ( − || y i − y j || 2 ) q j | i = k 6 = i exp ( − || y i − y k || 2 ) P Cost function p j | i log p j | i X X X C = KL ( P i || Q i ) = q j | i i i j Idea: Optimize y i via gradient descent on C

  32. Stochastic Neighbor Embedding Similarity of datapoints in High Dimension exp ( − || x i − x j || 2 / 2 σ 2 i ) p j | i = k 6 = i exp ( − || x i − x k || 2 / 2 σ 2 P i ) Similarity of datapoints in Low Dimension exp ( − || y i − y j || 2 ) q j | i = k 6 = i exp ( − || y i − y k || 2 ) P Cost function p j | i log p j | i X X X C = KL ( P i || Q i ) = q j | i i i j Idea: Optimize y i via gradient descent on C

  33. Stochastic Neighbor Embedding Gradient has a surprisingly simple form ∂ C X = ( p j | i − q j | i + p i | j − q i | j )( y i − y j ) ∂ y i j 6 = i The gradient update with momentum term is given by Y ( t ) = Y ( t � 1) + η∂ C + β ( t )( Y ( t � 1) − Y ( t � 2) ) ∂ y i

  34. Stochastic Neighbor Embedding Gradient has a surprisingly simple form ∂ C X = ( p j | i − q j | i + p i | j − q i | j )( y i − y j ) ∂ y i j 6 = i The gradient update with momentum term is given by Y ( t ) = Y ( t � 1) + η∂ C + β ( t )( Y ( t � 1) − Y ( t � 2) ) ∂ y i Problem : p j|i is not equal to p i|j

  35. Symmetric SNE X X X | Minimize a single KL divergence between a joint probability distribution p ij log p ij X X C = KL ( P || Q ) = q ij j 6 = i i The obvious way to redefine the pairwise similarities is exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P exp ( − || y i − y j || 2 ) q ij = k 6 = l exp ( − || y l − y k || 2 ) P

  36. Symmetric SNE X X X | Minimize a single KL divergence between a joint probability distribution p ij log p ij X X C = KL ( P || Q ) = q ij j 6 = i i The obvious way to redefine the pairwise similarities is exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P exp ( − || y i − y j || 2 ) q ij = k 6 = l exp ( − || y l − y k || 2 ) P Problem : How should we choose σ ?

  37. Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P Bad σ : Neighborhood is not local in manifold

  38. Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P Good σ : Neighborhood contains 5-50 points

  39. Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P Problem : optimal σ may vary if density not uniform

  40. Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 i ) p ij = p j | i + p i | j p j | i = k 6 = i exp ( − || x i − x k || 2 / 2 σ 2 P i ) 2 N Solution : Define σ i per point.

  41. Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 i ) p ij = p j | i + p i | j p j | i = k 6 = i exp ( − || x i − x k || 2 / 2 σ 2 P i ) 2 N Solution : Define σ i per point.

  42. Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 i ) p ij = p j | i + p i | j p j | i = k 6 = i exp ( − || x i − x k || 2 / 2 σ 2 P i ) 2 N Solution : Define σ i per point.

  43. Choosing the bandwidth Set σ i to ensure constant perplexity

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend