Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 13 - - PowerPoint PPT Presentation

data mining techniques
SMART_READER_LITE
LIVE PREVIEW

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 13 - - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 13 Jan-Willem van de Meent (credit: David Lopez-Paz, David Duvenaud, Laurens van der Maaten) Homework Homework 3 is out today (due 4 Nov) Homework 1 has been graded


slide-1
SLIDE 1

Data Mining Techniques

CS 6220 - Section 3 - Fall 2016

Lecture 13

Jan-Willem van de Meent (credit: David Lopez-Paz, David Duvenaud, Laurens van der Maaten)

slide-2
SLIDE 2

Homework

  • Homework 3 is out today (due 4 Nov)
  • Homework 1 has been graded


(we will grade Homework 2 a little faster)

  • Regrading policy
  • Step 1: E-mail TAs to resolve simple


problems (e.g. code not running).

  • Step 2: E-mail instructor to request 


regrading.

  • We will regrade the entire problem set.


The final grade can be lower than before.

slide-3
SLIDE 3

Review: PCA

U =( u1 ·· uk ) ∈ Rd⇥

Data Orthonormal Basis

X =( x1 · · · · · · xn) ∈ Rd⇥n

n n

Change of basis

z = U>x

” zj = u>

j x

to z = (z1, . . . , zk)>

n

Inverse Change of basis

˜ x = Uz =

> j x

slide-4
SLIDE 4

Review: PCA

U =( u1 ·· uk ) ∈ Rd⇥

Data Orthonormal Basis

X =( x1 · · · · · · xn) ∈ Rd⇥n

n n

Eigenvectors of Covariance Eigen-decomposition

slide-5
SLIDE 5

Review: PCA

U =( u1 ·· uk ) ∈ Rd⇥

Data Orthonormal Basis

X =( x1 · · · · · · xn) ∈ Rd⇥n

n n

Eigenvectors of Covariance Eigen-decomposition Claim: Eigenvectors of a symmetric matrix are orthogonal

slide-6
SLIDE 6

Review: PCA

n

(from stack exchange)

slide-7
SLIDE 7

Review: PCA

U =( u1 ·· uk ) ∈ Rd⇥

Data Orthonormal Basis

X =( x1 · · · · · · xn) ∈ Rd⇥n

n n

Eigenvectors of Covariance Eigen-decomposition Claim: Eigenvectors of a symmetric matrix are orthogonal

slide-8
SLIDE 8

Review: PCA

U =( u1 ·· uk ) ∈ Rd⇥k

Data Truncated Basis

X =( x1 · · · · · · xn) ∈ Rd⇥n

Truncated decomposition Eigenvectors of Covariance

slide-9
SLIDE 9

Review: PCA

U =( u1 ·· uk ) ∈ Rd⇥k

Data Truncated Basis

X =( x1 · · · · · · xn) ∈ Rd⇥n

Projection / Encoding

z = U>x

Reconstruction / Decoding

e: ˜ x = Uz =

slide-10
SLIDE 10

Review: PCA

Top 2 components Bottom 2 components

Data: three varieties of wheat: Kama, Rosa, Canadian
 Attributes: Area, Perimeter, Compactness, Length of Kernel, 
 Width of Kernel, Asymmetry Coefficient, Length of Groove

slide-11
SLIDE 11

PCA: Complexity

Using eigen-value decomposition

  • Computation of covariance C: O(n d 2)
  • Eigen-value decomposition: O(d 3)
  • Total complexity: O(n d 2 +d 3)

U =( u1 ·· uk ) ∈ Rd⇥k

Data Truncated Basis

X =( x1 · · · · · · xn) ∈ Rd⇥n

slide-12
SLIDE 12

PCA: Complexity

Using singular-value decomposition

  • Full decomposition: O(min{nd 2 , n 2d})
  • Rank-k decomposition: O(k d n log(n))


(with power method)


U =( u1 ·· uk ) ∈ Rd⇥k

Data Truncated Basis

X =( x1 · · · · · · xn) ∈ Rd⇥n

slide-13
SLIDE 13

Idea: Decompose a
 d x d matrix M into

  • 1. Change of basis V


(unitary matrix)

  • 2. A scaling Σ


(diagonal matrix)

  • 3. Change of basis U


(unitary matrix)

Singular Value Decomposition

slide-14
SLIDE 14

Singular Value Decomposition

Idea: Decompose the
 d x n matrix X into

  • 1. A n x n basis V


(unitary matrix)

  • 2. A d x n matrix Σ


(diagonal projection)

  • 3. A d x d basis U


(unitary matrix)

d X = Ud⇥dΣd⇥nV>

n⇥n

slide-15
SLIDE 15

Random Projections

Borrowing from:
 David Lopez-Paz & David Duvenaud

slide-16
SLIDE 16

Random Projections

Fast, efficient and & distance-preserving dimensionality reduction! R40500 R1000 x1 x2 y1 y2 w 2 R40500×1000 w 2 R40500×1000

  • (1 ± ✏)

(1 ✏)kx1 x2k2  ky1 y2k2  (1 + ✏)kx1 x2k2 This result is formalized in the Johnson-Lindenstrauss Lemma

slide-17
SLIDE 17

Johnson-Lindenstrauss Lemma

The proof is a great example of Erd¨

  • s’ probabilistic method (1947).

Paul Erd¨

  • s

Joram Lindenstrauss William B. Johnson 1913-1996 1936-2012 1944-

For any 0 < ✏ < 1/2 and any integer m > 4, let k = 20 log m

✏2

. Then, for any set V of m points in RN 9 f : RN ! Rk s.t. 8 u, v 2 V : (1 ✏)ku vk2  kf(u) f(v)k2  (1 + ✏)ku vk2.

slide-18
SLIDE 18

Johnson-Lindenstrauss Lemma

For any 0 < ✏ < 1/2 and any integer m > 4, let k = 20 log m

✏2

. Then, for any set V of m points in RN 9 f : RN ! Rk s.t. 8 u, v 2 V : (1 ✏)ku vk2  kf(u) f(v)k2  (1 + ✏)ku vk2.

Holds when f is linear function with random coefficients t f =

1 √ kA, A 2 Rk×N, k < N and Aij ⇠ N(0, 1).

slide-19
SLIDE 19

Example: 20-newsgroups data

Data: 20-newsgroups, from 100.000 features to 300 (0.3%)

slide-20
SLIDE 20

Example: 20-newsgroups data

Data: 20-newsgroups, from 100.000 features to 1.000 (1%)

slide-21
SLIDE 21

Data: 20-newsgroups, from 100.000 features to 10.000 (10%)

Example: 20-newsgroups data

slide-22
SLIDE 22

Data: 20-newsgroups, from 100.000 features to 10.000 (10%)

Example: 20-newsgroups data

Conclusion: RP preserves distances like PCA, 
 but faster than PCA number of dimensions is vey large

slide-23
SLIDE 23

Stochastic Neighbor 
 Embeddings

Borrowing from:
 Laurens van der Maaten
 (Delft -> Facebook AI)

slide-24
SLIDE 24

Manifold Learning

Idea: Perform a non-linear dimensionality reduction
 in a manner that preserves proximity (but not distances)

slide-25
SLIDE 25

Manifold Learning

slide-26
SLIDE 26

PCA on MNIST Digits

slide-27
SLIDE 27

Swiss Roll

Euclidean distance is not always 
 a good notion of proximity

slide-28
SLIDE 28

Non-linear Projection

Bad projection: relative position to neighbors changes

slide-29
SLIDE 29

Non-linear Projection

Intuition: Want to preserve local neighborhood

slide-30
SLIDE 30

Stochastic Neighbor Embedding

pj|i = exp(−||xi − xj||2/2σ2

i )

P

k6=i exp(−||xi − xk||2/2σ2 i )

qj|i = exp(−||yi − yj||2) P

k6=i exp(−||yi − yk||2)

Similarity in high dimension Similarity in low dimension

slide-31
SLIDE 31

Stochastic Neighbor Embedding

Idea: Optimize yi via gradient descent on C

Similarity of datapoints in High Dimension pj|i = exp(−||xi − xj||2/2σ2

i )

P

k6=i exp(−||xi − xk||2/2σ2 i )

Similarity of datapoints in Low Dimension qj|i = exp(−||yi − yj||2) P

k6=i exp(−||yi − yk||2)

Cost function C = X

i

KL(Pi||Qi) = X

i

X

j

pj|ilog pj|i qj|i

slide-32
SLIDE 32

Stochastic Neighbor Embedding

Idea: Optimize yi via gradient descent on C

Similarity of datapoints in High Dimension pj|i = exp(−||xi − xj||2/2σ2

i )

P

k6=i exp(−||xi − xk||2/2σ2 i )

Similarity of datapoints in Low Dimension qj|i = exp(−||yi − yj||2) P

k6=i exp(−||yi − yk||2)

Cost function C = X

i

KL(Pi||Qi) = X

i

X

j

pj|ilog pj|i qj|i

slide-33
SLIDE 33

Stochastic Neighbor Embedding

Gradient has a surprisingly simple form ∂C ∂yi = X

j6=i

(pj|i − qj|i + pi|j − qi|j)(yi − yj) The gradient update with momentum term is given by Y (t) = Y (t1) + η∂C ∂yi + β(t)(Y (t1) − Y (t2))

slide-34
SLIDE 34

Stochastic Neighbor Embedding

Gradient has a surprisingly simple form ∂C ∂yi = X

j6=i

(pj|i − qj|i + pi|j − qi|j)(yi − yj) The gradient update with momentum term is given by Y (t) = Y (t1) + η∂C ∂yi + β(t)(Y (t1) − Y (t2))

Problem: pj|i is not equal to pi|j

slide-35
SLIDE 35

Symmetric SNE

X X X

|

Minimize a single KL divergence between a joint probability distribution C = KL(P||Q) = X

i

X

j6=i

pijlog pij qij The obvious way to redefine the pairwise similarities is pij = exp(−||xi − xj||2/2σ2) P

k6=l exp(−||xl − xk||2/2σ2)

qij = exp(−||yi − yj||2) P

k6=l exp(−||yl − yk||2)

slide-36
SLIDE 36

Symmetric SNE

X X X

|

Minimize a single KL divergence between a joint probability distribution C = KL(P||Q) = X

i

X

j6=i

pijlog pij qij The obvious way to redefine the pairwise similarities is pij = exp(−||xi − xj||2/2σ2) P

k6=l exp(−||xl − xk||2/2σ2)

qij = exp(−||yi − yj||2) P

k6=l exp(−||yl − yk||2)

Problem: How should we choose σ ?

slide-37
SLIDE 37

Choosing the bandwidth

Bad σ: Neighborhood is not local in manifold

pij = exp(−||xi − xj||2/2σ2) P

k6=l exp(−||xl − xk||2/2σ2)

slide-38
SLIDE 38

Choosing the bandwidth

pij = exp(−||xi − xj||2/2σ2) P

k6=l exp(−||xl − xk||2/2σ2)

Good σ: Neighborhood contains 5-50 points

slide-39
SLIDE 39

Choosing the bandwidth

pij = exp(−||xi − xj||2/2σ2) P

k6=l exp(−||xl − xk||2/2σ2)

Problem: optimal σ may vary if density not uniform

slide-40
SLIDE 40

Choosing the bandwidth

Solution: Define σi per point.

pij = pj|i + pi|j 2N

pj|i = exp(−||xi − xj||2/2σ2

i )

P

k6=i exp(−||xi − xk||2/2σ2 i )

slide-41
SLIDE 41

Choosing the bandwidth

Solution: Define σi per point.

pij = pj|i + pi|j 2N

pj|i = exp(−||xi − xj||2/2σ2

i )

P

k6=i exp(−||xi − xk||2/2σ2 i )

slide-42
SLIDE 42

Choosing the bandwidth

Solution: Define σi per point.

pij = pj|i + pi|j 2N

pj|i = exp(−||xi − xj||2/2σ2

i )

P

k6=i exp(−||xi − xk||2/2σ2 i )

slide-43
SLIDE 43

Choosing the bandwidth

Set σi to ensure constant perplexity

slide-44
SLIDE 44

t-SNE: SNE with a t-Distribution

∂C ∂yi = 4 X

j6=i

(pij − qij)(1 + ||yi − yj||2)1(yi − yj) pij = exp(−||xi − xj||2/2σ2) P

k6=l exp(−||xl − xk||2/2σ2)

qij = (1 + ||yi − yj||2)1 P

k6=l(1 + ||yk − yl||2)1

Similarity in High Dimension Similarity in Low Dimension Gradient

slide-45
SLIDE 45

t-SNE: SNE with a t-Distribution

∂C ∂yi = 4 X

j6=i

(pij − qij)(1 + ||yi − yj||2)1(yi − yj) pij = exp(−||xi − xj||2/2σ2) P

k6=l exp(−||xl − xk||2/2σ2)

qij = (1 + ||yi − yj||2)1 P

k6=l(1 + ||yk − yl||2)1

Similarity in High Dimension Similarity in Low Dimension Gradient

slide-46
SLIDE 46

Symmetric SNE

X X X

|

Minimize a single KL divergence between a joint probability distribution C = KL(P||Q) = X

i

X

j6=i

pijlog pij qij The obvious way to redefine the pairwise similarities is pij = exp(−||xi − xj||2/2σ2) P

k6=l exp(−||xl − xk||2/2σ2)

qij = exp(−||yi − yj||2) P

k6=l exp(−||yl − yk||2)

Problem: How should we choose σ ?

slide-47
SLIDE 47

PCA on MNIST Digits

slide-48
SLIDE 48

t-SNE on MNIST Digits

1 2 3 4 5 6 7 8 9

MNIST

1 2 3 4 5 6 7 8 9

slide-49
SLIDE 49

t-SNE on MNIST Digits

slide-50
SLIDE 50

t-SNE on Olivetti Faces

slide-51
SLIDE 51

t-SNE on Olivetti Faces

slide-52
SLIDE 52

t-SNE on Olivetti Faces

slide-53
SLIDE 53

t-SNE on ImageNet

slide-54
SLIDE 54

t-SNE on ImageNet

slide-55
SLIDE 55

t-SNE on ImageNet

slide-56
SLIDE 56

Next lecture: Recommender Systems