compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514 algorithms for data science
SMART_READER_LITE
LIVE PREVIEW

compsci 514: algorithms for data science Cameron Musco University - - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 14 0 logistics exam/discuss solutions. 1 Midterm grades are on Moodle. Average was 32 . 67, median 33, standard deviation 6


slide-1
SLIDE 1

compsci 514: algorithms for data science

Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 14

slide-2
SLIDE 2

logistics

  • Midterm grades are on Moodle.
  • Average was 32.67, median 33, standard deviation 6.8
  • Come to office hours if you would like to see your

exam/discuss solutions.

1

slide-3
SLIDE 3

summary

Last Few Weeks: Low-Rank Approximation and PCA

  • Compress data that lies close to a k-dimensional subspace.
  • Equivalent to finding a low-rank approximation of the data

matrix X: X XVVT.

  • Optimal solution via PCA (eigendecomposition of XTX or

equivalently, SVD of X). This Class: Non-linear dimensionality reduction.

  • How do we compress data that does not lie close to a

k-dimensional subspace?

  • Spectral methods (SVD and eigendecomposition) are still key

techniques in this setting.

  • Spectral graph theory, spectral clustering.

2

slide-4
SLIDE 4

summary

Last Few Weeks: Low-Rank Approximation and PCA

  • Compress data that lies close to a k-dimensional subspace.
  • Equivalent to finding a low-rank approximation of the data

matrix X: X ≈ XVVT.

  • Optimal solution via PCA (eigendecomposition of XTX or

equivalently, SVD of X). This Class: Non-linear dimensionality reduction.

  • How do we compress data that does not lie close to a

k-dimensional subspace?

  • Spectral methods (SVD and eigendecomposition) are still key

techniques in this setting.

  • Spectral graph theory, spectral clustering.

2

slide-5
SLIDE 5

summary

Last Few Weeks: Low-Rank Approximation and PCA

  • Compress data that lies close to a k-dimensional subspace.
  • Equivalent to finding a low-rank approximation of the data

matrix X: X ≈ XVVT.

  • Optimal solution via PCA (eigendecomposition of XTX or

equivalently, SVD of X). This Class: Non-linear dimensionality reduction.

  • How do we compress data that does not lie close to a

k-dimensional subspace?

  • Spectral methods (SVD and eigendecomposition) are still key

techniques in this setting.

  • Spectral graph theory, spectral clustering.

2

slide-6
SLIDE 6

summary

Last Few Weeks: Low-Rank Approximation and PCA

  • Compress data that lies close to a k-dimensional subspace.
  • Equivalent to finding a low-rank approximation of the data

matrix X: X ≈ XVVT.

  • Optimal solution via PCA (eigendecomposition of XTX or

equivalently, SVD of X). This Class: Non-linear dimensionality reduction.

  • How do we compress data that does not lie close to a

k-dimensional subspace?

  • Spectral methods (SVD and eigendecomposition) are still key

techniques in this setting.

  • Spectral graph theory, spectral clustering.

2

slide-7
SLIDE 7

entity embeddings

End of Last Class: Embedding objects other than vectors into Euclidean space.

  • Documents (for topic-based search and classification)
  • Words (to identify synonyms, translations, etc.)
  • Nodes in a social network

Usual Approach: Convert each item into a high-dimensional feature vector and then apply low-rank approximation

3

slide-8
SLIDE 8

entity embeddings

End of Last Class: Embedding objects other than vectors into Euclidean space.

  • Documents (for topic-based search and classification)
  • Words (to identify synonyms, translations, etc.)
  • Nodes in a social network

Usual Approach: Convert each item into a high-dimensional feature vector and then apply low-rank approximation

3

slide-9
SLIDE 9

entity embeddings

End of Last Class: Embedding objects other than vectors into Euclidean space.

  • Documents (for topic-based search and classification)
  • Words (to identify synonyms, translations, etc.)
  • Nodes in a social network

Usual Approach: Convert each item into a high-dimensional feature vector and then apply low-rank approximation

3

slide-10
SLIDE 10

example: latent semantic analysis

4

slide-11
SLIDE 11

example: latent semantic analysis

4

slide-12
SLIDE 12

example: latent semantic analysis

  • If the error X

YZT F is small, then on average, Xi a YZT

i a

yi za

  • I.e., yi za

1 when doci contains worda.

  • If doci and docj both contain worda, yi za

yj za 1.

5

slide-13
SLIDE 13

example: latent semantic analysis

  • If the error ∥X − YZT∥F is small, then on average,

Xi,a ≈ (YZT)i,a = ⟨⃗ yi,⃗ za⟩.

  • I.e., yi za

1 when doci contains worda.

  • If doci and docj both contain worda, yi za

yj za 1.

5

slide-14
SLIDE 14

example: latent semantic analysis

  • If the error ∥X − YZT∥F is small, then on average,

Xi,a ≈ (YZT)i,a = ⟨⃗ yi,⃗ za⟩.

  • I.e., ⟨⃗

yi,⃗ za⟩ ≈ 1 when doci contains worda.

  • If doci and docj both contain worda, yi za

yj za 1.

5

slide-15
SLIDE 15

example: latent semantic analysis

  • If the error ∥X − YZT∥F is small, then on average,

Xi,a ≈ (YZT)i,a = ⟨⃗ yi,⃗ za⟩.

  • I.e., ⟨⃗

yi,⃗ za⟩ ≈ 1 when doci contains worda.

  • If doci and docj both contain worda, ⟨⃗

yi,⃗ za⟩ ≈ ⟨⃗ yj,⃗ za⟩ = 1.

5

slide-16
SLIDE 16

example: latent semantic analysis

If doci and docj both contain worda, ⟨⃗ yi,⃗ za⟩ ≈ ⟨⃗ yj,⃗ za⟩ = 1 Another View: Each column of Y represents a ‘topic’. yi j indicates how much doci belongs to topic j. za j indicates how much worda associates with that topic.

6

slide-17
SLIDE 17

example: latent semantic analysis

If doci and docj both contain worda, ⟨⃗ yi,⃗ za⟩ ≈ ⟨⃗ yj,⃗ za⟩ = 1 Another View: Each column of Y represents a ‘topic’. ⃗ yi(j) indicates how much doci belongs to topic j. ⃗ za(j) indicates how much worda associates with that topic.

6

slide-18
SLIDE 18

example: latent semantic analysis

  • Just like with documents, ⃗

za and ⃗ zb will tend to have high dot product if wordi and wordj appear in many of the same documents.

  • In an SVD decomposition we set Z

kVT K.

  • The columns of Vk are equivalently: the top k eigenvectors of XXT.

The eigendecomposition of XXT is XXT V

2VT.

  • What is the best rank-k approximation of XXT? I.e.

arg minrank

k B XXT

B F

  • XXT

Vk

2 kVT k

ZZT.

7

slide-19
SLIDE 19

example: latent semantic analysis

  • Just like with documents, ⃗

za and ⃗ zb will tend to have high dot product if wordi and wordj appear in many of the same documents.

  • In an SVD decomposition we set Z = ΣkVT

K.

  • The columns of Vk are equivalently: the top k eigenvectors of XXT.

The eigendecomposition of XXT is XXT V

2VT.

  • What is the best rank-k approximation of XXT? I.e.

arg minrank

k B XXT

B F

  • XXT

Vk

2 kVT k

ZZT.

7

slide-20
SLIDE 20

example: latent semantic analysis

  • Just like with documents, ⃗

za and ⃗ zb will tend to have high dot product if wordi and wordj appear in many of the same documents.

  • In an SVD decomposition we set Z = ΣkVT

K.

  • The columns of Vk are equivalently: the top k eigenvectors of XXT.

The eigendecomposition of XXT is XXT V

2VT.

  • What is the best rank-k approximation of XXT? I.e.

arg minrank

k B XXT

B F

  • XXT

Vk

2 kVT k

ZZT.

7

slide-21
SLIDE 21

example: latent semantic analysis

  • Just like with documents, ⃗

za and ⃗ zb will tend to have high dot product if wordi and wordj appear in many of the same documents.

  • In an SVD decomposition we set Z = ΣkVT

K.

  • The columns of Vk are equivalently: the top k eigenvectors of XXT.

The eigendecomposition of XXT is XXT = VΣ2VT.

  • What is the best rank-k approximation of XXT? I.e.

arg minrank

k B XXT

B F

  • XXT

Vk

2 kVT k

ZZT.

7

slide-22
SLIDE 22

example: latent semantic analysis

  • Just like with documents, ⃗

za and ⃗ zb will tend to have high dot product if wordi and wordj appear in many of the same documents.

  • In an SVD decomposition we set Z = ΣkVT

K.

  • The columns of Vk are equivalently: the top k eigenvectors of XXT.

The eigendecomposition of XXT is XXT = VΣ2VT.

  • What is the best rank-k approximation of XXT? I.e.

arg minrank −k B ∥XXT − B∥F

  • XXT

Vk

2 kVT k

ZZT.

7

slide-23
SLIDE 23

example: latent semantic analysis

  • Just like with documents, ⃗

za and ⃗ zb will tend to have high dot product if wordi and wordj appear in many of the same documents.

  • In an SVD decomposition we set Z = ΣkVT

K.

  • The columns of Vk are equivalently: the top k eigenvectors of XXT.

The eigendecomposition of XXT is XXT = VΣ2VT.

  • What is the best rank-k approximation of XXT? I.e.

arg minrank −k B ∥XXT − B∥F

  • XXT = VkΣ2

kVT k = ZZT.

7

slide-24
SLIDE 24

example: word embedding

LSA gives a way of embedding words into k-dimensional space.

  • Embedding is via low-rank approximation of XXT: where (XXT)a,b is

the number of documents that both worda and wordb appear in.

  • Think about XXT as a similarity matrix (gram matrix, kernel matrix)

with entry a b being the similarity between worda and wordb.

  • Many ways to measure similarity: number of sentences both occur

in, number of time both appear in the same window of w words, in similar positions of documents in different languages, etc.

  • Replacing XXT with these different metrics (sometimes

appropriately transformed) leads to popular word embedding algorithms: word2vec, GloVe, fastTest, etc.

8

slide-25
SLIDE 25

example: word embedding

LSA gives a way of embedding words into k-dimensional space.

  • Embedding is via low-rank approximation of XXT: where (XXT)a,b is

the number of documents that both worda and wordb appear in.

  • Think about XXT as a similarity matrix (gram matrix, kernel matrix)

with entry (a, b) being the similarity between worda and wordb.

  • Many ways to measure similarity: number of sentences both occur

in, number of time both appear in the same window of w words, in similar positions of documents in different languages, etc.

  • Replacing XXT with these different metrics (sometimes

appropriately transformed) leads to popular word embedding algorithms: word2vec, GloVe, fastTest, etc.

8

slide-26
SLIDE 26

example: word embedding

LSA gives a way of embedding words into k-dimensional space.

  • Embedding is via low-rank approximation of XXT: where (XXT)a,b is

the number of documents that both worda and wordb appear in.

  • Think about XXT as a similarity matrix (gram matrix, kernel matrix)

with entry (a, b) being the similarity between worda and wordb.

  • Many ways to measure similarity: number of sentences both occur

in, number of time both appear in the same window of w words, in similar positions of documents in different languages, etc.

  • Replacing XXT with these different metrics (sometimes

appropriately transformed) leads to popular word embedding algorithms: word2vec, GloVe, fastTest, etc.

8

slide-27
SLIDE 27

example: word embedding

LSA gives a way of embedding words into k-dimensional space.

  • Embedding is via low-rank approximation of XXT: where (XXT)a,b is

the number of documents that both worda and wordb appear in.

  • Think about XXT as a similarity matrix (gram matrix, kernel matrix)

with entry (a, b) being the similarity between worda and wordb.

  • Many ways to measure similarity: number of sentences both occur

in, number of time both appear in the same window of w words, in similar positions of documents in different languages, etc.

  • Replacing XXT with these different metrics (sometimes

appropriately transformed) leads to popular word embedding algorithms: word2vec, GloVe, fastTest, etc.

8

slide-28
SLIDE 28

example: word embedding

Note: word2vec is typically described as a neural-network method, but it is really just low-rank approximation of a specific similarity matrix. Neural word embedding as implicit matrix factorization, Levy and Goldberg.

9

slide-29
SLIDE 29

example: word embedding

Note: word2vec is typically described as a neural-network method, but it is really just low-rank approximation of a specific similarity matrix. Neural word embedding as implicit matrix factorization, Levy and Goldberg.

9

slide-30
SLIDE 30

similarity via graphs

A common way of encoding similarity is via a graph. E.g., a k-nearest neighbor graph.

  • Connect items to similar items, possibly with higher weight

edges when they are more similar. Is this set of points compressible? Does it lie close to a low-dimensional subspace?

10

slide-31
SLIDE 31

similarity via graphs

A common way of encoding similarity is via a graph. E.g., a k-nearest neighbor graph.

  • Connect items to similar items, possibly with higher weight

edges when they are more similar. Is this set of points compressible? Does it lie close to a low-dimensional subspace?

10

slide-32
SLIDE 32

similarity via graphs

A common way of encoding similarity is via a graph. E.g., a k-nearest neighbor graph.

  • Connect items to similar items, possibly with higher weight

edges when they are more similar. Is this set of points compressible? Does it lie close to a low-dimensional subspace?

10

slide-33
SLIDE 33

similarity via graphs

A common way of encoding similarity is via a graph. E.g., a k-nearest neighbor graph.

  • Connect items to similar items, possibly with higher weight

edges when they are more similar. Is this set of points compressible? Does it lie close to a low-dimensional subspace?

10

slide-34
SLIDE 34

similarity via graphs

A common way of encoding similarity is via a graph. E.g., a k-nearest neighbor graph.

  • Connect items to similar items, possibly with higher weight

edges when they are more similar. Is this set of points compressible? Does it lie close to a low-dimensional subspace?

10

slide-35
SLIDE 35

linear algebraic representation of a graph

Once we have connected n data points x1, . . . , xn into a graph, we can represent that graph by its (weighted) adjacency matrix. A ∈ Rn×n with Ai,j = edge weight between nodes i and j In LSA example, when X is the term-document matrix, XTX is like an adjacency matrix, where worda and wordb are connected if they appear in at least 1 document together (edge weight is documents they appear in together).

11

slide-36
SLIDE 36

linear algebraic representation of a graph

Once we have connected n data points x1, . . . , xn into a graph, we can represent that graph by its (weighted) adjacency matrix. A ∈ Rn×n with Ai,j = edge weight between nodes i and j In LSA example, when X is the term-document matrix, XTX is like an adjacency matrix, where worda and wordb are connected if they appear in at least 1 document together (edge weight is documents they appear in together).

11

slide-37
SLIDE 37

linear algebraic representation of a graph

Once we have connected n data points x1, . . . , xn into a graph, we can represent that graph by its (weighted) adjacency matrix. A ∈ Rn×n with Ai,j = edge weight between nodes i and j In LSA example, when X is the term-document matrix, XTX is like an adjacency matrix, where worda and wordb are connected if they appear in at least 1 document together (edge weight is # documents they appear in together).

11

slide-38
SLIDE 38

normalized adjacency matrix

What is the sum of entries in the ith column of A? The (weighted) degree of vertex i. Often, A is normalized as A D

1 2AD 1 2 where D is the

degree matrix. Spectral graph theory is the field of representing graphs as matrices and applying linear algebraic techniques.

12

slide-39
SLIDE 39

normalized adjacency matrix

What is the sum of entries in the ith column of A? The (weighted) degree of vertex i. Often, A is normalized as A D

1 2AD 1 2 where D is the

degree matrix. Spectral graph theory is the field of representing graphs as matrices and applying linear algebraic techniques.

12

slide-40
SLIDE 40

normalized adjacency matrix

What is the sum of entries in the ith column of A? The (weighted) degree of vertex i. Often, A is normalized as ¯ A = D−1/2AD−1/2 where D is the degree matrix. Spectral graph theory is the field of representing graphs as matrices and applying linear algebraic techniques.

12

slide-41
SLIDE 41

normalized adjacency matrix

What is the sum of entries in the ith column of A? The (weighted) degree of vertex i. Often, A is normalized as ¯ A = D−1/2AD−1/2 where D is the degree matrix. Spectral graph theory is the field of representing graphs as matrices and applying linear algebraic techniques.

12

slide-42
SLIDE 42

normalized adjacency matrix

What is the sum of entries in the ith column of A? The (weighted) degree of vertex i. Often, A is normalized as ¯ A = D−1/2AD−1/2 where D is the degree matrix. Spectral graph theory is the field of representing graphs as matrices and applying linear algebraic techniques.

12

slide-43
SLIDE 43

normalized adjacency matrix

What is the sum of entries in the ith column of A? The (weighted) degree of vertex i. Often, A is normalized as ¯ A = D−1/2AD−1/2 where D is the degree matrix. Spectral graph theory is the field of representing graphs as matrices and applying linear algebraic techniques.

12

slide-44
SLIDE 44

adjacency matrix eigenvectors

How do we compute an optimal low-rank approximation of A?

  • Project onto the top k eigenvectors of ATA

A2. These are just the eigenvectors of A.

13

slide-45
SLIDE 45

adjacency matrix eigenvectors

How do we compute an optimal low-rank approximation of A?

  • Project onto the top k eigenvectors of ATA = A2.

These are just the eigenvectors of A.

13

slide-46
SLIDE 46

adjacency matrix eigenvectors

How do we compute an optimal low-rank approximation of A?

  • Project onto the top k eigenvectors of ATA = A2. These are

just the eigenvectors of A.

13

slide-47
SLIDE 47

adjacency matrix eigenvectors

  • Similar vertices (close with regards to graph proximity)

should have similar embeddings. I.e., Vk i should be similar to Vk j .

14

slide-48
SLIDE 48

adjacency matrix eigenvectors

  • Similar vertices (close with regards to graph proximity)

should have similar embeddings. I.e., Vk(i) should be similar to Vk(j).

14

slide-49
SLIDE 49

spectral embedding

15

slide-50
SLIDE 50

spectral clustering

A very common task aside from just embedding points via graph based similarity and SVD, is to partition or cluster vertices based on this similarity. Non-linearly separable data.

16

slide-51
SLIDE 51

spectral clustering

A very common task aside from just embedding points via graph based similarity and SVD, is to partition or cluster vertices based on this similarity. Non-linearly separable data.

16

slide-52
SLIDE 52

spectral clustering

A very common task aside from just embedding points via graph based similarity and SVD, is to partition or cluster vertices based on this similarity. Community detection in naturally occuring networks.

16

slide-53
SLIDE 53

cut minimization

Simple Idea: Partition clusters along minimum cut in graph. Small cuts are often not informative. Solution: Encourage cuts that separate large sections of the graph.

  • Let v

n represent a cut: v i

1 if i S and v i 1 if i T. Want v to have roughly equal numbers of 1s and

  • 1s. I.e., vT1

0.

17

slide-54
SLIDE 54

cut minimization

Simple Idea: Partition clusters along minimum cut in graph. Small cuts are often not informative. Solution: Encourage cuts that separate large sections of the graph.

  • Let v

n represent a cut: v i

1 if i S and v i 1 if i T. Want v to have roughly equal numbers of 1s and

  • 1s. I.e., vT1

0.

17

slide-55
SLIDE 55

cut minimization

Simple Idea: Partition clusters along minimum cut in graph. Small cuts are often not informative. Solution: Encourage cuts that separate large sections of the graph.

  • Let v

n represent a cut: v i

1 if i S and v i 1 if i T. Want v to have roughly equal numbers of 1s and

  • 1s. I.e., vT1

0.

17

slide-56
SLIDE 56

cut minimization

Simple Idea: Partition clusters along minimum cut in graph. Small cuts are often not informative. Solution: Encourage cuts that separate large sections of the graph.

  • Let ⃗

v ∈ Rn represent a cut: ⃗ v(i) = 1 if i ∈ S and ⃗ v(i) = −1 if i ∈ T. Want ⃗ v to have roughly equal numbers of 1s and −1s. I.e., ⃗ vT⃗ 1 ≈ 0.

17

slide-57
SLIDE 57

the laplacian view

For a graph with adjacency matrix A and degree matrix D, L = D − A is the graph Laplacian. For any vector v, vTLv vTDv vTAv

n i 1

d i v i 2

n i 1 n j 1

A i j v i v j

18

slide-58
SLIDE 58

the laplacian view

For a graph with adjacency matrix A and degree matrix D, L = D − A is the graph Laplacian. For any vector ⃗ v, ⃗ vTL⃗ v = ⃗ vTD⃗ v −⃗ vTA⃗ v =

n

i=1

d(i)⃗ v(i)2 −

n

i=1 n

j=1

A(i, j) · v(i) · v(j)

18

slide-59
SLIDE 59

the laplacian view

For a cut indicator vector ⃗ v ∈ {−1, 1}n with ⃗ v(i) = −1 for i ∈ S and ⃗ v(i) = 1 for i ∈ T: ⃗ vTL⃗ V = ∑

(i,j)∈E

(⃗ v(i) −⃗ v(j))2 = 4 · cut(S, T). So minimizing ⃗ vTL⃗ v corresponds to minimizing the cut size. arg min

v 1 1 n vTLV

By the Courant-Fischer theorem, v is the smallest eigenvector

  • f L

D A.

19

slide-60
SLIDE 60

the laplacian view

For a cut indicator vector ⃗ v ∈ {−1, 1}n with ⃗ v(i) = −1 for i ∈ S and ⃗ v(i) = 1 for i ∈ T: ⃗ vTL⃗ V = ∑

(i,j)∈E

(⃗ v(i) −⃗ v(j))2 = 4 · cut(S, T). So minimizing ⃗ vTL⃗ v corresponds to minimizing the cut size. arg min

v∈{−1,1}n

⃗ vTL⃗ V By the Courant-Fischer theorem, v is the smallest eigenvector

  • f L

D A.

19

slide-61
SLIDE 61

the laplacian view

For a cut indicator vector ⃗ v ∈ {−1, 1}n with ⃗ v(i) = −1 for i ∈ S and ⃗ v(i) = 1 for i ∈ T: ⃗ vTL⃗ V = ∑

(i,j)∈E

(⃗ v(i) −⃗ v(j))2 = 4 · cut(S, T). So minimizing ⃗ vTL⃗ v corresponds to minimizing the cut size. arg min

v∈Rd with ∥⃗ v∥=1

⃗ vTL⃗ V By the Courant-Fischer theorem, v is the smallest eigenvector

  • f L

D A.

19

slide-62
SLIDE 62

the laplacian view

For a cut indicator vector ⃗ v ∈ {−1, 1}n with ⃗ v(i) = −1 for i ∈ S and ⃗ v(i) = 1 for i ∈ T: ⃗ vTL⃗ V = ∑

(i,j)∈E

(⃗ v(i) −⃗ v(j))2 = 4 · cut(S, T). So minimizing ⃗ vTL⃗ v corresponds to minimizing the cut size. arg min

v∈Rd with ∥⃗ v∥=1

⃗ vTL⃗ V By the Courant-Fischer theorem, ⃗ v is the smallest eigenvector

  • f L = D − A.

19

slide-63
SLIDE 63

smallest laplacian eigenvector

We have: ⃗ vn = 1 √n ·⃗ 1 = arg min

v∈Rdwith ∥⃗ v∥=1

⃗ vTL⃗ V with ⃗ vT

nL⃗

vn = 0.

20

slide-64
SLIDE 64

smallest laplacian eigenvector

We have: ⃗ vn = 1 √n ·⃗ 1 = arg min

v∈Rdwith ∥⃗ v∥=1

⃗ vTL⃗ V with ⃗ vT

nL⃗

vn = 0.

20

slide-65
SLIDE 65

second smallest laplacian eigenvector

By Courant-Fischer, second small eigenvector is obtained greedily: ⃗ v1 = arg min

v∈Rdwith ∥⃗ v∥=1

⃗ vTL⃗ V ⃗ v2 = arg min

v∈Rdwith ∥⃗ v∥=1, ⃗ vT

2⃗

v1=0

⃗ vTL⃗ V If v2 were binary 1 1 d, orthogonality condition ensures that there are an equal number of vertices on each side of the cut. When v2

d, enforces a ‘relaxed’ version of this constraint. 21

slide-66
SLIDE 66

second smallest laplacian eigenvector

By Courant-Fischer, second small eigenvector is obtained greedily: ⃗ v1 = arg min

v∈Rdwith ∥⃗ v∥=1

⃗ vTL⃗ V ⃗ v2 = arg min

v∈Rdwith ∥⃗ v∥=1, ⃗ vT

2⃗

v1=0

⃗ vTL⃗ V If ⃗ v2 were binary {−1, 1}d, orthogonality condition ensures that there are an equal number of vertices on each side of the cut. When v2

d, enforces a ‘relaxed’ version of this constraint. 21

slide-67
SLIDE 67

second smallest laplacian eigenvector

By Courant-Fischer, second small eigenvector is obtained greedily: ⃗ v1 = arg min

v∈Rdwith ∥⃗ v∥=1

⃗ vTL⃗ V ⃗ v2 = arg min

v∈Rdwith ∥⃗ v∥=1, ⃗ vT

2⃗

v1=0

⃗ vTL⃗ V If ⃗ v2 were binary {−1, 1}d, orthogonality condition ensures that there are an equal number of vertices on each side of the cut. When ⃗ v2 ∈ Rd, enforces a ‘relaxed’ version of this constraint.

21

slide-68
SLIDE 68

cutting with the second laplacian eigenvector

Find a good partition of the graph by computing ⃗ v2 = arg min

v∈Rdwith ∥⃗ v∥=1, ⃗ vT

2⃗

1=0

⃗ vTL⃗ V Set S to be all nodes with ⃗ v2(i) < 0, T to be all with ⃗ v2(i) ≥ 0. The Shi-Malik normalized cuts algorithm is a commonly used variance on this approach, using the normalize Laplacian D

1 2LD 1 2. 22

slide-69
SLIDE 69

cutting with the second laplacian eigenvector

Find a good partition of the graph by computing ⃗ v2 = arg min

v∈Rdwith ∥⃗ v∥=1, ⃗ vT

2⃗

1=0

⃗ vTL⃗ V Set S to be all nodes with ⃗ v2(i) < 0, T to be all with ⃗ v2(i) ≥ 0. The Shi-Malik normalized cuts algorithm is a commonly used variance on this approach, using the normalize Laplacian D

1 2LD 1 2. 22

slide-70
SLIDE 70

cutting with the second laplacian eigenvector

Find a good partition of the graph by computing ⃗ v2 = arg min

v∈Rdwith ∥⃗ v∥=1, ⃗ vT

2⃗

1=0

⃗ vTL⃗ V Set S to be all nodes with ⃗ v2(i) < 0, T to be all with ⃗ v2(i) ≥ 0. The Shi-Malik normalized cuts algorithm is a commonly used variance on this approach, using the normalize Laplacian D

1 2LD 1 2. 22

slide-71
SLIDE 71

cutting with the second laplacian eigenvector

Find a good partition of the graph by computing ⃗ v2 = arg min

v∈Rdwith ∥⃗ v∥=1, ⃗ vT

2⃗

1=0

⃗ vTL⃗ V Set S to be all nodes with ⃗ v2(i) < 0, T to be all with ⃗ v2(i) ≥ 0. The Shi-Malik normalized cuts algorithm is a commonly used variance on this approach, using the normalize Laplacian D−1/2LD−1/2.

22

slide-72
SLIDE 72

laplacian embedding

The smallest eigenvectors of L = D − A give the orthogonal ‘functions’ that are smoothest over the graph. I.e., minimize ⃗ vTL⃗ v = ∑

(i,j)∈E

[⃗ v(i) −⃗ v(j)]2. Embedding points with coordinates given by vn

1 j

vn

2 j

vn

k j

ensures that coordinates connected by edges have minimum Euclidean distance.

  • Laplacian Eigenmaps
  • Locally linear embedding
  • Isomap
  • Etc...

23

slide-73
SLIDE 73

laplacian embedding

The smallest eigenvectors of L = D − A give the orthogonal ‘functions’ that are smoothest over the graph. I.e., minimize ⃗ vTL⃗ v = ∑

(i,j)∈E

[⃗ v(i) −⃗ v(j)]2. Embedding points with coordinates given by [⃗ vn−1(j),⃗ vn−2(j), . . . ,⃗ vn−k(j)] ensures that coordinates connected by edges have minimum Euclidean distance.

  • Laplacian Eigenmaps
  • Locally linear embedding
  • Isomap
  • Etc...

23

slide-74
SLIDE 74

laplacian embedding

The smallest eigenvectors of L = D − A give the orthogonal ‘functions’ that are smoothest over the graph. I.e., minimize ⃗ vTL⃗ v = ∑

(i,j)∈E

[⃗ v(i) −⃗ v(j)]2. Embedding points with coordinates given by [⃗ vn−1(j),⃗ vn−2(j), . . . ,⃗ vn−k(j)] ensures that coordinates connected by edges have minimum Euclidean distance.

  • Laplacian Eigenmaps
  • Locally linear embedding
  • Isomap
  • Etc...

23

slide-75
SLIDE 75

laplacian embedding

The smallest eigenvectors of L = D − A give the orthogonal ‘functions’ that are smoothest over the graph. I.e., minimize ⃗ vTL⃗ v = ∑

(i,j)∈E

[⃗ v(i) −⃗ v(j)]2. Embedding points with coordinates given by [⃗ vn−1(j),⃗ vn−2(j), . . . ,⃗ vn−k(j)] ensures that coordinates connected by edges have minimum Euclidean distance.

  • Laplacian Eigenmaps
  • Locally linear embedding
  • Isomap
  • Etc...

23

slide-76
SLIDE 76

Questions?

24