Spectral Clustering Guokun Lai 2016/10 1 / 37 Organization Graph - - PowerPoint PPT Presentation

spectral clustering
SMART_READER_LITE
LIVE PREVIEW

Spectral Clustering Guokun Lai 2016/10 1 / 37 Organization Graph - - PowerPoint PPT Presentation

Spectral Clustering Guokun Lai 2016/10 1 / 37 Organization Graph Cut Fundamental Limitations of Spectral Clustering Ng 2002 paper (if we have time) 2 / 37 Notation We define a undirected weighted graph G ( V , E ), where V is


slide-1
SLIDE 1

Spectral Clustering

Guokun Lai 2016/10

1 / 37

slide-2
SLIDE 2

Organization

◮ Graph Cut ◮ Fundamental Limitations of Spectral Clustering ◮ Ng 2002 paper (if we have time)

2 / 37

slide-3
SLIDE 3

Notation

◮ We define a undirected weighted graph G(V , E), where V is

the G’s nodes set, and E is the G’s edges set. The adjacency matrix is Wij = E(i, j), Wij ≥ 0.

◮ The degree Matrix D ∈ Rn×n is a diagonal matrix and

Di,i = n

j=1 Wi,j. ◮ The Laplacian Matrix L ∈ Rn×n is L = D − W . ◮ Indicator vector of a cluster: The indicator vector Ic of a

cluster C is , Ic,i =

  • 1

vi ∈ C

  • therwise

(1)

3 / 37

slide-4
SLIDE 4

Graph Cut

The intuition of clustering is to separate points in different groups according to their similarities. If we try to separate the node set G into two disjoint sets A and B, we define Cut(A, B) =

  • i∈A,j∈B

wij If we split the node set into K disjoint set, then Cut(A1, · · · , Ak) =

K

  • i=1

Cut(A, A) Where A is the complement set of A.

4 / 37

slide-5
SLIDE 5

Defect of Graph Cut

The simplest idea to cluster the node set V is to find a partition to minimize the Graph Cut function. But usually it will lead to solutions that the subset with few nodes.

5 / 37

slide-6
SLIDE 6

Normalization Cut

For overcoming the defect of the Graph Cut, the Shi proposed a new cost function to regularize the size of the subset. First, we define Vol(A) =

i∈A,j∈V w(i, j), and we have

Ncut(A, B) = cut(A, B) V (A) + cut(A, B) V (B)

6 / 37

slide-7
SLIDE 7

Relation between NCut and Spectral Clustering

Given a vertex subset Ai ∈ V , we define the vector fi = IAi ∗

1

Vol(Ai). Then we can write the optimization problem as,

min

Ai

NCut = 1 2

n

  • i=0

f T

i Lfi = 1

2Tr(F TLF) s.t. fi = IAi ∗ 1

  • Vol(Ai)

F TDF = I (2)

7 / 37

slide-8
SLIDE 8

Optimization

Because the constraint fi = IAi ∗

1

Vol(Ai), the optimization

problem is a np-hard problem. So we can relax this constraint to the Rn. Then the optimization problem is, min

fi

Tr(F TLF) s.t. F TDF = I (3) Then we found the solution is the kth smallest eigenvector of D−1L. Based on the F, we recover the Ai by the k-mean algorithm.

8 / 37

slide-9
SLIDE 9

Unnormalized Laplacian Matrix

Similar to the above approach, we can prove that the eigenvectors

  • f the unnormalized Laplacian matrix is the relaxed solution for

RatioCut(A, B) = cut(A,B)

|A|

+ cut(A,B)

|B|

. We can set fi = IAi ∗

1

|Ai|

and get the relaxed optimization problem, min

fi

Tr(F TLF) s.t. F TF = I (4)

9 / 37

slide-10
SLIDE 10

Approximation

The solution from the spectral method is approximately for the Normalized Cut objective function. And there is not bound for the gap between them. We can easily construct a case to make the solution to the relaxed problem very different from the origin problem.

10 / 37

slide-11
SLIDE 11

Experiment Result of Shi paper

11 / 37

slide-12
SLIDE 12

Organization

◮ Graph Cut ◮ Fundamental Limitations of Spectral Clustering ◮ Ng 2002 paper (if we have time)

12 / 37

slide-13
SLIDE 13

Fundamental Limitations of Spectral Clustering

As mentioned above, the spectral clustering approximately solve the Normalized Graph Cut objective function. But is that the Normalized Graph Cut a good criterion for the all situations?

13 / 37

slide-14
SLIDE 14

Limitation of NCut

The NCut function is more likely to capture the global structure. But sometimes, we may want to extract some local feature of the graph. The Graph Normalized Cut cannot separate the Gaussian distribution and the band.

14 / 37

slide-15
SLIDE 15

Limitation of Spectral Clustering

Next we analyze the spectral method based on the view of random walk process. We define the Markov transition matrix as M = D−1W , it has eigenvalue λi and eigenvector vi. And the random walk process in the graph converges to the unique equilibrium distribution πs. Then we can found the relationship between eigenvector and the ’diffusion distance’ between points,

  • j

λ2t

j (vj(x) − vj(y))2 = ||p(z, t|x) − p(z, t|y)||2 L2(1/πs)

So we see that the spectral method want to capture the major pattern of the random walk on whole graph.

15 / 37

slide-16
SLIDE 16

Limitation of Spectral Clustering

But this method would fail in the situation, which the scale of clusters are very different.

16 / 37

slide-17
SLIDE 17

Self-Tuning Spectral Clustering

One way to solve above case is that we can accelerate the random walk process in the low density area. Assume we define the distance between node is, Ai,j = exp(−d(vi, vj)2 σiσj ) And σi = d(vi, vk), where vk is the k-th nearest neighbor of vi.

17 / 37

slide-18
SLIDE 18

Result of Self-Tuning Spectral Clustering

18 / 37

slide-19
SLIDE 19

Failure case

19 / 37

slide-20
SLIDE 20

Another solution

The paper proposed a solution is that we split the graph into two subsets recursively. And stop criterion is based on the relaxation time of the graph, which is τV = 1/(1 − λ2).

◮ Then if the size of two subsets after splitting is comparable,

we expect τV >> τ1 + τ2

◮ Otherwise, we expect max(τ1, τ2) >> min(τ1, τ2).

If the partition satisfy either condition, we accept separation and continue to split the subset. If not, we stop. But it didn’t address how to deal with K clustering problem.

20 / 37

slide-21
SLIDE 21

Tong Zhang 2007 paper

This paper gave a upper bound of expectation error in the semi-supervised learning task on graph. Because of the room of presentation, I will just introduce a interesting conclusion of this paper.

21 / 37

slide-22
SLIDE 22

S-Normalized Laplacian Matrix

We define the S-Normalized Laplacian Matrix as LS = S−1/2LS−1/2 where S is a diagonal matrix. According to the analyze of the this paper, the best choice of S is Si,i = |Cj|, where Cj is the size of the cluster j. So this is an approach want to solve the different scale cluster problem cannot be dealt with by the spectral clustering. We can find this is similar to the self-tuning spectral clustering, it renormalized the adjacency matrix as ˆ Wij =

Wij

|Ci|√ |Cj|.

22 / 37

slide-23
SLIDE 23

S-Normalized Laplacian Matrix

But we don’t know |Cj|, the author proposed a method to approximately computer it. We can define K −1 = αI + LS, α ∈ R. In the ideal case, which is that we have q disjoint connected components. Then we can prove that α → 0, αK =

q

  • i=1

1 |Cj|vjvT

j + O(α)

where vj is the indicator vector of the cluster j. So if we have a small α, we can assume Ki,i ∝ |Cj|. Then we can set Si,i ∝

1 Ki,i .

23 / 37

slide-24
SLIDE 24

Comparation

24 / 37

slide-25
SLIDE 25

Organization

◮ Graph Cut ◮ Fundamental Limitations of Spectral Clustering ◮ Ng 2002 paper (if we have time)

25 / 37

slide-26
SLIDE 26

Ng 2002 paper

This paper analyzed the spectral clustering problem based on the matrix perturbation theory. It obtains a error bound of the spectral clustering algorithm with several assumptions.

26 / 37

slide-27
SLIDE 27

Algorithm

◮ Define the weighted adjacency Matrix W, and construct the

Laplacian Matrix L = D−1/2WD−1/2.

◮ Find x1, · · ·, xk, the K largest eigenvectors of L, and form the

matrix X = [x1 · · · xk] ∈ Rn∗k

◮ Normalized the every row of X to have unit length,

Yij = Xij/(

j X 2 ij )1/2 ◮ Treating each row of Y as a point in Rk, cluster them into k

clusters via K-means.

27 / 37

slide-28
SLIDE 28

Ideal Case

Assume the graph G contain K clusters, and it dose not contain cross-clusters edge. In this case, the Laplacian matrix contains exactly K eigenvector with eigenvalue 1.

28 / 37

slide-29
SLIDE 29

Y Matrix of Ideal Case

After running the algorithm on this graph, we can get Y matrix as Where R is any rotation matrix, and each row of Y will cluster into 3 groups naturally.

29 / 37

slide-30
SLIDE 30

The general case

In real world data, we have cross-clusters edges. So the author analyzes the cross-clusters edges influence on the Y matrix based

  • n the matrix perturbation theory.

30 / 37

slide-31
SLIDE 31

The general case

Assumption 1

There exists δ > 0 so that, for all second largest eigenvalue of each cluster, i = 1, · · ·, k, λi

2 ≤ 1 − δ.

Assumption 2

There is some fixed ǫ1 > 0, so that for every i1, i2 ∈ 1, · · ·, k, i1 = i2, we have that

j∈Si1

  • k∈Si2

W 2

jk

ˆ dj ˆ dk ≤ ǫ1,

where ˆ di is the degree of i in its cluster. The intuition of this inequality is to limit the weight of cross-cluster edges, compared to weight of the intra-cluster edges.

31 / 37

slide-32
SLIDE 32

The general case

Assumption 3

There is some fixed ǫ2 > 0, so that for every j ∈ Si, we have that

  • k∈Si W 2

jk

ˆ dj

≤ ǫ2(

k,l∈Si W 2

kl

ˆ dk ˆ dl )−1/2

The intuition of this inequality is also to limit the weight of cross-cluster edges, compared to weight of the intra-cluster edges.

Assumption 4

There is some constant C > 0 so that for every i = 1, · · · , k, j = 1, · · · , ni, we have ˆ dj

i ≥ (ni k=1 ˆ

dk

i)/(Cni).

The intuition of this inequality is that no points in a cluster be ”too much less” connected than other points in the same cluster.

32 / 37

slide-33
SLIDE 33

The general case

If the all of assumptions holds, set ǫ =

  • k(k − 1)ǫ + k ∗ ǫ2

2 If

σ > (2 + √ 2)ǫ. There exists k orthogonal vectors r1, · · · , rk so that 1 n

k

  • i=1

ni

  • j=1

||yj

j − ri||2 2 ≤ 4C(4 + 2 ∗

√ k)2 ǫ2 (σ − √ 2ǫ)2

33 / 37

slide-34
SLIDE 34

Liu’s 2016 paper

Motivation

◮ The original semi-supervised learning problem can be

formalized as min

f

  • i

ℓ(fi, yi) + f TLf

◮ We can richer the label propagation patterns based on the

spectrum transformation, which called ST-enhance semi-supervised learning min

f

  • i

ℓ(fi, yi) + f Tσ(L)f

34 / 37

slide-35
SLIDE 35

Spectral Transform

We can define L =

i λiφiφT i , and θi = σ(λi)−1, where σ(x)

should be a non-decrease function. We can substitute it into the

  • bjective function,

min

f

C(f ; θ) =

  • i∈τ

ℓ(fi, yi) + γ

m

  • i=1

θ−1

i

φi, f 2 whereas θ1 ≥ θ2, · · · , ≥ θm ≥ 0.

35 / 37

slide-36
SLIDE 36

Jointly optimization

We can try to jointly optimization eigenvalues set θ and labels set f , so we have min

θ

(min

f

C(f ; θ)) + τ||θ||1 we can prove that this function is convex via θ. The optimization process can be describe as, First, fixed θ, we can optimize the convex problem on f . After that, optimize the θ in its domain.

36 / 37

slide-37
SLIDE 37

Proof of convexity

We can rewrite the objective function used the dual form of the C(f ; θ), which is C ∗(u; θ). min

θ

(max

u

C ∗(u; θ)) + τ||θ||1 where C ∗(u; θ) = −w(−u) − 1

  • i θi < φi, u >2, and −w(−u) is

the conjugate function of the ℓ. So the objection is the point-wise maximum of a set of convex function. Then it still convex on θ.

37 / 37