1-means clustering and conductance Twan van Laarhoven Radboud - - PowerPoint PPT Presentation

1 means clustering and conductance
SMART_READER_LITE
LIVE PREVIEW

1-means clustering and conductance Twan van Laarhoven Radboud - - PowerPoint PPT Presentation

1-means clustering and conductance Twan van Laarhoven Radboud University Nijmegen, The Netherlands Institute for Computing and Information Sciences November 11th, 2016 1 / 30 Outline Network community detection with conductance The relation


slide-1
SLIDE 1

1-means clustering and conductance

Twan van Laarhoven

Radboud University Nijmegen, The Netherlands Institute for Computing and Information Sciences

November 11th, 2016

1 / 30

slide-2
SLIDE 2

Outline

Network community detection with conductance The relation to k-means clustering Algorithms Experiments Conclusions

2 / 30

slide-3
SLIDE 3

Outline

Network community detection with conductance The relation to k-means clustering Algorithms Experiments Conclusions

3 / 30

slide-4
SLIDE 4

Network community detection

Global community detection Given a network, find all tightly connected sets of nodes (communities). Local community detection Given a network and a seed node, find the community/communities containing that seed. Without inspecting the whole graph.

4 / 30

slide-5
SLIDE 5

Network community detection

Global community detection Given a network, find all tightly connected sets of nodes (communities). Local community detection Given a network and a seed node, find the community/communities containing that seed. Without inspecting the whole graph.

4 / 30

slide-6
SLIDE 6

Communities as optima

Graphs G = (V , E), aij = aji = 1 if (i, j) ∈ E else 0 Score function φG : C(G) → R Note: I’ll consider sets and vectors interchangeably, so C(G) = P(V ) or C(G) = RV .

5 / 30

slide-7
SLIDE 7

Conductance

Definition Fraction of incident edges leaving the community φ(c) = #{(i, j) ∈ E | i ∈ c, j / ∈ c} #{(i, j) ∈ E | i ∈ c, j ∈ V },

  • r

φ(c) = 1 −

  • i,j∈V ciaijcj
  • i,j∈V ciaij

where ci ∈ {0, 1}. Very popular objective for finding network communities.

6 / 30

slide-8
SLIDE 8

Conductance

Definition Fraction of incident edges leaving the community φ(c) = #{(i, j) ∈ E | i ∈ c, j / ∈ c} #{(i, j) ∈ E | i ∈ c, j ∈ V },

  • r

φ(c) = 1 −

  • i,j∈V ciaijcj
  • i,j∈V ciaij

where ci ∈ {0, 1}. Very popular objective for finding network communities.

6 / 30

slide-9
SLIDE 9

Conductance

Definition Fraction of incident edges leaving the community φ(c) = #{(i, j) ∈ E | i ∈ c, j / ∈ c} #{(i, j) ∈ E | i ∈ c, j ∈ V },

  • r

φ(c) = 1 −

  • i,j∈V ciaijcj
  • i,j∈V ciaij

where ci ∈ {0, 1}. Very popular objective for finding network communities.

6 / 30

slide-10
SLIDE 10

Continuous optimization

As an optimization problem minimize

c

φ(c) subject to ci ∈ {0, 1} for all i. Karush-Kuhn-Tucker conditions c is a local optimum if for all ci 0 ≤ ci ≤ 1, and ∂φ(c) ∂ci ≤ 0 if ci ≤ 1, and ∂φ(c) ∂ci ≥ 0 if ci ≥ 0.

7 / 30

slide-11
SLIDE 11

Continuous optimization

As an optimization problem minimize

c

φ(c) subject to 0 ≤ ci ≤ 1 for all i. Karush-Kuhn-Tucker conditions c is a local optimum if for all ci 0 ≤ ci ≤ 1, and ∂φ(c) ∂ci ≤ 0 if ci ≤ 1, and ∂φ(c) ∂ci ≥ 0 if ci ≥ 0.

7 / 30

slide-12
SLIDE 12

Continuous optimization

As an optimization problem minimize

c

φ(c) subject to 0 ≤ ci ≤ 1 for all i. Karush-Kuhn-Tucker conditions c is a local optimum if for all ci 0 ≤ ci ≤ 1, and ∂φ(c) ∂ci ≤ 0 if ci ≤ 1, and ∂φ(c) ∂ci ≥ 0 if ci ≥ 0.

7 / 30

slide-13
SLIDE 13

Continuous optimization

As an optimization problem minimize

c

φ(c) subject to 0 ≤ ci ≤ 1 for all i. Karush-Kuhn-Tucker conditions c is a local optimum if for all ci 0 ≤ ci ≤ 1 ∇φ(c)i ≥ 0 if ci = 0 ∇φ(c)i = 0 if 0 < ci < 1, ∇φ(c)i ≤ 0 if ci = 1.

7 / 30

slide-14
SLIDE 14

Local optima

Local optima are discrete If c as a strict local minimum of φ, then ci ∈ {0, 1} for all i. Proof sketch Look at φ as a function of a single ci: φ(ci) = α1 + α2ci + α3c2

i

α4 + α5ci . If 0 < ci < 1 and φ′(ci) = 0, then φ′′(ci) = 2α3/(α4 + α5ci)3 ≥ 0.

8 / 30

slide-15
SLIDE 15

Local optima

Local optima are discrete If c as a strict local minimum of φ, then ci ∈ {0, 1} for all i. Proof sketch Look at φ as a function of a single ci: φ(ci) = α1 + α2ci + α3c2

i

α4 + α5ci . If 0 < ci < 1 and φ′(ci) = 0, then φ′′(ci) = 2α3/(α4 + α5ci)3 ≥ 0.

8 / 30

slide-16
SLIDE 16

Outline

Network community detection with conductance The relation to k-means clustering Algorithms Experiments Conclusions

9 / 30

slide-17
SLIDE 17

k-means clustering

k-means clustering minimize

c n

  • i=1

k

  • j=1

cijxi − µj2

2

Subject to the constraint that exactly one cij is 1 for every i. 1-means clustering minimize

c

  • i

wi

  • cixi − µ2

2 + (1 − ci)λi

  • 10 / 30
slide-18
SLIDE 18

k-means clustering

weighted k-means clustering minimize

c n

  • i=1

k

  • j=1

wicijxi − µj2

2

Subject to the constraint that exactly one cij is 1 for every i. 1-means clustering minimize

c

  • i

wi

  • cixi − µ2

2 + (1 − ci)λi

  • 10 / 30
slide-19
SLIDE 19

k-means clustering

weighted k-means clustering minimize

c n

  • i=1

k

  • j=1

wicijxi − µj2

2

Subject to the constraint that exactly one cij is 1 for every i. 1-means clustering minimize

c

  • i

wi

  • cixi − µ2

2 + (1 − ci)λi

  • 10 / 30
slide-20
SLIDE 20

k-means clustering (cont.)

Optimal µ Fix cluster assignment ci, then µ =

  • i wicixi
  • i wici

. Optimal c Fix µ, then ci is 1 if xi − µ < λi, and 0 otherwise.

11 / 30

slide-21
SLIDE 21

Kernel k-means clustering

Kernels K(i, j) = xi, xj so xi − xj2

2 = K(i, i) + K(j, j) − 2K(i, j).

Implicit centroid The centroid is then a linear combination of points, µ =

i µixi, giving

xi − µ2

2 = K(i, i) − 2

  • j

µjK(i, j) +

  • j,k

µjK(j, k)µk. Optimal µ becomes µi = wici

  • j wjcj

.

12 / 30

slide-22
SLIDE 22

Kernel k-means clustering

Kernels K(i, j) = xi, xj so xi − xj2

2 = K(i, i) + K(j, j) − 2K(i, j).

Implicit centroid The centroid is then a linear combination of points, µ =

i µixi, giving

xi − µ2

2 = K(i, i) − 2

  • j

µjK(i, j) +

  • j,k

µjK(j, k)µk. Optimal µ becomes µi = wici

  • j wjcj

.

12 / 30

slide-23
SLIDE 23

Kernel k-means clustering (cont.)

Implicit centroid xi − µ2

2 = K(i, i) − 2

  • j

µjK(i, j) +

  • j,k

µjK(j, k)µk. µi = wici

  • j wjcj

. 1-means objective minimize

c

  • i

wi

  • cixi − µ2

2 + (1 − ci)λi

  • 13 / 30
slide-24
SLIDE 24

Kernel k-means clustering (cont.)

Implicit centroid xi − µ2

2 = K(i, i) − 2

  • j

µjK(i, j) +

  • j,k

µjK(j, k)µk. µi = wici

  • j wjcj

. 1-means objective minimize

c

  • i
  • wiciK(i, i) − 2wici
  • j

µjK(i, j) + wici

  • j,k

µjK(j, k)µk + wi(1 − ci)λi

  • 13 / 30
slide-25
SLIDE 25

Kernel k-means clustering (cont.)

Implicit centroid xi − µ2

2 = K(i, i) − 2

  • j

µjK(i, j) +

  • j,k

µjK(j, k)µk. µi = wici

  • j wjcj

. 1-means objective minimize

c

  • i

wici(K(i, i) − λi) +

  • i

wiλi −

  • i,j wiciwjcjK(i, j)
  • i wici

.

13 / 30

slide-26
SLIDE 26

Kernel k-means clustering (cont.)

Implicit centroid xi − µ2

2 = K(i, i) − 2

  • j

µjK(i, j) +

  • j,k

µjK(j, k)µk. µi = wici

  • j wjcj

. 1-means objective minimize

c

1 −

  • i,j wiciwjcjK(i, j)
  • i wici

, taking λi = K(i, i).

13 / 30

slide-27
SLIDE 27

What is the kernel?

Idea K = W −1AW −1, wi =

  • j

aij turns the objective into minimize

c

1 −

  • i,j cicjaij
  • i,j ciaij

= φ(c), We get conductance! But this kernel is not positive definite.

14 / 30

slide-28
SLIDE 28

What is the kernel?

Idea K = W −1AW −1, wi =

  • j

aij turns the objective into minimize

c

1 −

  • i,j cicjaij
  • i,j ciaij

= φ(c), We get conductance! But this kernel is not positive definite.

14 / 30

slide-29
SLIDE 29

What is the kernel?

Idea K = W −1AW −1, wi =

  • j

aij turns the objective into minimize

c

1 −

  • i,j cicjaij
  • i,j ciaij

= φ(c), We get conductance! But this kernel is not positive definite.

14 / 30

slide-30
SLIDE 30

Positive definite kernel

Add a diagonal K = W −1AW −1 + σW −1 The objective becomes minimize

c

1 −

  • i,j cicjaij
  • ij ciaij

− σ

  • i,j c2

i aij

  • ij ciaij

= φσ(c). When ci ∈ {0, 1}, c2

i = ci, so the last term is constant.

15 / 30

slide-31
SLIDE 31

Positive definite kernel

Add a diagonal K = W −1AW −1 + σW −1 The objective becomes minimize

c

1 −

  • i,j cicjaij
  • ij ciaij

− σ

  • i,j c2

i aij

  • ij ciaij

= φσ(c). When ci ∈ {0, 1}, c2

i = ci, so the last term is constant.

15 / 30

slide-32
SLIDE 32

A look at local optima

Relaxing the optimization problem minimize φσ(c) subject to 0 ≤ ci ≤ 1 for all i ∈ V . (1) Theorem When σ ≥ 2, every discrete community c is a local optimum of (1). In practice Higher σ ⇒ more clusters are local optima.

16 / 30

slide-33
SLIDE 33

A look at local optima

Relaxing the optimization problem minimize φσ(c) subject to 0 ≤ ci ≤ 1 for all i ∈ V . (1) Theorem When σ ≥ 2, every discrete community c is a local optimum of (1). In practice Higher σ ⇒ more clusters are local optima.

16 / 30

slide-34
SLIDE 34

Outline

Network community detection with conductance The relation to k-means clustering Algorithms Experiments Conclusions

17 / 30

slide-35
SLIDE 35

The problem

Given some seed nodes s find the cluster c that contains s. As constrained optimization minimize φσ(c) subject to 0 ≤ ci ≤ 1 for all i ∈ V . and ci = 1 for all i ∈ s

18 / 30

slide-36
SLIDE 36

The problem

Given some seed nodes s find the cluster c that contains s. As constrained optimization minimize φσ(c) subject to 0 ≤ ci ≤ 1 for all i ∈ V . and ci = 1 for all i ∈ s

18 / 30

slide-37
SLIDE 37

Projected Gradient Descent

Gradient descent c(0) = s, c(t+1) = p(c(t) − α(t)∇φσ(c(t))). Project onto valid set p(c) = argmin

c′, s.t. 0≤c′

i ≤1,c′ i ≥si

c − c′2

2,

This is simply p(c) = max(s, min(1, c)).

19 / 30

slide-38
SLIDE 38

Projected Gradient Descent

Gradient descent c(0) = s, c(t+1) = p(c(t) − α(t)∇φσ(c(t))). Project onto valid set p(c) = argmin

c′, s.t. 0≤c′

i ≤1,c′ i ≥si

c − c′2

2,

This is simply p(c) = max(s, min(1, c)).

19 / 30

slide-39
SLIDE 39

Projected Gradient Descent

Gradient descent c(0) = s, c(t+1) = p(c(t) − α(t)∇φσ(c(t))). Project onto valid set p(c) = argmin

c′, s.t. 0≤c′

i ≤1,c′ i ≥si

c − c′2

2,

This is simply p(c) = max(s, min(1, c)).

19 / 30

slide-40
SLIDE 40

Expectation Maximization

E-step: ci ← 1 if xi − µ ≤ λi, and 0 otherwise. M-step: µi ← wici

  • j wjcj

. Together If you work this out, this is equivalent to c(0) = s, c(t+1) = {i | ∇φσ(c(t)) < 0} ∪ s. This is gradient descent with infinite step size.

20 / 30

slide-41
SLIDE 41

Expectation Maximization

E-step: ci ← 1 if xi − µ ≤ λi, and 0 otherwise. M-step: µi ← wici

  • j wjcj

. Together If you work this out, this is equivalent to c(0) = s, c(t+1) = {i | ∇φσ(c(t)) < 0} ∪ s. This is gradient descent with infinite step size.

20 / 30

slide-42
SLIDE 42

Expectation Maximization

E-step: ci ← 1 if xi − µ ≤ λi, and 0 otherwise. M-step: µi ← wici

  • j wjcj

. Together If you work this out, this is equivalent to c(0) = s, c(t+1) = {i | ∇φσ(c(t)) < 0} ∪ s. This is gradient descent with infinite step size.

20 / 30

slide-43
SLIDE 43

Outline

Network community detection with conductance The relation to k-means clustering Algorithms Experiments Conclusions

21 / 30

slide-44
SLIDE 44

Datasets

Dataset #node #edge clus.c. #comm LFR (om=2) 5000 25123 0.021 146 CYC2008 6230 6531 0.121 408 Amazon 334863 925872 0.079 151037 DBLP 317080 1049866 0.128 13477 Youtube 1134890 2987624 0.002 8385 LiveJournal 3997962 34681189 0.045 287512 Orkut 3072441 117185083 0.014 6288363

22 / 30

slide-45
SLIDE 45

Cluster size

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 101 102 σ size

LFR CYC2008 DBLP [PGD] DBLP [EM] Amazon [PGD] Amazon [EM] Youtube

23 / 30

slide-46
SLIDE 46

Conductance

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 σ φ

DBLP [PGD] DBLP [EM] Amazon [PGD] Amazon [EM] Youtube [PGD] Youtube [EM] LiveJournal [PGD] LiveJournal [EM] Orkut [PGD] Orkut [EM]

24 / 30

slide-47
SLIDE 47

Ground truth recovery

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 σ F1

DBLP Amazon Youtube LiveJournal Orkut

25 / 30

slide-48
SLIDE 48

Choice of σ

  • Choice of σ is important.
  • Heuristic: Pick σ to maximize community density.

26 / 30

slide-49
SLIDE 49

F1 scores

Dataset PGDc-0 PGDc-d EMc-0 EMc-d YL HK PPR LFR (om=1) 0.967 0.185 0.868 0.187 0.203 0.040 0.041 LFR (om=2) 0.483 0.095 0.293 0.092 0.122 0.039 0.041 LFR (om=3) 0.275 0.085 0.158 0.083 0.110 0.037 0.039 LFR (om=4) 0.178 0.074 0.100 0.072 0.092 0.032 0.034 Karate 0.831 0.472 0.816 0.467 0.600 0.811 0.914 Football 0.792 0.816 0.766 0.805 0.816 0.471 0.283 Pol.Blogs 0.646 0.141 0.661 0.149 0.017 0.661 0.535 Pol.Books 0.596 0.187 0.622 0.197 0.225 0.641 0.663 Flickr 0.098 0.027 0.097 0.027 0.013 0.054 0.118 CYC 0.474 0.543 0.455 0.543 0.526 0.336 0.294 Amazon 0.470 0.522 0.425 0.522 0.493 0.245 0.130 DBLP 0.356 0.369 0.317 0.371 0.341 0.214 0.210 Youtube 0.089 0.251 0.073 0.248 0.228 0.037 0.071 LiveJournal 0.067 0.262 0.059 0.259 0.183 0.035 0.049 Orkut 0.042 0.231 0.033 0.231 0.171 0.057 0.033

27 / 30

slide-50
SLIDE 50

Outline

Network community detection with conductance The relation to k-means clustering Algorithms Experiments Conclusions

28 / 30

slide-51
SLIDE 51

Conclusions

  • Conductance is related to 1-means clustering.
  • Similarly, EM is the limiting case of Projected Gradient Descent.
  • High conductance clusters often correspond to ‘true’ clusters.
  • ...but they can be very large.

29 / 30

slide-52
SLIDE 52

1-means clustering and conductance

Twan van Laarhoven

Radboud University Nijmegen, The Netherlands Institute for Computing and Information Sciences

November 11th, 2016

30 / 30