Improved Clustering Algorithms for the Random Cluster Graph Model - - PowerPoint PPT Presentation

improved clustering algorithms for the random cluster
SMART_READER_LITE
LIVE PREVIEW

Improved Clustering Algorithms for the Random Cluster Graph Model - - PowerPoint PPT Presentation

Improved Clustering Algorithms for the Random Cluster Graph Model Ron Shamir Dekel Tsur Tel Aviv University 1/18 The Clustering Problem Input: A graph G . (edges in G represent similarity between the vertices) Output: A partition of the


slide-1
SLIDE 1

Improved Clustering Algorithms for the Random Cluster Graph Model

Ron Shamir Dekel Tsur Tel Aviv University

1/18

slide-2
SLIDE 2

The Clustering Problem

Input: A graph G. (edges in G represent similarity between the vertices) Output: A partition of the vertices of V into sets such that there are many edges between vertices from the same set, and few edges between vertices from different sets.

2/18

slide-3
SLIDE 3

The Clustering Problem

Input: A graph G. (edges in G represent similarity between the vertices) Output: A partition of the vertices of V into sets such that there are many edges between vertices from the same set, and few edges between vertices from different sets.

2/18

slide-4
SLIDE 4

The Random Cluster Graph Model

A graph G = (V, E) which is built by the following process:

  • 1. V

is partitioned into disjoint sets V1, . . . , Vm (clusters).

  • 2. Mates

(= vertices from the same set) are connected by an edge with probability p.

  • 3. Non-mates are connected by an edge with

probability r < p. The edges are independent.

3/18

slide-5
SLIDE 5

The Clustering Problem

Input: A cluster graph G. Output: The clusters V1, . . . , Vm.

n = |V | k = min

i

|Vi| ∆ = p − r

4/18

slide-6
SLIDE 6

Previous Results

General case Paper Requirements Complexity k ∆ Ben-Dor et al 99 Ω(n) Ω(1) Equal sized clusters m ∆ Dyer and Frieze 86 2 Ω(n−1/4 log1/4 n) Boppana 87 2 Ω(n−1/2√log n) Jerrum and Sorkin 93 2 Ω(n−1/6+ε) Condon and Karp 99 O(1) Ω(n−1/2+ε)

n = |V | k = min

i

|Vi| ∆ = p − r

5/18

slide-7
SLIDE 7

Previous Results

General case Paper Requirements Complexity k ∆ Ben-Dor et al 99 Ω(n) Ω(1) This paper Ω(∆−1√n max(log n, ∆−ε)) Equal sized clusters m ∆ Dyer and Frieze 86 2 Ω(n−1/4 log1/4 n) Boppana 87 2 Ω(n−1/2√log n) Jerrum and Sorkin 93 2 Ω(n−1/6+ε) Condon and Karp 99 O(1) Ω(n−1/2+ε) This paper Ω(mn−1/2√log n)

n = |V | k = min

i

|Vi| ∆ = p − r

5/18

slide-8
SLIDE 8

Previous Results

General case Paper Requirements Complexity k ∆ Ben-Dor et al 99 Ω(n) Ω(1) n2 logO(1) n This paper Ω(∆−1√n max(log n, ∆−ε)) O(mn2/ log n) Equal sized clusters m ∆ Dyer and Frieze 86 2 Ω(n−1/4 log1/4 n) O(n2) Boppana 87 2 Ω(n−1/2√log n) nO(1) Jerrum and Sorkin 93 2 Ω(n−1/6+ε) O(n4) Condon and Karp 99 O(1) Ω(n−1/2+ε) O(n2) This paper Ω(mn−1/2√log n) O(mn2 log n)

n = |V | k = min

i

|Vi| ∆ = p − r

5/18

slide-9
SLIDE 9

Previous Results

General case Paper Requirements Complexity k ∆ Ben-Dor et al 99 Ω(n) Ω(1) n2 logO(1) n This paper Ω(∆−1√n max(log n, ∆−ε)) O(n log n) Equal sized clusters m ∆ Dyer and Frieze 86 2 Ω(n−1/4 log1/4 n) Boppana 87 2 Ω(n−1/2√log n) Jerrum and Sorkin 93 2 Ω(n−1/6+ε) Condon and Karp 99 O(1) Ω(n−1/2+ε) This paper Ω(mn−1/2√log n)

n = |V | k = min

i

|Vi| ∆ = p − r

5/18

slide-10
SLIDE 10

More Notation

For a graph G = (V, E),

w.h.p. = With probability 1 − n−Ω(1)

N(v) = The neighbors of v dS(v) = |N(v) ∩ S|

v S dS(v) = 2

6/18

slide-11
SLIDE 11

Top Level Description

A set S ⊆ V is called a subcluster if S ⊆ Vi for some cluster Vi. Our algorithm: While G is not empty: Find seed: Find a subcluster S of size Θ(log n/∆2). Expand: Find the whole cluster Vi which contains

S, and remove it from G.

7/18

slide-12
SLIDE 12

Expanding a subcluster S

Suppose that S ⊆ Vi and |S| = Θ(log n/∆2). Consider dS(v) for v ∈ V − S:

E[dS(v)] =

  • |S|p

if v ∈ Vi

|S|r

  • therwise

8/18

slide-13
SLIDE 13

Expanding a subcluster S

Suppose that S ⊆ Vi and |S| = Θ(log n/∆2). Consider dS(v) for v ∈ V − S:

E[dS(v)] =

  • |S|p

if v ∈ Vi

|S|r

  • therwise

Using Chernoff-like bound, w.h.p.

|dS(v) − E[dS(v)]| < 1

2D, where D = Θ(

  • |S| log n)

8/18

slide-14
SLIDE 14

Expanding a subcluster S

Suppose that S ⊆ Vi and |S| = Θ(log n/∆2). Consider dS(v) for v ∈ V − S:

E[dS(v)] =

  • |S|p

if v ∈ Vi

|S|r

  • therwise

Using Chernoff-like bound, w.h.p.

|dS(v) − E[dS(v)]| < 1

2D, where D = Θ(

  • |S| log n)

D D dS(v) Vi |S|r |S|p

8/18

slide-15
SLIDE 15

Expanding a subcluster S

Suppose that S ⊆ Vi and |S| = Θ(log n/∆2). Consider dS(v) for v ∈ V − S:

E[dS(v)] =

  • |S|p

if v ∈ Vi

|S|r

  • therwise

Using Chernoff-like bound, w.h.p.

|dS(v) − E[dS(v)]| < 1

2D, where D = Θ(

  • |S| log n)

D D |S|∆ − D > D dS(v) Vi |S|r |S|p

8/18

slide-16
SLIDE 16

Expanding a subcluster S

  • 1. Order V − S = {v1, . . . , vn−|S|} such that dS(v1) ≥

dS(v2) ≥ · · · ≥ dS(vn−|S|).

  • 2. Let D = Θ(
  • |S| log n).
  • 3. If maxj{dS(vj) − dS(vj+1)} < D, then return V .
  • 4. Otherwise, let j be the first index for which

dS(vj) − dS(vj+1) ≥ D.

Return S ∪ {u1, . . . , uj}.

9/18

slide-17
SLIDE 17

Finding a Subcluster — Imbalance

For two disjoint sets L, R of vertices of equal size, the

L, R-imbalance of Vi (Jerrum and Sorkin 93) is I(Vi, L, R) = |Vi ∩ L| − |Vi ∩ R| |L| .

The imbalance of L, R is

max{I(V1, L, R), . . . , I(Vm, L, R)}.

The secondary imbalance of L, R is the second largest value.

10/18

slide-18
SLIDE 18

Finding a Subcluster

  • 1. Find L, R with large imbalance and small secondary

imbalance.

  • 2. Let f(v) = dL(v) − dR(v), D = Θ(
  • |L| log n).
  • 3. Randomly choose Θ( m2 log n

∆2

) vertices from V − (L ∪ R)

into a set S.

  • 4. Order S = {v1, . . . , vs} such that f(v1) ≥ · · · ≥ f(vs).
  • 5. If maxj{f(vj) − f(vj+1)} < D, then return.

(L, R are “bad”)

  • 6. Let j be the first index for which f(vj) − f(vj+1) ≥ D.

Return {v1, . . . , vj}.

11/18

slide-19
SLIDE 19

Correctness of the Algorithm

Denote bi = I(Vi, L, R) and l = |L|. Suppose that b1 ≥ b2 ≥ · · · ≥ bm. Lemma If b1 ≥ Ω(

√log n ∆ √ l ) and b2 ≤ 1 2b1 then w.h.p. the

  • alg. returns a subcluster.

Proof For v ∈ Vi, E[f(v)] = ∆lbi.

12/18

slide-20
SLIDE 20

Correctness of the Algorithm

Denote bi = I(Vi, L, R) and l = |L|. Suppose that b1 ≥ b2 ≥ · · · ≥ bm. Lemma If b1 ≥ Ω(

√log n ∆ √ l ) and b2 ≤ 1 2b1 then w.h.p. the

  • alg. returns a subcluster.

Proof For v ∈ Vi, E[f(v)] = ∆lbi.

|f(v) − E[f(v)]| < 1

2D

D D D f(v) V1 V2 V3 ∆lb1 ∆lb2 ∆lb3

12/18

slide-21
SLIDE 21

Correctness of the Algorithm

Denote bi = I(Vi, L, R) and l = |L|. Suppose that b1 ≥ b2 ≥ · · · ≥ bm. Lemma If b1 ≥ Ω(

√log n ∆ √ l ) and b2 ≤ 1 2b1 then w.h.p. the

  • alg. returns a subcluster.

Proof For v ∈ Vi, E[f(v)] = ∆lbi.

|f(v) − E[f(v)]| < 1

2D

D D D > D f(v) V1 V2 V3 ∆lb1 ∆lb2 ∆lb3

12/18

slide-22
SLIDE 22

Finding the Sets L, R — Initialization

  • 1. L0, R0 ← φ. Let l = Θ( m2

∆2 ).

  • 2. Randomly select a vertex u and l pairs of vertices.
  • 3. For each pair of vertices, if only one vertex is a neighbor
  • f u, place that vertex in L0 and the other vertex in R0.

u L0 R0

13/18

slide-23
SLIDE 23

Finding the Sets L, R — Initialization

  • 1. L0, R0 ← φ. Let l = Θ( m2

∆2 ).

  • 2. Randomly select a vertex u and l pairs of vertices.
  • 3. For each pair of vertices, if only one vertex is a neighbor
  • f u, place that vertex in L0 and the other vertex in R0.

u L0 R0

13/18

slide-24
SLIDE 24

Finding the Sets L, R — Initialization

  • 1. L0, R0 ← φ. Let l = Θ( m2

∆2 ).

  • 2. Randomly select a vertex u and l pairs of vertices.
  • 3. For each pair of vertices, if only one vertex is a neighbor
  • f u, place that vertex in L0 and the other vertex in R0.

Otherwise randomly place one vertex in L0 and the other vertex in R0.

u L0 R0

13/18

slide-25
SLIDE 25

Finding the Sets L, R — Initialization

  • 1. L0, R0 ← φ. Let l = Θ( m2

∆2 ).

  • 2. Randomly select a vertex u and l pairs of vertices.
  • 3. For each pair of vertices, if only one vertex is a neighbor
  • f u, place that vertex in L0 and the other vertex in R0.

Otherwise randomly place one vertex in L0 and the other vertex in R0.

u L0 R0

13/18

slide-26
SLIDE 26

Analysis of the Initialization

Suppose that u ∈ V1. If v ∈ V1 and w /

∈ V1, then P [v is a neighbor of u] = p > r = P [w is a neighbor of u] ⇒ Using Chernoff bounds and Hoeffding-Azuma’s

Inequality, w.h.p.,

I(V1, L0, R0) ≈ (1 − 1 m)∆ m I(Vi, L0, R0) ≈ − 1 m · ∆ m i > 1

14/18

slide-27
SLIDE 27

Finding the Sets L, R — 1st Iteration

  • 4. If L0, R0 are “good” (yielding a subcluster) stop.

15/18

slide-28
SLIDE 28

Finding the Sets L, R — 1st Iteration

  • 4. If L0, R0 are “good” (yielding a subcluster) stop.
  • 5. Let f0(v) = dL0(v) − dR0(v).
  • 6. L1, R1 ← φ. Randomly select l pairs of unchosen vertices.
  • 7. For each pair v, w, if f0(v) = f0(w) place the vertex with

larger f0-value in L1 and the other vertex in R1.

L0 R0 L1 R1

f0(v) = 1 f0(w) = −2

15/18

slide-29
SLIDE 29

Finding the Sets L, R — 1st Iteration

  • 4. If L0, R0 are “good” (yielding a subcluster) stop.
  • 5. Let f0(v) = dL0(v) − dR0(v).
  • 6. L1, R1 ← φ. Randomly select l pairs of unchosen vertices.
  • 7. For each pair v, w, if f0(v) = f0(w) place the vertex with

larger f0-value in L1 and the other vertex in R1.

L0 R0 L1 R1

f0(v) = 1 f0(w) = −2

15/18

slide-30
SLIDE 30

Finding the Sets L, R — Iterations

  • 8. If L1, R1 are “good” stop.
  • 9. Otherwise repeat this process (i.e. build L2, R2 from L1, R1,

build L3, R3 from L2, R2 etc.) until a “good” pair is found.

L0 R0 L1 R1

16/18

slide-31
SLIDE 31

Finding the Sets L, R — Iterations

  • 8. If L1, R1 are “good” stop.
  • 9. Otherwise repeat this process (i.e. build L2, R2 from L1, R1,

build L3, R3 from L2, R2 etc.) until a “good” pair is found.

L0 R0 L1 R1 L2 R2

16/18

slide-32
SLIDE 32

Finding the Sets L, R — Iterations

  • 8. If L1, R1 are “good” stop.
  • 9. Otherwise repeat this process (i.e. build L2, R2 from L1, R1,

build L3, R3 from L2, R2 etc.) until a “good” pair is found.

L0 R0 L1 R1 L2 R2 L3 R3

16/18

slide-33
SLIDE 33

Analysis of the Iterations

Denote bt

i = I(Vi, Lt, Rt).

Using Hoeffding-Azuma’s Inequality and Esseen’s Inequality we show that w.h.p.

  • 1. The imbalance of V1 grows exponentially:

bt

1 ≥ 2bt−1 1

for all t.

  • 2. The imbalance of other Vi-s is much smaller:

bt

i = o(bt 1) for all i, t.

⇒ After at most log n iterations we reach Lt, Rt with

high imbalance.

17/18

slide-34
SLIDE 34

Concluding Remarks

Main results: An algorithm for (almost) equal sized cluster (shown). The algorithm requires k = Ω(∆−1√n log n). An algorithm for unequal sized cluster (not shown) The algorithm requires k = Ω(∆−1√n max(log n, ∆−ε)).

18/18