k -variates++: Poster #29, Mon. 3-7pm more pluses in the k -means++ - - PowerPoint PPT Presentation

k variates
SMART_READER_LITE
LIVE PREVIEW

k -variates++: Poster #29, Mon. 3-7pm more pluses in the k -means++ - - PowerPoint PPT Presentation

(formerly NICTA) k -variates++: Poster #29, Mon. 3-7pm more pluses in the k -means++ Richard Nock , Raphal Canyasse, Roksana Boreli, Frank Nielsen DATA61 | ANU | TECHNION | ECOLE POLYTECHNIQUE | UNSW | SONY CS LABS, INC. www.data61.csiro.au In


slide-1
SLIDE 1

www.data61.csiro.au

k-variates++: more pluses in the k-means++

Richard Nock, Raphaël Canyasse, Roksana Boreli, Frank Nielsen

DATA61 | ANU | TECHNION | ECOLE POLYTECHNIQUE | UNSW | SONY CS LABS, INC.

(formerly NICTA)

Poster #29, Mon. 3-7pm

slide-2
SLIDE 2

2

ICML 2016

k-variates

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

In this talk

❖ A generalization of the popular k-means++ seeding ❖ Two theorems on k-variates++ ❖ guarantees on approximation of the global optimum ❖ likelihood ratio bound between neighbouring instances ❖ Applications: “reductions” between clustering algorithms +

approximation bounds of new clustering algorithms, privacy

slide-3
SLIDE 3

❖ A generalization of the popular k-means++ seeding ❖ Two theorems on k-variates++ ❖ guarantees on approximation of the global optimum ❖ likelihood ratio bound between neighbouring instances ❖ Applications: “reductions” between clustering algorithms +

approximation bounds of new clustering algorithms, privacy

3

ICML 2016

k-variates

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

In this talk

A n d m

  • r

e ! ( s e e p

  • s

t e r )

slide-4
SLIDE 4

❖ A generalization of the popular k-means++ seeding ❖ Two theorems on k-variates++ ❖ guarantees on approximation of the global optimum ❖ likelihood ratio bound between neighbouring instances ❖ Applications: “reductions” between clustering algorithms +

approximation bounds of new clustering algorithms, privacy

4

ICML 2016

k-variates

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

In this talk

A n d m

  • r

e ! ( s e e p

  • s

t e r )

  • (

s e e p a p e r ! )

slide-5
SLIDE 5

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen 5

k-means++ seeding = a gold standard in clustering:

utterly simple to implement (iteratively pick centers squ. distance to previous centers)

assumption-free (expected) approximation guarantee wrt the k-means global optimum:
 
 (Arthur & Vassilvitskii, SODA 2007)

❖ Inspired many variants (tensor clustering,

distributed, data stream, on-line, parallel clustering, clustering without centroids in closed form, etc.)

Motivation

ICML 2016

k-means++

distributed

  • n-line

streamed no closed form centroid tensors more potentials

∼ EC[potential] ≤ (2 + log k) · 8φopt

slide-6
SLIDE 6

6

Approaches are spawns of k-means++:

modify the algorithm (e.g. )

use it as building block

Our objective:

all in the same “bag”: a generalisation of k-means++ from which such approaches would be just “instanciations” 
 reductions

Because general new applications

ICML 2016

distributed

  • n-line

streamed no closed form centroid more potentials

k-variates

more applications

k-means++

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

Motivation

⇒ ⇒

slide-7
SLIDE 7

7

ICML 2016

k-means++

Input: data A ⇢ Rd with |A| = m, k 2 N⇤; Step 1: Initialise centers C ;; Step 2: for t = 1, 2, ..., k 2.1: randomly sample a ⇠qt A, with q1

.

= um and, for t > 1, qt(a)

.

= Dt(a) X

a02A

Dt(a0) !1 , where Dt(a)

.

= min

x2C ka xk2 2 ;

2.2: x a; 2.3: C C [ {x}; Output: C;

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

Arthur & Vassilvitskii, SODA’07

slide-8
SLIDE 8

8

ICML 2016

k-variates

Input: data A ⇢ Rd with |A| = m, k 2 N⇤, random variables {Xa, a 2 A}, probe functions ℘t : A ! Rd (t 1); Step 1: Initialise centers C ;; Step 2: for t = 1, 2, ..., k 2.1: randomly sample a ⇠qt A, with q1

.

= um and, for t > 1, qt(a)

.

= Dt(a) X

a02A

Dt(a0) !1 , where Dt(a)

.

= min

x2C k℘t(a) xk2 2 ;

2.2: randomly sample x ⇠ Xa; 2.3: C C [ {x}; Output: C;

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

slide-9
SLIDE 9

9

ICML 2016

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

Two theorems & applications

slide-10
SLIDE 10

10

ICML 2016

Theorem 1

❖ k-means potential for : , with ❖ Suppose is -stretching: for any optimal cluster with size > 1

and any ,


❖ Then , with

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

c(a)

.

= arg min

c∈C ka ck2 2

φ(A; C)

.

= P

a∈A ka c(a)k2 2

C

℘t η A a0 ∈ A

φ(A; C) φ(A; {a0}) ≤ (1 + η) · φ(℘t(A); C) φ(℘t(A); {℘t(a0)}), ∀t

EC∼k−variates++[φ(A; C)] ≤ (2 + log k) · Φ

φvar

.

= X

a∈A

tr (cov[Xa]) φbias

.

= X

a∈A

kE[Xa] copt(a)k2

2

φopt

.

= X

a∈A

ka copt(a)k2

2

Φ

.

= (6 + 4η)φopt + 2φbias + 2φvar

a p p r

  • x

i m a t i

  • n
  • f

g l

  • b

a l

  • p

t i m u m

(≥ 0)

slide-11
SLIDE 11

11

ICML 2016

Theorem 1

❖ k-means potential for : , with ❖ Suppose is -stretching: for any optimal cluster with size > 1

and any ,


❖ Then , with

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

c(a)

.

= arg min

c∈C ka ck2 2

φ(A; C)

.

= P

a∈A ka c(a)k2 2

C

℘t η A a0 ∈ A

φ(A; C) φ(A; {a0}) ≤ (1 + η) · φ(℘t(A); C) φ(℘t(A); {℘t(a0)}), ∀t

EC∼k−variates++[φ(A; C)] ≤ (2 + log k) · Φ

φvar

.

= X

a∈A

tr (cov[Xa]) φbias

.

= X

a∈A

kE[Xa] copt(a)k2

2

φopt

.

= X

a∈A

ka copt(a)k2

2

k-means++:

  • probe = Id
  • = Diracs

Φ

.

= (6 + 4η)φopt + 2φbias + 2φvar

a p p r

  • x

i m a t i

  • n
  • f

g l

  • b

a l

  • p

t i m u m

X.

(≥ 0)

slide-12
SLIDE 12

12

ICML 2016

Theorem 1

❖ k-means potential for : , with ❖ Suppose is -stretching: for any optimal cluster with size > 1

and any ,


❖ Then , with

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

c(a)

.

= arg min

c∈C ka ck2 2

φ(A; C)

.

= P

a∈A ka c(a)k2 2

C

℘t η A a0 ∈ A

φ(A; C) φ(A; {a0}) ≤ (1 + η) · φ(℘t(A); C) φ(℘t(A); {℘t(a0)}), ∀t

EC∼k−variates++[φ(A; C)] ≤ (2 + log k) · Φ

φvar

.

= X

a∈A

tr (cov[Xa]) φbias

.

= X

a∈A

kE[Xa] copt(a)k2

2

φopt

.

= X

a∈A

ka copt(a)k2

2

k-means++:

φbias = φopt φvar = 0 ⇒ Φ = 8 φopt η =

Φ

.

= (6 + 4η)φopt + 2φbias + 2φvar

a p p r

  • x

i m a t i

  • n
  • f

g l

  • b

a l

  • p

t i m u m

(≥ 0)

slide-13
SLIDE 13

13

ICML 2016

Remarks

❖ Guarantee approaches statistical lowerbound 


(Fréchet-Cramér-Rao-Darmois)

❖ Can be better than Arthur-Vassilvitskii bound, in

particular if
 
 
 
 = knob from which background / domain knowledge may improve the general bound


k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

φbias < φopt

φbias

slide-14
SLIDE 14

14

ICML 2016

Applications

❖ Reductions from k-variates++ approximability ratios ❖ pick clustering algorithm , ❖ show that expected output of = that of k-variates++

for particular choices of and 
 (note: no computational constraint, just need existence)

❖ Get approximability ratio for !


k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

℘t X. L L L ⇒

slide-15
SLIDE 15

15

ICML 2016

Summary (poster, paper)

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

℘t X. L

Setting Algorithm Probe functions Densities Batch k-means++ Identity Diracs Distributed d-k-means++ Identity Uniform, support = subsets Distributed p+d-k-means++ Identity Non uniform, compact support Streaming s-k-means++ synopses Diracs On-line

  • l-k-means++

point (batch not hit) Diracs / closest center (batch hit)

slide-16
SLIDE 16

16

ICML 2016

Summary (poster, paper)

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

℘t X. L

Setting Algorithm Probe functions Densities Batch k-means++ Identity Diracs Distributed d-k-means++ Identity Uniform, support = subsets Distributed p+d-k-means++ Identity Non uniform, compact support Streaming s-k-means++ synopses Diracs On-line

  • l-k-means++

point (batch not hit) Diracs / closest center (batch hit)

slide-17
SLIDE 17

17

ICML 2016

Distributed clustering

❖ Setting: {data nodes = Forgy nodes} & special node

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

(F1, A1) (F2, A2) (F3, A3) (F4, A4) (F5, A5)

N∗

k data points communicated

“Forgy nodes” “Sampling node”

no data communicated

no data non-uniform
 sampling uniform
 sampling data

(∪iAi = A)

& &

e.g. hybrid, server-assisted P2P networks (or Forgy node)

slide-18
SLIDE 18

18

ICML 2016

Algorithm + Theorem

❖ Algorithm: iterate for : ❖ chooses (non-uniformly, ) Forgy node, say ❖ samples (uniformly) point , sends to ❖ computes & sends to , which updates ❖ Theorem: , with


and the spread of Forgy nodes

❖ Remarks: global optimum on total data; bound gets

all the better as Forgy nodes aggregate “local” data.


k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

N∗ ∼ D∗

t

Fi t = 1, 2, ..., k Fi at ∈ Ai Fj, ∀j at dj ∈ R+ N∗ D∗

t

E[φ(A, C)] ≤ (2 + log k) · Φ

Φ

.

= 10φopt + 6φF

s

φF

s .

= P

i∈[n]

P

a∈Ai kc(Ai) ak2 2

φopt Fj, ∀j d

  • k
  • m

e a n s + +

slide-19
SLIDE 19

19

ICML 2016

Theorem 2

❖ Assumption: , all satisfy



 
 
 (see e.g. differential privacy)

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

l i k e l i h

  • d

r a t i

  • b
  • u

n d f

  • r

n e i g h b

  • u

r s a m p l e s

.

= Ball(L2, R) (dpXa0 /dpXa)(x) ≤ %(R) , ∀a, a0 ∈ A, ∀x ∈ Ω X.

slide-20
SLIDE 20

20

ICML 2016

Theorem 2

❖ Fix . For any neighbour (differ from 1),


❖ are spread and monotonicity parameters ❖ They can be estimated / computed from data ❖ In general, they with 



 


k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

l i k e l i h

  • d

r a t i

  • b
  • u

n d f

  • r

n e i g h b

  • u

r s a m p l e s

A0 ≈ A

℘t = Id(.)

PC⇠kvariates++[C|A0] PC⇠kvariates++[C|A] ≤ (1 + δw)k−1 + f(k) · δw · (1 + δs)k−1 · %(R)

0 < δw, δs ⌧ 1 → 0 m

( f

  • r

m a l d e fi n i t i

  • n

i n p

  • s

t e r / p a p e r )

slide-21
SLIDE 21

21

ICML 2016

Theorem 2

❖ Fix . For any neighbour (differ from 1),



 


❖ Conditions for & ?

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

l i k e l i h

  • d

r a t i

  • b
  • u

n d f

  • r

n e i g h b

  • u

r s a m p l e s

A0 ≈ A

℘t = Id(.)

PC⇠kvariates++[C|A0] PC⇠kvariates++[C|A] ≤ (1 + δw)k−1 + f(k) · δw · (1 + δs)k−1 · %(R)

⇒ 1 ⇒ 0

slide-22
SLIDE 22

22

ICML 2016

Theorem 2

❖ Fix . For any neighbour (differ from 1),


❖ If densities of all are in , with prob.



 
 
 
 as long as

No , in bound (proof exhibits small values whp, experiments display such values). Application in differential privacy (sublinear noise !)

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

l i k e l i h

  • d

r a t i

  • b
  • u

n d f

  • r

n e i g h b

  • u

r s a m p l e s

A0 ≈ A

℘t = Id(.)

PC⇠kvariates++[C|A0] PC⇠kvariates++[C|A] ≤ (1 + δw)k−1 + f(k) · δw · (1 + δs)k−1 · %(R)

[✏m, ✏M] 63 0 Xa P[C|A0] P[C|A] ≤ 1 + ✓✏M ✏m ◆k · 4 m

1 4 + 1 d+1 +

✓ 64 k

2 d

◆k · %(2R) m ! ≥ 1 − δ k ≤ ✏m2

4✏M · √m | {z }

  • (1)

δwδs

slide-23
SLIDE 23

23

ICML 2016

Experiments

❖ k-variates++ ( ) vs k-means++ & k-means

(Bahmani & al. 2012), simulated data, d=50, sample peers with until .
 For each peer, (a) data uniformly sampled in an hyper rectangle + (b) p% of points given to a random peer (increases , 
 problem more 
 “difficult”)

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

k

d-k-means++

5 10 15 20 25 0 10 20 30 40 50

P

i |Ai| ≈ 20000

E[|Ai|] = 500

φF

s

p

φF

s (p)

φF

s (0)

slide-24
SLIDE 24

24

ICML 2016

Experiments

❖ k-variates++ ( ) vs k-means++ & k-means

(Bahmani & al. 2012) (used with their best parameters)

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

4 5 6 7 8 9 10 k 10 20 30 40 50 p

  • 4
  • 2

2 4 6 8

  • 4

5 6 7 8 9 10 k 10 20 30 40 50 p

  • 4
  • 2

2 4 6 8

k

ρφ(H)

.

= φ(d-k-means++)−φ(H)

φ(H)

· 100

ρφ(k-means++)

ρφ(k-meansk)

d-k-means++

k-variates++ beats k-means++

slide-25
SLIDE 25

25

ICML 2016

Conclusions

❖ We provide a generalisation of k-means++ with

guaranteed approximation of the global optimum

❖ k-variates++ can be used as is (e.g. privacy, k-means++) or

to prove approximation properties of other algorithms via “reductions” between clustering algorithms


❖ Come see the poster for more examples ❖ Future: use Theorems to address stability, generalisation

and smoothed analysis

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

slide-26
SLIDE 26

26

ICML 2016

Thank you !

k-variates++: more pluses in the k-means++ | Richard Nock, Raphael Canyasse, Roksana Boreli & Frank Nielsen

k-variates

Questions ?