Informational and Computational Limits of Clustering and other - - PowerPoint PPT Presentation

informational and computational limits of clustering
SMART_READER_LITE
LIVE PREVIEW

Informational and Computational Limits of Clustering and other - - PowerPoint PPT Presentation

Informational and Computational Limits of Clustering and other questions about clustering Nati Srebro University of Toronto based on work in progress with Gregory Shakhnarovich Sam Roweis Brown University University of Toronto


slide-1
SLIDE 1

Informational and Computational Limits of Clustering

and other questions about clustering

Nati Srebro

University of Toronto

based on work in progress with

Gregory Shakhnarovich

Brown University

Sam Roweis

University of Toronto

slide-2
SLIDE 2

“Clustering”

  • Clustering with respect to a specific model / structure /
  • bjective
  • Gaussian mixture model

– Each point comes from one of k “centers” – Gaussian cloud around each center – For now: unit-variance Gaussians, uniform prior over choice of center

  • As an optimization problem:

– Likelihood of centers:

Σi log( Σj exp -(xi-µj)2/2 )

– k-means objective—Likelihood of assignment:

Σi minj (xi-µj)2

slide-3
SLIDE 3

Is Clustering Hard or Easy?

  • k-means (and ML estimation?) is NP-hard

– For some point configurations, it is hard to find the

  • ptimal solution.

– But do these point configurations actually correspond to clusters of points?

slide-4
SLIDE 4

Is Clustering Hard or Easy?

  • k-means (and ML estimation?) is NP-hard

– For some point configurations, it is hard to find the

  • ptimal solution.

– But do these point configurations actually correspond to clusters of points?

  • Well separated Gaussian clusters, lots of data

– Poly time algorithms for very large separation, #points – Empirically, EM* works (modest separation, #points)

*EM with some bells and whistles: spectral projection (PCA), pruning centers, etc

slide-5
SLIDE 5

Is Clustering Hard or Easy? (when its interesting)

  • k-means (and ML estimation?) is NP-hard

– For some point configurations, it is hard to find the

  • ptimal solution.

– But do these point configurations actually correspond to clusters of points?

  • Well separated Gaussian clusters, lots of data

– Poly time algorithms for very large separation, #points – Empirically, EM* works (modest separation, #points)

  • Not enough data

– Can’t identify clusters (ML clustering meaningless)

*EM with some bells and whistles: spectral projection (PCA), pruning centers, etc

slide-6
SLIDE 6

Effect of “Signal Strength”

Large separation, More samples Small separation, Less samples Lots of data— true solution creates distinct peak. Easy to find. Not enough data— “optimal” solution is meaningless.

slide-7
SLIDE 7

Effect of “Signal Strength”

Large separation, More samples Small separation, Less samples Lots of data— true solution creates distinct peak. Easy to find. Just enough data—

  • ptimal solution is

meaningful, but hard to find? Not enough data— “optimal” solution is meaningless.

slide-8
SLIDE 8

Effect of “Signal Strength”

Not enough data— “optimal” solution is meaningless. Lots of data— true solution creates distinct peak. Easy to find. Just enough data—

  • ptimal solution is

meaningful, but hard to find?

~

Informational limit Computational limit ~ Large separation, More samples Small separation, Less samples

slide-9
SLIDE 9

Effect of “Signal Strength”

Infinite data limit: Ex[cost(x;model)] = KL(true||model) Mode always at true model Determined by

  • number of clusters (k)
  • dimensionality (d)
  • separation (s)

true model

slide-10
SLIDE 10

Effect of “Signal Strength”

Infinite data limit: Ex[cost(x;model)] = KL(true||model) Mode always at true model Determined by

  • number of clusters (k)
  • dimensionality (d)
  • separation (s)

Actual log-likelihood Also depends on:

  • sample size (n)

“local ML model” ~

[Redner Walker 84]

true model

) ; true N(

1 1 − Fisher n J

slide-11
SLIDE 11

Informational and Computational Limits

separation (s) Enough information to reconstruct N

  • t

e n

  • u

g h i n f

  • r

m a t i

  • n

t

  • r

e c

  • n

s t r u c t ( M L s

  • l

u t i

  • n

i s r a n d

  • m

) sample size (n)

slide-12
SLIDE 12

Informational and Computational Limits

separation (s) sample size (n) Enough information to reconstruct N

  • t

e n

  • u

g h i n f

  • r

m a t i

  • n

t

  • r

e c

  • n

s t r u c t ( M L s

  • l

u t i

  • n

i s r a n d

  • m

) Enough information to efficiently reconstruct

slide-13
SLIDE 13

Informational and Computational Limits

separation (s) sample size (n) Enough information to reconstruct N

  • t

e n

  • u

g h i n f

  • r

m a t i

  • n

t

  • r

e c

  • n

s t r u c t ( M L s

  • l

u t i

  • n

i s r a n d

  • m

) Enough information to efficiently reconstruct

slide-14
SLIDE 14

Informational and Computational Limits

separation (s) sample size (n) Enough information to reconstruct N

  • t

e n

  • u

g h i n f

  • r

m a t i

  • n

t

  • r

e c

  • n

s t r u c t ( M L s

  • l

u t i

  • n

i s r a n d

  • m

) Enough information to efficiently reconstruct

slide-15
SLIDE 15

Informational and Computational Limits

sample size (n) separation (s)

  • What are the informational and computational limits?
  • Is there a gap?
  • Is there some minimum required separation for

computational tractability?

  • Is the learning the centers always easy given the true

distribution?

Analytic, quantitative answers. Independent of specific algorithm / estimator

slide-16
SLIDE 16

Behavior as a function of Sample Size

0.02 0.04 0.06 0.08 0.1 0.12

label error

k=16 d=1024 sep=6σ “fair” EM EM from true centers Max likelihood (fair or not) True centers

100 1000 sample size 300 3000

slide-17
SLIDE 17

Behavior as a function of Sample Size

0.02 0.04 0.06 0.08 0.1 0.12

label error

k=16 d=1024 sep=6σ “fair” EM EM from true centers Max likelihood (fair or not) True centers

  • 2
  • 1

1 2 3 4 5

bits/sample

100 1000 sample size 300 3000

Difference between likelihood of “fair” EM runs and EM from true centers each run (random init) run attaining max likelihood

slide-18
SLIDE 18

Clustering

Model of clustering

What structure are we trying to capture? What properties do we expect the data to have? What are we trying to get out of it? What is a “good clustering”?

Empirical objective and evaluation

(e.g. minimization objective) Can it be used to recover the clustering (as specified above)? Post-hoc analysis: is what we found “real”?

Algorithm

How well does it achieve objective? How efficient is it? Under what circumstances?

slide-19
SLIDE 19

Clustering

Model of clustering

What structure are we trying to capture? What properties do we expect the data to have? What are we trying to get out of it? What is a “good clustering”? Questions about the world Mathematics

Empirical objective and evaluation

(e.g. minimization objective) Can it be used to recover the clustering (as specified above)? Post-hoc analysis: is what we found “real”?

Algorithm

How well does it achieve objective? How efficient is it? Under what circumstances?

slide-20
SLIDE 20

Clustering

Model of clustering

What structure are we trying to capture? What properties do we expect the data to have? What are we trying to get out of it? What is a “good clustering”? Questions about the world Mathematics

Empirical objective and evaluation

(e.g. minimization objective) Can it be used to recover the clustering (as specified above)? Post-hoc analysis: is what we found “real”?

Algorithm

How well does it achieve objective? How efficient is it? Under what circumstances?

slide-21
SLIDE 21

Clustering

Model of clustering

What structure are we trying to capture? What properties do we expect the data to have? What are we trying to get out of it? What is a “good clustering”? Questions about the world Mathematics

Empirical objective and evaluation

(e.g. minimization objective) Can it be used to recover the clustering (as specified above)? Post-hoc analysis: is what we found “real”?

Algorithm

How well does it achieve objective? How efficient is it? Under what circumstances?

slide-22
SLIDE 22

Clustering

Model of clustering

What structure are we trying to capture? What properties do we expect the data to have? What are we trying to get out of it? What is a “good clustering”? Questions about the world Mathematics

Empirical objective and evaluation

(e.g. minimization objective) Can it be used to recover the clustering (as specified above)? Post-hoc analysis: is what we found “real”?

Algorithm

How well does it achieve objective? How efficient is it? Under what circumstances?

slide-23
SLIDE 23

Clustering

Model of clustering

What structure are we trying to capture? What properties do we expect the data to have? What are we trying to get out of it? What is a “good clustering”? Questions about the world Mathematics

Empirical objective and evaluation

(e.g. minimization objective) Can it be used to recover the clustering (as specified above)? Post-hoc analysis: is what we found “real”? Can what we found generalize?

Algorithm

How well does it achieve objective? How efficient is it? Under what circumstances?

slide-24
SLIDE 24
slide-25
SLIDE 25

“Clustering is Easy”, take 1: Approximation Algorithms

(1+ε)-Approximation for k-means in time O(2(k/ε)constnd) [Kumar Sabharwal Sen 2004]

For any data set of points, find clustering with k-means cost · (1+ε)×cost-of-optimal-clustering

slide-26
SLIDE 26

“Clustering is Easy”, take 1: Approximation Algorithms

(1+ε)-Approximation for k-means in time O(2(k/ε)constnd) [Kumar Sabharwal Sen 2004]

cost([µ1,µ2]) ≈ ∑i minj (xi-µj)2 ≈ d·n cost([0,0]) ≈ ∑i minj (xi-0)2 ≈ (d+25)·n ⇒ [0,0] is a (1+25/d)-approximation

Need ε < sep2/d, time becomes O(2(kds)constn)

0.5 N(µ1,I) + 0.5 N(µ2,I) µ1 = ( 5,0,0,0,…,0) µ2 = (-5,0,0,0,…,0)

slide-27
SLIDE 27

“Clustering is Easy”, take 2: Data drawn from a Gaussian Mixture

x1, x2,…, xn ~ 1/k N(µ1,σ2I) + 1/k N(µ2,σ2I) + L + 1/k N(µk,σ2I) |µi-µj|>s·σ

  • Find the modes

(ε-neighborhood with the most points; point with closest neighbors) – Required sample size: n=2Ω(d)

  • Randomly project to Θ(log k) dimensions

– Now n = Ω(klog2 1/δ) enough to find modes – With s>½d½, modes maintained in projection

[Dasgupta 99]

slide-28
SLIDE 28

“Clustering is Easy”, take 2: Data drawn from a Gaussian Mixture

x1, x2,…, xn ~ 1/k N(µ1,σ2I) + 1/k N(µ2,σ2I) + L + 1/k N(µk,σ2I) |µi-µj|>s·σ

Distance based s = Ω(d¼ log d)

Arora Kannan 2001 Dasgupta 1999

s > 0.5d½ n = Ω(klog2 1/δ) Random projection, then mode finding

slide-29
SLIDE 29

“Clustering is Easy”, take 2: Data drawn from a Gaussian Mixture

x1, x2,…, xn ~ 1/k N(µ1,σ2I) + 1/k N(µ2,σ2I) + L + 1/k N(µk,σ2I) |µi-µj|>s·σ

  • Randomly project to Θ(log k) dimensions

– Now n = Ω(klog2 1/δ) enough to find modes – With s>½d½, modes maintained in projection

  • Project to k principal directions (PCA)

– Spherical Gaussian components: k principal directions of true distribution span centers – Required separation only s = Ω(k¼ log dk)

[Dasgupta 99] [Vempala Wang 04]

slide-30
SLIDE 30

“Clustering is Easy”, take 2: Data drawn from a Gaussian Mixture

x1, x2,…, xn ~ 1/k N(µ1,σ2I) + 1/k N(µ2,σ2I) + L + 1/k N(µk,σ2I) |µi-µj|>s·σ

Spectral projection, then distances n = Ω(d3k2log(dk/sδ)) s = Ω(k¼ log dk)

Vempala Wang 2004

Distance based s = Ω(d¼ log d)

Arora Kannan 2001

2 round EM with Θ(k·logk) centers n = poly(k) s = Ω(d¼) (large d)

Dagupta Schulman 2000 Dasgupta 1999

s > 0.5d½ n = Ω(klog2 1/δ) Random projection, then mode finding all between-class distance all within-class distance

>

General mixture of Gaussians:

[Kannan Salmasian Vempala 2005] s=Ω(k5/2log(kd)), n=Ω(k2d·log5(d)) [Achliopts McSherry 2005]

s>4k+o(k), n=Ω(k2d)