Informational and Computational Limits of Clustering
and other questions about clustering
Nati Srebro
University of Toronto
based on work in progress with
Gregory Shakhnarovich
Brown University
Sam Roweis
University of Toronto
Informational and Computational Limits of Clustering and other - - PowerPoint PPT Presentation
Informational and Computational Limits of Clustering and other questions about clustering Nati Srebro University of Toronto based on work in progress with Gregory Shakhnarovich Sam Roweis Brown University University of Toronto
University of Toronto
based on work in progress with
Brown University
University of Toronto
– Each point comes from one of k “centers” – Gaussian cloud around each center – For now: unit-variance Gaussians, uniform prior over choice of center
– Likelihood of centers:
– k-means objective—Likelihood of assignment:
*EM with some bells and whistles: spectral projection (PCA), pruning centers, etc
*EM with some bells and whistles: spectral projection (PCA), pruning centers, etc
Large separation, More samples Small separation, Less samples Lots of data— true solution creates distinct peak. Easy to find. Not enough data— “optimal” solution is meaningless.
Large separation, More samples Small separation, Less samples Lots of data— true solution creates distinct peak. Easy to find. Just enough data—
meaningful, but hard to find? Not enough data— “optimal” solution is meaningless.
Not enough data— “optimal” solution is meaningless. Lots of data— true solution creates distinct peak. Easy to find. Just enough data—
meaningful, but hard to find?
Informational limit Computational limit ~ Large separation, More samples Small separation, Less samples
Infinite data limit: Ex[cost(x;model)] = KL(true||model) Mode always at true model Determined by
true model
Infinite data limit: Ex[cost(x;model)] = KL(true||model) Mode always at true model Determined by
Actual log-likelihood Also depends on:
“local ML model” ~
[Redner Walker 84]
true model
) ; true N(
1 1 − Fisher n J
separation (s) Enough information to reconstruct N
e n
g h i n f
m a t i
t
e c
s t r u c t ( M L s
u t i
i s r a n d
) sample size (n)
separation (s) sample size (n) Enough information to reconstruct N
e n
g h i n f
m a t i
t
e c
s t r u c t ( M L s
u t i
i s r a n d
) Enough information to efficiently reconstruct
separation (s) sample size (n) Enough information to reconstruct N
e n
g h i n f
m a t i
t
e c
s t r u c t ( M L s
u t i
i s r a n d
) Enough information to efficiently reconstruct
separation (s) sample size (n) Enough information to reconstruct N
e n
g h i n f
m a t i
t
e c
s t r u c t ( M L s
u t i
i s r a n d
) Enough information to efficiently reconstruct
sample size (n) separation (s)
computational tractability?
distribution?
Analytic, quantitative answers. Independent of specific algorithm / estimator
0.02 0.04 0.06 0.08 0.1 0.12
label error
k=16 d=1024 sep=6σ “fair” EM EM from true centers Max likelihood (fair or not) True centers
100 1000 sample size 300 3000
0.02 0.04 0.06 0.08 0.1 0.12
label error
k=16 d=1024 sep=6σ “fair” EM EM from true centers Max likelihood (fair or not) True centers
1 2 3 4 5
bits/sample
100 1000 sample size 300 3000
Difference between likelihood of “fair” EM runs and EM from true centers each run (random init) run attaining max likelihood
What structure are we trying to capture? What properties do we expect the data to have? What are we trying to get out of it? What is a “good clustering”?
(e.g. minimization objective) Can it be used to recover the clustering (as specified above)? Post-hoc analysis: is what we found “real”?
How well does it achieve objective? How efficient is it? Under what circumstances?
What structure are we trying to capture? What properties do we expect the data to have? What are we trying to get out of it? What is a “good clustering”? Questions about the world Mathematics
(e.g. minimization objective) Can it be used to recover the clustering (as specified above)? Post-hoc analysis: is what we found “real”?
How well does it achieve objective? How efficient is it? Under what circumstances?
What structure are we trying to capture? What properties do we expect the data to have? What are we trying to get out of it? What is a “good clustering”? Questions about the world Mathematics
(e.g. minimization objective) Can it be used to recover the clustering (as specified above)? Post-hoc analysis: is what we found “real”?
How well does it achieve objective? How efficient is it? Under what circumstances?
What structure are we trying to capture? What properties do we expect the data to have? What are we trying to get out of it? What is a “good clustering”? Questions about the world Mathematics
(e.g. minimization objective) Can it be used to recover the clustering (as specified above)? Post-hoc analysis: is what we found “real”?
How well does it achieve objective? How efficient is it? Under what circumstances?
What structure are we trying to capture? What properties do we expect the data to have? What are we trying to get out of it? What is a “good clustering”? Questions about the world Mathematics
(e.g. minimization objective) Can it be used to recover the clustering (as specified above)? Post-hoc analysis: is what we found “real”?
How well does it achieve objective? How efficient is it? Under what circumstances?
What structure are we trying to capture? What properties do we expect the data to have? What are we trying to get out of it? What is a “good clustering”? Questions about the world Mathematics
(e.g. minimization objective) Can it be used to recover the clustering (as specified above)? Post-hoc analysis: is what we found “real”? Can what we found generalize?
How well does it achieve objective? How efficient is it? Under what circumstances?
[Dasgupta 99]
Distance based s = Ω(d¼ log d)
Arora Kannan 2001 Dasgupta 1999
s > 0.5d½ n = Ω(klog2 1/δ) Random projection, then mode finding
[Dasgupta 99] [Vempala Wang 04]
Spectral projection, then distances n = Ω(d3k2log(dk/sδ)) s = Ω(k¼ log dk)
Vempala Wang 2004
Distance based s = Ω(d¼ log d)
Arora Kannan 2001
2 round EM with Θ(k·logk) centers n = poly(k) s = Ω(d¼) (large d)
Dagupta Schulman 2000 Dasgupta 1999
s > 0.5d½ n = Ω(klog2 1/δ) Random projection, then mode finding all between-class distance all within-class distance
General mixture of Gaussians:
[Kannan Salmasian Vempala 2005] s=Ω(k5/2log(kd)), n=Ω(k2d·log5(d)) [Achliopts McSherry 2005]
s>4k+o(k), n=Ω(k2d)