Clustering via Uncoupled REgression (CURE)
Kaizheng Wang Department of ORFE Princeton University May 8th 2020
Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department - - PowerPoint PPT Presentation
Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department of ORFE Princeton University May 8 th 2020 Collaborators Yuling Yan Mateo Daz Princeton ORFE Cornell CAM Clustering 3 Spherical Clusters { x i } n i =1 1 2 N (
Kaizheng Wang Department of ORFE Princeton University May 8th 2020
Yuling Yan Princeton ORFE Mateo Díaz Cornell CAM
3
4
{xi}n
i=1 ⇠ 1 2N(µ, Id) + 1 2N(µ, Id)
5
{xi}n
i=1 ⇠ 1 2N(µ, Id) + 1 2N(µ, Id)
maxβ2Sd−1 1
n
Pn
i=1(β>xi)2
minµ1,µ2,y 1
n
Pn
i=1 kxi µyik2 2
6
They are powerful but not omnipotent. : covariance
Reduction to the spherical case?
µµ> + Σ kµk2
2/kΣk2 1
Σ ≈ I
6=
1 2N(µ, Σ) + 1 2N(µ, Σ)
, Σ)
7
Stretched mixtures can be catastrophic. Commonly-used: isotropic, Gaussian, uniform, etc.
9
Given centered , want such that
β ∈ Rd
{xi}n
i=1 ✓ Rd
β>xi ⇡ yi, i 2 [n].
10
Given centered , want such that Clustering via Uncoupled REgression:
β ∈ Rd
{xi}n
i=1 ✓ Rd
β>xi ⇡ yi, i 2 [n].
1 n
n
X
i=1
β>xi ⇡ 1 21 + 1 21.
11
Given centered , want such that Clustering via Uncoupled REgression: CURE: take with valleys at , e.g. ; solve ; return .
β ∈ Rd
f(
f(x) = (x2 − 1)2.
±1
{xi}n
i=1 ✓ Rd
β>xi ⇡ yi, i 2 [n].
1 n
n
X
i=1
β>xi ⇡ 1 21 + 1 21.
min
β2Rd
1 n
n
X
i=1
f(β>xi)
ˆ yi = sgn( ˆ β>xi)
12
is non-convex by nature.
ICA (Hyvärinen and Oja, 2000)
1 n
Pn
i=1 f(β>xi)
Given , find and s.t. The naïve extension yields trivial solutions . It only forces rather than
13
α ∈ R
β ∈ Rd
1 n
n
X
i=1
↵+β>xi ⇡ 1 21 + 1 21.
{xi}n
i=1 ✓ Rd
(ˆ α, ˆ β) = (±1, 0)
P min
α2R, β2Rd
1 n
n
X
i=1
f(α + β>xi).
|↵ + β>xi| ⇡ 1 #
#{i : α + β>xi ⇡ 1} ⇡ n/2.
Given , find and s.t. CURE:
14
α ∈ R
β ∈ Rd
1 n
n
X
i=1
↵+β>xi ⇡ 1 21 + 1 21.
{xi}n
i=1 ✓ Rd
min
↵2R, β2Rd
⇢ 1 n
n
X
i=1
f(↵ + β>xi) + 1 2(↵ + β> ¯ x)2
Given , find and s.t. CURE:
15
α ∈ R
β ∈ Rd
1 n
n
X
i=1
↵+β>xi ⇡ 1 21 + 1 21.
{xi}n
i=1 ✓ Rd
min
↵2R, β2Rd
⇢ 1 n
n
X
i=1
f(↵ + β>xi) + 1 2(↵ + β> ¯ x)2
R
⇢ 1 n
n
X
i=1
f(↵ + β>xi) +
1 2(↵ + β> ¯ x)2
#
#{i : α + β>xi ⇡ 1} ⇡ n/2.
16
Clip to improve
(x2 − 1)2/4
17
70000 fashion products, 10 categories (Xiao et al. 2017).
Visualization by PCA
18
Goal: cluster 1000 T-shirts/tops and 1000 Pullovers. Alg.: gradient descent, random initialization from unit sphere. Err.: CURE 5.2%, kmeans 44.3%, spectral (vanilla) 41.9%; spectral (Gaussian kernel) 10.5%. Also works when the classes are imbalanced.
Given , find in s.t.
19
2 F
f : X ! Y
{xi}n
i=1 ✓ X
1 n
n
X
i=1
δf(xi) ⇡
K
X
j=1
πjδyj.
Given , find in s.t. CURE:
20
2 F
f : X ! Y
min
f2F D(f#ˆ
⇢n, ⌫).
{xi}n
i=1 ✓ X
1 n
n
X
i=1
δf(xi) ⇡
K
X
j=1
πjδyj.
21
22
Drawbacks of generative approaches
23
Example: with
Never ask for more than you need!
{xi}n
i=1 ⇠ 1 2N(µ, Id) + 1 2N(µ, Id)
d n y µ kµk2 p d/n kµk2 (d/n)1/4
25
Main Assumptions CURE:
xi ⇠ ( (µ1, Σ), if yi = 1 (µ1, Σ), if yi = 1 . (
xi = µyi + Σ1/2zi
min
α2R, β2Rd
⇢ 1 n
n
X
i=1
f(↵ + β>xi) + 1 2(↵ + β> ¯ x)2
zi
26
Suppose is large. The perturbed gradient descent alg. (Jin et al. 2017) starting from 0 achieves stat. precision within iterations (hiding polylog factors).
Theorem (WYD’20)
n/d
e O ✓n d _ d2 n ◆
27
Suppose is large. The perturbed gradient descent alg. (Jin et al. 2017) starting from 0 achieves stat. precision within iterations (hiding polylog factors).
Theorem (WYD’20)
n/d
e O ✓n d _ d2 n ◆
e O( p d/n)
e O(d/n)
28
Let . For the infinite-sample loss:
Theorem (population landscape)
f(x) = (x2 − 1)2/4
re β⇤ ∝ Σ1µ
±β∗
Consider the centered case :
xi ⇠ (±µ, Σ)
min
β2Rd
1 n
n
X
i=1
f(β>xi).
29
Clip to improve
(x2 − 1)2/4
30
Suppose is large and let . W.h.p.,
Theorem (empirical landscape)
n/d
rb L
r2b L
e O(1) e O(1 _
d √n)
Nice landscape ensures efficiency and accuracy of optimization.
b L(β) = 1
n
Pn
i=1 f(β>xi)
31
Suppose is large and let . W.h.p.,
if then
Theorem (empirical landscape)
n/d
krb L(β)k2 δ, λmin[r2b L(β)] δ,
rb L
r2b L
e O(1) e O(1 _
d √n)
Nice landscape ensures efficiency and accuracy of optimization.
b L(β) = 1
n
Pn
i=1 f(β>xi)
kβ β∗k2 . krb L(β)k2 | {z }
+ r d n log ⇣n d ⌘ | {z }
stat err.
;
A general CURE for clustering problems. Wang, Yan and Díaz. Efficient clustering for stretched mixtures: landscape and optimality. Submitted.
Extensions
32