Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department - - PowerPoint PPT Presentation

clustering via uncoupled regression cure
SMART_READER_LITE
LIVE PREVIEW

Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department - - PowerPoint PPT Presentation

Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department of ORFE Princeton University May 8 th 2020 Collaborators Yuling Yan Mateo Daz Princeton ORFE Cornell CAM Clustering 3 Spherical Clusters { x i } n i =1 1 2 N (


slide-1
SLIDE 1

Clustering via Uncoupled REgression (CURE)

Kaizheng Wang Department of ORFE
 Princeton University May 8th 2020

slide-2
SLIDE 2

Collaborators

Yuling Yan Princeton ORFE Mateo Díaz Cornell CAM

slide-3
SLIDE 3

Clustering

3

slide-4
SLIDE 4

4

Spherical Clusters

{xi}n

i=1 ⇠ 1 2N(µ, Id) + 1 2N(µ, Id)

slide-5
SLIDE 5

5

Spherical Clusters

  • PCA:
  • k-means:
  • SDP relaxations of k-means, etc
  • Density-based methods require large samples

{xi}n

i=1 ⇠ 1 2N(µ, Id) + 1 2N(µ, Id)

maxβ2Sd−1 1

n

Pn

i=1(β>xi)2

minµ1,µ2,y 1

n

Pn

i=1 kxi µyik2 2

slide-6
SLIDE 6

Finding a Needle in a Haystack

6

They are powerful but not omnipotent. : covariance

  • Max variance useful
  • PCA: or

Reduction to the spherical case?

  • Estimation of is difficult!

µµ> + Σ kµk2

2/kΣk2 1

Σ ≈ I

6=

1 2N(µ, Σ) + 1 2N(µ, Σ)

, Σ)

slide-7
SLIDE 7

Headaches

7

  • PCA and many: nice shapes & large separations.
  • Learning with non-convex losses:
  • 1. Initialization (e.g. spectral methods);
  • 2. Refinement (e.g. gradient descent).

Stretched mixtures can be catastrophic. Commonly-used: isotropic, Gaussian, uniform, etc.

  • 5
5 1
  • 5
5
slide-8
SLIDE 8

Clustering via Uncoupled REgression

  • The CURE methodology
  • Theoretical guarantees
slide-9
SLIDE 9

9

Given centered , want such that

β ∈ Rd

Vanilla CURE

{xi}n

i=1 ✓ Rd

β>xi ⇡ yi, i 2 [n].

slide-10
SLIDE 10

10

Given centered , want such that Clustering via Uncoupled REgression:

β ∈ Rd

Vanilla CURE

{xi}n

i=1 ✓ Rd

β>xi ⇡ yi, i 2 [n].

1 n

n

X

i=1

β>xi ⇡ 1 21 + 1 21.

slide-11
SLIDE 11

11

Given centered , want such that Clustering via Uncoupled REgression: CURE: take with valleys at , e.g. ; solve ; return .

β ∈ Rd

f(

f(x) = (x2 − 1)2.

±1

Vanilla CURE

{xi}n

i=1 ✓ Rd

β>xi ⇡ yi, i 2 [n].

1 n

n

X

i=1

β>xi ⇡ 1 21 + 1 21.

min

β2Rd

1 n

n

X

i=1

f(β>xi)

ˆ yi = sgn( ˆ β>xi)

slide-12
SLIDE 12

12

is non-convex by nature.

  • Projection pursuit (Friedman and Tukey, 1974),

ICA (Hyvärinen and Oja, 2000)

  • Maximize deviation from the null (Gaussian);
  • Limited algorithmic guarantees.
  • Phase retrieval (Candès et al. 2011)
  • Isotropic measurements, spectral initialization.

Vanilla CURE

1 n

Pn

i=1 f(β>xi)

slide-13
SLIDE 13

Given , find and s.t. The naïve extension yields trivial solutions . It only forces rather than

13

α ∈ R

β ∈ Rd

Vanilla CURE with Intercept

1 n

n

X

i=1

↵+β>xi ⇡ 1 21 + 1 21.

{xi}n

i=1 ✓ Rd

(ˆ α, ˆ β) = (±1, 0)

P min

α2R, β2Rd

1 n

n

X

i=1

f(α + β>xi).

|↵ + β>xi| ⇡ 1 #

#{i : α + β>xi ⇡ 1} ⇡ n/2.

slide-14
SLIDE 14

Given , find and s.t. CURE:

14

α ∈ R

β ∈ Rd

Vanilla CURE with Intercept

1 n

n

X

i=1

↵+β>xi ⇡ 1 21 + 1 21.

{xi}n

i=1 ✓ Rd

min

↵2R, β2Rd

⇢ 1 n

n

X

i=1

f(↵ + β>xi) + 1 2(↵ + β> ¯ x)2

  • .
slide-15
SLIDE 15

Given , find and s.t. CURE:

  • : ;
  • : .
  • Moment matching. Extension: imbalanced cases.

15

α ∈ R

β ∈ Rd

Vanilla CURE with Intercept

1 n

n

X

i=1

↵+β>xi ⇡ 1 21 + 1 21.

{xi}n

i=1 ✓ Rd

min

↵2R, β2Rd

⇢ 1 n

n

X

i=1

f(↵ + β>xi) + 1 2(↵ + β> ¯ x)2

  • .

R

⇢ 1 n

n

X

i=1

f(↵ + β>xi) +

1 2(↵ + β> ¯ x)2

  • |↵ + β>xi| ⇡ 1

#

#{i : α + β>xi ⇡ 1} ⇡ n/2.

slide-16
SLIDE 16

16

Loss Function

Clip to improve

  • concentration and robustness for statistics;
  • growth condition and smoothness for optimization.

(x2 − 1)2/4

slide-17
SLIDE 17

Example: Fashion-MNIST

17

70000 fashion products, 10 categories (Xiao et al. 2017).

  • T-shirts/tops
  • Pullovers

Visualization by PCA

slide-18
SLIDE 18

Example: Fashion-MNIST

18

Goal: cluster 1000 T-shirts/tops and 1000 Pullovers. Alg.: gradient descent, random initialization from unit sphere. Err.: CURE 5.2%, kmeans 44.3%, spectral (vanilla) 41.9%; spectral (Gaussian kernel) 10.5%. Also works when the classes are imbalanced.

slide-19
SLIDE 19

Given , find in s.t.

19

General CURE

2 F

f : X ! Y

{xi}n

i=1 ✓ X

1 n

n

X

i=1

δf(xi) ⇡

K

X

j=1

πjδyj.

  • 5
5 1
  • 5
5
slide-20
SLIDE 20

Given , find in s.t. CURE:

  • Discrepancy measure: divergence; MMD; Wp;
  • Fashion (10 classes), CNN + W1: state-of-the-art;
  • Bridle et al. (1992), Krause et al. (2010), Springenberg (2015), Xie et
  • al. (2016), Yang et al. (2017), Hu et al. (2017), Shaham et al. (2018).

20

General CURE

2 F

f : X ! Y

min

f2F D(f#ˆ

⇢n, ⌫).

{xi}n

i=1 ✓ X

1 n

n

X

i=1

δf(xi) ⇡

K

X

j=1

πjδyj.

slide-21
SLIDE 21

21

Clustering Algorithms

  • Generative: (X, Y) -> (Y | X)
  • Distribution learning (EM, DBSCAN)
  • ~ Linear discriminant analysis
  • Discriminative: (Y | X) — CURE belongs to this.
  • Criterion opt. (projection pursuit, Transductive SVM)
  • ~ Logistic regression
slide-22
SLIDE 22

22

Clustering Algorithms

Drawbacks of generative approaches

  • Model dependency
  • Unnecessary parameters
  • Computational challenges
  • Strong conditions
slide-23
SLIDE 23

23

Clustering Algorithms

Example: with

  • Parameter estimation:
  • Clustering:

Never ask for more than you need!

{xi}n

i=1 ⇠ 1 2N(µ, Id) + 1 2N(µ, Id)

d n y µ kµk2 p d/n kµk2 (d/n)1/4

slide-24
SLIDE 24

Clustering via Uncoupled REgression

  • The CURE methodology
  • Theoretical guarantees
slide-25
SLIDE 25
  • , ;
  • spherically symmetric, leptokurtic, sub-Gaussian.

25

Elliptical Mixture Model

Main Assumptions CURE:

  • 5
5 1
  • 5
5

xi ⇠ ( (µ1, Σ), if yi = 1 (µ1, Σ), if yi = 1 . (

  • P(yi = 1) = P(yi = 1) = 1/2

xi = µyi + Σ1/2zi

min

α2R, β2Rd

⇢ 1 n

n

X

i=1

f(↵ + β>xi) + 1 2(↵ + β> ¯ x)2

  • .

zi

slide-26
SLIDE 26

26

Suppose is large. The perturbed gradient descent alg. (Jin et al. 2017) starting from 0 achieves stat. precision within iterations (hiding polylog factors).

Theorem (WYD’20)

n/d

Theoretical Guarantees

e O ✓n d _ d2 n ◆

slide-27
SLIDE 27

27

Suppose is large. The perturbed gradient descent alg. (Jin et al. 2017) starting from 0 achieves stat. precision within iterations (hiding polylog factors).

Theorem (WYD’20)

n/d

Theoretical Guarantees

e O ✓n d _ d2 n ◆

  • Efficient clustering for stretched mixtures without warm start;
  • Two terms: prices for accuracy (stat.) and smoothness (opt.);
  • Angular error: ; excess risk: .

e O( p d/n)

e O(d/n)

slide-28
SLIDE 28

28

Let . For the infinite-sample loss:

  • Two minima , where , locally strongly cvx;
  • Local maximum ; all saddles are strict.

Theorem (population landscape)

f(x) = (x2 − 1)2/4

re β⇤ ∝ Σ1µ

±β∗

Consider the centered case :

xi ⇠ (±µ, Σ)

min

β2Rd

1 n

n

X

i=1

f(β>xi).

Proof Sketch: Population

slide-29
SLIDE 29

29

Loss Function

Clip to improve

  • concentration and robustness for statistics;
  • growth condition and smoothness for optimization.

(x2 − 1)2/4

slide-30
SLIDE 30

30

Suppose is large and let . W.h.p.,

  • Approx. second-order stationary points are good:
  • is -Lipschitz, is -Lipschitz.

Theorem (empirical landscape)

n/d

Proof Sketch: Finite Samples

rb L

r2b L

e O(1) e O(1 _

d √n)

Nice landscape ensures efficiency and accuracy of optimization.

b L(β) = 1

n

Pn

i=1 f(β>xi)

slide-31
SLIDE 31

31

Suppose is large and let . W.h.p.,

  • Approx. second-order stationary points are good:

if then

  • is -Lipschitz, is -Lipschitz.

Theorem (empirical landscape)

n/d

krb L(β)k2  δ, λmin[r2b L(β)] δ,

Proof Sketch: Finite Samples

rb L

r2b L

e O(1) e O(1 _

d √n)

Nice landscape ensures efficiency and accuracy of optimization.

b L(β) = 1

n

Pn

i=1 f(β>xi)

kβ β∗k2 . krb L(β)k2 | {z }

  • pt err.

+ r d n log ⇣n d ⌘ | {z }

stat err.

;

slide-32
SLIDE 32

Summary

A general CURE for clustering problems. Wang, Yan and Díaz. Efficient clustering for stretched mixtures: landscape and optimality. Submitted.

  • Clustering -> classification;
  • Flexible choices of transforms, OOS-extensions;
  • Stat. and comp. guarantees under mixture models.

Extensions

  • High dim., significance testing, model selection;
  • Representation learning, semi-supervised version.

32

slide-33
SLIDE 33

Q & A

slide-34
SLIDE 34

Thank you!