clustering via uncoupled regression cure
play

Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department - PowerPoint PPT Presentation

Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department of ORFE Princeton University May 8 th 2020 Collaborators Yuling Yan Mateo Daz Princeton ORFE Cornell CAM Clustering 3 Spherical Clusters { x i } n i =1 1 2 N (


  1. Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department of ORFE 
 Princeton University May 8 th 2020

  2. Collaborators Yuling Yan Mateo Díaz Princeton ORFE Cornell CAM

  3. Clustering 3

  4. Spherical Clusters { x i } n i =1 ⇠ 1 2 N ( µ , I d ) + 1 2 N ( � µ , I d ) 4

  5. Spherical Clusters { x i } n i =1 ⇠ 1 2 N ( µ , I d ) + 1 2 N ( � µ , I d ) P n • PCA: max β 2 S d − 1 1 i =1 ( β > x i ) 2 n P n • k-means: min µ 1 , µ 2 , y 1 i =1 k x i � µ y i k 2 2 n • SDP relaxations of k-means, etc • Density-based methods require large samples 5

  6. Finding a Needle in a Haystack They are powerful but not omnipotent . µµ > + Σ 1 2 N ( µ , Σ ) + 1 : covariance 2 N ( � µ , Σ ) • Max variance useful 6 = • PCA: or k µ k 2 2 / k Σ k 2 � 1 Σ ≈ I Reduction to the spherical case? • Estimation of is difficult! , Σ ) 6

  7. Headaches • PCA and many: nice shapes & large separations. • Learning with non-convex losses: 1. Initialization (e.g. spectral methods ); 2. Refinement (e.g. gradient descent). 5 0 - 5 Stretched mixtures can be catastrophic . - 5 0 5 1 0 Commonly-used: isotropic, Gaussian, uniform, etc. 7

  8. C lustering via U ncoupled RE gression • The CURE methodology • Theoretical guarantees

  9. Vanilla CURE { x i } n i =1 ✓ R d β ∈ R d Given centered , want such that β > x i ⇡ y i , i 2 [ n ] . 9

  10. Vanilla CURE { x i } n i =1 ✓ R d β ∈ R d Given centered , want such that β > x i ⇡ y i , i 2 [ n ] . C lustering via U ncoupled RE gression: n X 1 � β > x i ⇡ 1 2 � � 1 + 1 2 � 1 . n i =1 10

  11. Vanilla CURE { x i } n i =1 ✓ R d β ∈ R d Given centered , want such that β > x i ⇡ y i , i 2 [ n ] . C lustering via U ncoupled RE gression: n X 1 � β > x i ⇡ 1 2 � � 1 + 1 2 � 1 . n i =1 f ( x ) = ( x 2 − 1) 2 . CURE : take with valleys at , e.g. ; f ( ± 1 n X 1 y i = sgn( ˆ β > x i ) f ( β > x i ) solve ; return . ˆ min n β 2 R d i =1 11

  12. Vanilla CURE P n 1 i =1 f ( β > x i ) is non-convex by nature. n • Projection pursuit (Friedman and Tukey, 1974), ICA (Hyvärinen and Oja, 2000) ‣ Maximize deviation from the null (Gaussian); ‣ Limited algorithmic guarantees. • Phase retrieval (Candès et al. 2011) ‣ Isotropic measurements, spectral initialization. 12

  13. Vanilla CURE with Intercept { x i } n i =1 ✓ R d Given , find and s.t. β ∈ R d α ∈ R n X 1 � ↵ + β > x i ⇡ 1 2 � � 1 + 1 2 � 1 . n i =1 P The naïve extension n X 1 f ( α + β > x i ) . min n α 2 R , β 2 R d i =1 α , ˆ yields trivial solutions . (ˆ β ) = ( ± 1 , 0 ) | ↵ + β > x i | ⇡ 1 It only forces rather than # # { i : α + β > x i ⇡ 1 } ⇡ n/ 2 . 13

  14. Vanilla CURE with Intercept { x i } n i =1 ✓ R d Given , find and s.t. β ∈ R d α ∈ R n X 1 � ↵ + β > x i ⇡ 1 2 � � 1 + 1 2 � 1 . n i =1 ⇢ 1 � n X f ( ↵ + β > x i ) + 1 2( ↵ + β > ¯ x ) 2 CURE : min . n ↵ 2 R , β 2 R d i =1 14

  15. Vanilla CURE with Intercept { x i } n i =1 ✓ R d Given , find and s.t. β ∈ R d α ∈ R n X 1 � ↵ + β > x i ⇡ 1 2 � � 1 + 1 2 � 1 . n i =1 ⇢ 1 � n X f ( ↵ + β > x i ) + 1 2( ↵ + β > ¯ x ) 2 CURE : min . n ↵ 2 R , β 2 R d i =1 ⇢ 1 n X • : ; f ( ↵ + β > x i ) + | ↵ + β > x i | ⇡ 1 # n R i =1 � 1 2( ↵ + β > ¯ • : . x ) 2 # { i : α + β > x i ⇡ 1 } ⇡ n/ 2 . ‣ Moment matching . Extension: imbalanced cases. 15

  16. Loss Function Clip to improve ( x 2 − 1) 2 / 4 • concentration and robustness for statistics ; • growth condition and smoothness for optimization . 16

  17. Example: Fashion-MNIST 70000 fashion products, 10 categories (Xiao et al. 2017). • T-shirts/tops • Pullovers Visualization by PCA 17

  18. Example: Fashion-MNIST Goal: cluster 1000 T-shirts/tops and 1000 Pullovers. Alg.: gradient descent, random initialization from unit sphere. Err.: CURE 5.2% , kmeans 44.3%, spectral (vanilla) 41.9%; spectral (Gaussian kernel) 10.5%. Also works when the classes are imbalanced . 18

  19. General CURE { x i } n i =1 ✓ X Given , find in s.t. f : X ! Y 2 F n K 1 X X δ f ( x i ) ⇡ π j δ y j . n i =1 j =1 5 0 - 5 - 5 0 5 1 0 19

  20. General CURE { x i } n i =1 ✓ X Given , find in s.t. f : X ! Y 2 F n K 1 X X δ f ( x i ) ⇡ π j δ y j . n i =1 j =1 f 2 F D ( f # ˆ min ⇢ n , ⌫ ) . CURE : • Discrepancy measure: divergence; MMD; W p ; • Fashion ( 10 classes), CNN + W 1 : state-of-the-art; • Bridle et al. (1992), Krause et al. (2010), Springenberg (2015), Xie et al. (2016), Yang et al. (2017), Hu et al. (2017), Shaham et al. (2018). 20

  21. Clustering Algorithms • Generative: (X, Y) -> (Y | X) ‣ Distribution learning (EM, DBSCAN) ‣ ~ Linear discriminant analysis • Discriminative: (Y | X) — CURE belongs to this. ‣ Criterion opt. (projection pursuit, Transductive SVM) ‣ ~ Logistic regression 21

  22. Clustering Algorithms Drawbacks of generative approaches • Model dependency • Unnecessary parameters • Computational challenges • Strong conditions 22

  23. Clustering Algorithms { x i } n i =1 ⇠ 1 2 N ( µ , I d ) + 1 Example : with 2 N ( � µ , I d ) d � n p • Parameter estimation: k µ k 2 � d/n y µ • Clustering: k µ k 2 � ( d/n ) 1 / 4 Never ask for more than you need! 23

  24. C lustering via U ncoupled RE gression • The CURE methodology • Theoretical guarantees

  25. Elliptical Mixture Model Main Assumptions ( 5 0 ( µ 1 , Σ ) , if y i = 1 - 5 x i ⇠ if y i = � 1 . ( µ � 1 , Σ ) , � - 5 ( • , ; 0 x i = µ y i + Σ 1 / 2 z i P ( y i = 1) = P ( y i = � 1) = 1 / 2 5 1 0 • spherically symmetric, leptokurtic, sub-Gaussian. z i ⇢ 1 � n X f ( ↵ + β > x i ) + 1 2( ↵ + β > ¯ CURE: x ) 2 min . n α 2 R , β 2 R d i =1 25

  26. Theoretical Guarantees Theorem (WYD’20) Suppose is large. The perturbed gradient descent alg. n/d (Jin et al. 2017) starting from 0 achieves stat. precision within ✓ n ◆ d _ d 2 e O n iterations (hiding polylog factors). 26

  27. Theoretical Guarantees Theorem (WYD’20) Suppose is large. The perturbed gradient descent alg. n/d (Jin et al. 2017) starting from 0 achieves stat. precision within ✓ n ◆ d _ d 2 e O n iterations (hiding polylog factors). • Efficient clustering for stretched mixtures without warm start; • Two terms: prices for accuracy (stat.) and smoothness (opt.) ; p e • Angular error: ; excess risk: . e O ( d/n ) O ( d/n ) 27

  28. Proof Sketch: Population Consider the centered case : x i ⇠ ( ± µ , Σ ) n X 1 f ( β > x i ) . min n β 2 R d i =1 Theorem (population landscape) Let . For the infinite-sample loss: f ( x ) = ( x 2 − 1) 2 / 4 • Two minima , where , locally strongly cvx; re β ⇤ ∝ Σ � 1 µ ± β ∗ • Local maximum ; all saddles are strict. 0 28

  29. Loss Function Clip to improve ( x 2 − 1) 2 / 4 • concentration and robustness for statistics ; • growth condition and smoothness for optimization . 29

  30. Proof Sketch: Finite Samples Theorem (empirical landscape) P n b Suppose is large and let . W.h.p., L ( β ) = 1 i =1 f ( β > x i ) n/d n • Approx. second-order stationary points are good: • is -Lipschitz, is -Lipschitz. e r b e r 2 b d O (1) O (1 _ √ n ) L L Nice landscape ensures e ffi ciency and accuracy of optimization. 30

  31. Proof Sketch: Finite Samples Theorem (empirical landscape) P n b Suppose is large and let . W.h.p., L ( β ) = 1 i =1 f ( β > x i ) n/d n • Approx. second-order stationary points are good: kr b λ min [ r 2 b if then L ( β ) k 2  δ , L ( β )] � � δ , r ⇣ n ⌘ d k β � β ∗ k 2 . kr b L ( β ) k 2 + n log ; | {z } d | {z } opt err . stat err . • is -Lipschitz, is -Lipschitz. e r b e r 2 b d O (1) O (1 _ √ n ) L L Nice landscape ensures e ffi ciency and accuracy of optimization. 31

  32. Summary A general CURE for clustering problems . Wang , Yan and Díaz. E ffi cient clustering for stretched mixtures: landscape and optimality. Submitted. ‣ Clustering -> classification ; ‣ Flexible choices of transforms, OOS-extensions; ‣ Stat. and comp. guarantees under mixture models. Extensions ‣ High dim., significance testing, model selection; ‣ Representation learning, semi-supervised version. 32

  33. Q & A

  34. Thank you!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend