SD-DP: Sparse Dual of the Density Peaks Algorithm for Cluster - PowerPoint PPT Presentation

SD-DP: Sparse Dual of the Density Peaks Algorithm for Cluster Analysis of High-Dimensional Data November 5, 2018 Dimitris Floros 1 Tiancheng Liu 2 Nikos Pitsianis 12 Xiaobai Sun 2 1 Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki 2 Department of Computer Science, Duke University The Ąrst two authors contributed equally to this work Floros Liu Pitsianis Sun (AUTh|Duke) SD-DP: Sparse Dual of Density Peaks November 5, 2018 1 / 29

Outline 1. Cluster analysis of high-dimensional data 2. The Density Peaks (DP) and other influential algorithms 3. SD-DP: Sparse Dual of the DP algorithm 4. Experimental evidence Benchmarks Exploratory results Floros Liu Pitsianis Sun (AUTh|Duke) SD-DP: Sparse Dual of Density Peaks November 5, 2018 2 / 29

1. Cluster analysis of high-dimensional data 2. The Density Peaks (DP) and other influential algorithms 3. SD-DP: Sparse Dual of the DP algorithm 4. Experimental evidence Benchmarks Exploratory results

Cluster analysis of high-dimensional data Premise: intrinsic heterogeneous group/cluster structures in Alpert B. Hagstrom T. Waag R. real-word data of research interest Astheimer J. Kropinski M. Hesford A. Duan R. Moura M. Veerapanen S. Lin P. Mayo A. Beylkin G. Dutt A. Cluster analysis: uncover cluster structures in data, with noise and Trease N. Ho K. Wang S. Chang H. Chen Y. Jerschow A. Gu M. Kolm P. Greenbaum A. Grey C. Beylkin D. uncertainty, with quantiĄed features, governed by certain Ilott A. Sun X. Liang Z. Kong W. Chandrashekar S. Klöckner A. Li J. Wandzura S. Tornberg A. Lee J. Huang J. Serkh K. difgerentiation criteria Bao W. Minion M. Zhao J. Helsing J. Jiang S. Askham T. Sammis I. Coifman R. Ambikasaran S. Cheng H. Greengard L. Rokhlin V. Lai J. Murphy W. Imbert-Gerard L. Ethridge J. Bremer J. - massive data of many attributes/features Engheta N. Gropp W. Borges C. Gimbutas Z. Crutchfield W. Epstein C. Vassiliou M. - supervised vs. un-supervised Kobayashi M. Yarvin N. Glaser A. Hogg D. Barnett A. Ambrosiano J. Ethridge F. Cerfon A. O'Neil M. Martinsson P. Hrycak T. Szlam A. Foreman-Mackey D. Ferrando-Bataller M. Vico F. Fundamental to various research studies Berman C. Woolfe F. Gueyffier D. Coakley E. Pataki A. Sifuentes J. Veerapaneni S. Liberty E. Freidberg J. Zorin D. Tygert M. Rachh M. Abell 901/902 supercluster [23] Langston M. Spivak M. Domain-specific analysis Feature description Corona E. Co-authorship communities [25] Molecular dynamics trajectory patterns [1] kinetic, spectral measurements ClassiĄcation of astronomical events [2] Gamma ray measurements Community detection in complex system [3, 4, 5] link features Image segmentation/denoising [6, 7] intensity, patch texture Content-based image retrieval [8] semantic content descriptor Image object recognition [9, 10] SIFT [11], HOG [12] descriptors Gene expression pattern analysis [13, 14, 15, 16, 17] gene-expression matrix Thematic categorization of documents [18, 19] word frequency vector Statistical semantic or sentiment analysis GloVe [20] word vector Statistical categorization of musical genres [21] musical surface features Consumer proĄling/market segmentation [22] purchase history \[-1.5em] US city lights [26] Uber & Taxi demand in NYC [24] Floros Liu Pitsianis Sun (AUTh|Duke) SD-DP: Sparse Dual of Density Peaks November 5, 2018 3 / 29

DP, other influential algorithms & SD-DP Algorithms MEAN K-MEANS [27] DBSCAN [28] OPTICS [29] GN [3] COMBO [5] DP [31] SD-DP [32] SHIFT [30] (1982) (1996) (1999) (2002) (2014) (2014) (2018) Desirable properties 1 (2002) No prescription of # clusters � � � � � � � No restriction in cluster shape � � � � � � � Free choice of metrics � � � � � � Agnostic to distribution � � � � � Easy or no tuning � � � � Robust in high-dim. space � Accurate in high-dim. space � Low computation cost � Checkmarks are based on limited benchmarking experiments 1 Additional properties include low program complexity, stability and more Floros Liu Pitsianis Sun (AUTh|Duke) SD-DP: Sparse Dual of Density Peaks November 5, 2018 4 / 29

DP vs SD-DP: classification accuracy 60,000 images of handwritten digits (MNIST dataset) [33] DP (2018) [34] SD-DP 5866 2 21 11 8 18 18 10 38 54 97.0% 5893 4 0 5 2 3 12 0 1 3 99.5% 0 0 9.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.1% 3.0% 9.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.5% 98.6% 19 6498 8 0 15 4 10 2 29 2 0 5032 1688 0 6 0 5 5 3 3 74.6% 1 1 8 0.0% 10.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 1.4% 1688 0.0% 8.4% 2.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 25.4% 98.6% 74.6% 0.0% 1 141 5570 21 4 0 0 38 3 0 96.4% 2.8% 42 141 5699 16 3 2 13 32 4 6 95.7% 2 2 0.0% 0.2% 9.3% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 3.6% 1.4% 0.1% 0.2% 9.5% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 4.3% 25.4% error error 4 2 192 5866 0 28 1 54 19 14 94.9% 2 8 304 5690 1 46 5 22 28 25 92.8% 3 3 0.0% 0.0% 0.3% 9.8% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 5.1% precision 0.0% 0.0% 0.5% 9.5% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 7.2% precision Estimated Clusters 1 30 26 1 5484 6 0 19 14 4 98.2% Estimated Clusters 3 34 15 1 5405 0 19 7 2 356 92.5% 4 4 0.0% 0.1% 0.0% 0.0% 9.1% 0.0% 0.0% 0.0% 0.0% 0.0% 1.8% 0.0% 0.1% 0.0% 0.0% 9.0% 0.0% 0.0% 0.0% 0.0% 0.6% 7.5% 97.7% 89.1% 4 0 7 27 0 5178 11 3 28 6 98.4% 5 3 58 104 9 5089 82 6 8 57 93.9% 5 5 0.0% 0.0% 0.0% 0.0% 0.0% 8.6% 0.0% 0.0% 0.0% 0.0% 1.6% 0.0% 0.0% 0.1% 0.2% 0.0% 8.5% 0.1% 0.0% 0.0% 0.1% 6.1% 2.3% 10.9% 13 4 6 1 82 48 5870 0 42 5 96.7% 14 21 1 1 3 9 5867 0 1 1 99.1% 6 6 0.0% 0.0% 0.0% 0.0% 0.1% 0.1% 9.8% 0.0% 0.1% 0.0% 3.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 9.8% 0.0% 0.0% 0.0% 0.9% recall recall 97.7% 5 23 71 19 7 2 0 5902 5 8 1 44 1117 0 21 0 0 5048 0 34 80.6% 7 7 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 9.8% 0.0% 0.0% 2.3% 0.0% 0.1% 1.9% 0.0% 0.0% 0.0% 0.0% 8.4% 0.0% 0.1% 19.4% 5 15 32 120 25 116 8 17 5526 43 93.6% 11 48 36 115 42 91 34 7 5374 93 91.8% 8 8 0.0% 0.0% 0.1% 0.2% 0.0% 0.2% 0.0% 0.0% 9.2% 0.1% 6.4% 0.0% 0.1% 0.1% 0.2% 0.1% 0.2% 0.1% 0.0% 9.0% 0.2% 8.2% 5 27 25 65 217 21 0 220 147 5813 88.9% 10 3 22 54 1065 12 1 51 14 4717 79.3% 9 9 0.0% 0.0% 0.0% 0.1% 0.4% 0.0% 0.0% 0.4% 0.2% 9.7% 11.1% 0.0% 0.0% 0.0% 0.1% 1.8% 0.0% 0.0% 0.1% 0.0% 7.9% 20.7% 96.0% 89.7% 99.0% 96.4% 93.5% 95.7% 93.9% 95.5% 99.2% 94.2% 94.4% 97.7% 96.0% 98.5% 94.3% 63.7% 95.1% 82.4% 96.9% 97.2% 97.5% 98.9% 89.1% 89.7% 4.0% 10.3% 1.0% 3.6% 6.5% 4.3% 6.1% 4.5% 0.8% 5.8% 5.6% 2.3% 4.0% 1.5% 5.7% 36.3% 4.9% 17.6% 3.1% 2.8% 2.5% 1.1% 10.9% 10.3% 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 total accuracy total accuracy True Classes True Classes HOG descriptors ( D = 144) Intensity feature vector ( D = 28 × 28 = 784) Euclidean distance Tangent distance Unsupervised cluster revision Manual intervention in peak selection and cluster merge Floros Liu Pitsianis Sun (AUTh|Duke) SD-DP: Sparse Dual of Density Peaks November 5, 2018 5 / 29

DP vs SD-DP: classification accuracy DP (2018) SD-DP Digit semi-supervised un-supervised 0.99 0.98 0 0.83 0.98 1 All misclassiĄed digit- 0 images by SD-DP 0.77 0.95 2 0.94 0.95 3 0.87 0.96 4 0.95 0.97 5 0.98 0.98 6 0.88 0.96 7 0.95 0.94 8 0.84 0.93 9 Subset of misclassiĄed digit- 2 images by SD-DP Comparison in Dice similarity coeffjcients (DSC) a.k.a. F1 scores and Sørensen-Dice coeffjcients P 60,000 images of handwritten digits (MNIST dataset) 2 TP 2 | T ∩ P | T ∩ P DSC = = T 2 TP + FP + FN | T | + | P | Floros Liu Pitsianis Sun (AUTh|Duke) SD-DP: Sparse Dual of Density Peaks November 5, 2018 6 / 29

The Density Peaks principle [Rodriguez and Laio, Science, 2014] Principle Probability distribution from which point distributions are drawn. The regions with “Cluster centers are characterized by a higher lowest intensity correspond to a back- density than their neighbors and by a ground uniform probability of 20%. relatively large distance from points with higher densities”. Local density description Point distribution for samples of 4000 population in neighborhood of specified radius r points. Points are colored according to the cluster to which they are assigned. Black points belong to the cluster halos. ⎭ |N r ( x i ) | , hard cutofg ρ i = √︂ j exp )︄ − d 2 ij / r 2 [︄ , soft cutofg Floros Liu Pitsianis Sun (AUTh|Duke) SD-DP: Sparse Dual of Density Peaks November 5, 2018 7 / 29

SD-DP: Sparse Dual of the Density Peaks Algorithm for Cluster - PowerPoint PPT Presentation

SD-DP: Sparse Dual of the Density Peaks Algorithm for Cluster Analysis of High-Dimensional Data November 5, 2018 Dimitris Floros 1 Tiancheng Liu 2 Nikos Pitsianis 12 Xiaobai Sun 2 1 Department of Electrical and Computer Engineering, Aristotle

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

McCullough Peaks WSA Cody Field Office, Wind River/Bighorn Basin District, Wyoming 2 McCullough

Three peaks, but which 2) Measure FT-IR > peaks belong to which molecule? Crosspeaks between

Relative Density Chapters 3.5 Relative Density 1 2/5/2015 Minimum Density Pluviate soil from

Calhoun Community College Dual Enrollment Info Session for Students & Parents What is Dual

DUAL CREDIT WHAT IS DUAL CREDIT? Dual credit means two things are happening at once. Students

Lenguaje dual en el distrito 47 Dual Language in District 47 2017-2018 What is Dual Language?

Web Application for the Dual Web Application for the Dual Web Application for the Dual Web

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Polyethylene Monomer: Ethylene High Density Polyethylene (HDPE) Low Density Polyethylene

Bulk Density and Void Content Bulk Density Bulk density ( n .) the mass of a unit volume of bulk

Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Xia Hong 1 , Sheng Chen 2

Scout 2.1 Software Training Presentation Welcome! In this training we will cover: How to

Regulatory Update April and May 2018 Agenda Twin peaks Legislative Update

High Peaks Trails Plan Conservation = Recreation = Economic Development Thanks to all who have

August Luncheon August 9, 2017 UPCOMING EVENTS South Sound Summer Social: PEAKS & PINTS

Why doesnt my pain go away? Paul Vaucher Osteopath, MSc in clinical trials, PhD in

7/11/2019 Objectives Understand the interactions between humans, environment, and

Nik kunj S Scien tific C Corpo oratio on D ealers In: All Laboratory In nstruments, C

FILARIAL WORMS Found mainly in tropical regions in Asia Threadlike worms live in the

Safety Analysis of Systems Aaron R. Bradley Stanford University Safety Analysis of Systems

AIRS Project Overview March 27, 2007 March 27, 2007 Thomas S. Pagano Thomas S. Pagano AIRS

DRAFT This paper is a draft submission to Inequality Measurement, trends, impacts, and

19. Dynamic Programming I Memoization, Optimal Substructure, Overlapping Sub-Problems,

SD-DP: Sparse Dual of the Density Peaks Algorithm for Cluster - PowerPoint PPT Presentation

SD-DP: Sparse Dual of the Density Peaks Algorithm for Cluster Analysis of High-Dimensional Data November 5, 2018 Dimitris Floros 1 Tiancheng Liu 2 Nikos Pitsianis 12 Xiaobai Sun 2 1 Department of Electrical and Computer Engineering, Aristotle

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

McCullough Peaks WSA Cody Field Office, Wind River/Bighorn Basin District, Wyoming 2 McCullough

Three peaks, but which 2) Measure FT-IR &gt; peaks belong to which molecule? Crosspeaks between

Relative Density Chapters 3.5 Relative Density 1 2/5/2015 Minimum Density Pluviate soil from

Calhoun Community College Dual Enrollment Info Session for Students &amp; Parents What is Dual

DUAL CREDIT WHAT IS DUAL CREDIT? Dual credit means two things are happening at once. Students

Lenguaje dual en el distrito 47 Dual Language in District 47 2017-2018 What is Dual Language?

Web Application for the Dual Web Application for the Dual Web Application for the Dual Web

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Polyethylene Monomer: Ethylene High Density Polyethylene (HDPE) Low Density Polyethylene

Bulk Density and Void Content Bulk Density Bulk density ( n .) the mass of a unit volume of bulk

Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Xia Hong 1 , Sheng Chen 2

Scout 2.1 Software Training Presentation Welcome! In this training we will cover: How to

Regulatory Update April and May 2018 Agenda Twin peaks Legislative Update

High Peaks Trails Plan Conservation = Recreation = Economic Development Thanks to all who have

August Luncheon August 9, 2017 UPCOMING EVENTS South Sound Summer Social: PEAKS &amp; PINTS

Why doesnt my pain go away? Paul Vaucher Osteopath, MSc in clinical trials, PhD in

7/11/2019 Objectives Understand the interactions between humans, environment, and

Nik kunj S Scien tific C Corpo oratio on D ealers In: All Laboratory In nstruments, C

FILARIAL WORMS Found mainly in tropical regions in Asia Threadlike worms live in the

Safety Analysis of Systems Aaron R. Bradley Stanford University Safety Analysis of Systems

AIRS Project Overview March 27, 2007 March 27, 2007 Thomas S. Pagano Thomas S. Pagano AIRS

DRAFT This paper is a draft submission to Inequality Measurement, trends, impacts, and

19. Dynamic Programming I Memoization, Optimal Substructure, Overlapping Sub-Problems,

Three peaks, but which 2) Measure FT-IR > peaks belong to which molecule? Crosspeaks between

Calhoun Community College Dual Enrollment Info Session for Students & Parents What is Dual

August Luncheon August 9, 2017 UPCOMING EVENTS South Sound Summer Social: PEAKS & PINTS