Asynchronous K-Means Clustering of Multiple Data Sets Marek Fiser, - PowerPoint PPT Presentation

Asynchronous K-Means Clustering of Multiple Data Sets Marek Fiser, Illia Ziamtsov, Ariful Azad, Bedrich Benes, Alex Pothen

Motivation Clustering bottleneck in Flow Cytometry research 3,000 data sets 25,000 points in 7D per data set 19 separate clustering tasks per data set Parallel CPU time: 295 minutes Other GPU implementations: 96 minutes (3x)

K-means clustering 1. Initialize cluster centers (randomly) 2. Assign each data point to the nearest cluster center Easy to parallelize 3. Re-assign new cluster centers Harder to parallelize 4. If any cluster changed go to 2.

Problem definition Multiple datasets (> 100) Small data set size (2,000 – 200,000 points) Low number of clusters (2 – 30) Low number of dimensions (1 – 50) All data sets are processed in serial Synchronization overhead is high for small data sets Synchronization has to be performed for every iteration of k-means algorithm

K-means clustering requires sync 1. Initialize cluster centers (randomly) 2. Assign each data point to the nearest cluster center 3. Re-assign new cluster centers Synchronization 4. If any cluster changed go to 2.

The problem – graphs Speedup of the GPUMiner (GPU) over the MineBench (CPU) 20 40 15 30 Speedup Speedup Area of poor Area of poor 10 20 performance performance 5 10 0 0 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 4 8 16 32 64 128 Data set size Number of clusters k

Our approach Avoid kernel-wise CPU-GPU synchronization Use only one CUDA-block for clustering Single CUDA-block can be synchronized within GPU using __syncblocks() Use CUDA-streams to run as many blocks as possible Thanks to CUDA-streams the clustering is fully asynchronous While the GPU is busy clustering the CPU is loading more data sets There is nearly no overhead with I/O operations of the CPU

Our approach – Timeline Time

Our approach – Real timeline

Implementation – Core for each input data set i do { D = Load Data ( i ); // Loads data from HDD or other source. s ← Get Available Cuda Stream (); // Blocking operation Ensure Enough Pinned Memory ( D , s ); // Every stream has associated pinned memory Copy Data To Pinned Memory ( D , s ); Schedule Mem Copy From Host To Device On Stream ( s ); Asynchronous Schedule Cuda Kernel Invocation On Stream ( s ); (non-blocking) Schedule Mem Copy From Device To Host On Stream ( s ); }

Implementation – Get Cuda Stream function freeStream ← null ; while ( freeStream == null ) { for each stream s i do { if ( Is Stream Finished ( s i ) ) { D ← Download Results From Pinned Memory ( s i ); Save Results ( D ); freeStream = s i ; } } } return freeStream ;

Non-paged (pinned) memory Required to use with CUDA streams Uses Direct memory access (DMA) for memory copies Used for both input and output It is allocated big enough, size = max(input size, output size) Pooled per stream Memory is re-used for consecutive datasets, or re-allocated if needed

Flow Cytometry Data 2,872 individual data sets 25,000 points per dataset, 7 dimensions 19 separate clusterings for k={2, …, 20} Total: 2,872 · 19 = 54,568 individual clustering tasks CPU: Intel Core i7 2600k @ 3.40GH GPU: Tesla K40

Results on the Flow Cytometry Data Mine bench – North Western, STAMP – Stanford, GPUMiner – Hong Kong University of Science and Technology

Speedup as a function of data sets count d = 5 n = 20,000

Strengths High performance on multiple data sets Low memory requirements Can process unlimited amount of small data sets Data sets can have different sizes Asynchronous – hides I/O overhead The kernel uses only one CUDA block Simplifies programming and enables synchronization

Limitations The kernel can use only one CUDA block ~30 data sets have to fit in the GPU memory at once Number of points and their dimensions is the limitation Has to process multiple data sets

Conclusion High speedup due to synchronization overhead elimination Our technique can be applied to other problems which: Independently process multiple input data sets Data sets are relatively small Algorithm may require synchronization

Asynchronous K-Means Clustering of Multiple Data Sets Marek Fiser mfiser@purdue.edu http://www.marekfiser.com This slides can be viewed on: http://goo.gl/arSaoF Please complete the Presenter Evaluation sent to you by email or through the GTC Mobile App. Your feedback is important!

Asynchronous K-Means Clustering of Multiple Data Sets Marek Fiser, - PowerPoint PPT Presentation

Asynchronous K-Means Clustering of Multiple Data Sets Marek Fiser, Illia Ziamtsov, Ariful Azad, Bedrich Benes, Alex Pothen Motivation Clustering bottleneck in Flow Cytometry research 3,000 data sets 25,000 points in 7D per data set 19 separate

k -means clustering Method to automatically separate data sets into distinct groups. Clustering

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Data Clustering: Data Clustering: 50 Years Beyond K means 50 Years Beyond K means 50 Years

K-MEANS++ OPTIMAL INITIALIZATION ALGORITHM An Improved K-means Clustering Method OVERVIEW

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Multi-variable Optimization K-means clustering K-means clustering on points is finding K

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

1 K-means clustering The K-means clustering algorithm can be seen as applying the EM algorithm to

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

L A M B D A M E A N S C L U S T E R I N G A U T O M A T I C P A R A M E T E R S E A R C H A N

The MOF4AIR Project M etal O rganic F rameworks for carbon dioxide A dsorption processes in power

Transparent parallelization of neural network training Cyprien Noel Flickr / Yahoo - GTC 2015

H1 2018 Interim Results and Project Update September 2018 Disclaimer Certain statements within

Addressing the Learning Needs of Gifted Students Through the Schoolwide Cluster Grouping Model

Phase Identification of Smart Meters by Clustering Voltage Measurements Frdric OLIVIER

SOA Education Update STUART KLUGMAN Senior Staff Fellow, Education Agenda ASA 2018 VEE

Clustering Patients with Tensor Decomposition Matteo Ruffini 1 a 1 on 2 Ricard Gavald` Esther