Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel - PowerPoint PPT Presentation

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel Comparison Sort On Hybrid Multicore Architectures Dip Sankar Dip Sankar Banerjee Dip Sankar Dip Sankar Banerjee Banerjee Parikshit Sakurikar Banerjee, Parikshit Sakurikar and Kishore Kothapalli Center For Security Theory and Algorithmic Research Center For Security ,Theory, and Algorithmic Research International Institute of Information Technology Hyderabad, INDIA y , AsHES 2013 CSTAR, IIIT Hyderabad 20 May, 2013

GPGPU • General purpose computation on GPUs. General purpose computation on GPUs (GPGPU) is very common and widely practiced. • Provides the lowest cost to FLOPS ratio. P id th l t t t FLOPS ti • A many-core device which consists of: • Symmetric multi-processors. • Low power cores in each SM • Low power cores in each SM. • SIMD programmability. • Shared and global memory. Sh d d l b l • High use in general purpose computations and g g remarkable results in widely used primitives. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Accelerators in Computing p g • Typical usage model Typical usage model • Transfer input from CPU to the accelerator the accelerator • Transfer program to the accelerator accelerator • Execute on the accelerator • Transfer results back to the Transfer results back to the CPU. • Above model necessitated partly because the Abo e model necessitated partl beca se the accelerators do not have I/O capability. • Truly auxiliary devices. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Accelerators in Computing p g • The big issue: Utilizing multiple multicore The big issue: Utilizing multiple multicore devices for computation. • CPU Utilization for solving C f generic problems: • CPUs have high compute power cores. compute power cores. • Computational power of CPUs is also on the rise CPUs is also on the rise. • Hybrid Multi-core Computing • Use all resources available(in a single platform). Use all reso rces a ailable(in a single platform) • Provides a higher level of parallelism and efficiency. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Outline • General Hybrid Computing Platform General Hybrid Computing Platform • Problem Statement • Our Solution • Implementation Details p • Results • Conclusion Conclusion AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Hybrid Multicore Platforms y • Target of our research is to validate the • Target of our research is to validate the implementation of algorithms on both high-end systems as well as commodity low-end system. systems as well as commodity low end system. • A high end system will have a high throughput GPU connected to a multi-core CPU. connected to a multi core CPU. • An Intel i7 980 coupled with an NVidia GTX 580 GPU GPU • A low-end system is typically found on commodity systems such as laptops and desktops systems such as laptops and desktops. • An Intel Core 2 Duo E7400 CPU coupled with an NVidia GT520 GPU NVidia GT520 GPU. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Our Results • In this work we implemented comparison sort p p on a hybrid multicore platform. • We used hybrid sample sorting on the platform We used hybrid sample sorting on the platform using different data sets. • Our sorting implementation is 20% better than • Our sorting implementation is 20% better than the current best known parallel comparison sorting due to Davidson et al at InPAR 2012 sorting, due to Davidson et. al. at InPAR 2012. • Our results are on an average 40% better than the GPU Sample Sorting algorithm published at th GPU S l S ti l ith bli h d t IPDPS 2010. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Problem Definition • Sorting is a fundamental algorithm which finds massive application in scientific computations, f databases, searching, ranking etc. • The problem is to arrange a certain set of input according in a particular order. according in a particular order. 6 10 2 7 11 9 5 2 3 4 6 7 1 5 5 5 6 6 7 7 9 9 10 10 11 11 2 2 2 3 4 6 7 1 5 •Sorting is an irregular operation and is not Sorting is an irreg lar operation and is not entirely suited for GPU or parallel architectures. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Parallel Sorting • Effective use of all available processors by • Effective use of all available processors by creating independent sub-problems. • Quick sort is a popular sorting technique where Q i k t i l ti t h i h sub-problems are created and solved in a recursive fashion. • Sample sort is a generalization of the quick- p g q sorting algorithm that chooses many pivots and hence creates higher number of sub-problems hence creates higher number of sub problems. • Each of these sub-problems can be efficiently allocated to either a CPU or a GPU for sorting allocated to either a CPU or a GPU for sorting. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Algorithm Overview g • Phase I • Create sqrt(n) bins where n in no. of input elements. • Efficiently bin elements using a BST. Effi i tl bi l t i BST • Phase II • Compute histograms of the bins allocated to each Compute histograms of the bins allocated to each CPU and GPU. • Phase III • Phase III • Scatter elements across all SMs on GPU and cores in CPU in an synchronous manner cores in CPU in an synchronous manner. • Phase IV • Recurse Phases I-III and until bin sizes are Recurse Phases I III and until bin sizes are reduced to a certain threshold. • Sort the small bins. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Algorithm Overview g PHASE 1 : BINNING PHASE 2: HISTOGRAM H H Y PHASE 3 B SCATT R I D PHASE 4 : RECURSION 11December 2011 11December 2011 HiPC 2011 HiPC 2011 AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

PHASE I : BINNING PHASE 1 • Select splitters at uniform intervals of sqrt(n) • Form a Binary Search Tree using the splitters F Bi S h T i th litt • Now set a threshold for separation of the labels between the CPU and the GPU. • Transfer GPU labels to the device a s e G U abe s to t e de ce • Use BST on both CPU and GPU to bin the elements elements. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

PHASE II : Histograms g PHASE 2 • Compute histograms in an overlapped fashion on both the CPU and the GPU. • Store histogram H c of CPU for LEN/BLOCK size of elements. of elements. • Store H g of GPU on LEN/BLOCK size of elements elements. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

PHASE III : Histograms g PHASE 3 • Perform scan on the GPU and CPU histograms to compute the block-wise offsets. t th bl k i ff t • Scatter elements in an hybrid fashion to all bins • GPU: Perform local scattering in each BLOCK BLOCK • CPU: Perform global scattering across the single BLOCK single BLOCK. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

PHASE IV : Recurse and Sort • Recurse from phases I to III until the size for each block comes down to a size where we can do a normal quick sort on each thread. • Separate the bins among the CPU and GPU and apply the sorting on each of the bins until a and apply the sorting on each of the bins until a final sorted sequence is obtained. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Memory Access Optimization y p AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Memory Access Optimization y p • Available memory is a vital resource. y • Reuse of data-structures is vital for synchronization and consolidation. synchronization and consolidation. • We reuse our histogram store in the scattering step where we do not write all the entries for step where we do not write all the entries for all the labels together. • Instead writing all the entries in one space, we I t d iti ll th t i i write it in the order in which it will be read back. • This facilitates a higher coalescing of reads as This facilitates a higher coalescing of reads as well as the re-use of a data-structure. AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Results on Key ‐ Value Pairs y 180 Hybrid Key-value Hybrid Key value Merge Sort Key-value Sample Sort Key-value Satish Key-value 160 /sec on Pairs/ 140 140 Millio 120 100 80 15 16 17 18 19 20 21 22 23 24 No No. of Elements (power of 2) of Elements (power of 2) AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Results on 32 bit integers 3 g 300 Hybrid Key-only Hybrid Key only Merge Sort Key-only Sample Sort Key-only Satish Key-only 250 c MKeys/sec 200 M 150 100 15 16 17 18 19 20 21 22 23 24 No. of Elements (power of 2) AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Results of Key ‐ Value Pairs on Low ‐ End Platform y 55 H b id K Hybrid Key-value l Merge Sort Key-value Sample Sort Key-value Satish Key-value 50 50 /sec on Pairs/ 45 Millio 40 35 30 5 10 15 20 25 30 No. of Elements (Million) AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Variation of Threshold 30 Threhshold High-End Threhshold Low-End 25 (%) esholed 20 Thre 15 5 10 15 16 17 18 19 20 No. of Elements (power of 2) AsHES, 2013 CSTAR, IIIT Hyderabad 20 May 2013

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel - PowerPoint PPT Presentation

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel Comparison Sort On Hybrid Multicore Architectures Dip Sankar Dip Sankar Banerjee Dip Sankar Dip Sankar Banerjee Banerjee Parikshit Sakurikar Banerjee, Parikshit Sakurikar and

Insertion-Sort M. Esponda Insertion-Sort M. Esponda Insertion-Sort M. Esponda Insertion-Sort

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

R A D I X S O R T Radix Sort 147 dnc CS 16: Radix Sort Radix Sort Unlike other sorting

RADIX SORT Parosh Aziz Abdulla Uppsala University September 21, 2008 Parosh Aziz Abdulla

Sort Algorithms 15-110 - Friday 10/09 Learning Objectives Recognize the general algorithm and

Sorting a List: bubble sort selection sort insertion sort Sept. 22, 2017 1 Sorting BEFORE

Bucket-Sort and Radix-Sort 1, c 3, a 3, b 7, d 7, g 7, e B 0 1

SORTING Chapter 8 Sorting 2 Why sort? To make searching faster! How? Binary Search gives

Sorting Chapter 7 1 Quick Sort One of the most popular fast sorting algorithms Quick sort

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should be

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should be

How fast can we sort? All the sorting algorithms we have seen so far are comparison sorts : only

Sorting Lower Bound Sorting Lower Bound 1 Comparison-Based Sorting (10.4) Many sorting

Topological Sort Shivam Patel Viktor Zenkov Questions 1. Who first described topological sort?

For Wednesday Read Weiss, chapter 4, section 4 Homework: Weiss, chapter 4, exercise 9

9. Sorting III Lower bounds for the comparison based sorting, radix- and bucket-sort 248 9.1

R e a l - w o r l d s o r t i n g ( n o t o n e x a m ) S o r t i

AlgorithmsinaNutshell Session2 Sorting 9:4010:30 Outline

Merge Sort 7 2 9 4 2 4 7 9 7 2 2 7 9 4 4 9 7 7 2 2 9 9 4

Sorting Upper and Lower bounds [Aggarwal, Vitter, 88] EMADS Fall 2003: Sorting Page 1 Standard

Sorting Algorithms Mark Redekopp David Kempe Sandra Batista 2 Algorithm Efficiency SORTING 3

Algorithms and Architecture I Sorting in Linear Time 1 Linear Sort? But... Best algorithms