Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel - - PowerPoint PPT Presentation

fast scalable parallel comparison sort fast scalable
SMART_READER_LITE
LIVE PREVIEW

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel - - PowerPoint PPT Presentation

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel Comparison Sort On Hybrid Multicore Architectures Dip Sankar Dip Sankar Banerjee Dip Sankar Dip Sankar Banerjee Banerjee Parikshit Sakurikar Banerjee, Parikshit Sakurikar and


slide-1
SLIDE 1

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel Comparison Sort On Hybrid Multicore Architectures

Dip Sankar Dip Sankar Banerjee Banerjee Parikshit Sakurikar Dip Sankar Dip Sankar Banerjee Banerjee, Parikshit Sakurikar and Kishore Kothapalli

Center For Security Theory and Algorithmic Research Center For Security ,Theory, and Algorithmic Research International Institute of Information Technology Hyderabad, INDIA

20 May, 2013 AsHES 2013 CSTAR, IIIT Hyderabad

y ,

slide-2
SLIDE 2

GPGPU

General purpose computation on GPUs

  • General purpose computation on GPUs.

(GPGPU) is very common and widely practiced. P id th l t t t FLOPS ti

  • Provides the lowest cost to FLOPS ratio.
  • A many-core device which consists of:
  • Symmetric multi-processors.
  • Low power cores in each SM
  • Low power cores in each SM.
  • SIMD programmability.

Sh d d l b l

  • Shared and global memory.
  • High use in general purpose computations and

g g remarkable results in widely used primitives.

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-3
SLIDE 3

Accelerators in Computing p g

  • Typical usage model

Typical usage model

  • Transfer input from CPU to

the accelerator the accelerator

  • Transfer program to the

accelerator accelerator

  • Execute on the accelerator
  • Transfer results back to the

Transfer results back to the CPU.

Abo e model necessitated partl beca se the

  • Above model necessitated partly because the

accelerators do not have I/O capability.

  • Truly auxiliary devices.

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-4
SLIDE 4

Accelerators in Computing p g

The big issue: Utilizing multiple multicore

  • The big issue: Utilizing multiple multicore

devices for computation. C f

  • CPU Utilization for solving

generic problems:

  • CPUs have high

compute power cores. compute power cores.

  • Computational power of

CPUs is also on the rise CPUs is also on the rise.

  • Hybrid Multi-core Computing

Use all reso rces a ailable(in a single platform)

  • Use all resources available(in a single platform).
  • Provides a higher level of parallelism and efficiency.

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-5
SLIDE 5

Outline

General Hybrid Computing Platform

  • General Hybrid Computing Platform
  • Problem Statement
  • Our Solution
  • Implementation Details

p

  • Results

Conclusion

  • Conclusion

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-6
SLIDE 6

Hybrid Multicore Platforms y

  • Target of our research is to validate the
  • Target of our research is to validate the

implementation of algorithms on both high-end systems as well as commodity low-end system. systems as well as commodity low end system.

  • A high end system will have a high throughput GPU

connected to a multi-core CPU. connected to a multi core CPU.

  • An Intel i7 980 coupled with an NVidia GTX 580

GPU GPU

  • A low-end system is typically found on commodity

systems such as laptops and desktops systems such as laptops and desktops.

  • An Intel Core 2 Duo E7400 CPU coupled with an

NVidia GT520 GPU NVidia GT520 GPU.

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-7
SLIDE 7

Our Results

  • In this work we implemented comparison sort

p p

  • n a hybrid multicore platform.
  • We used hybrid sample sorting on the platform

We used hybrid sample sorting on the platform using different data sets.

  • Our sorting implementation is 20% better than
  • Our sorting implementation is 20% better than

the current best known parallel comparison sorting due to Davidson et al at InPAR 2012 sorting, due to Davidson et. al. at InPAR 2012.

  • Our results are on an average 40% better than

th GPU S l S ti l ith bli h d t the GPU Sample Sorting algorithm published at IPDPS 2010.

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-8
SLIDE 8

Problem Definition

  • Sorting is a fundamental algorithm which finds

f massive application in scientific computations, databases, searching, ranking etc.

  • The problem is to arrange a certain set of input

according in a particular order. according in a particular order. 5 6 10 2 7 11 9

1 2 3 4 5 6 7

2 5 6 7 9 10 11 2 5 6 7 9 10 11

1 2 3 4 5 6 7

Sorting is an irreg lar operation and is not

  • Sorting is an irregular operation and is not

entirely suited for GPU or parallel architectures.

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-9
SLIDE 9

Parallel Sorting

  • Effective use of all available processors by
  • Effective use of all available processors by

creating independent sub-problems. Q i k t i l ti t h i h

  • Quick sort is a popular sorting technique where

sub-problems are created and solved in a recursive fashion.

  • Sample sort is a generalization of the quick-

p g q sorting algorithm that chooses many pivots and hence creates higher number of sub-problems hence creates higher number of sub problems.

  • Each of these sub-problems can be efficiently

allocated to either a CPU or a GPU for sorting allocated to either a CPU or a GPU for sorting.

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-10
SLIDE 10

Algorithm Overview g

  • Phase I
  • Create sqrt(n) bins where n in no. of input

elements. Effi i tl bi l t i BST

  • Efficiently bin elements using a BST.
  • Phase II

Compute histograms of the bins allocated to each

  • Compute histograms of the bins allocated to each

CPU and GPU.

  • Phase III
  • Phase III
  • Scatter elements across all SMs on GPU and

cores in CPU in an synchronous manner cores in CPU in an synchronous manner.

  • Phase IV
  • Recurse Phases I-III and until bin sizes are

Recurse Phases I III and until bin sizes are reduced to a certain threshold.

  • Sort the small bins.

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-11
SLIDE 11

Algorithm Overview g

PHASE 1 : BINNING PHASE 2: HISTOGRAM H PHASE 3 SCATT H Y B R 11December 2011 HiPC 2011 PHASE 4 : RECURSION I D 11December 2011 HiPC 2011 20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-12
SLIDE 12

PHASE I : BINNING

PHASE 1

  • Select splitters at uniform intervals of sqrt(n)

F Bi S h T i th litt

  • Form a Binary Search Tree using the splitters
  • Now set a threshold for separation of the labels

between the CPU and the GPU.

  • Transfer GPU labels to the device

a s e G U abe s to t e de ce

  • Use BST on both CPU and GPU to bin the

elements elements.

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-13
SLIDE 13

PHASE II : Histograms g

PHASE 2

  • Compute histograms in an overlapped fashion
  • n both the CPU and the GPU.
  • Store histogram Hc of CPU for LEN/BLOCK size
  • f elements.
  • f elements.
  • Store Hg of GPU on LEN/BLOCK size of

elements elements.

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-14
SLIDE 14

PHASE III : Histograms g

PHASE 3

  • Perform scan on the GPU and CPU histograms to

t th bl k i ff t compute the block-wise offsets.

  • Scatter elements in an hybrid fashion to all bins
  • GPU: Perform local scattering in each

BLOCK BLOCK

  • CPU: Perform global scattering across the

single BLOCK

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

single BLOCK.

slide-15
SLIDE 15

PHASE IV : Recurse and Sort

  • Recurse from phases I to III until the size for

each block comes down to a size where we can do a normal quick sort on each thread.

  • Separate the bins among the CPU and GPU

and apply the sorting on each of the bins until a and apply the sorting on each of the bins until a final sorted sequence is obtained.

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-16
SLIDE 16

Memory Access Optimization y p

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-17
SLIDE 17

Memory Access Optimization y p

  • Available memory is a vital resource.

y

  • Reuse of data-structures is vital for

synchronization and consolidation. synchronization and consolidation.

  • We reuse our histogram store in the scattering

step where we do not write all the entries for step where we do not write all the entries for all the labels together. I t d iti ll th t i i

  • Instead writing all the entries in one space, we

write it in the order in which it will be read back.

  • This facilitates a higher coalescing of reads as

This facilitates a higher coalescing of reads as well as the re-use of a data-structure.

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-18
SLIDE 18

Results on Key‐Value Pairs y

180 Hybrid Key-value 160 Hybrid Key value Merge Sort Key-value Sample Sort Key-value Satish Key-value 140 /sec 140

  • n Pairs/

120 Millio 100 80 15 16 17 18 19 20 21 22 23 24 No

  • f Elements (power of 2)

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

  • No. of Elements (power of 2)
slide-19
SLIDE 19

Results on 32 bit integers 3 g

300 Hybrid Key-only Hybrid Key only Merge Sort Key-only Sample Sort Key-only Satish Key-only 250 c 200 MKeys/sec 150 M 100 15 16 17 18 19 20 21 22 23 24

  • No. of Elements (power of 2)

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-20
SLIDE 20

Results of Key‐Value Pairs on Low‐End Platform y

55 H b id K l 50 Hybrid Key-value Merge Sort Key-value Sample Sort Key-value Satish Key-value 50 /sec 45

  • n Pairs/

40 Millio 35 30 5 10 15 20 25 30

  • No. of Elements (Million)

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-21
SLIDE 21

Variation of Threshold

30 Threhshold High-End Threhshold Low-End 25 (%) 20 esholed 15 Thre 5 10 15 16 17 18 19 20

  • No. of Elements (power of 2)

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-22
SLIDE 22

Results

  • Our Key-Value pair sorting is on an average

% 20% better than the current best known result.

  • Our 32 bit sorting results are on an average

23% better than the current best known 23% better than the current best known result.

  • The performance benefit can be attributed to:
  • The performance benefit can be attributed to:
  • Hybrid Histogram computation which

d t i d i l it reduces atomic and irregularity

  • verheads.
  • Overlapped scattering which reduces the

memory access latencies.

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

y

slide-23
SLIDE 23

Conclusions

  • Our implementation clear shows the benefits
  • f a heterogeneous platform.
  • Hybrid algorithms show promising results on

Hybrid algorithms show promising results on both high-end as well as commodity level processors processors.

  • Our algorithm can be very easily incremented

t t i bl l th k h t i to sort variable length keys such as strings.

  • It will also be of interest to experiment and
  • ptimize on other data sets such as

Deterministic Duplicates, Staggered Keys, p , gg y , Bucket Sorted Keys.

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad

slide-24
SLIDE 24

Hybrid THANK Comparison YOU p Sort YOU Questions ? Sort Questions ?

20 May 2013 AsHES, 2013 CSTAR, IIIT Hyderabad