[PPT] - Highly Scalable Parallel Sorting Edgar Solomonik University of PowerPoint Presentation

SLIDE 1

1

Highly Scalable Parallel Sorting

Edgar Solomonik University of Illinois at Urbana-Champaign April 29, 2010

SLIDE 2

2

Outline

Parallel sorting background
Histogram Sort overview
Histogram Sort optimizations
Charm++ implementation
Results
Limitations of work
Contributions
Future work

SLIDE 3

3

Parallel Sorting

Input

– There are n unsorted keys, distributed evenly

ver p processors

– The distribution of keys in the range is unknown

and possibly skewed

Goal

– Sort the data globally according to keys – Ensure no processor has more than

(n/p)+threshold keys

SLIDE 4

4

Scaling Challenges

Load balance

– Main objective of most parallel sorting algorithms – Each processor needs a continuous chunk of data

Data exchange communication

– Can require complete communication graph – All-to-all contains n elements in p² messages

SLIDE 5

5

Parallel Sorting Algorithms

Type

Merge-based

– Bitonic Sort – Cole's Merge Sort

Splitter-based

– Sample Sort – Histogram Sort

Other

– Parallel Quicksort – Radix Sort

Data movement

½nlog²(p) O(nlog(p)) n n O(nlog(p)) O(n)~4*n

SLIDE 6

6

Splitter-Based Parallel Sorting

A splitter is a key that

partitions the global data at a desired location

p-1 global splitters needed

to subdivide the data into p continuous chunks

Each processor can send
ut its local data based on

the splitters

– Data moves only once

Each processor merges the

data chunks as it receives them

Proc 3 Proc 2 Proc 1 Splitter 2 Splitter 1 Key Number of Keys Splitting of Initial Data

SLIDE 7

7

n key_max key_min Key Number of Keys Smaller than x S p l i t t e r k k*(n/p)

Splitter on Key Density Function

SLIDE 8

8

Sample Sort

......

Processor p-1 Combined Sample Sample Processor 1 Sample Combined Sorted Sample Sort combined sample Splitters Broadcast Splitters

......

Extract splitters

......

Processor p-1 sorted data Processor 1 sorted data Splitters Splitters Apply splitters to data Extract local samples Concatenate samples All-to-All

SLIDE 9

9

Sample Sort

The sample is typically regularly spaced in

the local sorted data s=p-1

– Worst case final load imbalance is 2*(n/p) keys – In practice, load imbalance is typically very small

Combined sample becomes bottleneck since

(s*p)~p²

– With 64-bit keys, if p = 8192, sample is 16 GB!

SLIDE 10

10

Basic Histogram Sort

Splitter-based
Uses iterative guessing to find splitters

– O(p) probe rather than O(p²) combined sample – Probe refinement based on global histogram

Histogram calculated by applying splitters to data
Kale and Krishnan, ICPP 1993
Basis for this work

SLIDE 11

11

Basic Histogram Sort

......

Processor 1 Processor 1 Processor p sorted data Test probe of splitter-guesses Calculate histograms Add up histograms Analyze global histogram Refine probe Broadcast probe

......

Apply splitters to data All-to-All Processor 1 sorted data If converged If probe not converged

SLIDE 12

12

Basic Histogram Sort

Positives

– Splitter-based: single all-to-all data transpose – Can achieve arbitrarily small threshold – Probing technique is scalable compared to sample

sort, O(p) vs O(p²)

– Allows good overlap between communication and

computation (to be shown)

Negatives

– Harder to implement – Running time dependent on data distribution

SLIDE 13

13

Sorting and Histogramming Overlap

Don't actually need to sort local data first
Splice data instead

– Use splitter-guesses as Quicksort pivots – Each splice determines location of a guess and

partitions data

Sort chunks of data while histogramming

happens

SLIDE 14

14

Histogramming by Splicing Data

Unsorted data Splice data with probe Sort chunks Sorted data Splice data with new probe Search here Splice here

SLIDE 15

15

Histogram Overlap Analysis

Probe generation work should be offloaded to
ne processor

– Reduces critical path

Splicing is somewhat expensive

– O((n/p)*log(p)) for first iteration

log(p) approaches log(n/p) in weak scaling

– Small theoretical overhead (limited pivot

selection)

– Slight implementation overhead (libraries faster) – Some optimizations/code necessary

SLIDE 16

16

Sorting and All-to-All Overlap

Histogram and local sort overlap is good but

the all-to-all is the worst scaling bottleneck

Fortunately, much all-to-all overlap available
All-to-all can initially overlap with local sorting

– Some splitters converge every histogram iteration

This is also prior to completion of local sorting
Can begin sending to any defined ranges

SLIDE 17

17

Eager Data Movement

Unsorted Data Sorted data Receive message with resolved ranges Sort chunk Send to destination processor Extract chunk Send to destination processor

SLIDE 18

18

All-to-All and Merge Overlap

The k-way merge done when the data arrives

should be implemented as a tree merge

– A k-way heap merge requires all k arrays – A tree merge can start with just two arrays

Some data arrives much earlier than the rest

– Tree merge allows overlap

SLIDE 19

19

Tree k-way Merging

Buffer 1 Buffer 2 First chunk Buffer 1 First chunk Two more chunks arrive Buffer 1 First chunk Third chunk First merged data Fourth chunk Second merged data Buffer 1 First chunk Final merged data First merged data Second merged data Merge Merge Another chunk arrives Buffer 1 First chunk First chunk Second chunk First merged data Merge B1 B2 B1 B2 B1 B2 B1 B2

SLIDE 20

20

Charm++ Implementation

Why?

– Sort is compatible with Charm++ applications – Division between histogramming analysis work

and data containers

More natural
Flexible

– Charm++ scheduler used to automatically overlap

executing stages and push probes through

MPI implementation possible, but more difficult

SLIDE 21

21

Overlap Benefit (Weak Scaling)

Tests done on Intrepid (BG/P) and Jaguar (XT4) with 8 million 64-bit keys per core.

SLIDE 22

22

Overlap Benefit (Weak Scaling)

Tests done on Intrepid (BG/P) and Jaguar (XT4) with 8 million 64-bit keys per core.

SLIDE 23

23

Overlap Benefit (Weak Scaling)

Tests done on Intrepid (BG/P) and Jaguar (XT4) with 8 million 64-bit keys per core.

SLIDE 24

24

Effect of All-to-All Overlap

N O O V E R L A P V S O V E R L A P

Merge Sort all data Histogram Send data Idle time Splice data Sort by chunks Send data Merge

Tests done on 4096 cores of Intrepid (BG/P) with 8 million 64-bit keys per core.

Processor Utilization

100%

Processor Utilization

100%

SLIDE 25

25

All-to-All Spread and Staging

Personalized all-to-all collective

communication strategies important

– All-to-all eventually dominates execution time

Some basic optimizations easily applied

– Varying order sends

Minimizes network contention

– Only a subset of processors should send data to

ne destination at a time
Prevents network overload

SLIDE 26

26

Communication Spread

Data Splicing Sorting Sending Merging

Tests done on 4096 cores of Intrepid (BG/P) with 8 million 64-bit keys per core.

SLIDE 27

27

Algorithm Scaling Comparison

Tests done on Intrepid (BG/P) with 8 million 64-bit keys per core.

Out of memory

SLIDE 28

28

Histogram Sort Parallel Efficiency

Tests done on Intrepid (BG/P) and Jaguar (XT4) with 8 million 64-bit keys per core.

SLIDE 29

29

Some Limitations of this Work

Benchmarking done with 64-bit keys rather

than key-value pairs

Optimizations presented are only beneficial

for certain parallel sorting problems

– Generally, we assumed n > p²

Splicing useless unless n/p > p
Different all-to-all optimizations required if n/p is

small (combine messages)

– Communication usually cheap until p>512

Complex implementation another issue

SLIDE 30

30

Future/Ongoing Work

Write a further optimized library

implementation of Histogram Sort

– Sort key-value pairs – Almost completed, code to be released

To scale past 32k cores, histogramming

needs to be better optimized

– As p→n/p, probe creation cost matches the cost

f local sorting and merging

– One promising solution is to parallelize probing

Can use early determined splitters to divide probing

SLIDE 31

31

Contributions

Improvements on original Histogram Sort

algorithm

– Overlap between computation and communication – Interleaved algorithm stages

Efficient and well-optimized implementation
Scalability up to tens of thousands of cores
Ground work for further parallel scaling of

sorting algorithms

SLIDE 32

32

Acknowledgements

Everyone in PPL for various and generous

help

IPDPS reviewers for excellent feedback
Funding and Machine Grants

–

DOE Grant DEFG05-08OR23332 through ORNL LCF

–

Blue Gene/P at Argonne National Laboratory, which is supported by DOE under contract DE-AC02-06CH11357

–

Jaguar at Oak Ridge National Laboratory, which is supported by the DOE under contract DE-AC05-00OR22725

–

Accounts on Jaguar were made available via the Performance Evaluation and Analysis Consortium End Station, a DOE INCITE project.

Highly Scalable Parallel Sorting

Edgar Solomonik University of Illinois at Urbana-Champaign April 29, 2010

Outline

Parallel Sorting

and possibly skewed

(n/p)+threshold keys

Scaling Challenges

Parallel Sorting Algorithms

Type

Data movement

½*n*log²(p) O(n*log(p)) n n O(n*log(p)) O(n)~4*n

Splitter-Based Parallel Sorting

partitions the global data at a desired location

to subdivide the data into p continuous chunks

the splitters

data chunks as it receives them

Splitter on Key Density Function

Sample Sort

......

......

......

Sample Sort

the local sorted data s=p-1

(s*p)~p²

Basic Histogram Sort

Basic Histogram Sort

......

......

Basic Histogram Sort

sort, O(p) vs O(p²)

computation (to be shown)

Sorting and Histogramming Overlap

partitions data

happens

Histogramming by Splicing Data

Histogram Overlap Analysis

selection)

Sorting and All-to-All Overlap

the all-to-all is the worst scaling bottleneck

Eager Data Movement

All-to-All and Merge Overlap

should be implemented as a tree merge

Tree k-way Merging

Charm++ Implementation

and data containers

executing stages and push probes through

Overlap Benefit (Weak Scaling)

Overlap Benefit (Weak Scaling)

Overlap Benefit (Weak Scaling)

Effect of All-to-All Overlap

N O O V E R L A P V S O V E R L A P

All-to-All Spread and Staging

communication strategies important

Communication Spread

Algorithm Scaling Comparison

Histogram Sort Parallel Efficiency

Some Limitations of this Work

than key-value pairs

for certain parallel sorting problems

small (combine messages)

Future/Ongoing Work

implementation of Histogram Sort

needs to be better optimized

Contributions

algorithm

sorting algorithms

Acknowledgements

help

½nlog²(p) O(nlog(p)) n n O(nlog(p)) O(n)~4*n