Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale - PowerPoint PPT Presentation

Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010 1

Outline ● Parallel sorting background ● Histogram Sort overview ● Histogram Sort optimizations ● Results ● Limitations of work ● Contributions ● Future work 2

Parallel Sorting ● Input – There are n unsorted keys, distributed evenly over p processors – The distribution of keys in the range is unknown and possibly skewed ● Goal – Sort the data globally according to keys – Ensure no processor has more than ( n/p)+threshold keys 3

Scaling Challenges ● Load balance – Main objective of most parallel sorting algorithms – Each processor needs a continuous chunk of data ● Data exchange communication – Can require complete communication graph – All-to-all contains n elements in p² messages 4

Parallel Sorting Algorithms Type Data movement ● Merge-based – Bitonic Sort ½* n*log ²(p ) – Cole's Merge Sort O(n*log(p)) ● Splitter-based – Sample Sort n – Histogram Sort n ● Other – Parallel Quicksort O(n*log(p)) 5 – Radix Sort O(n)~4*n

Splitter-Based Parallel Sorting ● A splitter is a key that Splitting of Initial Data partitions the global set of keys at a desired location ● p-1 global splitters needed to subdivide the data into p Proc 3 continuous chunks Number of Keys ● Each processor can send Splitter 1 out its local data based on Splitter 2 Proc 2 the splitters – Data moves only once ● Each processor merges the Proc 1 data chunks as it receives key_min key_max them Key 6

Splitter on Key Density Function n Number of Keys Smaller than x k*(n/p) 0 S key_min key_max p l i t t e r Key k 7

Sample Sort Processor 1 sorted data Processor 1 ...... ...... Processor p-1 sorted data Processor p-1 Extract local samples Sample Sample Concatenate samples Combined Sample Sort combined sample Combined Sorted Sample Extract splitters Splitters Broadcast Splitters Splitters Splitters Apply splitters to data ...... 8 All-to-All

Sample Sort ● The sample is typically regularly spaced in the local sorted data s=p-1 – Worst case final load imbalance is 2*(n/p) keys – In practice, load imbalance is typically very small ● Combined sample becomes bottleneck since (s*p)~p ² – With 64 -bit keys, if p = 8192 , sample is 16 GB ! 9

Basic Histogram Sort ● Splitter-based ● Uses iterative guessing to find splitters – O(p) probe rather than O(p ² ) combined sample – Probe refinement based on global histogram ● Histogram calculated by applying splitters to data ● Kale and Krishnan, ICPP 1993 ● Basis for this work 10

Basic Histogram Sort Test probe of splitter-guesses Broadcast probe ...... Processor 1 sorted data Processor 1 Processor p sorted data Processor 1 Calculate histograms Refine probe Add up histograms Analyze global histogram If probe not converged If converged Apply splitters to data ...... 11 All-to-All

Basic Histogram Sort ● Positives – Splitter-based: single all-to-all data transpose – Can achieve arbitrarily small threshold – Probing technique is scalable compared to sample sort, O(p) vs O(p ² ) – Allows good overlap between communication and computation (to be shown) ● Negatives – Harder to implement 12 – Running time dependent on data distribution

Sorting and Histogramming Overlap ● Don't actually need to sort local data first ● Splice data instead – Use splitter-guesses as Quicksort pivots – Each splice determines location of a guess and partitions data ● Sort chunks of data while histogramming happens 13

Histogramming by Splicing Data Unsorted data Splice data with probe Sort chunks Sorted data Splice data with new probe Splice here Search here 14

Histogram Overlap Analysis ● Probe generation work should be offloaded to one processor – Reduces critical path ● Splicing is somewhat expensive – O((n/p)*log(p)) for first iteration ● log(p) approaches log(n/p) in weak scaling – Small theoretical overhead (limited pivot selection) – Slight implementation overhead (libraries faster) – Some optimizations/code necessary 15

Sorting and All-to-All Overlap ● Histogram and local sort overlap is good but the all-to-all is the worst scaling bottleneck ● Fortunately, much all-to-all overlap available ● All-to-all can initially overlap with local sorting – Some splitters converge every histogram iteration ● This is also prior to completion of local sorting ● Can begin sending to any defined ranges 16

Eager Data Movement Sorted data Unsorted Data Receive message with resolved ranges Extract chunk Sort chunk 17 Send to destination processor Send to destination processor

All-to-All and Merge Overlap ● The k -way merge done when the data arrives should be implemented as a tree merge – A k -way heap merge requires all k arrays – A tree merge can start with just two arrays ● Some data arrives much earlier than the rest – Tree merge allows overlap 18

Tree k-way Merging B1 First chunk Buffer 1 First chunk Buffer 1 Buffer 2 B2 Another chunk arrives First chunk First chunk Second chunk Buffer 1 B1 Merge First merged data B2 Two more chunks arrive B1 First chunk Buffer 1 Third chunk Fourth chunk Merge B2 First merged data Second merged data B1 First chunk Final merged data Buffer 1 Merge 19 B2 First merged data Second merged data

Overlap Benefit (Weak Scaling) 20 Tests done on Intrepid (BG/P) and Jaguar (XT4) with 8 million 64-bit keys per core.

Effect of All-to-All Overlap N 100% O Histogram Send data O Processor Utilization V E R Sort all data Merge L Idle time A P V S 100% O V Sort by chunks Processor Utilization E Send data R Splice data L Merge A P 23 Tests done on 4096 cores of Intrepid (BG/P) with 8 million 64-bit keys per core.

All-to-All Spread and Staging ● Personalized all-to-all collective communication strategies important – All-to-all eventually dominates execution time ● Some basic optimizations easily applied – Varying order sends ● Minimizes network contention – Only a subset of processors should send data to one destination at a time ● Prevents network overload 24

Communication Spread Data Splicing Sorting Sending Merging 25 Tests done on 4096 cores of Intrepid (BG/P) with 8 million 64-bit keys per core.

Algorithm Scaling Comparison Out of memory 26 Tests done on Intrepid (BG/P) with 8 million 64-bit keys per core.

Histogram Sort Parallel Efficiency 27 Tests done on Intrepid (BG/P) and Jaguar (XT4) with 8 million 64-bit keys per core.

Some Limitations of this Work ● Benchmarking done with 64-bit keys rather than key-value pairs ● Optimizations presented are only beneficial for certain parallel sorting problems – Generally, we assumed n > p ² ● Splicing useless unless n/p > p ● Different all-to-all optimizations required if n/p is small (combine messages) – Communication usually cheap until p> 512 ● Complex implementation another issue 28

Future/Ongoing Work ● Write a further optimized library implementation of Histogram Sort – Sort key-value pairs – Almost completed, code to be released ● To scale past 32k cores, histogramming needs to be better optimized – As p→n/p , probe creation cost matches the cost of local sorting and merging – One promising solution is to parallelize probing 29 ● Can use early determined splitters to divide probing

Contributions ● Improvements on original Histogram Sort algorithm – Overlap between computation and communication – Interleaved algorithm stages ● Efficient and well-optimized implementation ● Scalability up to tens of thousands of cores ● Ground work for further parallel scaling of sorting algorithms 30

Acknowledgements ● Everyone in PPL for various and generous help ● IPDPS reviewers for excellent feedback ● Funding and Machine Grants DOE Grant DEFG05-08OR23332 through ORNL LCF – Blue Gene/P at Argonne National Laboratory, which is supported by DOE – under contract DE-AC02-06CH11357 Jaguar at Oak Ridge National Laboratory, which is supported by the DOE – under contract DE-AC05-00OR22725 Accounts on Jaguar were made available via the Performance Evaluation and – Analysis Consortium End Station, a DOE INCITE project. 31

Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale - PowerPoint PPT Presentation

Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010 1 Outline Parallel sorting background Histogram Sort overview Histogram Sort optimizations Results

SORTING Review of Sorting Merge Sort Sets sorting 1 Sorting Algorithms

Overview/Questions What is sorting? Why does sorting matter? How is sorting

Sorting Lower Bound Sorting Lower Bound 1 Comparison-Based Sorting (10.4) Many sorting

Sorting Insertion sort Bubble sort Divide and conquer sorting Sorting Last time: introduction

+ Design of Parallel Algorithms Parallel Sorting Algorithms + Topic Overview n Issues in

Highly Scalable Parallel Sorting Edgar Solomonik University of Illinois at Urbana-Champaign

Cache and TLB-aware Parallel Sorting Kynan Shook Sorting Sorting is used in many places

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Sorting with Pop Stacks Stack sorting Pop stack sorting 1-pop-stack sortability 2-pop-stack

Sorting Sorting used as a step in many algorithms Savitch Chapter 7.4 Sorting algorithms

Sorting Sorting as a tool Sorting problem: Given a list a with n elements possessing a There are

Sorting Sorting: to arrange data in some sequential order Sorting occurs as a part in

Chapter 7 External Sorting Sorting Tables Larger Than Main Memory Query Processing Sorting

Sorting Algorithms Introduction Sorting Problem Sorting Problem Given a sequence A = a 1 , .

Highly Scalable Highly Scalable Ethernets Ethernets Paul Bottorff, Chief Architect, Carrier

Sorting Algorithms CENG 707 Data Structures and Algorithms Sorting Sorting is a process

Waiting for 6+ years Pete Beckman Argonne National Laboratory 2 Data from Peter

The Sample-Computational Tradeoff Shai Shalev-Shwartz School of Computer Science and Engineering

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Conditional Gradient Methods via Stochastic Path-Integrated Differential Estimator Alp Yurtsever

A Quantum Interior Point Method for LPs and SDPs Iordanis Kerenidis 1 Anupam Prakash 1 1 CNRS,

Gaussian Process Regression with Noisy Inputs Dan Cervone Harvard Statistics Department March 3,

A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

Michael Spillane President, Product & Categories Good morning, and thank you for joining

Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale - PowerPoint PPT Presentation

Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010 1 Outline Parallel sorting background Histogram Sort overview Histogram Sort optimizations Results

SORTING Review of Sorting Merge Sort Sets sorting 1 Sorting Algorithms

Overview/Questions What is sorting? Why does sorting matter? How is sorting

Sorting Lower Bound Sorting Lower Bound 1 Comparison-Based Sorting (10.4) Many sorting

Sorting Insertion sort Bubble sort Divide and conquer sorting Sorting Last time: introduction

+ Design of Parallel Algorithms Parallel Sorting Algorithms + Topic Overview n Issues in

Highly Scalable Parallel Sorting Edgar Solomonik University of Illinois at Urbana-Champaign

Cache and TLB-aware Parallel Sorting Kynan Shook Sorting Sorting is used in many places

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Sorting with Pop Stacks Stack sorting Pop stack sorting 1-pop-stack sortability 2-pop-stack

Sorting Sorting used as a step in many algorithms Savitch Chapter 7.4 Sorting algorithms

Sorting Sorting as a tool Sorting problem: Given a list a with n elements possessing a There are

Sorting Sorting: to arrange data in some sequential order Sorting occurs as a part in

Chapter 7 External Sorting Sorting Tables Larger Than Main Memory Query Processing Sorting

Sorting Algorithms Introduction Sorting Problem Sorting Problem Given a sequence A = a 1 , .

Highly Scalable Highly Scalable Ethernets Ethernets Paul Bottorff, Chief Architect, Carrier

Sorting Algorithms CENG 707 Data Structures and Algorithms Sorting Sorting is a process

Waiting for 6+ years Pete Beckman Argonne National Laboratory 2 Data from Peter

The Sample-Computational Tradeoff Shai Shalev-Shwartz School of Computer Science and Engineering

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Conditional Gradient Methods via Stochastic Path-Integrated Differential Estimator Alp Yurtsever

A Quantum Interior Point Method for LPs and SDPs Iordanis Kerenidis 1 Anupam Prakash 1 1 CNRS,

Gaussian Process Regression with Noisy Inputs Dan Cervone Harvard Statistics Department March 3,

A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

Michael Spillane President, Product &amp; Categories Good morning, and thank you for joining

Michael Spillane President, Product & Categories Good morning, and thank you for joining