Communication-Efficient String Sorting Timo Bingmann, Peter Sanders, - PowerPoint PPT Presentation

Communication-Efficient String Sorting Timo Bingmann, Peter Sanders, Matthias Schimek · 2020-05-18 @ IPDPS’20 I NSTITUTE OF T HEORETICAL I NFORMATICS – A LGORITHMICS A n t i d i s e s t a b l i s h m e n t a r i a n i s m 0 s 0 F l o c c i n a u c i n i h i l i p i l i f i c a t i o n 0 s 1 H o n o r i f i c a b i l i t u d i n i t a t i b u s 0 s 2 www.kit.edu KIT – The Research University in the Helmholtz Association

Abstract There has been surprisingly little work on algorithms for sorting strings on distributed-memory parallel machines. We develop efficient algorithms for this problem based on the multi-way merging principle. These algorithms inspect only characters that are needed to determine the sorting order. Moreover, communication volume is reduced by also communicating (roughly) only those characters and by communicating repetitions of the same prefixes only once. Experiments on up to 1280 cores reveal that these algorithm are often more than five times faster than previous algorithms. This document is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 2 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

Why String Sorting? string: array of characters over s t r i n g 0 alphabet Σ sorted string set: sorted lexicographically ⇒ like in a dictionary characteristics of string sets #strings n , #characters N s 0 a l g o r i t h m 0 s 1 c o m p a r e 0 sum distinguishing s 2 c o m p a r i s o n 0 prefix lengths D s 3 p r e f i x 0 ⇒ multidimensional data only published distributed string sorting algorithm: one paragraph in [Fischer and Kurpicz, ALENEX’19] Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 3 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

String Sorting Toolbox Sequential Sorting: String Radix Sort, Multikey Quicksort, . . . [Kärkkäinen et al., SPIRE’08], [Bentley and Sedgewick, SODA’97] evaluation of many sequential a l g o r i t h m 0 ⊥ algorithms in [Bingmann ’18] 2 a l p h a 0 5 a l p h a b e t 0 needed: string sorting c h a r a c t e r 0 0 c o m p l e t e 1 0 + Longest Common Prefix 4 c o m p u t e r 0 (LCP) array computation c o m p u t i n g 0 6 c o p y 0 2 Multiway Merging: LCP Losertree [Bingmann et. al, Algorithmica’17] exploit LCP values to ( 2 , aab ) save character-comparisons ( 1 , acb ) LCP- ( 2 , aac ) ( 0 , bca ) Merge ( 2 , aab ) ( 2 , aac ) ( 0 , bca ) ( 1 , acb ) Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 4 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

String Sorting Toolbox LCP Compression ⊥ a l g o r i t h m 0 ⊥ a l g o r i t h m 0 2 a l p h a 0 2 p h a 0 a l p h a b e t 0 b e t 0 5 5 compress c h a r a c t e r 0 c h a r a c t e r 0 0 0 ⇒ c o m p l e t e 0 o m p l e t e 0 1 1 c o m p u t e r u t e r 4 0 4 0 c o m p u t i n g i n g 6 0 6 0 c o p y 0 p y 0 2 2 each longest common prefix is sent only once compression: iterate over strings + LCP array decompression: iterate over compressed strings + LCP array Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 5 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

Distributed Merge String Sort (MS) Local Sorting local sorting local sorting local sorting String Radix Sort new: String Radix Sort + LCP array Distributed Partitioning Algorithm String Exchange no compression new: LCP compression String Exchange Merging y y plain losertree merging merging merging new: LCP losertree Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 6 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

Distributed Merge String Sort (MS) Partitioning equidistant sampling regular sampling regular sampling regular sampling sample sets gather + seq. sort new: hypercube quicksort Sorting of Sample Sets + [Axtmann and Sanders, ALENEX’17] Final Splitter Selection broadcast final p − 1 final splitters splitters partitioning partitioning partitioning partitioning Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 7 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

Partitioning – Sampling Approaches string-based sampling character-based sampling a a a a a a a 0 a a a a a a a 0 b 0 b 0 c c 0 0 d 0 d 0 e e 0 0 f f f f f f f f f f f f f f 0 0 Goal: equal number of Goal: equal number of strings per bucket characters per bucket sampling of string array sampling of character array provable upper bounds provable upper bounds Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 8 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

Prefix Doubling String Merge Sort (PDMS) PE1: A n t i d i s e s t a b l i s h m e n t a r i a n i s m 0 F l o c c i n a u c i n i h i l i p i l i f i c a t i o n 0 PE2: PE3: H o n o r i f i c a b i l i t u d i n i t a t i b u s 0 same main structure as before use distributed Single-Shot Bloom Filter (dSBF) [Sanders et al., IEEE BigData’13] to approximate distinguishing prefixes with distributed duplicate detection only operate on those characters calculate only the permutation for sorting (exchanging further characters is optional). Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 9 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

Distinguishing Prefix Computation h ( s i ) h ( s i ) s 0 a l p h a 0 s 0 a l g o 0 c h a r a c t e r 0 c o m p a r e 0 s 1 s 1 s o r t i n g 0 p r e f i x 0 s 2 s 2 s t r i n g 0 s c a l e 0 s 3 s 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 10 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

Distinguishing Prefix Computation h ( s i ) h ( s i ) s 0 2 a l p h a 0 s 0 2 a l g o 0 c h a r a c t e r 0 c o m p a r e 0 s 1 19 s 1 19 s o r t i n g 0 p r e f i x 0 s 2 7 s 2 13 s t r i n g 0 s c a l e 0 s 3 7 s 3 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 10 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

Distinguishing Prefix Computation h ( s i ) h ( s i ) s 0 2 a l p h a 0 s 0 2 a l g o 0 c h a r a c t e r 0 c o m p a r e 0 s 1 19 s 1 19 s o r t i n g 0 p r e f i x 0 s 2 7 s 2 13 s t r i n g 0 s c a l e 0 s 3 7 s 3 7 m 1 := [ 2 , 7 ] m 2 := [ 19 ] m 1 := [ 2 , 7 ] m 2 := [ 13 , 19 ] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 10 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

Distinguishing Prefix Computation h ( s i ) h ( s i ) s 0 5 a l p h a 0 s 0 5 a l g o 0 c h a r a c t e r 0 c o m p a r e 0 s 1 15 s 1 11 s o r t i n g 0 p r e f i x 0 s 2 7 s 2 s t r i n g 0 s c a l e 0 s 3 0 s 3 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 10 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

Distinguishing Prefix Computation h ( s i ) h ( s i ) s 0 5 a l p h a 0 s 0 5 a l g o 0 c h a r a c t e r 0 c o m p a r e 0 s 1 15 s 1 11 s o r t i n g 0 p r e f i x 0 s 2 7 s 2 s t r i n g 0 s c a l e 0 s 3 0 s 3 0 m 1 := [ 0 , 5 , 7 ] m 2 := [ 15 ] m 1 := [ 0 , 5 ] m 2 := [ 11 ] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 10 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

Communication-Efficient String Sorting Timo Bingmann, Peter Sanders, - PowerPoint PPT Presentation

Communication-Efficient String Sorting Timo Bingmann, Peter Sanders, Matthias Schimek 2020-05-18 @ IPDPS20 I NSTITUTE OF T HEORETICAL I NFORMATICS A LGORITHMICS A n t i d i s e s t a b l i s h m e n t a r i a n i

The String Class Trace Code Constructing a String String s = "Java"; String

SORTING Review of Sorting Merge Sort Sets sorting 1 Sorting Algorithms

Overview/Questions What is sorting? Why does sorting matter? How is sorting

Sorting Lower Bound Sorting Lower Bound 1 Comparison-Based Sorting (10.4) Many sorting

Sorting Insertion sort Bubble sort Divide and conquer sorting Sorting Last time: introduction

1 2 3+4 2 type Parser = String Tree type Parser = String ( Tree, String) type Parser =

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

String Matching String matching problem: string T (text) and string P (pattern) over an

Sorting with Pop Stacks Stack sorting Pop stack sorting 1-pop-stack sortability 2-pop-stack

Sorting Sorting used as a step in many algorithms Savitch Chapter 7.4 Sorting algorithms

Sorting Sorting as a tool Sorting problem: Given a list a with n elements possessing a There are

Sorting Sorting: to arrange data in some sequential order Sorting occurs as a part in

Chapter 7 External Sorting Sorting Tables Larger Than Main Memory Query Processing Sorting

Sorting Algorithms Introduction Sorting Problem Sorting Problem Given a sequence A = a 1 , .

Cache and TLB-aware Parallel Sorting Kynan Shook Sorting Sorting is used in many places

String Objectives Discuss string handling System.String class

Linked Open Data and Its Potential for CREDO Martin Homola Faculty of Mathematics, Physics and

A Measurement Study of BGP Misconfiguration Ratul Mahajan, David Wetherall, and Tom

(Best Current Operational Practice for operators) Jan or, Internet Society What is this

IXP Route Server Prefix Validation at LINX Progress & Challenges Mo Shivji, LINX

Entropy/IP: Uncovering Structure in IPv6 Addresses ACM IMC 2016, Santa Monica, USA Pawe

the real-time Internet routing observatory Luca Sani 1 / 24 Our research topic: discovering the

COMP 364: Computer Tools for Life Sciences Regular expressions Christopher J.F. Cameron and

!"#$%&'()*+&,$-./(0).( & KG"99$(/$3 &

Communication-Efficient String Sorting Timo Bingmann, Peter Sanders, - PowerPoint PPT Presentation

Communication-Efficient String Sorting Timo Bingmann, Peter Sanders, Matthias Schimek 2020-05-18 @ IPDPS20 I NSTITUTE OF T HEORETICAL I NFORMATICS A LGORITHMICS A n t i d i s e s t a b l i s h m e n t a r i a n i

The String Class Trace Code Constructing a String String s = &quot;Java&quot;; String

SORTING Review of Sorting Merge Sort Sets sorting 1 Sorting Algorithms

Overview/Questions What is sorting? Why does sorting matter? How is sorting

Sorting Lower Bound Sorting Lower Bound 1 Comparison-Based Sorting (10.4) Many sorting

Sorting Insertion sort Bubble sort Divide and conquer sorting Sorting Last time: introduction

1 2 3+4 2 type Parser = String Tree type Parser = String ( Tree, String) type Parser =

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

String Matching String matching problem: string T (text) and string P (pattern) over an

Sorting with Pop Stacks Stack sorting Pop stack sorting 1-pop-stack sortability 2-pop-stack

Sorting Sorting used as a step in many algorithms Savitch Chapter 7.4 Sorting algorithms

Sorting Sorting as a tool Sorting problem: Given a list a with n elements possessing a There are

Sorting Sorting: to arrange data in some sequential order Sorting occurs as a part in

Chapter 7 External Sorting Sorting Tables Larger Than Main Memory Query Processing Sorting

Sorting Algorithms Introduction Sorting Problem Sorting Problem Given a sequence A = a 1 , .

Cache and TLB-aware Parallel Sorting Kynan Shook Sorting Sorting is used in many places

String Objectives Discuss string handling System.String class

Linked Open Data and Its Potential for CREDO Martin Homola Faculty of Mathematics, Physics and

A Measurement Study of BGP Misconfiguration Ratul Mahajan, David Wetherall, and Tom

(Best Current Operational Practice for operators) Jan or, Internet Society What is this

IXP Route Server Prefix Validation at LINX Progress &amp; Challenges Mo Shivji, LINX

Entropy/IP: Uncovering Structure in IPv6 Addresses ACM IMC 2016, Santa Monica, USA Pawe

the real-time Internet routing observatory Luca Sani 1 / 24 Our research topic: discovering the

COMP 364: Computer Tools for Life Sciences Regular expressions Christopher J.F. Cameron and

!&quot;#$%&amp;'()*+&amp;,$-./(0).( &amp; KG&quot;99$(/$3 &amp;

The String Class Trace Code Constructing a String String s = "Java"; String

IXP Route Server Prefix Validation at LINX Progress & Challenges Mo Shivji, LINX

!"#$%&'()*+&,$-./(0).( & KG"99$(/$3 &