communication efficient string sorting
play

Communication-Efficient String Sorting Timo Bingmann, Peter Sanders, - PowerPoint PPT Presentation

Communication-Efficient String Sorting Timo Bingmann, Peter Sanders, Matthias Schimek 2020-05-18 @ IPDPS20 I NSTITUTE OF T HEORETICAL I NFORMATICS A LGORITHMICS A n t i d i s e s t a b l i s h m e n t a r i a n i


  1. Communication-Efficient String Sorting Timo Bingmann, Peter Sanders, Matthias Schimek · 2020-05-18 @ IPDPS’20 I NSTITUTE OF T HEORETICAL I NFORMATICS – A LGORITHMICS A n t i d i s e s t a b l i s h m e n t a r i a n i s m 0 s 0 F l o c c i n a u c i n i h i l i p i l i f i c a t i o n 0 s 1 H o n o r i f i c a b i l i t u d i n i t a t i b u s 0 s 2 www.kit.edu KIT – The Research University in the Helmholtz Association

  2. Abstract There has been surprisingly little work on algorithms for sorting strings on distributed-memory parallel machines. We develop efficient algorithms for this problem based on the multi-way merging principle. These algorithms inspect only characters that are needed to determine the sorting order. Moreover, communication volume is reduced by also communicating (roughly) only those characters and by communicating repetitions of the same prefixes only once. Experiments on up to 1280 cores reveal that these algorithm are often more than five times faster than previous algorithms. This document is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 2 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

  3. Why String Sorting? string: array of characters over s t r i n g 0 alphabet Σ sorted string set: sorted lexicographically ⇒ like in a dictionary characteristics of string sets #strings n , #characters N s 0 a l g o r i t h m 0 s 1 c o m p a r e 0 sum distinguishing s 2 c o m p a r i s o n 0 prefix lengths D s 3 p r e f i x 0 ⇒ multidimensional data only published distributed string sorting algorithm: one paragraph in [Fischer and Kurpicz, ALENEX’19] Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 3 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

  4. String Sorting Toolbox Sequential Sorting: String Radix Sort, Multikey Quicksort, . . . [Kärkkäinen et al., SPIRE’08], [Bentley and Sedgewick, SODA’97] evaluation of many sequential a l g o r i t h m 0 ⊥ algorithms in [Bingmann ’18] 2 a l p h a 0 5 a l p h a b e t 0 needed: string sorting c h a r a c t e r 0 0 c o m p l e t e 1 0 + Longest Common Prefix 4 c o m p u t e r 0 (LCP) array computation c o m p u t i n g 0 6 c o p y 0 2 Multiway Merging: LCP Losertree [Bingmann et. al, Algorithmica’17] exploit LCP values to ( 2 , aab ) save character-comparisons ( 1 , acb ) LCP- ( 2 , aac ) ( 0 , bca ) Merge ( 2 , aab ) ( 2 , aac ) ( 0 , bca ) ( 1 , acb ) Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 4 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

  5. String Sorting Toolbox LCP Compression ⊥ a l g o r i t h m 0 ⊥ a l g o r i t h m 0 2 a l p h a 0 2 p h a 0 a l p h a b e t 0 b e t 0 5 5 compress c h a r a c t e r 0 c h a r a c t e r 0 0 0 ⇒ c o m p l e t e 0 o m p l e t e 0 1 1 c o m p u t e r u t e r 4 0 4 0 c o m p u t i n g i n g 6 0 6 0 c o p y 0 p y 0 2 2 each longest common prefix is sent only once compression: iterate over strings + LCP array decompression: iterate over compressed strings + LCP array Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 5 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

  6. Distributed Merge String Sort (MS) Local Sorting local sorting local sorting local sorting String Radix Sort new: String Radix Sort + LCP array Distributed Partitioning Algorithm String Exchange no compression new: LCP compression String Exchange Merging y y plain losertree merging merging merging new: LCP losertree Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 6 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

  7. Distributed Merge String Sort (MS) Partitioning equidistant sampling regular sampling regular sampling regular sampling sample sets gather + seq. sort new: hypercube quicksort Sorting of Sample Sets + [Axtmann and Sanders, ALENEX’17] Final Splitter Selection broadcast final p − 1 final splitters splitters partitioning partitioning partitioning partitioning Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 7 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

  8. Partitioning – Sampling Approaches string-based sampling character-based sampling a a a a a a a 0 a a a a a a a 0 b 0 b 0 c c 0 0 d 0 d 0 e e 0 0 f f f f f f f f f f f f f f 0 0 Goal: equal number of Goal: equal number of strings per bucket characters per bucket sampling of string array sampling of character array provable upper bounds provable upper bounds Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 8 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

  9. Prefix Doubling String Merge Sort (PDMS) PE1: A n t i d i s e s t a b l i s h m e n t a r i a n i s m 0 F l o c c i n a u c i n i h i l i p i l i f i c a t i o n 0 PE2: PE3: H o n o r i f i c a b i l i t u d i n i t a t i b u s 0 same main structure as before use distributed Single-Shot Bloom Filter (dSBF) [Sanders et al., IEEE BigData’13] to approximate distinguishing prefixes with distributed duplicate detection only operate on those characters calculate only the permutation for sorting (exchanging further characters is optional). Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 9 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

  10. Distinguishing Prefix Computation h ( s i ) h ( s i ) s 0 a l p h a 0 s 0 a l g o 0 c h a r a c t e r 0 c o m p a r e 0 s 1 s 1 s o r t i n g 0 p r e f i x 0 s 2 s 2 s t r i n g 0 s c a l e 0 s 3 s 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 10 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

  11. Distinguishing Prefix Computation h ( s i ) h ( s i ) s 0 2 a l p h a 0 s 0 2 a l g o 0 c h a r a c t e r 0 c o m p a r e 0 s 1 19 s 1 19 s o r t i n g 0 p r e f i x 0 s 2 7 s 2 13 s t r i n g 0 s c a l e 0 s 3 7 s 3 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 10 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

  12. Distinguishing Prefix Computation h ( s i ) h ( s i ) s 0 2 a l p h a 0 s 0 2 a l g o 0 c h a r a c t e r 0 c o m p a r e 0 s 1 19 s 1 19 s o r t i n g 0 p r e f i x 0 s 2 7 s 2 13 s t r i n g 0 s c a l e 0 s 3 7 s 3 7 m 1 := [ 2 , 7 ] m 2 := [ 19 ] m 1 := [ 2 , 7 ] m 2 := [ 13 , 19 ] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 10 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

  13. Distinguishing Prefix Computation h ( s i ) h ( s i ) s 0 2 a l p h a 0 s 0 2 a l g o 0 c h a r a c t e r 0 c o m p a r e 0 s 1 19 s 1 19 s o r t i n g 0 p r e f i x 0 s 2 7 s 2 13 s t r i n g 0 s c a l e 0 s 3 7 s 3 7 m 1 := [ 2 , 7 ] m 2 := [ 19 ] m 1 := [ 2 , 7 ] m 2 := [ 13 , 19 ] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 10 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

  14. Distinguishing Prefix Computation h ( s i ) h ( s i ) s 0 5 a l p h a 0 s 0 5 a l g o 0 c h a r a c t e r 0 c o m p a r e 0 s 1 15 s 1 11 s o r t i n g 0 p r e f i x 0 s 2 7 s 2 s t r i n g 0 s c a l e 0 s 3 0 s 3 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 10 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

  15. Distinguishing Prefix Computation h ( s i ) h ( s i ) s 0 5 a l p h a 0 s 0 5 a l g o 0 c h a r a c t e r 0 c o m p a r e 0 s 1 15 s 1 11 s o r t i n g 0 p r e f i x 0 s 2 7 s 2 s t r i n g 0 s c a l e 0 s 3 0 s 3 0 m 1 := [ 0 , 5 , 7 ] m 2 := [ 15 ] m 1 := [ 0 , 5 ] m 2 := [ 11 ] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 10 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

  16. Distinguishing Prefix Computation h ( s i ) h ( s i ) s 0 5 a l p h a 0 s 0 5 a l g o 0 c h a r a c t e r 0 c o m p a r e 0 s 1 15 s 1 11 s o r t i n g 0 p r e f i x 0 s 2 7 s 2 s t r i n g 0 s c a l e 0 s 3 0 s 3 0 m 1 := [ 0 , 5 , 7 ] m 2 := [ 15 ] m 1 := [ 0 , 5 ] m 2 := [ 11 ] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Timo Bingmann, Peter Sanders, Matthias Schimek – Communication-Efficient String Sorting 10 / 14 Institute of Theoretical Informatics – Algorithmics May 18th, 2020

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend