a top down parallel semisort

A Top-Down Parallel Semisort Yan Gu Julian Shun Yihan Sun Guy - PowerPoint PPT Presentation

A Top-Down Parallel Semisort Yan Gu Julian Shun Yihan Sun Guy Blelloch Carnegie Mellon University What is semisort? key 45 12 45 61 28 61 61 45 28 45 Value 2 5 3 9 5 9 8 1 7 5 Input: An array of records with associated


  1. A Top-Down Parallel Semisort Yan Gu Julian Shun Yihan Sun Guy Blelloch Carnegie Mellon University

  2. What is semisort? key 45 12 45 61 28 61 61 45 28 45 Value 2 5 3 9 5 9 8 1 7 5  Input:  An array of records with associated keys  Assume keys can be hashed to the range [𝑜 𝑙 ]  Goal:  All records with equal keys should be adjacent

  3. What is semisort? key 12 61 61 61 45 45 45 45 28 28 Value 5 8 9 9 2 5 1 3 7 5  Input:  An array of records with associated keys  Assume keys can be hashed to the range [𝑜 𝑙 ]  Goal:  All records with equal keys should be adjacent

  4. What is semisort? key 45 45 45 45 12 61 61 61 28 28 Value 2 5 1 3 5 8 9 9 7 5  Input:  An array of records with associated keys  Assume keys can be hashed to the range [𝑜 𝑙 ]  Goal:  All records with equal keys should be adjacent  Different keys are not necessarily sorted  Records with equal keys do not need to be sorted by their values

  5. What is semisort? key 45 45 45 45 12 61 61 61 28 28 Value 1 5 3 2 5 8 9 9 7 5  Input:  An array of records with associated keys  Assume keys can be hashed to the range [𝑜 𝑙 ]  Goal:  All records with equal keys should be adjacent  Different keys are not necessarily sorted  Records with equal keys do not need to be sorted by their values

  6. Why is parallel semisort important?  The simulation of PRAM model – concurrent write [Valiant 1990]  Key: memory addresses  Value: operations Concurrent Sorted Thread Thread Result writes operations 1 a[3]=71 4 a[3]=10 2 a[1]=99 1 a[3]=71 a[3]=71 3 a[2]=19 6 a[3]=12 4 a[3]=10 5 a[5]=50 a[5]=50 5 a[5]=50 7 a[1]=16 a[1]=99 6 a[3]=12 2 a[1]=99 7 a[1]=16 3 a[2]=19 a[2]=19

  7. Why is parallel semisort important?  The map-(semisort-)reduce paradigm Shuffle Map Reduce (Semisort)

  8. Why is parallel semisort important?  The map-(semisort-)reduce paradigm  Generate adjacency array for a graph Sorted edge Edge list list (3,5) (3,5) 2 (1,7) (3,7) 6 (2,3) (3,6) 3 (3,6) (5,4) 4 (5,4) (1,6) 1 5 7 (3,7) (1,7) (1,6) (2,3)

  9. Why is parallel semisort important?  The map-(semisort-)reduce paradigm  Generate adjacency array for a graph  Other applications:  In database, the relational join operation  Gather words that differ by a deletion in edit-distance application  Collect shared edges based on endpoints in Delaunay triangulation  Etc.

  10. Attempts – Sequentially Hash Table With Open Addressing 92 56 keys 37 … 58 … 92 … key value Linked 12 8 11 lists of values 9 19 56 52  Problem:  Maintaining linked lists in parallel can be hard

  11. Attempts – Sequentially Pre-allocated array 92 56 keys 37 … 58 … 92 … key value Arrays 12 8 11 of values 9 44 19 52 31 56

  12. Attempts - Parallelized Pre-allocated array keys 37 … 58 … 92 … Arrays 56 of values 17 90 37 90 9 key value 58 17 key value 58 9 92 56 key value key value  Problem  Need to pre-count the number of each key

  13. Attempts – In parallel  Comparison-based sort  𝑃(nlog 𝑜) work  Not work-efficient ☹  Radix-sort (probably the best work-efficient option previously)  𝑃(𝑜 𝜗 ) depth  Not highly-parallelized ☹

  14. Attempts – In parallel  R&R integer sort [Rajasekaran and Reif 1989]: sort 𝑜 records with keys in the range [𝑜] in 𝑃(𝑜) work and 𝑃 log 𝑜 depth  Linear work and logarithmic depth  Should map keys to range [𝑜]  Too much global data movement – practically inefficient  Hashing and packing – 1 time  Random radix sort – 1 time  Deterministic radix sort – 2 times ☹

  15. How to design an efficient semisort?  Theoretically efficient:  Linear work  Logarithmic depth  Practically efficient:  Less data communication  Cache-friendly  Space efficient:  Linear space

  16. Our Top-Down Parallel Semisort Algorithm

  17. Key insight: estimate key count from samples  Once the count of each key is known, we can pre- allocate an array for each key  The exact number is hard to compute - estimate the upper bound by sampling  Those appearing many times: we could make reasonable estimations from the sample  Those with few samples: hard to estimate precisely  Solution: Treat “heavy” keys and “light” keys differently

  18. Our parallel semisort algorithm  1. Select a sample 𝑇 of keys and sort it  Sample rate Θ(1/ log 𝑜)  2. Partition 𝑇 into heavy keys and light keys  Heavy: appears = Ω(log 𝑜) times; will be assigned an individual bucket  Light: appears = 𝑃 log 𝑜 times. We evenly partition the hash range to 𝑜/ log 2 𝑜 buckets for them  3. Scatter each record into its associated bucket  The only global data communication  4. Semisort light key buckets  Performed locally  5. Pack and output

  19. Heavy vs. Light…Why?  [Rajasekaran and Reif 1989]If the records are sampled with probability 𝑞 = 1/ log 𝑜 , and for a key 𝑗 which appears 𝑏 𝑗 times in the original array, and 𝒅 𝒋 times in the sample:  𝑑 𝑗 = Ω(log 𝑜) , then 𝑏 𝑗 = Θ 𝑑 𝑗 log 𝑜 w.h.p.  𝑑 𝑗 = 𝑃(log 𝑜) , then 𝑏 𝑗 = 𝑃 log 2 𝑜 w.h.p. (Can be proved using Chernoff bounds)

  20. Estimate upper bounds for the counts 𝒃 𝒋  Key insight: if the records are sampled with probability 𝑞 = 1/ log 𝑜 , and key 𝑗 has:  𝑑 𝑗 = Ω(log 𝑜) samples, then 𝑏 𝑗 = Θ 𝑑 𝑗 log 𝑜 w.h.p.  𝑑 𝑗 = 𝑃(log 𝑜) samples, then 𝑏 𝑗 = 𝑃 log 2 𝑜 w.h.p.  𝑣 𝑗 = 𝑑′ max(log 2 𝑜 , 𝑑 𝑗 log 𝑜) 𝑑′ is a sufficiently large constant to provide the high probability  bound

  21. Estimate upper bounds for the counts 𝒃 𝒋  Key insight: if the records are sampled with probability 𝑞 = 1/ log 𝑜 , and key 𝑗 has:  𝑑 𝑗 = Ω(log 𝑜) samples, then 𝑏 𝑗 = Θ 𝑑 𝑗 log 𝑜 w.h.p.  𝑑 𝑗 = 𝑃(log 𝑜) samples, then 𝑏 𝑗 = 𝑃 log 2 𝑜 w.h.p.  Extreme case: all samples are of the same key 𝑜  𝑑 𝑗 = ⇒ 𝑣 𝑗 = 𝑃(𝑜) log 𝑜 ⇒ 𝑣 𝑗 = 𝑃(log 2 𝑜)  𝑑 𝑗 = 0  Require keys to be in range [𝑜/ log 2 𝑜]  Solution: combine light keys  evenly partition the hash range to 𝑜/ log 2 𝑜 intervals as buckets

  22. Phase 1: Sampling and sorting 1. Select a sample 𝑇 of keys with probability 𝑞 = Θ(1/ log 𝑜) 2. Sort 𝑇 …… Sampling …… S Sorting 17 17 …… 5 5 5 8 8 8 8 8 11 17 (Counting)

  23. Phase 2: Array Construction Sorted samples: 17 17 …… 5 5 5 8 8 8 8 8 11 17 Counting & Filtering Light keys Heavy keys Range 0-15 16-31 keys 65 … keys 5 11 17 21 26 31 ... 8 20

  24. Phase 2: Array Construction Heavy Keys keys … 𝑙 1 𝑙 2 𝑙 3 # samples … 𝑑 1 𝑑 2 𝑑 3 Array … 𝑔(𝑑 1 ) 𝑔(𝑑 2 ) 𝑔(𝑑 3 ) length Light Keys keys … 𝑙′ 1 𝑙′ 2 𝑙′ 3 𝑙′ 4 𝑙′ 5 𝑙′ 6 𝑙′ 7 𝑙′ 8 𝑙′ 9 # samples … 𝑑′ 1 𝑑′ 2 𝑑′ 3 𝑑′ 4 𝑑′ 5 𝑑′ 6 𝑑′ 7 𝑑′ 8 𝑑′ 9 Array … 𝑔(𝑑′ 1 + 𝑑′ 2 ) 𝑔(𝑑′ 3 + ⋯ + 𝑑′ 6 ) 𝑔(𝑑′ 7 + 𝑑′ 8 + 𝑑′ 9 ) length

  25. Phase 3: Scattering Light keys × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × Heavy keys Conflict!

  26. Phase 4: Local sort × × × × × × × × × × × × × × Phase 5: Packing × × × × × × × × × ×

  27. Size Estimation for Arrays - High Probability  Now consider an array that has 𝑡 samples. We define the following size-estimation function: 𝒅 𝟑 𝒎𝒐 𝟑 𝒐 + 𝟑𝒕𝒅 𝒎𝒐 𝒐 /𝒒 𝒈 𝒕 = 𝒕 + 𝒅 𝒎𝒐 𝒐 + 1 where 𝑞 = Θ log 𝑜 is the sampling probability and 𝑑 is a constant, to be an upper bound of the size of the array  Lemma 1: If there are 𝑡 samples of an array, the probability that number of records is more than 𝑔(𝑡) is at most 𝑜 −𝑑

  28. Size estimation for arrays - Linear Space in Expectation 𝒅 𝟑 𝒎𝒐 𝟑 𝒐 + 𝟑𝒕𝒅 𝒎𝒐 𝒐 /𝒒 𝒈 𝒕 = 𝒕 + 𝒅 𝒎𝒐 𝒐 +  Lemma 1: If there are 𝑡 samples of an array, the probability that number of records is more than 𝑔(𝑡) is at most 𝑜 −𝑑  Corollary 1: The probability that 𝑔 gives an upper bound on all buckets is at least 1 − 𝑜 −𝑑+1 /log 2 𝑜  Lemma 2: 𝒋 𝒈 𝒕 𝒋 = 𝚰 𝒐 holds in expectation

  29. Comparison with R&R integer sort  R&R algorithm:  Preprocessing: hashing and packing – global data movement  Three times bottom-up radix sort – global data movement  Our parallel semisort:  Sample and sort – on a small set  Bucket construction – more about calculations  Scatter: the only global data communication  Local sort: performed locally  Pack: performed locally

  30. Experiments

Recommend


More recommend