Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 β Lecture cture 6 Yan n Gu I/O Algorithms and Parallel Samplesort
Review of Samplesort CS260: Algorithm Semisort Engineering Lecture 6 Course Policy 2
Sample-sort outline Analo logou gous s to mult ltiw iway ay quic ickso ksort 1. 1. Sp Spli lit in input ut array in into π contiguo iguous us suba barra rrays ys of siz ize π . So Sort subar arrays rays recursi sivel vely β¦ π , sorted π
Sample-sort outline Analo logou gous s to mult ltiw iway ay quic ickso ksort π , sorted 1. 1. Sp Spli lit in input ut array in into π contiguo iguous us suba barra rrays ys of siz ize π . So Sort subar arrays rays recursi sivel vely y (sequ equent entia ially lly) β¦
Sample-sort outline 2. 2. Choo oose se π β 1 βgoodβ pivots π , sorted π 1 β€ π 2 β€ β― β€ π πβ1 3. 3. Dis istribu ribute te su subar barrays rays in into o buckets ckets , , ac accordin ording g to β¦ pivot vots Size β π β€ π 1 β€ β€ π 2 β€ β― β€ π πβ1 β€ Bucket 1 Bucket 2 Bucket π
Sample-sort outline 4. Recurs 4. cursively ively sort rt the buckets ckets β€ π 1 β€ β€ π 2 β€ β― β€ π πβ1 β€ Bucket 1 Bucket 2 Bucket π 5. 5. Copy py conca oncatenated tenated buckets ckets bac ack k to input put ar arra ray sorted
Review of Samplesort CS260: Algorithm Semisort Engineering Lecture 6 Course Policy 7
What is semisort? key 45 12 45 61 28 61 61 45 28 45 Value 2 5 3 9 5 9 8 1 7 5 β’ Input: β’ An array of records with associated keys β’ Assume keys can be hashed to the range [π π ] β’ Goal: β’ All records with equal keys should be adjacent
What is semisort? key 12 61 61 61 45 45 45 45 28 28 Value 5 8 9 9 2 5 1 3 7 5 β’ Input: β’ An array of records with associated keys β’ Assume keys can be hashed to the range [π π ] β’ Goal: β’ All records with equal keys should be adjacent
What is semisort? key 45 45 45 45 12 61 61 61 28 28 Value 2 5 1 3 5 8 9 9 7 5 β’ Input: β’ An array of records with associated keys β’ Assume keys can be hashed to the range [π π ] β’ Goal: β’ All records with equal keys should be adjacent β’ Different keys are not necessarily sorted β’ Records with equal keys do not need to be sorted by their values
What is semisort? key 45 45 45 45 12 61 61 61 28 28 Value 1 5 3 2 5 8 9 9 7 5 β’ Input: β’ An array of records with associated keys β’ Assume keys can be hashed to the range [π π ] β’ Goal: β’ All records with equal keys should be adjacent β’ Different keys are not necessarily sorted β’ Records with equal keys do not need to be sorted by their values
Semisort is one of the most useful primitives in parallel algorithms Parallel In-Place Algorithms: Theory and Practice Julienne: A Framework for Parallel Graph Algorithms using Work- efficient Bucketing Semi-Asymmetric Parallel Graph Algorithms for NVRAMs Efficient BVH Construction via Approximate Agglomerative Clustering Theoretically-Efficient and Practical Parallel DBSCAN 12
Why is semisort so useful? (albeit not seen before) β’ Semisorting can be done by sorting, but faster (less restriction) β’ Theoretically can be done in π π work not π π log π work β’ Can be used to implement counting / integer sort β’ Integer sort: given π key-value pairs with keys in range [1, β¦ , π] , query the KV-pairs with a certain key β’ Counting sort: given π key-value pairs with keys in range [1, β¦ , π] , query the number of KV-pairs with a certain key β’ In database community, this is called the GroupBy operator 13
Why is semisort so useful? (albeit not seen before) β’ Semisorting can be done by sorting, but faster (less restriction) β’ Theoretically can be done in π π work not π π log π work β’ Can be used to implement counting / integer sort keys 37 β¦ 58 β¦ 92 β¦ 92 56 key value Linked 12 8 11 lists of values 9 19 56 52 14
Attempts β Sequentially: Pre-allocated array 92 56 keys 37 β¦ 58 β¦ 92 β¦ key value Arrays 12 8 11 of values 9 44 19 52 31 56 ο’ Problem ο Need to pre-count the number of each key
Another use case for semisrot β’ Generate adjacency array for a graph Sorted Edge list edge list (3,5) (3,5) 2 6 (1,7) (3,7) (2,3) (3,6) 3 4 (3,6) (5,4) 1 5 (5,4) (1,6) 7 (3,7) (1,7) (1,6) (2,3)
What is semisort? key 45 45 45 45 12 61 61 61 28 28 Value 1 5 3 2 5 8 9 9 7 5 β’ Input: β’ An array of records with associated keys β’ Assume keys can be hashed to the range [π π ] β’ Goal: β’ All records with equal keys should be adjacent β’ Different keys are not necessarily sorted β’ Records with equal keys do not need to be sorted by their values
Why is semisort hard? key 45 45 45 45 12 61 61 61 28 28 Value 1 5 3 2 5 8 9 9 7 5 β’ There can be many duplicate keys β’ Heavy keys β’ Or, there can be almost no duplicate keys β’ Light keys
Implement integer sort using semisort key 45 45 45 45 12 61 61 61 28 28 Value 1 5 3 2 5 8 9 9 7 5 β’ Input: π KV-pairs with key in [π] β’ Step 1: hash the keys (i.e., for π π , π π , generate π π = π’πππ’(π π ) ) β’ Step 2: semisort π π , (π π , π π ) , and resolve conflicts β’ Step 3: get the pointer for each key π π
The Top-Down Parallel Semisort Algorithm 22
The main goal estimate key counts β’ And tell the heavy keys from light ones. By how? Sampling! β’ For a key appear more than π¨/π times, we call it a heavy key β’ Otherwise, we call it a light key β’ We can treat them separately
The algorithm β’ Take π log π samples and sort them β’ For those keys with more than log π appearances, we mark them as heavy keys, others are light keys β’ We give each heavy key a bucket, and the another π buckets for light keys each corresponds to a range of π π /π β’ The input keys are hashed into π π β’ In total we have no more than 2π’ buckets β’ The rest of the algorithm is pretty similar to samplesort
Phase 1: Sampling and sorting 1. Select a sample set π with π’ log π of keys 2. Sort π β¦β¦ Sampling β¦β¦ S Sorting 17 17 β¦β¦ 5 5 5 8 8 8 8 8 11 17 (Counting)
Phase 2: Bucket Construction Sorted samples: 17 17 β¦β¦ 5 5 5 8 8 8 8 8 11 17 Counting & Filtering Light keys Heavy keys Range 0-15 16-31 keys 65 β¦ 8 20 keys 5 11 17 21 26 31 ...
At the end of Phase 2 β’ In total we have no more than 2π’ buckets β’ π’ of them are for light keys β’ Then we construct a hash table for the heavy keys β’ Now we know which bucket each KV-pair (π π , π π ) goes to: β’ If π π is found in the hash table, assign it to the associated heavy bucket β’ Otherwise, it goes to the light bucket based on the range of π π β’ The rest of the algorithm is almost identical to samplesort
Sample-sort outline Analogous to multiway quicksort π/π’ 1. Split input array into π contiguous subarrays of size π π’ β¦ π π
Sample-sort outline Analogous to multiway quicksort Size β π’ 1. Split input array into π/π’ contiguous subarrays of size π’ . Sort subarrays recursively (sequentially) β¦
Sample-sort outline 2. Distribute subarrays into buckets β¦ β¦ β€ π 1 β€ β€ π 2 β€ β― β€ π πβ1 β€ Bucket 1 Bucket 2 Bucket π
Sample-sort outline Only for the light buckets 3. Recursively sort the buckets β¦ Bucket 1 Bucket 2 Bucket π 4. Copy concatenated buckets back to input array sorted
Difference 2: subarrays are not sorted β’ For simplicity, assume π = ππ , and the input is [π, π, π, π, π, π, π, π, π, π, π, π, π, π, π, π] β’ First, get the count for each subarray in each bucket [π, π, π, π, π, π, π, π, π, π, π, π, π, π, π, π] β’ Then, transpose the array and scan to compute the offsets [π, π, π, π, π, π, π, π, π, π, π, π, π, π, π, π] [π, π, π, π, π, π, π, π, π, ππ, ππ, ππ, ππ, ππ, ππ, ππ] β’ Lastly, move each element to the corresponding bucket [β , β , β , β , β , β , β , β , β , β , β , β , β , β , β , β ] [π, β , β , β , β , π, β , β , β , π, β , β , π, β , β , β ] [π, π, π, β , β , π, β , β , β , π, π, π, π, β , β , β ] 32
Recommend
More recommend