high throughput sorting by dynamically merging
play

High-Throughput Sorting by Dynamically Merging Multiple Hardware - PowerPoint PPT Presentation

High-Throughput Sorting by Dynamically Merging Multiple Hardware Sequential Sorters Wei Song 03/04/2014 Advanced Processor Technologies Group The School of Computer Science Motivation Hardware sorter is important. Parallel sorters have


  1. High-Throughput Sorting by Dynamically Merging Multiple Hardware Sequential Sorters Wei Song 03/04/2014 Advanced Processor Technologies Group The School of Computer Science

  2. Motivation • Hardware sorter is important. • Parallel sorters have size limit. – Sorting N numbers need a network sized 2 log ( ) N N • Sequential sorters have throughput limit. – Sorting throughput is limited to 1 number per cycle. • Is there a way to sort N (N>1M) numbers with a throughput larger than 1 number per cycle? Advanced Processor Technologies Group 03/04/2014 2 School of Computer Science

  3. Content • Review of existing sorters – Parallel sorters – Sequential sorters • Parallel merge-tree sorter – Key ideas – Hardware structure – Performance Advanced Processor Technologies Group 03/04/2014 3 School of Computer Science

  4. Parallel Sorters (Bitonic Sorting Network) S0 S1 S2 S3 S4 S5 12 12 12 9 9 9 9 9 BN(8) BN(4) BN(2) 89 89 9 12 12 12 12 12 I 7 O 7 BM(8) BM(4) 53 9 89 53 30 30 17 17 I 6 O 6 9 53 53 89 17 17 30 30 BN(2) I 5 O 5 30 30 30 17 89 62 53 53 I 4 O 4 79 79 17 30 53 53 62 62 BN(4) BN(2) 62 17 79 62 62 89 79 79 I 3 O 3 BM(4) 17 62 62 79 79 79 89 89 I 2 O 2 BN(2) I 1 O 1 B min{ A , B } I 0 O 0 A max{ A , B } Advanced Processor Technologies Group 03/04/2014 School of Computer Science

  5. Parallel Sorters (Bitonic Sorting Network) BM(8) S0 S1 S2 S3 S4 S5 4 4 BN(8) BN(4) BN(2) BM(4) BM(4) I 7 O 7 BM(8) BM(4) 2 2 2 2 I 6 O 6 BN(2) BM(2) BM(2) BM(2) BM(2) I 5 O 5 I 4 O 4 Bitonic Network (BN) BN(4) BN(2) Bitonic Merger (BM) I 3 O 3 BM(4) P Data Set Size: I 2 O 2 BN(2) I 1 O 1 P Throughput: I 0 O 0 2 log ( ) Size(Compare): P P 2 Delay: log ( ) P Advanced Processor Technologies Group 03/04/2014 School of Computer Science

  6. Sequential Sorters (Insertion Sorter) data_in 0 > > > data_out 0 Cell 0 Cell 1 Cell N-1 3 N Data Set Size: 12 3 7 3 12 Throughput: 1 1 3 7 12 Size(cells): N 9 1 3 7 12 20 1 3 7 9 12 N Delay: 1 3 7 9 12 20 Advanced Processor Technologies Group 03/04/2014 6 School of Computer Science

  7. Sequential Sorter (FIFO-merge) I 1 S 2 S 1 S 0 N /8 N /4 N /2 I 0 O I 1 I 0 5 12 16 19 22 4 9 10 22 N Data Set Size: 5 12 16 19 19 22 4 9 10 Throughput: 1 5 12 16 16 19 22 2 N Size(Memory): 4 9 10 5 12 N Delay: 12 16 19 22 4 9 10 D. Koch and J. Torresen , “ FPGASort: a high performance sorting architecture exploiting run- time reconfiguration on FPGAs for large problem sorting ,” in Proc. of FPGA , February 2011, pp. 45 – 54. Advanced Processor Technologies Group 03/04/2014 7 School of Computer Science

  8. Summarise Existing Sorters • Parallel Sorters – High throughput – Area increases significantly with the quantity of data – Sorting a small quantity of numbers • Sequential Sorters – Linear area overhead – Feasible for large data sets – Low throughput Advanced Processor Technologies Group 03/04/2014 8 School of Computer Science

  9. Can we dynamically merge multiple sequential sorters? Advanced Processor Technologies Group 03/04/2014 9 School of Computer Science

  10. Parallel Merging Merge multiple sequential sorters using a Bitonic network? YES Sequential sorter 3 5 15 22 28 34 1 5 10 17 24 29 Sequential sorter 1 9 10 20 24 30 3 6 13 20 26 30 Sequential sorter 4 7 15 17 26 29 4 7 15 22 28 34 Sequential sorter 5 6 13 24 28 37 5 9 15 24 28 37 Advanced Processor Technologies Group 03/04/2014 10 School of Computer Science

  11. Parallel Merging Merge multiple sequential sorters using a Bitonic network? YES Sequential sorter 3 5 15 22 28 34 1 5 10 17 24 29 Sequential sorter 1 9 10 20 24 30 3 6 13 20 26 30 Sequential sorter 4 7 15 17 26 29 4 7 15 22 28 34 Sequential sorter 5 6 13 24 28 37 5 9 15 24 28 37 NO! Sequential sorter 5 9 10 22 30 34 1 3 7 13 24 28 Sequential sorter 1 3 15 20 24 28 4 6 10 20 26 28 Sequential sorter 5 6 7 13 26 28 5 9 15 22 29 34 Sequential sorter 4 15 17 24 29 37 5 15 17 24 30 37 Numbers may not be distributed evenly among sequences. Advanced Processor Technologies Group 03/04/2014 11 School of Computer Science

  12. Parallel Merging Increase the comparing window. Sequential sorter 5 9 10 22 30 34 Sequential sorter 1 3 15 20 24 28 Advanced Processor Technologies Group 03/04/2014 12 School of Computer Science

  13. Parallel Merging Increase the comparing window. Return unselected numbers. 24 Sequential sorter 5 9 10 22 30 34 28 30 Sequential sorter 1 3 15 20 24 28 34 24 10 5 9 10 22 30 34 5 9 10 22 28 15 30 20 24 30 1 3 15 20 24 28 1 3 15 20 34 22 28 34 10 3 5 9 10 22 22 5 9 10 9 24 30 1 3 15 20 24 28 10 20 24 30 1 3 15 28 34 15 22 28 34 YES! Advanced Processor Technologies Group 03/04/2014 13 School of Computer Science

  14. Parallel Merging Increase the comparing window. Return unselected numbers. 24 Sequential sorter 5 9 10 22 30 34 28 30 Sequential sorter 1 3 15 20 24 28 34 Requirement: To merge S pre-sorted sequence and at a speed of S numbers per cycle, 1. Increase the comparing window to S x S ; 2. Using an S x S -input Bitonic sorting network; [Area overhead] 3. Return the S x ( S - 1) unselected numbers; [Control overhead] 4. Unselected numbers should be returned in one cycle. [Slow clock] 5. Maximal shifting rate of S numbers per cycle. [Speed mismatch] Advanced Processor Technologies Group 03/04/2014 14 School of Computer Science

  15. Parallel Merging 10 13 Sequential sorter 5 9 10 22 30 34 15 15 17 20 Sequential sorter 1 3 15 20 24 28 22 24 24 26 28 Sequential sorter 5 6 7 13 26 28 28 29 30 Sequential sorter 4 15 17 24 29 37 34 37 Advanced Processor Technologies Group 03/04/2014 15 School of Computer Science

  16. Optimising the Parallel Merging 10 13 Sequential sorter 5 9 10 22 30 34 15 Using a tree structure reduces the number of 15 17 comparators by > 50%. 20 Sequential sorter 1 3 15 20 24 28 22 24 24 26 28 Sequential sorter 5 6 7 13 26 28 28 29 30 Sequential sorter 4 15 17 24 29 37 34 37 20 Sequential sorter 5 9 10 22 30 34 24 22 26 24 30 Sequential sorter 1 3 15 20 28 28 34 28 29 30 17 Sequential sorter 5 6 7 13 26 28 34 24 37 26 29 Sequential sorter 4 15 17 24 28 37 Advanced Processor Technologies Group 03/04/2014 16 School of Computer Science

  17. Optimising the Parallel Merging 20 Sequential sorter 5 9 10 22 30 34 24 Replace the Bitonic sorting 22 26 networks with Bitonic 24 30 Sequential sorter 1 3 15 20 28 28 34 mergers because the 28 29 sequences are pre-sorted. 30 17 Sequential sorter 5 6 7 13 26 28 34 24 37 26 29 Sequential sorter 4 15 17 24 28 37 20 Sequential sorter 5 9 10 22 30 34 24 22 26 1. Reduce the comparing 24 30 Sequential sorter 1 3 15 20 28 window. 28 34 2. Reduce the size of 28 soring networks. 29 3. Reduce the numbers 30 17 being returned. Sequential sorter 5 6 7 13 26 28 34 24 37 26 29 Sequential sorter 4 15 17 24 28 37 Advanced Processor Technologies Group 03/04/2014 17 School of Computer Science

  18. Bitonic Partial Merger I 7 O 7 I 7 I 6 O 6 I 6 I 5 I 5 O 5 I 4 O 4 I 4 I 3 O 3 I 3 O 3 I 2 O 2 I 2 O 2 I 1 O 1 I 1 O 1 I 0 O 0 I 0 O 0 • A. Farmahini-Farahani, H. J. Duwe, III, M. J. Schulte, and K. Compton, “ Modular design of high-throughput, low-latency sorting units ,” IEEE Transactions on Computers , vol. 62, no. 7, pp. 1389 – 1402, July 2013. Advanced Processor Technologies Group 03/04/2014 18 School of Computer Science

  19. Optimising the Parallel Merging control control Single clock Sequential sorter 5 9 10 22 30 34 data return. 24 30 Sequential sorter 1 3 15 20 28 34 control 29 30 Sequential sorter 5 6 7 13 26 28 34 37 26 29 Sequential sorter 4 15 17 24 28 37 Advanced Processor Technologies Group 03/04/2014 19 School of Computer Science

  20. Optimising the Parallel Merging control 2 N/cyc 1 N/cyc control Single clock Sequential sorter 5 9 10 22 30 34 data return. 24 30 Sequential sorter 1 3 15 20 28 34 4 N/cyc control 29 1 N/cyc 30 Sequential sorter 5 6 7 13 26 28 34 37 26 29 Sequential sorter 4 15 17 24 28 37 The last issue: Speed mismatch between inputs and outputs. Advanced Processor Technologies Group 03/04/2014 20 School of Computer Science

  21. Speed Mismatch: Using FIFO and Allow Stalls control control Sequential sorter Sequential sorter control Sequential sorter Sequential sorter Advanced Processor Technologies Group 03/04/2014 21 School of Computer Science

  22. How Stalls Occur control Sequential sorter Even distribution has 0 stall. Sequential sorter 16 4 20 6 10 12 18 2 14 8 Original Sequences 11 3 9 5 13 1 19 7 17 15 2 4 6 8 10 12 14 16 18 20 Pre-sorted 1 3 5 7 9 11 13 15 17 19 0 stall R = 0% α = 0 Advanced Processor Technologies Group 03/04/2014 22 School of Computer Science

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend