FLiMS: Fast Lightweight Merge Sorter 2018 International Conference - PowerPoint PPT Presentation

FLiMS: Fast Lightweight Merge Sorter 2018 International Conference on Field-Programmable Technology (FPT) P h i l i p p o s P a p a p h i l i p p o u C h r i s B r o o k s Wa y n e L u k D e p t . o f C o m p u t i n g S c i e n c e I n n o v a t i o n D e p t . o f C o m p u t i n g I m p e r i a l C o l l e g e L o n d o n d u n n h u m b y I m p e r i a l C o l l e g e L o n d o n L o n d o n ,U n i t e d K i n g d o m L o n d o n ,U n i t e d K i n g d o m L o n d o n ,U n i t e d K i n g d o m p p 6 1 6 @i m p e r i a l . a c . u k C h r i s . B r o o k s @d u n n h u m b y . c o m w . l u k @i m p e r i a l . a c . u k 12/12/2018 Philippos Papaphilippou 1

Novel merger design ● Task 1 2 – Merge 2 sorted sequences in parallel 4 3 ● Contributions 5 3 – Highly-efficient parallel merger 7 8 Half hardware resources of the state-of-the-art ● Half the latency ● 10 – Open source – Evaluation 11 FPGA 15 ● CPU with SIMD registers ● 16 18 12/12/2018 Philippos Papaphilippou 2

Introduction: Bitonic sorter Bitonic sort [S. Batcher, 1968] ● A parallel sorting algorithm – N/2 comparisons per step – O (log 2 (N ))s teps – (log 2 (N ) · (log 2 (N ) + 1))/2) Pipelineable → FPGAs – Compare and swap (CAS) ● if (a<b) swap(a, b) – Sorter of 2 elements – 12/12/2018 Philippos Papaphilippou 3

Introduction: Bitonic sorter ● Bitonic sort is based on mergesort – Hierarchical merge module sorter (4) sorter (4) merger (8) sorter (4) sorter (P=8) 12/12/2018 Philippos Papaphilippou 4

Introduction: Bitonic sorter (P=64) 12/12/2018 Philippos Papaphilippou 5

Bitonic sort example 8 3 5 2 1 7 9 0 12/12/2018 Philippos Papaphilippou 6

Bitonic sort example 8 8 3 3 5 5 2 2 1 7 7 1 9 9 0 0 12/12/2018 Philippos Papaphilippou 7

Merging ● Bitonic sort: – The merger module can be used to merge 2 lists of P elements → 1 list of 2P elements ● Problem: How about sorting bigger lists with limited hardware/logic? – Merge 2 lists of arbitrary length – The data can be streamed and queued ● Design an efficient parallel merger for arbitrarily long lists ● Simple (unrealistic) FPGA example: – CPU mergesort, but – Parallel merging in FPGA 12/12/2018 Philippos Papaphilippou 13

Basic merger for longer lists ● Based on bitonic merger 13 13 Queue A Output 6 13 13 11 12 3 11 12 9 11 1 9 11 0 7 10 7 10 12 9 Queue B 4 12 10 8 2 10 8 7 1 8 Lower 4 need 0 5 5 5 to be fed back 12/12/2018 Philippos Papaphilippou 14

Basic merger for longer lists ● Based on bitonic merger 13 9 Queue A Output 6 13 13 12 8 3 11 12 11 1 9 7 11 0 7 10 10 5 9 6 Queue B 4 12 8 3 2 10 7 1 1 8 Lower 4 need 0 5 5 0 to be fed back. Not Pipelined! (works fine for CPUs [Chhugani, et al., 2008]) 12/12/2018 Philippos Papaphilippou 15

Structures overview 12/12/2018 Philippos Papaphilippou 16

Optimisation for FPGAs ● Based on 2P-to-P bitonic partial merger, such as in [Song et al., 2016] 9 Queue A Output 6 13 13 7 3 11 12 6 1 9 11 0 7 10 3 8 Queue B 4 12 5 Not needed 2 10 4 1 8 0 5 2 ‘unsorted’ lower 4 12/12/2018 Philippos Papaphilippou 17

Related work examples ● Merging 2 already-sorted sequences – Related work [Song et al., 2016] [Saitoh et al., 2018] – Problem: Trade-off between feedback (low frequency) and resources ● 12/12/2018 Philippos Papaphilippou 18

Contributions ● New parallel merger design – Merge algorithm – Feedback-less – Half hardware resources than the state-of-the-art MMS [Saitoh et al., 2018] Lookup-table and Flip-flop utilisation ● (also with half the latency (pipeline length)) ● ● Proof & Evaluation ● Open source implementation – Verilog generator for AXI peripherals FLiMS & MMS merger ● – SIMD version in C AVX2 & AVX-512 ● 12/12/2018 Philippos Papaphilippou 19

Merge sorter – solution Just one 2P-to-P bitonic partial merger ● Modified 1 st pipeline stage – No need for barrel shifters ● 1 int i ; i i s t h e e n t i t y t a g 2 int cA i , cB i , in i ; 3 2 - b i t r e g i s t e r s 3 while forever do receive (positive clock edge); 4 if cA i >cB i then 5 in i ← cA i ; 6 cA i ← dequeue( a i ); 7 else 8 in i ← cB i ; 9 cB i ← dequeue( b P −i ); 10 end 11 12 end Algorithm A: Distributed algorithm pseudocode 12/12/2018 Philippos Papaphilippou 20

Brief proof overview ● Main principles – Top P works for equally rotated input → No need to rotate input – Top P is a rotated bitonic sequence → also bitonic → rest of the sorting network works 12/12/2018 Philippos Papaphilippou 21

Brief proof overview ● Main principles – Top P works for equally rotated input → No need to rotate input – Top P is a rotated bitonic sequence → also bitonic → rest of the sorting network works 12/12/2018 Philippos Papaphilippou 22

Comparison with FLiMS [Song et al., 2016] [Saitoh et al., 2018] FLiMS Feedback log 2 (P)+1 1 1 datapath length log 2 (P)+log 2 (2P) 2×log 2 (2P) log 2 (2P) Latency 1 × b.p.m 2 × b.p.m, H/W modules 1 × b.p.m 2 crossbars (barrel shifters) shift registers 12/12/2018 Philippos Papaphilippou 23

FLiMS example run (P=4) Max CAS CAS 4 16 29 A Output 3 11 26 3 5 26 4 17 7 15 22 B 0 12 21 9 19 8 18 12/12/2018 Philippos Papaphilippou 24

FLiMS example run (P=4) Max CAS CAS 4 16 29 29 max A Output 3 11 26 26 max 3 5 26 26 max 4 17 17 7 15 22 22 max B 0 12 21 21 9 19 19 8 18 18 12/12/2018 Philippos Papaphilippou 25

FLiMS example run (P=4) Max CAS CAS 4 16 29 16 29 A Output 3 11 26 11 26 3 5 26 5 26 4 17 17 22 max 7 15 22 15 B 0 12 21 21 max 9 19 19 max 8 18 18 max 12/12/2018 Philippos Papaphilippou 26

FLiMS example run (P=4) Max CAS CAS 4 16 29 16 18 29 max A Output 3 11 26 11 19 26 max 3 5 26 5 21 26 4 17 4 17 22 7 15 22 15 max B 0 12 21 12 max 9 19 9 8 18 8 12/12/2018 Philippos Papaphilippou 27

FLiMS example run (P=4) Max CAS CAS 4 16 29 4 16 21 29 A Output 3 11 26 26 3 11 19 3 5 26 26 5 12 18 max 4 17 22 4 15 17 7 15 22 7 max B 0 12 21 0 9 19 9 max 8 18 8 max 12/12/2018 Philippos Papaphilippou 28

FLiMS example run (P=4) Max CAS CAS 4 16 29 4 8 16 21 29 max A Output 3 11 26 19 26 3 9 15 max 3 5 26 18 26 3 5 12 max 4 17 17 22 4 7 11 max 7 15 22 B 0 12 21 0 9 19 8 18 12/12/2018 Philippos Papaphilippou 29

FLiMS example run (P=4) Max CAS CAS 4 16 29 4 8 16 21 29 A Output 3 11 26 15 19 26 3 9 3 5 26 12 18 26 3 5 4 17 11 17 22 4 7 7 15 22 B 0 12 21 0 max 9 19 8 18 12/12/2018 Philippos Papaphilippou 30

FLiMS example run (P=4) Max CAS CAS 4 16 29 4 9 16 21 29 A Output 3 11 26 8 15 19 26 4 3 5 26 7 12 18 26 0 3 4 17 5 11 17 22 3 7 15 22 B 0 12 21 9 19 8 18 12/12/2018 Philippos Papaphilippou 31

FLiMS example run (P=4) Max CAS CAS 4 16 29 0 4 9 16 21 29 A Output 3 11 26 4 8 15 19 26 3 5 26 3 7 12 18 26 4 17 3 5 11 17 22 7 15 22 B 0 12 21 9 19 8 18 12/12/2018 Philippos Papaphilippou 32

FLiMS example run (P=4) Max CAS CAS 4 16 29 0 4 9 16 21 29 A Output 3 11 26 4 8 15 19 26 3 5 26 3 7 12 18 26 4 17 3 5 11 17 22 7 15 22 B 0 12 21 9 19 8 18 12/12/2018 Philippos Papaphilippou 33

Merge sorter – results Board: MYIR Z-turn (Xilinx Zynq 7020) ● 90K 100K Proposal Proposal 90K 80K MMS MMS 80K 70K 70K 105 60K 60K 7z020 100 50K Operating frequency (MHz) LUT 50K FF 40K 95 40K 30K 90 30K 20K 85 20K 10K 10K 80 Proposal 0K 0K LUT utilisation improvement 75 FF utilisation improvement MMS 2 1.8 70 1.9 4 8 16 32 64 128 1.7 1.8 P (integers/cycle) Observations Observations 1.6 1.7 Fitting Fitting 1.6 1.5 4 8 16 32 64 128 4 8 16 32 64 128 P (integers/cycle) P (integers/cycle) 12/12/2018 Philippos Papaphilippou 34

FLiMS: Fast Lightweight Merge Sorter 2018 International Conference - PowerPoint PPT Presentation

FLiMS: Fast Lightweight Merge Sorter 2018 International Conference on Field-Programmable Technology (FPT) P h i l i p p o s P a p a p h i l i p p o u C h r i s B r o o k s Wa y n e L u k D e p t . o f

What Would it Take to Train an Agent to Play with a Shape-Sorter? Feryal Behbahani Shape sorter?

a Atg12 Rab9 (ER) F-USP13 Merge (Autophagy) F-USP13 Merge COX4 (Mito) F-USP13 Merge Mock HSV-1 b

Merge Strategies for Merge-and-Shrink Masters Thesis Daniel Federau 13th February 2017

Mail Merge Internals Eilidh McAdam Mail Merge Mail merge fjlls a template from a

Accelerating the merge phase of sort-merge join FPL 2019 The 29th International Conference on

Model Merge Tooling: Whats New in EMF Diff/Merge for Neon ECLIPSECON FRANCE, 08/06/2016

News from Git in Eclipse Matthias Sohn (SAP) merge strategy extension point enables

SORTING Chapter 8 Comparison of Quadratic Sorts 2 1 12/6/2017 Merge Sort Section 8.7 Merge

overview merge sort heaps data structures and algorithms 2020 09 07 heapsort intuitively

CS141: Intermediate Data Structures and Algorithms Divide and Conquer: Design and Analysis Amr

Capture the full market potential Peter Hackel, CFO Flims, 17 January 2020 Looking back

COMBIBLOC THE OCTAVIAN SEMINAR CEO ROLF STANGL FLIMS 17-18 JANUARY 2020 DISCLAIMER The

The lightweight beam for Heavyweight applications The impact of this lightweight beam concept

The lightweight beam for Heavyweight applications The impact of this lightweight steel beam will

Its time to Think Lightweight! www.thinklightweight.com TO D A Y S TO P IC S 1.

Lightweight Cryptography and and RFID Security Svetla Nikova COSIC KUL COSIC, KULeuven and

DAVE James HyunSeung Hong (hh2473) Min Woo Kim (mk3351) Fan Yang (fy2207) Chen Yu (cy2415)

Integrated Digital Care Record Programme April 2015 Dean

Results briefing for the 3rd quarter of Fiscal Year ending December 2014 November 4, 2014

Index Rules and Methodology S-Network Europe Equity 500 Index (Ticker: SNE500) S-Network Europe

11-11032 Approved for public release; distribution is unlimited. Title: VISUALIZATION AND DATA

Capturing and Processing One Million Network Flows Per Second with SiLK: Challenges and

Pavan Poluri Pavan Poluri Siddharth Deokar Varun Sudhakar Generic View of System Get the

Beta Presentation Security Analytics Suite: Dataset Merger Tool The Capstone Experience Team