FLiMS: Fast Lightweight Merge Sorter 2018 International Conference - - PowerPoint PPT Presentation

flims fast lightweight merge sorter
SMART_READER_LITE
LIVE PREVIEW

FLiMS: Fast Lightweight Merge Sorter 2018 International Conference - - PowerPoint PPT Presentation

FLiMS: Fast Lightweight Merge Sorter 2018 International Conference on Field-Programmable Technology (FPT) P h i l i p p o s P a p a p h i l i p p o u C h r i s B r o o k s Wa y n e L u k D e p t . o f


slide-1
SLIDE 1

12/12/2018 1 Philippos Papaphilippou

FLiMS: Fast Lightweight Merge Sorter

P h i l i p p

  • s

P a p a p h i l i p p

  • u

D e p t .

  • f

C

  • m

p u t i n g I m p e r i a l C

  • l

l e g e L

  • n

d

  • n

L

  • n

d

  • n

,U n i t e d K i n g d

  • m

p p 6 1 6 @i m p e r i a l . a c . u k

C h r i s B r

  • k

s

S c i e n c e I n n

  • v

a t i

  • n

d u n n h u m b y L

  • n

d

  • n

,U n i t e d K i n g d

  • m

C h r i s . B r

  • k

s @d u n n h u m b y . c

  • m

Wa y n e L u k

D e p t .

  • f

C

  • m

p u t i n g I m p e r i a l C

  • l

l e g e L

  • n

d

  • n

L

  • n

d

  • n

,U n i t e d K i n g d

  • m

w . l u k @i m p e r i a l . a c . u k

2018 International Conference on Field-Programmable Technology (FPT)

slide-2
SLIDE 2

12/12/2018 2 Philippos Papaphilippou

Novel merger design

  • Task

– Merge 2 sorted sequences in parallel

  • Contributions

– Highly-efficient parallel merger

  • Half hardware resources of the state-of-the-art
  • Half the latency

– Open source – Evaluation

  • FPGA
  • CPU with SIMD registers

1 4 5 7 2 3 3 8 10 11 15 16 18

slide-3
SLIDE 3

12/12/2018 3 Philippos Papaphilippou

Introduction: Bitonic sorter

  • Bitonic sort [S. Batcher, 1968]

A parallel sorting algorithm

N/2 comparisons per step

O (log2 (N ))steps

(log2 (N ) · (log2 (N ) + 1))/2)

Pipelineable → FPGAs

  • Compare and swap (CAS)

if (a<b) swap(a, b)

Sorter of 2 elements

slide-4
SLIDE 4

12/12/2018 4 Philippos Papaphilippou

Introduction: Bitonic sorter

  • Bitonic sort is based on mergesort

– Hierarchical merge module

sorter (4) sorter (P=8)

sorter (4) sorter (4) merger (8)

slide-5
SLIDE 5

12/12/2018 5 Philippos Papaphilippou

Introduction: Bitonic sorter

(P=64)

slide-6
SLIDE 6

12/12/2018 6 Philippos Papaphilippou

Bitonic sort example

8 3 5 2 1 7 9

slide-7
SLIDE 7

12/12/2018 7 Philippos Papaphilippou

Bitonic sort example

8 3 5 2 1 7 9 8 3 5 2 7 1 9

slide-8
SLIDE 8

12/12/2018 8 Philippos Papaphilippou

Bitonic sort example

8 3 5 2 7 1 9 8 5 3 2 7 9 1

slide-9
SLIDE 9

12/12/2018 9 Philippos Papaphilippou

Bitonic sort example

8 5 3 2 7 9 1 8 5 3 2 9 7 1

slide-10
SLIDE 10

12/12/2018 10 Philippos Papaphilippou

Bitonic sort example

8 5 3 2 9 7 1 8 5 7 9 2 3 1

slide-11
SLIDE 11

12/12/2018 11 Philippos Papaphilippou

Bitonic sort example

8 5 7 9 2 3 1 8 9 7 5 2 3 1

slide-12
SLIDE 12

12/12/2018 12 Philippos Papaphilippou

Bitonic sort example

8 9 7 5 2 3 1 9 8 7 5 3 2 1

slide-13
SLIDE 13

12/12/2018 13 Philippos Papaphilippou

Merging

  • Bitonic sort:

– The merger module can be used to merge

2 lists of P elements → 1 list of 2P elements

  • Problem: How about sorting bigger lists with limited

hardware/logic?

– Merge 2 lists of arbitrary length – The data can be streamed and queued

  • Design an efficient parallel merger for arbitrarily long lists
  • Simple (unrealistic) FPGA example:

– CPU mergesort, but – Parallel merging in FPGA

slide-14
SLIDE 14

12/12/2018 14 Philippos Papaphilippou

Basic merger for longer lists

  • Based on bitonic merger

Queue A 6 13 3 11 1 9 0 7 Queue B 4 12 2 10 1 8 0 5 13 11 9 7 12 10 8 5 13 12 11 10 9 8 7 5 Output 13 12 11 10 Lower 4 need to be fed back

slide-15
SLIDE 15

12/12/2018 15 Philippos Papaphilippou

Basic merger for longer lists

  • Based on bitonic merger

Queue A 6 13 3 11 1 9 0 7 Queue B 4 12 2 10 1 8 0 5 9 8 7 5 6 3 1 13 12 11 10 9 8 7 5 Output 13 12 11 10 Lower 4 need to be fed back. Not Pipelined!

(works fine for CPUs [Chhugani, et al., 2008])

slide-16
SLIDE 16

12/12/2018 16 Philippos Papaphilippou

Structures overview

slide-17
SLIDE 17

12/12/2018 17 Philippos Papaphilippou

Optimisation for FPGAs

  • Based on 2P-to-P bitonic partial merger, such as in [Song et al., 2016]

Not needed Output 13 12 11 10 Queue A 6 13 3 11 1 9 0 7 Queue B 4 12 2 10 1 8 0 5 ‘unsorted’ lower 4 9 7 6 3 8 5 4 2

slide-18
SLIDE 18

12/12/2018 18 Philippos Papaphilippou

Related work examples

  • Merging 2 already-sorted sequences

– Related work – Problem:

  • Trade-off between feedback (low frequency) and resources

[Song et al., 2016] [Saitoh et al., 2018]

slide-19
SLIDE 19

12/12/2018 19 Philippos Papaphilippou

Contributions

  • New parallel merger design

– Merge algorithm – Feedback-less – Half hardware resources than the state-of-the-art MMS [Saitoh et al., 2018]

  • Lookup-table and Flip-flop utilisation
  • (also with half the latency (pipeline length))
  • Proof & Evaluation
  • Open source implementation

– Verilog generator for AXI peripherals

  • FLiMS & MMS merger

– SIMD version in C

  • AVX2 & AVX-512
slide-20
SLIDE 20

12/12/2018 20 Philippos Papaphilippou

Merge sorter – solution

  • Just one 2P-to-P bitonic partial merger

Modified 1st pipeline stage

  • No need for barrel shifters

1 int i;

i i s t h e e n t i t y t a g

2 int cA i, cB i, in i;

3 2

  • b

i t r e g i s t e r s

3 while forever do 4

receive (positive clock edge);

5

if cA i>cB i then

6

in i ← cA i;

7

cA i ← dequeue(ai);

8

else

9

in i ← cB i;

10

cB i ← dequeue(bP −i );

11

end

12 end

Algorithm A: Distributed algorithm pseudocode

slide-21
SLIDE 21

12/12/2018 21 Philippos Papaphilippou

Brief proof overview

  • Main principles

– Top P works for equally rotated input → No need to rotate input – Top P is a rotated bitonic sequence → also bitonic → rest of the sorting

network works

slide-22
SLIDE 22

12/12/2018 22 Philippos Papaphilippou

Brief proof overview

  • Main principles

– Top P works for equally rotated input → No need to rotate input – Top P is a rotated bitonic sequence → also bitonic → rest of the sorting

network works

slide-23
SLIDE 23

12/12/2018 23 Philippos Papaphilippou

Comparison with FLiMS

[Song et al., 2016] [Saitoh et al., 2018] FLiMS Feedback datapath length log2(P)+1 1 1 Latency log2(P)+log2(2P) 2×log2(2P) log2(2P) H/W modules 1 × b.p.m 2 crossbars (barrel shifters) 2 × b.p.m, shift registers 1 × b.p.m

slide-24
SLIDE 24

12/12/2018 24 Philippos Papaphilippou

A

FLiMS example run (P=4)

B 4 16 29 3 11 26 3 5 26 4 17 7 15 22 0 12 21 9 19 8 18 Output Max CAS CAS

slide-25
SLIDE 25

12/12/2018 25 Philippos Papaphilippou

FLiMS example run (P=4)

29 26 26 17 22 21 19 18

max max max max

A B 4 16 29 3 11 26 3 5 26 4 17 7 15 22 0 12 21 9 19 8 18 Output Max CAS CAS

slide-26
SLIDE 26

12/12/2018 26 Philippos Papaphilippou

FLiMS example run (P=4)

16 11 5 17 15 21 19 18 29 26 26 22 A B 4 16 29 3 11 26 3 5 26 4 17 7 15 22 0 12 21 9 19 8 18 Output

max max max max

Max CAS CAS

slide-27
SLIDE 27

12/12/2018 27 Philippos Papaphilippou

FLiMS example run (P=4)

16 11 5 4 15 12 9 8 18 19 21 17 29 26 26 22 A B 4 16 29 3 11 26 3 5 26 4 17 7 15 22 0 12 21 9 19 8 18 Output

max max max max

Max CAS CAS

slide-28
SLIDE 28

12/12/2018 28 Philippos Papaphilippou

FLiMS example run (P=4)

4 3 5 4 7 9 8 16 11 12 15 21 19 18 17 A B 4 16 29 3 11 26 3 5 26 4 17 7 15 22 0 12 21 9 19 8 18 Output 29 26 26 22

max max max max

Max CAS CAS

slide-29
SLIDE 29

12/12/2018 29 Philippos Papaphilippou

FLiMS example run (P=4)

4 3 3 4 8 9 5 7 16 15 12 11 A B 4 16 29 3 11 26 3 5 26 4 17 7 15 22 0 12 21 9 19 8 18 Output 21 29 19 26 18 26 17 22

max max max max

Max CAS CAS

slide-30
SLIDE 30

12/12/2018 30 Philippos Papaphilippou

FLiMS example run (P=4)

4 3 3 4 8 9 5 7 A B 4 16 29 3 11 26 3 5 26 4 17 7 15 22 0 12 21 9 19 8 18 Output 16 21 29 15 19 26 12 18 26 11 17 22

max

Max CAS CAS

slide-31
SLIDE 31

12/12/2018 31 Philippos Papaphilippou

FLiMS example run (P=4)

4 4 3 3 A B 4 16 29 3 11 26 3 5 26 4 17 7 15 22 0 12 21 9 19 8 18 Output 9 16 21 29 8 15 19 26 7 12 18 26 5 11 17 22 Max CAS CAS

slide-32
SLIDE 32

12/12/2018 32 Philippos Papaphilippou

FLiMS example run (P=4)

A B 4 16 29 3 11 26 3 5 26 4 17 7 15 22 0 12 21 9 19 8 18 Output 4 9 16 21 29 4 8 15 19 26 3 7 12 18 26 3 5 11 17 22 Max CAS CAS

slide-33
SLIDE 33

12/12/2018 33 Philippos Papaphilippou

FLiMS example run (P=4)

A B 4 16 29 3 11 26 3 5 26 4 17 7 15 22 0 12 21 9 19 8 18 Output 0 4 9 16 21 29 4 8 15 19 26 3 7 12 18 26 3 5 11 17 22 Max CAS CAS

slide-34
SLIDE 34

12/12/2018 34 Philippos Papaphilippou

Merge sorter – results

  • Board: MYIR Z-turn (Xilinx Zynq 7020)

0K 10K 20K 30K 40K 50K 60K 70K 80K 90K 7z020 LUT Proposal MMS 1.6 1.7 1.8 1.9 2 4 8 16 32 64 128 LUT utilisation improvement P (integers/cycle) Observations Fitting 0K 10K 20K 30K 40K 50K 60K 70K 80K 90K 100K FF Proposal MMS 1.5 1.6 1.7 1.8 4 8 16 32 64 128 FF utilisation improvement P (integers/cycle) Observations Fitting 70 75 80 85 90 95 100 105 4 8 16 32 64 128 Operating frequency (MHz) P (integers/cycle) Proposal MMS

slide-35
SLIDE 35

12/12/2018 35 Philippos Papaphilippou

SIMD in modern processors

  • The same algorithm sounds applicable to modern processors

– Single instruction, multiple data (SIMD) model

  • Interesting properties that might yield good CPU performance

– Short pipeline → less instruction count – No need for rotation → less instruction count – Less dependencies → higher Instruction Level Parallelism (ILP) – Distributed approach → No centralized control mechanism → fewer branch prediction

misses

  • Using vector registers, such us in Intel i7 and Xeon processors

– AVX2 registers hold 8 32-bit integers → P=8 – AVX-512 registers hold 16 32-bit integers → P=16

slide-36
SLIDE 36

12/12/2018 36 Philippos Papaphilippou

SIMD preliminary results

  • Small experiment

– Merging 2 big lists, Using 3 compilers – Compare with serial merge (attempt automatic vectorisation) C

  • m

p i l e r g c c 8 . 1 . 1 c l a n g 6 . . 1 i c c 1 8 . . 3 F L i MS : t i m e ( s ) . 2 3 . 2 3 . 2 2 w r i t e t h r

  • u

g h p u t ( G B / s ) 2 . 7 7 2 . 6 9 2 . 8 1 S e r i a l : t i m e ( s ) . 6 . 3 6 . 6 1 w r i t e t h r

  • u

g h p u t ( G B / s ) 1 . 3 1 . 7 3 1 . 3 s p e e d u p 2 . 6 7 1 . 5 5 2 . 7 3 T A B L E I I I : P r e l i m i n a r y r e s u l t s f

  • r

m a n u a l v e c t

  • r

i s a t i

  • n

w i t h P = 1 6

  • n

i 7

  • 8

8 9 G , m e r g i n g 2 l i s t s

  • f

1 6 7 7 7 2 1 6 3 2

  • b

i t i n t e g e r s e a c h .

slide-37
SLIDE 37

12/12/2018 37 Philippos Papaphilippou

Possible future work

  • Full FPGA sort designs

– Improve parallel merge trees (PMT [Song et. Al, 2016])

  • Multiple input queues → Bandwidth-aware
  • Explore other ways of merging multiple input queues
  • Accelerate FPGA sort-merge join for databases
  • Evaluation of FLiMS-based SIMD mergesort

– & multi-threading

  • Stable sort variation

– Output equal keys in the same order as input – (serial mergesort is already stable sort) – Solves the satellite data problem (key-value pair)

  • Multiple passes to sort using more than one keys
  • OpenCL implementation on GPUs

FIFO merge FIFO merge FIFO merge 4:2 8:4 16:8

Parallel merge tree example,

  • riginally includes crossbars
slide-38
SLIDE 38

12/12/2018 38 Philippos Papaphilippou

Summary

  • Contributions

– Highly-efficient parallel merger design

  • Half LUT, FF and latency in comparison with the state-of-the-art
  • Feedback-less → High operating frequency

– Evaluation & proof – SIMD preliminary evaluation – Open source repository: www.philippos.info/merge

  • Future work

– Building block for a variety of applications (Full H/W sort-merge join, etc.) – Variations (stable sort, etc.)

slide-39
SLIDE 39

12/12/2018 39 Philippos Papaphilippou

END

Thank you for your attention! Questions?

Philippos Papaphilippou

slide-40
SLIDE 40

12/12/2018 40 Philippos Papaphilippou

Backup slides

31/1/2019 40 Philippos Papaphilippou