Accelerating the merge phase of sort-merge join FPL 2019 The 29th - - PowerPoint PPT Presentation

▶

Nov 03, 2022 8 likes •120 views

Accelerating the merge phase of sort-merge join Accelerating the merge phase of sort-merge join FPL 2019 The 29th International Conference on Field-Programmable Logic and Applications Philippos Papaphilippou, Holger Pirk, Wayne Luk Dept. of

SLIDE 1

9/9/2019 1 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

FPL 2019 – The 29th International Conference on Field-Programmable Logic and Applications

Philippos Papaphilippou, Holger Pirk, Wayne Luk

Dept. of Computing, Imperial College London, UK

{pp616, pirk, w.luk}@imperial.ac.uk

Source code: philippos.info/mergejoin

SLIDE 2

9/9/2019 3 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

The task: equi-join

Equi-join

– Join two tables based on key equality – Cartesian product when there are more than 1 keys in one of the 2 tables

Popular algorithms

– Hash-join → Random access pattern – Sort-merge join → Streaming access pattern → FPGA friendly

A-Key Value A1 2 A2 2 A3 3 A4 3 A5 3 A6 11

B-Key Value B1 2 B2 2 B3 3 B4 5 B5 6 A-Key B-Key Value A1 B1 2 A1 B2 2 A2 B1 2 A2 B2 2 A3 B3 3 A4 B3 3 A5 B3 3

⨝

=

SLIDE 3

9/9/2019 4 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Challenges in related work

Input properties

– Presence of duplicate keys → complicates the hardware and access patterns – Long input → limited storage inside the FPGA – Wide input → moving big rows is expensive – Some designs are inapplicable or slow down

Data movement

– Narrow inter-chip (CPU ↔ FPGA) communication – Induced latency

Scalability

– Future technologies (High-throughput) – Big data → arbitrarily long tables

SLIDE 4

9/9/2019 5 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Abstracted solution

High-Throughput Stream processor
Inputs

– Sorted keys of table A – Sorted keys of table B

Output

– Index ranges where the key was the same

Expand on demand (late materialisation)

SLIDE 5

9/9/2019 6 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Proposal

Building blocks

–

Round-robin module

–

Co-grouping engine

–

Modified FLiMS

SLIDE 6

9/9/2019 7 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Round-robin module

Stream processor
Rearranges sparse input,

before writing in multiple banks

Round-robin effect,

but in parallel

+ +

CAS network (bitonic sorter)

MSB

Barrel Shifters

SLIDE 7

9/9/2019 8 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Co-grouping engine

Stream processor
Provides ranges of indexes,

where the key was the same

Input: Sorted keys
Output: Unique keys,

index ranges

<indexstart, indexend, key>

1 cycle delay

<index, key>

f f

... f

P-1

g0 g1 ... gP-1

Round Robin

SLIDE 8

9/9/2019 9 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Join module

Task: merge 2 co-grouped streams
Output: tuples of the form

<indexAstart, indexAend, indexBstart, indexBend, key>

Main idea:

–

Sort them together

Based on a high-throughput H/W

merge sorter (FLiMS [FPT’18])

–

Match same-key groups, by only looking at consecutives

SLIDE 9

9/9/2019 10 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Advantages

Input agnostic

– Index-based – Big data analytics

Stream processor

– FPGA-friendly

Modular design

– Novel building blocks – Can be combined with other: H/W sorters, filters, ...

High-throughput design

– Scalable for future architectures – Lower resources than related work

SLIDE 10

9/9/2019 11 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Evaluation on a heterogeneous system

Platform

–

Zynq UltraScale+ device

–

Operating system: Petalinux

–

Communication: DMA transfers

Speedup of up to 3.1 times

–

1-port (H/W) vs 1-thread (S/W)

Input design space exploration

–

Fraction of distinct keys (%)

–

Fraction of key matches (%) (directly related to the output size)

Speedup variation factors

–

CPU performance

–

Length of the DMA transfers (CPU→ FPGA)

20 40 60 80 100 Distinct keys in A, B (%) 4096 8192 12288 16384 Output size (# of rows) 1.5 2 2.5 3 FPGA speedup

Empty space: no more key matches than the number of distinct keys

SLIDE 11

9/9/2019 12 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Accelerating the merge phase of sort-merge join FPL 2019 The 29th - - PowerPoint PPT Presentation