Accelerating the merge phase of sort-merge join FPL 2019 The 29th - - PowerPoint PPT Presentation

accelerating the merge phase of sort merge join
SMART_READER_LITE
LIVE PREVIEW

Accelerating the merge phase of sort-merge join FPL 2019 The 29th - - PowerPoint PPT Presentation

Accelerating the merge phase of sort-merge join Accelerating the merge phase of sort-merge join FPL 2019 The 29th International Conference on Field-Programmable Logic and Applications Philippos Papaphilippou, Holger Pirk, Wayne Luk Dept. of


slide-1
SLIDE 1

9/9/2019 1 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Accelerating the merge phase of sort-merge join

FPL 2019 – The 29th International Conference on Field-Programmable Logic and Applications

Philippos Papaphilippou, Holger Pirk, Wayne Luk

  • Dept. of Computing, Imperial College London, UK

{pp616, pirk, w.luk}@imperial.ac.uk

Source code: philippos.info/mergejoin

slide-2
SLIDE 2

9/9/2019 3 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

The task: equi-join

  • Equi-join

– Join two tables based on key equality – Cartesian product when there are more than 1 keys in one of the 2 tables

  • Popular algorithms

– Hash-join → Random access pattern – Sort-merge join → Streaming access pattern → FPGA friendly

A-Key Value A1 2 A2 2 A3 3 A4 3 A5 3 A6 11

B-Key Value B1 2 B2 2 B3 3 B4 5 B5 6 A-Key B-Key Value A1 B1 2 A1 B2 2 A2 B1 2 A2 B2 2 A3 B3 3 A4 B3 3 A5 B3 3

=

slide-3
SLIDE 3

9/9/2019 4 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Challenges in related work

  • Input properties

– Presence of duplicate keys → complicates the hardware and access patterns – Long input → limited storage inside the FPGA – Wide input → moving big rows is expensive – Some designs are inapplicable or slow down

  • Data movement

– Narrow inter-chip (CPU ↔ FPGA) communication – Induced latency

  • Scalability

– Future technologies (High-throughput) – Big data → arbitrarily long tables

slide-4
SLIDE 4

9/9/2019 5 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Abstracted solution

  • High-Throughput Stream processor
  • Inputs

– Sorted keys of table A – Sorted keys of table B

  • Output

– Index ranges where the key was the same

  • Expand on demand (late materialisation)
slide-5
SLIDE 5

9/9/2019 6 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Proposal

Building blocks

Round-robin module

Co-grouping engine

Modified FLiMS

slide-6
SLIDE 6

9/9/2019 7 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Round-robin module

  • Stream processor
  • Rearranges sparse input,

before writing in multiple banks

  • Round-robin effect,

but in parallel

+ +

SR

+ +

CAS network (bitonic sorter)

MSB

Barrel Shifters

slide-7
SLIDE 7

9/9/2019 8 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Co-grouping engine

  • Stream processor
  • Provides ranges of indexes,

where the key was the same

  • Input: Sorted keys
  • Output: Unique keys,

index ranges

<indexstart, indexend, key>

1 cycle delay

<index, key>

f f

1

... f

P-1

g0 g1 ... gP-1

Round Robin

slide-8
SLIDE 8

9/9/2019 9 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Join module

  • Task: merge 2 co-grouped streams
  • Output: tuples of the form

<indexAstart, indexAend, indexBstart, indexBend, key>

  • Main idea:

Sort them together

  • Based on a high-throughput H/W

merge sorter (FLiMS [FPT’18])

Match same-key groups, by only looking at consecutives

slide-9
SLIDE 9

9/9/2019 10 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Advantages

  • Input agnostic

– Index-based – Big data analytics

  • Stream processor

– FPGA-friendly

  • Modular design

– Novel building blocks – Can be combined with other: H/W sorters, filters, ...

  • High-throughput design

– Scalable for future architectures – Lower resources than related work

slide-10
SLIDE 10

9/9/2019 11 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

Evaluation on a heterogeneous system

  • Platform

Zynq UltraScale+ device

Operating system: Petalinux

Communication: DMA transfers

  • Speedup of up to 3.1 times

1-port (H/W) vs 1-thread (S/W)

  • Input design space exploration

Fraction of distinct keys (%)

Fraction of key matches (%) (directly related to the output size)

  • Speedup variation factors

CPU performance

Length of the DMA transfers (CPU→ FPGA)

20 40 60 80 100 Distinct keys in A, B (%) 4096 8192 12288 16384 Output size (# of rows) 1.5 2 2.5 3 FPGA speedup

Empty space: no more key matches than the number of distinct keys

slide-11
SLIDE 11

9/9/2019 12 Philippos Papaphilippou

Accelerating the merge phase of sort-merge join

END

Thank you for your attention! Source code for Ultra96:

philippos.info/mergejoin