FPGA Acceleration for the Frequent Item Problem Jens Teubner, Ren - - PowerPoint PPT Presentation

▶

Sep 30, 2023 153 likes •333 views

FPGA Acceleration for the Frequent Item Problem Jens Teubner, Ren e M uller, Gustavo Alonso ETH Zurich, Systems Group not (only) about FPGAs not about a new solution to the frequent item problem 2 / 17 Frequent Item Problem: Given a

SLIDE 1

FPGA Acceleration for the Frequent Item Problem

Jens Teubner, Ren´ e M¨ uller, Gustavo Alonso ETH Zurich, Systems Group

SLIDE 2

not about a new solution to the frequent item problem not (only) about FPGAs

2 / 17

SLIDE 3

Frequent Item Problem: Given a stream S of items xi, which items occur most often in S? Solution: [Metwally et al. 2006]

1 foreach stream item x ∈ S do 2

find bin bx with bx.item = x lookup by item ;

if such a bin was found then

bx.count ← bx.count + 1 ;

else

bmin ← bin with minimum count value lookup by count ;

bmin.count ← bmin.count + 1 ;

bmin.item ← x ;

3 / 17

SLIDE 4

16 32 64 128 256 512 1024 10 20 30 40 50 z = ∞ z = 2 z = 1.5 z = 1 z = 0 number of items monitored throughput [million items / sec]

(Intel T9550 @ 2.66 MHz; code by Cormode and Hadjieleftheriou, VLDB 2008)

4 / 17

SLIDE 5

Tricks on FPGAs: content-addressable memory

“hash table on steroids”

dual-ported memory

min-heap maintenance speed-up

5 / 17

SLIDE 6

16 32 64 128 256 512 1024 10 20 30 40 50 hardware software number of items monitored throughput [million items / sec]

→ data dependent, not scalable

6 / 17

SLIDE 7

Lesson 1: FPGAs are not a silver bullet.

7 / 17

SLIDE 8

Idea: Parallelize

coordinator bin 1 bin 2 bin 3 · · · bin k − 1 bin k item

= xi, count ? reduction: bx, bmin update!

1 Broadcast input item xi to all bins. 2 Reduce to determine bx and bmin. 3 Update bx/bmin.

8 / 17

SLIDE 9

16 32 64 128 256 512 1024 10 20 30 40 50 z = ∞ z = 1.5 z = 0 number of items monitored throughput [million items / sec]

→ still not scalable

9 / 17

SLIDE 10

What went wrong?

coordinator bin 1 bin 2 bin 3 bin 4 · · · bin k − 1 bin k

Lesson 2: Avoid long-distance communication.

10 / 17

SLIDE 11

Can we keep processing local?

(avoid long-distance communication)

11 / 17

SLIDE 12

Pipeline-Style Processing:

item count bi−1 · · · item count bi x1 item count bi+1 x1 item count bi+2 · · · bi.item

= x1 bi.count

< bi+1.count

1 Compare input item x1 to content of bin bi

(and increment count value if a match was found).

2 Order bins bi and bi+1 according to count values. 3 Move x1 forward in the array and repeat.

→ Drop x1 into last bin if no match can be found.

12 / 17

SLIDE 13

O(1) → O(#bins) ? But: Can be parallelized well.

13 / 17

SLIDE 14

Pipeline Parallelism:

item count bi−1 x2 · · · item count bi item count bi+1 x1 item count bi+2 · · · bi+1.item

= x1 bi−1.item

= x2 bi+1.count

< bi+2.count bi−1.count

< bi.count

O(#bins) → 1 #bins · O (#bins)

14 / 17

SLIDE 15

16 32 64 128 256 512 1024 20 40 60 80 100 software FPGA (data parallel) FPGA (pipeline parallel) number of items monitored throughput [million items / sec]

15 / 17

SLIDE 16

Lesson 3: Pipelining → scalability, performance.

16 / 17

SLIDE 17

Lessons learned:

1. FPGAs are not a silver bullet.

Straightforward s/w → h/w mapping will not do the job.

2. Avoid long-distance communication.

Signal propagation delays will limit scalability.

3. Pipelining → scalability, performance.

Keep communication and synchronization cheap.

Frequent item solution:

three times faster than software, data independent.

This work was supported by the Swiss National Science Foundation.

17 / 17