Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS - - PowerPoint PPT Presentation

data processing on modern hardware
SMART_READER_LITE
LIVE PREVIEW

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS - - PowerPoint PPT Presentation

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2014 Jens Teubner Data Processing on Modern Hardware Summer 2014 c 1 Part VII FPGAs for Data Processing Jens Teubner


slide-1
SLIDE 1

Data Processing on Modern Hardware

Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2014

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 1

slide-2
SLIDE 2

Part VII FPGAs for Data Processing

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 239

slide-3
SLIDE 3

Motivation

Modern hardware features a number of “speed-up tricks”: caches, instruction scheduling (out-of-order exec., branch prediction, . . . ), parallelism (SIMD, multi-core), throughput-oriented designs (GPUs). Combining these “tricks” is essentially an economic choice: → chip space ≡ eee → chip space ↔ component selection ↔ workload

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 240

slide-4
SLIDE 4

Another Constraint: Power

Can use transistors for either logic or caches.

Case B Case A, 0 Logic, 8W Case A, 16MB of Cache Case C 50MT Logic 6MB Cache Power Dissipation Cache Size 100 80 60 40 20 18 16 14 12 10 8 6 4 2 20 40 60 80 total Power (Watts) Logic transistors (millions) 2008, 45nm, 100mm2 Cache (mB) Source: Borkar and Chien. The Future of Microprocessors. CACM 2011.

→ Power consumptions limits amount of logic that can be put on chip.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 241

slide-5
SLIDE 5

Heterogeneous Hardware

(a) (b) (c)

Large-Core 25 MT 2 4 3 5 6 Large-Core homogeneous Large-core throughput 1 Small-core throughput Total throughput 6 small-Core homogeneous Large-core throughput Small-core throughput Pollack’s Rule (5/25)0.5=0.45 Total throughput 13 small-Core homogeneous Large-core throughput 1 Small-core throughput Pollack’s Rule (5/25)0.5=0.45 Total throughput 11 5 MT 2 3 30 5 MT 2 3 20 Large-Core 25MT c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 242

slide-6
SLIDE 6

Field-Programmable Gate Arrays

Field-Programmable Gate Arrays (FPGAs) are yet-another point in the design space. “Programmable hardware.” Make (some) design decisions after chip fabrication. Promises of FPGA technology: Build application-/workload-specific circuit. Spend chip space only on functionality that you really need. Tune for throughput, latency, energy consumption, . . . Overcome limits of general-purpose hardware with regard to task at hand (e.g., I/O limits).

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 243

slide-7
SLIDE 7

Field-Programmable Gate Arrays

An array of logic gates Functionality fully programmable Re-programmable after deployment (“in the field”) → “programmable hardware” FPGAs can be configured to implement any logic circuit. Complexity bound by available chip space. → Obviously, the effective chip space is less than in custom-fabricated chips (ASICs).

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 244

slide-8
SLIDE 8

Basic FPGA Architecture

IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB DCM DCM

CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

chip layout: 2D array Components CLB: Configurable Logic Block (“logic gates”) IOB: Input/Output Block DCM: Digital Clock Manager Interconnect Network signal lines configurable switch boxes

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 245

slide-9
SLIDE 9

Signal Routing

programmable Switch Box and bundle of lines programmable intersection point

SRAM cell

programmable switch with memory cell

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 246

slide-10
SLIDE 10

Configurable Logic Block (CLB)

in0 in1 in2 in3

SRAM cell

4-LUT D Flip Flop clock

SRAM cell

Multiplexer

  • ut

implements {0, 1}4 → {0, 1} function stores a single bit

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 247

slide-11
SLIDE 11

Programming FPGAs

Programming is usually done using a hardware description language. E.g., VHDL6, Verilog High-level circuit description Circuit description is compiled into a bitstream, then loaded into SRAM cells on the FPGA: VHDL synthesis map place & route FPGA netlist bitstream

6VHSIC Hardware Description language c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 248

slide-12
SLIDE 12

Example: VHDL

HDLs enable programming language-like descriptions of hardware circuits. architecture Behavioral of compare is begin process (A, B) begin if ( A = B ) then C <= ’1’; else C <= ’0’; end if; end process; end Behavioral; VHDL can be synthesized, but also executed in software (simulation).

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 249

slide-13
SLIDE 13

Real-World Hardware

CPU 0 CPU 1 Simplified Virtex-5 XC5VFXxxxT floor plan Frequently used high-level components are provided in discrete silicon BlockRAM (BRAM): set of blocks that each store up 36 kbits of data DSP48 slices: 25x18-bit multipliers followed by a 48-bit accumulator CPU: two full embedded PowerPC 440 cores

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 250

slide-14
SLIDE 14

Development Board with Virtex-5 FPGA

source: Xilinx Inc., ML50x Evaluation Platform. User Guide.

Virtex-5 XC5VLX110T Lookup Tables (LUTs) 69,120 Block RAM (kbit) 5,328 DSP48 Slices 64 PowerPC Cores

  • max. clock speed

≈ 450 MHz release year 2006

Low-level speed of configurable gates is slower than in custom-fabricated chips (clock frequencies: ∼ 100 MHz). → Compensate with efficient circuit for problem at hand.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 251

slide-15
SLIDE 15

State Machines

The key asset of FPGAs is their inherent parallelism. Chip areas naturally operate independently and in parallel. For example, consider finite-state automata. q0 q1 q2 q3 q4 a * b c d * → non-deterministic automaton for .*abc.*d

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 252

slide-16
SLIDE 16

State Machines

✛ How would you implement an automaton in software? Problems with state machine implementations in software: In non-deterministic automata, several states can be active at a time, which requires iterative execution on sequential hardware. Deterministic automata avoid this problem at the expense of a significantly higher state count.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 253

slide-17
SLIDE 17

State Machines in Hardware

Automata can be translated mechanically into hardware circuits. each state → flip-flop

(A flip-flop holds a single bit of information. Just the right amount to keep the ‘active’/‘not active’ information.)

transitions: → signals (“wires”) between states conditioned on current input symbol ( ‘and’ gate) multiple sources for one flip-flop input → ‘or’ gate.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 254

slide-18
SLIDE 18

State Machines in Hardware

q0 q1 q2 q3 q4 a * b c d *

FF q0 FF q1 FF q2 FF q3 FF q4

  • r

and input

?

= a and input

?

= b and input

?

= c

  • r

and input

?

= d

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 255

slide-19
SLIDE 19

1 2 3 4 5 Flip-flop cons. in % NFA DFA DFA (compressed) 1 2 3 4 5 6 7 8 9 10 i in (0|1)* 1 (0|1)i 1 2 3 4 5 6 LUT cons. in % NFA DFA DFA (compressed)

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 256

slide-20
SLIDE 20

Use Case: Network Intrusion Detection

Analyze network traffic using regular expressions. Scan for known attack tools. Prevent exploitation of known security holes. Scan for shell code. E.g., Snort (http://www.snort.org/) → Hundreds of (regular expression-based) rules. Idea: Instantiate a hardware state machine for each rule. → Leverage available hardware parallelism. → Challenge: optimize for high throughput.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 257

slide-21
SLIDE 21

Predicate Decoding

Optimization 1: Centralized character classification

decod. a d FF q0 FF q1 FF q2 FF q3 FF q4

  • r

and a and b and c

  • r

and d

→ Optimizes for space, not for speed. Character/predicate decoder: Use FPGA logic resources or use on-chip BRAM (configure as ROM and use as lookup table).

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 258

slide-22
SLIDE 22

Predicate Decoding Factored Out

50 100 150 200 250 i in (A B)i 5 10 15 20 25 30 Resource consumption % No decoder: LUTs Slices With decoder: LUTs Slices

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 259

slide-23
SLIDE 23

Signal Propagation Delay

Signal propagation delays determine a circuit’s speed. Here: One state transition per clock cycle. Longest signal path → maximum clock frequency CLK rising clock edge

  • reg. input

stable at rising clock may be undefined in-between

  • reg. output

register written at rising clock

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 260

slide-24
SLIDE 24

Propagation Delays and Many State Machines

Straightforward design with many rules and one input: NFA 1 NFA 2 NFA 3 NFA 4 NFA 5 NFA 6 input

  • r
  • utput
  • c

Jens Teubner · Data Processing on Modern Hardware · Summer 2014 261

slide-25
SLIDE 25

Pipelining

Optimization 2: Pipelining → What matters is longest path between any two registers (flip-flops). NFA 1 NFA 2 NFA 3 NFA 4 NFA 5 NFA 6 input

  • r
  • utput

longest path → Introduce pipeline registers. → ✛ Flip side of the idea?

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 262

slide-26
SLIDE 26

Pipelining in Practice

Yang et al. Compact Architecture for High-Throughput Regular Expression Maching on FPGA. ANCS 2008.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 263

slide-27
SLIDE 27

Multi-Character Matching

In a finite state automaton, the state si+1 at step i + 1 depends on the previous state si, the input symbol σi, and a transition function f : si+1 = f (si, σi) . Consequently: si+2 = f (si+1, σi+1) = f (f (si, σi), σi+1) . That is, with help of a new transition function F(si, σi, σi+1) def = f (f (si, σi), σi+1) , an automaton can accept two input symbols per clock cycle.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 264

slide-28
SLIDE 28

Multi-Character Encoding

In hardware: state f input state f f input (2 characters) Trade-off: space ↔ performance longer signal paths

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 265

slide-29
SLIDE 29

Putting it Together (Snort Workload)

Yang et al. Compact Architecture for High-Throughput Regular Expression Maching on FPGA. ANCS 2008.

(Virtex-4 LX100; ≈ 100k 4-LUTs; ≈ 100k flip-flops)

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 266

slide-30
SLIDE 30

Use Case: XML Projection

Example: for $i in //regions//item return <item> { $i/name } <num-categories> { count ($i/incategory) } </num-categories> </item> Projection paths: { //regions//item, //regions//item/name #

keep descendants

, //regions//item/incategory } Challenge: Avoid explicit synthesis for each query.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 267

slide-31
SLIDE 31

Advantage: FPGA System Integration

Here: In-network filtering server XML FPGA filtered XML client In general: FPGA in the data path. disk → CPU memory → CPU . . .

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 268

slide-32
SLIDE 32

XPath → Finite State Automata

Automaton for //a/b/c//d: q0 q1 q2 q3 q4 a * b c d * In hardware: (see also earlier slides)

tag decod. a d XML FF q0 FF q1 FF q2 FF q3 FF q4

  • r

and a and b and c

  • r

and d root()/desc:: a/child:: b/child:: c/desc:: d

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 269

slide-33
SLIDE 33

Compilation to Hardware

/a//b a * b FPGA XPath Hardware FSM bitstream several hours!

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 270

slide-34
SLIDE 34

Skeleton Automaton

Separate the difficult parts from the latency-critical

. . . ⊥ a * b

  • FPGA

XPath spec. . . .

  • skeleton

user query /a//b . . . ⊥ a * b configuration param. static part (off-line) dynamic part (runtime)

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 271

slide-35
SLIDE 35

Skeleton Automaton

Thus: Build skeleton automaton that can be parameterized to implement any projection query.

XML parser skeleton segment seg1 . . . skeleton segment segn seria- lizer RAM XML filtered XML “cooked XML” skeleton automaton (NFA)

Intuitively: Runtime-configuration determines presence of * .

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 272

slide-36
SLIDE 36

Again: Pipelining

seg0 seg1 seg2 seg3 seg4 seg5 start input stream seg0 seg1 seg2 seg3 seg4 seg5 start input str. pipeline registers

→ Side effect: Can support self and descendant-or-self axes.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 273

slide-37
SLIDE 37

Scalability

50 100 150 200 clock frequency [MHz] 100 200 300 400 500 600 number of segment matchers n no BRAM sharing 2-way BRAM sharing 3-way BRAM sharing

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 274

slide-38
SLIDE 38

Application Speedup

20 40

2 4 6 8 10 12 14 16 parse time execution time memory cons. improvement / speedup Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 XMark query

ր Jens Teubner, Louis Woods, and Chongling Nie. Skeleton Automata for FPGAs: Reconfiguring without Reconstructing. SIGMOD 2012.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 275

slide-39
SLIDE 39

Skyline Queries

Problem: Pareto-optimal set of multi-dimensional data points. x dominates y (x ≺ y) iff for every dimension i: xi ≤ yi and for at least one dimension j: xj < yj. Skyline points: all y not dominated by any x. p1 p2 p3 p4 p5 p6 p7 → Parallelize, keep on-chip routing distance short

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 276

slide-40
SLIDE 40

“Lemming’s Got Talent”

→ Lemmings have multiple skills (dimension) → Determine “best” Lemmings Let Lemmings battle on a narrow bridge:

qi dominated p0 qi+1 requeue queue

p0 dominates qi → qi falls off the bridge. qi dominates p0 → p0 falls off bridge, qi becomes new p0 Battle undecided → let qi requeue. A Lemming that has survived a full round is a “skyline Lemming.”

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 277

slide-41
SLIDE 41

“Lemming’s Got Talent”—Second Year

To speed up the process, let a set of pi stay on bridge:

qi dominated [p0, pw−1] qi+1 requeue queue

→ Challengers qi fight against multiple pj in turn. → qi and/or multiple pj might fall off the bridge. → Keep surviving qi on bridge if there is space, otherwise requeue. → Standard algorithm Block Nested Loops (BNL).

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 278

slide-42
SLIDE 42

1 foreach Lemming qi ∈ queue do 2

isDominated = false;

3

foreach Lemming pj ∈ bridge do

4

if qi.timestamp > pj.timestamp then

5

bridge.movetoskyline(pj); /* pj ∈ Lemming skyline */

6

else if qi ≺ pj then

7

bridge.drop(pj);

8

else if pj ≺ qi then

9

isDominated = true;

10

break;

11

if not isDominated then

12

timestamp(qi);

13

if bridge.isFull() then

14

queue.insert(qi);

15

else

16

bridge.insert(qi);

slide-43
SLIDE 43

Block Nested Loops Algorithm

Design goal of BNL: Eliminate I/O Bottleneck 4 8 16 32 64 128 256 106 107 108 109 106 107 108 109

  • verflow tuples

comparisons window size [ # of tuples ] comparisons

  • verflow tuples

→ Compute load remains (mostly) unchanged.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 280

slide-44
SLIDE 44

“Lemming’s Got Talent”—Third Year

Let multiple (pairs of) Lemmings battle in parallel.

pk qj queue q(i+w−1) requeue

Challengers qi move from left to right. Potential skyline Lemmings pj move from right to left. Either can fall off the cliff if dominated. On the right end, challengers become potential skyline Lemmings (if there is space on the bridge), otherwise they requeue.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 281

slide-45
SLIDE 45

Parallel BNL with FPGAs

Parallel battles can be realized on distinct processing nodes νi. ? ? node νn node νm qk qi · · · Nodes form a list where νj only communicates with νj−1 and νj+1. → Challengers qi forwarded from left to right. → Potential skyline tuples forwarded from right to left. Effectively, qi scans over current window (as in BNL). Trick: Causality still holds. qi “sees” effect of any preceding challenger, but not of any following challenger.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 282

slide-46
SLIDE 46

Implementation

Let all νi operate in lock-step.7 Process in two alternating phases:

1 Evaluate: Compute dominance; drop tuples if need be. 2 Shift: Exchange data (“Lemmings”) between nodes.

In practice, exchanging tuples is more tricky. For high dimensionality data can be passed only one dimension at a time.

M′ 1 2 3 3′ 1 2 3

1

2′ 1 2 3

2

1′ 1 2 3

3

M 1 2 3

4

3 1 2 3

5

2 1 2 3

6

1 1 2 3

7

7We tried to avoid this when we did “handshake joins” on multi-core hardware,

because of the high synchronization cost. But on FPGAs this is really cheap.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 283

slide-47
SLIDE 47

Experiments

Randomly distributed data; seven dimensions (1.48 % skyline density). 4 8 16 32 64 128 256 105 106 107

2.28 M tuples/sec 0.45 sec exec. time 0.23 M tuples/sec 4.37 sec exec. time BNL Software BNL FPGA

window size throughput [tuples/sec]

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 284

slide-48
SLIDE 48

Experiments

Correlated data; seven dimensions (0.013 % skyline density). 4 8 16 32 64 128 256 107 108 109

CPU FPGA 41 M tuples/sec 25 ms exec. time 17 M tuples/sec 61 ms exec. time BNL Software BNL FPGA

window size throughput [tuples/sec] → FPGA bottlenecked by the memory interface of the particular FPGA board.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 285

slide-49
SLIDE 49

Experiments

Anti-Correlated data; seven dimensions (19.8 % skyline density). 4 8 16 32 64 128 256 103 104 105

32 K tuples/sec 32 sec exec. time 1.8 K tuples/sec 579 sec exec. time BNL Software BNL FPGA

window size throughput [tuples/sec] → Benefit of FPGA solution is greatest when it is most needed (i.e., when running times are very high).

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 286

slide-50
SLIDE 50

The Frequent Item Problem

Problem: Given an input stream S, which items in S occur most often? Exact solution too expensive (O(min{|S|, |A|}) space) Good approximate solutions available. Space-Saving by Metwally et al. In-depth study: Cormode and Hadjieleftheriou (VLDB 2008)

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 287

slide-51
SLIDE 51

Space-Saving (Metwally et al., TODS 2006)

Space-Saving tries to “monitor” only items that are frequent.

1 foreach stream item x ∈ S do 2

find bin bx with bx.item = x lookup by item ;

3

if such a bin was found then

4

bx.count ← bx.count + 1 ;

5

else

6

bmin ← bin with minimum count value lookup by count ;

7

bmin.count ← bmin.count + 1 ;

8

bmin.item ← x ; Main complexity: Look up bin that monitors the input item x. Find bin with minimum count value.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 288

slide-52
SLIDE 52

Space-Saving in Software

16 32 64 128 256 512 1024 10 20 30 40 50 z = ∞ z = 2 z = 1.5 z = 1 z = 0 data dependence number of items monitored throughput [million items / sec] Code by Cormode and Hadjieleftheriou, Intel Core2 Duo, 2.66 GHz

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 289

slide-53
SLIDE 53

Data-Parallel Frequent Item on FPGAs

Idea: Use available (data) parallelism to make searches efficient. Perform all item searches in parallel: input item xi bin b0 bin b1 bin b2 bin b3 bin b4

?

=

Find bin with minimum count using a tree: min min min · · · · · · min · · · · · · min min · · · · · · min · · · · · · tree for 8 bins

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 290

slide-54
SLIDE 54

Evaluation

16 32 64 128 256 512 1024 20 40 60 software parallel unit access number of items monitored / length of array throughput [million items / sec] Problem: Increasing signal propagation delays.

Teubner, M¨ uller, and Alonso. FPGA Acceleration for the Frequent Item Problem. ICDE 2010.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 291

slide-55
SLIDE 55

Don’t Think in Software

Organize monitored items as an array (→ keep things local). item count bi−1 · · · item count bi x1 item count bi+1 x1 item count bi+2 · · · bi.item ? = x1 bi.count

?

< bi+1.count

1 Compare input item x1 to content of bin bi

(and increment count value if a match was found).

2 Order bins bi and bi+1 according to count values. 3 Move x1 forward in the array and repeat.

→ Drop x1 into last bin if no match can be found.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 292

slide-56
SLIDE 56

Pipelining

The idea seems terribly inefficient: O (# bins) vs. O (log(# bins)). But: All sub-tasks are simple, all processing stays local. Thus, the processing of multiple input items can be parallelized. item count bi−1 x2 · · · item count bi item count bi+1 x1 item count bi+2 · · · bi+1.item ? = x1 bi−1.item ? = x2 bi+1.count

?

< bi+2.count bi−1.count

?

< bi.count Multiple input items xi can traverse this pipeline if they keep sufficient distance.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 293

slide-57
SLIDE 57

Algorithm

1 foreach stream item x ∈ S do 2

i ← 1 ;

3

while i < k do

4

if bi.item = x then

5

bi.count ← bi.count + 1 ;

6

continue foreach ;

7

else if bi.count < bi+1.count then

8

swap contents of bi and bi+1 ;

9

else

10

i ← i + 1 ; /* replace last bin if x was not found */

11

bk.count ← bk.count + 1 ;

12

bk.item ← x ;

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 294

slide-58
SLIDE 58

Evaluation

16 32 64 128 256 512 1024 20 40 60 80 100 software pipeline parallel unit access number of items monitored / length of array throughput [million items / sec]

Teubner, M¨ uller, and Alonso. FPGA Acceleration for the Frequent Item Problem. ICDE 2010.

c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 295