Data Processing on Modern Hardware
Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2014
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 1
Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS - - PowerPoint PPT Presentation
Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2014 Jens Teubner Data Processing on Modern Hardware Summer 2014 c 1 Part VII FPGAs for Data Processing Jens Teubner
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 1
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 239
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 240
Case B Case A, 0 Logic, 8W Case A, 16MB of Cache Case C 50MT Logic 6MB Cache Power Dissipation Cache Size 100 80 60 40 20 18 16 14 12 10 8 6 4 2 20 40 60 80 total Power (Watts) Logic transistors (millions) 2008, 45nm, 100mm2 Cache (mB) Source: Borkar and Chien. The Future of Microprocessors. CACM 2011.
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 241
(a) (b) (c)
Large-Core 25 MT 2 4 3 5 6 Large-Core homogeneous Large-core throughput 1 Small-core throughput Total throughput 6 small-Core homogeneous Large-core throughput Small-core throughput Pollack’s Rule (5/25)0.5=0.45 Total throughput 13 small-Core homogeneous Large-core throughput 1 Small-core throughput Pollack’s Rule (5/25)0.5=0.45 Total throughput 11 5 MT 2 3 30 5 MT 2 3 20 Large-Core 25MT c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 242
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 243
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 244
IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB IOB DCM DCM
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 245
SRAM cell
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 246
SRAM cell
SRAM cell
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 247
6VHSIC Hardware Description language c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 248
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 249
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 250
source: Xilinx Inc., ML50x Evaluation Platform. User Guide.
Virtex-5 XC5VLX110T Lookup Tables (LUTs) 69,120 Block RAM (kbit) 5,328 DSP48 Slices 64 PowerPC Cores
≈ 450 MHz release year 2006
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 251
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 252
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 253
(A flip-flop holds a single bit of information. Just the right amount to keep the ‘active’/‘not active’ information.)
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 254
FF q0 FF q1 FF q2 FF q3 FF q4
and input
?
= a and input
?
= b and input
?
= c
and input
?
= d
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 255
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 256
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 257
decod. a d FF q0 FF q1 FF q2 FF q3 FF q4
and a and b and c
and d
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 258
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 259
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 260
Jens Teubner · Data Processing on Modern Hardware · Summer 2014 261
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 262
Yang et al. Compact Architecture for High-Throughput Regular Expression Maching on FPGA. ANCS 2008.
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 263
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 264
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 265
Yang et al. Compact Architecture for High-Throughput Regular Expression Maching on FPGA. ANCS 2008.
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 266
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 267
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 268
tag decod. a d XML FF q0 FF q1 FF q2 FF q3 FF q4
and a and b and c
and d root()/desc:: a/child:: b/child:: c/desc:: d
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 269
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 270
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 271
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 272
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 273
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 274
20 40
ր Jens Teubner, Louis Woods, and Chongling Nie. Skeleton Automata for FPGAs: Reconfiguring without Reconstructing. SIGMOD 2012.
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 275
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 276
qi dominated p0 qi+1 requeue queue
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 277
qi dominated [p0, pw−1] qi+1 requeue queue
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 278
1 foreach Lemming qi ∈ queue do 2
isDominated = false;
3
foreach Lemming pj ∈ bridge do
4
if qi.timestamp > pj.timestamp then
5
bridge.movetoskyline(pj); /* pj ∈ Lemming skyline */
6
else if qi ≺ pj then
7
bridge.drop(pj);
8
else if pj ≺ qi then
9
isDominated = true;
10
break;
11
if not isDominated then
12
timestamp(qi);
13
if bridge.isFull() then
14
queue.insert(qi);
15
else
16
bridge.insert(qi);
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 280
pk qj queue q(i+w−1) requeue
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 281
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 282
1 Evaluate: Compute dominance; drop tuples if need be. 2 Shift: Exchange data (“Lemmings”) between nodes.
M′ 1 2 3 3′ 1 2 3
1
2′ 1 2 3
2
1′ 1 2 3
3
M 1 2 3
4
3 1 2 3
5
2 1 2 3
6
1 1 2 3
7
7We tried to avoid this when we did “handshake joins” on multi-core hardware,
because of the high synchronization cost. But on FPGAs this is really cheap.
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 283
2.28 M tuples/sec 0.45 sec exec. time 0.23 M tuples/sec 4.37 sec exec. time BNL Software BNL FPGA
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 284
CPU FPGA 41 M tuples/sec 25 ms exec. time 17 M tuples/sec 61 ms exec. time BNL Software BNL FPGA
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 285
32 K tuples/sec 32 sec exec. time 1.8 K tuples/sec 579 sec exec. time BNL Software BNL FPGA
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 286
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 287
1 foreach stream item x ∈ S do 2
3
4
5
6
7
8
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 288
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 289
?
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 290
Teubner, M¨ uller, and Alonso. FPGA Acceleration for the Frequent Item Problem. ICDE 2010.
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 291
?
1 Compare input item x1 to content of bin bi
(and increment count value if a match was found).
2 Order bins bi and bi+1 according to count values. 3 Move x1 forward in the array and repeat.
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 292
?
?
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 293
1 foreach stream item x ∈ S do 2
3
4
5
6
7
8
9
10
11
12
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 294
Teubner, M¨ uller, and Alonso. FPGA Acceleration for the Frequent Item Problem. ICDE 2010.
c Jens Teubner · Data Processing on Modern Hardware · Summer 2014 295