data processing on modern hardware
play

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS - PowerPoint PPT Presentation

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2014 Jens Teubner Data Processing on Modern Hardware Summer 2014 c 1 Part VII FPGAs for Data Processing Jens Teubner


  1. Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2014 � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 1

  2. Part VII FPGAs for Data Processing � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 239

  3. Motivation Modern hardware features a number of “speed-up tricks”: caches, instruction scheduling (out-of-order exec., branch prediction, . . . ), parallelism (SIMD, multi-core), throughput-oriented designs (GPUs). Combining these “tricks” is essentially an economic choice : → chip space ≡ eee → chip space ↔ component selection ↔ workload � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 240

  4. Another Constraint: Power Can use transistors for either logic or caches. Source: Borkar and Chien. The Future of Microprocessors. CACM 2011 . 2008, 45nm, 100mm 2 100 18 16 Power Dissipation Case A, 16MB of Cache 80 14 total Power (Watts) 12 Cache Size Cache (mB) 60 Case C 10 50MT Logic 6MB Cache 8 40 6 4 20 2 Case A, 0 Logic, 8W 0 0 Case B 0 20 40 60 80 Logic transistors (millions) → Power consumptions limits amount of logic that can be put on chip. � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 241

  5. Heterogeneous Hardware 5 MT 2 3 Large-Core Large-Core 2 25MT 25 MT 5 MT 2 3 3 4 5 6 30 20 Large-Core homogeneous small-Core homogeneous small-Core homogeneous Large-core 1 Large-core Large-core 1 throughput throughput throughput Small-core Small-core Pollack’s Rule Small-core Pollack’s Rule throughput throughput (5/25) 0.5 =0.45 throughput (5/25) 0.5 =0.45 Total 6 Total 13 Total 11 throughput throughput throughput (a) (b) (c) � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 242

  6. Field-Programmable Gate Arrays Field-Programmable Gate Arrays (FPGAs) are yet-another point in the design space. “Programmable hardware.” Make (some) design decisions after chip fabrication. Promises of FPGA technology: � Build application-/workload-specific circuit. � Spend chip space only on functionality that you really need. � Tune for throughput, latency, energy consumption, . . . � Overcome limits of general-purpose hardware with regard to task at hand ( e.g. , I/O limits). � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 243

  7. Field-Programmable Gate Arrays An array of logic gates Functionality fully programmable Re-programmable after deployment (“in the field ”) → “programmable hardware” FPGAs can be configured to implement any logic circuit. Complexity bound by available chip space . → Obviously, the effective chip space is less than in custom-fabricated chips (ASICs). � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 244

  8. Basic FPGA Architecture IOB IOB IOB IOB DCM IOB IOB IOB IOB IOB chip layout: 2D array IOB IOB Components CLB CLB CLB CLB CLB: Configurable Logic Block IOB IOB CLB CLB CLB CLB (“logic gates”) IOB IOB IOB: Input/Output Block CLB CLB CLB CLB DCM: Digital Clock Manager IOB IOB Interconnect Network CLB CLB CLB CLB IOB IOB signal lines IOB IOB IOB DCM IOB IOB IOB IOB configurable switch boxes � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 245

  9. Signal Routing programmable Switch Box and programmable bundle of lines intersection point programmable SRAM cell switch with memory cell � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 246

  10. Configurable Logic Block (CLB) SRAM SRAM cell cell out in 0 in 1 D Multiplexer in 2 in 3 4-LUT Flip clock Flop implements { 0 , 1 } 4 → { 0 , 1 } stores a single bit function � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 247

  11. Programming FPGAs Programming is usually done using a hardware description language . E.g. , VHDL 6 , Verilog High-level circuit description Circuit description is compiled into a bitstream , then loaded into SRAM cells on the FPGA: VHDL synthesis map place & route FPGA netlist bitstream 6 VHSIC Hardware Description language � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 248

  12. Example: VHDL HDLs enable programming language-like descriptions of hardware circuits. architecture Behavioral of compare is begin process (A, B) begin if ( A = B ) then C <= ’1’; else C <= ’0’; end if; end process; end Behavioral; VHDL can be synthesized, but also executed in software ( simulation ). � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 249

  13. Real-World Hardware Simplified Virtex-5 XC5VFXxxxT floor plan Frequently used high-level components are provided in discrete silicon BlockRAM (BRAM): set of CPU 0 CPU 1 blocks that each store up 36 kbits of data DSP48 slices: 25x18-bit multipliers followed by a 48-bit accumulator CPU: two full embedded PowerPC 440 cores � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 250

  14. Development Board with Virtex-5 FPGA Virtex-5 XC5VLX110T Lookup Tables (LUTs) 69,120 Block RAM (kbit) 5,328 DSP48 Slices 64 PowerPC Cores 0 max. clock speed ≈ 450 MHz release year 2006 source: Xilinx Inc., ML50x Evaluation Platform. User Guide. � Low-level speed of configurable gates is slower than in custom-fabricated chips (clock frequencies: ∼ 100 MHz). → Compensate with efficient circuit for problem at hand. � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 251

  15. State Machines The key asset of FPGAs is their inherent parallelism . Chip areas naturally operate independently and in parallel. For example, consider finite-state automata . a b c d q 0 q 1 q 2 q 3 q 4 * * → non-deterministic automaton for .*abc.*d � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 252

  16. State Machines ✛ How would you implement an automaton in software? Problems with state machine implementations in software: In non-deterministic automata , several states can be active at a time, which requires iterative execution on sequential hardware. Deterministic automata avoid this problem at the expense of a significantly higher state count . � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 253

  17. State Machines in Hardware Automata can be translated mechanically into hardware circuits. each state → flip-flop (A flip-flop holds a single bit of information. Just the right amount to keep the ‘active’/‘not active’ information.) transitions : → signals (“wires”) between states conditioned on current input symbol ( � ‘and’ gate) multiple sources for one flip-flop input → ‘or’ gate . � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 254

  18. State Machines in Hardware a b c d q 0 q 1 q 2 q 3 q 4 * * ? ? ? ? input = a input = b input = c input = d q 0 q 1 q 2 q 3 q 4 and and and and or or FF FF FF FF FF � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 255

  19. 5 Flip-flop cons. in % NFA 4 DFA DFA (compressed) 3 2 1 0 6 5 NFA LUT cons. in % DFA 4 DFA (compressed) 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 i in (0|1)* 1 (0|1) i � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 256

  20. Use Case: Network Intrusion Detection Analyze network traffic using regular expressions . Scan for known attack tools. Prevent exploitation of known security holes. Scan for shell code. E.g. , Snort ( http://www.snort.org/ ) → Hundreds of (regular expression-based) rules . Idea: Instantiate a hardware state machine for each rule. → Leverage available hardware parallelism. → Challenge: optimize for high throughput . � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 257

  21. Predicate Decoding Optimization 1: Centralized character classification a decod. d a b c d q 0 q 1 q 2 q 3 q 4 and and and and or or FF FF FF FF FF → Optimizes for space , not for speed. Character/predicate decoder: Use FPGA logic resources or use on-chip BRAM (configure as ROM and use as lookup table). � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 258

  22. Predicate Decoding Factored Out 30 No decoder: With decoder: Resource consumption % 25 LUTs LUTs Slices Slices 20 15 10 5 0 0 50 100 150 200 250 i in (A B) i � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 259

  23. Signal Propagation Delay Signal propagation delays determine a circuit’s speed . Here: One state transition per clock cycle. Longest signal path → maximum clock frequency rising clock edge stable at may be undefined CLK rising clock in-between reg. input reg. output register written at rising clock � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 260

  24. Propagation Delays and Many State Machines Straightforward design with many rules and one input: input � NFA 1 NFA 2 NFA 3 NFA 4 NFA 5 NFA 6 output � or � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 261

  25. Pipelining Optimization 2: Pipelining → What matters is longest path between any two registers (flip-flops). longest path input NFA 1 NFA 2 NFA 3 NFA 4 NFA 5 NFA 6 output or → Introduce pipeline registers . → ✛ Flip side of the idea? � Jens Teubner · Data Processing on Modern Hardware · Summer 2014 c 262

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend