FPGAs milliseconds+ to reconfjgure custom chips ??? (next week) - PowerPoint PPT Presentation

FPGAs milliseconds+ to reconfjgure custom chips — ??? (next week) FPGAs — ??? GPUs — vector computations examples: computation second processor specialized for particular the accelerator concept 2 fmexible, slower functional units fjxed, fast functional units lots of routing lots of control logic fetch 1+ instruction/cycle 1 set of wirings stream of instructions reconfjg. HW ‘normal’ processor reconfjgurable hardware 1 Datacenter Services” Putnam et al, ”A Reconfjgurable Fabric for Accelerating Large-Scale review required) Brown and Rose, ”Architecture of FPGAs and CPLDs: A Tutorial”. (no This day’s papers: To read more… 3

FPGA structure attempts at easier interfaces: count <= count + 1'b1; end assign value = count; endmodule 6 A note about HW programming not intuitive “schematic capture” — draw circuit diagram else common, doesn’t seem great at scale higher-level tools, e.g., Chisel (Berkeley research project) compile to RTL; used at scale automatic translation of C-like language (C to gates) Very mixed reputation — very hard compilers problem But see Aladdin paper begin end Brown and Rose, Figure 2 count <= 0; 4 FPGA programs: RTL e.g.: Verilog everything happens in parallel every cycle manually specify what’s in registers, etc. same languages used to design real processors 5 RTL example module counter(clock,reset,value); input clock; input reset; output value; reg [32:0] count; always @ ( posedge reset or posedge clock) if (reset) begin 7 determines wiring between gates, registers, memories

FPGA design pipeline Programmable switches: example Brown and Rose, Figure 5 Programmable switches: example 10 Brown and Rose, Figure 5 can be written by seperate circuit (not shown) SRAM cell continously outputs stored value Example switch: transistor + SRAM cell 9 Brown and Rose, Figure 7 efgects performance — longer wires/more switches nearly full not straightforward; hours+ to compute if FPGA connect needs to turn into what components in the FPGA to RTL compiles to “gate list” FPGA: place and route 8 11 (SRAM cell ≈ 1-bit register)

FPGA routing example 12 FPGA logic block example (1) 13 FPGA logic block example (2) 14 FPGA confjguration what to do for every switch just loading values into memory that controls switch 15

FPGA efficiency review comments failure handling centralized allocation needs fast FPGA-to-FPGA communication programs across multiple FPGAs physical space power density (cooling, power distribution) cost (only 10% more???) datacenter logistics Catapult challenges 18 programmability? other large-scale deployments? versus/combined with GPUs/CPUs? what are FPGAs good for anyways? 17 most transistors perform routing, not computation RAM much longer signal paths than in CPUs slower clock rates for same task development tool usefulness/quality is not great 16 FPGA: more complex logic many FPGAs include specialized fjxed functionality adders, multipliers but slower/bigger fmoating point units common DSP computations full embedded-class CPU cores … could implement these using fully programmable logic 19

s t y n r e e m u u q c o d documents query index shard index shard index shard The Shell ranking service index shard documents, query rankings index shard index shard 23% of FPGA (confjgurable) area: Search engine architecture 20 CPU to FPGA transfers (about maximum PCIe 3.0 transfer rate) 21 Catapult roles hand-coded Verilog (RTL language) hand partitioned across FPGAs? precise duplication of existing software 22 search index shard query cache top-level aggregator (TLA) MLA MLA index shard index shard index shard 23 10 µ s for 16 KB — approx 15 GB/s

s t n y e r e m u u q c o d documents query s t y y n r r e e e m u u u q q c o d documents query query (TLA) MLA index shard index shard ranking service index shard index shard index shard index shard index shard MLA index shard index shard index shard Search engine architecture index shard documents, query index shard rankings documents, query ranking service index shard index shard index shard index shard index shard index shard index shard index shard rankings top-level MLA MLA (TLA) aggregator top-level cache query search Search engine architecture 23 aggregator Search engine architecture cache index shard query MLA MLA (TLA) aggregator top-level cache query search Search engine architecture 23 rankings documents, query ranking service index shard index shard MLA query cache top-level aggregator (TLA) MLA index shard index shard index shard index shard index shard index shard index shard index shard index shard index shard index shard index shard search search 23 rankings documents, query ranking service index shard index shard index shard index shard 23 index shard s t n y r e e m u q u c o d documents query s t n e m u c o d documents

Overall Motivation to load from external RAM fully pipelined essentially regexes compiled to gates? parallel fjnite-state machines Feature Extraction FSMs 26 trick: proess queries for same model together (distributed) on FPGA memories: approx. 40MB capacity 27 24 “model reload” Queue Manager 25 (1600 clock cycles) output: score recieve: document, some features via shared memory FPGA operation each FPGA runs a macropipeline stage — 8 µ s can only store one model at a time — takes 250 µ s

Feature Expressions perhaps even if difgerent operations — hard for GPUs being easy to program well … but sometimes dedicated SRAM blocks caching lots of data? purpose-built, denser ALUs just win fmoating point, other ‘big’ arithmetic operations What are FPGAs bad at? 30 prototyping CPUs, GPUs low-latency I/O interface and processing? inherently parallel programs? speialized mathematical expressions bit-twiddling (lots of simple CPU instrs.)? What are FPGAs good for? 29 “Complex” logic area 28 threads priority-scheduled split across multiple FPGAs mostly integer — small FPGA area — but some FP model determines what the expressions are custom multithreaded processor 31 programming FPGAs ≈ processor design!

FPGAs versus GPUs register renaming and reservation stations … 34 Exam topics Memory hierarchy — caches, TLBs Pipelining, instruction scheduling, VLIW Multiple issue/out-of-order: reorder bufgers and branch prediction 255 hardware multithreading Multicore shared memory: cache coherency protocols/networks relaxed memory models and sequential consistency synchronization: spin locks, transaction memory, etc. Vector machines, GPUs, other accelerators 511 35 both good at doing massively parallel computations Interlude: Homework 3 FPGAs better at exploiting multiple instruction parallelism? FPGAs can be lower latency for simple operations FPGAs much worse at fmoating point/non-small-integer calculations? 32 33 Homework 3 supplied kernel what does the supplied kernel do? 0 1 2 … 255 256 257 258 … 511 512 …

Next time: Custom ASICs mathematical tradeofgs (remove “unimportant” gates”-like) tools Complements existing high-level synthesis (“C to Produces fast estimates accelerator designs Tool (used by Minerva) for quickly evaluating Previre: Aladdin 37 architectural tradeofgs things from model) from a pre-trained model) higher dev cost/higher efficiency Deep Neural Networks — machine learning models Preview: Minerva 36 all these things probably apply to FPGA stufg another: a case study using that (Minerva) (Aladdin) one on: automating design of custom ASIC accelerators two papers: 38 accelerating evaluating DNNs (making predictions

FPGAs milliseconds+ to reconfjgure custom chips ??? (next week) - PowerPoint PPT Presentation

FPGAs milliseconds+ to reconfjgure custom chips ??? (next week) FPGAs ??? GPUs vector computations examples: computation second processor specialized for particular the accelerator concept 2 fmexible, slower functional units

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

Virtex-7 FPGAs Target Software Virtex-7 FPGAs Target Software Defined Radio Applications Defined

Linux and FPGAs Chad D. Kersey chad@cdkersey.com cdkersey@gatech.edu Linux and FPGAs - p. 1/9

Physical optimization for Physical optimization for FPGAs using post- FPGAs using post-

Hybrid Dot-Product Design for FP-Enabled FPGAs Bogdan Pasca Intel ARITH 2019, June 10-12, 2019

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

A Network of Time Division Multiplexing for FPGAs Rosemary Francis Motivation FPGAs are

with FP FPGAs: Cas ase Stu tudy on on a a Key-Value Store FPGAs in the Cloud Wider

Gigabit Ethernet Gigabit Ethernet implementation for implementation for FPGAs FPGAs Grzegorz

FPGAs 1 To read more This days papers: Brown and Rose, Architecture of FPGAs and

Measuring Long Wire Leakage with Ring Oscillators in Cloud FPGAs Ilias Giechaskiel Kasper B.

SoC Design SoC Design : Designing with FPGAs Designing with FPGAs es g es g g w t g w t G s

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and Architectures at ETH Systems

An Update on Game Tree Research Akihiro Kishimoto and Martin Mueller Tutorial 2: Solving and

Document Clustering for Mediated Information Access The WebCluster Project Gheorghe

Simulation in a Nutshell Game Theory meets Object Oriented Simulation Special Interest Group

A Community Approach to Palliative and End of Life Care # ICMatters

Autotuning (2/2): Specialized code generators Prof. Richard Vuduc Georgia Institute of

Building amazing searcies with Searci API T h o ma s S e i d l ( d r u n k e n

Working with Academic Literature Approach Search & Search, Screen, Read, Appraise Acquire

A General Approach to Discovering, Registering, and Extracting Features from Raster Maps Craig

FPGAs milliseconds+ to reconfjgure custom chips ??? (next week) - PowerPoint PPT Presentation

FPGAs milliseconds+ to reconfjgure custom chips ??? (next week) FPGAs ??? GPUs vector computations examples: computation second processor specialized for particular the accelerator concept 2 fmexible, slower functional units

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

Virtex-7 FPGAs Target Software Virtex-7 FPGAs Target Software Defined Radio Applications Defined

Linux and FPGAs Chad D. Kersey chad@cdkersey.com cdkersey@gatech.edu Linux and FPGAs - p. 1/9

Physical optimization for Physical optimization for FPGAs using post- FPGAs using post-

Hybrid Dot-Product Design for FP-Enabled FPGAs Bogdan Pasca Intel ARITH 2019, June 10-12, 2019

High-Speed Computing &amp; Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

A Network of Time Division Multiplexing for FPGAs Rosemary Francis Motivation FPGAs are

with FP FPGAs: Cas ase Stu tudy on on a a Key-Value Store FPGAs in the Cloud Wider

Gigabit Ethernet Gigabit Ethernet implementation for implementation for FPGAs FPGAs Grzegorz

FPGAs 1 To read more This days papers: Brown and Rose, Architecture of FPGAs and

Measuring Long Wire Leakage with Ring Oscillators in Cloud FPGAs Ilias Giechaskiel Kasper B.

SoC Design SoC Design : Designing with FPGAs Designing with FPGAs es g es g g w t g w t G s

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and Architectures at ETH Systems

An Update on Game Tree Research Akihiro Kishimoto and Martin Mueller Tutorial 2: Solving and

Document Clustering for Mediated Information Access The WebCluster Project Gheorghe

Simulation in a Nutshell Game Theory meets Object Oriented Simulation Special Interest Group

A Community Approach to Palliative and End of Life Care # ICMatters

Autotuning (2/2): Specialized code generators Prof. Richard Vuduc Georgia Institute of

Building amazing searcies with Searci API T h o ma s S e i d l ( d r u n k e n

Working with Academic Literature Approach Search &amp; Search, Screen, Read, Appraise Acquire

A General Approach to Discovering, Registering, and Extracting Features from Raster Maps Craig

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

Working with Academic Literature Approach Search & Search, Screen, Read, Appraise Acquire