FireSim Scale-Out System Simulation in the Public Cloud - PowerPoint PPT Presentation

FPGA-Accelerated Cycle-Exact FireSim Scale-Out System Simulation in the Public Cloud https://fires.im @firesimproject sagark@eecs.berkeley.edu Sagar Karandikar , Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, Krste Asanović

The new datacenter hardware environment The end of Faster Moore’s Law networks Custom Silicon e.g. Silicon in the Cloud Photonics Deeper New datacenter memory/storage architectures hierarchies e.g. disaggregation e.g. 3DXPoint, HBM [1] 2

Disaggregated Datacenters 3 Diagram from Gao et al., OSDI’16

…and custom HW is changing faster than ever FPGAs: Agile HW Design for ASICs: [2] 4

What does our simulator need to do? • Model hardware at scale: • CPUs down to microarchitecture • Fast networks, switches • Novel accelerators • Run real software: • Real OS, networking stack (Linux) • Real frameworks/applications (not microbenchmarks) • Be productive/usable: • Run on a commodity platform • Want to encourage collaboration between systems, architecture: real HW/SW co-design 5

Comparing existing HW “simulation” systems (1 (1) ) Build the hardware (2 (2) ) Build a soft ftware simulator (3 (3) ) Build a hardware-ac accel eler erated ed simul ulator 6

A HW-accelerated DC simulator: DIABLO • DIABLO, ASPLOS’15 [4]: • Simulated 3072 servers, 96 ToRs at ~2.7 MHz • Booted Linux, ran apps like Memcached • Part of RAMP collaboration [8] • Need to hand-write abstract RTL models • Harder than writing “tapeout-ready” RTL • Need to validate against real HW • Tied to an expensive custom host- platform • $100k+ host platform, custom built DIABLO Prototype 7

Comparing existing HW “simulation” systems • Taping-out excels at: • Modeling reality: “single source of truth” • Scalability • Hardware-accelerated simulators excel at: • Simulation rate • Ability to run real workloads (as fn. of sim rate) • Software-based simulators excel at: • Ease-of-use • Ease-of-rebuild (time-to-first-cycle) • Commodity host platform • Cost • Introspection 8

Useful trends throughout the architect’s stack Open ISA Open, Silicon-Proven SoC Implementations FPGAs in the Cloud High-Productivity Hardware Design Language w/IR 9

FireSim at a high-level Server Simulations f1.16xlarge • Inherent parallelism – lots of gates CPU • We have tapeout-proven RTL: F P Host Ethernet (EC2 Network) G A automatically FAME-1 transform s ( x 8 Server ) • Put RTL-derived sims on the FPGAs Server Simulations Switch Model Server Simulation(s) Network simulation Server Simulation(s) Server Simulation(s) • Little parallelism in switch models Server Simulation(s) Server (e.g. a thread per port) Simulation(s) Server Simulation(s) • Need to coordinate all of our Simulation(s) Host distributed server simulations PCIe • So use CPUs + host network 10

Now, let’s build a datacenter-scale FireSim simulation! 11

Step 1: Server SoC in RTL Rocket L1I Core L1D L1I Server Rocket Core L1D L2 Rocket L1I Blade Core L1D Rocket L1I Sim. Core L1D Other Peripherals NIC - N/A Sim Ra Si - < ¼ of an FPGA Resource Util. Re NIC - 200 Gb/s Eth. - 256K Shared L2$ - 16K I/D L1$ Cores @ 3.2 GHz - 4x RISC-V Rocket Modeled System Mo Rate 12

DRAM Step 1: Server SoC in RTL Rocket L1I Core L1D L1I Server Rocket Core L1D L2 Rocket L1I Blade Core L1D Rocket L1I Sim. Core L1D Other Peripherals NIC NIC Sim Other Periph. Endpoint Sim Endpoints - N/A Sim Ra Si - < ¼ of an FPGA Resource Util. Re NIC - 200 Gb/s Eth. - 256K Shared L2$ - 16K I/D L1$ Cores @ 3.2 GHz - 4x RISC-V Rocket Modeled System Mo Rate PCIe to Host 13

Step 2: FPGA Simulation of one server blade DRAM Rocket L1I Core L1D DRAM Model Server L1I Rocket Core L1D L2 Rocket L1I Blade Core L1D Rocket L1I Sim. Core L1D Other Peripherals NIC NIC Sim Other Periph. Fabric FPGA Endpoint Sim Endpoints PCIe to Host ms) - ~40 MHz (netw) - ~150 MHz Sim Ra Si - ¼ Mem Chans - < ¼ of an FPGA Resource Util. Re - 16 GB DDR3 NIC - 200 Gb/s Eth. - 256K Shared L2$ - 16K I/D L1$ Cores @ 3.2 GHz - 4x RISC-V Rocket Modeled System Mo Rate 14

Step 2: FPGA Simulation of one server blade DRAM DR Rocket L1I Core L1D DRAM Model L1I Server Rocket Se Core L1D L2 Rocket L1I Blade Bl Core L1D Rocket L1I Sim. Sim Core L1D Other Peripherals NIC NIC Sim Other Periph. Fabric FPGA Endpoint Sim Endpoints PCIe to Host ms) - ~40 MHz (netw) - ~150 MHz Sim Ra Si - ¼ Mem Chans - < ¼ of an FPGA Resource Util. Re - 16 GB DDR3 NIC - 200 Gb/s Eth. - 256K Shared L2$ - 16K I/D L1$ Cores @ 3.2 GHz - 4x RISC-V Rocket Modeled System Mo Server Se Rate Blade Bl 15 Simulation Sim

Step 3: FPGA Simulation of 4 server blades DRAM DRAM Modeled System Mo Rocket L1I Core L1D DRAM Model Server L1I Rocket Server - 4 Server Blades Cost: Core L1D L2 Rocket L1I Blade Blade - 16 Cores Core L1D $0.49 per hour Rocket L1I Sim. SimulaIon Core L1D - 64 GB DDR3 (spot) Other Peripherals NIC NIC Sim Other Periph. Re Resource Util. Fabric FPGA Endpoint Sim Endpoints PCIe to Host ms) - < 1 FPGA $1.65 per hour Server Server - 4/4 Mem Chans (on-demand) Blade Blade Si Sim Ra Rate Simulation Simulation - ~14.3 MHz (netw) DRAM DRAM 16

Step 3: FPGA Simulation of 4 server blades DRAM DRAM Modeled System Mo Rocket L1I Core L1D DRAM Model Server L1I Rocket Server - 4 Server Blades Core L1D L2 Rocket L1I Blade Blade FPGA FPGA - 16 Cores Core L1D Rocket L1I Sim. SimulaIon Core L1D - 64 GB DDR3 Other Peripherals NIC NIC Sim Other Periph. Re Resource Util. Fabric FPGA Endpoint Sim Endpoints PCIe to Host (4 Sims) (4 Sims) - < 1 FPGA Server Server - 4/4 Mem Chans Blade Blade Sim Ra Si Rate Simulation Simulation - ~14.3 MHz (netw) DRAM DRAM 17

Step 4: Simulating a 32 node rack Mo Modeled System - 32 Server Blades DRAM DRAM Rocket L1I Core L1D DRAM Model Server L1I Rocket Server - 128 Cores Core L1D L2 L1I Rocket Blade Blade FPGA FPGA FPGA Core L1D L1I Rocket Sim. SimulaIon Core L1D Other Peripherals Cost: NIC - 512 GB DDR3 NIC Sim Other Periph. Fabric FPGA Endpoint Sim Endpoints PCIe to Host (4 Sims) (4 Sims) (4 Sims) Server Server $2.60 per - 32 Port ToR Blade Blade Simulation Simulation Switch hour (spot) DRAM DRAM - 200 Gb/s, 2us Host Instance CPU: ToR Switch Model links $13.20 per Re Resource Util. hour (on- demand) - 8 FPGAs = FPGA FPGA FPGA FPGA - 1x f1.16xlarge (4 Sims) (4 Sims) (4 Sims) (4 Sims) Si Sim Ra Rate - ~10.7 MHz (netw) 18

Cycle-accurate Network Modeling Switch Port • For global cycle-accuracy, send a token on each link for each cycle, in each direction • Each direction of a link has link latency in cycles tokens in-flight • e.g. 6400 tokens in flight on link for 2us link Link Model latency @ 3.2 GHz 6400 tokens • Each token is desired bandwidth / clock frequency bits wide ß à • e.g. 200 Gbps / 3.2 GHz ≈ 64 bit wide token sent per cycle • Target transport agnostic (we provide Ethernet switch models) • Host transport agnostic (shared mem, sockets, PCIe) 64b 64b • Can “downgrade” to a zero-perf-impact functional network model (150+ MHz) NIC Top-Level I/O on FPGA 19

Step 4: Simulating a 32 node rack Mo Modeled System - 32 Server Blades DRAM DRAM Rocket L1I Core L1D DRAM Model Server L1I Rocket Server - 128 Cores Core L1D L2 L1I Rocket Blade Blade FPGA FPGA FPGA Core L1D L1I Rocket Sim. SimulaIon Core L1D Other Peripherals Cost: NIC - 512 GB DDR3 NIC Sim Other Periph. Fabric FPGA Endpoint Sim Endpoints PCIe to Host (4 Sims) (4 Sims) (4 Sims) Server Server $2.60 per - 32 Port ToR Blade Blade Simulation Simulation Switch hour (spot) DRAM DRAM - 200 Gb/s, 2us Host Instance CPU: ToR Switch Model links $13.20 per Re Resource Util. hour (on- demand) - 8 FPGAs = FPGA FPGA FPGA FPGA - 1x f1.16xlarge (4 Sims) (4 Sims) (4 Sims) (4 Sims) Si Sim Ra Rate - ~10.7 MHz (netw) 20

Ag Step 4: Simulating a 32 node rack Modeled System Mo - 32 Server Blades DRAM DRAM L1I Rocket Core L1D DRAM Model Server L1I Rocket Server - 128 Cores Core L1D L2 L1I Rocket Blade Blade FPGA FPGA FPGA Core L1D L1I Rocket Sim. SimulaIon Core L1D Other Peripherals NIC - 512 GB DDR3 NIC Sim Other Periph. Fabric FPGA Endpoint Sim Endpoints PCIe to Host (4 Sims) (4 Sims) (4 Sims) Server Server - 32 Port ToR Blade Blade Simulation Simulation ck Switch DRAM DRAM - 200 Gb/s, 2us Host Instance CPU: ToR Switch Model links Re Resource Util. - 8 FPGAs = FPGA FPGA FPGA FPGA - 1x f1.16xlarge (4 Sims) (4 Sims) (4 Sims) (4 Sims) Si Sim Ra Rate - ~10.7 MHz (netw) 21

FireSim Scale-Out System Simulation in the Public Cloud - PowerPoint PPT Presentation

FPGA-Accelerated Cycle-Exact FireSim Scale-Out System Simulation in the Public Cloud https://fires.im @firesimproject sagark@eecs.berkeley.edu Sagar Karandikar , Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan

FireSim A Virtual Firefighting Simulator Nils Schmeier Campus Fire Brigade Rossendorf

FireSim and Chipyard Tutorial: Intro Sagar Karandikar 1. Fill out the form at UC Berkeley

In Integrating NVIDIA IA Deep Learning Accelerator (NVDLA) with RIS ISC-V SoC on FireSim

A Brief Tour of FireSim: The Manager & Compiler; Building Hardware Designs https://fires.im

Running a FireSim Simulation: Password Cracking on a RISC-V SoC with SHA-3 Accelerators and Linux

FireSim Multi-FPGA Networked Simulation https://fires.im @firesimproject MICRO 2019 Tutorial

Instrumenting and Debugging FireSim-Simulated Designs https://fires.im @firesimproject MICRO

Golden Gate Bridging The Resource-Efficiency Gap Between ASICs and FPGA Prototypes Albert Magyar

Tutorial Conclusion Alon Amid UC Berkeley alonamid@berkeley.edu Recap Chipyard Basics

Chipyard Basics Howie Mao, Jerry Zhao UC Berkeley {zhemao,jzh}@berkeley.edu Motivation

Dr. J r. Jones has served as the Director, Geisinger Regional Laboratories since 1985 and

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L.

REVISED 10 CFR PART 35: MEDICAL USE OF BYPRODUCT MATERIAL Subpart H: Photon Emitting Remote

SpOT-Light: Lightweight Private Set Intersection from Sparse OT Extension Benny Pinkas Mike

SHA-1 is a Shambles First Chosen-Prefix Collision on SHA-1 and Application to the PGP Web of

From the Structure and Function of the Ribosome to new Antibiotics Cricks central dogma of

Spatially resolved Raman spectroscopy on single- and few-layer graphene D. Graf, F. Molitor, and

rst rts t

Lecture 2.1: Cyclic and abelian groups Matthew Macauley Department of Mathematical Sciences

MIT Relational Database GraphiQL Expensive! Graph Intuitive Query Language for Relational

Purchasing Power Parity Yee-Tien Ted Fu Course web pages: http://finance2010.pageout.net

CS 356 Lecture 29 Wireless Security Spring 2013 Review Chapter 1: Basic Concepts and

2020 Census Program Management Review Decennial Census Programs U.S. Census Bureau August 3,

1 II.A. Extended Form Equations (contd.) Extended Form Equations (contd.) II.B. Reduced Form

FireSim Scale-Out System Simulation in the Public Cloud - PowerPoint PPT Presentation

FPGA-Accelerated Cycle-Exact FireSim Scale-Out System Simulation in the Public Cloud https://fires.im @firesimproject sagark@eecs.berkeley.edu Sagar Karandikar , Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan

FireSim A Virtual Firefighting Simulator Nils Schmeier Campus Fire Brigade Rossendorf

FireSim and Chipyard Tutorial: Intro Sagar Karandikar 1. Fill out the form at UC Berkeley

In Integrating NVIDIA IA Deep Learning Accelerator (NVDLA) with RIS ISC-V SoC on FireSim

A Brief Tour of FireSim: The Manager &amp; Compiler; Building Hardware Designs https://fires.im

Running a FireSim Simulation: Password Cracking on a RISC-V SoC with SHA-3 Accelerators and Linux

FireSim Multi-FPGA Networked Simulation https://fires.im @firesimproject MICRO 2019 Tutorial

Instrumenting and Debugging FireSim-Simulated Designs https://fires.im @firesimproject MICRO

Golden Gate Bridging The Resource-Efficiency Gap Between ASICs and FPGA Prototypes Albert Magyar

Tutorial Conclusion Alon Amid UC Berkeley alonamid@berkeley.edu Recap Chipyard Basics

Chipyard Basics Howie Mao, Jerry Zhao UC Berkeley {zhemao,jzh}@berkeley.edu Motivation

Dr. J r. Jones has served as the Director, Geisinger Regional Laboratories since 1985 and

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L.

REVISED 10 CFR PART 35: MEDICAL USE OF BYPRODUCT MATERIAL Subpart H: Photon Emitting Remote

SpOT-Light: Lightweight Private Set Intersection from Sparse OT Extension Benny Pinkas Mike

SHA-1 is a Shambles First Chosen-Prefix Collision on SHA-1 and Application to the PGP Web of

From the Structure and Function of the Ribosome to new Antibiotics Cricks central dogma of

Spatially resolved Raman spectroscopy on single- and few-layer graphene D. Graf, F. Molitor, and

rst rts t

Lecture 2.1: Cyclic and abelian groups Matthew Macauley Department of Mathematical Sciences

MIT Relational Database GraphiQL Expensive! Graph Intuitive Query Language for Relational

Purchasing Power Parity Yee-Tien Ted Fu Course web pages: http://finance2010.pageout.net

CS 356 Lecture 29 Wireless Security Spring 2013 Review Chapter 1: Basic Concepts and

2020 Census Program Management Review Decennial Census Programs U.S. Census Bureau August 3,

1 II.A. Extended Form Equations (contd.) Extended Form Equations (contd.) II.B. Reduced Form

A Brief Tour of FireSim: The Manager & Compiler; Building Hardware Designs https://fires.im