FireSim Scale-Out System Simulation in the Public Cloud - - PowerPoint PPT Presentation

firesim
SMART_READER_LITE
LIVE PREVIEW

FireSim Scale-Out System Simulation in the Public Cloud - - PowerPoint PPT Presentation

FPGA-Accelerated Cycle-Exact FireSim Scale-Out System Simulation in the Public Cloud https://fires.im @firesimproject sagark@eecs.berkeley.edu Sagar Karandikar , Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan


slide-1
SLIDE 1

FireSim

FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud

Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, Krste Asanović

https://fires.im @firesimproject sagark@eecs.berkeley.edu

slide-2
SLIDE 2

The new datacenter hardware environment

Deeper memory/storage hierarchies e.g. 3DXPoint, HBM The end of Moore’s Law Custom Silicon in the Cloud Faster networks e.g. Silicon Photonics New datacenter architectures e.g. disaggregation [1] 2
slide-3
SLIDE 3

Disaggregated Datacenters

3 Diagram from Gao et al., OSDI’16
slide-4
SLIDE 4

…and custom HW is changing faster than ever

FPGAs: Agile HW Design for ASICs:

[2] 4
slide-5
SLIDE 5

What does our simulator need to do?

  • Model hardware at scale:
  • CPUs down to microarchitecture
  • Fast networks, switches
  • Novel accelerators
  • Run real software:
  • Real OS, networking stack (Linux)
  • Real frameworks/applications (not microbenchmarks)
  • Be productive/usable:
  • Run on a commodity platform
  • Want to encourage collaboration between systems, architecture: real HW/SW

co-design

5
slide-6
SLIDE 6

(2 (2) ) Build a soft ftware simulator

Comparing existing HW “simulation” systems

(1 (1) ) Build the hardware (3 (3) ) Build a hardware-ac accel eler erated ed simul ulator

6
slide-7
SLIDE 7

A HW-accelerated DC simulator: DIABLO

  • DIABLO, ASPLOS’15 [4]:
  • Simulated 3072 servers, 96 ToRs at ~2.7 MHz
  • Booted Linux, ran apps like Memcached
  • Part of RAMP collaboration [8]
  • Need to hand-write abstract RTL models
  • Harder than writing “tapeout-ready” RTL
  • Need to validate against real HW
  • Tied to an expensive custom host-

platform

  • $100k+ host platform, custom built
DIABLO Prototype 7
slide-8
SLIDE 8

Comparing existing HW “simulation” systems

8
  • Taping-out excels at:
  • Modeling reality: “single source of truth”
  • Scalability
  • Hardware-accelerated simulators excel at:
  • Simulation rate
  • Ability to run real workloads (as fn. of sim rate)
  • Software-based simulators excel at:
  • Ease-of-use
  • Ease-of-rebuild (time-to-first-cycle)
  • Commodity host platform
  • Cost
  • Introspection
slide-9
SLIDE 9

FPGAs in the Cloud

Useful trends throughout the architect’s stack

High-Productivity Hardware Design Language w/IR Open, Silicon-Proven SoC Implementations

Open ISA

9
slide-10
SLIDE 10

FireSim at a high-level

Server Simulations

  • Inherent parallelism – lots of gates
  • We have tapeout-proven RTL:

automatically FAME-1 transform

  • Put RTL-derived sims on the FPGAs

Network simulation

  • Little parallelism in switch models

(e.g. a thread per port)

  • Need to coordinate all of our

distributed server simulations

  • So use CPUs + host network
10 Server Simulation(s) Server Simulation(s) Server Simulation(s) Server Simulation(s) Server Simulation(s) Server Simulation(s) Server Simulation(s) F P G A s ( x 8 ) CPU Host PCIe f1.16xlarge Host Ethernet (EC2 Network) Server Simulations Switch Model
slide-11
SLIDE 11

Now, let’s build a datacenter-scale FireSim simulation!

11
slide-12
SLIDE 12 12

Server Blade Sim.

Rocket Core Rocket Core Rocket Core Rocket Core

L1I L1D L1I L1D L1I L1D L1I L1D

L2

NIC

Other Peripherals

Mo Modeled System

  • 4x RISC-V Rocket

Cores @ 3.2 GHz

  • 16K I/D L1$
  • 256K Shared L2$
  • 200 Gb/s Eth.

NIC

Re Resource Util.

  • < ¼ of an FPGA

Si Sim Ra Rate

  • N/A

Step 1: Server SoC in RTL

slide-13
SLIDE 13 13

DRAM Server Blade Sim.

Rocket Core Rocket Core Rocket Core Rocket Core

L1I L1D L1I L1D L1I L1D L1I L1D

L2

NIC

Other Peripherals

NIC Sim Endpoint Other Periph. Sim Endpoints

PCIe to Host

Mo Modeled System

  • 4x RISC-V Rocket

Cores @ 3.2 GHz

  • 16K I/D L1$
  • 256K Shared L2$
  • 200 Gb/s Eth.

NIC

Re Resource Util.

  • < ¼ of an FPGA

Si Sim Ra Rate

  • N/A

Step 1: Server SoC in RTL

slide-14
SLIDE 14 14

DRAM Server Blade Sim.

Rocket Core Rocket Core Rocket Core Rocket Core L1I L1D L1I L1D L1I L1D L1I L1D

L2

NIC Other Peripherals

DRAM Model

NIC Sim Endpoint Other Periph. Sim Endpoints FPGA Fabric

PCIe to Host

ms)

Mo Modeled System

  • 4x RISC-V Rocket

Cores @ 3.2 GHz

  • 16K I/D L1$
  • 256K Shared L2$
  • 200 Gb/s Eth.

NIC

  • 16 GB DDR3

Re Resource Util.

  • < ¼ of an FPGA
  • ¼ Mem Chans

Si Sim Ra Rate

  • ~150 MHz
  • ~40 MHz (netw)

Step 2: FPGA Simulation of one server blade

slide-15
SLIDE 15 15

DRAM DR Se Bl Sim Se Bl Sim Server Blade Sim. Server Blade Simulation

Rocket Core Rocket Core Rocket Core Rocket Core L1I L1D L1I L1D L1I L1D L1I L1D

L2

NIC Other Peripherals

DRAM Model

NIC Sim Endpoint Other Periph. Sim Endpoints FPGA Fabric

PCIe to Host

ms)

Mo Modeled System

  • 4x RISC-V Rocket

Cores @ 3.2 GHz

  • 16K I/D L1$
  • 256K Shared L2$
  • 200 Gb/s Eth.

NIC

  • 16 GB DDR3

Re Resource Util.

  • < ¼ of an FPGA
  • ¼ Mem Chans

Si Sim Ra Rate

  • ~150 MHz
  • ~40 MHz (netw)

Step 2: FPGA Simulation of one server blade

slide-16
SLIDE 16 16

DRAM DRAM DRAM DRAM Server Blade SimulaIon Server Blade Simulation Server Blade Sim. Server Blade Simulation

Rocket Core Rocket Core Rocket Core Rocket Core L1I L1D L1I L1D L1I L1D L1I L1D L2 NIC Other Peripherals DRAM Model NIC Sim Endpoint Other Periph. Sim Endpoints FPGA Fabric PCIe to Host

ms)

Mo Modeled System

  • 4 Server Blades
  • 16 Cores
  • 64 GB DDR3

Re Resource Util.

  • < 1 FPGA
  • 4/4 Mem Chans

Si Sim Ra Rate

  • ~14.3 MHz

(netw)

Step 3: FPGA Simulation of 4 server blades

Cost: $0.49 per hour (spot) $1.65 per hour (on-demand)

slide-17
SLIDE 17 17

FPGA (4 Sims)

DRAM DRAM DRAM DRAM Server Blade SimulaIon Server Blade Simulation Server Blade Sim. Server Blade Simulation

Rocket Core Rocket Core Rocket Core Rocket Core L1I L1D L1I L1D L1I L1D L1I L1D L2 NIC Other Peripherals DRAM Model NIC Sim Endpoint Other Periph. Sim Endpoints FPGA Fabric PCIe to Host

FPGA (4 Sims)

Mo Modeled System

  • 4 Server Blades
  • 16 Cores
  • 64 GB DDR3

Re Resource Util.

  • < 1 FPGA
  • 4/4 Mem Chans

Si Sim Ra Rate

  • ~14.3 MHz

(netw)

Step 3: FPGA Simulation of 4 server blades

slide-18
SLIDE 18 18

FPGA (4 Sims) FPGA (4 Sims)

Host Instance CPU: ToR Switch Model DRAM DRAM DRAM DRAM Server Blade SimulaIon Server Blade Simulation Server Blade Sim. Server Blade Simulation Rocket Core Rocket Core Rocket Core Rocket Core L1I L1D L1I L1D L1I L1D L1I L1D L2 NIC Other Peripherals DRAM Model NIC Sim Endpoint Other Periph. Sim Endpoints FPGA Fabric PCIe to Host

FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims)

Mo Modeled System

  • 32 Server Blades
  • 128 Cores
  • 512 GB DDR3
  • 32 Port ToR

Switch

  • 200 Gb/s, 2us

links

Re Resource Util.

  • 8 FPGAs =
  • 1x f1.16xlarge

Si Sim Ra Rate

  • ~10.7 MHz

(netw)

Step 4: Simulating a 32 node rack

Cost: $2.60 per hour (spot) $13.20 per hour (on- demand)

slide-19
SLIDE 19

Cycle-accurate Network Modeling

  • For global cycle-accuracy, send a token on
each link for each cycle, in each direction
  • Each direction of a link has link latency in
cycles tokens in-flight
  • e.g. 6400 tokens in flight on link for 2us link
latency @ 3.2 GHz
  • Each token is desired bandwidth / clock
frequency bits wide
  • e.g. 200 Gbps / 3.2 GHz ≈ 64 bit wide token
sent per cycle
  • Target transport agnostic (we provide
Ethernet switch models)
  • Host transport agnostic (shared mem,
sockets, PCIe)
  • Can “downgrade” to a zero-perf-impact
functional network model (150+ MHz) 19 Switch Port NIC Top-Level I/O on FPGA

ß à

Link Model

64b 64b 6400 tokens

slide-20
SLIDE 20 20

FPGA (4 Sims) FPGA (4 Sims)

Host Instance CPU: ToR Switch Model DRAM DRAM DRAM DRAM Server Blade SimulaIon Server Blade Simulation Server Blade Sim. Server Blade Simulation Rocket Core Rocket Core Rocket Core Rocket Core L1I L1D L1I L1D L1I L1D L1I L1D L2 NIC Other Peripherals DRAM Model NIC Sim Endpoint Other Periph. Sim Endpoints FPGA Fabric PCIe to Host

FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims)

Mo Modeled System

  • 32 Server Blades
  • 128 Cores
  • 512 GB DDR3
  • 32 Port ToR

Switch

  • 200 Gb/s, 2us

links

Re Resource Util.

  • 8 FPGAs =
  • 1x f1.16xlarge

Si Sim Ra Rate

  • ~10.7 MHz

(netw)

Step 4: Simulating a 32 node rack

Cost: $2.60 per hour (spot) $13.20 per hour (on- demand)

slide-21
SLIDE 21 21

Ag

ck

FPGA (4 Sims) FPGA (4 Sims)

Host Instance CPU: ToR Switch Model DRAM DRAM DRAM DRAM Server Blade SimulaIon Server Blade Simulation Server Blade Sim. Server Blade Simulation Rocket Core Rocket Core Rocket Core Rocket Core L1I L1D L1I L1D L1I L1D L1I L1D L2 NIC Other Peripherals DRAM Model NIC Sim Endpoint Other Periph. Sim Endpoints FPGA Fabric PCIe to Host

FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims)

Mo Modeled System

  • 32 Server Blades
  • 128 Cores
  • 512 GB DDR3
  • 32 Port ToR

Switch

  • 200 Gb/s, 2us

links

Re Resource Util.

  • 8 FPGAs =
  • 1x f1.16xlarge

Si Sim Ra Rate

  • ~10.7 MHz

(netw)

Step 4: Simulating a 32 node rack

slide-22
SLIDE 22 22 Step N: Title (placeholder slid

Ag

Aggregation Switch

Rack Rack Rack Rack Rack Rack Rack

FPGA (4 Sims) FPGA (4 Sims) Host Instance CPU: ToR Switch Model DRAM DRAM DRAM DRAM Server Blade SimulaIon Server Blade Simulation Server Blade Sim. Server Blade Simulation Rocket Core Rocket Core Rocket Core Rocket Core L1I L1D L1I L1D L1I L1D L1I L1D L2 NIC Other Peripherals DRAM Model NIC Sim Endpoint Other Periph. Sim Endpoints FPGA Fabric PCIe to Host FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims)

Mo Modeled System

  • 256 Server

Blades

  • 1024 Cores
  • 4 TB DDR3
  • 8 ToRs, 1 Aggr
  • 200 Gb/s, 2us

links

Re Resource Util.

  • 64 FPGAs =
  • 8x f1.16xlarge
  • 1x m4.16xlarge

Si Sim Ra Rate

  • ~9 MHz (netw)

Step 5: Simulating a 256 node “aggregation pod”

slide-23
SLIDE 23 23 Step N: Title (placeholder slide)

Root Switch

Aggreg

Aggregation Switch

Rack Rack Rack Rack Rack Rack Rack

FPGA (4 Sims) FPGA (4 Sims) Host Instance CPU: ToR Switch Model DRAM DRAM DRAM DRAM Server Blade SimulaIon Server Blade Simulation Server Blade Sim. Server Blade Simulation Rocket Core Rocket Core Rocket Core Rocket Core L1I L1D L1I L1D L1I L1D L1I L1D L2 NIC Other Peripherals DRAM Model NIC Sim Endpoint Other Periph. Sim Endpoints FPGA Fabric PCIe to Host FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims)

Mo Modeled System

  • 256 Server

Blades

  • 1024 Cores
  • 4 TB DDR3
  • 8 ToRs, 1 Aggr
  • 200 Gb/s, 2us

links

Re Resource Util.

  • 64 FPGAs =
  • 8x f1.16xlarge
  • 1x m4.16xlarge

Si Sim Ra Rate

  • ~9 MHz (netw)

Step 5: Simulating a 256 node “aggregation pod”

slide-24
SLIDE 24 24 13 Modeled System Resource Util Step N: Title (placeholder slide) Root Switch

Aggregation Pod Aggregation Pod Aggregation Pod

Aggregation Switch Rack Rack Rack Rack Rack Rack Rack FPGA (4 Sims) FPGA (4 Sims) Host Instance CPU: ToR Switch Model DRAM DRAM DRAM DRAM Server Blade SimulaIon Server Blade Simulation Server Blade Sim. Server Blade Simulation Rocket Core Rocket Core Rocket Core Rocket Core L1I L1D L1I L1D L1I L1D L1I L1D L2 NIC Other Peripherals DRAM Model NIC Sim Endpoint Other Periph. Sim Endpoints FPGA Fabric PCIe to Host FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims)

Mo Modeled System

  • 1024 Servers
  • 4096 Cores
  • 16 TB DDR3
  • 32 ToRs, 4 Aggr, 1

Root

  • 200 Gb/s, 2us

links

Re Resource Util.

  • 256 FPGAs =
  • 32x f1.16xlarge
  • 5x m4.16xlarge

Si Sim Ra Rate

  • ~6.6 MHz (netw)

Step 6: Simulating a 1024 node datacenter

slide-25
SLIDE 25

Experimenting on a 1024 Node Datacenter

25 13 Modeled System Resource Util Step N: Title (placeholder slide) Root Switch

Aggregation Pod Aggregation Pod Aggregation Pod

Aggregation Switch Rack Rack Rack Rack Rack Rack Rack FPGA (4 Sims) FPGA (4 Sims) Host Instance CPU: ToR Switch Model DRAM DRAM DRAM DRAM Server Blade SimulaIon Server Blade Simulation Server Blade Sim. Server Blade Simulation Rocket Core Rocket Core Rocket Core Rocket Core L1I L1D L1I L1D L1I L1D L1I L1D L2 NIC Other Peripherals DRAM Model NIC Sim Endpoint Other Periph. Sim Endpoints FPGA Fabric PCIe to Host FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims)

50th %-ile (us) Cross-ToR 79.3 Cross-aggregation 87.1

512 Memcached Servers 512 Mutilate Clients
slide-26
SLIDE 26

Experimenting on a 1024 Node Datacenter

26 13 Modeled System Resource Util Step N: Title (placeholder slide) Root Switch

Aggregation Pod Aggregation Pod Aggregation Pod

Aggregation Switch Rack Rack Rack Rack Rack Rack Rack FPGA (4 Sims) FPGA (4 Sims) Host Instance CPU: ToR Switch Model DRAM DRAM DRAM DRAM Server Blade SimulaIon Server Blade Simulation Server Blade Sim. Server Blade Simulation Rocket Core Rocket Core Rocket Core Rocket Core L1I L1D L1I L1D L1I L1D L1I L1D L2 NIC Other Peripherals DRAM Model NIC Sim Endpoint Other Periph. Sim Endpoints FPGA Fabric PCIe to Host FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims)

50th %-ile (us) Cross-ToR 79.3 Cross-aggregation 87.1

512 Memcached Servers 512 Mutilate Clients
slide-27
SLIDE 27

Reproducing tail latency effects from deployed clusters

  • Leverich and Kozyrakis show effects of thread-imbalance in memcached in

EuroSys ’14 [3]

27 TileLink2 On-Chip Interconnect Rocket Core L1I L1D Accel Rocket Core L1I L1D Accel Rocket Core L1I L1D Accel Rocket Core L1I L1D Accel memcached threads 1 2 3 4 No thread imbalance
slide-28
SLIDE 28

Reproducing tail latency effects from deployed clusters

  • Leverich and Kozyrakis show effects of thread-imbalance in memcached in

EuroSys ’14 [3]

28 TileLink2 On-Chip Interconnect Rocket Core L1I L1D Accel Rocket Core L1I L1D Accel Rocket Core L1I L1D Accel Rocket Core L1I L1D Accel memcached threads 1 2 3 4

?

5

From [3], under thread imbalance, we expect: 1) Median latency to be unchanged 2) Tail latency to increase drastically

Thread imbalance
slide-29
SLIDE 29

Reproducing tail latency effects from deployed clusters

  • Let’s run a similar experiment on

an 8 node cluster in FireSim:

29

50th

slide-30
SLIDE 30

Reproducing tail latency effects from deployed clusters

  • Let’s run a similar experiment on

an 8 node cluster in FireSim:

30

50th 95th

slide-31
SLIDE 31

Open-source: Not just datacenter simulation

  • An “easy” button for fast, FPGA-

accelerated full-system simulation

  • One-click: Parallel FPGA builds, Simulation
run/result collection, building target software
  • Scales to a variety of use cases:
  • Networked (performance depends on scale)
  • Non-networked (150+ MHz), limited by your budget
  • firesim command line program
  • Like docker or vagrant, but for FPGA sims
  • User doesn’t need to care about distributed magic
happening behind the scenes 31 FireSim Developer Environment
slide-32
SLIDE 32

Open-source: Not just datacenter simulation

  • Scripts can call firesim to fully

automate distributed FPGA sim

  • Reproducibility: included scripts to
reproduce ISCA 2018 results
  • e.g. scripts to automatically run
SPECInt2017 reference inputs in ≈1 day
  • Many others
  • 91+ pages of documentation:

https://docs.fires.im

  • AWS provides grants for

researchers: https://aws.amazon.com/grants/

32

$ cd fsim/deploy/workloads $ ./run-all.sh

slide-33
SLIDE 33

Wrapping Up

  • We can prototype thousand-node

datacenters built on arbitrary RTL

+ Mix software models when desired

  • Simulation is automatically built

and deployed

  • Automatically deploy real

workloads and collect results

  • Open-source, runs on Amazon EC2

F1, no capex

33 SoC RTL Other RTL Network Topology Automatically deployed, high- performance, distributed simulation SW Models Full Work- load 13 Modeled System Resource Util Step N: Title (placeholder slide) Root Switch Aggregation Pod Aggregation Pod Aggregation Pod Aggregation Switch Rack Rack Rack Rack Rack Rack Rack FPGA (4 Sims) FPGA (4 Sims) Host Instance CPU: ToR Switch Model DRAM DRAM DRAM DRAM Server Blade SimulaIon Server Blade Simulation Server Blade Sim. Server Blade Simulation Rocket Core Rocket Core Rocket Core Rocket Core L1I L1D L1I L1D L1I L1D L1I L1D L2 NIC Other Peripherals DRAM Model NIC Sim Endpoint Other Periph. Sim Endpoints FPGA Fabric PCIe to Host FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims) FPGA (4 Sims)
slide-34
SLIDE 34

Now open-sourced! https://fires.im https://github.com/firesim @firesimproject sagark@eecs.berkeley.edu

The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849, DARPA Award Number HR0011- 12-2-0016, RISE Lab sponsor Amazon Web Services, and ADEPT/ASPIRE Lab industrial sponsors and affiliates Intel, HP, Huawei, NVIDIA, and SK Hynix. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or of the industrial sponsors.
slide-35
SLIDE 35

References

[1] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. OSDI'16 [2] Y. Lee et al., "An Agile Approach to Building RISC-V Microprocessors," in IEEE Micro, vol. 36, no. 2, pp. 8-20, Mar.-Apr. 2016. [3] Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. EuroSys '14 [4] Zhangxi Tan, Zhenghao Qian, Xi Chen, Krste Asanovic, and David Patterson. DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. ASPLOS '15 [5] Tan, Z., Waterman, A., Cook, H., Bird, S., Asanović, K., & Patterson, D. A case for FAME: FPGA architecture model execution. ISCA ’10 [6] Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach and Krste Asanovic. Evaluation of RISC-V RTL Designs with FPGA Simulation. CARRV '17. [7] Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, and Krste Asanović. Strober: fast and accurate sample-based energy simulation for arbitrary RTL. ISCA ‘16 [8] http://ramp.eecs.berkeley.edu/ 35
slide-36
SLIDE 36

Backup Slides/Common Questions

36
slide-37
SLIDE 37

New! Beta support for BOOM, an OoO Core

  • A superscalar out-of-order RV64G core
  • Highly parameterizable
  • Beta support for BOOM is now available on the

rc-bump-may branch

  • On FPGA, currently boots Linux to userspace, then

hangs

  • A target-RTL issue, can be reproduced in VCS without

any FireSim simulation shims

  • Working with BOOM devs to solve this
37
slide-38
SLIDE 38

FPGA Utilization

38
  • “Supernode” config – 4 quad-core

nodes

  • Available on FPGA: 1,182,000 LUTs
  • Top-level consumed (w/shell):

803,462 LUTs = ~68%

  • firesim_top consumed: 569921

LUTs = ~48%

slide-39
SLIDE 39

Comparing cost of Cloud FPGAs: SPEC17 Run

  • f1.2xlarge spot market: 49c per hour per FPGA
  • SPEC17 intrate reference inputs – 10 workloads – 177.6 fpga-sim

machine-hours

  • Longest individual benchmark: omnetpp @ 27.3 hours
  • $87.024 for a SPEC17 Intrate run w/reference inputs, in 27.3 hours
  • Roughly 1 day
  • ~ 1/100 – 1/500 the cost of the FPGA to run on the cloud
  • Purchase one FPGA:
  • Pay 10k - 50k per FPGA (+ops/cooling)
  • Each run takes 177.6 fpga-sim hours – roughly 1 week
39
slide-40
SLIDE 40

Or – compare purchasing one FPGA

  • Let’s say $10k for one of the FPGAs (conservative)
  • 49c per hour on F1
  • Break even if you run an F1 instance continuously for 2.33 years
  • What if a grad student works 16 hours a day, 7 days a week? 3.5 years
  • What about 16 hours a day, 5 days a week? 4.9 years
  • Ignoring all other benefits of cloud FPGAs…
40
slide-41
SLIDE 41

Now open-sourced! https://fires.im https://github.com/firesim @firesimproject sagark@eecs.berkeley.edu

The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849, DARPA Award Number HR0011- 12-2-0016, RISE Lab sponsor Amazon Web Services, and ADEPT/ASPIRE Lab industrial sponsors and affiliates Intel, HP, Huawei, NVIDIA, and SK Hynix. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or of the industrial sponsors.
slide-42
SLIDE 42

References

[1] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. OSDI'16 [2] Y. Lee et al., "An Agile Approach to Building RISC-V Microprocessors," in IEEE Micro, vol. 36, no. 2, pp. 8-20, Mar.-Apr. 2016. [3] Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. EuroSys '14 [4] Zhangxi Tan, Zhenghao Qian, Xi Chen, Krste Asanovic, and David Patterson. DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. ASPLOS '15 [5] Tan, Z., Waterman, A., Cook, H., Bird, S., Asanović, K., & Patterson, D. A case for FAME: FPGA architecture model execution. ISCA ’10 [6] Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach and Krste Asanovic. Evaluation of RISC-V RTL Designs with FPGA Simulation. CARRV '17. [7] Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, and Krste Asanović. Strober: fast and accurate sample-based energy simulation for arbitrary RTL. ISCA ‘16 [8] http://ramp.eecs.berkeley.edu/ 42