GRVI Phalanx Update:
A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray | jan@fpga.org | http://fpga.org
CARRV2017: 2017/10/14
GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator - - PowerPoint PPT Presentation
GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray | jan@fpga.org | http://fpga.org CARRV2017: 2017/10/14 FPGA Datacenter Accelerators Are Almost Mainstream Catapult v2. Intel += Altera. OpenPOWER CAPI. AWS
A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray | jan@fpga.org | http://fpga.org
CARRV2017: 2017/10/14
AWS F1. Baidu. Alibaba. Huawei …
– Massively parallel, customized, connected, versatile – High throughput, low latency, low energy
– Software: C++ workload ??? FPGA accelerator? – Hardware: “tape out” a complex SoC daily?
2 CARRV2017: 2017/10/14
NIC 1 NIC 2 NIC 3 NIC 4 HBM CHANNEL 1
FPGA
HBM CHANNEL 2 DRAM CHANNEL 1 DRAM CHANNEL 2
PHY PHY PHY PHY
10G…100G NETWORKS
HBM HIGH BANDWIDTH MEMORY
. . . . . .
HOST PERIPH
PCI EXPRESS 1 PCI EXPRESS 2
A1 A1 A1 A1 A1 A1 A1 A1 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2
ACCELERATORS
3 CARRV2017: 2017/10/14
– Run parallel software on 100s of soft processors – Add custom logic as needed = More 5 second recompiles, fewer 5 hour PARs
– Many clusters of PEs, RAMs, accelerators, I/O – Message passing in a PGAS across a …
4 CARRV2017: 2017/10/14
– Specs, tests, simulators, cores, compilers, libs, FOSS
CARRV2017: 2017/10/14 5
more task and memory parallelism
– RV32I, minus CSRs, exceptions, plus mul*, lr/sc – 3 stage pipeline (fetch, decode, execute) – 2 cycle loads; 3 cycle taken branches/jumps – Typically 320 LUTs @ 375 MHz ≈ 0.7 MIPS/LUT
CARRV2017: 2017/10/14 6
CARRV2017: 2017/10/14 7
8 CARRV2017: 2017/10/14
CARRV2017: 2017/10/14 9
P P P P P P P P 2:1 2:1 2:1 2:1 4:4
XBAR
CMEM = 128 KB CLUSTER DATA IMEM 4-8 KB ACCELERATOR(S) 64 IMEM 4-8 KB IMEM 4-8 KB IMEM 4-8 KB
10 CARRV2017: 2017/10/14
– No segmentation/flits, VCs, buffering, credits – Unidirectional rings – Deflecting dimension order routing of whole msgs – Simple; frugal; wide; fast: 1-400 Gbps/link
11 CARRV2017: 2017/10/14
12 CARRV2017: 2017/10/14
256b links @ 400 MHz = 100 Gb/s links; <3% of FPGA
CARRV2017: 2017/10/14 CARRV2017: 2017/10/14
P P P P P P P P 2:1 2:1 2:1 2:1 4:4
XBAR
CMEM = 128 KB CLUSTER DATA IMEM 4-8 KB NoC ITF ACCELERATOR(S)
HOPLITE ROUTER
64 IMEM 4-8 KB IMEM 4-8 KB IMEM 4-8 KB 300 = header + 32b msg dest addr + 256b msg data
14 CARRV2017: 2017/10/14
message passing, memcpy/RDMA DRAM
– Uses GCC for RISC-V RV32IMA. Thank you!
– Accelerated with custom FUs, AXI cores, RAMs
CARRV2017: 2017/10/14 15
CARRV2017: 2017/10/14 16
CARRV2017: 2017/10/14 17
cores on a chip in any technology
CARRV2017: 2017/10/14 18
CARRV2017: 2017/10/14 19
CARRV2017: 2017/10/14 20
Resource Use
Logical nets 3.2 M
1.8 M
795 K 67.2% CLB registers 744 K 31.5% BRAM 840 38.9% URAM 840 87.5% DSP 840 12.3%
Frequency 250 MHz Peak MIPS 420 GIPS CRAM Bandwidth 2.5 TB/s NoC Bisection BW 900 Gb/s Power (INA226) 31-40 W Power/Core 18-24 mW/core MAX VCU118 Temp 44C Vivado 2016.4 / ES1 Max RAM use ~32 GB Flat build time 11 hours Tools bugs
>1000 BRAMs + 6000 DSPs available for accelerators
XEON
CARRV2017: 2017/10/14 21
XEON VU9P DRAM ENA NVMe DRAM 64 GB 122 GB
XEON
CARRV2017: 2017/10/14 22
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM XEON VU9P VU9P VU9P VU9P VU9P VU9P VU9P VU9P ENA PCIe SWITCH FABRIC 64 GB 976 GB NVMe NVMe NVMe NVMe
– Message passing with host CPUs (x86 or ARM) – DRAM channel RDMA request/response messaging
– 1000-core AWS F1 (<$2/hr) – 80-core PYNQ (Z7020) ($65 edu)
CARRV2017: 2017/10/14 23
CARRV2017: 2017/10/14 24
PYNQ-Z1 Demo: Parallel Burst DRAM Readback Test: 80 Cores × 228 × 256 B
CARRV2017: 2017/10/14 25
CARRV2017: 2017/10/14 26
(Not yet bridged; F1.16XL)
– Specs, libraries, examples, tests – As PYNQ Jupyter Python notebooks, bitstreams – AMI+AFI in AWS Marketplace
CARRV2017: 2017/10/14 27
reconfigurable / memory parallelism
research and teaching on 80-8,000 core systems
CARRV2017: 2017/10/14 28
Thank you