Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r - - PowerPoint PPT Presentation

softwar tware fir first st fpga ga ac accele elerato
SMART_READER_LITE
LIVE PREVIEW

Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r - - PowerPoint PPT Presentation

2GRVI Phalanx: A A Ki Kilocor locore RISC ISC-V V RV64I V64I Pr Processor ocessor Clust ster er Arr rray ay with h HBM BM2 2 In In a Xili linx nx VU37P 37P FPG FPGA Wor ork k in Progr ogress ss Re Repor ort Jan Gray |


slide-1
SLIDE 1

2GRVI Phalanx: A

A Ki Kilocor locore RISC ISC-V V RV64I V64I Pr Processor

  • cessor

Clust ster er Arr rray ay with h HBM BM2 2 In In a Xili linx nx VU37P 37P FPG FPGA

Wor

  • rk

k in Progr

  • gress

ss Re Repor

  • rt

Jan Gray | Gray Re Rese search ch LLC | Bellevue levue, , WA | http: p://fp fpga ga.or .org

slide-2
SLIDE 2

Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r De Desi sign gn

  • Make it easier for programmers to exploit spatial fabrics
  • Manycore accelerator overlays
  • Run C++ or OpenCL kernels on 100s of soft processors
  • Add custom functions/accelerators/memories to suit
  • More 5 second recompiles, fewer 5 hour place and routes
  • Software + overlays = familiar programming experiences, easier

ports, rapid iteration, design agility

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 2

slide-3
SLIDE 3

GR GRVI VI Pha hala lanx nx Accelerator ccelerator Frame ramework work

  • A processor cluster array overlay
  • GRVI

VI/2GRVI /2GRVI: RISC-V processing elements

  • Phalanx

lanx: : fabric of clusters of PEs, memories, accelerators, bridges, IOs

  • Hoplite:

ite: 2D torus network on chip

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 3

slide-4
SLIDE 4

GR GRVI VI Proces

  • cessing

sing Eleme lement nt

  • Simpler PEs → more PEs → greater memory parallelism
  • GRVI: austere RISC-V RV32I + mul*/lr/sc

2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 2019/11/17 4

~320 LUTs @ 400 MHz

slide-5
SLIDE 5

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 5

PE PE PE PE PE PE PE PE 2:1 2:1 2:1 2:1 4:4

XBAR

CMEM = 128 KB CLUSTER DATA IMEM 4-8 KB ACCELERATOR(S) 64 IMEM 4-8 KB IMEM 4-8 KB IMEM 4-8 KB

32

~3500 LUTs

GR GRVI VI Cl Clus uster er: : PE PEs, s, Sh Shar ared ed Mem emor

  • ry,

, Acc ccel elerato erators

slide-6
SLIDE 6

Clu luste ster r Compo positio sition: n: Me Mess ssage ge Passi ssing ng On On a a No NoC

  • Hoplite

te: FPGA-optimal 2D torus NoC router

  • Single flits, unidirectional rings, deflection routing, multicast; configurable
  • 300b-wide router uses only ~330 LUTs

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 6

YI XI X Y

1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3

C C C C C C C C C C C C C C C C

256b @ 400 MHz = 100 Gb/s links

slide-7
SLIDE 7

GR GRVI VI Cl Clus uster er: : PE PEs, s, Mem emor

  • ry,

, Rou

  • uter

er, , Mes essa sage ge Pas assi sing ng

PE PE PE PE PE PE PE PE 2:1 2:1 2:1 2:1 4:4

XBAR

CMEM = 128 KB CLUSTER DATA IMEM ACCELERATOR(S)

NoC/ACCEL ITF

HOPLITE ROUTER

64 IMEM IMEM IMEM 310 256

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 7

PGAS: { mx:1; my:1; x:4; y:6; addr:20 }

  • r { dram_addr:40 }
slide-8
SLIDE 8

10 10×5 5 Cl Clus uster ters s × 8 8 PEs s = 40 400 PE 0 PEs

(KU040 040, , 12/2 /2015)

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 8

slide-9
SLIDE 9

30 30×7 7 Clu luste sters s x x 8 PE PE = 1 = 1680 80 PE PEs, s, 26 MB MB S SRAM AM

(VU9 U9P , , 12/2 /2016)

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 9

  • 400,000 MIPS @ 250 MHz @ 40 W
slide-10
SLIDE 10

GR GRVI VI Pha hala lanx nx V1 Sho Shortcomin tcomings gs

  • 32b pointers: awkward for big data on AWS F1, OpenCL
  • 32b accesses: wastes half of 64b UltraRAMs bandwidth?
  • In-order μarch: stall on loads = ~5 cycles
  • DDR4 bandwidth << G

GPU GDDRx Rx/H /HBM2 BM2 ba bandw dwidt dth

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 10

slide-11
SLIDE 11

Ult ltraSc raScale ale+ + HB HBM2 M2 FPGAs! PGAs!

  • VU37P w/ two 4 GB HBM2 stacks
  • 32 AXI-HBM bridge/controllers
  • 32 x 256b x 450 MHz = 460 GB/s

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 11

slide-12
SLIDE 12

V2 Red 2 Redesign esign fo for r HB HBM M FPGAs PGAs

  • Latency tolerant 2GRVI RV64I PEs
  • 64b cluster datapaths
  • 32B/cycle deep pipeline NoC-AXI RDMA bridges
  • Double NoC column rings

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 12

slide-13
SLIDE 13

2GR GRVI VI – A S A Sim impl ple, e, Latency ency T

  • le

lerant nt RV6 V64I 4I PE PE

  • Register file scoreboard: only stall issue on use of a bu

busy sy register

  • Concurrent execution and out of order retirement
  • Example: unrolled block copy – no issue stalls even with 7 cycle memory
  • 400 6-LUTs (sans <<)!

2019/11/17 13

slide-14
SLIDE 14

GR GRVI VI vs

  • vs. 2G

2GRVI VI

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 14

32b GRVI PE 64b 2GRVI VI PE Year 2015 Q4 2019 Q2 ISA RV32I + mul/lr/sc RV64I 4I + lr/sc Area 320 6-LUTs 400 6-LUT UTs s (sans s shared ed <<) Fmax / congested 400 / 300 MHz 500+ 500+ / TBD MHz Pipeline stages 2 / 3 2 / 3 / 4 (superpipelined) Out-of-order retire yes Cluster, load interval 5 cycles 1 / / c cycl cle Cluster, load-to-use 5 cycles 3-6 cycles Cluster, Σ RAM BW 4.8 GB/s (300 MHz) 12.8 GB/s s (400 MHz)

slide-15
SLIDE 15

Ph Phalanx lanx SoC: : 15x1 x15-3 3 Ar Array ay of Clu luste sters s + + HB HBM + P M + PCIe

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 15

C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C A A A A C C C C C C C C C C C C A A A A C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C A A A A C C C C C C A A A H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H 4 GB HBM2 DRAM STACK 4 GB HBM2 DRAM STACK PCIe DMA H 300

C A HH

Cluster { 8 GRVI / 6 2GRVI, 4-8 KB IRAM, 128 KB CRAM, Hoplite router } NoC-AXI RDMA bridge { 2 256b AXI R/W req queues, 2 resp queues } Two AXI-switch-MC-HBM2 bridges, each 256b R/W at up to 450 MHz Unidirectional Hoplite NoC X-ring rows and Y-ring columns

PE ↔ Cluster RAM ↔ NoC ↔ AXI ↔ HBM

  • 32 B write request message;

32×n B burst-read request → n×32 B read responses

  • PE sends R/W request message to its NoC-AXI bridge;

bridge issues request to its AXI-HBM channel(s); bridge sends read response messages to dest. address

  • 32 B write + 32 B read response per cycle per bridge
  • Measured ~130 GB/s write + ~130 GB/s read at 300 MHz
slide-16
SLIDE 16

No NoC-AXI AXI-HBM HBM Tra ransaction nsactions s in in Fl Flight ight

slide-17
SLIDE 17

22 222 x 8 2 x 8 GR GRVI VI PEs s = 17 1776 76 RV32 32I I PEs 22 222 x 6 2 x 6 2 2GR GRVI VI PEs = 13 s = 1332 32 RV64 64I I PEs

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 17

slide-18
SLIDE 18

Pha halanx lanx-HBM HBM2 2 Ne Next xt St Step eps

  • Tune up to 400+ MHz = ~200 GB/s writes + ~200 GB/s reads
  • Computational HBM2 – compute at the bridges
  • Scatter/gather, add-to-memory, block zero, copy, hash, reduce, select,

regexp, sort, …

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 18

slide-19
SLIDE 19

Pha halanx lanx Par arallel allel Programm

  • gramming

ing Models

  • dels
  • Architecture: array of clusters of PEs, no caches, message passing
  • T
  • day: bare metal C/C++ + message passing runtime
  • Future
  • Flat data parallel NDRange OpenCL kernels
  • Streaming kernels composed with OpenCL pipes
  • ‘Gatling gun’ parallel packet processing

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 19

slide-20
SLIDE 20

An An Op Open enCL-li like ke Mo Mode del a l and nd T

  • ols

ls

  • Familiar to GPU developers(?)
  • Host side: Xilinx SDAccel OpenCL runtime
  • Setup, copy buffers, queue parallel kernel calls, wait, copy results
  • FPGA side: GRVI Phalanx  SDAccel-for-RTL shell
  • Map work

k groups ps to PE clusters, work k items s to PEs

  • Memory: global

al = HBM; local al = cluster RAM (static); private ivate = thread (auto)

  • Scheduler (PEs at cluster 0): distribute kernels, map work groups to idle clusters
  • Plan. Not yet implemented

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 20

slide-21
SLIDE 21

OpenCL “Like”

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 21

kernel void vector_add( global int* g_a, global int* g_b, global int* g_sum, const unsigned n) { local align int a[N], b[N], sum[N]; int iloc = get_local_id(0) * n; int iglb = (get_group_id(0) * get_local_size(0) + get_local_id(0)) * n; int size = n * sizeof(int); copy(a + iloc, g_a + iglb, size); // from HBM copy(b + iloc, g_b + iglb, size); barrier(CLK_LOCAL_MEM_FENCE); for (int i = 0; i < n; ++i) sum[i] = a[i] + b[i]; barrier(CLK_LOCAL_MEM_FENCE); copy(g_sum + iglb, sum + iloc, size); // to HBM }

slide-22
SLIDE 22

T ake ake Aways ways

  • (Prior work)
  • Software-first, software-mostly manycore accelerators
  • Die filling, FPGA frugal, clustered, tiled, NoC-interconnected overlays
  • Demo

mocra cratiz tizin ing HBM

  • Xilinx AXI-HBM bridges are easy to use, simplify interconnects, save 100Ks LUTs
  • HBM bandwidth

width is now access cessible ible to all

  • T
  • wards an OpenCL-like SDK, on AWS F1, Azure NP10, Alveo
slide-23
SLIDE 23

Ref efer erences ences

  • The Past and Future of FPGA Soft Processors (2014)

http://fpga.org/2014/12/31/the-past-and-future-of-fpga-soft-processors/

  • GRVI Phalanx and Hoplite NoC (2016)

http://fpga.org/grvi-phalanx/ http://fpga.org/hoplite/

  • 2GRVI Phalanx at Hot Chips 31 (2019)

http://fpga.org/2019/08/19/2grvi-phalanx-at-hot-chips-31-2019/

  • Xilinx AXI High Bandwidth Memory Controller v1.0

https://www.xilinx.com/support/documentation/ip_documentation/hbm/v1_0/pg276-axi-hbm.pdf

2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 23