softwar tware fir first st fpga ga ac accele elerato
play

Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r - PowerPoint PPT Presentation

2GRVI Phalanx: A A Ki Kilocor locore RISC ISC-V V RV64I V64I Pr Processor ocessor Clust ster er Arr rray ay with h HBM BM2 2 In In a Xili linx nx VU37P 37P FPG FPGA Wor ork k in Progr ogress ss Re Repor ort Jan Gray |


  1. 2GRVI Phalanx: A A Ki Kilocor locore RISC ISC-V V RV64I V64I Pr Processor ocessor Clust ster er Arr rray ay with h HBM BM2 2 In In a Xili linx nx VU37P 37P FPG FPGA Wor ork k in Progr ogress ss Re Repor ort Jan Gray | Gray Re Rese search ch LLC | Bellevue levue, , WA | http: p://fp fpga ga.or .org

  2. Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r De Desi sign gn • Make it easier for programmers to exploit spatial fabrics • Manycore accelerator overlays • Run C++ or OpenCL kernels on 100s of soft processors • Add custom functions/accelerators/memories to suit • More 5 second recompiles, fewer 5 hour place and routes • Software + overlays = familiar programming experiences, easier ports, rapid iteration, design agility 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 2

  3. GR GRVI VI Pha hala lanx nx Accelerator ccelerator Frame ramework work • A processor cluster array overlay • GRVI VI/2GRVI /2GRVI: RISC-V processing elements • Phalanx lanx: : fabric of clusters of PEs, memories, accelerators, bridges, IOs • Hoplite: ite: 2D torus network on chip 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 3

  4. GR GRVI VI Proces ocessing sing Eleme lement nt • Simpler PEs → more PEs → greater memory parallelism • GRVI: austere RISC-V RV32I + mul*/lr/sc ~320 LUTs @ 400 MHz 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 4

  5. GRVI GR VI Cl Clus uster er: : PE PEs, s, Sh Shar ared ed Mem emor ory, , Acc ccel elerato erators PE 4:4 4-8 KB IMEM 2:1 CMEM = 128 KB CLUSTER DATA PE PE ACCELERATOR(S) 4-8 KB IMEM 2:1 PE PE 4-8 KB IMEM 2:1 PE PE 4-8 KB IMEM 2:1 32 64 XBAR PE ~3500 LUTs 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 5

  6. Clu luste ster r Compo positio sition: n: Me Mess ssage ge Passi ssing ng On On a a No NoC • Hoplite te: FPGA-optimal 2D torus NoC router • Single flits, unidirectional rings, deflection routing, multicast; configurable • 300b-wide router uses only ~330 LUTs YI 1,0 2,0 3,0 XI X Y C C C C 3,1 0,1 1,1 2,1 C C C C 0,2 1,2 2,2 3,2 C C C C 0,3 1,3 2,3 3,3 256b @ 400 MHz = 100 Gb/s links C C C C 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 6

  7. GR GRVI VI Cl Clus uster er: : PE PEs, s, Mem emor ory, , Rou outer er, , Mes essa sage ge Pas assi sing ng PGAS: { mx:1; my:1; x:4; y:6; addr:20 } or { dram_addr:40 } HOPLITE ROUTER 310 256 NoC/ACCEL ITF PE 4:4 IMEM 2:1 CMEM = 128 KB CLUSTER DATA PE PE ACCELERATOR(S) IMEM 2:1 PE PE IMEM 2:1 PE PE IMEM 2:1 XBAR 64 PE 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 7

  8. 10 10 × 5 5 Cl Clus uster ters s × 8 8 PEs s = 40 400 PE 0 PEs (KU040 040, , 12/2 /2015) 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 8

  9. 30 30 × 7 7 Clu luste sters s x x 8 PE PE = 1 = 1680 80 PE PEs, s, 26 MB MB S SRAM AM (VU9 U9P , , 12/2 /2016) • 400,000 MIPS @ 250 MHz @ 40 W 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 9

  10. GR GRVI VI Pha hala lanx nx V1 Sho Shortcomin tcomings gs • 32b pointers: awkward for big data on AWS F1, OpenCL • 32b accesses: wastes half of 64b UltraRAMs bandwidth? • In-order μarch : stall on loads = ~5 cycles • DDR4 bandwidth << G GPU GDDRx Rx/H /HBM2 BM2 ba bandw dwidt dth 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 10

  11. Ult ltraSc raScale ale+ + HB HBM2 M2 FPGAs! PGAs! • VU37P w/ two 4 GB HBM2 stacks • 32 AXI-HBM bridge/controllers • 32 x 256b x 450 MHz = 460 GB/s 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 11

  12. V2 Red 2 Redesign esign fo for r HB HBM M FPGAs PGAs • Latency tolerant 2GRVI RV64I PEs • 64b cluster datapaths • 32B/cycle deep pipeline NoC-AXI RDMA bridges • Double NoC column rings 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 12

  13. 2GR GRVI VI – A S A Sim impl ple, e, Latency ency T ole lerant nt RV6 V64I 4I PE PE • Register file scoreboard: only stall issue on use of a bu busy sy register • Concurrent execution and out of order retirement • Example: unrolled block copy – no issue stalls even with 7 cycle memory • 400 6-LUTs (sans <<)! 2019/11/17 13

  14. GRVI GR VI vs vs. 2G 2GRVI VI 32b GRVI PE 64b 2GRVI VI PE Year 2015 Q4 2019 Q2 ISA RV32I + mul/lr/sc RV64I 4I + lr/sc Area 320 6-LUTs 400 6-LUT UTs s (sans s shared ed <<) Fmax / congested 400 / 300 MHz 500+ 500+ / TBD MHz Pipeline stages 2 / 3 2 / 3 / 4 (superpipelined) Out-of-order retire yes Cluster, load interval 5 cycles 1 / / c cycl cle Cluster, load-to-use 5 cycles 3-6 cycles Cluster, Σ RAM BW 4.8 GB/s (300 MHz) 12.8 GB/s s (400 MHz) 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 14

  15. Ph Phalanx lanx SoC: : 15x1 x15-3 3 Ar Array ay of Clu luste sters s + + HB HBM + P M + PCIe 300 PE ↔ Cluster RAM ↔ NoC ↔ AXI ↔ HBM C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C • 32 B write request message; C C C C C C C C C C C C C C C 32 × n B burst- read request → n× 32 B read responses C C C C C C C C C C C C C C C • PE sends R/W request message to its NoC-AXI bridge; C C C C C C C C C C C C C C C bridge issues request to its AXI-HBM channel(s); C C C C C C C C C C C C C C C bridge sends read response messages to dest. address C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C • 32 B write + 32 B read response per cycle per bridge C C C C C C C C C C C C C C C • Measured ~130 GB/s write + ~130 GB/s read at 300 MHz C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Cluster { 8 GRVI / 6 2GRVI, 4-8 KB IRAM, 128 KB CRAM, Hoplite router } C C C C C C C C C C C C C C C C NoC-AXI RDMA bridge { 2 256b AXI R/W req queues, 2 resp queues } A C C C C C C C C C C C C C C C PCIe DMA C C C C C C C C C C C C H H Two AXI- switch -MC-HBM2 bridges, each 256b R/W at up to 450 MHz A A A A A A A A A A A A A A A Unidirectional Hoplite NoC X-ring rows and Y-ring columns H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H H 4 GB HBM2 DRAM STACK 4 GB HBM2 DRAM STACK 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 15

  16. No NoC-AXI AXI-HBM HBM Tra ransaction nsactions s in in Fl Flight ight

  17. 22 222 x 8 2 x 8 GR GRVI VI PEs s = 17 1776 76 RV32 32I I PEs 222 x 6 22 2 x 6 2 2GR GRVI VI PEs = 13 s = 1332 32 RV64 64I I PEs 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 17

  18. Pha halanx lanx-HBM HBM2 2 Ne Next xt St Step eps • Tune up to 400+ MHz = ~200 GB/s writes + ~200 GB/s reads • Computational HBM2 – compute at the bridges • Scatter/gather, add-to-memory, block zero, copy, hash, reduce, select, regexp , sort, … 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 18

  19. Pha halanx lanx Par arallel allel Programm ogramming ing Models odels • Architecture: array of clusters of PEs, no caches, message passing • T oday: bare metal C/C++ + message passing runtime • Future • Flat data parallel NDRange OpenCL kernels • Streaming kernels composed with OpenCL pipes • ‘Gatling gun’ parallel packet processing 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 19

  20. An An Op Open enCL-li like ke Mo Mode del a l and nd T ools ls • Familiar to GPU developers(?) • Host side: Xilinx SDAccel OpenCL runtime • Setup, copy buffers, queue parallel kernel calls, wait, copy results • FPGA side: GRVI Phalanx  SDAccel-for-RTL shell • Map work k groups ps to PE clusters, work k items s to PEs • Memory: global al = HBM; local al = cluster RAM (static); private ivate = thread (auto) • Scheduler (PEs at cluster 0): distribute kernels, map work groups to idle clusters • Plan. Not yet implemented 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 20

  21. OpenCL “Like” kernel void vector_add( global int* g_a, global int* g_b, global int* g_sum, const unsigned n) { local align int a[N], b[N], sum[N]; int iloc = get_local_id(0) * n; int iglb = (get_group_id(0) * get_local_size(0) + get_local_id(0)) * n; int size = n * sizeof(int); copy(a + iloc, g_a + iglb, size); // from HBM copy(b + iloc, g_b + iglb, size); barrier(CLK_LOCAL_MEM_FENCE); for (int i = 0; i < n; ++i) sum[i] = a[i] + b[i]; barrier(CLK_LOCAL_MEM_FENCE); copy(g_sum + iglb, sum + iloc, size); // to HBM } 2019/11/17 2GRVI Phalanx: Kilocore RV64I + HBM Accelerator Framework 21

  22. T ake ake Aways ways • (Prior work) • Software-first, software-mostly manycore accelerators • Die filling, FPGA frugal, clustered, tiled, NoC-interconnected overlays • Demo mocra cratiz tizin ing HBM • Xilinx AXI-HBM bridges are easy to use, simplify interconnects, save 100Ks LUTs • HBM bandwidth width is now access cessible ible to all • T owards an OpenCL-like SDK, on AWS F1, Azure NP10, Alveo

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend