Scaling the Cascades Interconnect-aware FPGA implementation of - PowerPoint PPT Presentation

BRAM DSP URAM Scaling the Cascades Interconnect-aware FPGA implementation of Machine Learning problems Anand Samajdar, Tushar Garg, Tushar Krishna, Nachiket Kapre nachiket@uwaterloo.ca

Claim • Hard FPGA interconnect (cascades) e ffi ciently supports nearest neighbour communication + reuse in ML workloads • Three kinds of UltraScale+ cascades [DSP , BRAM, URAM] • Combination of (1) pixel, (2) row, (3) map reuse • Deliverables: • 650 MHz full-chip operation • 7x better latency, 30% lower throughput than the formidable Xilinx SuperTile design for GoogLeNet v1 � 4

Landscape of FPGA+ML accelerators � 5

Communication Requirements of 3x3 Convolution Output Map J x x x Input Row k Input Row k+1 x x x + Output Row k Input Row k+2 x x x Weights Input Map I � 6

Communication Requirements of 3x3 Convolution 2 1 Output 3 Map J row streaming pixel streaming x x x Input Row k channel streaming Input Row k+1 x x x + Output Row k Input Row k+2 x x x Weights Input Map I � 7

Reuse Patterns Output Map J x x x Input Row k + Input Row k+1 x x x Output Row k Input Row k+2 x x x Input Map I Weight Row 0 Weight Row 1 Weight Row 2 Input Row k+1 Input Row k+2 Input Row k x+ x+ x+ x+ x+ x+ x+ x+ x+ P cascade for summation pixel streaming � 8

Reuse Patterns Output Map J x x x Input Row k + Input Row k+1 x x x Output Row k Input Row k+2 x x x Input Map I Weight Row 0 Weight Row 1 Weight Row 2 Input Row k+1 Input Row k+2 Input Row k B cascade for weight streaming A cascade for x+ x+ x+ x+ x+ x+ x+ x+ x+ pixel streaming P cascade for summation 1 pixel streaming � 9

Reuse Patterns Output Map J x x x Input Row k + Input Row k+1 x x x Output Row k Input Row k+2 x x x row streaming 2 Input Map I Weight Row 0 Weight Row 1 Weight Row 2 Exploit Input Row k+1 Input Row k+2 Input Row k Data Reuse B cascade for weight streaming A cascade for x+ x+ x+ x+ x+ x+ x+ x+ x+ pixel streaming P cascade for summation 1 pixel streaming � 10

Reuse Patterns Output Map J x x x Input Row k + Input Row k+1 x x x Output Row k Input Row k+2 x x x row streaming 2 Input Map I Weight Row 0 Weight Row 1 Weight Row 2 Exploit Input Row k+1 Input Row k+2 Input Row k Data Reuse Input Map I B cascade for weight streaming A cascade for x+ x+ x+ x+ x+ x+ x+ x+ x+ pixel streaming P cascade for summation 1 pixel streaming Output Map J � 11

Reuse Patterns Output Map J x x x Input Row k + Input Row k+1 x x x Output Row k Input Row k+2 x x x Input Map I Input Map I Weights 3x3 Convolution Tile Output Map J � 12

Reuse Patterns Weights Input Output 3x3 Convolution Tile Map I Map J Weights Input 3x3 Convolution Tile Map I+1 Weights Input 3x3 Convolution Tile Map I+.. � 13

Reuse Patterns Weights Input Output 3x3 Convolution Tile Map I Map J channel streaming Weights Input 3x3 Convolution Tile Map I+1 3 Weights Input 3x3 Convolution Tile Map I+.. � 14

Reuse Patterns Weights Input Output 3x3 Convolution Tile Map I Map J channel streaming Weights Input Map I+1 3 Weights Input Map I+.. � 15

Xilinx UltraScale+ FPGA Cascades • BRAM18 support A/B row streaming pixel streaming cascades 2x72b-wide links • DSP48 supports A, B, P channel streaming 2 1 cascades (systolic input and summation) BRAM 3 • URAM288 supports A, B DSP cascades URAM � 16

Outline • Understanding Cascades • Assembling the FPGA accelerator + FPGA Layout • MLPerf Evaluation • Conclusions + Discussion � 17

Promise of Cascades • Absorb data movement onto dedicated interconnect vs. General-purpose wiring • Higher clock frequency operation, layout-friendly architecture � 18

Our approach • Exploit cascades aggressively! • DSP48 • For 3x3 convolution, length-9 cascades • P cascade for summation (like INT8 paper) • A cascade for systolic retiming (like DSP48E2 user guide) • B cascades for weights (our contribution) 19

Our approach • Exploit cascades aggressively! • RAMB18E2 (our contribution) • For 3x3 convolution, only need 3 BRAM-long chains • A/B cascade for shift operation • Swap between A and B to keep one read port available. 20

Our approach • Exploit cascades aggressively! • URAM288 (our contribution) • Alternating A/B cascades of length-2 • Both data + addresses cascades • Shift operation tricky! • Due to 72b width, and resource ratios, idle cycles available for realizing shifts 21

Putting it together x x x x x x x x x + + + + + + + + + DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 22

Putting it together x x x x x x x x x + + + + + + + + + DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 23

Putting it together Weights (initial shift) x x x x x x x x x + + + + + + + + + DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 24

Putting it together Weights Pixel streaming (initial shift) x x x x x x x x x + + + + + + + + + DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 25

Putting it together Row i+2 Row i+1 Row i RAMB RAMB RAMB 18 (C) 18 (B) 18 (A) x x x x x x x x x + + + + + + + + + DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 26

Putting it together RAMB RAMB RAMB Row streaming 18 (C) 18 (B) 18 (A) x x x x x x x x x + + + + + + + + + DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 27

Putting it together to next URAM RAMB RAMB RAMB 18 (C) 18 (B) 18 (A) URAM 288 (Input) (Weights) x x x x x x x x x from previous URAM + + + + + + + + + RAMB 18 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 (Kern) 28

Putting it together to next URAM RAMB RAMB RAMB 18 (C) 18 (B) 18 (A) URAM 288 (Input) (Weights) x x x x x x x x x from previous URAM + + + + + + + + + RAMB 18 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 (Kern) Map streaming 29

Putting it together to next URAM RAMB RAMB RAMB 18 (C) 18 (B) 18 (A) URAM URAM 288 288 (Input) (Output) (Weights) + x x x x x x x x x from previous URAM + + + + + + + + + RAMB 18 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 (Kern) 30

Putting it together row streaming 2 to next URAM channel streaming RAMB RAMB RAMB 18 (C) 18 (B) 18 (A) URAM URAM 288 288 (Input) (Output) (Weights) + x x x x x x x x x from previous URAM 3 1 pixel streaming + + + + + + + + + RAMB 18 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 DSP48 (Kern) 31

A 3x3 tile layout � 32

A 3x3 tile layout � 33

A 3x3 tile layout Places and routes at 1.2ns � 34

Tiling the design • VU37P device has specific resource mix • For each URAM, you get 4.2 BRAMs and 9.4 DSP48s • Repeating pattern must conform to this ratio • One tile: 2 URAMs, 8 BRAMs, 18 DSPs • Physical layout XDC constraints must account for irregular column arrangement of hard resources � 35

Tiling the design • VU37P device has specific resource mix • For each URAM, you get 4.2 BRAMs and 9.4 DSP48s • Repeating pattern must conform to this ratio • One tile: 2 URAMs, 8 BRAMs, 18 DSPs • Physical layout XDC constraints must ONE TILE account for irregular column arrangement of hard resources � 36

Matrix-Matrix Multiplication • Limited Reuse opportunities • Split large matrix across URAMs • Each URAM stores a set of complete rows —> allows result vector to be independently processed. • Partial vector results then circulated across the chip in a ring-like fashion —> using BRAM cascades • URAM cascades only used for loading matrix at start � 37

VU37P Layout � 38

VU37P FPGA Mapping CONVOLUTION MATRIX-MULTIPLY � 39

VU37P FPGA Mapping 80% 20% CONVOLUTION MATRIX-MULTIPLY � 40

VU37P FPGA Mapping 80% 20% � 41

Effect of using cascades • Registers in hard interconnect Cascade No Cascade save us fabric registers for other Route Count 75K pipelining needs 50K • Clock period marginally better 25K • Obvious reduction in interconnect 0 20 40 60 80 utilization Congestion (%) � 42

Effect of using cascades • Registers in hard interconnect Cascade No Cascade save us fabric registers for other Route Count 75K pipelining needs 50K • Clock period marginally better 25K • Obvious reduction in interconnect 0 20 40 60 80 utilization Congestion (%) � 43

Evaluation Methodology • We use the SCALE-Sim cycle- accurate simulator • https://github.com/ARM- software/SCALE-Sim • Map URAMs -> IFMAP/ OFMAP SRAMs • BRAMs and DSP cascades => systolic array links • VU37P can fit systolic array of size 960x9 (conv), 480x9 (mm) 44

Xilinx SuperTile • GoogLeNet v1 mapped to VU9P FPGA (Amazon F1) • 3046 images/s + 3.3ms latency • Scorching 720 MHz operation! • Mind-numbing 88% overall e ffi ciency http://isfpga.org/slides/Compute-E ffi cient_Neural-Network_Acceleration.pdf 45

Scaling the Cascades Interconnect-aware FPGA implementation of - PowerPoint PPT Presentation

BRAM DSP URAM Scaling the Cascades Interconnect-aware FPGA implementation of Machine Learning problems Anand Samajdar, Tushar Garg, Tushar Krishna, Nachiket Kapre nachiket@uwaterloo.ca Claim Hard FPGA interconnect (cascades) e ffi ciently

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Information Cascades in Human Networks Milo Trujillo Professor Gao Information Cascades

Optimizing cascades & submodular optimization Rik Sarkar Today Maximizing cascades

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Effectively Scaling Effectively Scaling up/universalizing exclusive up/universalizing exclusive

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

Outline Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large Principles of

Cascades Recovery Inc. We care so much about paper and packaging; when youre done with it, we

Cascades Social and Technological Networks Rik Sarkar University of Edinburgh, 2019. Network

BlackBerry 10 Cascades UI FW: A Different Take Markus Landin, Product Manager, Research In Motion

Weighted Classification Cascades for Optimizing Discovery Significance Lester Mackey

Community detection and cascades Rik Sarkar Today Community Detection Spectral

Conformal Finite Size Scaling of Conformal Finite Size Scaling of Flavors Chik Him Wong Twelve

Chapter 11: Scaling and Round-off Noise Keshab K. Parhi Outline Introduction Scaling

So#ware Scaling Mo/va/on & Goals HW Configura/on & Scale Out So#ware Scaling

Object Detection using Haar like Features CS 395T: Visual Recognition and Search Harshdeep

Willump: A Statistically-Aware End-to-end Optimizer for ML Inference Peter Kraft , Daniel Kang,

Tree Recursion Announcements Recursive Factorial factorial (!) if n == 0 n! = 1 if n > 0

On the well-posedness of cascades of analytic nonlinear input-output systems driven by noise

Outline Contagion Contagion Basic Contagion Basic Contagion Models Models Complex Networks,

Online Learning to Rank with Features Authors: Shuai Li, Tor Lattimore, Csaba Szepesvri The

IN WHATSAPP GROUPS Josemar Alves Caetano, Gabriel Magno, Marcos Gonalves, Jussara Almeida,

Projections of random fractals and measures and Liouville quantum gravity Kenneth Falconer

Scaling the Cascades Interconnect-aware FPGA implementation of - PowerPoint PPT Presentation

BRAM DSP URAM Scaling the Cascades Interconnect-aware FPGA implementation of Machine Learning problems Anand Samajdar, Tushar Garg, Tushar Krishna, Nachiket Kapre nachiket@uwaterloo.ca Claim Hard FPGA interconnect (cascades) e ffi ciently

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Information Cascades in Human Networks Milo Trujillo Professor Gao Information Cascades

Optimizing cascades &amp; submodular optimization Rik Sarkar Today Maximizing cascades

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

Effectively Scaling Effectively Scaling up/universalizing exclusive up/universalizing exclusive

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

Outline Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large Principles of

Cascades Recovery Inc. We care so much about paper and packaging; when youre done with it, we

Cascades Social and Technological Networks Rik Sarkar University of Edinburgh, 2019. Network

BlackBerry 10 Cascades UI FW: A Different Take Markus Landin, Product Manager, Research In Motion

Weighted Classification Cascades for Optimizing Discovery Significance Lester Mackey

Community detection and cascades Rik Sarkar Today Community Detection Spectral

Conformal Finite Size Scaling of Conformal Finite Size Scaling of Flavors Chik Him Wong Twelve

Chapter 11: Scaling and Round-off Noise Keshab K. Parhi Outline Introduction Scaling

So#ware Scaling Mo/va/on &amp; Goals HW Configura/on &amp; Scale Out So#ware Scaling

Object Detection using Haar like Features CS 395T: Visual Recognition and Search Harshdeep

Willump: A Statistically-Aware End-to-end Optimizer for ML Inference Peter Kraft , Daniel Kang,

Tree Recursion Announcements Recursive Factorial factorial (!) if n == 0 n! = 1 if n &gt; 0

On the well-posedness of cascades of analytic nonlinear input-output systems driven by noise

Outline Contagion Contagion Basic Contagion Basic Contagion Models Models Complex Networks,

Online Learning to Rank with Features Authors: Shuai Li, Tor Lattimore, Csaba Szepesvri The

IN WHATSAPP GROUPS Josemar Alves Caetano, Gabriel Magno, Marcos Gonalves, Jussara Almeida,

Projections of random fractals and measures and Liouville quantum gravity Kenneth Falconer

Optimizing cascades & submodular optimization Rik Sarkar Today Maximizing cascades

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

So#ware Scaling Mo/va/on & Goals HW Configura/on & Scale Out So#ware Scaling

Tree Recursion Announcements Recursive Factorial factorial (!) if n == 0 n! = 1 if n > 0