Scaling the Cascades Interconnect-aware FPGA implementation of - - PowerPoint PPT Presentation

scaling the cascades
SMART_READER_LITE
LIVE PREVIEW

Scaling the Cascades Interconnect-aware FPGA implementation of - - PowerPoint PPT Presentation

BRAM DSP URAM Scaling the Cascades Interconnect-aware FPGA implementation of Machine Learning problems Anand Samajdar, Tushar Garg, Tushar Krishna, Nachiket Kapre nachiket@uwaterloo.ca Claim Hard FPGA interconnect (cascades) e ffi ciently


slide-1
SLIDE 1

Scaling the Cascades

Interconnect-aware FPGA implementation of Machine Learning problems

Anand Samajdar, Tushar Garg, Tushar Krishna, Nachiket Kapre nachiket@uwaterloo.ca

DSP URAM BRAM

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4

Claim

  • Hard FPGA interconnect (cascades) efficiently supports

nearest neighbour communication + reuse in ML workloads

  • Three kinds of UltraScale+ cascades [DSP

, BRAM, URAM]

  • Combination of (1) pixel, (2) row, (3) map reuse
  • Deliverables:
  • 650 MHz full-chip operation
  • 7x better latency, 30% lower throughput than the

formidable Xilinx SuperTile design for GoogLeNet v1

4

slide-5
SLIDE 5

Landscape of FPGA+ML accelerators

5

slide-6
SLIDE 6

Communication Requirements

  • f 3x3 Convolution

Input Row k Input Row k+1 Input Row k+2 x x x x x x x x x + Output Row k Input Map I Output Map J Weights

6

slide-7
SLIDE 7

Communication Requirements

  • f 3x3 Convolution

Input Row k Input Row k+1 Input Row k+2 x x x x x x x x x + Output Row k Input Map I Output Map J pixel streaming row streaming channel streaming 1 2 3 Weights

7

slide-8
SLIDE 8

Reuse Patterns

Input Row k Input Row k+1 Input Row k+2 x x x x x x x x x + Output Row k Input Map I Output Map J

pixel streaming Input Row k+2 x+ x+ x+ Input Row k+1 x+ x+ x+ Input Row k x+ x+ x+

P cascade for summation

Weight Row 2 Weight Row 1 Weight Row 0

8

slide-9
SLIDE 9

Reuse Patterns

Input Row k Input Row k+1 Input Row k+2 x x x x x x x x x + Output Row k Input Map I Output Map J

Input Row k+2 x+ x+ x+ Input Row k+1 x+ x+ x+ Input Row k x+ x+ x+

P cascade for summation A cascade for pixel streaming

Weight Row 2 Weight Row 1 Weight Row 0

B cascade for weight streaming

pixel streaming 1

9

slide-10
SLIDE 10

Reuse Patterns

Input Row k Input Row k+1 Input Row k+2 x x x x x x x x x + Output Row k Input Map I Output Map J

Input Row k+2 x+ x+ x+ Input Row k+1 x+ x+ x+ Input Row k x+ x+ x+

P cascade for summation A cascade for pixel streaming

Weight Row 2 Weight Row 1 Weight Row 0

B cascade for weight streaming Exploit Data Reuse

row streaming 2 pixel streaming 1

10

slide-11
SLIDE 11

Reuse Patterns

Input Row k Input Row k+1 Input Row k+2 x x x x x x x x x + Output Row k Input Map I Output Map J

Input Row k+2 x+ x+ x+ Input Row k+1 x+ x+ x+ Input Row k x+ x+ x+

P cascade for summation A cascade for pixel streaming

Weight Row 2 Weight Row 1 Weight Row 0

B cascade for weight streaming Exploit Data Reuse

row streaming 2 pixel streaming 1 Input Map I Output Map J

11

slide-12
SLIDE 12

Reuse Patterns

Input Map I Output Map J 3x3 Convolution Tile

Input Row k Input Row k+1 Input Row k+2 x x x x x x x x x + Output Row k Input Map I Output Map J

Weights

12

slide-13
SLIDE 13

Reuse Patterns

Input Map I Output Map J 3x3 Convolution Tile Input Map I+1 3x3 Convolution Tile Input Map I+.. 3x3 Convolution Tile Weights Weights Weights

13

slide-14
SLIDE 14

Reuse Patterns

Input Map I Output Map J 3x3 Convolution Tile Input Map I+1 3x3 Convolution Tile Input Map I+.. 3x3 Convolution Tile Weights Weights Weights channel streaming 3

14

slide-15
SLIDE 15

Reuse Patterns

Input Map I Output Map J 3x3 Convolution Tile Input Map I+1 Input Map I+.. Weights Weights Weights channel streaming 3

15

slide-16
SLIDE 16

Xilinx UltraScale+ FPGA Cascades

DSP URAM BRAM

  • BRAM18 support A/B

cascades 2x72b-wide links

  • DSP48 supports A, B, P

cascades (systolic input and summation)

  • URAM288 supports A, B

cascades

pixel streaming 1 row streaming 2 channel streaming 3

16

slide-17
SLIDE 17

Outline

  • Understanding Cascades
  • Assembling the FPGA accelerator + FPGA Layout
  • MLPerf Evaluation
  • Conclusions + Discussion

17

slide-18
SLIDE 18

Promise of Cascades

  • Absorb data movement onto dedicated interconnect vs.

General-purpose wiring

  • Higher clock frequency operation, layout-friendly

architecture

18

slide-19
SLIDE 19

Our approach

  • Exploit cascades aggressively!
  • DSP48
  • For 3x3 convolution, length-9

cascades

  • P cascade for summation (like

INT8 paper)

  • A cascade for systolic retiming

(like DSP48E2 user guide)

  • B cascades for weights (our

contribution)

19

slide-20
SLIDE 20

Our approach

  • Exploit cascades aggressively!
  • RAMB18E2 (our contribution)
  • For 3x3 convolution, only

need 3 BRAM-long chains

  • A/B cascade for shift
  • peration
  • Swap between A and B to

keep one read port available.

20

slide-21
SLIDE 21

Our approach

  • Exploit cascades aggressively!
  • URAM288 (our contribution)
  • Alternating A/B cascades of

length-2

  • Both data + addresses

cascades

  • Shift operation tricky!
  • Due to 72b width, and

resource ratios, idle cycles available for realizing shifts

21

slide-22
SLIDE 22

Putting it together

DSP48 DSP48 DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x + x + x +

DSP48

x +

22

slide-23
SLIDE 23

Putting it together

DSP48 DSP48 DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x + x + x +

DSP48

x +

23

slide-24
SLIDE 24

Putting it together

DSP48 DSP48 DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x + x + x +

DSP48

x +

Weights (initial shift)

24

slide-25
SLIDE 25

Putting it together

DSP48 DSP48 DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x + x + x +

DSP48

x +

Weights (initial shift) Pixel streaming

25

slide-26
SLIDE 26

Putting it together

DSP48 DSP48 DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x + x + x +

DSP48

x +

RAMB 18 (C) RAMB 18 (B) RAMB 18 (A)

Row i+2 Row i+1 Row i

26

slide-27
SLIDE 27

Putting it together

Row streaming

DSP48 DSP48 DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x + x + x +

DSP48

x +

RAMB 18 (C) RAMB 18 (B) RAMB 18 (A)

27

slide-28
SLIDE 28

Putting it together

DSP48 DSP48 DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x + x + x +

DSP48

x +

RAMB 18 (C) RAMB 18 (B) RAMB 18 (A) RAMB 18 (Kern) URAM 288 (Input)

from previous URAM to next URAM

(Weights)

28

slide-29
SLIDE 29

Putting it together

DSP48 DSP48 DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x + x + x +

DSP48

x +

RAMB 18 (C) RAMB 18 (B) RAMB 18 (A) RAMB 18 (Kern) URAM 288 (Input)

from previous URAM to next URAM

(Weights)

Map streaming

29

slide-30
SLIDE 30

Putting it together

DSP48 DSP48 DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x + x + x +

DSP48

x +

RAMB 18 (C) RAMB 18 (B) RAMB 18 (A) RAMB 18 (Kern) URAM 288 (Input)

from previous URAM to next URAM

URAM 288 (Output)

+

(Weights)

30

slide-31
SLIDE 31

Putting it together

DSP48 DSP48 DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x + x + x +

DSP48

x +

RAMB 18 (C) RAMB 18 (B) RAMB 18 (A) RAMB 18 (Kern) URAM 288 (Input)

from previous URAM to next URAM

URAM 288 (Output)

+

(Weights)

pixel streaming 1 row streaming 2 channel streaming 3

31

slide-32
SLIDE 32

A 3x3 tile layout

32

slide-33
SLIDE 33

A 3x3 tile layout

33

slide-34
SLIDE 34

A 3x3 tile layout

Places and routes at 1.2ns

34

slide-35
SLIDE 35

Tiling the design

  • VU37P device has specific resource

mix

  • For each URAM, you get 4.2

BRAMs and 9.4 DSP48s

  • Repeating pattern must conform to

this ratio

  • One tile: 2 URAMs, 8 BRAMs, 18

DSPs

  • Physical layout XDC constraints must

account for irregular column arrangement of hard resources

35

slide-36
SLIDE 36

Tiling the design

  • VU37P device has specific resource

mix

  • For each URAM, you get 4.2

BRAMs and 9.4 DSP48s

  • Repeating pattern must conform to

this ratio

  • One tile: 2 URAMs, 8 BRAMs, 18

DSPs

  • Physical layout XDC constraints must

account for irregular column arrangement of hard resources

ONE TILE

36

slide-37
SLIDE 37

Matrix-Matrix Multiplication

  • Limited Reuse opportunities
  • Split large matrix across URAMs
  • Each URAM stores a set of complete rows —> allows

result vector to be independently processed.

  • Partial vector results then circulated across the chip in a

ring-like fashion —> using BRAM cascades

  • URAM cascades only used for loading matrix at start

37

slide-38
SLIDE 38

VU37P Layout

38

slide-39
SLIDE 39

VU37P FPGA Mapping

CONVOLUTION MATRIX-MULTIPLY

39

slide-40
SLIDE 40

VU37P FPGA Mapping

CONVOLUTION MATRIX-MULTIPLY

80%

20%

40

slide-41
SLIDE 41

VU37P FPGA Mapping

80%

20%

41

slide-42
SLIDE 42

Effect of using cascades

  • Registers in hard interconnect

save us fabric registers for other pipelining needs

  • Clock period marginally better
  • Obvious reduction in interconnect

utilization

25K 50K 75K 20 40 60 80 Congestion (%) Route Count

Cascade No Cascade

42

slide-43
SLIDE 43

Effect of using cascades

  • Registers in hard interconnect

save us fabric registers for other pipelining needs

  • Clock period marginally better
  • Obvious reduction in interconnect

utilization

25K 50K 75K 20 40 60 80 Congestion (%) Route Count

Cascade No Cascade

43

slide-44
SLIDE 44

Evaluation Methodology

  • We use the SCALE-Sim cycle-

accurate simulator

  • https://github.com/ARM-

software/SCALE-Sim

  • Map URAMs -> IFMAP/

OFMAP SRAMs

  • BRAMs and DSP cascades

=> systolic array links

  • VU37P can fit systolic array of

size 960x9 (conv), 480x9 (mm)

44

slide-45
SLIDE 45

Xilinx SuperTile

  • GoogLeNet v1 mapped to

VU9P FPGA (Amazon F1)

  • 3046 images/s + 3.3ms

latency

  • Scorching 720 MHz operation!
  • Mind-numbing 88% overall

efficiency

http://isfpga.org/slides/Compute-Efficient_Neural-Network_Acceleration.pdf 45

slide-46
SLIDE 46

Why is SuperTile so good?

  • Base Design
  • High-frequency layout using DSP

cascades,

  • Systolic data movement in fabric
  • Throughput boost
  • Decompose the systolic array

into sub-arrays

  • Perform pipelining across CNN

layers

  • Sacrifice some latency to

significantly boost throughput!

https://dl.acm.org/citation.cfm?id=3293925

46

slide-47
SLIDE 47

MLPerf Benchmarks

  • Caveat: Result not verified by
  • MLPerf. MLPerf name and

logo are trademarks. See www.mlperf.org for more information.

  • Different ML workloads
  • Various domains
  • Different compute complexity

47

slide-48
SLIDE 48

MLPerf Benchmarks

  • Caveat: Result not verified by
  • MLPerf. MLPerf name and

logo are trademarks. See www.mlperf.org for more information.

  • Different ML workloads
  • Various domains
  • Different compute complexity

48

slide-49
SLIDE 49

MLPerf Benchmarks

35 MB URAM Capacity!

  • Caveat: Result not verified by
  • MLPerf. MLPerf name and

logo are trademarks. See www.mlperf.org for more information.

  • Different ML workloads
  • Various domains
  • Different compute complexity

49

slide-50
SLIDE 50

Performance Results

50

slide-51
SLIDE 51

Performance Results

51

slide-52
SLIDE 52

Performance Results

7x Lower 
 Latency

52

slide-53
SLIDE 53

Performance Results

53

30% Lower 
 Throughput

slide-54
SLIDE 54

Performance Results

LAYER PIPELINING

30% Lower 
 Throughput

54

slide-55
SLIDE 55

Performance Results

LAYER PIPELINING

THREE COPIES 30% Lower 
 Throughput

55

slide-56
SLIDE 56

Performance Results

56

slide-57
SLIDE 57

Performance Results

57

slide-58
SLIDE 58

Optimizing the mapping

58

slide-59
SLIDE 59

Compute Characteristics

59

slide-60
SLIDE 60

Conclusions

  • 650+ MHz operation for FPGA ML accelerator tailored for

Xilinx UltraScale+

  • 7x better latency, 30% poorer throughput vs. Xilinx

SuperTile

  • Hard interconnect cascades save us 30% registers +

12% on clock period vs. Fabric interconnect

60

slide-61
SLIDE 61

Discussion

  • URAM Bandwidth balance — Matrix-Multiplication

performance suffers due to missing bandwidth from URAM —> Give us 144b ports vs 72b ports!

  • Dynamic Control — Can build unified Conv + MM blocks

if data flow in cascades even more programmble —> Give us more control!

  • Abandon Versal — Versal architecture is not an FPGA.

Improve DSPs, BRAMs, URAMs + hard interconnect instead—> Stay true to your roots, Xilinx!

61

slide-62
SLIDE 62

Discussion

  • URAM Bandwidth balance — Matrix-Multiplication

performance suffers due to missing bandwidth from URAM —> Give us 144b ports vs 72b ports!

  • Dynamic Control — Can build unified Conv + MM blocks

if data flow in cascades even more programmble —> Give us more control!

  • Abandon Versal — Versal architecture is not an FPGA.

Improve DSPs, BRAMs, URAMs + hard interconnect instead—> Stay true to your roots, Xilinx!

62

slide-63
SLIDE 63

Communication Requirements

  • f Matrix Multiplication

Vector i+(k-2) Vector i+(k-1) Vector i+k x x x + Partial result vector Matrix Vector i x

63

slide-64
SLIDE 64

Communication Requirements

  • f Matrix Multiplication

Vector i+(k-2) Vector i+(k-1) Vector i+k x x x + Partial result vector Matrix Vector i x

64

slide-65
SLIDE 65

Communication Requirements

  • f Matrix Multiplication

Vector i+(k-2) Vector i+(k-1) Vector i+k x x x + Partial result vector Matrix Vector i x Vector fanout 2 Accum stream 1 3 Matrix initialize

65

slide-66
SLIDE 66

Communication Pattern

pixel streaming x+ x+ x+ x+ x+ x+ Vector I x+ x+ x+

P cascade for summation

Matrix

Vector i+(k-2) Vector i+(k-1) Vector i+k x x x + Partial result vector Matrix Vector i x

Vector i+(k-2) Vector i+(k-1) Vector i+k

66

slide-67
SLIDE 67

Communication Pattern

pixel streaming x+ x+ x+ x+ x+ x+ Vector i x+ x+ x+

P cascade for summation

Matrix

Vector i+(k-2) Vector i+(k-1) Vector i+k x x x + Partial result vector Matrix Vector i x

Vector i+(k-2) Vector i+(k-1) Vector i+k Accum stream 1

67

slide-68
SLIDE 68

Communication Pattern

pixel streaming x+ x+ x+ x+ x+ x+ Vector i x+ x+ x+

P cascade for summation

Matrix

Vector i+(k-2) Vector i+(k-1) Vector i+k x x x + Partial result vector Matrix Vector i x

Vector i+(k-2) Vector i+(k-1) Vector i+k Partial Result Accum stream 1

68

slide-69
SLIDE 69

Communication Pattern

pixel streaming x+ x+ x+ x+ x+ x+ Vector i x+ x+ x+

P cascade for summation

Matrix

Vector i+(k-2) Vector i+(k-1) Vector i+k x x x + Partial result vector Matrix Vector i x

Vector i+(k-2) Vector i+(k-1) Vector i+k Partial Result Accum stream 1 Vector fanout 2

69

slide-70
SLIDE 70

Communication Pattern

pixel streaming x+ x+ x+ x+ x+ x+ Vector i x+ x+ x+

P cascade for summation

Matrix

Vector i+(k-2) Vector i+(k-1) Vector i+k x x x + Partial result vector Matrix Vector i x

Vector i+(k-2) Vector i+(k-1) Vector i+k Partial Result Accum stream 1 Vector fanout 2 3 Matrix initialize

70

slide-71
SLIDE 71

Matrix-Matrix Multiplication

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

71

slide-72
SLIDE 72

Matrix-Matrix Multiplication

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 DSP48

x +

RAMB 18

72

slide-73
SLIDE 73

Matrix-Matrix Multiplication

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 DSP48

x +

RAMB 18

Vector fanout

73

slide-74
SLIDE 74

Matrix-Matrix Multiplication

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 URAM 288 DSP48

x +

RAMB 18

74

slide-75
SLIDE 75

Matrix-Matrix Multiplication

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 URAM 288 DSP48

x +

RAMB 18

Matrix reads

75

slide-76
SLIDE 76

Matrix-Matrix Multiplication

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 URAM 288

from previous URAM to next URAM

DSP48

x +

RAMB 18

76

slide-77
SLIDE 77

Matrix-Matrix Multiplication

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

DSP48

x +

RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 RAMB 18 URAM 288 RAMB 18

+

from previous URAM to next URAM

DSP48

x +

RAMB 18

Initial loading

77

slide-78
SLIDE 78

MLPerf Benchmarks

  • Caveat: Result not verified by
  • MLPerf. MLPerf name and

logo are trademarks. See www.mlperf.org for more information.

  • Different ML workloads
  • Various domains
  • Different compute complexity
  • SoftMax not implemented in

hardware

78

slide-79
SLIDE 79

Why is SuperTile so good?

  • Base Design
  • High-frequency layout using DSP

cascades,

  • LUT RAMs for weights,
  • Systolic data movement in fabric
  • Throughput boost
  • Decompose the systolic array into

sub-arrays

  • Perform pipelining across CNN layers
  • Sacrifice some latency to significantly

boost throughput!

79

slide-80
SLIDE 80

Xilinx INT8 optimization

  • Soft-fracture a DSP48 27x18 multiplier to compute two 8x8

multiplications with a common operand

  • A*B and D*B computed in together when A, B, D are 8 inputs

https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf https://www.xilinx.com/support/documentation/white_papers/wp487-int8-acceleration.pdf

80

slide-81
SLIDE 81

Xilinx INT8 optimization

  • Create DSP cascades of length 7 to

accumulate multiple product terms

  • Enough precision headroom to avoid overflow

for this length of the chain

  • Some fabric operation needed for final

accumulations, or if 3x3 convolution support required

https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf https://www.xilinx.com/support/documentation/white_papers/wp487-int8-acceleration.pdf

81

slide-82
SLIDE 82

Xilinx DSP48 Systolic Mode

https://www.xilinx.com/support/documentation/user_guides/ug479_7Series_DSP48E1.pdf

82