De DeCO: : A DS DSP Block Based FPGA Accelerator Overlay Wi With - - PowerPoint PPT Presentation

de deco a ds dsp block based fpga accelerator overlay wi
SMART_READER_LITE
LIVE PREVIEW

De DeCO: : A DS DSP Block Based FPGA Accelerator Overlay Wi With - - PowerPoint PPT Presentation

De DeCO: : A DS DSP Block Based FPGA Accelerator Overlay Wi With Low Overhead Interconnect Ab Abhishek Kumar Ja Jain, Xiangwei Li, Pranjul Singhai, Douglas L. Maskell School of Computer Science and Engineering Nanyang Technological


slide-1
SLIDE 1

Ab Abhishek Kumar Ja Jain, Xiangwei Li, Pranjul Singhai, Douglas L. Maskell

School of Computer Science and Engineering Nanyang Technological University (NTU), Singapore Suhaib A. Fahmy School of Engineering University of Warwick, UK

International Symposium on Field-Programmable Custom Computing Machines (FCCM) 2nd May 2016, Washington DC, USA

De DeCO: : A DS DSP Block Based FPGA Accelerator Overlay Wi With Low Overhead Interconnect

slide-2
SLIDE 2

FP FPGAs in Heterogeneous Computing Platforms

2

  • Xi

Xilinx: x: FP FPGA GAs coupled wi with ARM RM (Zy Zynq Ul UltraScale MP MPSoC)

– 3500 3500 DS DSP Blocks in the largest device – Pe Peak performance of 5200 5200 Giga-Op Operations Per er Sec econd (GOPS)

  • In

Intel el: FP FPGA GAs coupled with Xe Xeon

– 1500 1500 floating point DS DSP Bl Blocks in the largest device – Pe Peak performance of 1300 1300 GFLOPS

slide-3
SLIDE 3
  • No

No, Main ainly ly due to po poor de design n pr produc ductivity issue ues

– Ac Accelerator design at RTL level -> > Hardware design expertise

Ar Are FPGAs As really ready fo for th the mainstr tream?

3

slide-4
SLIDE 4
  • No

No, Main ainly ly due to po poor de design n pr produc ductivity issue ues

– Ac Accelerator design at RTL level -> > Hardware design expertise – Lo Long g compilation times of RTL L design gn

Ar Are FPGAs As really ready fo for th the mainstr tream?

4

slide-5
SLIDE 5
  • Ar

Array of

  • f coa
  • arse-gr

grained tiles

  • Pr

Programmable funct ctional unit an and in interconnect re reso sourc rces

Co Coarse grained FPGA overlays ys

5

slide-6
SLIDE 6
  • Ar

Array of

  • f coa
  • arse-gr

grained tiles

  • Pr

Programmable fu functional unit an and in interconnect re reso sourc rces

  • Be

Benefits:

– Ac Accelerator design at a higher level

  • f
  • f ab

abstrac action ion – Fa Fast co compilation – Fa Fast reconfiguration – Im Improved ed des esign pr produc ductivity

Co Coarse grained FPGA overlays ys

6

slide-7
SLIDE 7
  • Ar

Array of

  • f coa
  • arse-gr

grained tiles

  • Pr

Programmable fu functional unit an and in interconnect re reso sourc rces

  • Be

Benefits:

– Ac Accelerator design at a higher level

  • f
  • f ab

abstrac action ion – Fa Fast co compilation – Fa Fast reconfiguration – Im Improved ed des esign pr produc ductivity

  • Th

The major ISSUE is the area an and pe performanc nce overhe heads ds

Co Coarse grained FPGA overlays ys

7

slide-8
SLIDE 8
  • Tw

Two metrics

– In Inter erconnec ect area ea over erhea ead in ter erms of LUTs/FU – Pe Peak pe performanc nce in n terms of GOPS – 3x 3x better (in area overhead) – 10x 10x be better (in n pe peak thr hroug ughput hput)

Co Coarse grained FPGA overlays ys

8

Ov Overlay In Inter erconnec ect Area ea Over erhea ead Pe Peak performance DS DSP-Dy DySER

[H [HEART2015]

1360 1360 LUTs/FU 6. 6.3 3 GOPS DS DSP-ba based d Island nd-st style

[F [FCCM2015] ]

437 437 LUTs/FU 65 65 GOPS

}

FC FCCM 2015 2015

slide-9
SLIDE 9

9

Is Issues es

  • Ca

Can improve further?

– On On Zy Zynq, , an array of 220 DSP blocks can provide 264 GOP OPS

slide-10
SLIDE 10

10

200 400 600 800 1000 1200 1400 DySER DSP-based island-style 1360 1360 437 437

In Interconnect Area Overhead (L (LUTs/FU)

50 100 150 200 250 DySER DSP-based island-style 6. 6.3 65 65

Pe Peak Pe Performance on Zynq (G (GOPS)

Is Issues es

  • Ca

Can improve further?

– On On Zy Zynq, , an array of 220 DSP blocks can provide 264 GOP OPS – Ca Can we reduce interconnect area overhead further to achieve a hi highe her pe peak pe performanc nce out ut of DSP bl blocks?

slide-11
SLIDE 11

11

Is Issues es

  • Ca

Can improve further?

– On On Zy Zynq, , an array of 220 DSP bl blocks can n pr provide de 264 GOPS – Ca Can we reduce interconnect area overhead further to achieve a hi highe her pe peak pe performanc nce out ut of DSP bl blocks?

200 400 600 800 1000 1200 1400 DySER DSP-based island-style DeCO 1360 1360 437 437 68 68

In Interconnect Area Overhead (L (LUTs/FU)

50 100 150 200 250 DySER DSP-based island-style DeCO 6. 6.3 65 65 264 264

Pe Peak Pe Performance on Zynq (G (GOPS)

slide-12
SLIDE 12
  • Is

Island-st style interc rconnect allows s co communica cation between any FU to an any other FU

  • Not
  • t required for
  • r feed-fo

forward co compute kernels

Ap Approach

12

I0 I1 SUB ADD ADD ADD O0 SQR I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 SUB SUB SUB SUB SUB SUB SUB

SQRADD

SQR

SQRADD

SQR

SQRADD

SQR

SQRADD

STAGE-1 STAGE-2 STAGE-3 STAGE-4 STAGE-5

slide-13
SLIDE 13
  • Is

Island-st style interc rconnect allows s co communica cation between any FU to an any other FU

  • Not
  • t required for
  • r feed-fo

forward co compute kernels

Ap Approach

13

I0 I1 SUB ADD ADD ADD O0 SQR I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 SUB SUB SUB SUB SUB SUB SUB

SQRADD

SQR

SQRADD

SQR

SQRADD

SQR

SQRADD

STAGE-1 STAGE-2 STAGE-3 STAGE-4 STAGE-5

Programmable Routing Network DFs FU FU FU Programmable Routing Network DFs FU FU FU DFs FU FU FU Data inputs Data outputs Stage-1 Stage-2 Stage-N

slide-14
SLIDE 14

Ke Kernel Set Ch Characteristics

14

Ke Kernels I/ I/O nodes Be Before Tr Transformation OP OP no node des DF DFG de dept pth fft fft 6/ 6/4 10 10 3 km kmeans 16/ 16/1 23 23 9 mm mm 16/ 16/1 15 15 8 sp spmv 16/ 16/2 14 14 4 mr mri 11/ 11/2 11 11 6 st stencil 15/ 15/2 14 14 5

slide-15
SLIDE 15

De Designed Overlay

15

Cluster Programmable Routing Network Data Forwarding (DF) Link

slide-16
SLIDE 16

DSP Tile

De Designed Overlay

16

Cluster Programmable Routing Network Data Forwarding (DF) Link

MUL B Register Pre-Adder C M INMODE OPMODE B A D C 1 ALUMODE P 16 16 16 16 5 7 4 1 4 16 MUXSEL 8 DSP48E1 X Y Z MUL B Register Pre-Adder C M INMODE OPMODE B A D C 1 ALUMODE P 16 16 16 16 5 7 4 1 4 16 MUXSEL 8 DSP48E1 X Y Z MUL B Register Pre-Adder C M INMODE OPMODE B A D C 1 ALUMODE P 16 16 16 16 5 7 4 1 4 16 MUXSEL 8 DSP48E1 X Y Z MUL B Register Pre-Adder C M INMODE OPMODE B A D C 1 ALUMODE P 16 16 16 16 5 7 4 1 4 16 MUXSEL 8 DSP48E1 X Y Z

slide-17
SLIDE 17
  • Pr

Prototyped De DeCO an and two other overlay lays for the kernel l set

– 5x 5x5 5 DSP-Ba Based Dy DySER ov

  • verlay (Ov

Overlay-I) I) – 5x 5x5 5 DSP block based island-st style overlay (Overlay-II II)

17

Co Comparison of

  • f Overlays for
  • r the kernel se

set

slide-18
SLIDE 18
  • Pr

Prototyped De DeCO an and two other overlay lays for the kernel l set

– 5x 5x5 5 DSP-Ba Based Dy DySER ov

  • verlay (Ov

Overlay-I) I) – 5x 5x5 5 DSP block based island-st style overlay (Overlay-II II)

  • Si

Significant savings in LUT requirements

– 96% 96% compared to Overlay-I – 87% 87% compared to Overlay-II II

18

10 20 30 40 50 60 70 LUTs FFs DSP Blocks

Re Resource Consumption of Overlays

Overlay-I Overlay-II DeCO

Co Comparison of

  • f Overlays for
  • r the kernel se

set

slide-19
SLIDE 19

Ma Mapping Kern rnels onto DeCO

19

Ke Kernels Re Required No.

  • f
  • f Con
  • nes

% % FU Ut Utilization Ac Achievable GO GOPS fft fft 1 40% 40% 3. 3.95 95 km kmeans 1 95% 95% 9. 9.08 08 mm mm 1 75% 75% 5. 5.92 92 sp spmv 1 70% 70% 5. 5.53 53 mr mri 1 75% 75% 4. 4.34 34 st stencil 1 80% 80% 5. 5.53 53

  • FU

FU utilization of up to to 95%

slide-20
SLIDE 20

Ma Mapping Kern rnels onto DeCO

20

Ke Kernels Re Required No.

  • f
  • f Con
  • nes

% % FU Ut Utilization Ac Achievable GO GOPS fft fft 1 40% 40% 3. 3.95 95 km kmeans 1 95% 95% 9. 9.08 08 mm mm 1 75% 75% 5. 5.92 92 sp spmv 1 70% 70% 5. 5.53 53 mr mri 1 75% 75% 4. 4.34 34 st stencil 1 80% 80% 5. 5.53 53 gr gradient 0. 0.5 90% 90% 4. 4.34 34 ch chebyshev 0. 0.5 40% 40% 5. 5.53 53

  • FU

FU utilization of up to to 95%

  • Ca

Can replicate small ke kernels an and map ap

slide-21
SLIDE 21

Ma Mapping Kern rnels onto DeCO

21

  • FU

FU utilization of up up to to 95% 95%

  • Ca

Can replicate small ke kernels and map

  • Mu

Multiple cones s can be be us used d to map p lar large kernels ls

Ke Kernels Re Required No.

  • f
  • f Con
  • nes

% % FU Ut Utilization Ac Achievable GO GOPS fft fft 1 40% 40% 3. 3.95 95 km kmeans 1 95% 95% 9. 9.08 08 mm mm 1 75% 75% 5. 5.92 92 sp spmv 1 70% 70% 5. 5.53 53 mr mri 1 75% 75% 4. 4.34 34 st stencil 1 80% 80% 5. 5.53 53 gr gradient 0. 0.5 90% 90% 4. 4.34 34 ch chebyshev 0. 0.5 40% 40% 5. 5.53 53 bi bicg 3 50% 50% 11. 11.85 85 tr trmm 4. 4.5 60% 60% 21. 21.33 33 sy syrk 4. 4.5 80% 80% 28. 28.44 44

slide-22
SLIDE 22

Co Comparison to HLS

22 22

DS DSP Tile CL CLB Ti Tile

  • Co

Compare De DeCO wi with a Viv Vivad ado HL HLS implem emen entations (i (implemented in PR region)

– Fo For the kernel set HLS required 1 1 DS DSP and 3 3 CLB tiles

PR PR Region

slide-23
SLIDE 23

Co Comparison to HLS

23 23

  • Co

Compare De DeCO wi with a Viv Vivad ado HL HLS implem emen entations (i (implemented in PR region)

– Fo For the kernel set HLS required 1 1 DS DSP and 3 3 CLB tiles – De DeCO re require res 2 2 DS DSP and 6 6 CLB tiles. .

PR PR Region

De DeCO 2x 2x area pe pena nalty

slide-24
SLIDE 24

Co Comparison to HLS

24 24

  • Co

Compare De DeCO wi with a Viv Vivad ado HL HLS implem emen entations (i (implemented in PR region)

– Fo For the kernel set HLS required 1 1 DS DSP and 3 3 CLB tiles – De DeCO re require res 2 2 DS DSP and 6 6 CLB tiles. .

PR PR Region

De DeCO Vi Vivado-HL HLS De DeCO Co Configuration Da Data Size 49000 49000 Bytes 53. 53.5 5 Bytes Co Configuration ti time 382 382 us 2 2 us 190x 190x faster

slide-25
SLIDE 25

Ma Mapping to Zynq

25

PR Region

DeCO

slide-26
SLIDE 26

26

slide-27
SLIDE 27

27

slide-28
SLIDE 28

28

Co Conclusion

  • Pr

Presented De DeCO, , an overlay with

– Lo Lower Interconnect area overhead – Hi High gher peak performance

200 400 600 800 1000 1200 1400 DySER DSP-based island-style DeCO 1360 1360 437 437 68 68

In Interconnect Area Overhead (L (LUTs/FU)

50 100 150 200 250 DySER DSP-based island-style DeCO 6. 6.3 65 65 264 264

Pe Peak Pe Performance on Zynq (G (GOPS)

slide-29
SLIDE 29

29

Co Conclusion

  • Pr

Presented De DeCO, , an overlay with

– Lo Lower Interconnect area overhead – Hi High gher peak performance

  • Co

Compared to HLS generated PR-ba based d impl plement ntation

– 2 2 times area penalty but 200 200 times faster reconfiguration

200 400 600 800 1000 1200 1400 DySER DSP-based island-style DeCO 1360 1360 437 437 68 68

In Interconnect Area Overhead (L (LUTs/FU)

50 100 150 200 250 DySER DSP-based island-style DeCO 6. 6.3 65 65 264 264

Pe Peak Pe Performance on Zynq (G (GOPS)

slide-30
SLIDE 30

30

Fu Future Work

  • Re

Releasing De DeCO, , as a programmable accelerator within Zy Zynq

slide-31
SLIDE 31

31

Fu Future Work

  • Re

Releasing De DeCO, , as a programmable accelerator within Zy Zynq

  • Op

OpenCL co compiler for De DeCO

slide-32
SLIDE 32

32

Fu Future Work

  • Re

Releasing De DeCO, , as a programmable accelerator within Zy Zynq

  • Op

OpenCL co compiler for De DeCO

  • Fa

Fast context xt switching under control of OS/Hypervisor

slide-33
SLIDE 33

33

Fu Future Work

  • Re

Releasing De DeCO, , as a programmable accelerator within Zy Zynq

  • Op

OpenCL co compiler for De DeCO

  • Fa

Fast context xt switching under control of OS/Hypervisor

De Demo of compiler for is islan land-st style overlays s to tonight! t!

slide-34
SLIDE 34

34

Fu Future Work

  • Re

Releasing De DeCO, , as a programmable accelerator within Zy Zynq

  • Op

OpenCL co compiler for De DeCO

  • Fa

Fast context xt switching under control of OS/Hypervisor

De Demo of compiler for is islan land-st style overlays s to tonight! t!

slide-35
SLIDE 35

Ba Back-up up Slide des

35

slide-36
SLIDE 36
  • Fu

Fully Pipelined DSP Block based Is Island-st style Overl rlay[1

[1] – An An 8×8 8 array of FUs (64 64 DSPs) on Xilinx Zy Zynq – 28K 28K LUTs required to implement interconnect – Fm Fmax = = 338MHz, peak throughput of 65 65 GOPS – In Inter erconnec ect ar area a overhead ad: 437 437 LUTs/FU

  • An

Anot

  • ther Overlay: Dy

DySER[2

[2] – An An 6×6 6 ar array ay of FUs (3 (36 DS DSPs) on Xilinx Zy Zynq – 48K 48K LU LUTs re require red to implement interc rconnect – Fm Fmax = = 175M 175MHz, , peak throughput of 6. 6.3 3 GO GOPS – In Inter erconnec ect area ea over erhea ead: 1360 1360 LU LUTs/FU

  • Pr

Proposed design was 10x better (in peak throughput) and 3x be better (in n area overhe head) d) the hen n Dy DySER

DS DSP based overlays (Island-st style network)

[1 [1] ] A. . K. . Jain, , S. . A. . Fa Fahmy, , and D. . L. . Ma Maskell, , “Efficient Overlay architecture based on DSP blocks,” in FCCM, , 2015. [2 [2] ] A.

  • A. K. Ja

Jain, X. Li, S. A.

  • A. Fa

Fahmy, , and D. . L. . Ma Maskell, , “Adapting the Dy DySER ar archi hitectur ure with h DSP bl blocks as as an an Overlay ay for the he Xilinx nx Zy Zynq,” ,” in HE HEART, 2015.

36 36

slide-37
SLIDE 37

Co Comparison to HLS

37 37

  • Co

Compare De DeCO wi with a direct PR-ba based d FPGA impl plement ntation n us using ng Viv Vivad ado HL HLS.

DS DSP Tile CL CLB Ti Tile

slide-38
SLIDE 38

Pr Programmability Overhead Modeling

38

  • As

As a function

  • n of
  • f DSPs and DFs

re require rements s in each tile

  • Pr

Programmability Ov Overhead (P (PO) )

– 6. 6.25 25 LUTs/bit per FU for (a) – 4. 4.75 75 LUTs/bit per FU for (b) – 4. 4.30 30 LUTs/bit per FU for (c)

  • Ch

Choose (c) as candidate DFG for

  • v
  • verlay design proc
  • cess
slide-39
SLIDE 39

Ke Kernel Set Ch Characteristics

39

  • Pe

Perform transformation on each ch kernel to get the best ca candidate DFG

Ke Kernels I/ I/O nodes Be Before Tr Transformation Af After Tr Transformation OP OP no node des DF DFG de dept pth OP OP no node des DF DFG de dept pth fft fft 6/ 6/4 10 10 3 8 3 km kmeans 16/ 16/1 23 23 9 19 19 5 mm mm 16/ 16/1 15 15 8 15 15 4 sp spmv 16/ 16/2 14 14 4 14 14 3 mr mri 11/ 11/2 11 11 6 9 5 st stencil 15/ 15/2 14 14 5 8 3