De DeCO: : A DS DSP Block Based FPGA Accelerator Overlay Wi With - - PowerPoint PPT Presentation
De DeCO: : A DS DSP Block Based FPGA Accelerator Overlay Wi With - - PowerPoint PPT Presentation
De DeCO: : A DS DSP Block Based FPGA Accelerator Overlay Wi With Low Overhead Interconnect Ab Abhishek Kumar Ja Jain, Xiangwei Li, Pranjul Singhai, Douglas L. Maskell School of Computer Science and Engineering Nanyang Technological
FP FPGAs in Heterogeneous Computing Platforms
2
- Xi
Xilinx: x: FP FPGA GAs coupled wi with ARM RM (Zy Zynq Ul UltraScale MP MPSoC)
– 3500 3500 DS DSP Blocks in the largest device – Pe Peak performance of 5200 5200 Giga-Op Operations Per er Sec econd (GOPS)
- In
Intel el: FP FPGA GAs coupled with Xe Xeon
– 1500 1500 floating point DS DSP Bl Blocks in the largest device – Pe Peak performance of 1300 1300 GFLOPS
- No
No, Main ainly ly due to po poor de design n pr produc ductivity issue ues
– Ac Accelerator design at RTL level -> > Hardware design expertise
Ar Are FPGAs As really ready fo for th the mainstr tream?
3
- No
No, Main ainly ly due to po poor de design n pr produc ductivity issue ues
– Ac Accelerator design at RTL level -> > Hardware design expertise – Lo Long g compilation times of RTL L design gn
Ar Are FPGAs As really ready fo for th the mainstr tream?
4
- Ar
Array of
- f coa
- arse-gr
grained tiles
- Pr
Programmable funct ctional unit an and in interconnect re reso sourc rces
Co Coarse grained FPGA overlays ys
5
- Ar
Array of
- f coa
- arse-gr
grained tiles
- Pr
Programmable fu functional unit an and in interconnect re reso sourc rces
- Be
Benefits:
– Ac Accelerator design at a higher level
- f
- f ab
abstrac action ion – Fa Fast co compilation – Fa Fast reconfiguration – Im Improved ed des esign pr produc ductivity
Co Coarse grained FPGA overlays ys
6
- Ar
Array of
- f coa
- arse-gr
grained tiles
- Pr
Programmable fu functional unit an and in interconnect re reso sourc rces
- Be
Benefits:
– Ac Accelerator design at a higher level
- f
- f ab
abstrac action ion – Fa Fast co compilation – Fa Fast reconfiguration – Im Improved ed des esign pr produc ductivity
- Th
The major ISSUE is the area an and pe performanc nce overhe heads ds
Co Coarse grained FPGA overlays ys
7
- Tw
Two metrics
– In Inter erconnec ect area ea over erhea ead in ter erms of LUTs/FU – Pe Peak pe performanc nce in n terms of GOPS – 3x 3x better (in area overhead) – 10x 10x be better (in n pe peak thr hroug ughput hput)
Co Coarse grained FPGA overlays ys
8
Ov Overlay In Inter erconnec ect Area ea Over erhea ead Pe Peak performance DS DSP-Dy DySER
[H [HEART2015]
1360 1360 LUTs/FU 6. 6.3 3 GOPS DS DSP-ba based d Island nd-st style
[F [FCCM2015] ]
437 437 LUTs/FU 65 65 GOPS
}
FC FCCM 2015 2015
9
Is Issues es
- Ca
Can improve further?
– On On Zy Zynq, , an array of 220 DSP blocks can provide 264 GOP OPS
10
200 400 600 800 1000 1200 1400 DySER DSP-based island-style 1360 1360 437 437
In Interconnect Area Overhead (L (LUTs/FU)
50 100 150 200 250 DySER DSP-based island-style 6. 6.3 65 65
Pe Peak Pe Performance on Zynq (G (GOPS)
Is Issues es
- Ca
Can improve further?
– On On Zy Zynq, , an array of 220 DSP blocks can provide 264 GOP OPS – Ca Can we reduce interconnect area overhead further to achieve a hi highe her pe peak pe performanc nce out ut of DSP bl blocks?
11
Is Issues es
- Ca
Can improve further?
– On On Zy Zynq, , an array of 220 DSP bl blocks can n pr provide de 264 GOPS – Ca Can we reduce interconnect area overhead further to achieve a hi highe her pe peak pe performanc nce out ut of DSP bl blocks?
200 400 600 800 1000 1200 1400 DySER DSP-based island-style DeCO 1360 1360 437 437 68 68
In Interconnect Area Overhead (L (LUTs/FU)
50 100 150 200 250 DySER DSP-based island-style DeCO 6. 6.3 65 65 264 264
Pe Peak Pe Performance on Zynq (G (GOPS)
- Is
Island-st style interc rconnect allows s co communica cation between any FU to an any other FU
- Not
- t required for
- r feed-fo
forward co compute kernels
Ap Approach
12
I0 I1 SUB ADD ADD ADD O0 SQR I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 SUB SUB SUB SUB SUB SUB SUB
SQRADD
SQR
SQRADD
SQR
SQRADD
SQR
SQRADD
STAGE-1 STAGE-2 STAGE-3 STAGE-4 STAGE-5
- Is
Island-st style interc rconnect allows s co communica cation between any FU to an any other FU
- Not
- t required for
- r feed-fo
forward co compute kernels
Ap Approach
13
I0 I1 SUB ADD ADD ADD O0 SQR I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 SUB SUB SUB SUB SUB SUB SUB
SQRADD
SQR
SQRADD
SQR
SQRADD
SQR
SQRADD
STAGE-1 STAGE-2 STAGE-3 STAGE-4 STAGE-5
Programmable Routing Network DFs FU FU FU Programmable Routing Network DFs FU FU FU DFs FU FU FU Data inputs Data outputs Stage-1 Stage-2 Stage-N
Ke Kernel Set Ch Characteristics
14
Ke Kernels I/ I/O nodes Be Before Tr Transformation OP OP no node des DF DFG de dept pth fft fft 6/ 6/4 10 10 3 km kmeans 16/ 16/1 23 23 9 mm mm 16/ 16/1 15 15 8 sp spmv 16/ 16/2 14 14 4 mr mri 11/ 11/2 11 11 6 st stencil 15/ 15/2 14 14 5
De Designed Overlay
15
Cluster Programmable Routing Network Data Forwarding (DF) Link
DSP Tile
De Designed Overlay
16
Cluster Programmable Routing Network Data Forwarding (DF) Link
MUL B Register Pre-Adder C M INMODE OPMODE B A D C 1 ALUMODE P 16 16 16 16 5 7 4 1 4 16 MUXSEL 8 DSP48E1 X Y Z MUL B Register Pre-Adder C M INMODE OPMODE B A D C 1 ALUMODE P 16 16 16 16 5 7 4 1 4 16 MUXSEL 8 DSP48E1 X Y Z MUL B Register Pre-Adder C M INMODE OPMODE B A D C 1 ALUMODE P 16 16 16 16 5 7 4 1 4 16 MUXSEL 8 DSP48E1 X Y Z MUL B Register Pre-Adder C M INMODE OPMODE B A D C 1 ALUMODE P 16 16 16 16 5 7 4 1 4 16 MUXSEL 8 DSP48E1 X Y Z
- Pr
Prototyped De DeCO an and two other overlay lays for the kernel l set
– 5x 5x5 5 DSP-Ba Based Dy DySER ov
- verlay (Ov
Overlay-I) I) – 5x 5x5 5 DSP block based island-st style overlay (Overlay-II II)
17
Co Comparison of
- f Overlays for
- r the kernel se
set
- Pr
Prototyped De DeCO an and two other overlay lays for the kernel l set
– 5x 5x5 5 DSP-Ba Based Dy DySER ov
- verlay (Ov
Overlay-I) I) – 5x 5x5 5 DSP block based island-st style overlay (Overlay-II II)
- Si
Significant savings in LUT requirements
– 96% 96% compared to Overlay-I – 87% 87% compared to Overlay-II II
18
10 20 30 40 50 60 70 LUTs FFs DSP Blocks
Re Resource Consumption of Overlays
Overlay-I Overlay-II DeCO
Co Comparison of
- f Overlays for
- r the kernel se
set
Ma Mapping Kern rnels onto DeCO
19
Ke Kernels Re Required No.
- f
- f Con
- nes
% % FU Ut Utilization Ac Achievable GO GOPS fft fft 1 40% 40% 3. 3.95 95 km kmeans 1 95% 95% 9. 9.08 08 mm mm 1 75% 75% 5. 5.92 92 sp spmv 1 70% 70% 5. 5.53 53 mr mri 1 75% 75% 4. 4.34 34 st stencil 1 80% 80% 5. 5.53 53
- FU
FU utilization of up to to 95%
Ma Mapping Kern rnels onto DeCO
20
Ke Kernels Re Required No.
- f
- f Con
- nes
% % FU Ut Utilization Ac Achievable GO GOPS fft fft 1 40% 40% 3. 3.95 95 km kmeans 1 95% 95% 9. 9.08 08 mm mm 1 75% 75% 5. 5.92 92 sp spmv 1 70% 70% 5. 5.53 53 mr mri 1 75% 75% 4. 4.34 34 st stencil 1 80% 80% 5. 5.53 53 gr gradient 0. 0.5 90% 90% 4. 4.34 34 ch chebyshev 0. 0.5 40% 40% 5. 5.53 53
- FU
FU utilization of up to to 95%
- Ca
Can replicate small ke kernels an and map ap
Ma Mapping Kern rnels onto DeCO
21
- FU
FU utilization of up up to to 95% 95%
- Ca
Can replicate small ke kernels and map
- Mu
Multiple cones s can be be us used d to map p lar large kernels ls
Ke Kernels Re Required No.
- f
- f Con
- nes
% % FU Ut Utilization Ac Achievable GO GOPS fft fft 1 40% 40% 3. 3.95 95 km kmeans 1 95% 95% 9. 9.08 08 mm mm 1 75% 75% 5. 5.92 92 sp spmv 1 70% 70% 5. 5.53 53 mr mri 1 75% 75% 4. 4.34 34 st stencil 1 80% 80% 5. 5.53 53 gr gradient 0. 0.5 90% 90% 4. 4.34 34 ch chebyshev 0. 0.5 40% 40% 5. 5.53 53 bi bicg 3 50% 50% 11. 11.85 85 tr trmm 4. 4.5 60% 60% 21. 21.33 33 sy syrk 4. 4.5 80% 80% 28. 28.44 44
Co Comparison to HLS
22 22
DS DSP Tile CL CLB Ti Tile
- Co
Compare De DeCO wi with a Viv Vivad ado HL HLS implem emen entations (i (implemented in PR region)
– Fo For the kernel set HLS required 1 1 DS DSP and 3 3 CLB tiles
PR PR Region
Co Comparison to HLS
23 23
- Co
Compare De DeCO wi with a Viv Vivad ado HL HLS implem emen entations (i (implemented in PR region)
– Fo For the kernel set HLS required 1 1 DS DSP and 3 3 CLB tiles – De DeCO re require res 2 2 DS DSP and 6 6 CLB tiles. .
PR PR Region
De DeCO 2x 2x area pe pena nalty
Co Comparison to HLS
24 24
- Co
Compare De DeCO wi with a Viv Vivad ado HL HLS implem emen entations (i (implemented in PR region)
– Fo For the kernel set HLS required 1 1 DS DSP and 3 3 CLB tiles – De DeCO re require res 2 2 DS DSP and 6 6 CLB tiles. .
PR PR Region
De DeCO Vi Vivado-HL HLS De DeCO Co Configuration Da Data Size 49000 49000 Bytes 53. 53.5 5 Bytes Co Configuration ti time 382 382 us 2 2 us 190x 190x faster
Ma Mapping to Zynq
25
PR Region
DeCO
26
27
28
Co Conclusion
- Pr
Presented De DeCO, , an overlay with
– Lo Lower Interconnect area overhead – Hi High gher peak performance
200 400 600 800 1000 1200 1400 DySER DSP-based island-style DeCO 1360 1360 437 437 68 68
In Interconnect Area Overhead (L (LUTs/FU)
50 100 150 200 250 DySER DSP-based island-style DeCO 6. 6.3 65 65 264 264
Pe Peak Pe Performance on Zynq (G (GOPS)
29
Co Conclusion
- Pr
Presented De DeCO, , an overlay with
– Lo Lower Interconnect area overhead – Hi High gher peak performance
- Co
Compared to HLS generated PR-ba based d impl plement ntation
– 2 2 times area penalty but 200 200 times faster reconfiguration
200 400 600 800 1000 1200 1400 DySER DSP-based island-style DeCO 1360 1360 437 437 68 68
In Interconnect Area Overhead (L (LUTs/FU)
50 100 150 200 250 DySER DSP-based island-style DeCO 6. 6.3 65 65 264 264
Pe Peak Pe Performance on Zynq (G (GOPS)
30
Fu Future Work
- Re
Releasing De DeCO, , as a programmable accelerator within Zy Zynq
31
Fu Future Work
- Re
Releasing De DeCO, , as a programmable accelerator within Zy Zynq
- Op
OpenCL co compiler for De DeCO
32
Fu Future Work
- Re
Releasing De DeCO, , as a programmable accelerator within Zy Zynq
- Op
OpenCL co compiler for De DeCO
- Fa
Fast context xt switching under control of OS/Hypervisor
33
Fu Future Work
- Re
Releasing De DeCO, , as a programmable accelerator within Zy Zynq
- Op
OpenCL co compiler for De DeCO
- Fa
Fast context xt switching under control of OS/Hypervisor
De Demo of compiler for is islan land-st style overlays s to tonight! t!
34
Fu Future Work
- Re
Releasing De DeCO, , as a programmable accelerator within Zy Zynq
- Op
OpenCL co compiler for De DeCO
- Fa
Fast context xt switching under control of OS/Hypervisor
De Demo of compiler for is islan land-st style overlays s to tonight! t!
Ba Back-up up Slide des
35
- Fu
Fully Pipelined DSP Block based Is Island-st style Overl rlay[1
[1] – An An 8×8 8 array of FUs (64 64 DSPs) on Xilinx Zy Zynq – 28K 28K LUTs required to implement interconnect – Fm Fmax = = 338MHz, peak throughput of 65 65 GOPS – In Inter erconnec ect ar area a overhead ad: 437 437 LUTs/FU
- An
Anot
- ther Overlay: Dy
DySER[2
[2] – An An 6×6 6 ar array ay of FUs (3 (36 DS DSPs) on Xilinx Zy Zynq – 48K 48K LU LUTs re require red to implement interc rconnect – Fm Fmax = = 175M 175MHz, , peak throughput of 6. 6.3 3 GO GOPS – In Inter erconnec ect area ea over erhea ead: 1360 1360 LU LUTs/FU
- Pr
Proposed design was 10x better (in peak throughput) and 3x be better (in n area overhe head) d) the hen n Dy DySER
DS DSP based overlays (Island-st style network)
[1 [1] ] A. . K. . Jain, , S. . A. . Fa Fahmy, , and D. . L. . Ma Maskell, , “Efficient Overlay architecture based on DSP blocks,” in FCCM, , 2015. [2 [2] ] A.
- A. K. Ja
Jain, X. Li, S. A.
- A. Fa
Fahmy, , and D. . L. . Ma Maskell, , “Adapting the Dy DySER ar archi hitectur ure with h DSP bl blocks as as an an Overlay ay for the he Xilinx nx Zy Zynq,” ,” in HE HEART, 2015.
36 36
Co Comparison to HLS
37 37
- Co
Compare De DeCO wi with a direct PR-ba based d FPGA impl plement ntation n us using ng Viv Vivad ado HL HLS.
DS DSP Tile CL CLB Ti Tile
Pr Programmability Overhead Modeling
38
- As
As a function
- n of
- f DSPs and DFs
re require rements s in each tile
- Pr
Programmability Ov Overhead (P (PO) )
– 6. 6.25 25 LUTs/bit per FU for (a) – 4. 4.75 75 LUTs/bit per FU for (b) – 4. 4.30 30 LUTs/bit per FU for (c)
- Ch
Choose (c) as candidate DFG for
- v
- verlay design proc
- cess
Ke Kernel Set Ch Characteristics
39
- Pe