Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs
Chethan Kumar H B and Nachiket Kapre nachiket@ieee.org
Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently - - PowerPoint PPT Presentation
Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs Chethan Kumar H B and Nachiket Kapre nachiket@ieee.org Hoplite FPL 2015 paper Jan Gray co-author Specs 60 LUTs+100 FFs 2.9ns
Chethan Kumar H B and Nachiket Kapre nachiket@ieee.org
— 60 LUTs+100 FFs — 2.9ns clock
available + RTL code
2
3
3
3
3
4
5
5
5
5
6
7
really want clean-slate hard NoCs?
NoC overheads
elements
8
9
10
torus
into the network + PE connection
buffering, no allocation, etc
from: Jan Gray
11
5 LUT 5 LUT 5 LUT 5 LUT 5 LUT 6 LUT
W PE E S/PE
DOR Logic
sel0 sel1,2 N
12
— implement packet multiplexers
address fields, valid signals
Xilinx DSP48 block
5 LUT 5 LUT 5 LUT 5 LUT 5 LUT 6 LUT W PE E S/PE DOR Logic sel0 sel1,2 N13
A D B C
30 / 27 / 18 / 48 /
P
48 / 27 / 48 /
PCIN
48 /
ALU
X Z Y
PCOUT
OPMODE ALUMODE INMODE
14
A D B C
30 / 27 / 18 / 48 /
P
48 / 27 / 48 /
PCIN
48 /
ALU
X Z Y
PCOUT
OPMODE ALUMODE INMODE
15
computations => mainly arithmetic
OPMODE — 48b multiplexers between A:B, C
A D B C
30 / 27 / 18 / 48 /P
48 / 27 / 48 /PCIN
48 /ALU
X Z Y
PCOUT
OPMODE ALUMODE INMODE16
5 LUT 5 LUT 5 LUT 5 LUT 5 LUT 6 LUT
W PE E S/PE
DOR Logic
sel0 sel1,2 N
A D B C
30 / 27 / 18 / 48 /
P
48 / 27 / 48 /
PCIN
48 /
ALU
X Z Y
PCOUT
OPMODE ALUMODE INMODE
17
5 LUT 5 LUT 5 LUT 5 LUT 5 LUT 6 LUT
W PE E S/PE
DOR Logic
sel0 sel1,2 N
A D B C
30 / 27 / 18 / 48 /
P
48 / 27 / 48 /
PCIN
48 /
ALU
X Z Y
PCOUT
OPMODE ALUMODE INMODE
WEST PE N S/PE EAST
18
5 LUT 5 LUT 5 LUT 5 LUT 5 LUT 6 LUT
W PE E S/PE
DOR Logic
sel0 sel1,2 N
A D B C
30 / 27 / 18 / 48 /
P
48 / 27 / 48 /
PCIN
48 /
ALU
X Z Y
PCOUT
OPMODE ALUMODE INMODE
WEST PE N S/PE EAST
19
5 LUT 5 LUT 5 LUT 5 LUT 5 LUT 6 LUT
W PE E S/PE
DOR Logic
sel0 sel1,2 N
A D B C
30 / 27 / 18 / 48 /
P
48 / 27 / 48 /
PCIN
48 /
ALU
X Z Y
PCOUT
OPMODE ALUMODE INMODE
WEST PE N S/PE EAST
20
with S/PE output port shared)
— runs at 2x the frequency of the PEs
21
A D B C
30 / 27 / 18 / 48 / 27 /
PCIN
48 /
ALU
X Z Y
PCOUT
OPMODE ALUMODE INMODE
PE Input West Input East Output
48 /
P
48 /
CE
22
A D B C
30 / 27 / 18 / 48 / 27 /
PCIN
48 /
ALU
X Z Y
PCOUT
OPMODE ALUMODE INMODE
PE Input West Input South/PE Output
48 /
P
48 /
North Input
CE
23
24
DSP48E DSP48E
PCOUT PCIN A:B C P
DSP48E User Logic
A:B
DSP48E
PCOUT PCIN
DSP48E
P
dedicated cascade routes programmable FPGA interconnect
DSP Column
DOR Logic
25
~100s of DSPs in a column ~10s of columns
— use passthrough DSPs
26
Hoplite Hoplite
DSP48E cascade fabric
Hoplite
DSP48E DSP48E
Hoplite
DSP48E
Hoplite
DSP48E DSP48E
Hoplite Hoplite
DSP48E
Hoplite
DSP48E DSP48E
Hoplite Hoplite
DSP48E
Hoplite
DSP48E DSP48E
Hoplite
fabric
Top-Turn DSPs PCIN to P Bottom-Turn DSPs A:B to PCOUT
DSP48E DSP48E DSP48E DSP48E
Pass-thru DSPs PCOUT to PCIN Pass-thru DSPs PCOUT to PCIN Router DSPs Router DSPs Router DSPs
27
8x8 NoC (ML605 board) 16x16 NoC (VC707 board)
28
29
30
— substantially fewer LUTs vs. DSP48s — Importantly, FFs absorbed into DSP48
for random traffic mostly identical
31
— substantially fewer LUTs vs. DSP48s — Importantly, FFs absorbed into DSP48
for random traffic mostly identical
— Hard router = 12.45 LABs — 1 Altera DSP block = 11.9 LABs Stratix-III — Hoplite-DSP marginally smaller
— Hard router ~996 MHz — Hoplite-DSP ~650 MHz (multi-pumped) — Hoplite-DSP limits freq advantage to 3x.
— Hard router ~1.58 W — Hoplite-DSP model ~1.1W 15% activity — Hoplite-DSP uses ~50% less power
32
Abdelfattah + Betz [TRETS2014] (extrapolated results for 48b-wide 1VC)
— 48b switched bidirectional routing instead of just cascades (approach hard NoC wiring) — option to skip DSP blocks (segment lengths)
— pattern detection logic with multiple masks (similar to Altera DSP units)
— fracturing 48b-wide lanes into multiple lanes
33
— use the dynamic OPMODE feature
34
— Small fraction of DSPs for switching
— glorified “pipelined wires” — multi-pumping 50% back to user
— connect cascades to fabric
Hoplite Hoplite
DSP48E cascade fabricHoplite
DSP48E DSP48EHoplite
DSP48EHoplite
DSP48E DSP48EHoplite Hoplite
DSP48EHoplite
DSP48E DSP48EHoplite Hoplite
DSP48EHoplite
DSP48E DSP48EHoplite
fabric DSP48E DSP48E DSP48E DSP48EHoplite
DSP48E DSP48E DSP48E
35
Hoplite Hoplite
DSP48E cascade fabricHoplite
DSP48E DSP48EHoplite
DSP48EHoplite
DSP48E DSP48EHoplite Hoplite
DSP48EHoplite
DSP48E DSP48EHoplite Hoplite
DSP48EHoplite
DSP48E DSP48EHoplite
fabric DSP48E DSP48E DSP48E DSP48E2x2 NoC (ML605 board) Corner-Turn Pass-Thru Hoplite
36
38
39
40
41
DSP48s less-efficient than LUT-based Hoplite!