FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs
Nachiket Kapre + Tushar Krishna nachiket@uwaterloo.ca, tushar@ece.gatech.edu
1/29
FastTrack : Leveraging Heterogeneous FPGA Wires to Design Low-cost - - PowerPoint PPT Presentation
FastTrack : Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs Nachiket Kapre + Tushar Krishna nachiket@uwaterloo.ca, tushar@ece.gatech.edu 1/29 Claim FPGA overlay NoCs designed to exploit interconnect
1/29
◮ 2.5–2.8× throughput ↑ ◮ 2.2× energy ↓ ◮ at 2.5× LUT cost ↑
2/29
◮ FPGAs finding comfortable home in datacenters
◮ Offloading compute intensive workloads to the FPGA ◮ Energy-efficiency, fast coupling to networking
◮ Common Infrastructure: NoCs for apps + system IO
3/29
◮ FPGAs finding comfortable home in datacenters
◮ Offloading compute intensive workloads to the FPGA ◮ Energy-efficiency, fast coupling to networking
◮ Common Infrastructure: NoCs for apps + system IO
3/29
4/29
0.25 0.50 0.75 1.00 1000 2000 3000
Cost per Switch max(LUTs,FFs) Peak S/W Bandwidth (packets/ns)
CONNECT BLESS OpenSMART ◮ ASIC clones transplanted onto FPGAs fare poorly! →
◮ Even contemporary FPGA routers are expensive and slow ◮ FastTrack: Deflection-routing + Bufferless + Torus
5/29
0.25 0.50 0.75 1.00 1000 2000 3000
Cost per Switch max(LUTs,FFs) Peak S/W Bandwidth (packets/ns)
CONNECT BLESS OpenSMART ◮ ASIC clones transplanted onto FPGAs fare poorly! →
◮ Even contemporary FPGA routers are expensive and slow ◮ FastTrack: Deflection-routing + Bufferless + Torus
5/29
2 3 1000 2000 3000
Cost per Switch max(LUTs,FFs) Peak S/W Bandwidth (packets/ns)
CONNECT BLESS OpenSMART ◮ ASIC clones transplanted onto FPGAs fare poorly! →
◮ Even contemporary FPGA routers are expensive and slow ◮ FastTrack: Deflection-routing + Bufferless + Torus
5/29
2 3 1000 2000 3000
Cost per Switch max(LUTs,FFs) Peak S/W Bandwidth (packets/ns)
CONNECT BLESS OpenSMART Hoplite FastTrack ◮ ASIC clones transplanted onto FPGAs fare poorly! →
◮ Even contemporary FPGA routers are expensive and slow ◮ FastTrack: Deflection-routing + Bufferless + Torus
5/29
6/29
Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs, TRETS 2017 Hoplite: Building Austere Overlay NoCs for FPGAs, FPL 2015
7/29
HopliteRT: An Efficient FPGA NoC for Real-Time Applications, FPT 2017
8/29
9/29
9/29
◮ Deflection routing → inefficient use of wiring resources
◮ Deflected packets stay in network for longer → latency↑ ◮ Steal bandwidth from other traffic → throughput ↓
◮ Can we allow improve NoC performance under
◮ Are there unique opportunities provided by the
◮ Hoplite cheap in LUT cost. . . ◮ FastTrack → inspect FPGA interconnect 10/29
11/29
12/29
13/29
14/29
15/29
◮ FPGA NoC parameterized by three terms:
◮ N System size ◮ D Distance of express link ◮ R Depopulation parameter → controls how many routers
◮ Fully populated 4×4 NoC → FT(16,2,1) ◮ Half population 4×4 NoC → FT(16,2,2)
16/29
17/29
3:1 3:1 W E N S PE
4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE
18/29
4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either
◮ DOR routing function:
◮ Packets can upgrade to
◮ Packets can downgrade to
◮ Livelock avoidance:
◮ Express links=higher
19/29
4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either
◮ DOR routing function:
◮ Packets can upgrade to
◮ Packets can downgrade to
◮ Livelock avoidance:
◮ Express links=higher
19/29
4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either
◮ DOR routing function:
◮ Packets can upgrade to
◮ Packets can downgrade to
◮ Livelock avoidance:
◮ Express links=higher
19/29
4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either
◮ DOR routing function:
◮ Packets can upgrade to
◮ Packets can downgrade to
◮ Livelock avoidance:
◮ Express links=higher
19/29
4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either
◮ DOR routing function:
◮ Packets can upgrade to
◮ Packets can downgrade to
◮ Livelock avoidance:
◮ Express links=higher
19/29
4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either
◮ DOR routing function:
◮ Packets can upgrade to
◮ Packets can downgrade to
◮ Livelock avoidance:
◮ Express links=higher
19/29
4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either
◮ DOR routing function:
◮ Packets can upgrade to
◮ Packets can downgrade to
◮ Livelock avoidance:
◮ Express links=higher
19/29
4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either
◮ DOR routing function:
◮ Packets can upgrade to
◮ Packets can downgrade to
◮ Livelock avoidance:
◮ Express links=higher
19/29
20/29
◮ RTL implementation of Routers → parameterized
◮ D, R parameters control cost
◮ Cycle-accurate simulations → Verilator ◮ FPGA synthesis + out-of-context place-and-route + XDC
◮ Benchmarking:
◮ Synthetic traffic patterns at various injection rates ◮ Traces from real workloads SpMV, Graph Analytics,
◮ Measure sustained throughput, average latency, power
21/29
50 75 100 125 0.1 0.2 Injection Rate Avg Latency (cyc)
FT(N,2,2) FT(N,2,1)
22/29
50 75 100 125 0.1 0.2 Injection Rate Avg Latency (cyc)
Hoplite FT(N,2,2) FT(N,2,1)
22/29
50 75 100 125 0.1 0.2 Injection Rate Avg Latency (cyc)
FT(N,2,2) FT(N,2,1)
50 75 100 125 0.1 0.2 Injection Rate Avg Latency (cyc)
Hoplite FT(N,2,2) FT(N,2,1)
◮ FastTrack saturates at
◮ vs Replicated Hoplite, still
◮ Replicated Hoplite has a
23/29
50 100 150 200 5000 10000 15000 20000 Area (LUTs) Sustained Rate (Million Packets/s)
24/29
Hoplite−2x Hoplite−2x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite Hoplite Hoplite 50 100 150 200 5000 10000 15000 20000 Area (LUTs) Sustained Rate (Million Packets/s)
24/29
Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite Hoplite Hoplite Hoplite Hoplite FT R=1 FT R=1 FT R=1 FT R=1 FT R=1 FT R=2 FT R=2 FT R=2 FT R=2 FT R=2 50 100 150 200 5000 10000 15000 20000 Area (LUTs) Sustained Rate (Million Packets/s)
24/29
50 100 150 200 25 50 75 100 125 Wire Count Sustained Rate (Million Packets/s)
25/29
Hoplite−2x Hoplite−2x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite Hoplite Hoplite 50 100 150 200 25 50 75 100 125 Wire Count Sustained Rate (Million Packets/s)
25/29
Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite Hoplite Hoplite Hoplite Hoplite FT R=2 FT R=2 FT R=2 FT R=2 FT R=2 FT R=1 FT R=1 FT R=1 FT R=1 FT R=1 50 100 150 200 25 50 75 100 125 Wire Count Sustained Rate (Million Packets/s)
25/29
Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite Hoplite Hoplite Hoplite Hoplite FT R=1 FT R=1 FT R=1 FT R=1 FT R=1 FT R=2 FT R=2 FT R=2 FT R=2 FT R=2 50 100 150 200 5000 10000 15000 20000 Area (LUTs) Sustained Rate (Million Packets/s)
Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite Hoplite Hoplite Hoplite Hoplite FT R=2 FT R=2 FT R=2 FT R=2 FT R=2 FT R=1 FT R=1 FT R=1 FT R=1 FT R=1 50 100 150 200 25 50 75 100 125 Wire Count Sustained Rate (Million Packets/s)
◮ FastTrack makes better
◮ Packets are allowed to
◮ Must pick proper
26/29
27/29
27/29
27/29
200 300 400 50 100
NoC Datawidth NoC Frequency (MHz)
FastTrack (64,4) Hoplite
◮ Calibration studies showed
◮ Fmax for 2-hop FastTrack
◮ 4-hop express link
28/29
◮ FastTrack outperforms state-of-the-art Hoplite FPGA
◮ 2.5× for synthetic traffic, 2.8× for real-world traces ◮ 2.2× on energy efficiency ◮ 2.5× more LUTs required
◮ FastTrack better at larger system sizes ◮ Ideal hop distance is 2–4 (4–256 PEs) ◮ Fmax gap between FastTrack and Hoplite is small
29/29