FastTrack : Leveraging Heterogeneous FPGA Wires to Design Low-cost - - PowerPoint PPT Presentation

fasttrack leveraging heterogeneous fpga wires to design
SMART_READER_LITE
LIVE PREVIEW

FastTrack : Leveraging Heterogeneous FPGA Wires to Design Low-cost - - PowerPoint PPT Presentation

FastTrack : Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs Nachiket Kapre + Tushar Krishna nachiket@uwaterloo.ca, tushar@ece.gatech.edu 1/29 Claim FPGA overlay NoCs designed to exploit interconnect


slide-1
SLIDE 1

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

Nachiket Kapre + Tushar Krishna nachiket@uwaterloo.ca, tushar@ece.gatech.edu

1/29

slide-2
SLIDE 2

Claim

FPGA overlay NoCs designed to exploit interconnect properties of the FPGA fabric can surpass existing state-of-the-art NoCs by:

◮ 2.5–2.8× throughput ↑ ◮ 2.2× energy ↓ ◮ at 2.5× LUT cost ↑

Xilinx Virtex-7 485T FPGA, 8×8 system size, synthetic+real-world traffic.

2/29

slide-3
SLIDE 3

Context

◮ FPGAs finding comfortable home in datacenters

◮ Offloading compute intensive workloads to the FPGA ◮ Energy-efficiency, fast coupling to networking

◮ Common Infrastructure: NoCs for apps + system IO

3/29

slide-4
SLIDE 4

Context

◮ FPGAs finding comfortable home in datacenters

◮ Offloading compute intensive workloads to the FPGA ◮ Energy-efficiency, fast coupling to networking

◮ Common Infrastructure: NoCs for apps + system IO

3/29

slide-5
SLIDE 5

Landscape of contemporary FPGA NoC Routers

4/29

slide-6
SLIDE 6

Landscape of contemporary FPGA NoC Routers

  • 0.00

0.25 0.50 0.75 1.00 1000 2000 3000

Cost per Switch max(LUTs,FFs) Peak S/W Bandwidth (packets/ns)

  • Split−Merge

CONNECT BLESS OpenSMART ◮ ASIC clones transplanted onto FPGAs fare poorly! →

expensive buffers, virtual channels, multi-ported switches

◮ Even contemporary FPGA routers are expensive and slow ◮ FastTrack: Deflection-routing + Bufferless + Torus

5/29

slide-7
SLIDE 7

Landscape of contemporary FPGA NoC Routers

  • 0.00

0.25 0.50 0.75 1.00 1000 2000 3000

Cost per Switch max(LUTs,FFs) Peak S/W Bandwidth (packets/ns)

  • Split−Merge

CONNECT BLESS OpenSMART ◮ ASIC clones transplanted onto FPGAs fare poorly! →

expensive buffers, virtual channels, multi-ported switches

◮ Even contemporary FPGA routers are expensive and slow ◮ FastTrack: Deflection-routing + Bufferless + Torus

5/29

slide-8
SLIDE 8

Landscape of contemporary FPGA NoC Routers

  • 1

2 3 1000 2000 3000

Cost per Switch max(LUTs,FFs) Peak S/W Bandwidth (packets/ns)

  • Split−Merge

CONNECT BLESS OpenSMART ◮ ASIC clones transplanted onto FPGAs fare poorly! →

expensive buffers, virtual channels, multi-ported switches

◮ Even contemporary FPGA routers are expensive and slow ◮ FastTrack: Deflection-routing + Bufferless + Torus

5/29

slide-9
SLIDE 9

Landscape of contemporary FPGA NoC Routers

  • 1

2 3 1000 2000 3000

Cost per Switch max(LUTs,FFs) Peak S/W Bandwidth (packets/ns)

  • Split−Merge

CONNECT BLESS OpenSMART Hoplite FastTrack ◮ ASIC clones transplanted onto FPGAs fare poorly! →

expensive buffers, virtual channels, multi-ported switches

◮ Even contemporary FPGA routers are expensive and slow ◮ FastTrack: Deflection-routing + Bufferless + Torus

5/29

slide-10
SLIDE 10

Qualitative Comparison of FPGA NoC Routers

Router Cost Xbar+Arb Buffers VCs OpenSMART ✗ ✗ ✗ BLESS ✗ ✓ ✓ CONNECT ✗ ✗ ✗ Split-Merge ✗ ✗ ✓ Hoplite ✓✓ ✓ ✓

6/29

slide-11
SLIDE 11

Quick Tutorial on Hoplite

0,3 sw 0,2 sw 0,1 sw 0,0 sw 1,3 sw 1,2 sw 1,1 sw 1,0 sw 2,3 sw 2,2 sw 2,1 sw 2,0 sw 3,3 sw 3,2 sw 3,1 sw 3,0 sw

Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs, TRETS 2017 Hoplite: Building Austere Overlay NoCs for FPGAs, FPL 2015

7/29

slide-12
SLIDE 12

Quick Tutorial on HopliteRT

0,3 sw 0,2 sw 0,1 sw 0,0 sw 1,3 sw 1,2 sw 1,1 sw 1,0 sw 2,3 sw 2,2 sw 2,1 sw 2,0 sw 3,3 sw 3,2 sw 3,1 sw 3,0 sw

HopliteRT: An Efficient FPGA NoC for Real-Time Applications, FPT 2017

8/29

slide-13
SLIDE 13

Qualitative Comparison of FPGA NoC Routers

Router Cost Xbar+Arb Buffers VCs OpenSMART ✗ ✗ ✗ BLESS ✗ ✓ ✓ CONNECT ✗ ✗ ✗ Split-Merge ✗ ✗ ✓ Hoplite ✓✓ ✓ ✓

9/29

slide-14
SLIDE 14

Qualitative Comparison of FPGA NoC Routers

Router Cost Perf Xbar+Arb Buffers VCs Tput Latency OpenSMART ✗ ✗ ✗ ✓ ✓ BLESS ✗ ✓ ✓ ✓ ✗ CONNECT ✗ ✗ ✗ ✓ ✓ Split-Merge ✗ ✗ ✓ ✓ ✓ Hoplite ✓✓ ✓ ✓ ✗ ✗

9/29

slide-15
SLIDE 15

Challenge

◮ Deflection routing → inefficient use of wiring resources

◮ Deflected packets stay in network for longer → latency↑ ◮ Steal bandwidth from other traffic → throughput ↓

◮ Can we allow improve NoC performance under

deflection routing?

◮ Are there unique opportunities provided by the

FPGA fabric?

◮ Hoplite cheap in LUT cost. . . ◮ FastTrack → inspect FPGA interconnect 10/29

slide-16
SLIDE 16

Outline

Introduction and Motivation FastTrack NoC Organization FastTrack Router Operation Evaluation

11/29

slide-17
SLIDE 17

Outline

Introduction and Motivation FastTrack NoC Organization FastTrack Router Operation Evaluation

12/29

slide-18
SLIDE 18

FPGA Wire Speeds

distances not to scale

13/29

slide-19
SLIDE 19

FastTrack NoC Organization

sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw

14/29

slide-20
SLIDE 20

Depopulated Topology Generation

sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw sw

15/29

slide-21
SLIDE 21

Parametric Topology generation

◮ FPGA NoC parameterized by three terms:

◮ N System size ◮ D Distance of express link ◮ R Depopulation parameter → controls how many routers

are FastTrack vs. vanilla Hoplite

◮ Fully populated 4×4 NoC → FT(16,2,1) ◮ Half population 4×4 NoC → FT(16,2,2)

16/29

slide-22
SLIDE 22

Outline

Introduction and Motivation FastTrack NoC Organization FastTrack Router Operation Evaluation

17/29

slide-23
SLIDE 23

FastTrack Switch Organization

3:1 3:1 W E N S PE

(a) Base HopliteRT

4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE

(b) FastTrack

18/29

slide-24
SLIDE 24

Switch Operation

4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either

short or express links

◮ DOR routing function:

travel in X first, then Y

◮ Packets can upgrade to

fast links if they can

◮ Packets can downgrade to

slow links only on turn!

◮ Livelock avoidance:

W → S > N → S

◮ Express links=higher

priority, deflected packets acquire higher priority → progress

19/29

slide-25
SLIDE 25

Switch Operation

4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either

short or express links

◮ DOR routing function:

travel in X first, then Y

◮ Packets can upgrade to

fast links if they can

◮ Packets can downgrade to

slow links only on turn!

◮ Livelock avoidance:

W → S > N → S

◮ Express links=higher

priority, deflected packets acquire higher priority → progress

19/29

slide-26
SLIDE 26

Switch Operation

4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either

short or express links

◮ DOR routing function:

travel in X first, then Y

◮ Packets can upgrade to

fast links if they can

◮ Packets can downgrade to

slow links only on turn!

◮ Livelock avoidance:

W → S > N → S

◮ Express links=higher

priority, deflected packets acquire higher priority → progress

19/29

slide-27
SLIDE 27

Switch Operation

4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either

short or express links

◮ DOR routing function:

travel in X first, then Y

◮ Packets can upgrade to

fast links if they can

◮ Packets can downgrade to

slow links only on turn!

◮ Livelock avoidance:

W → S > N → S

◮ Express links=higher

priority, deflected packets acquire higher priority → progress

19/29

slide-28
SLIDE 28

Switch Operation

4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either

short or express links

◮ DOR routing function:

travel in X first, then Y

◮ Packets can upgrade to

fast links if they can

◮ Packets can downgrade to

slow links only on turn!

◮ Livelock avoidance:

W → S > N → S

◮ Express links=higher

priority, deflected packets acquire higher priority → progress

19/29

slide-29
SLIDE 29

Switch Operation

4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either

short or express links

◮ DOR routing function:

travel in X first, then Y

◮ Packets can upgrade to

fast links if they can

◮ Packets can downgrade to

slow links only on turn!

◮ Livelock avoidance:

W → S > N → S

◮ Express links=higher

priority, deflected packets acquire higher priority → progress

19/29

slide-30
SLIDE 30

Switch Operation

4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either

short or express links

◮ DOR routing function:

travel in X first, then Y

◮ Packets can upgrade to

fast links if they can

◮ Packets can downgrade to

slow links only on turn!

◮ Livelock avoidance:

W → S > N → S

◮ Express links=higher

priority, deflected packets acquire higher priority → progress

19/29

slide-31
SLIDE 31

Switch Operation

4:1 5:1 4:1 4:1 WSh ESh NSh SSh WEx EEx NEx SEx PE ◮ Packets can start in either

short or express links

◮ DOR routing function:

travel in X first, then Y

◮ Packets can upgrade to

fast links if they can

◮ Packets can downgrade to

slow links only on turn!

◮ Livelock avoidance:

W → S > N → S

◮ Express links=higher

priority, deflected packets acquire higher priority → progress

19/29

slide-32
SLIDE 32

Outline

Introduction and Motivation FastTrack NoC Organization FastTrack Router Operation Evaluation

20/29

slide-33
SLIDE 33

Experimental Setup

◮ RTL implementation of Routers → parameterized

◮ D, R parameters control cost

◮ Cycle-accurate simulations → Verilator ◮ FPGA synthesis + out-of-context place-and-route + XDC

floorplanning constraints → Vivado

◮ Benchmarking:

◮ Synthetic traffic patterns at various injection rates ◮ Traces from real workloads SpMV, Graph Analytics,

Multi-processing

◮ Measure sustained throughput, average latency, power

model

21/29

slide-34
SLIDE 34
  • Avg. Latency RANDOM traffic 8×8 NoC
  • 25

50 75 100 125 0.1 0.2 Injection Rate Avg Latency (cyc)

  • Hoplite

FT(N,2,2) FT(N,2,1)

22/29

slide-35
SLIDE 35
  • Avg. Latency RANDOM traffic 8×8 NoC
  • 25

50 75 100 125 0.1 0.2 Injection Rate Avg Latency (cyc)

  • Hoplite−3x

Hoplite FT(N,2,2) FT(N,2,1)

22/29

slide-36
SLIDE 36
  • Avg. Latency RANDOM traffic
  • 25

50 75 100 125 0.1 0.2 Injection Rate Avg Latency (cyc)

  • Hoplite

FT(N,2,2) FT(N,2,1)

  • 25

50 75 100 125 0.1 0.2 Injection Rate Avg Latency (cyc)

  • Hoplite−3x

Hoplite FT(N,2,2) FT(N,2,1)

◮ FastTrack saturates at

4–5× higher injection rate than Hoplite

◮ vs Replicated Hoplite, still

better but by smaller margin

◮ Replicated Hoplite has a

new kind of livelock possibility (delivery)

23/29

slide-37
SLIDE 37

Results – LUT vs Throughput 8×8 NoC

  • Hoplite

50 100 150 200 5000 10000 15000 20000 Area (LUTs) Sustained Rate (Million Packets/s)

24/29

slide-38
SLIDE 38

Results – LUT vs Throughput 8×8 NoC

  • Hoplite−2x

Hoplite−2x Hoplite−2x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite Hoplite Hoplite 50 100 150 200 5000 10000 15000 20000 Area (LUTs) Sustained Rate (Million Packets/s)

24/29

slide-39
SLIDE 39

Results – LUT vs Throughput 8×8 NoC

  • Hoplite−2x

Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite Hoplite Hoplite Hoplite Hoplite FT R=1 FT R=1 FT R=1 FT R=1 FT R=1 FT R=2 FT R=2 FT R=2 FT R=2 FT R=2 50 100 150 200 5000 10000 15000 20000 Area (LUTs) Sustained Rate (Million Packets/s)

24/29

slide-40
SLIDE 40

Results – Wiring vs. Throughput 8×8 NoC

  • Hoplite

50 100 150 200 25 50 75 100 125 Wire Count Sustained Rate (Million Packets/s)

25/29

slide-41
SLIDE 41

Results – Wiring vs. Throughput 8×8 NoC

  • Hoplite−2x

Hoplite−2x Hoplite−2x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite Hoplite Hoplite 50 100 150 200 25 50 75 100 125 Wire Count Sustained Rate (Million Packets/s)

25/29

slide-42
SLIDE 42

Results – Wiring vs. Throughput 8×8 NoC

  • Hoplite−2x

Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite Hoplite Hoplite Hoplite Hoplite FT R=2 FT R=2 FT R=2 FT R=2 FT R=2 FT R=1 FT R=1 FT R=1 FT R=1 FT R=1 50 100 150 200 25 50 75 100 125 Wire Count Sustained Rate (Million Packets/s)

25/29

slide-43
SLIDE 43

Results – Cost vs. Throughput 8×8 NoC

  • Hoplite−2x

Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite Hoplite Hoplite Hoplite Hoplite FT R=1 FT R=1 FT R=1 FT R=1 FT R=1 FT R=2 FT R=2 FT R=2 FT R=2 FT R=2 50 100 150 200 5000 10000 15000 20000 Area (LUTs) Sustained Rate (Million Packets/s)

  • Hoplite−2x

Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−2x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite−3x Hoplite Hoplite Hoplite Hoplite Hoplite FT R=2 FT R=2 FT R=2 FT R=2 FT R=2 FT R=1 FT R=1 FT R=1 FT R=1 FT R=1 50 100 150 200 25 50 75 100 125 Wire Count Sustained Rate (Million Packets/s)

◮ FastTrack makes better

use of FPGA resources (LUTs, and wires)

◮ Packets are allowed to

leave the NoC faster, freeing up resources

◮ Must pick proper

combination of FT design parameters

26/29

slide-44
SLIDE 44

Qualitative Comparison of FPGA NoC Routers

Router Cost Xbar+Arb Buffers VCs OpenSMART ✗ ✗ ✗ BLESS ✗ ✓ ✓ CONNECT ✗ ✗ ✗ Split-Merge ✗ ✗ ✓ Hoplite ✓✓ ✓ ✓

27/29

slide-45
SLIDE 45

Qualitative Comparison of FPGA NoC Routers

Router Cost Perf Xbar+Arb Buffers VCs Tput Latency OpenSMART ✗ ✗ ✗ ✓ ✓ BLESS ✗ ✓ ✓ ✓ ✗ CONNECT ✗ ✗ ✗ ✓ ✓ Split-Merge ✗ ✗ ✓ ✓ ✓ Hoplite ✓✓ ✓ ✓ ✗ ✗

27/29

slide-46
SLIDE 46

Qualitative Comparison of FPGA NoC Routers

Router Cost Perf Xbar+Arb Buffers VCs Tput Latency OpenSMART ✗ ✗ ✗ ✓ ✓ BLESS ✗ ✓ ✓ ✓ ✗ CONNECT ✗ ✗ ✗ ✓ ✓ Split-Merge ✗ ✗ ✓ ✓ ✓ Hoplite ✓✓ ✓ ✓ ✗ ✗ FastTrack ✓ ✓ ✓ ✓ ✓

27/29

slide-47
SLIDE 47

FPGA Mapping Frequency 8×8 NoC

  • 100

200 300 400 50 100

NoC Datawidth NoC Frequency (MHz)

  • FastTrack (64,2)

FastTrack (64,4) Hoplite

◮ Calibration studies showed

express links can travel quickly on chip

◮ Fmax for 2-hop FastTrack

keeps up with original Hoplite

◮ 4-hop express link

distance too large, some noticeable slowdown

28/29

slide-48
SLIDE 48

Conclusions

◮ FastTrack outperforms state-of-the-art Hoplite FPGA

NoC by

◮ 2.5× for synthetic traffic, 2.8× for real-world traces ◮ 2.2× on energy efficiency ◮ 2.5× more LUTs required

◮ FastTrack better at larger system sizes ◮ Ideal hop distance is 2–4 (4–256 PEs) ◮ Fmax gap between FastTrack and Hoplite is small

29/29