Contrasting Topologies for Regular Interconnection Networks under - - PowerPoint PPT Presentation

contrasting topologies for regular interconnection
SMART_READER_LITE
LIVE PREVIEW

Contrasting Topologies for Regular Interconnection Networks under - - PowerPoint PPT Presentation

Contrasting Topologies for Regular Interconnection Networks under the Constraints of Nanoscale Technologies MPSoC Research Group @ University of Ferrara Daniele Ludovici , Francisco Gilabert, Maria Gomez, Georgi Gaydadjiev, Davide Bertozzi


slide-1
SLIDE 1

Contrasting Topologies for Regular Interconnection Networks under the Constraints of Nanoscale Technologies

MPSoC Research Group @ University of Ferrara

Daniele Ludovici, Francisco Gilabert, Maria Gomez, Georgi Gaydadjiev, Davide Bertozzi

slide-2
SLIDE 2

Outline

  • Motivation & Goal
  • Topologies under test
  • Physical Modeling Framework
  • 64-tile topologies: P&R results
  • What happens when the link needs to be pipelined?
  • System-level exploration
  • Conclusions
slide-3
SLIDE 3

The execution of many multimedia and signal processing functions has been historically accelerated by means of specialized processing engines Performance of hardware accelerators is becoming accessible by combining multiple programmable processor tiles within a multicore system With the advent of MPSoC technology

MPSoC Technology

slide-4
SLIDE 4

The execution of many multimedia and signal processing functions has been historically accelerated by means of specialized processing engines Performance of hardware accelerators is becoming accessible by combining multiple programmable processor tiles within a multicore system With the advent of MPSoC technology

MPSoC: efficient computation can be achieved while only marginally impacting programmability and/or configurability

MPSoC Technology

slide-5
SLIDE 5
  • Hierarchical system
  • programmable accelerator
  • tile based subsystem of homogeneous processing units
  • Top level of the hierarchy:
  • DSP, I/O units, hw accelerators

System complexity is more a matter of instantiation and connectivity capability rather than architecture development

MPSoC Architecture: The new Landscape

slide-6
SLIDE 6

Connectivity patterns for large scale systems are well known from off-chip networking Nanoscale silicon Technologies

Module Module Module Module Module Module Module Module Module Module Module Module

Growing gap between pre- and post-layout properties of topology connectivity patterns

The Physical Gap

slide-7
SLIDE 7

Over-the-cell routing? Latency in injection links? Latency in express links? Which switch

  • perating

frequency ? Can automatic routing tools handle this effectively? How is routing congestion at each metal layer impacted? Regularity broken by asymmetric tile size or heterogeneous tiles!

Pencil-and-paper floorplanning considerations may be misleading Therefore topology comparison with layout-awareness is a must!

Layout Effects

slide-8
SLIDE 8

IDEA: there are many regular topologies feature better abstract properties than a 2D mesh

GOAL

GOAL: quantify to which extent such properties are impacted by the degradation effects of the physical synthesis

  • n nanoscale silicon
slide-9
SLIDE 9

GOAL

GOAL: we propose a NoC physical characterization methodology enabling layout aware analysis of large scale systems pruning time and memory requirements IDEA: typically, an accurate physical modeling

  • f interconnection networks is limited to

small scale systems

slide-10
SLIDE 10

IDEA: long links will be most probably inferred with link pipelining

GOAL

GOAL: we capture the impact of link pipelining on topology area and performance assessing whether and to which extent theoretical benefits are preserved

slide-11
SLIDE 11

GOAL

GOAL: we consider IP core-network speed decoupling typical of GALS systems in the topology evaluation framework IDEA: System-level power management is achieved by structuring the MPSoC into voltage and frequency islands

slide-12
SLIDE 12

8ary-2mesh => 2D mesh

Topologies under test

slide-13
SLIDE 13

4ary-3mesh

Topologies under test

slide-14
SLIDE 14

4ary-2mesh

Topologies under test

slide-15
SLIDE 15

Other concentrated variants 2ary 6mesh

Topologies under test

slide-16
SLIDE 16

8-Cmesh

Topologies under test

slide-17
SLIDE 17

Power estimation Physical Synthesis Floorplan Topology generation Topology specification RTL SystemC/Verilog Prime Time PX Prime time SDF (timing) Placement Clock Tree Synth., Power Grid, routing, post-routing opt Netlist, Parasitic Extraction Simulation VCD Trace OCP Traffic Generator Transactional Simulator

slide-18
SLIDE 18

Challenge: layout aware physical modeling of large scale NoC topologies

Characterization Methodology

slide-19
SLIDE 19

The critical path is determined by the switch-to-switch link in a NoC topology! Most of the topologies are not competitive with the 2D mesh because of their long links and even unusable!!!

Post-layout Results

Longest link determines the highest achievable frequency (post-layout) Highest switch radix determines maximum frequency (post-synthesis)

slide-20
SLIDE 20

Area for 64-tile, no pipelining

  • Final area footprint depends on the #of_switches, max switch radix and

consequently their final synthesis frequency KEY T/A: they are not more area efficient than the 2D mesh but due to their slow down, their area footprint can be overly optimized… …never forget target frequency when considering area footprint!! 4-ary 2-mesh: short link (3mm)=>small performance drop. Few switches (16): 20% saving E.g., 2-ary 6-mesh has slower final frequency w.r.t. 8-ary 2-mesh but all the swiches have radix 8 vs. 4,5,6 => 8-ary 2-mesh has 10% area saving

slide-21
SLIDE 21

Does this mean that multi-dimensional and concentrated-mesh topologies are totally unusable?

  • Li

Link Pipelining

  • breaks long timing paths at a cheaper cost than switches

switch switch Flip flops

data data

Link Pipelining

slide-22
SLIDE 22

Pipeline stage Pipeline stage

Data Data stall stall Control Logic Control Logic

2 slot buffers needed for stall/go flow control

Normal flit propagation Backup slot to compensate propagation delay

  • f backpressure

signals

sel en2 en1

The xpipes architecture uses stall/go flow control

stall valid Data

Link Pipelining

slide-23
SLIDE 23
  • A flow control stage features a considerable area overhead

with respect to a simple barrier of flip-flops.

0.5 1 1.5 2 2.5 3 3.5 flip-flops barrier flow control stage

area overhead

What are the implications on topology area figures? How is the area ratios between topolog. impacted by link pipelining?

Not just a bunch of flip-flops

slide-24
SLIDE 24
  • Insertion criteria: from the third link dimension onwards
  • each topology has a different number of links and a different number
  • f required pipeline stages => depends on the maximum achievable frequency

Key take-away: each topology has a different price to pay to restore the maximum achievable frequency dictated by its elementary switch block

Area for 64-tile, with pipelining

slide-25
SLIDE 25
  • Area before and after pipeline stage insertion
  • Cell area increment in all cases comes from a twofold contribution:
  • pipeline stage insertion
  • restored higher frequency allowed by such insertion
  • The possibility to restore a high frequency changes radically the area
  • verhead of the elementary switch block as well

Area Overhead for pipeline stage insertion

slide-26
SLIDE 26

Multi-dimensional and C-mesh become again competitive!!!

  • However… flow control stages add latency to the link crossing
  • …and performance due to a better frequency could be

vanished by an increased overall link latency

  • Therefore, in order to quantify this… we performed

a system-level exploration

Link Pipelining

slide-27
SLIDE 27
  • TLM simulator cycle accurate

with the xpipes architecture

  • Back annotation of physical parameters from the layout

synthesis such as real link latency, core and switch

  • perating frequency, etc.
  • The target system implements dual-clock

FIFO in order to model a scenario where every core can run at its own frequency (it was ratio-based) System-level exploration with layout awareness

System-level Exploration

slide-28
SLIDE 28

Performance of 64-tile systems with uniform random traffic

Theoretical =>

  • Neglecting layout implications, 2-ary 6-mesh is the best solution
  • Several topologies outperform the 8-ary 2-mesh
slide-29
SLIDE 29

Performance of 64-tile systems with uniform random traffic

Layout aware no pipelining =>

  • 8-ary 2-mesh is the best solution
  • the poor matching of several topologies with silicon technology

completely offsets their better theoretical properties

slide-30
SLIDE 30

Performance of 64-tile systems with uniform random traffic

Layout aware, with link pipelining =>

  • When the impact of wiring complexity over the critical path is alleviated by

using link pipelining techniques:

  • 2-ary 6-mesh, 2-ary 5-mesh, 4-ary 3-mesh outperform the 2D mesh
  • BUT their performance comes at an area cost!

Is the performance boost proportioned to the area overhead?

slide-31
SLIDE 31
  • area efficiency metric: throughput/area
  • performance improvements achieved by complex topologies

is NOT cost-effective in that the area overhead is disproportioned with the performance boost

  • only traffic pattern favoring low-hop count (perfect shuffle)

achieves better area efficiency

Area Efficiency

slide-32
SLIDE 32
  • A comprehensive analysis framework to assess k-ary n-mesh and

C-mesh topologies at different levels of abstraction: from system- to layout-level

  • Accurate physical modeling methodology to characterize topologies

from an area and timing viewpoint while pruning implementation time and memory requirements

  • without link pipelining => forget about it
  • with link pipelining: area ratio wrt 2D-mesh is inverted (due to synthesis)
  • Even with an increased link latency some k-ary n-mesh and

C-mesh topologies preserve performance benefits…. …but this comes at disproportioned area cost!

Summing up

slide-33
SLIDE 33

Questions?

daniele.ludovici@unife.it