Contrasting Topologies for Regular Interconnection Networks under - PowerPoint PPT Presentation

Contrasting Topologies for Regular Interconnection Networks under the Constraints of Nanoscale Technologies MPSoC Research Group @ University of Ferrara Daniele Ludovici , Francisco Gilabert, Maria Gomez, Georgi Gaydadjiev, Davide Bertozzi

Outline • Motivation & Goal • Topologies under test • Physical Modeling Framework • 64-tile topologies: P&R results • What happens when the link needs to be pipelined? • System-level exploration • Conclusions

MPSoC Technology The execution of many multimedia and signal processing functions has been historically accelerated by means of specialized processing engines With the advent of MPSoC technology Performance of hardware accelerators is becoming accessible by combining multiple programmable processor tiles within a multicore system

MPSoC Technology The execution of many multimedia and signal processing functions has been historically accelerated by means of specialized processing engines MPSoC: efficient computation can be achieved while only With the advent of MPSoC technology marginally impacting programmability and/or configurability Performance of hardware accelerators is becoming accessible by combining multiple programmable processor tiles within a multicore system

MPSoC Architecture: The new Landscape • Hierarchical system - programmable accelerator - tile based subsystem of homogeneous processing units • Top level of the hierarchy: - DSP, I/O units, hw accelerators System complexity is more a matter of instantiation and connectivity capability rather than architecture development

The Physical Gap Connectivity patterns for large scale systems are well known from off-chip networking Nanoscale silicon Technologies Module Module Module Module Module Module Module Module Module Module Module Module Growing gap between pre- and post-layout properties of topology connectivity patterns

Layout Effects Regularity Latency in broken by injection asymmetric links? tile size or heterogeneous Latency in tiles! express links? Which Over-the-cell switch routing? operating Can automatic routing tools frequency handle this effectively? ? How is routing congestion at each metal layer impacted? Pencil-and-paper floorplanning considerations may be misleading Therefore topology comparison with layout-awareness is a must!

GOAL IDEA: there are many regular topologies feature better abstract properties than a 2D mesh GOAL: quantify to which extent such properties are impacted by the degradation effects of the physical synthesis on nanoscale silicon

GOAL IDEA: typically, an accurate physical modeling of interconnection networks is limited to small scale systems GOAL: we propose a NoC physical characterization methodology enabling layout aware analysis of large scale systems pruning time and memory requirements

GOAL IDEA: long links will be most probably inferred with link pipelining GOAL: we capture the impact of link pipelining on topology area and performance assessing whether and to which extent theoretical benefits are preserved

GOAL IDEA: System-level power management is achieved by structuring the MPSoC into voltage and frequency islands GOAL: we consider IP core-network speed decoupling typical of GALS systems in the topology evaluation framework

Topologies under test 8ary-2mesh => 2D mesh

Topologies under test 4ary-3mesh

Topologies under test 4ary-2mesh

Topologies under test 2ary 6mesh Other concentrated variants

Topologies under test 8-Cmesh

Topology specification Transactional Simulator Topology generation RTL SystemC/Verilog OCP Traffic Generator Physical Synthesis Simulation Placement VCD Trace Floorplan Clock Tree Synth., Power Grid, routing, post-routing opt Netlist, Parasitic Extraction Prime time Power estimation Prime Time PX SDF (timing)

Characterization Methodology Challenge: layout aware physical modeling of large scale NoC topologies

Post-layout Results Highest switch radix determines maximum frequency (post-synthesis) Longest link determines the highest achievable frequency (post-layout) The critical path is determined by the switch-to-switch link in a NoC topology! Most of the topologies are not competitive with the 2D mesh because of their long links and even unusable!!!

Area for 64-tile, no pipelining • Final area footprint depends on the #of_switches, max switch radix and consequently their final synthesis frequency E.g., 2-ary 6-mesh has slower final frequency w.r.t. 8-ary 2-mesh but all the swiches have radix 8 vs. 4,5,6 => 8-ary 2-mesh has 10% area saving 4-ary 2-mesh: short link (3mm)=>small performance drop. Few switches (16): 20% saving KEY T/A: they are not more area efficient than the 2D mesh but due to their slow down, their area footprint can be overly optimized… …never forget target frequency when considering area footprint!!

Link Pipelining Does this mean that multi-dimensional and concentrated-mesh topologies are totally unusable? • Li Link Pipelining breaks long timing paths at a cheaper cost than switches • Flip data data switch switch flops

Link Pipelining The xpipes architecture uses stall/go flow control Pipeline stage Pipeline stage Normal flit Data 2 slot buffers Data Data propagation needed for valid stall/go Backup slot to stall stall stall flow control compensate propagation delay Control Control of backpressure Logic Logic signals en1 en2 sel

Not just a bunch of flip-flops area overhead 3.5 3 2.5 2 1.5 1 0.5 0 flip-flops barrier flow control stage • A flow control stage features a considerable area overhead with respect to a simple barrier of flip-flops. What are the implications on topology area figures? How is the area ratios between topolog. impacted by link pipelining?

Area for 64-tile, with pipelining • Insertion criteria : from the third link dimension onwards • each topology has a different number of links and a different number of required pipeline stages => depends on the maximum achievable frequency Key take-away: each topology has a different price to pay to restore the maximum achievable frequency dictated by its elementary switch block

Area Overhead for pipeline stage insertion • Area before and after pipeline stage insertion • Cell area increment in all cases comes from a twofold contribution: • pipeline stage insertion • restored higher frequency allowed by such insertion • The possibility to restore a high frequency changes radically the area overhead of the elementary switch block as well

Link Pipelining Multi-dimensional and C-mesh become again competitive!!! • However… flow control stages add latency to the link crossing • …and performance due to a better frequency could be vanished by an increased overall link latency Therefore, in order to quantify this… we performed • a system-level exploration

System-level Exploration • TLM simulator cycle accurate with the xpipes architecture • Back annotation of physical parameters from the layout synthesis such as real link latency, core and switch operating frequency, etc. • The target system implements dual-clock FIFO in order to model a scenario where every core can run at its own frequency (it was ratio-based) System-level exploration with layout awareness

Performance of 64-tile systems with uniform random traffic Theoretical => • Neglecting layout implications, 2-ary 6-mesh is the best solution • Several topologies outperform the 8-ary 2-mesh

Performance of 64-tile systems with uniform random traffic Layout aware no pipelining => • 8-ary 2-mesh is the best solution • the poor matching of several topologies with silicon technology completely offsets their better theoretical properties

Performance of 64-tile systems with uniform random traffic Layout aware, with link pipelining => • When the impact of wiring complexity over the critical path is alleviated by using link pipelining techniques: • 2-ary 6-mesh, 2-ary 5-mesh, 4-ary 3-mesh outperform the 2D mesh • BUT their performance comes at an area cost! Is the performance boost proportioned to the area overhead?

Area Efficiency • area efficiency metric: throughput/area • performance improvements achieved by complex topologies is NOT cost-effective in that the area overhead is disproportioned with the performance boost • only traffic pattern favoring low-hop count (perfect shuffle) achieves better area efficiency

Summing up • A comprehensive analysis framework to assess k-ary n-mesh and C-mesh topologies at different levels of abstraction: from system- to layout-level • Accurate physical modeling methodology to characterize topologies from an area and timing viewpoint while pruning implementation time and memory requirements • without link pipelining => forget about it • with link pipelining: area ratio wrt 2D-mesh is inverted (due to synthesis) • Even with an increased link latency some k-ary n-mesh and C-mesh topologies preserve performance benefits…. …but this comes at disproportioned area cost!

Questions? daniele.ludovici@unife.it

Contrasting Topologies for Regular Interconnection Networks under - PowerPoint PPT Presentation

Contrasting Topologies for Regular Interconnection Networks under the Constraints of Nanoscale Technologies MPSoC Research Group @ University of Ferrara Daniele Ludovici , Francisco Gilabert, Maria Gomez, Georgi Gaydadjiev, Davide Bertozzi

Interconnection, Peering IXPs What and How Interconnection 2 Interconnection The Internet is

Interconnection Application Options and Process Jason Foster, Sr. Interconnection Specialist

Frame Relay Topologies and Designs Frame Relay Topologies and Design As we learned in the Frame

Virtual Topologies Virtual Topologies Convenient process naming. Naming scheme to fit the

Regular Expressions A regular expression describes a language using three operations. Regular

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

ISO generator interconnection queue Bob Emmert Sr. Manager, Interconnection Resources Board of

Decision on interconnection process enhancements for independent study and fast track processes

synchronous networks an interconnection structure with an interconnection structure with

Decision on Interconnection Process Enhancements Track 4 Robert Emmert Manager,

Evolution of Interconnection Joseph Lorenzo Hall March 11, 2015 Princeton CITP Global Conference

Low Power and Reliable Interconnection with Low Power and Reliable Interconnection with

Interconnection Networks for Parallel Computers Interconnection networks carry data between

SoC Design Lecture 10: On-Chip Interconnection Networks Lecture 10: On Chip Interconnection

Endogenizing Interconnection Measurement and Economics Christopher S. Yoo University of

Interconnection Structures Patrick Happ Raul Queiroz Feitosa Objective To present key issues

Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentn Puente

5194.01: Introduction to High-Performance Deep Learning Mesh-TensorFlow & SparkNet Shen Wang

Sidecars and Service Meshes Next level scaling for microservices User Interface(s): HTML, CSS,

Parallel I/O and Parallel Refinement Chris Richardson Garth Wells DCSE project NAG

Open Cloud Mesh Peter Szegedi GANT Amsterdam 17th TF-Storage, Pisa, Italy 14/10/2015 Networks

Compact Behavioural Modelling Behavioural Modelling Compact of Electromagnetic Effects of

One of the major challenges in the design and verification of manycore systems is cache coherency.

HoneySpider Network 2.0 detecting client-side attacks the easy way Pawe Pawli nski CERT

Contrasting Topologies for Regular Interconnection Networks under - PowerPoint PPT Presentation

Contrasting Topologies for Regular Interconnection Networks under the Constraints of Nanoscale Technologies MPSoC Research Group @ University of Ferrara Daniele Ludovici , Francisco Gilabert, Maria Gomez, Georgi Gaydadjiev, Davide Bertozzi

Interconnection, Peering IXPs What and How Interconnection 2 Interconnection The Internet is

Interconnection Application Options and Process Jason Foster, Sr. Interconnection Specialist

Frame Relay Topologies and Designs Frame Relay Topologies and Design As we learned in the Frame

Virtual Topologies Virtual Topologies Convenient process naming. Naming scheme to fit the

Regular Expressions A regular expression describes a language using three operations. Regular

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

ISO generator interconnection queue Bob Emmert Sr. Manager, Interconnection Resources Board of

Decision on interconnection process enhancements for independent study and fast track processes

synchronous networks an interconnection structure with an interconnection structure with

Decision on Interconnection Process Enhancements Track 4 Robert Emmert Manager,

Evolution of Interconnection Joseph Lorenzo Hall March 11, 2015 Princeton CITP Global Conference

Low Power and Reliable Interconnection with Low Power and Reliable Interconnection with

Interconnection Networks for Parallel Computers Interconnection networks carry data between

SoC Design Lecture 10: On-Chip Interconnection Networks Lecture 10: On Chip Interconnection

Endogenizing Interconnection Measurement and Economics Christopher S. Yoo University of

Interconnection Structures Patrick Happ Raul Queiroz Feitosa Objective To present key issues

Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentn Puente

5194.01: Introduction to High-Performance Deep Learning Mesh-TensorFlow &amp; SparkNet Shen Wang

Sidecars and Service Meshes Next level scaling for microservices User Interface(s): HTML, CSS,

Parallel I/O and Parallel Refinement Chris Richardson Garth Wells DCSE project NAG

Open Cloud Mesh Peter Szegedi GANT Amsterdam 17th TF-Storage, Pisa, Italy 14/10/2015 Networks

Compact Behavioural Modelling Behavioural Modelling Compact of Electromagnetic Effects of

One of the major challenges in the design and verification of manycore systems is cache coherency.

HoneySpider Network 2.0 detecting client-side attacks the easy way Pawe Pawli nski CERT

5194.01: Introduction to High-Performance Deep Learning Mesh-TensorFlow & SparkNet Shen Wang