and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 - PowerPoint PPT Presentation

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 Jesús Camacho Villanueva, José Flich, José Duato Universidad Politécnica de Valencia Hans Eberle, Nils Gura, Wladek Olesinski Sun Microsystems December 12, 2009 Conference title 1

Index Introduction Network simulator Simulation model Performance analysis Conclusions Future work 2 Second International Workshop on Network on Chip Architectures 2

Introduction Network simulator Simulation model Performance analysis Conclusions Future work 3 Second International Workshop on Network on Chip Architectures 3

Introduction - Networks-on-chip (NoCs) are the critical component of a chip multiprocessor (CMP) as the number of cores increases - CMPs with 32 cores are already on the drawing table - 48 cores recently announced by Intel - Need for a full-system simulator with an accurate network simulation model - Not considering the network component and full- system simulation may lead to Incorrect Conclusions 4 Second International Workshop on Network on Chip Architectures 4

Introduction Topology considerations for NoCs (in CMPs) - Crossbars simplify the design, but they have a limited scalability [Micro07] - 2D-Meshes have better scalability than crossbars and simplify the design of a tiled organization - Rings have a simpler design than 2D-Meshes, but the average distance between nodes is higher - The network capacity is also a critical parameter in the design of NoCs [Micro07] Hoskote Y., Vangal S., Singh A., Borkar N., Borkar S.: ‘A 5 -GHz mesh interconnect for a teraflops - 5 processor’, IEEE Micro Mag., 2007, 27, (5), pp. 51– 61 Second International Workshop on Network on Chip Architectures 5

Introduction Goals - To develop an accurate simulation tool for the on-chip network taking into account the target machine: coherence protocol, OS, and application - At the network level the simulation tool needs to allow: - Collective communication - Different topologies - Different architectures: - Switch architecture - Switching mechanisms (WH, VCT) - Flit size, flow control… 6 Second International Workshop on Network on Chip Architectures 6

Network simulator - SIMICS + GEMS + GAPNET - SIMICS: Full-system simulator - GEMS: A set of modules for SIMICS that enables detailed simulation of Chip-Multiprocessors (CMPs) - Provides a detailed memory system simulator - Implements the cache coherence protocol - GAPNET: Event-driven network simulator providing collective communication 8 Second International Workshop on Network on Chip Architectures 8

Network simulator GapNet and network interface 9 Second International Workshop on Network on Chip Architectures 9

Network simulator GapNet simulator events GEMS Src Dst INTERFACE Enqueue Wakeup GAPNET Send Route Cross Transmit Receive 10 Second International Workshop on Network on Chip Architectures 10

Simulation model - Sarek machine (Sun Fire server) with Solaris10 - 32 cores with a SPARC CPU, private cache for the L1 and shared cache among all the processors for the L2 L1 cache L2 cache Size 128 KB 8 MB Associativity 8-way 16-way Line Size 64 B 64 B Hit Latency 3 cycles 6 cycles - Cache coherency protocol is a directory protocol with non-inclusive and blocking caches 12 Second International Workshop on Network on Chip Architectures 12

Simulation model Interconnects - Four interconnect types: fixed delay interconnect, crossbar, 2D-mesh and bidirectional ring - 2D-mesh is organized as a 4x8 array and routing is based on X-Y dimension order routing. Bidirectional ring choose the shortest path Ideal Crossbar 2D-Mesh Ring Link Latency [cycles] - 5 1 1 Switch Delay [cycles] 1..128 2 1 1 Fixed delay interconnect means constant latency and infinite bandwidth 13 Second International Workshop on Network on Chip Architectures 13

Simulation model Interconnects 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Crossbar 2D-mesh 14 Second International Workshop on Network on Chip Architectures 14

Simulation model Interconnects Ring Ideal network: - fixed delay - free of contention - unlimited amount of bandwidth 15 Second International Workshop on Network on Chip Architectures 15

Simulation model Tile based design Tile based 2D-mesh 4x4 16 Second International Workshop on Network on Chip Architectures 16

Simulation model Network capacity the network changing the flit size - We change the capacity of the network by modifying the flit size - The flit is the minimum amount of data information that can be flow-controlled through a link - The flit size is an important parameter at 2 levels: - Architectural level: Assuming wormhole, different flit sizes lead to different contention levels - Design level: Large flit size lead to more expensive router designs that consume more area and power 17 Second International Workshop on Network on Chip Architectures 17

Performance analysis - Ideal network: normalized execution time (cycles) delay spectrum for each benchmark The system (for most applications) is very sensitive to network latency. E.g. 41% increase for FFT, 171% for Raytrace, 32% for Radix (8c vs 1c delay) 19 Second International Workshop on Network on Chip Architectures 19

Performance analysis - Ideal network: normalized number of L1 misses delay spectrum for each benchmark 20 Second International Workshop on Network on Chip Architectures 20

Performance analysis - Ideal network: normalized number of messages delay spectrum for each benchmark 21 Second International Workshop on Network on Chip Architectures 21

Performance analysis - 2D-Mesh achieves the best performance. The average savings for narrow flits: - 19% when compare with ring - 26% when compare with crossbar - Crossbar for wide flits perform better than ring in FMM, LU, FFT and Barnes and similar than the others. - As we shrink the flit size, the behavior change and the crossbar becomes worse. - Ring with wide flits achieve similar performance than 2D-Mesh with narrow flits. - Narrow flits tend to delay execution time, regardless of the topology, however 2D-Mesh is less affected. - A good trade-off would be a 2D-Mesh with moderate flit sizes (for example 8B), for this CMP configuration. 22 Second International Workshop on Network on Chip Architectures 22

Performance analysis Comparison between 2D-mesh, ring and crossbar FMM LU FFT LU Radix Raytrace 23 Second International Workshop on Network on Chip Architectures 23

Performance analysis Comparison between 2D-mesh, ring and crossbar FTT Barnes FFT LU Radiosity L1 Miss Types in Radiosity 16 8 4 User 537,136 541,517 539,187 Supervisor 198,764 480,737 201,876 Total 735,901 1,022,255 741,063 24 Second International Workshop on Network on Chip Architectures 24

Performance analysis L1 miss rates (%): low network load mesh ring xbar Radix 16B 0.35 0.33 0.33 8B 0.35 0.37 0.36 4B 0.38 0.30 0.29 Radiosity 16B 0.09 0.09 0.09 8B 0.07 0.09 0.09 4B 0.09 0.09 0.09 FFT 16B 0.36 0.29 0.32 8B 0.36 0.28 0.29 4B 0.31 0.26 0.22 Barnes 16B 0.15 0.13 0.14 8B 0.15 0.13 0.13 4B 0.14 0.13 0.13 Raytrace 16B 0.84 0.52 0.38 8B 0.82 0.53 0.32 4B 0.68 0.38 0.20 Congestion is not an issue (in this CMP configuration) 25 Second International Workshop on Network on Chip Architectures 25

Conclusions - Developed and interfaced a detailed on-chip network simulator to GEMS/SIMICS Analyzed the impact of topology and flit sizes on real application’s - execution time - Results: - - Applications are very sensitive to network latency - - Application + system behavior may change because of the network (unpredicted behavior captured by our simulation tool) - - 2D-Meshes always outperforms rings and crossbars - For this CMP configuration, 2D-Mesh with moderate flit sizes is the best option 27 Second International Workshop on Network on Chip Architectures 27

Future work - The tool will enable us to: - Evaluation of other cache coherence protocols (token and hammer) with strong requirements for collective communication - Impact of multicast traffic on application’s execution time - Impact of memory controllers on application’s execution time - Evaluation of commercial workloads 29 Second International Workshop on Network on Chip Architectures 29

Thank you! Jesús Camacho Villanueva e-mail: jecavil@gap.upv.es December 12, 2009 Conference title 30

and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 - PowerPoint PPT Presentation

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 Jess Camacho Villanueva, Jos Flich, Jos Duato Universidad Politcnica de Valencia Hans Eberle, Nils Gura, Wladek Olesinski Sun

Optical Interconnects for Backplane and Chip-to-chip Photonics I H White* and R V Penty * van

Columbia University Chip-Scale Interconnection Networks Chip multi-processors create need

1/5/2012 Overview of Interconnects Presentation Outline Myrinet and Quadrics General

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

Automated Generation of Round-robin Arbitration and Crossbar Switch Logic Eung S. Shin

BISM: Built-in Self-Map for Crossbar Nano-Architectures Mehdi B. Tahoori Boston, MA Outline

CS257 Introduction to Nanocomputing Overview of Crossbar-Based Computing John E Savage

Chip Multi-threading and Chip Multi-threading and Sun s Niagara-series s Niagara-series

Overview Overview Processors Interconnects A few machines Examine the Top242 2 1

Coupled Thermal-Electrical Transient Analysis of 3D Fuses and Interconnects Self Heating Effects

Optical Interconnects for Cloud Computing Data Centers: Recent Advances and Future Challenges Dr.

Hardwired Networks on Chip in FPGAs to Unify Functional and Con fi guration Interconnects Kees

System Level Power Estimation of System-on-Chip Interconnects in Consideration of Transition

SEU-Hardened Energy Recovery Pipelined Interconnects for On-Chip Networks A. Ejlali*, B. M.

A Hybrid Buffer Design with STT-MRAM for On-Chip Interconnects Hyunjun Jang , Baik Song An,

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Shan - Bhavani Shanmugabalan Registered Mentor with the Oxford Brookes University and supported

Reversible Computation and Principal Types in ! -calculus F .Alessi, A.Ciaffaglione, P .Di

Working together to support communities Emily van de Venter Associate Director of Public Health

Presenters The End of CJRS, Remote Working and Redundancies: What's in store for the months

and Money Laundering Update 2019 Presented by John Selwood Redcentric PLC Findings Redcentric

A Set that is Streamless and Not Provably Noetherian Marc Bezem Department of Informatics

Kristin Heinemeier Claudia Barriga Psychology Sociology End user Physiology Information

Structure from Motion Computer Vision Jia-Bin Huang, Virginia Tech Many slides from S. Seitz, N

Sambuz

Useful Links

Newsletter

Mail Us