Analytical Performance Modeling of Hierarchical Interconnect Fabrics - PowerPoint PPT Presentation

Analytical Performance Modeling of Hierarchical Interconnect Fabrics Nikita Nikitin, Javier de San Pedro, Josep Carmona and Jordi Cortadella Universitat Politècnica de Catalunya Supported by Intel Corporation International Symposium on Networks-on-Chip (NOCS) 2012, Copenhagen, Denmark

Outline • Introduction – Hierarchical Chip Multiprocessors (CMPs) – Performance modeling for CMPs – The cyclic dependency between latency and traffic • Analytical performance modeling – Modeling traffic – Modeling latency – Methods to resolve the dependency • Results and conclusions NOCS'12 Universitat Politècnica de Catalunya 2

The trends in CMP design • Hundreds of computing units per chip – Smaller, simpler, more power-efficient cores • Advanced memory management – Larger on-chip cache – Increasing interconnect (IC) bandwidth • Tiled architecture R R R R Memory Controller Memory Controller L1 C R R R R L2 R R R R R R R R R NOCS'12 Universitat Politècnica de Catalunya 3

Hierarchical interconnects • Exploit locality of memory references* R R R Memory Controller Memory Controller R IC IC C+L1 C+L1 NI L2 L2 IC ( Bus / Ring ) R R R R L3 Dir IC IC Tiled CMP with hierarchical interconnect * “ Design and Evaluation of a Hierarchical On-Chip Interconnect for Next-Generation CMPs”, R.Das et al., HPCA, 2009 NOCS'12 Universitat Politècnica de Catalunya 4

Design of CMP architecture • Goal: efficient use of chip resources R C C – Maximize performance – Fit area/power/thermal budget L3 D R R R • Multidimensional exploration space IC IC MC MC (#cores / cache size / R R R memory hierarchy / IC topologies /…) IC IC • Means: automated design space exploration – Analytical performance models are essential NOCS'12 Universitat Politècnica de Catalunya 5

Contention modeling • Contention impacts CMP performance • Crucial evaluating hierarchical interconnects – Is the required bandwidth sustainable? # of wires? Router architecture? Local IC topology? R R R Memory Controller Memory Controller IC IC R R R R IC IC NOCS'12 Universitat Politècnica de Catalunya 6

Motivational example Legend: core cache IC 48 cores, 16 cache modules (a) 8x8 mesh (b) 4x4 mesh with (c) 2x2 mesh with bus clusters bus clusters 10 Estimation w/o 8 Throughput (IPC) contention is very 6 inaccurate! 4 No contention 2 With contention 0 (a) (b) (c) NOCS'12 Universitat Politècnica de Catalunya 7

Analytical modeling of CMP performance • Analytical models for ICs: Memory subsystem – Latency L as a function of traffic λ Core 1 – λ defined by the workload λ i L i Core i Emphasis: λ depends on L ! … Core N   L     L     ••• • This work: resolve the cyclic dependency of traffic and latency – Formulate λ as a function of L L λ IPC – Add existing model for L( λ ) – Resolve the system efficiently (Throughput) NOCS'12 Universitat Politècnica de Catalunya 8

Modeling memory traffic Parameters of core executing some workload: Memory subsystem 1. - ideal Cycles Per Instruction λ L 2. - # Memory references Per Instruction Core Real performance of in-order core: Memory access penalty Average latency of memory access Traffic to memory (probability of a memory reference per cycle): NOCS'12 Universitat Politècnica de Catalunya 10

Modeling average memory latency • Average latency of memory requests for a core: Latencies are calculated using Probabilities are calculated using - Cache latencies - Miss ratio dependency on cache size - Interconnect topology - Routing algorithm (XY) 0,25 0,4 0,2 Miss Ratio 0,3 15% miss in 64K L1 Miss Ratio 0,15 0,2 0,1 5% miss in 1M L2 Application 0,1 0,05 Application 0 0 0 5 10 0 5 10 Cache size (Mb) Cache size (Mb) NOCS'12 Universitat Politècnica de Catalunya 11

Modeling contention latency “An Analytical Approach for Network-on- Chip Performance Analysis”, Ogras et al., TCAD, 2010 (Best Paper Award) R R R C C NI CL CL MC MC R R L3 D CL CL Mesh NoC Bus-based cluster Delays in queues are defined by extending M/G/1 queuing model: NOCS'12 Universitat Politècnica de Catalunya 12

The cyclic dependency of L and λ Analytical model for latency System of non-linear equations … … • Solve using numerical methods • General methods are very slow – 10x10 mesh ( 10K vars./eqns. ) – MATLAB timeout after few hours • Proposed methods: – Fixed-point iteration Any “black - box” – Bisection search for λ model for L( λ ) ! NOCS'12 Universitat Politècnica de Catalunya 13

Fixed-point iteration Characteristic of Characteristic of the IC the cores/workload 50 L, average latency (cycles) L( λ ) λ (L) 40 30 20 10 0 0 0,05 0,1 0,15 0,2 Hop-count latency λ , average traffic rate (flits/cycle) + Fast (10x10 mesh in several ms) – May not converge for high λ + Converges to the exact solution NOCS'12 Universitat Politècnica de Catalunya 14

Bisection search for λ Characteristic of Characteristic of the IC the cores/workload 50 L, average latency (cycles) L( λ ) λ (L) 40 30 20 10 λ =0 λ (L hop-count ) 0 0 0,05 0,1 0,15 0,2 λ , average traffic rate (flits/cycle) – Fast, as fixed-point – Always converges to an approximate solution (good for homogeneous clusters) NOCS'12 Universitat Politècnica de Catalunya 15

Performance of analytical methods Runtime (sec) Num. of Test Mesh Cont. lat. var./eqn. MATLAB Fixed-Point Bisection T1 2 x 2 5% 236 0.023 0.001 0.001 T2 4 x 4 13% 1224 1.412 0.001 0.002 T3 6 x 6 8% 3108 30.831 0.002 0.003 T4 8 x 8 12% 6128 408.539 0.006 0.010 T5 10 x 10 23% 10260 Timeout (1hr) 0.010 0.012 T6 10 x 10 46% 10260 Timeout (1hr) 0.022 0.015 T7 10 x 10 55% 10260 Timeout (1hr) NA 0.016 NOCS'12 Universitat Politècnica de Catalunya 17

Case study: performance exploration 1062 configurations explored Parameter Value 350 mm 2 0,25 Chip area Core area 1.25 mm 2 0,2 Core IPC 0 2.0 Miss Ratio 0,15 MPI 0.5 L1 size 64, 128 Kb 0,1 L2 size 64 Kb to 3 Mb 0,05 Memory density 1 mm 2 / Mb Mesh dimensions 2x2 to 16x16 0 0 2 4 6 8 10 MC latency 100 cycles Cache size (Mb) Cache Size 64K 128K 256K 512K 1M 2M 4M 8M Area* (mm 2 ) 0.063 0.125 0.25 0.5 1.0 2.0 4.0 8.0 Latency (cycles) 2 3 4 5 6 7 8 9 NOCS'12 Universitat Politècnica de Catalunya 18

Simulation environment • Verify model by simulation Core • Cycle-accurate NoC simulator – On top of BookSim 2.0 • Extensions Network simulation – Hierarchical networks – Bus topologies Global (mesh) – Probabilistic state-machines Bus Local (bus, ring, …) for cores and memories memory L3 cache Memory node controller NOCS'12 Universitat Politècnica de Catalunya 19

Faithfulness of the model 35 Modeling 30 Simulation 25 Throughput (IPC) 20 15 10 5 0 1 52 103 154 205 256 307 358 409 460 511 562 613 664 715 766 817 868 919 970 1021 Configurations sorted in descending order of throughput • Average difference in throughput is about 10% • Corresponds to the error of the latency model NOCS’12 Universitat Politècnica de Catalunya 20

Best-throughput ordering 70 Best configurations by analysis that include N (50; 64) 60 Best configurations by analysis 1000 50 800 that include N (4; 44) 600 40 400 (1; 33) Static latency No contention 30 With contention Full latency 200 Ideal (Simulation) Ideal (Simulation) 20 0 Static latency No contention 0 200 400 600 800 1000 With contention Full latency 10 Number of best config. by simulation (N) (4; 6) Ideal (Simulation) (1; 2) 0 Simulation time: 5.5 hours 0 10 20 30 40 50 60 Modeling time: 16.8 sec (>1000x faster) Number of best configurations by simulation (N) NOCS’12 Universitat Politècnica de Catalunya 21

Conclusions • Analytical modeling of contention in CMPs is essential • There exists cyclic dependency between latency and traffic of memory requests • This dependency can be efficiently resolved using numerical methods (fixed-point, bisection) • Precision of the model is significantly improved • Current work: out-of-order cores, heterogeneity NOCS'12 Universitat Politècnica de Catalunya 22

Backup NOCS'12 Universitat Politècnica de Catalunya 23

Fixed-point convergence issues Sufficient for convergence of : 50 L, average latency (cycles) λ (L) L( λ ) 40 30 20 10 0 0 0,05 0,1 0,15 0,2 Hop-count latency λ , average traffic rate (flits/cycle) NOCS'12 Universitat Politècnica de Catalunya 24

Analytical Performance Modeling of Hierarchical Interconnect Fabrics - PowerPoint PPT Presentation

Analytical Performance Modeling of Hierarchical Interconnect Fabrics Nikita Nikitin, Javier de San Pedro, Josep Carmona and Jordi Cortadella Universitat Politcnica de Catalunya Supported by Intel Corporation International Symposium on

BTEC: Analytical Services and Capabilities Nathaniel Hentz, Assistant Director Analytical What is

P1 Holistic Assessment for Mathematics 2013 Curricula Goal Curricula Goal Analytical

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

UT DA Hierarchical and Analytical Hierarchical and Analytical Pla Placement T cement

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Evidence-based Guidelines for the pre-analytical phase of RNA testing in Blood Samples Francesca

PRE-ANALYTICAL VARIABLES IN THE CONTEXT OF PUBLISHING/ESTABLISHING A STANDARDIZED PRE-ANALYTICAL

A semi- -analytical mathematical formulation as a analytical mathematical formulation as a A

Low-Permeability Black Oil/Gas Condensate Reservoirs Using Analytical/Semi-Analytical Methods

Improving Analytical Capabilities of the Improving Analytical Capabilities of the California

RO-ICAC'2014 Analytical Chemistry for a Better Life September 17 th -21 st 2014 Trgovi te,

84.314 84.314 Analytical Chemistry II Analytical Chemistry II (Instrumental Analysis)

Approaching an Analytical Project Tuba Islam, Analytics CoE, SAS UK Approaching an Analytical

1 Response Matrix Analytical Implementation of TwissResponse Joschua Dilly VIA at CERN 2

ANALYTICAL N-BPM METHOD ANALYTICAL N-BPM METHOD IMPROVING ACCURACY AND ROBUSTNESS OF LINEAR

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department

COOPERATION INSTEAD OF CONTENTION! THE NEBULOUS CONCEPT OF WIRELESS LINK. Network

Shuffling: A Lock Contention Aware Thread Scheduling Technique Kishore Pusukuri Multicores are

UNDERSTANDING TRANSACTIONAL MEMORY PERFORMANCE Donald E. Porter and Emmett Witchel The

URSA: Precise Capacity Planning and Fair Scheduling based on Low-level Statistics for Public

Low Contention Mapping of Real-Time Tasks onto a TilePro 64 Core Processor Christopher Zimmer and

What well talk about 2 ZSim has a full-featured memory system (originally designed for

On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma