Analytical Performance Modeling of Hierarchical Interconnect Fabrics - - PowerPoint PPT Presentation

analytical performance modeling
SMART_READER_LITE
LIVE PREVIEW

Analytical Performance Modeling of Hierarchical Interconnect Fabrics - - PowerPoint PPT Presentation

Analytical Performance Modeling of Hierarchical Interconnect Fabrics Nikita Nikitin, Javier de San Pedro, Josep Carmona and Jordi Cortadella Universitat Politcnica de Catalunya Supported by Intel Corporation International Symposium on


slide-1
SLIDE 1

Analytical Performance Modeling

  • f Hierarchical Interconnect Fabrics

Nikita Nikitin, Javier de San Pedro, Josep Carmona and Jordi Cortadella

Universitat Politècnica de Catalunya

Supported by Intel Corporation

International Symposium on Networks-on-Chip (NOCS) 2012, Copenhagen, Denmark

slide-2
SLIDE 2

Outline

  • Introduction

– Hierarchical Chip Multiprocessors (CMPs) – Performance modeling for CMPs – The cyclic dependency between latency and traffic

  • Analytical performance modeling

– Modeling traffic – Modeling latency – Methods to resolve the dependency

  • Results and conclusions

NOCS'12 Universitat Politècnica de Catalunya 2

slide-3
SLIDE 3

The trends in CMP design

  • Hundreds of computing units per chip

– Smaller, simpler, more power-efficient cores

  • Advanced memory management

– Larger on-chip cache – Increasing interconnect (IC) bandwidth

  • Tiled architecture

NOCS'12 Universitat Politècnica de Catalunya 3

R R R R R R R R R R R R R R R R Memory Controller Memory Controller C L2 R

L1

slide-4
SLIDE 4

Hierarchical interconnects

4

C+L1

L2

C+L1

L2 L3

IC ( Bus / Ring )

NI R

Dir

NOCS'12 Universitat Politècnica de Catalunya

Tiled CMP with hierarchical interconnect

R R R Memory Controller Memory Controller

IC

R

IC IC IC

R R R

  • Exploit locality of memory references*

* “Design and Evaluation of a Hierarchical On-Chip Interconnect for Next-Generation CMPs”, R.Das et al., HPCA, 2009

slide-5
SLIDE 5

Design of CMP architecture

  • Goal: efficient use of chip resources

– Maximize performance – Fit area/power/thermal budget

  • Multidimensional exploration space

(#cores / cache size / memory hierarchy / IC topologies /…)

  • Means: automated design space exploration

– Analytical performance models are essential

NOCS'12 Universitat Politècnica de Catalunya 5

C C L3 R D R MC MC

IC

R

IC

R R

IC

R

IC

R

slide-6
SLIDE 6

Contention modeling

  • Contention impacts CMP performance
  • Crucial evaluating hierarchical interconnects

– Is the required bandwidth sustainable?

NOCS'12 Universitat Politècnica de Catalunya 6

R R R Memory Controller Memory Controller IC R IC IC IC R R R

# of wires? Router architecture? Local IC topology?

slide-7
SLIDE 7

Motivational example

NOCS'12 Universitat Politècnica de Catalunya 7

(a) 8x8 mesh (b) 4x4 mesh with bus clusters (c) 2x2 mesh with bus clusters Estimation w/o contention is very inaccurate!

48 cores, 16 cache modules

core cache IC Legend:

2 4 6 8 10 (a) (b) (c) Throughput (IPC) No contention With contention

slide-8
SLIDE 8

Analytical modeling of CMP performance

  • Analytical models for ICs:

– Latency L as a function of traffic λ – λ defined by the workload

Emphasis: λ depends on L!

  • This work: resolve the cyclic dependency of traffic and latency

– Formulate λ as a function of L – Add existing model for L(λ) – Resolve the system efficiently

NOCS'12 Universitat Politècnica de Catalunya 8

L λ IPC

Core1 Corei CoreN … Li λi

Memory subsystem

  L    L    •••

(Throughput)

slide-9
SLIDE 9

Outline

  • Introduction

– Hierarchical Chip Multiprocessors (CMPs) – Performance modeling for CMPs – The cyclic dependency between latency and traffic

  • Analytical performance modeling

– Modeling traffic – Modeling latency – Methods to resolve the dependency

  • Results and conclusions

NOCS'12 Universitat Politècnica de Catalunya 9

slide-10
SLIDE 10

Modeling memory traffic

Traffic to memory (probability of a memory reference per cycle):

NOCS'12 Universitat Politècnica de Catalunya 10

Average latency of memory access Memory access penalty Core L λ

Memory subsystem

Parameters of core executing some workload:

1.

  • ideal Cycles Per Instruction

2.

  • # Memory references Per Instruction

Real performance of in-order core:

slide-11
SLIDE 11

Modeling average memory latency

  • Average latency of memory requests for a core:

NOCS'12 Universitat Politècnica de Catalunya 11

0,05 0,1 0,15 0,2 0,25 5 10

Miss Ratio Cache size (Mb)

Latencies are calculated using

  • Cache latencies
  • Interconnect topology
  • Routing algorithm (XY)

Probabilities are calculated using

  • Miss ratio dependency on cache size

Application 15% miss in 64K L1 5% miss in 1M L2

0,1 0,2 0,3 0,4 5 10

Miss Ratio Cache size (Mb)

Application

slide-12
SLIDE 12

Modeling contention latency

NOCS'12 Universitat Politècnica de Catalunya 12

CL MC MC R CL R CL R CL R R C C L3 NI D Mesh NoC Bus-based cluster

Delays in queues are defined by extending M/G/1 queuing model: “An Analytical Approach for Network-on-Chip Performance Analysis”, Ogras et al., TCAD, 2010 (Best Paper Award)

slide-13
SLIDE 13

System of non-linear equations

  • Solve using numerical methods
  • General methods are very slow

– 10x10 mesh (10K vars./eqns.) – MATLAB timeout after few hours

  • Proposed methods:

– Fixed-point iteration – Bisection search for λ

The cyclic dependency of L and λ

NOCS'12 Universitat Politècnica de Catalunya 13

Any “black-box” model for L(λ)! Analytical model for latency … …

slide-14
SLIDE 14

Fixed-point iteration

+ Fast (10x10 mesh in several ms) + Converges to the exact solution

NOCS'12 Universitat Politècnica de Catalunya 14

10 20 30 40 50 0,05 0,1 0,15 0,2

L, average latency (cycles) λ, average traffic rate (flits/cycle) L(λ) λ (L) Characteristic of the IC Characteristic of the cores/workload

– May not converge for high λ

Hop-count latency

slide-15
SLIDE 15

Bisection search for λ

– Fast, as fixed-point – Always converges to an approximate solution (good for homogeneous clusters)

NOCS'12 Universitat Politècnica de Catalunya 15

10 20 30 40 50 0,05 0,1 0,15 0,2

L, average latency (cycles) λ, average traffic rate (flits/cycle) L(λ) λ (L) Characteristic of the IC Characteristic of the cores/workload λ=0 λ(Lhop-count)

slide-16
SLIDE 16

Outline

  • Introduction

– Hierarchical Chip Multiprocessors (CMPs) – Performance modeling for CMPs – The cyclic dependency between latency and traffic

  • Analytical performance modeling

– Modeling traffic – Modeling latency – Methods to resolve the dependency

  • Results and conclusions

NOCS'12 Universitat Politècnica de Catalunya 16

slide-17
SLIDE 17

Performance of analytical methods

Test Mesh

  • Cont. lat.
  • Num. of

var./eqn. Runtime (sec) MATLAB Fixed-Point Bisection T1 2 x 2 5% 236 0.023 0.001 0.001 T2 4 x 4 13% 1224 1.412 0.001 0.002 T3 6 x 6 8% 3108 30.831 0.002 0.003 T4 8 x 8 12% 6128 408.539 0.006 0.010 T5 10 x 10 23% 10260 Timeout (1hr) 0.010 0.012 T6 10 x 10 46% 10260 Timeout (1hr) 0.022 0.015 T7 10 x 10 55% 10260 Timeout (1hr) NA 0.016

NOCS'12 Universitat Politècnica de Catalunya 17

slide-18
SLIDE 18

Case study: performance exploration

NOCS'12 Universitat Politècnica de Catalunya 18

Parameter Value Chip area Core area Core IPC0 MPI L1 size L2 size Memory density Mesh dimensions MC latency 350 mm2 1.25 mm2 2.0 0.5 64, 128 Kb 64 Kb to 3 Mb 1 mm2 / Mb 2x2 to 16x16 100 cycles

0,05 0,1 0,15 0,2 0,25 2 4 6 8 10

Miss Ratio Cache size (Mb)

1062 configurations explored

Cache Size 64K 128K 256K 512K 1M 2M 4M 8M Area* (mm2) 0.063 0.125 0.25 0.5 1.0 2.0 4.0 8.0 Latency (cycles) 2 3 4 5 6 7 8 9

slide-19
SLIDE 19

Simulation environment

NOCS'12 Universitat Politècnica de Catalunya 19

Network simulation

Global (mesh)

memory

L3 cache node

Bus

Local (bus, ring, …) Core Memory controller

  • Verify model by simulation
  • Cycle-accurate NoC simulator

– On top of BookSim 2.0

  • Extensions

– Hierarchical networks – Bus topologies – Probabilistic state-machines for cores and memories

slide-20
SLIDE 20

Faithfulness of the model

20

5 10 15 20 25 30 35 1 52 103 154 205 256 307 358 409 460 511 562 613 664 715 766 817 868 919 970 1021

Throughput (IPC) Configurations sorted in descending order of throughput Modeling Simulation

  • Average difference in throughput is about 10%
  • Corresponds to the error of the latency model

NOCS’12 Universitat Politècnica de Catalunya

slide-21
SLIDE 21

Best-throughput ordering

21

Simulation time: 5.5 hours Modeling time: 16.8 sec (>1000x faster)

200 400 600 800 1000 200 400 600 800 1000

Best configurations by analysis that include N Number of best config. by simulation (N) Static latency Full latency Ideal (Simulation) (1; 33) (4; 44) (1; 2) (4; 6) (50; 64)

10 20 30 40 50 60 70 10 20 30 40 50 60

Best configurations by analysis that include N Number of best configurations by simulation (N)

Static latency Full latency Ideal (Simulation)

NOCS’12 Universitat Politècnica de Catalunya

No contention With contention No contention With contention Ideal (Simulation)

slide-22
SLIDE 22

Conclusions

  • Analytical modeling of contention in CMPs is essential
  • There exists cyclic dependency between latency and

traffic of memory requests

  • This dependency can be efficiently resolved using

numerical methods (fixed-point, bisection)

  • Precision of the model is significantly improved
  • Current work: out-of-order cores, heterogeneity

NOCS'12 Universitat Politècnica de Catalunya 22

slide-23
SLIDE 23

Backup

NOCS'12 Universitat Politècnica de Catalunya 23

slide-24
SLIDE 24

Sufficient for convergence of :

10 20 30 40 50 0,05 0,1 0,15 0,2

L, average latency (cycles) λ, average traffic rate (flits/cycle) L(λ) λ (L)

Fixed-point convergence issues

NOCS'12 Universitat Politècnica de Catalunya 24

Hop-count latency

slide-25
SLIDE 25

Bisection search

NOCS'12 Universitat Politècnica de Catalunya 25

Latency model Traffic model

slide-26
SLIDE 26

Average latency calculation

  • Average Memory Access Time (AMAT):

NOCS'12 Universitat Politècnica de Catalunya 26

slide-27
SLIDE 27

Best configuration

27

  • 6x6 mesh, 36 clusters, 5 cores/cluster
  • total 180 cores with 64K L1, 256K L2
  • 68Mb total shared L3

Throughput = 30.81 IPC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Memory Controller Memory Controller Memory Controller Memory Controller

C+L1

L2

C+L1

L2

C+L1

L2 L3 Bus

C+L1

L2

C+L1

L2 NI R

Dir

NOCS'12 Universitat Politècnica de Catalunya

slide-28
SLIDE 28

Runtime: Modeling vs Simulation

0,001 0,01 0,1 1 10 100 1000 100 200 300 400 500 600 700 800 900 1000 1100 Runtime (seconds) Number of components (cores + memories) in CMP Analytical Simulation Modeling a CMP with ~700 components in 1 second

28 NOCS’12 Universitat Politècnica de Catalunya