and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 - - PowerPoint PPT Presentation

and crossbar interconnects for chip multi
SMART_READER_LITE
LIVE PREVIEW

and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 - - PowerPoint PPT Presentation

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 Jess Camacho Villanueva, Jos Flich, Jos Duato Universidad Politcnica de Valencia Hans Eberle, Nils Gura, Wladek Olesinski Sun


slide-1
SLIDE 1

Conference title 1

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09

Jesús Camacho Villanueva, José Flich, José Duato Universidad Politécnica de Valencia Hans Eberle, Nils Gura, Wladek Olesinski Sun Microsystems

December 12, 2009

slide-2
SLIDE 2

Second International Workshop on Network on Chip Architectures 2

Index

Introduction Network simulator Simulation model Performance analysis Conclusions Future work 2

slide-3
SLIDE 3

Second International Workshop on Network on Chip Architectures 3

Introduction Network simulator Simulation model Performance analysis Conclusions Future work 3

slide-4
SLIDE 4

Second International Workshop on Network on Chip Architectures 4

Introduction

  • Networks-on-chip (NoCs) are the critical component of

a chip multiprocessor (CMP) as the number of cores increases

  • CMPs with 32 cores are already on the drawing table
  • 48 cores recently announced by Intel
  • Need for a full-system simulator with an accurate

network simulation model

  • Not considering the network component and full-

system simulation may lead to Incorrect Conclusions 4

slide-5
SLIDE 5

Second International Workshop on Network on Chip Architectures 5

Introduction

Topology considerations for NoCs (in CMPs)

  • Crossbars simplify the design, but they have a limited

scalability [Micro07]

  • 2D-Meshes have better scalability than crossbars and

simplify the design of a tiled organization

  • Rings have a simpler design than 2D-Meshes, but the

average distance between nodes is higher

  • The network capacity is also a critical parameter in

the design of NoCs

  • [Micro07] Hoskote Y., Vangal S., Singh A., Borkar N., Borkar S.: ‘A 5-GHz mesh interconnect for a teraflops

processor’, IEEE Micro Mag., 2007, 27, (5), pp. 51–61

5

slide-6
SLIDE 6

Second International Workshop on Network on Chip Architectures 6

Introduction

Goals

  • To develop an accurate simulation tool for the on-chip

network taking into account the target machine: coherence protocol, OS, and application

  • At the network level the simulation tool needs to allow:
  • Collective communication
  • Different topologies
  • Different architectures:
  • Switch architecture
  • Switching mechanisms (WH, VCT)
  • Flit size, flow control…

6

slide-7
SLIDE 7

Second International Workshop on Network on Chip Architectures 7

Introduction Network simulator Simulation model Performance analysis Conclusions Future work 7

slide-8
SLIDE 8

Second International Workshop on Network on Chip Architectures 8

Network simulator

  • SIMICS + GEMS + GAPNET
  • SIMICS: Full-system simulator
  • GEMS: A set of modules for SIMICS that enables

detailed simulation of Chip-Multiprocessors (CMPs)

  • Provides a detailed memory system simulator
  • Implements the cache coherence protocol
  • GAPNET: Event-driven network simulator providing

collective communication 8

slide-9
SLIDE 9

Second International Workshop on Network on Chip Architectures 9

Network simulator

GapNet and network interface 9

slide-10
SLIDE 10

Second International Workshop on Network on Chip Architectures 10

Network simulator

GapNet simulator events

Src Wakeup

Send Route Cross Transmit Receive

Dst Enqueue GAPNET GEMS INTERFACE

10

slide-11
SLIDE 11

Second International Workshop on Network on Chip Architectures 11

Introduction Network simulator Simulation model Performance analysis Conclusions Future work 11

slide-12
SLIDE 12

Second International Workshop on Network on Chip Architectures 12

Simulation model

  • Sarek machine (Sun Fire server) with Solaris10
  • 32 cores with a SPARC CPU, private cache for the L1

and shared cache among all the processors for the L2

  • Cache coherency protocol is a directory protocol with

non-inclusive and blocking caches

L1 cache L2 cache Size 128 KB 8 MB Associativity 8-way 16-way Line Size 64 B 64 B Hit Latency 3 cycles 6 cycles

12

slide-13
SLIDE 13

Second International Workshop on Network on Chip Architectures 13

Simulation model

Interconnects

  • Four interconnect types: fixed delay interconnect,

crossbar, 2D-mesh and bidirectional ring

  • 2D-mesh is organized as a 4x8 array and routing is

based on X-Y dimension order routing. Bidirectional ring choose the shortest path

Ideal Crossbar 2D-Mesh Ring Link Latency [cycles]

  • 5

1 1 Switch Delay [cycles] 1..128 2 1 1

13

Fixed delay interconnect means constant latency and infinite bandwidth

slide-14
SLIDE 14

Second International Workshop on Network on Chip Architectures 14

Simulation model

Interconnects Crossbar 2D-mesh

7 6 5 4 3 2 1 7 6 5 4 3 2 1

14

slide-15
SLIDE 15

Second International Workshop on Network on Chip Architectures 15

Simulation model

Interconnects Ring Ideal network:

  • fixed delay
  • free of contention
  • unlimited amount of bandwidth

15

slide-16
SLIDE 16

Second International Workshop on Network on Chip Architectures 16

Simulation model

Tile based design Tile based 2D-mesh 4x4 16

slide-17
SLIDE 17

Second International Workshop on Network on Chip Architectures 17

Simulation model

Network capacity the network changing the flit size

  • We change the capacity of the network by modifying

the flit size

  • The flit is the minimum amount of data information

that can be flow-controlled through a link

  • The flit size is an important parameter at 2 levels:
  • Architectural level: Assuming wormhole, different flit sizes lead

to different contention levels

  • Design level: Large flit size lead to more expensive router

designs that consume more area and power

17

slide-18
SLIDE 18

Second International Workshop on Network on Chip Architectures 18

Introduction Network simulator Simulation model Performance analysis Conclusions Future work 18

slide-19
SLIDE 19

Second International Workshop on Network on Chip Architectures 19

Performance analysis

  • Ideal network: normalized execution time (cycles)

delay spectrum for each benchmark The system (for most applications) is very sensitive to network

  • latency. E.g. 41% increase for FFT, 171% for Raytrace, 32% for

Radix (8c vs 1c delay)

19

slide-20
SLIDE 20

Second International Workshop on Network on Chip Architectures 20

Performance analysis

  • Ideal network: normalized number of L1 misses

delay spectrum for each benchmark

20

slide-21
SLIDE 21

Second International Workshop on Network on Chip Architectures 21

Performance analysis

  • Ideal network: normalized number of messages

delay spectrum for each benchmark

21

slide-22
SLIDE 22

Second International Workshop on Network on Chip Architectures 22

Performance analysis

  • 2D-Mesh achieves the best performance. The average savings for

narrow flits:

  • 19% when compare with ring
  • 26% when compare with crossbar
  • Crossbar for wide flits perform better than ring in FMM, LU, FFT

and Barnes and similar than the others.

  • As we shrink the flit size, the behavior change and the

crossbar becomes worse.

  • Ring with wide flits achieve similar performance than 2D-Mesh with

narrow flits.

  • Narrow flits tend to delay execution time, regardless of the

topology, however 2D-Mesh is less affected.

  • A good trade-off would be a 2D-Mesh with moderate flit sizes (for

example 8B), for this CMP configuration.

22

slide-23
SLIDE 23

Second International Workshop on Network on Chip Architectures 23

Performance analysis

Comparison between 2D-mesh, ring and crossbar

FMM LU

FFT LU

Radix Raytrace

23

slide-24
SLIDE 24

Second International Workshop on Network on Chip Architectures 24

Performance analysis

Comparison between 2D-mesh, ring and crossbar

FTT Barnes

FFT LU

Radiosity L1 Miss Types in Radiosity

16 8 4 User 537,136 541,517 539,187 Supervisor 198,764 480,737 201,876 Total 735,901 1,022,255 741,063

24

slide-25
SLIDE 25

Second International Workshop on Network on Chip Architectures 25

Performance analysis

L1 miss rates (%): low network load Congestion is not an issue (in this CMP configuration)

mesh ring xbar Radix 16B 0.35 0.33 0.33 8B 0.35 0.37 0.36 4B 0.38 0.30 0.29 Radiosity 16B 0.09 0.09 0.09 8B 0.07 0.09 0.09 4B 0.09 0.09 0.09 FFT 16B 0.36 0.29 0.32 8B 0.36 0.28 0.29 4B 0.31 0.26 0.22 Barnes 16B 0.15 0.13 0.14 8B 0.15 0.13 0.13 4B 0.14 0.13 0.13 Raytrace 16B 0.84 0.52 0.38 8B 0.82 0.53 0.32 4B 0.68 0.38 0.20

25

slide-26
SLIDE 26

Second International Workshop on Network on Chip Architectures 26

Introduction Network simulator Simulation model Performance analysis Conclusions Future work 26

slide-27
SLIDE 27

Second International Workshop on Network on Chip Architectures 27

Conclusions

  • Developed and interfaced a detailed on-chip network simulator to

GEMS/SIMICS

  • Analyzed the impact of topology and flit sizes on real application’s

execution time

  • Results:
  • Applications are very sensitive to network latency
  • Application + system behavior may change because of the

network (unpredicted behavior captured by our simulation tool)

  • 2D-Meshes always outperforms rings and crossbars
  • For this CMP configuration, 2D-Mesh with moderate flit sizes is the

best option

27

slide-28
SLIDE 28

Second International Workshop on Network on Chip Architectures 28

Introduction Network simulator Simulation model Performance analysis Conclusions Future work 28

slide-29
SLIDE 29

Second International Workshop on Network on Chip Architectures 29

Future work

  • The tool will enable us to:
  • Evaluation of other cache coherence protocols (token and

hammer) with strong requirements for collective communication

  • Impact of multicast traffic on application’s execution time
  • Impact of memory controllers on application’s execution time
  • Evaluation of commercial workloads

29

slide-30
SLIDE 30

Conference title 30

Thank you!

Jesús Camacho Villanueva e-mail: jecavil@gap.upv.es

December 12, 2009