7 On-Chip Interconnection Networks Chip Multiprocessors (ACS MPhil) - - PowerPoint PPT Presentation

7 on chip interconnection networks
SMART_READER_LITE
LIVE PREVIEW

7 On-Chip Interconnection Networks Chip Multiprocessors (ACS MPhil) - - PowerPoint PPT Presentation

7 On-Chip Interconnection Networks Chip Multiprocessors (ACS MPhil) Robert Mullins Introduction Vast transistor budgets, but .... Poor interconnect scaling Pressure to decentralise designs Need to manage complexity and power


slide-1
SLIDE 1

7 • On-Chip Interconnection Networks

Chip Multiprocessors (ACS MPhil) Robert Mullins

slide-2
SLIDE 2

Introduction

  • Vast transistor budgets, but....
  • Poor interconnect scaling

– Pressure to decentralise designs

  • Need to manage complexity and power
  • Need for flexible/fault tolerant designs
  • Parallel architectures

– Keep core complexity constant or simplify – The result is need to interconnect lots of cores, memories and other IP cores.

slide-3
SLIDE 3

Introduction

  • On-chip communication requirements:

– High-performance

  • Latency and Bandwidth

– Flexibility

  • Move away from fixed application-

specific wiring – Scalability

  • Number of modules is rapidly increasing
slide-4
SLIDE 4

Introduction

  • On-chip communication requirements:

– Simplicity (ease of design and verification)

  • Structured, modular and regular
  • Optimize channel and router once

– Efficiency

  • Ability to share global wiring resources

between different flows – Fault tolerance (in the long term)

  • The existence of multiple communication

paths between module pairs – Support for different traffic types and QoS

slide-5
SLIDE 5

Introduction

  • The design of the on-chip network is not an

isolated design decision (or afterthought)

– e.g. consider impact on cache coherency protocol – What is the correct balance of resources (wires and transistors, silicon area, power etc.) between the on-chip network and computational resources? – Where does the on-chip network stop and the design of a module or core start?

  • “integrated microarchitectural networks”

– Does network simply blindly allow modules to communicate or does it have additional functionality?

slide-6
SLIDE 6

Chip Multiprocessors (ACS MPhil) 6

Introduction

  • Don't we already know how to design

interconnection networks?

– Many existing network topologies, router designs and theory has already been developed for high- end supercomputers and telecom switches – Yes, and we'll cover some of this material, but the trade-offs on-chip lead to very different designs.

slide-7
SLIDE 7

Chip Multiprocessors (ACS MPhil) 7

On-chip vs. Off-chip

  • Compare availability of pins and wiring tracks on-chip

to cost of pins/connectors and cables off-chip

  • Compare communication latencies on- and off-chip

– What is the impact on router and network design?

  • Applications and workloads
  • Amount of memory available on-chip

– What is the impact on router design/flow control?

  • Power budgets on- and off-chip
  • Need to map network to planar chip (or perhaps more

recently a 3D stack of dies)

slide-8
SLIDE 8

Chip Multiprocessors (ACS MPhil) 8

On-chip interconnect

  • Typical interconnect at 45nm

node: – 10-14 metal layers – Local interconnect (M1)

  • 65nm metal width, 65nm

spacing

  • 7700 metal tracks/mm

– Global (e.g. M10)

  • 400nm metal width,

400nm spacing

  • 1250 metal tracks/mm
  • Remember global interconnects

scale poorly when compared to transistors

9Cu+1Al process (Fujitsu 2007)

slide-9
SLIDE 9

Chip Multiprocessors (ACS MPhil) 9

Bus-based interconnects

  • Bus-based

interconnects

– Central arbiter provides access to bus – Logically the bus is simply viewed as a set of wires shared by all processors

slide-10
SLIDE 10

Chip Multiprocessors (ACS MPhil) 10

Bus-based interconnects

  • Real bus implementations are typically switch

based

– Multiplexers and unidirectional interconnects with repeaters – Tri-states are rarely used now – Interconnect itself may be pipelined

  • A bus-based CMP usually exploits multiple

unidirectional buses

– e.g. address bus, response bus and data bus

slide-11
SLIDE 11

Chip Multiprocessors (ACS MPhil) 11

Bus-based interconnects for multicore?

  • Metal/wiring is cheap on-

chip!

  • Avoid complexity of

packet-switched networks

  • Keep cache-coherency

simple

  • Performance issues

– Centralised arbitration – Low clock frequency (pipeline?) – Power? – Scalability? O O R R R R R R R R R R R R R R R R

Shekhar Borkar (OCIN'06) Repeated Bus Global Interconnect

slide-12
SLIDE 12

Chip Multiprocessors (ACS MPhil) 12

Bus-based interconnects for multicore?

  • Optimising bus-based solutions:

– Arbitrate for next cycle on current clock cycle – Use wide, low-swing interconnects – Limit broadcast to subset of processors?

  • Segment bus and filter redundant broadcasts to some

segments by maintaining some knowledge of cache

  • contents. So called, “Filtered Segmented Buses”

– Employ multiple buses – Move from electrical to on-chip optical solutions?

slide-13
SLIDE 13

Chip Multiprocessors (ACS MPhil) 13

Filtered Segmented Bus

“Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks”, Udipi et al, HPCA 2010

  • Filter broadcasts to

segments with Bloom filter

  • Energy savings possible vs.

mesh and flattened butterfly networks (for 16, 32 and 64- cores) because routers can be removed

  • For large numbers of cores

multiple (address- interleaved) buses are required to avoid a significant performance penalty due to contention

slide-14
SLIDE 14

Chip Multiprocessors (ACS MPhil) 14

Bus-based interconnects for multicore?

  • Exploiting multiple buses (or rings):

– Multiple address-interleaved buses

  • e.g. Sun Wildfire/Starfire

– Use different buses for different message types – Subspace snooping [Huh/Burger06]

  • Associate (dynamic) address ranges with each bus.

Each subspace are regions of data that are shared by a stable subset of the processors.

  • This technique tackles snoop bandwidth limitations as

all processors are not required to snoop all buses

– Exploit buses at the lowest level of a hierarchical network (e.g. mesh interconnecting tiles, where each tile is a group of cores connected by a bus)

slide-15
SLIDE 15

Chip Multiprocessors (ACS MPhil) 15

Sun Starfire (UE10000)

Uses 4 interleaved address buses to scale snooping protocol

16x16 Data Crossbar

Memory Module

Board Interconnect µP $ µP $ µP $ µP $

Memory Module

Board Interconnect µP $ µP $ µP $ µP $ 4 processors + memory module per system board

  • Up to 64-way SMP using bus-based snooping protocol

Separate data transfer over high bandwidth crossbar

Slide from Krste Asanovic (Berkeley)

slide-16
SLIDE 16

Chip Multiprocessors (ACS MPhil) 16

Ring Networks

  • Exploit short point-to-point interconnects
  • Can support many concurrent data transfers
  • Can keep coherence protocol simple and avoid need

for directory-based schemes – We may still broadcast transactions

  • Modest area requirements

k-node ring (or k-ary 1-cube)

slide-17
SLIDE 17

Chip Multiprocessors (ACS MPhil) 17

Ring Networks

  • Control

– May be distributed

  • Need to be a little careful to avoid possibility of deadlock

(more later!)

– Or a centralised arbiter/scheduler may be used

  • e.g. IBM Cell BE and Larrabee both appear to use a

centralised scheduler

  • Try and schedule as many concurrent (non-overlapping)

transfers on each available ring as possible

  • Trivial routers at each node

– Simple routers are attractive as they don't introduce significant latency, power and area

  • verheads
slide-18
SLIDE 18

Chip Multiprocessors (ACS MPhil) 18

Ring Networks: Examples

  • IBM

– Power4, Power5

  • IBM/Sony/Toshiba

– Cell BE (PS3, HDTV, Cell blades, ...)

  • Intel

– Larabee (graphics), 8-core Xeon processor

  • Kendall Square Research (1990's)

– Massively parallel supercomputer design – Ring of rings (hierarchical or multi-ring) topology

  • Cluster = 32 nodes connected in a ring
  • Up to 34 clusters connected by higher level ring
slide-19
SLIDE 19

Chip Multiprocessors (ACS MPhil) 19

Ring Networks: Example IBM Cell BE

  • Cell Broadband

Engine

– Message-passing style (no $ coherence) – Element Interconnect Bus (EIB)

  • 2-rings are provided in

each direction

  • Crossbar solution was

deemed too large

slide-20
SLIDE 20

Chip Multiprocessors (ACS MPhil) 20

Ring Networks: Example Larrabee

  • Cache coherent
  • Bi-directional ring network, 512-bit wide links

– Short linked rings proposed for >16 processors – Routing decisions are made before injecting messages

  • The clockwise ring delivers on even clock cycles and the

anticlockwise ring on odd clock cycles

slide-21
SLIDE 21

Chip Multiprocessors (ACS MPhil) 21

Crossbar Networks

  • A crossbar switch is able to directly connect

any input to any output without any intermediate stages

– It is an example of a strictly non-blocking network

  • It can connect any input to any output, incrementally,

without the need the rearrange any of the circuits currently set up.

– The main limitation of a crossbar is its cost. Although very useful in small configurations, n x n crossbars can quickly become prohibitively expensive as their cost increases as n2

slide-22
SLIDE 22

Chip Multiprocessors (ACS MPhil) 22

Crossbar Networks

  • A 4x3 crossbar

implemented using three 4:1 multiplexers

  • Each multiplexer selects a

particular input to be connected to the corresponding output

(Dally/Towles book Chapter 6)

slide-23
SLIDE 23

Chip Multiprocessors (ACS MPhil) 23

Crossbar Networks: Example Niagara

  • Crossbar switch

interconnects 8 processors to banked on-chip L2 cache

– A crossbar is actually provided in each direction:

  • Forward and Return
  • Simple cache coherence

protocol

– See earlier seminar

Reproduced from IEEE Micro, Mar'05

slide-24
SLIDE 24

Chip Multiprocessors (ACS MPhil) 24

Crossbar Networks: Example Cyclops

  • IBM, US Dept. of Energy/Defense, Academia
  • Full system 1M+ processors, 80 cores per chip
  • Interconnect: centralised 96x96 buffered crossbar switch

with a 7-stage pipeline

slide-25
SLIDE 25

Chip Multiprocessors (ACS MPhil) 25

Crossbar Networks: Example Cyclops

  • Motivation for use of crossbar

– Simple uniform memory access model – System is sequentially consistent

  • Crossbar interconnects:

– 80 processors, incl:

  • 2 x thread units, 2 x scratch pad memories (can also be

configured as global memory), 1 x FPU – Off-chip memory, I/O and interfaces to off-chip network

  • Crossbar Area

– 1.6mm x 17mm (27mm2), 6% of total die area

  • Communication and arbitration is centralised

– Power implications? Fixed core-to-core comms. delay

slide-26
SLIDE 26

Chip Multiprocessors (ACS MPhil) 26

Crossbar Networks: On-chip Optical

  • Fully optical on-chip buses, rings and crossbars have also

recently been proposed. They may be constructed using ring-resonator based modulators and detectors.

[Firefly/Corona ISCA 08/09]

slide-27
SLIDE 27

Chip Multiprocessors (ACS MPhil) 27

Interconnection Networks

  • So far we have only discussed buses, ring networks

and crossbar switches

  • In general a network can be described by its:

– Topology: How we arrange our shared routers and channels

  • Router = (buffers) + switch + control

– Flow Control: How we allocate network resources, e.g. buffer capacity and channel bandwidth – Routing Algorithm: How we select a path through

  • ur network from the source to destination node
slide-28
SLIDE 28

Chip Multiprocessors (ACS MPhil) 28

Nomenclature (See Dally/Towles)

  • Modules (i.e. cores, memories etc.) connect to the

network at network terminals

  • The network itself consists of a collection of nodes

(switches/routers) which are linked by channels

– Channels are characterised by their width, frequency and latency

  • Direct networks associate a terminal (connection to the
  • utside world) with each node. In a direct network a node

is both a terminal and a switch

  • In indirect networks nodes are either a terminal or a

switch (not both)

slide-29
SLIDE 29

Chip Multiprocessors (ACS MPhil) 29

Nomenclature

  • Paths

– A route or path between two nodes is given by an

  • rdered set of channels, P. The length or hop count of

a path is |P|. – A minimal path between two nodes is the path with the smallest hop count – The diameter (Hmax) of the network is the largest, minimal hop count over all pairs of terminals in the network – For a NxN mesh, the diameter is 2x(N-1),

  • e.g. bottom-left node to top-right node
  • It is 1 for a crossbar switch
slide-30
SLIDE 30

Chip Multiprocessors (ACS MPhil) 30

Nomenclature

  • Higher level messages between network clients are

communicated over the network as one or more packets.

  • Each packet is composed from a number of flits (flow

control digits). A flit is the smallest unit of information recognised by the flow control method.

– e.g. an on-chip network may employ 64-bit flits and have the ability to send a new flit over a channel (perhaps also 64-bits wide) every clock cycle. – A network may expect fixed or variable length packets – The first flit in a packet is often called the head flit and the last the tail flit

slide-31
SLIDE 31

Chip Multiprocessors (ACS MPhil) 31

Nomenclature

  • The average length of the minimum paths between all sources

and destinations is known as the average minimum hop count, Hmin .

  • The latency of a network is the time required for a packet to

traverse the network

– More precisely, this is the time between the head of the message (or packet) reaching the input terminal and its tail departing from the output terminal – Head latency is the time for the head of the message to traverse the network – The serialization latency is the time for the tail of the message to catch up (i.e. the time taken to send a complete packet across a channel) – The latency of a single router is TR

slide-32
SLIDE 32

Chip Multiprocessors (ACS MPhil) 32

Nomenclature

Average zero-load latency (in the absence of contention) provides us with a lower bound on average latency

= (Router delay) + (Time of flight) + (serialization latency) T0 = Hmin*TR + Dmin/v + L/b

Dmin– average distance The physical distance between network nodes is dependent

  • n the layout of the network (in larger networks involving multiple

chips this will be dependent on packaging decisions) v - propagation velocity (how fast does a flit propagate between nodes?) L – packet length, b – channel bandwidth

slide-33
SLIDE 33

Chip Multiprocessors (ACS MPhil) 33

Nomenclature

  • The bisection width is the minimum number of wires

that must be cut when the network is divided into two equal sets of nodes

  • The collective bandwidth over the bisection width is

known as the bisection bandwidth

slide-34
SLIDE 34

Chip Multiprocessors (ACS MPhil) 34

Traffic Patterns (Dally/Towles, 3.2)

  • Random traffic

– each source is equally likely to send to any destination

  • Permutation traffic: (one fixed destination per source)

– Bit permutation, e.g. 4-bit source address = {s3, s2, s1, s0}

  • Bit complement (e.g. dest = {!s3, !s2, !s1, !s0})
  • Bit reverse (e.g. dest = {s0, s1, s2, s3})
  • Bit rotation
  • Shuffle (Sorting)
  • Transpose (FFT)

– Digit permutations

  • Tornado (adversary for torus topologies)
  • Neighbour (fluid dynamics)
slide-35
SLIDE 35

Chip Multiprocessors (ACS MPhil) 35

Real Traffic Patterns

“A communication characterization of Splash-2 and Parsec”, Barrow-Williams/Fensch/Moore, IISWC'09

slide-36
SLIDE 36

Chip Multiprocessors (ACS MPhil) 36

Performance

  • Imagine a mesh and uniform

random traffic

  • If we divide the nodes of the

mesh in two, it is clear to see that half the traffic from one half must cross the bisection

  • Hence the topology limit is 2Bc/N
  • e.g. for an 8x8 mesh:

32*traffic_per_node*0.5 = 0.5*Bc Lets assume that the bisection bandwidth = 16 flits/cycle, so traffic_per_node = 0.5 flits/cycle Offered traffic is often normalised to the topology limit

slide-37
SLIDE 37

Chip Multiprocessors (ACS MPhil) 37

Performance: Routing Limit

  • A particular routing algorithm may be unable to

balance the traffic across the channels of the

  • bisection. In this case, the throughput limit imposed

by the routing algorithm may be significantly lower than that of the topology limit.

slide-38
SLIDE 38

Chip Multiprocessors (ACS MPhil) 38

Topology: The Basics

  • Choosing the topology is the first step in designing a

network

– The flow control and routing algorithm are obviously dependent on topology – Selecting a topology involves a trade-off between complexity/cost and performance (bandwidth and latency)

  • Don't forget we must map the network to a 2D VLSI

implementation

– Die stacking introduces need for 3D networks

  • On-chip networks typically try hard to minimise

communication latency and power

– e.g. Exploit short channels to neighbouring cores – Direct networks

slide-39
SLIDE 39

Chip Multiprocessors (ACS MPhil) 39

Topology: Mesh

A 4-ary 2-mesh

  • Simple and regular

mapping to 2D VLSI implementation

  • Channels to nearest

neighbours

  • Simple routers

– Low radix (or degree)

  • e.g. 5x5 switch (or two

3x3 switches)

  • 4 ports for neighbours

and 1 for terminal

  • Potentially high hop count
slide-40
SLIDE 40

Chip Multiprocessors (ACS MPhil) 40

Topology: Concentrated Mesh

  • Increase scalability of

mesh by co-locating multiple terminals at each node

– Here concentration has been achieved using a larger crossbar at each node – Could also just use a multiplexer or bus

  • Concentration reduces the

average hop count and hence the zero-load latency

slide-41
SLIDE 41

Chip Multiprocessors (ACS MPhil) 41

Topology: Conc. Mesh + Express Chan.

  • We can add express

channels to restore lost bisection bandwidth

  • Express channels in

general are additional links to non-local nodes

– Physically, they are often longer than the minimum channel length

Concentrated Mesh with Express Channels (Balfour/Dally, ICS'06)

slide-42
SLIDE 42

Chip Multiprocessors (ACS MPhil) 42

Topology: Mesh + multidrop channels

  • MECS topology

– Grot el al, HPCA'09 – Exploits multi-drop express channels

  • Point-to-multipoint

unidirection links connect a source with multiple destinations in a given row or column

  • Network diameter is 2
  • High connectivity with low

channel count

slide-43
SLIDE 43

Chip Multiprocessors (ACS MPhil) 43

Topology: Torus

  • Meshes suffer from load

imbalances for many traffic patterns as demand for the central channels is higher than for the edge channels

  • The torus topology has twice

the bisection bandwidth of a mesh

  • The long wraparound

channels may be avoided by folding the torus

(Dally/Towles, p99) A 4-ary 2-cube

slide-44
SLIDE 44

Chip Multiprocessors (ACS MPhil) 44

Flow Control: Circuit Switching

  • In a circuit switched network, network

resources (channels) are reserved before a packet is sent

– The entire path (circuit) must be reserved first – Channels are often shared between different circuits using time-division multiplexing or by dividing the channel into multiple narrow links – The circuit is torn down once the message has been sent

slide-45
SLIDE 45

Chip Multiprocessors (ACS MPhil) 45

Flow Control: Circuit Switching

  • Minimal buffering at each switch
  • Once circuit is setup, router latency and

control overheads are very low

  • Very poor use of channel bandwidth if lots of

short packets must be sent to many different destinations

– More commonly seen in embedded SoC applications where traffic patterns may be static and involve streaming large amounts of data between different IP blocks – Can also provide QoS/Guaranteed Services (GS)

slide-46
SLIDE 46

Chip Multiprocessors (ACS MPhil) 46

Flow Control: Buffered Flow Control

  • We can aim to make better use of channel resources by

buffering packets (or a fraction of a packet) at each node. We then arbitrate for access to network resources dynamically.

  • We distinguish between different approaches by the

granularity at which we reserve resources (e.g. channels and buffers) and conditions that must be met for a packet to advance to the next node.

  • Packet-Buffer Flow Control

– Store and forward – Virtual cut-through

  • Flit-Buffer Flow Control

– Wormhole

  • Efficient use of (limited) buffer space
slide-47
SLIDE 47

Chip Multiprocessors (ACS MPhil) 47

Flow Control: Buffered Flow Control

L – Packet Length

slide-48
SLIDE 48

Chip Multiprocessors (ACS MPhil) 48

Virtual-Channel Flow Control

  • Improve performance of wormhole routing, prevent a single packet

blocking a free channel

– e.g. if the green packet is blocked the red packet may still make progress through the network – We can interleave flits from different packets over the same channel

slide-49
SLIDE 49

Chip Multiprocessors (ACS MPhil) 49

Virtual-channel Flow Control

  • Virtual-channels are also often used to avoid

deadlock

– Deadlock may be a possibility due to the use

  • f a particular topology, routing algorithm or

higher-level protocol (message-dependent deadlock) – Messages on one virtual-channel cannot block messages on another

  • e.g. we might want to put “request” and “reply”

messages on different virtual-channels

  • Or provide a unique virtual channel for each

message type in a directory-based cache coherency protocol

slide-50
SLIDE 50

Chip Multiprocessors (ACS MPhil) 50

Flow Control: Backpressure

  • How do we know how many buffers are free in

the downstream router?

  • Mechanisms for low-level flow-control between

nodes (to provide backpressure)

– on/off

  • Able to send, yes/no?

– credit based

  • The upstream router maintains a counter for each

downstream VC

  • Counter is decremented when flit is sent
  • Downstream router sends credit to upstream router whenever

a flit leaves the VC buffer, when a credit is received the corresponding counter is incremented

slide-51
SLIDE 51

Chip Multiprocessors (ACS MPhil) 51

Deadlock

  • The deadlock problem

– A group of agents (e.g. packets) is unable to make progress as they are waiting on each to release a resource (e.g. buffer or channel) – Disaster! Need to be careful even when design looks simple (e.g. ring network)

  • Message-dependent deadlock

– External dependencies in a higher level protocol may also contribute to deadlock – Often need to make sure different message classes (e.g. request/reply messages) travel

  • n different virtual channels
slide-52
SLIDE 52

Chip Multiprocessors (ACS MPhil) 52

Deadlock example

Glass/Ni

slide-53
SLIDE 53

Chip Multiprocessors (ACS MPhil) 53

Deadlock

  • Techniques for deadlock avoidance

– Injection restriction – centralised/static scheduling – end-to-end flow control – Virtual channels – For routing related deadlock: – the turn-model, escape channels

  • Deadlock Recovery
  • Deadlock might be infrequent and costly to avoid altogether,

may be easier just to detect and recover

  • e.g. drain deadlocked network to another network, NACK

requests, fall back on simpler protocol (e.g. Culler p.594).

slide-54
SLIDE 54

Chip Multiprocessors (ACS MPhil) 54

Routing: The Basics

  • A routing algorithm aims to maximise

network throughput by balancing load across network channels

– Latency is often important too, need to keep routes short too – Keep complexity low

  • Cycle time and energy implications
  • Can global network state information be exploited?

– Avoid deadlock

  • Deterministic and adaptive routing
  • Multicast and broadcast support?
slide-55
SLIDE 55

Chip Multiprocessors (ACS MPhil) 55

Routing: Dimension Order Routing

  • Simple deterministic minimal routing algorithm (aka e-

cube routing) – Route in one dimension at a time – If there is a choice of directions in each dimension (e.g. for a torus) we first compute the distance that would have to be traveled in each case and select the shortest path

  • 2D version is also called XY routing

– Route in X direction first, then Y direction

  • Restricts turns to avoid deadlock

– See The Turn Model (Dally/Towles, p.269) – and West-first, North-last, Negative-First routing

slide-56
SLIDE 56

Chip Multiprocessors (ACS MPhil) 56

Routing: XY Routing

slide-57
SLIDE 57

Chip Multiprocessors (ACS MPhil) 57

Virtual-Channel Router Design

slide-58
SLIDE 58

Chip Multiprocessors (ACS MPhil) 58

Virtual-Channel Router Design

(2006+) State-of-the-art routers use a single combined VC and switch allocator

slide-59
SLIDE 59

Chip Multiprocessors (ACS MPhil) 59

Virtual-Channel Router Design

  • 4x4 mesh network
  • Single cycle routers (incl.

channel latency)

  • Clock ~35 FO4
  • Virtual-channel support

– 4 virtual channels per input port

  • Clocking: H-Tree or Distributed

Clock Generator (DCG)

Lochside Chip Mullins, West, Moore (2005)

slide-60
SLIDE 60

Chip Multiprocessors (ACS MPhil) 60

Fair allocation of resources to flows

  • A flow is a stream of packets between a particular source

and destination

  • Locally fair allocation (round-robin on input ports) does not

balance flow throughputs globally

  • Solutions: Arbitrate between flows, not virtual channels

[Banerjee, NOCS'09] See also [Lee, ISCA'08]

Reproduced from [Lee ISCA'08]

slide-61
SLIDE 61

Chip Multiprocessors (ACS MPhil) 61

Many Networks

  • Cost of replicating networks on-chip is much lower than off-chip. Is it

advantageous to build multiple networks? – Simple way to provide decoupled/isolated network resources (e.g. to provide QoS guarantees or to help predict performance)

  • e.g. special-purpose low-latency synchronization networks

– Able to carefully tailor each network for its specific purpose

  • e.g. use both static and dynamic networks

– see MIT RAW/Tilera processor

  • Save power/energy this way? e.g. provide both low-power

and high-performance networks (steer messages to one network or the other depending on how critical they are) – Increased serialization latency due to narrow channels

  • But improved energy efficiency?

– Lower overall router area, but more control overhead – Have to partition of network buffers

slide-62
SLIDE 62

Chip Multiprocessors (ACS MPhil) 62

In-Network Optimisations

  • The on-chip network can collect information (in

a distributed manner) that in turn can be used to optimise system performance

– A router can collect information from packets it routes – Routing of subsequent packets may be modified depending on contents of previous packets

  • This isn't just adaptive routing, the information collected

isn't about the state of the network but rather about the network clients or system as a whole

– e.g. “In-network cache coherence” [Eisley/Peh/Shang, MICRO'06]

slide-63
SLIDE 63

Chip Multiprocessors (ACS MPhil) 63

Conclusions

  • Brief introduction to on-chip network design

– Read Dally/Towles and Duato/Yalamanchili/Li for more details – Conferences: NOCS/HPCA/ISCA/MICRO... – Lots more to routing and deadlock issues in particular

  • See course wiki
  • Very broad viable design-space, very active research

area

– Challenging power dissipation constraints – Interconnects scale poorly compared to transistors, the problem won't get easier!