7 On-Chip Interconnection Networks Chip Multiprocessors (ACS MPhil) - - PowerPoint PPT Presentation
7 On-Chip Interconnection Networks Chip Multiprocessors (ACS MPhil) - - PowerPoint PPT Presentation
7 On-Chip Interconnection Networks Chip Multiprocessors (ACS MPhil) Robert Mullins Introduction Vast transistor budgets, but .... Poor interconnect scaling Pressure to decentralise designs Need to manage complexity and power
Introduction
- Vast transistor budgets, but....
- Poor interconnect scaling
– Pressure to decentralise designs
- Need to manage complexity and power
- Need for flexible/fault tolerant designs
- Parallel architectures
– Keep core complexity constant or simplify – The result is need to interconnect lots of cores, memories and other IP cores.
Introduction
- On-chip communication requirements:
– High-performance
- Latency and Bandwidth
– Flexibility
- Move away from fixed application-
specific wiring – Scalability
- Number of modules is rapidly increasing
Introduction
- On-chip communication requirements:
– Simplicity (ease of design and verification)
- Structured, modular and regular
- Optimize channel and router once
– Efficiency
- Ability to share global wiring resources
between different flows – Fault tolerance (in the long term)
- The existence of multiple communication
paths between module pairs – Support for different traffic types and QoS
Introduction
- The design of the on-chip network is not an
isolated design decision (or afterthought)
– e.g. consider impact on cache coherency protocol – What is the correct balance of resources (wires and transistors, silicon area, power etc.) between the on-chip network and computational resources? – Where does the on-chip network stop and the design of a module or core start?
- “integrated microarchitectural networks”
– Does network simply blindly allow modules to communicate or does it have additional functionality?
Chip Multiprocessors (ACS MPhil) 6
Introduction
- Don't we already know how to design
interconnection networks?
– Many existing network topologies, router designs and theory has already been developed for high- end supercomputers and telecom switches – Yes, and we'll cover some of this material, but the trade-offs on-chip lead to very different designs.
Chip Multiprocessors (ACS MPhil) 7
On-chip vs. Off-chip
- Compare availability of pins and wiring tracks on-chip
to cost of pins/connectors and cables off-chip
- Compare communication latencies on- and off-chip
– What is the impact on router and network design?
- Applications and workloads
- Amount of memory available on-chip
– What is the impact on router design/flow control?
- Power budgets on- and off-chip
- Need to map network to planar chip (or perhaps more
recently a 3D stack of dies)
Chip Multiprocessors (ACS MPhil) 8
On-chip interconnect
- Typical interconnect at 45nm
node: – 10-14 metal layers – Local interconnect (M1)
- 65nm metal width, 65nm
spacing
- 7700 metal tracks/mm
– Global (e.g. M10)
- 400nm metal width,
400nm spacing
- 1250 metal tracks/mm
- Remember global interconnects
scale poorly when compared to transistors
9Cu+1Al process (Fujitsu 2007)
Chip Multiprocessors (ACS MPhil) 9
Bus-based interconnects
- Bus-based
interconnects
– Central arbiter provides access to bus – Logically the bus is simply viewed as a set of wires shared by all processors
Chip Multiprocessors (ACS MPhil) 10
Bus-based interconnects
- Real bus implementations are typically switch
based
– Multiplexers and unidirectional interconnects with repeaters – Tri-states are rarely used now – Interconnect itself may be pipelined
- A bus-based CMP usually exploits multiple
unidirectional buses
– e.g. address bus, response bus and data bus
Chip Multiprocessors (ACS MPhil) 11
Bus-based interconnects for multicore?
- Metal/wiring is cheap on-
chip!
- Avoid complexity of
packet-switched networks
- Keep cache-coherency
simple
- Performance issues
– Centralised arbitration – Low clock frequency (pipeline?) – Power? – Scalability? O O R R R R R R R R R R R R R R R R
Shekhar Borkar (OCIN'06) Repeated Bus Global Interconnect
Chip Multiprocessors (ACS MPhil) 12
Bus-based interconnects for multicore?
- Optimising bus-based solutions:
– Arbitrate for next cycle on current clock cycle – Use wide, low-swing interconnects – Limit broadcast to subset of processors?
- Segment bus and filter redundant broadcasts to some
segments by maintaining some knowledge of cache
- contents. So called, “Filtered Segmented Buses”
– Employ multiple buses – Move from electrical to on-chip optical solutions?
Chip Multiprocessors (ACS MPhil) 13
Filtered Segmented Bus
“Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks”, Udipi et al, HPCA 2010
- Filter broadcasts to
segments with Bloom filter
- Energy savings possible vs.
mesh and flattened butterfly networks (for 16, 32 and 64- cores) because routers can be removed
- For large numbers of cores
multiple (address- interleaved) buses are required to avoid a significant performance penalty due to contention
Chip Multiprocessors (ACS MPhil) 14
Bus-based interconnects for multicore?
- Exploiting multiple buses (or rings):
– Multiple address-interleaved buses
- e.g. Sun Wildfire/Starfire
– Use different buses for different message types – Subspace snooping [Huh/Burger06]
- Associate (dynamic) address ranges with each bus.
Each subspace are regions of data that are shared by a stable subset of the processors.
- This technique tackles snoop bandwidth limitations as
all processors are not required to snoop all buses
– Exploit buses at the lowest level of a hierarchical network (e.g. mesh interconnecting tiles, where each tile is a group of cores connected by a bus)
Chip Multiprocessors (ACS MPhil) 15
Sun Starfire (UE10000)
Uses 4 interleaved address buses to scale snooping protocol
16x16 Data Crossbar
Memory Module
Board Interconnect µP $ µP $ µP $ µP $
Memory Module
Board Interconnect µP $ µP $ µP $ µP $ 4 processors + memory module per system board
- Up to 64-way SMP using bus-based snooping protocol
Separate data transfer over high bandwidth crossbar
Slide from Krste Asanovic (Berkeley)
Chip Multiprocessors (ACS MPhil) 16
Ring Networks
- Exploit short point-to-point interconnects
- Can support many concurrent data transfers
- Can keep coherence protocol simple and avoid need
for directory-based schemes – We may still broadcast transactions
- Modest area requirements
k-node ring (or k-ary 1-cube)
Chip Multiprocessors (ACS MPhil) 17
Ring Networks
- Control
– May be distributed
- Need to be a little careful to avoid possibility of deadlock
(more later!)
– Or a centralised arbiter/scheduler may be used
- e.g. IBM Cell BE and Larrabee both appear to use a
centralised scheduler
- Try and schedule as many concurrent (non-overlapping)
transfers on each available ring as possible
- Trivial routers at each node
– Simple routers are attractive as they don't introduce significant latency, power and area
- verheads
Chip Multiprocessors (ACS MPhil) 18
Ring Networks: Examples
- IBM
– Power4, Power5
- IBM/Sony/Toshiba
– Cell BE (PS3, HDTV, Cell blades, ...)
- Intel
– Larabee (graphics), 8-core Xeon processor
- Kendall Square Research (1990's)
– Massively parallel supercomputer design – Ring of rings (hierarchical or multi-ring) topology
- Cluster = 32 nodes connected in a ring
- Up to 34 clusters connected by higher level ring
Chip Multiprocessors (ACS MPhil) 19
Ring Networks: Example IBM Cell BE
- Cell Broadband
Engine
– Message-passing style (no $ coherence) – Element Interconnect Bus (EIB)
- 2-rings are provided in
each direction
- Crossbar solution was
deemed too large
Chip Multiprocessors (ACS MPhil) 20
Ring Networks: Example Larrabee
- Cache coherent
- Bi-directional ring network, 512-bit wide links
– Short linked rings proposed for >16 processors – Routing decisions are made before injecting messages
- The clockwise ring delivers on even clock cycles and the
anticlockwise ring on odd clock cycles
Chip Multiprocessors (ACS MPhil) 21
Crossbar Networks
- A crossbar switch is able to directly connect
any input to any output without any intermediate stages
– It is an example of a strictly non-blocking network
- It can connect any input to any output, incrementally,
without the need the rearrange any of the circuits currently set up.
– The main limitation of a crossbar is its cost. Although very useful in small configurations, n x n crossbars can quickly become prohibitively expensive as their cost increases as n2
Chip Multiprocessors (ACS MPhil) 22
Crossbar Networks
- A 4x3 crossbar
implemented using three 4:1 multiplexers
- Each multiplexer selects a
particular input to be connected to the corresponding output
(Dally/Towles book Chapter 6)
Chip Multiprocessors (ACS MPhil) 23
Crossbar Networks: Example Niagara
- Crossbar switch
interconnects 8 processors to banked on-chip L2 cache
– A crossbar is actually provided in each direction:
- Forward and Return
- Simple cache coherence
protocol
– See earlier seminar
Reproduced from IEEE Micro, Mar'05
Chip Multiprocessors (ACS MPhil) 24
Crossbar Networks: Example Cyclops
- IBM, US Dept. of Energy/Defense, Academia
- Full system 1M+ processors, 80 cores per chip
- Interconnect: centralised 96x96 buffered crossbar switch
with a 7-stage pipeline
Chip Multiprocessors (ACS MPhil) 25
Crossbar Networks: Example Cyclops
- Motivation for use of crossbar
– Simple uniform memory access model – System is sequentially consistent
- Crossbar interconnects:
– 80 processors, incl:
- 2 x thread units, 2 x scratch pad memories (can also be
configured as global memory), 1 x FPU – Off-chip memory, I/O and interfaces to off-chip network
- Crossbar Area
– 1.6mm x 17mm (27mm2), 6% of total die area
- Communication and arbitration is centralised
– Power implications? Fixed core-to-core comms. delay
Chip Multiprocessors (ACS MPhil) 26
Crossbar Networks: On-chip Optical
- Fully optical on-chip buses, rings and crossbars have also
recently been proposed. They may be constructed using ring-resonator based modulators and detectors.
[Firefly/Corona ISCA 08/09]
Chip Multiprocessors (ACS MPhil) 27
Interconnection Networks
- So far we have only discussed buses, ring networks
and crossbar switches
- In general a network can be described by its:
– Topology: How we arrange our shared routers and channels
- Router = (buffers) + switch + control
– Flow Control: How we allocate network resources, e.g. buffer capacity and channel bandwidth – Routing Algorithm: How we select a path through
- ur network from the source to destination node
Chip Multiprocessors (ACS MPhil) 28
Nomenclature (See Dally/Towles)
- Modules (i.e. cores, memories etc.) connect to the
network at network terminals
- The network itself consists of a collection of nodes
(switches/routers) which are linked by channels
– Channels are characterised by their width, frequency and latency
- Direct networks associate a terminal (connection to the
- utside world) with each node. In a direct network a node
is both a terminal and a switch
- In indirect networks nodes are either a terminal or a
switch (not both)
Chip Multiprocessors (ACS MPhil) 29
Nomenclature
- Paths
– A route or path between two nodes is given by an
- rdered set of channels, P. The length or hop count of
a path is |P|. – A minimal path between two nodes is the path with the smallest hop count – The diameter (Hmax) of the network is the largest, minimal hop count over all pairs of terminals in the network – For a NxN mesh, the diameter is 2x(N-1),
- e.g. bottom-left node to top-right node
- It is 1 for a crossbar switch
Chip Multiprocessors (ACS MPhil) 30
Nomenclature
- Higher level messages between network clients are
communicated over the network as one or more packets.
- Each packet is composed from a number of flits (flow
control digits). A flit is the smallest unit of information recognised by the flow control method.
– e.g. an on-chip network may employ 64-bit flits and have the ability to send a new flit over a channel (perhaps also 64-bits wide) every clock cycle. – A network may expect fixed or variable length packets – The first flit in a packet is often called the head flit and the last the tail flit
Chip Multiprocessors (ACS MPhil) 31
Nomenclature
- The average length of the minimum paths between all sources
and destinations is known as the average minimum hop count, Hmin .
- The latency of a network is the time required for a packet to
traverse the network
– More precisely, this is the time between the head of the message (or packet) reaching the input terminal and its tail departing from the output terminal – Head latency is the time for the head of the message to traverse the network – The serialization latency is the time for the tail of the message to catch up (i.e. the time taken to send a complete packet across a channel) – The latency of a single router is TR
Chip Multiprocessors (ACS MPhil) 32
Nomenclature
Average zero-load latency (in the absence of contention) provides us with a lower bound on average latency
= (Router delay) + (Time of flight) + (serialization latency) T0 = Hmin*TR + Dmin/v + L/b
Dmin– average distance The physical distance between network nodes is dependent
- n the layout of the network (in larger networks involving multiple
chips this will be dependent on packaging decisions) v - propagation velocity (how fast does a flit propagate between nodes?) L – packet length, b – channel bandwidth
Chip Multiprocessors (ACS MPhil) 33
Nomenclature
- The bisection width is the minimum number of wires
that must be cut when the network is divided into two equal sets of nodes
- The collective bandwidth over the bisection width is
known as the bisection bandwidth
Chip Multiprocessors (ACS MPhil) 34
Traffic Patterns (Dally/Towles, 3.2)
- Random traffic
– each source is equally likely to send to any destination
- Permutation traffic: (one fixed destination per source)
– Bit permutation, e.g. 4-bit source address = {s3, s2, s1, s0}
- Bit complement (e.g. dest = {!s3, !s2, !s1, !s0})
- Bit reverse (e.g. dest = {s0, s1, s2, s3})
- Bit rotation
- Shuffle (Sorting)
- Transpose (FFT)
– Digit permutations
- Tornado (adversary for torus topologies)
- Neighbour (fluid dynamics)
Chip Multiprocessors (ACS MPhil) 35
Real Traffic Patterns
“A communication characterization of Splash-2 and Parsec”, Barrow-Williams/Fensch/Moore, IISWC'09
Chip Multiprocessors (ACS MPhil) 36
Performance
- Imagine a mesh and uniform
random traffic
- If we divide the nodes of the
mesh in two, it is clear to see that half the traffic from one half must cross the bisection
- Hence the topology limit is 2Bc/N
- e.g. for an 8x8 mesh:
32*traffic_per_node*0.5 = 0.5*Bc Lets assume that the bisection bandwidth = 16 flits/cycle, so traffic_per_node = 0.5 flits/cycle Offered traffic is often normalised to the topology limit
Chip Multiprocessors (ACS MPhil) 37
Performance: Routing Limit
- A particular routing algorithm may be unable to
balance the traffic across the channels of the
- bisection. In this case, the throughput limit imposed
by the routing algorithm may be significantly lower than that of the topology limit.
Chip Multiprocessors (ACS MPhil) 38
Topology: The Basics
- Choosing the topology is the first step in designing a
network
– The flow control and routing algorithm are obviously dependent on topology – Selecting a topology involves a trade-off between complexity/cost and performance (bandwidth and latency)
- Don't forget we must map the network to a 2D VLSI
implementation
– Die stacking introduces need for 3D networks
- On-chip networks typically try hard to minimise
communication latency and power
– e.g. Exploit short channels to neighbouring cores – Direct networks
Chip Multiprocessors (ACS MPhil) 39
Topology: Mesh
A 4-ary 2-mesh
- Simple and regular
mapping to 2D VLSI implementation
- Channels to nearest
neighbours
- Simple routers
– Low radix (or degree)
- e.g. 5x5 switch (or two
3x3 switches)
- 4 ports for neighbours
and 1 for terminal
- Potentially high hop count
Chip Multiprocessors (ACS MPhil) 40
Topology: Concentrated Mesh
- Increase scalability of
mesh by co-locating multiple terminals at each node
– Here concentration has been achieved using a larger crossbar at each node – Could also just use a multiplexer or bus
- Concentration reduces the
average hop count and hence the zero-load latency
Chip Multiprocessors (ACS MPhil) 41
Topology: Conc. Mesh + Express Chan.
- We can add express
channels to restore lost bisection bandwidth
- Express channels in
general are additional links to non-local nodes
– Physically, they are often longer than the minimum channel length
Concentrated Mesh with Express Channels (Balfour/Dally, ICS'06)
Chip Multiprocessors (ACS MPhil) 42
Topology: Mesh + multidrop channels
- MECS topology
– Grot el al, HPCA'09 – Exploits multi-drop express channels
- Point-to-multipoint
unidirection links connect a source with multiple destinations in a given row or column
- Network diameter is 2
- High connectivity with low
channel count
Chip Multiprocessors (ACS MPhil) 43
Topology: Torus
- Meshes suffer from load
imbalances for many traffic patterns as demand for the central channels is higher than for the edge channels
- The torus topology has twice
the bisection bandwidth of a mesh
- The long wraparound
channels may be avoided by folding the torus
(Dally/Towles, p99) A 4-ary 2-cube
Chip Multiprocessors (ACS MPhil) 44
Flow Control: Circuit Switching
- In a circuit switched network, network
resources (channels) are reserved before a packet is sent
– The entire path (circuit) must be reserved first – Channels are often shared between different circuits using time-division multiplexing or by dividing the channel into multiple narrow links – The circuit is torn down once the message has been sent
Chip Multiprocessors (ACS MPhil) 45
Flow Control: Circuit Switching
- Minimal buffering at each switch
- Once circuit is setup, router latency and
control overheads are very low
- Very poor use of channel bandwidth if lots of
short packets must be sent to many different destinations
– More commonly seen in embedded SoC applications where traffic patterns may be static and involve streaming large amounts of data between different IP blocks – Can also provide QoS/Guaranteed Services (GS)
Chip Multiprocessors (ACS MPhil) 46
Flow Control: Buffered Flow Control
- We can aim to make better use of channel resources by
buffering packets (or a fraction of a packet) at each node. We then arbitrate for access to network resources dynamically.
- We distinguish between different approaches by the
granularity at which we reserve resources (e.g. channels and buffers) and conditions that must be met for a packet to advance to the next node.
- Packet-Buffer Flow Control
– Store and forward – Virtual cut-through
- Flit-Buffer Flow Control
– Wormhole
- Efficient use of (limited) buffer space
Chip Multiprocessors (ACS MPhil) 47
Flow Control: Buffered Flow Control
L – Packet Length
Chip Multiprocessors (ACS MPhil) 48
Virtual-Channel Flow Control
- Improve performance of wormhole routing, prevent a single packet
blocking a free channel
– e.g. if the green packet is blocked the red packet may still make progress through the network – We can interleave flits from different packets over the same channel
Chip Multiprocessors (ACS MPhil) 49
Virtual-channel Flow Control
- Virtual-channels are also often used to avoid
deadlock
– Deadlock may be a possibility due to the use
- f a particular topology, routing algorithm or
higher-level protocol (message-dependent deadlock) – Messages on one virtual-channel cannot block messages on another
- e.g. we might want to put “request” and “reply”
messages on different virtual-channels
- Or provide a unique virtual channel for each
message type in a directory-based cache coherency protocol
Chip Multiprocessors (ACS MPhil) 50
Flow Control: Backpressure
- How do we know how many buffers are free in
the downstream router?
- Mechanisms for low-level flow-control between
nodes (to provide backpressure)
– on/off
- Able to send, yes/no?
– credit based
- The upstream router maintains a counter for each
downstream VC
- Counter is decremented when flit is sent
- Downstream router sends credit to upstream router whenever
a flit leaves the VC buffer, when a credit is received the corresponding counter is incremented
Chip Multiprocessors (ACS MPhil) 51
Deadlock
- The deadlock problem
– A group of agents (e.g. packets) is unable to make progress as they are waiting on each to release a resource (e.g. buffer or channel) – Disaster! Need to be careful even when design looks simple (e.g. ring network)
- Message-dependent deadlock
– External dependencies in a higher level protocol may also contribute to deadlock – Often need to make sure different message classes (e.g. request/reply messages) travel
- n different virtual channels
Chip Multiprocessors (ACS MPhil) 52
Deadlock example
Glass/Ni
Chip Multiprocessors (ACS MPhil) 53
Deadlock
- Techniques for deadlock avoidance
– Injection restriction – centralised/static scheduling – end-to-end flow control – Virtual channels – For routing related deadlock: – the turn-model, escape channels
- Deadlock Recovery
- Deadlock might be infrequent and costly to avoid altogether,
may be easier just to detect and recover
- e.g. drain deadlocked network to another network, NACK
requests, fall back on simpler protocol (e.g. Culler p.594).
Chip Multiprocessors (ACS MPhil) 54
Routing: The Basics
- A routing algorithm aims to maximise
network throughput by balancing load across network channels
– Latency is often important too, need to keep routes short too – Keep complexity low
- Cycle time and energy implications
- Can global network state information be exploited?
– Avoid deadlock
- Deterministic and adaptive routing
- Multicast and broadcast support?
Chip Multiprocessors (ACS MPhil) 55
Routing: Dimension Order Routing
- Simple deterministic minimal routing algorithm (aka e-
cube routing) – Route in one dimension at a time – If there is a choice of directions in each dimension (e.g. for a torus) we first compute the distance that would have to be traveled in each case and select the shortest path
- 2D version is also called XY routing
– Route in X direction first, then Y direction
- Restricts turns to avoid deadlock
– See The Turn Model (Dally/Towles, p.269) – and West-first, North-last, Negative-First routing
Chip Multiprocessors (ACS MPhil) 56
Routing: XY Routing
Chip Multiprocessors (ACS MPhil) 57
Virtual-Channel Router Design
Chip Multiprocessors (ACS MPhil) 58
Virtual-Channel Router Design
(2006+) State-of-the-art routers use a single combined VC and switch allocator
Chip Multiprocessors (ACS MPhil) 59
Virtual-Channel Router Design
- 4x4 mesh network
- Single cycle routers (incl.
channel latency)
- Clock ~35 FO4
- Virtual-channel support
– 4 virtual channels per input port
- Clocking: H-Tree or Distributed
Clock Generator (DCG)
Lochside Chip Mullins, West, Moore (2005)
Chip Multiprocessors (ACS MPhil) 60
Fair allocation of resources to flows
- A flow is a stream of packets between a particular source
and destination
- Locally fair allocation (round-robin on input ports) does not
balance flow throughputs globally
- Solutions: Arbitrate between flows, not virtual channels
[Banerjee, NOCS'09] See also [Lee, ISCA'08]
Reproduced from [Lee ISCA'08]
Chip Multiprocessors (ACS MPhil) 61
Many Networks
- Cost of replicating networks on-chip is much lower than off-chip. Is it
advantageous to build multiple networks? – Simple way to provide decoupled/isolated network resources (e.g. to provide QoS guarantees or to help predict performance)
- e.g. special-purpose low-latency synchronization networks
– Able to carefully tailor each network for its specific purpose
- e.g. use both static and dynamic networks
– see MIT RAW/Tilera processor
- Save power/energy this way? e.g. provide both low-power
and high-performance networks (steer messages to one network or the other depending on how critical they are) – Increased serialization latency due to narrow channels
- But improved energy efficiency?
– Lower overall router area, but more control overhead – Have to partition of network buffers
Chip Multiprocessors (ACS MPhil) 62
In-Network Optimisations
- The on-chip network can collect information (in
a distributed manner) that in turn can be used to optimise system performance
– A router can collect information from packets it routes – Routing of subsequent packets may be modified depending on contents of previous packets
- This isn't just adaptive routing, the information collected
isn't about the state of the network but rather about the network clients or system as a whole
– e.g. “In-network cache coherence” [Eisley/Peh/Shang, MICRO'06]
Chip Multiprocessors (ACS MPhil) 63
Conclusions
- Brief introduction to on-chip network design
– Read Dally/Towles and Duato/Yalamanchili/Li for more details – Conferences: NOCS/HPCA/ISCA/MICRO... – Lots more to routing and deadlock issues in particular
- See course wiki
- Very broad viable design-space, very active research