HPCA: 1 Feb 12, 2007
I nterconnect-Centric Computing William J. Dally Computer Systems - - PowerPoint PPT Presentation
I nterconnect-Centric Computing William J. Dally Computer Systems - - PowerPoint PPT Presentation
I nterconnect-Centric Computing William J. Dally Computer Systems Laboratory Stanford University HPCA Keynote February 12, 2007 HPCA: 1 Feb 12, 2007 Outline Interconnection Networks (INs) are THE central component of modern computer
HPCA: 2 Feb 12, 2007
Outline
- Interconnection Networks (INs) are THE central
component of modern computer systems
- Topology driven to high-radix by packaging
technology
- Global adaptive routing balances load - and enables
efficient topologies
- Case study, the Cray Black Widow
- On-Chip Interconnection Networks (OCINs) face
unique challenges
- The road ahead…
HPCA: 3 Feb 12, 2007
Outline
- Interconnection Networks (INs) are THE central
component of modern computer systems
- Topology driven to high-radix by packaging
technology
- Global adaptive routing balances load - and enables
efficient topologies
- Case study, the Cray Black Widow
- On-Chip Interconnection Networks (OCINs) face
unique challenges
- The road ahead…
HPCA: 4 Feb 12, 2007
I Ns: Connect Processors in Clusters
IBM Blue Gene
HPCA: 5 Feb 12, 2007
and on chip
MIT RAW
HPCA: 6 Feb 12, 2007
Connect Processors to Memories in Systems
Cray Black Widow
HPCA: 7 Feb 12, 2007
and on chip
Texas TRIPS
HPCA: 8 Feb 12, 2007
provide the fabric for network Switches and Routers
Avici TSR
HPCA: 9 Feb 12, 2007
and connect I / O Devices
Brocade Switch
HPCA: 10 Feb 12, 2007
Group History: Routing Chips & I nterconnection Networks
- Mars Router, Torus Routing Chip, Network Design
Frame, Reliable Router
- Basis for Intel, Cray/SGI, Mercury, Avici network chips
MARS Router 1984 Torus Routing Chip 1985 Network Design Frame 1988 Reliable Router 1994
HPCA: 11 Feb 12, 2007
Group History: Parallel Computer Systems
- J-Machine (MDP) led to Cray T3D/T3E
- M-Machine (MAP)
– Fast messaging, scalable processing nodes, scalable memory architecture
- Imagine – basis for SPI
MDP Chip J-Machine Cray T3D MAP Chip Imagine Chip
HPCA: 12 Feb 12, 2007
I nterconnection Networks are THE Central Component of Modern Computer Systems
- Processors are a commodity
– Performance no longer scaling (ILP mined out) – Future growth is through CMPs - connected by INs
- Memory is a commodity
– Memory system performance determined by interconnect
- I/O systems are largely interconnect
- Embedded systems built using SoCs
– Standard components – Connected by on-chip INs (OCINs)
HPCA: 13 Feb 12, 2007
Outline
- Interconnection Networks (INs) are THE central
component of modern computer systems
- Topology driven to high-radix by packaging
technology
- Global adaptive routing balances load - and enables
efficient topologies
- Case study, the Cray Black Widow
- On-Chip Interconnection Networks (OCINs) face
unique challenges
- The road ahead…
HPCA: 14 Feb 12, 2007
0.1 1 10 100 1000 10000 1985 1990 1995 2000 2005 2010
year bandwidth per router node (Gb/s)
Torus Routing Chip Intel iPSC/2 J-Machine CM-5 Intel Paragon XP Cray T3D MIT Alewife IBM Vulcan Cray T3E SGI Origin 2000 AlphaServer GS320 IBM SP Switch2 Quadrics QsNet Cray X1 Velio 3003 IBM HPS SGI Altix 3000 Cray XT3 YARC
BlackWidow
Technology Trends…
HPCA: 15 Feb 12, 2007
High-Radix Router
Router Router
HPCA: 16 Feb 12, 2007
High-Radix Router
Router Router
Low-radix (small number of fat ports) High-radix (large number of skinny ports)
Router Router
HPCA: 17 Feb 12, 2007
4 hops 2 hops 96 channels 32 channels
Low-Radix vs. High-Radix Router
O0 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 O13 O14 O15 I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 O0 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 O13 O14 O15
Low-Radix High-Radix Latency : Cost :
HPCA: 18 Feb 12, 2007
Latency
Latency =
H tr + L / b = 2trlogkN + 2kL / B
where k = radix B = total router Bandwidth N = # of nodes L = message size
HPCA: 19 Feb 12, 2007
Latency vs. Radix
50 100 150 200 250 300 50 100 150 200 250
radix latency (nsec)
2003 technology 2010 technology
Optimal radix ~ 40 Optimal radix ~ 128
Serialization latency increases Header latency decreases
HPCA: 20 Feb 12, 2007
Determining Optimal Radix
Latency = Header Latency + Serialization Latency =
H tr + L / b = 2trlogkN + 2kL / B
Optimal radix
k log2 k = (B tr log N) / L
= Aspect Ratio
where k = radix B = total router Bandwidth N = # of nodes L = message size
HPCA: 21 Feb 12, 2007
Higher Aspect Ratio, Higher Optimal Radix
1996 2003 2010 1991 1 10 100 1000 10 100 1000 10000
Aspect Ratio Optimal Radix (k)
HPCA: 22 Feb 12, 2007
High-Radix Topology
- Use high radix, k, to get low hop count
– H = logk(N)
- Provide good performance on both benign and
adversarial traffic patterns
– Rules out butterfly networks - no path diversity – Clos networks work well
- H = 2logk(N) - with short circuit
– Cayley graphs have nice properties but are hard to route
HPCA: 23 Feb 12, 2007
Example radix-64 Clos Network
Y0
BW0 BW1 BW31
Y31
BW992 BW993
BW1023
Y1
BW32 BW33 BW63
Y32 Y33 Y63
Rank 1 Rank 2
HPCA: 24 Feb 12, 2007
Flattened Butterfly Topology
HPCA: 25 Feb 12, 2007
Packaging the Flattened Butterfly
HPCA: 26 Feb 12, 2007
Packaging the Flattened Butterfly (2)
HPCA: 27 Feb 12, 2007
Cost
HPCA: 28 Feb 12, 2007
Outline
- Interconnection Networks (INs) are THE central
component of modern computer systems
- Topology driven to high-radix by packaging
technology
- Global adaptive routing balances load - and enables
efficient topologies
- Case study, the Cray Black Widow
- On-Chip Interconnection Networks (OCINs) face
unique challenges
- The road ahead…
HPCA: 29 Feb 12, 2007
Routing in High-Radix Networks
- Adaptive routing avoids transient load imbalance
- Global adaptive routing balances load for adversarial
traffic
– Cost/perf of a butterfly on benign traffic and at low loads – Cost/perf of a clos on adversarial traffic
HPCA: 30 Feb 12, 2007
A Clos can statically load balance traffic using oblivious routing
Y0
BW0 BW1 BW31
Y31
BW992 BW993
BW1023
Y1
BW32 BW33 BW63
Y32 Y33 Y63
Rank 1 Rank 2
HPCA: 31 Feb 12, 2007
Transient I mbalance
HPCA: 32 Feb 12, 2007
With Adaptive Routing
HPCA: 33 Feb 12, 2007
Latency for UR traffic
HPCA: 34 Feb 12, 2007
Flattened Butterfly Topology
1 2 3 4 5 6 7
HPCA: 35 Feb 12, 2007
Flattened Butterfly Topology
1 2 3 4 5 6 7 What if node 0 sends all of its traffic to node 1?
HPCA: 36 Feb 12, 2007
Flattened Butterfly Topology
1 2 3 4 5 6 7 What if node 0 sends all of its traffic to node 1? How much traffic should we route over alternate paths?
HPCA: 37 Feb 12, 2007
Simpler Case - ring of 8 nodes Send traffic from 2 to 5
- Model: Assume queues to be a network of
independent M/D/1 queues
2 1 3 4 5 6 7 x1 x2
= x1 + x2
Min path delay = Dm(x1) Non-min path delay = Dnm(x2)
- Routing remains minimal as long as
Dm’ () Dnm’ (0)
- Afterwards, route a fraction, x2, non-
minimally such that
Dm’ (x1) = Dnm’ (x2)
HPCA: 38 Feb 12, 2007
Traffic divides to balance delay Load balanced at saturation
0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.2 0.3 0.4 0.5 0.6
Offered Load (fraction of capacity) Accepted Throughput
Model Overall Model Minimal Model Non-minimal
HPCA: 39 Feb 12, 2007
Channel-Queue Routing
- Estimate delay per hop by local queue length Qi
- Overall latency estimated by
– Li ~ QiHi
- Route each packet on route with lowest estimated Li
- Works extremely well in practice
HPCA: 40 Feb 12, 2007
Performance on UR Traffic
HPCA: 41 Feb 12, 2007
Performance on WC Traffic
HPCA: 42 Feb 12, 2007
Allocator Design Matters
HPCA: 43 Feb 12, 2007
Outline
- Interconnection Networks (INs) are THE central
component of modern computer systems
- Topology driven to high-radix by packaging
technology
- Global adaptive routing balances load - and enables
efficient topologies
- Case study, the Cray Black Widow
- On-Chip Interconnection Networks (OCINs) face
unique challenges
- The road ahead…
HPCA: 44 Feb 12, 2007
Putting it all together The Cray BlackWidow Network
In collaboration with Steve Scott and Dennis Abts (Cray Inc.)
HPCA: 45 Feb 12, 2007
Cray Black Widow
- Shared-memory vector parallel computer
- Up to 32K nodes
- Vector processor per node
- Shared memory across nodes
HPCA: 46 Feb 12, 2007
Black Widow Topology
- Up to 32K nodes in a 3-level
folded Clos
- Each node has 4 18.75Gb/s
channels, one to each of 4 network slices
HPCA: 47 Feb 12, 2007
YARC Yet Another Router Chip
- 64 Ports
- Each port is 18.75 Gb/s (3 x 6.25Gb/s links)
- Table-driven routing
- Fault tolerance
– CRC with link-level retry – Graceful degradation of links
- 3 bits -> 2 bits -> 1 bit -> OTS
HPCA: 48 Feb 12, 2007
YARC Microarchitecture
- Regular 8x8 array of tiles
– Easy to lay out chip
- No global arbitration
– All decisions local
- Simple routing
- Hierarchical organization
– Input buffers – Row buffers – Column buffers
HPCA: 49 Feb 12, 2007
A Closer Look at a Tile
- No global arbitration
- Non-blocking with an 8x
internal speedup in subswitch
- Simple routing
– Small 8-entry routing table per tile – High routing throughput for small packets
HPCA: 50 Feb 12, 2007
YARC I mplementation
- Implemented in a 90nm
CMOS standard-cell ASIC technology
- 192 SerDes on the chip
- (64 ports x 3-bits per port)
- 6.25Gbaud data rate
- Estimated power
- 80 W (idle)
- 87 W (peak)
- 17mm x 17mm die
HPCA: 51 Feb 12, 2007
YARC I mplementation
- Implemented in a 90nm
CMOS standard-cell ASIC technology
- 192 SerDes on the chip
- (64 ports x 3-bits per port)
- 6.25Gbaud data rate
- Estimated power
- 80 W (idle)
- 87 W (peak)
- 17mm x 17mm die
HPCA: 52 Feb 12, 2007
Outline
- Interconnection Networks (INs) are THE central
component of modern computer systems
- Topology driven to high-radix by packaging
technology
- Global adaptive routing balances load - and enables
efficient topologies
- Case study, the Cray Black Widow
- On-Chip Interconnection Networks (OCINs) face
unique challenges
- The road ahead…
HPCA: 53 Feb 12, 2007
Much of the future is on-chip (CMP, SoC, Operand)
2006 2007.5 2009 2010.5 2012 2015 2013.5
HPCA: 54 Feb 12, 2007
On-Chip Networks are Fundamentally Different
- Different cost model
– Wires plentiful, no pin constraints – Buffers expensive (consume die area) – Slow signal propagation
- Different usage patterns
– Particularly for SoCs
- Significant isochronous traffic
- Hard RT constraints
- Different design problems
– Floorplans – Energy-efficient transmission circuits
HPCA: 55 Feb 12, 2007
NSF Workshop I dentified 3 Critical I ssues
- Power
– OCINs will have 10x the required power with current approaches
- Circuit and architecture innovations can close this gap
- Latency
– OCIN latency currently not competitive with buses and dedicated wiring
- Novel flow-control strategies required
- Tool Integration
– OCINs need to be integrated with standard tool flows to enable widespread use
HPCA: 56 Feb 12, 2007
The Road Ahead
- INs become an even more dominant system component
– Number of processors goes up, cost of processors decreases – Communication dominates performance and cost – From hand-held media UI devices to huge data centers
- Technology drives topology in new directions
– On-chip, short reach electrical (10m), optical – Expect radix to continue to increase – Hybrid topologies to match each packaging level
- Latency will approach that of dedicated wiring
– Better flow-control and router architecture – Optimized circuits
- Adaptivity will optimize performance
– Balance load, route around defects, tolerate variation, tune power to load
HPCA: 57 Feb 12, 2007
Summary
- Interconnection Networks (INs) are THE central component of modern
computing systems
- High-radix topologies have evolved to exploit packaging/signaling
technology
– Including hybrid optical/electrical – Flattened Butterfly
- Global adaptive routing balances load and enables advanced topologies
– Eliminate transient load imbalance – Use local queues to estimate global congestion
- Cray Black Widow - an example high-radix network
- On-Chip INs
– Very different constraints – Three “Gaps” identified - power, latency, tools.
- The road ahead
– Lots of room for improvement, INs are in their infancy
HPCA: 58 Feb 12, 2007
Some very good books
HPCA: 59 Feb 12, 2007
Backup
HPCA: 60 Feb 12, 2007
Virtual Channel Router Architecture
Switch Allocator VC Allocator
Output k Crossbar switch
Router
Routing computation Output 1
VC 1 VC 2 VC v VC 1 VC 2 VC v
Input 1 Input k
Switch Allocator VC Allocator
Output k Crossbar switch
Router
Routing computation Output 1
VC 1 VC 2 VC v VC 1 VC 2 VC v
Input 1 Input k
Switch Allocator VC Allocator
Output k Crossbar switch
Router
Routing computation Output 1
VC 1 VC 2 VC v VC 1 VC 2 VC v
Input 1 Input k
Switch Allocator VC Allocator
Output k Crossbar switch
Router
Routing computation Output 1
VC 1 VC 2 VC v VC 1 VC 2 VC v
Input 1 Input k
Switch Allocator VC Allocator
Output k Crossbar switch
Router
Routing computation Output 1
VC 1 VC 2 VC v VC 1 VC 2 VC v
Input 1 Input k
Switch Allocator VC Allocator
Output k Crossbar switch
Router
Routing computation Output 1
VC 1 VC 2 VC v VC 1 VC 2 VC v
Input 1 Input k
Switch Allocator VC Allocator
Output k Crossbar switch
Router
Routing computation Output 1
VC 1 VC 2 VC v VC 1 VC 2 VC v
Input 1 Input k
HPCA: 61 Feb 12, 2007
Baseline Performance Evaluation
10 20 30 40 50 0.2 0.4 0.6 0.8 1
- ffered load
latency (cycles)
low-radix
HPCA: 62 Feb 12, 2007
Baseline Performance Evaluation
10 20 30 40 50 0.2 0.4 0.6 0.8 1
- ffered load