I nterconnect-Centric Computing William J. Dally Computer Systems - PowerPoint PPT Presentation

I nterconnect-Centric Computing William J. Dally Computer Systems Laboratory Stanford University HPCA Keynote February 12, 2007 HPCA: 1 Feb 12, 2007

Outline • Interconnection Networks (INs) are THE central component of modern computer systems • Topology driven to high-radix by packaging technology • Global adaptive routing balances load - and enables efficient topologies • Case study, the Cray Black Widow • On-Chip Interconnection Networks (OCINs) face unique challenges • The road ahead… HPCA: 2 Feb 12, 2007

I Ns: Connect Processors in Clusters IBM Blue Gene Feb 12, 2007 HPCA: 4

and on chip MIT RAW Feb 12, 2007 HPCA: 5

Connect Processors to Memories in Systems Cray Black Widow Feb 12, 2007 HPCA: 6

and on chip Texas TRIPS Feb 12, 2007 HPCA: 7

provide the fabric for network Switches and Routers Avici TSR Feb 12, 2007 HPCA: 8

and connect I / O Devices Brocade Switch Feb 12, 2007 HPCA: 9

Group History: Routing Chips & I nterconnection Networks • Mars Router, Torus Routing Chip, Network Design Frame, Reliable Router • Basis for Intel, Cray/SGI, Mercury, Avici network chips Reliable Router Torus Routing Chip MARS Router 1994 1985 1984 Network Design Frame 1988 HPCA: 10 Feb 12, 2007

Group History: Parallel Computer Systems • J-Machine (MDP) led to Cray T3D/T3E • M-Machine (MAP) – Fast messaging, scalable processing nodes, scalable memory architecture • Imagine – basis for SPI Imagine Chip MDP Chip J-Machine MAP Chip Cray T3D HPCA: 11 Feb 12, 2007

I nterconnection Networks are THE Central Component of Modern Computer Systems • Processors are a commodity – Performance no longer scaling (ILP mined out) – Future growth is through CMPs - connected by INs • Memory is a commodity – Memory system performance determined by interconnect • I/O systems are largely interconnect • Embedded systems built using SoCs – Standard components – Connected by on-chip INs (OCINs) HPCA: 12 Feb 12, 2007

Technology Trends… Torus Routing Chip 10000 Intel iPSC/2 bandwidth per router node (Gb/s) BlackWidow J-Machine CM-5 1000 Intel Paragon XP Cray T3D MIT Alewife IBM Vulcan 100 Cray T3E SGI Origin 2000 AlphaServer GS320 10 IBM SP Switch2 Quadrics QsNet Cray X1 Velio 3003 1 IBM HPS SGI Altix 3000 Cray XT3 0.1 YARC 1985 1990 1995 2000 2005 2010 year Feb 12, 2007 HPCA: 14

High -Radix Router Router Router Feb 12, 2007 HPCA: 15

High -Radix Router Router Router Router Router Low-radix (small number of fat ports) High-radix (large number of skinny ports) Feb 12, 2007 HPCA: 16

Low-Radix vs. High-Radix Router I0 O0 I0 O0 I1 O1 I1 O1 I2 O2 I2 O2 I3 O3 I3 O3 I4 O4 I4 O4 I5 O5 I5 O5 I6 O6 I6 O6 I7 O7 I7 O7 I8 O8 I8 O8 I9 O9 I9 O9 I10 O10 I10 O10 I11 I11 O11 O11 I12 I12 O12 O12 I13 I13 O13 O13 I14 O14 I14 O14 I15 I15 O15 O15 Low-Radix High-Radix Latency : 4 hops 2 hops Cost : 96 channels 32 channels Feb 12, 2007 HPCA: 17

Latency H t r + L / b Latency = = 2t r log k N + 2kL / B where k = radix B = total router Bandwidth N = # of nodes L = message size Feb 12, 2007 HPCA: 18

Latency vs. Radix 2003 technology 2010 technology 300 Header latency 250 decreases Serialization latency increases latency (nsec) 200 Optimal radix ~ 40 150 Optimal radix ~ 128 100 50 0 0 50 100 150 200 250 radix Feb 12, 2007 HPCA: 19

Determining Optimal Radix Latency = Header Latency + Serialization Latency H t r + L / b = = 2t r log k N + 2kL / B where k = radix B = total router Bandwidth N = # of nodes L = message size Optimal radix � k log 2 k = (B t r log N) / L = Aspect Ratio HPCA: 20 Feb 12, 2007

Higher Aspect Ratio, Higher Optimal Radix 1000 2010 Optimal Radix (k) 100 2003 1996 10 1991 1 10 100 1000 10000 Aspect Ratio Feb 12, 2007 HPCA: 21

High-Radix Topology • Use high radix, k, to get low hop count – H = log k (N) • Provide good performance on both benign and adversarial traffic patterns – Rules out butterfly networks - no path diversity – Clos networks work well • H = 2log k (N) - with short circuit – Cayley graphs have nice properties but are hard to route HPCA: 22 Feb 12, 2007

Example radix-64 Clos Network Rank 2 Y32 Y33 Y63 Y0 Y1 Y31 Rank 1 BW0 BW1 BW31 BW32 BW33 BW63 BW992 BW993 BW1023 Feb 12, 2007 HPCA: 23

Flattened Butterfly Topology Feb 12, 2007 HPCA: 24

Packaging the Flattened Butterfly Feb 12, 2007 HPCA: 25

Packaging the Flattened Butterfly (2) Feb 12, 2007 HPCA: 26

Cost Feb 12, 2007 HPCA: 27

Routing in High-Radix Networks • Adaptive routing avoids transient load imbalance • Global adaptive routing balances load for adversarial traffic – Cost/perf of a butterfly on benign traffic and at low loads – Cost/perf of a clos on adversarial traffic HPCA: 29 Feb 12, 2007

A Clos can statically load balance traffic using oblivious routing Rank 2 Y32 Y33 Y63 Y0 Y1 Y31 Rank 1 BW0 BW1 BW31 BW32 BW33 BW63 BW992 BW993 BW1023 Feb 12, 2007 HPCA: 30

Transient I mbalance Feb 12, 2007 HPCA: 31

With Adaptive Routing Feb 12, 2007 HPCA: 32

Latency for UR traffic Feb 12, 2007 HPCA: 33

Flattened Butterfly Topology 0 1 2 3 4 5 6 7 Feb 12, 2007 HPCA: 34

Flattened Butterfly Topology 0 1 2 3 4 5 6 7 What if node 0 sends all of its traf fi c to node 1? Feb 12, 2007 HPCA: 35

Flattened Butterfly Topology 0 1 2 3 4 5 6 7 What if node 0 sends all of its traf fi c to node 1? How much traf fi c should we route over alternate paths? Feb 12, 2007 HPCA: 36

Simpler Case - ring of 8 nodes Send traffic from 2 to 5 • Model: Assume queues to be a network of independent M/D/1 queues � = x 1 + x 2 1 2 3 4 x 1 Min path delay = D m (x 1 ) x 2 Non-min path delay = D nm (x 2 ) 0 5 7 6 • Routing remains minimal as long as D m ’ ( � ) � D nm ’ (0) • Afterwards, route a fraction, x 2 , non- minimally such that D m ’ (x 1 ) = D nm ’ (x 2 ) HPCA: 37 Feb 12, 2007

Traffic divides to balance delay Load balanced at saturation 0.6 Model Overall 0.5 Model Minimal Accepted Throughput Model Non-minimal 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Offered Load (fraction of capacity) Feb 12, 2007 HPCA: 38

Channel-Queue Routing • Estimate delay per hop by local queue length Q i • Overall latency estimated by – L i ~ Q i H i • Route each packet on route with lowest estimated L i • Works extremely well in practice HPCA: 39 Feb 12, 2007

Performance on UR Traffic Feb 12, 2007 HPCA: 40

Performance on WC Traffic Feb 12, 2007 HPCA: 41

Allocator Design Matters Feb 12, 2007 HPCA: 42

Putting it all together The Cray BlackWidow Network In collaboration with Steve Scott and Dennis Abts (Cray Inc.) HPCA: 44 Feb 12, 2007

Cray Black Widow • Shared-memory vector parallel computer • Up to 32K nodes • Vector processor per node • Shared memory across nodes HPCA: 45 Feb 12, 2007

Black Widow Topology • Up to 32K nodes in a 3-level folded Clos • Each node has 4 18.75Gb/s channels, one to each of 4 network slices HPCA: 46 Feb 12, 2007

YARC Yet Another Router Chip • 64 Ports • Each port is 18.75 Gb/s (3 x 6.25Gb/s links) • Table-driven routing • Fault tolerance – CRC with link-level retry – Graceful degradation of links • 3 bits -> 2 bits -> 1 bit -> OTS HPCA: 47 Feb 12, 2007

I nterconnect-Centric Computing William J. Dally Computer Systems - PowerPoint PPT Presentation

I nterconnect-Centric Computing William J. Dally Computer Systems Laboratory Stanford University HPCA Keynote February 12, 2007 HPCA: 1 Feb 12, 2007 Outline Interconnection Networks (INs) are THE central component of modern computer

The Worlds First LED Human Centric Fluorescent Tube by Human Centric Optics Inc. 333,

GraVF: GraVF: A Vertex-Centric A Vertex-Centric Graph Processing Graph Processing Framework

Summary User-centric Social Social Multimedia Multimedia Computing From Users: user-perceptive

Data-centric Computing for Earth Observation . . Comments and Discussion Current Work

A Connector- A Connector- Centric Approach Centric Approach to Architectural to Architectural

Mobile Tactical Ops Center using ATCA MOSA, Swap and Net Centric Architecture DoD - Net Centric

Info- -Centric Scenario Development Centric Scenario Development Info Presentation to 19 th

A Conceptual Framework for Network Centric Warfare Workshop on Network Centric Warfare and

Human Centric Human Centric Machine Learning Infrastructure Machine Learning Infrastructure @

Various Faces of Data Centric Networking Eiko Yoneki University of Cambridge Computer Laboratory

Six Faces of Data Centric Networking Eiko Yoneki University of Cambridge Computer Laboratory

SLICT: Secure Localized Information Centric Things Marcel Enguehard, Ralph Droms, Dario Rossi 26

Data Centric Networking Session 1: Introduction to R202 Data Centric Networking Eiko Yoneki

Application Centric Infrastructure (ACI) The Cisco Application Centric Infrastructure (ACI) allows

Data Data- -Centric Query in Sensor Networks Centric Query in Sensor Networks Jie Gao Jie Gao

Content-Centric Content-Centric Networking Networking J.J. Garcia-Luna-Aceves UCSC and PARC

CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed

at the Dawn of the Computer Age Using Archival Materials to Explore the Historic 1952 Debut of

Complexity of DLs RWTH Aachen 1 Germany Complexity of DLs: Overview of the Complexity of

TRILL Header Extension Simplifica8ons Donald Eastlake 3 rd

How to Build a Large Scale Data Visualization Mike Barry - Twitter Brian Card - ViaSat Project

CS252LectureNotes MultithreadedArchitectures Concept

Network Interface Architectures (See other document for figures) Networks are becoming the

Parallel Programming and High-Performance Computing Part 4: Programming Memory-Coupled Systems