i nterconnect centric computing
play

I nterconnect-Centric Computing William J. Dally Computer Systems - PowerPoint PPT Presentation

I nterconnect-Centric Computing William J. Dally Computer Systems Laboratory Stanford University HPCA Keynote February 12, 2007 HPCA: 1 Feb 12, 2007 Outline Interconnection Networks (INs) are THE central component of modern computer


  1. I nterconnect-Centric Computing William J. Dally Computer Systems Laboratory Stanford University HPCA Keynote February 12, 2007 HPCA: 1 Feb 12, 2007

  2. Outline • Interconnection Networks (INs) are THE central component of modern computer systems • Topology driven to high-radix by packaging technology • Global adaptive routing balances load - and enables efficient topologies • Case study, the Cray Black Widow • On-Chip Interconnection Networks (OCINs) face unique challenges • The road ahead… HPCA: 2 Feb 12, 2007

  3. Outline • Interconnection Networks (INs) are THE central component of modern computer systems • Topology driven to high-radix by packaging technology • Global adaptive routing balances load - and enables efficient topologies • Case study, the Cray Black Widow • On-Chip Interconnection Networks (OCINs) face unique challenges • The road ahead… HPCA: 3 Feb 12, 2007

  4. I Ns: Connect Processors in Clusters IBM Blue Gene Feb 12, 2007 HPCA: 4

  5. and on chip MIT RAW Feb 12, 2007 HPCA: 5

  6. Connect Processors to Memories in Systems Cray Black Widow Feb 12, 2007 HPCA: 6

  7. and on chip Texas TRIPS Feb 12, 2007 HPCA: 7

  8. provide the fabric for network Switches and Routers Avici TSR Feb 12, 2007 HPCA: 8

  9. and connect I / O Devices Brocade Switch Feb 12, 2007 HPCA: 9

  10. Group History: Routing Chips & I nterconnection Networks • Mars Router, Torus Routing Chip, Network Design Frame, Reliable Router • Basis for Intel, Cray/SGI, Mercury, Avici network chips Reliable Router Torus Routing Chip MARS Router 1994 1985 1984 Network Design Frame 1988 HPCA: 10 Feb 12, 2007

  11. Group History: Parallel Computer Systems • J-Machine (MDP) led to Cray T3D/T3E • M-Machine (MAP) – Fast messaging, scalable processing nodes, scalable memory architecture • Imagine – basis for SPI Imagine Chip MDP Chip J-Machine MAP Chip Cray T3D HPCA: 11 Feb 12, 2007

  12. I nterconnection Networks are THE Central Component of Modern Computer Systems • Processors are a commodity – Performance no longer scaling (ILP mined out) – Future growth is through CMPs - connected by INs • Memory is a commodity – Memory system performance determined by interconnect • I/O systems are largely interconnect • Embedded systems built using SoCs – Standard components – Connected by on-chip INs (OCINs) HPCA: 12 Feb 12, 2007

  13. Outline • Interconnection Networks (INs) are THE central component of modern computer systems • Topology driven to high-radix by packaging technology • Global adaptive routing balances load - and enables efficient topologies • Case study, the Cray Black Widow • On-Chip Interconnection Networks (OCINs) face unique challenges • The road ahead… HPCA: 13 Feb 12, 2007

  14. Technology Trends… Torus Routing Chip 10000 Intel iPSC/2 bandwidth per router node (Gb/s) BlackWidow J-Machine CM-5 1000 Intel Paragon XP Cray T3D MIT Alewife IBM Vulcan 100 Cray T3E SGI Origin 2000 AlphaServer GS320 10 IBM SP Switch2 Quadrics QsNet Cray X1 Velio 3003 1 IBM HPS SGI Altix 3000 Cray XT3 0.1 YARC 1985 1990 1995 2000 2005 2010 year Feb 12, 2007 HPCA: 14

  15. High -Radix Router Router Router Feb 12, 2007 HPCA: 15

  16. High -Radix Router Router Router Router Router Low-radix (small number of fat ports) High-radix (large number of skinny ports) Feb 12, 2007 HPCA: 16

  17. Low-Radix vs. High-Radix Router I0 O0 I0 O0 I1 O1 I1 O1 I2 O2 I2 O2 I3 O3 I3 O3 I4 O4 I4 O4 I5 O5 I5 O5 I6 O6 I6 O6 I7 O7 I7 O7 I8 O8 I8 O8 I9 O9 I9 O9 I10 O10 I10 O10 I11 I11 O11 O11 I12 I12 O12 O12 I13 I13 O13 O13 I14 O14 I14 O14 I15 I15 O15 O15 Low-Radix High-Radix Latency : 4 hops 2 hops Cost : 96 channels 32 channels Feb 12, 2007 HPCA: 17

  18. Latency H t r + L / b Latency = = 2t r log k N + 2kL / B where k = radix B = total router Bandwidth N = # of nodes L = message size Feb 12, 2007 HPCA: 18

  19. Latency vs. Radix 2003 technology 2010 technology 300 Header latency 250 decreases Serialization latency increases latency (nsec) 200 Optimal radix ~ 40 150 Optimal radix ~ 128 100 50 0 0 50 100 150 200 250 radix Feb 12, 2007 HPCA: 19

  20. Determining Optimal Radix Latency = Header Latency + Serialization Latency H t r + L / b = = 2t r log k N + 2kL / B where k = radix B = total router Bandwidth N = # of nodes L = message size Optimal radix � k log 2 k = (B t r log N) / L = Aspect Ratio HPCA: 20 Feb 12, 2007

  21. Higher Aspect Ratio, Higher Optimal Radix 1000 2010 Optimal Radix (k) 100 2003 1996 10 1991 1 10 100 1000 10000 Aspect Ratio Feb 12, 2007 HPCA: 21

  22. High-Radix Topology • Use high radix, k, to get low hop count – H = log k (N) • Provide good performance on both benign and adversarial traffic patterns – Rules out butterfly networks - no path diversity – Clos networks work well • H = 2log k (N) - with short circuit – Cayley graphs have nice properties but are hard to route HPCA: 22 Feb 12, 2007

  23. Example radix-64 Clos Network Rank 2 Y32 Y33 Y63 Y0 Y1 Y31 Rank 1 BW0 BW1 BW31 BW32 BW33 BW63 BW992 BW993 BW1023 Feb 12, 2007 HPCA: 23

  24. Flattened Butterfly Topology Feb 12, 2007 HPCA: 24

  25. Packaging the Flattened Butterfly Feb 12, 2007 HPCA: 25

  26. Packaging the Flattened Butterfly (2) Feb 12, 2007 HPCA: 26

  27. Cost Feb 12, 2007 HPCA: 27

  28. Outline • Interconnection Networks (INs) are THE central component of modern computer systems • Topology driven to high-radix by packaging technology • Global adaptive routing balances load - and enables efficient topologies • Case study, the Cray Black Widow • On-Chip Interconnection Networks (OCINs) face unique challenges • The road ahead… HPCA: 28 Feb 12, 2007

  29. Routing in High-Radix Networks • Adaptive routing avoids transient load imbalance • Global adaptive routing balances load for adversarial traffic – Cost/perf of a butterfly on benign traffic and at low loads – Cost/perf of a clos on adversarial traffic HPCA: 29 Feb 12, 2007

  30. A Clos can statically load balance traffic using oblivious routing Rank 2 Y32 Y33 Y63 Y0 Y1 Y31 Rank 1 BW0 BW1 BW31 BW32 BW33 BW63 BW992 BW993 BW1023 Feb 12, 2007 HPCA: 30

  31. Transient I mbalance Feb 12, 2007 HPCA: 31

  32. With Adaptive Routing Feb 12, 2007 HPCA: 32

  33. Latency for UR traffic Feb 12, 2007 HPCA: 33

  34. Flattened Butterfly Topology 0 1 2 3 4 5 6 7 Feb 12, 2007 HPCA: 34

  35. Flattened Butterfly Topology 0 1 2 3 4 5 6 7 What if node 0 sends all of its traf fi c to node 1? Feb 12, 2007 HPCA: 35

  36. Flattened Butterfly Topology 0 1 2 3 4 5 6 7 What if node 0 sends all of its traf fi c to node 1? How much traf fi c should we route over alternate paths? Feb 12, 2007 HPCA: 36

  37. Simpler Case - ring of 8 nodes Send traffic from 2 to 5 • Model: Assume queues to be a network of independent M/D/1 queues � = x 1 + x 2 1 2 3 4 x 1 Min path delay = D m (x 1 ) x 2 Non-min path delay = D nm (x 2 ) 0 5 7 6 • Routing remains minimal as long as D m ’ ( � ) � D nm ’ (0) • Afterwards, route a fraction, x 2 , non- minimally such that D m ’ (x 1 ) = D nm ’ (x 2 ) HPCA: 37 Feb 12, 2007

  38. Traffic divides to balance delay Load balanced at saturation 0.6 Model Overall 0.5 Model Minimal Accepted Throughput Model Non-minimal 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Offered Load (fraction of capacity) Feb 12, 2007 HPCA: 38

  39. Channel-Queue Routing • Estimate delay per hop by local queue length Q i • Overall latency estimated by – L i ~ Q i H i • Route each packet on route with lowest estimated L i • Works extremely well in practice HPCA: 39 Feb 12, 2007

  40. Performance on UR Traffic Feb 12, 2007 HPCA: 40

  41. Performance on WC Traffic Feb 12, 2007 HPCA: 41

  42. Allocator Design Matters Feb 12, 2007 HPCA: 42

  43. Outline • Interconnection Networks (INs) are THE central component of modern computer systems • Topology driven to high-radix by packaging technology • Global adaptive routing balances load - and enables efficient topologies • Case study, the Cray Black Widow • On-Chip Interconnection Networks (OCINs) face unique challenges • The road ahead… HPCA: 43 Feb 12, 2007

  44. Putting it all together The Cray BlackWidow Network In collaboration with Steve Scott and Dennis Abts (Cray Inc.) HPCA: 44 Feb 12, 2007

  45. Cray Black Widow • Shared-memory vector parallel computer • Up to 32K nodes • Vector processor per node • Shared memory across nodes HPCA: 45 Feb 12, 2007

  46. Black Widow Topology • Up to 32K nodes in a 3-level folded Clos • Each node has 4 18.75Gb/s channels, one to each of 4 network slices HPCA: 46 Feb 12, 2007

  47. YARC Yet Another Router Chip • 64 Ports • Each port is 18.75 Gb/s (3 x 6.25Gb/s links) • Table-driven routing • Fault tolerance – CRC with link-level retry – Graceful degradation of links • 3 bits -> 2 bits -> 1 bit -> OTS HPCA: 47 Feb 12, 2007

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend