Multistage Switches are not Crossbars: Effects of Static Routing in - - PowerPoint PPT Presentation

multistage switches are not crossbars effects of static
SMART_READER_LITE
LIVE PREVIEW

Multistage Switches are not Crossbars: Effects of Static Routing in - - PowerPoint PPT Presentation

Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks - A Case Study with InfiniBand - Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine Open Systems Lab Indiana University Bloomington, USA IEEE


slide-1
SLIDE 1

Multistage Switches are not Crossbars: Effects

  • f Static Routing in High-Performance

Networks

  • A Case Study with InfiniBand -

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine

Open Systems Lab Indiana University Bloomington, USA

IEEE Cluster 2008

Tsukuba, Japan

September, 29th 2008

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-2
SLIDE 2

Introduction

large-scale networks are common on HPC huge variety of different technologies (IB, QSNet, Myrinet)

  • ffering: offload, onload, OS bypass

we focus on topologies and routing! Topologies flat: Ring, Kautz, k-ary n-cubes (Torus, Hypercube) MIN: Omega, Banyan, Clos, k-ary n-tree (Fat Tree) Routing

  • blivious: fully random, random paths, online, ...

adaptive: simple adaptive, probing adaptive, ... ⇒ focus on Fat Tree Topologies with oblivious routing!

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-3
SLIDE 3

Why Fat Trees?

Fat Tree networks seem to have several advantages: simple construction rule Clos networks are a special case high bandwidth at large scale well understood since the 60s (Telephone) used by many switch vendors can be built with full bisection bandwidth (FBB) maps many (all?) patterns optimally simple deadlock-free routing

... so it seems ...

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-4
SLIDE 4

What is Bisection Bandwidth?

Definition: For a general network with N endpoints, represented as a graph with a bandwidth of one on every edge, BB is defined as the minimum number of edges that have to be removed in order to split the graphs into two equallysized unconnected parts.

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-5
SLIDE 5

Clos Networks

see [Clos’53] for details can be built blocking, rearrangable non-blocking, strictly non-blocking rearrangable non-blocking is most widely used

N 2 + N N × N crossbars N 2 · N endpoints and connections (“cables”)

8x8 8x8 8x8 8x8 8x8 8x8 8x8 8x8 8x8 8x8 8x8 8x8

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-6
SLIDE 6

k-ary n-trees (Fat Trees)

see [Leiserson’90] for details “generalisation” of Clos networks much more flexible in size and bandwidth similar principles

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-7
SLIDE 7

Oblivious Routing and InfiniBand

Oblivious Routing static routing without considering the traffic demands e.g., Ethernet, InfiniBand, IP , ... adaptive routing has limits (fast changing patterns with small packets) InfiniBand Subnet Manager (SM) discovers topology and computes routes crossbars have destination-based forwarding tables 24-port crossbars -> Clos network has 288 ports recursive Clos up to 41472 ports biggest chassis has 3456 ports (Fat Tree)

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-8
SLIDE 8

An FBB Network Pattern

1,5,9 (up) 2,6,10 (up) 7,11,15 (up) 3,7,11 (up) 4,8,12 (up) 2,10,14 (up) 1,9,13 (up) 3,11,15 (up) 4,12,16 (up)

1 .. 4

....

9 .. 12

8 port crossbar 5,9,13 (up)

2 2 1 2

6,10,14 (up) 8, 12, 16 (up) 1 (down) 2 (down) 6 (down) 5 (down)

(to 9+10)

13 (down) 14 (down) 3 (down) 4 (down) 7 (down) 8 (down) 15 (down) 16 (down)

5 .. 8

1

13 .. 16

1 2 1 2 1 1

(to 11+12)

crossbar 8 port 8 port crossbar 8 port crossbar 8 port crossbar

two distinct communications (1 to 6 and 4 to 14) in an FBB network ⇒ no full bandwidth! Bandwidth depends on traffic patterns, routing and topology!

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-9
SLIDE 9

What is the essence of Bisection Bandwidth?

Is it an upper bound to real bandwidth?

no, see example on last slide!

Is it a lower bound to real bandwidth?

no, see:

Is it the expected (average) bandwidth?

not easy to assess simulate different traffic patterns!

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-10
SLIDE 10

The effective Bisection Bandwidth (eBB)

models real bandwidth as the average bandwidth of all bisect patterns constructing a bisect pattern:

divide network in two equal partitions A and B find exactly one peer in B for each node in A

P

P 2

  • ways to partition P nodes

P 2 ! ways to pair P 2 nodes

→ huge number of patterns

many of them have full bandwidth no closed form yet, thus simulate

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-11
SLIDE 11

The Network Simulator

model physical network as a graph construct random bisect patterns

simulate packet routing and record edge congestion find maximum congestion c along each path

compute average bandwidth per path b = 1

c

repeat simulation with many patterns

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-12
SLIDE 12

Simulating Real-World Systems

retrieved topology via ibnetdiscover and ibdiagnet four large-scale InfiniBand systems queried: Thunderbird at SNL - 4390 nodes Atlas at LLNL - 1142 nodes Ranger at TACC - 3908 nodes CHiC at TUC - 566 nodes

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-13
SLIDE 13

Influence of Head of Line Blocking

1000 2000 3000 4000 5000 6000 7000 8000 0.001 0.01 0.1 1 10 100 1000 10000 Bandwidth (MBit/s) Datasize (kiB) no congestion congestion: 1 congestion: 3 congestion: 5 congestion: 7 congestion: 10 5 10 15 20 25 30 35 40 1 2 3 4 5 6 7 8 9 10 11 1000 2000 3000 4000 5000 6000 7000 8000 1-byte Latency (us) Peak Bandwidth (Mbps) Congestion Factor Latency Bandwidth

communication between different pairs (bisect) laid out to cause congestion 24 ports → max. congestion is 11

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-14
SLIDE 14

Simulation and Reality

compare 512 node CHiC run with 566 node simulation random bisect patterns, bins of size 50 MiB/s measured and simulated > 99.9% into only 4 bins!

1 2 3 4 5 6 7 Number of Occurrences (x 100,000) 627.4 MiB/s 281.2 MiB/s 181.2 MiB/s 133.6 MiB/s measured simulated

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-15
SLIDE 15

Results on other Systems

Effective Bisection Bandwidth Ranger (3908 nodes, FBB): 57.5% Atlas (1142 nodes, FBB): 55.6% Thunderbird (4390 nodes, 1/2 FBB): 40.6% Other Effects of Congestion? bandwidth varies with comm. pattern not easy to predict/model effects on latency are not trivial (buffering etc.) leads to network skew (problem at large scale)

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-16
SLIDE 16

Influence on Applications?

Application Analysis MPQC - MPI_Reduce, MPI_Bcast, MPI_Allreduce MILC - Neighbor Exchange, MPI_Allreduce POP - Neighbor Exchange, MPI_Allreduce Octopus - Neighbor Exchange, MPI_Allreduce Conclusions? many applications use collective communication nearest neighbor exchange is also collective patterns can be scaled up! simulate collective patterns: tree, dissemination, six neighbor

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-17
SLIDE 17

Results on other Systems

Six Neighbor Bandwidth Ranger: 62.4% Atlas: 60.7% Thunderbird: 37.4% Tree Bandwidth Ranger: 69.9% Atlas: 71.3% Thunderbird: 57.4% Dissemination Bandwidth Ranger: 41.9% Atlas: 40.2% Thunderbird: 27.4%

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-18
SLIDE 18

Conclusions and Future Work

Conclusions bisection bandwidth does reflect practice well effective bisection bandwidth is harder to assess but more realistic applications bandwidths suffer, even on FBB networks Future Work develop better oblivious routing for IB analyze more systems and applications look into adaptive routing options? or LMC?? look at other interconnects and topologies

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

slide-19
SLIDE 19

Conclusions and Future Work

Conclusions bisection bandwidth does reflect practice well effective bisection bandwidth is harder to assess but more realistic applications bandwidths suffer, even on FBB networks Future Work develop better oblivious routing for IB analyze more systems and applications look into adaptive routing options? or LMC?? look at other interconnects and topologies

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars