Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens - - PowerPoint PPT Presentation

deadlock free oblivious routing for arbitrary topologies
SMART_READER_LITE
LIVE PREVIEW

Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens - - PowerPoint PPT Presentation

Center for Information Services and High Performance Computing (ZIH) Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoefler, Wolfgang E. Nagel May 18th, 2011 Zellescher Weg 12 Willers-Bau A 219 01062 Dresden


slide-1
SLIDE 1

Center for Information Services and High Performance Computing (ZIH)

Deadlock-Free Oblivious Routing for Arbitrary Topologies

Jens Domke, Torsten Hoefler, Wolfgang E. Nagel

May 18th, 2011

Zellescher Weg 12 Willers-Bau A 219 01062 Dresden

  • Tel. +49 0351 - 463 39114

Jens Domke ( jens.domke@tu-dresden.de )

slide-2
SLIDE 2

Outline

1

Basics and previous work

2

Deadlocks

3

Deadlock-free SSSP routing algorithm

4

Simulations and measurements

5

Conclusion

Jens Domke Slide 2

slide-3
SLIDE 3

Outline Basics and previous work

InfiniBand interconnect InfiniBand subnet manager – OpenSM Motivation Previous work

Jens Domke Slide 3

slide-4
SLIDE 4

InfiniBand interconnect

Based on an open standard, developed by the InfiniBand Trade Association One of the most widely used interconnect in the field of HPC Gigabit Ethernet InfiniBand Proprietary Others 42,4% 45,6% 6,2% 5,8%

Figure: Top500 List, Interconnects, Nov. 2010

Jens Domke Slide 4

slide-5
SLIDE 5

InfiniBand subnet manager – OpenSM

Tasks Scan the components of the IB subnet Initialize the IB ports

Calculate paths for each port pair in the subnet Generate linear forwarding tables (LFT) Configure the IB ports with additional preferences, e.g. QoS

Reconfiguration, if the subnet changes

Jens Domke Slide 5

slide-6
SLIDE 6

InfiniBand subnet manager – OpenSM

Implemented static/destination-based routing algorithms MinHop Up*/Down* Fat -Tree LASH DOR

Jens Domke Slide 6

slide-7
SLIDE 7

Motivation

General problems for most of the routing algorithms No global balancing of the traffic ⇒ congestions reduce the bandwidth Only designed for a small set of topologies Not deadlock-free for every topology Not usable for production systems, because of long runtime The algorithm should support irregular topologies, because HPC-systems grow in their lifetime Additional node like I/O or login nodes are connected Network components can fail

Jens Domke Slide 7

slide-8
SLIDE 8

Previous work

Single-source-shortest-path routing algorithm ” Optimized Routing for Large-Scale InfiniBand Networks” [Hoefler et al., 2009] presented SSSP Minimizes congestions thru global balancing Higher effective bisection bandwidth compared to others algorithms Disadvantage of the presented approach

Algorithm is not deadlock-free LFT are calculated by an external program (not OpenSM)

Jens Domke Slide 8

slide-9
SLIDE 9

Outline Deadlocks

Definition Deadlocks in interconnects Approaches for deadlock-free routing Theorem of Dally and Seitz Virtual channels and channel dependency graph

Jens Domke Slide 9

slide-10
SLIDE 10

Definition

Definition Deadlock [Tanenbaum, 2007]

A set of processes is deadlocked if each process in the set is waiting for an event that only a process in the set can cause.

Jens Domke Slide 10

slide-11
SLIDE 11

Deadlocks in interconnects

Package source Switch buffer Package destination

Jens Domke Slide 11

slide-12
SLIDE 12

Approaches for deadlock-free routing

Package life-time (only to break the deadlock, if they occur) Controller principle Up*/Down* routing Virtual channels

” Deadlock-Free Message Routing in Multiprocessor Interconnection Networks” [Dally and Seitz, 1987] Each link will be split into multiple virtual channels Channel dependency graph

Jens Domke Slide 12

slide-13
SLIDE 13

Theorem of Dally and Seitz

Theorem of Dally and Seitz

A routing algorithm for a interconnect is deadlock-free, iff there are no cycles in the corresponding channel dependency graph.

Jens Domke Slide 13

slide-14
SLIDE 14

Virtual channels and channel dependency graph

c2 c1 c3 c4 n4 c1 n1 c2 n2 c3 n3 c4

Jens Domke Slide 14

slide-15
SLIDE 15

Virtual channels and channel dependency graph

c2,2 c2,1 c1,1 c1,2 c1,3 c1,4 c2,3 c2,4 c2,2 n1 c2,1 c1,2 c1,1 n2 n4 c1,4 c1,3 c2,3 n3 c2,4

Jens Domke Slide 15

slide-16
SLIDE 16

Virtual channels and channel dependency graph

c2 c3 c4 c1 r2 r3 r1 c1,2 c1,1 r2 r1 c1,3 r3 c2,1 c2,2 c2,4 c2,3

Jens Domke Slide 16

slide-17
SLIDE 17

Outline Deadlock-free SSSP routing algorithm

DFSSSP routing algorithm How to identify the ”weakest” edge?

Jens Domke Slide 17

slide-18
SLIDE 18

DFSSSP routing algorithm

Algorithm 1 DFSSSP routing algorithm /* Phase 1: Identification of all network components */ Scan(...) /* Phase 2: Calculate paths */ SSSP(...) /* Phase 3: Assign paths to virtual layers */ RemoveDeadlocks(...) /* Phase 4: Balancing of all virtual layers */ Balancing(...)

Jens Domke Slide 18

slide-19
SLIDE 19

DFSSSP routing algorithm

Algorithm 2 Remove deadlocks from the channel dependency graph (Phase 3) Input: Linear forwarding tables Output: Assign each path to a virtual layer /* Initialization of layer 1 */ for all PortPairs(source, destination) do Update CDG[1] with the source-destination path end for /* Search cycles in the channel dependency graph */ for i = 1,...,max −1 do repeat Search for cycle in CDG[i] Identify ”weakest” edge of the cycle Move port pairs or paths on this edge to CDG[i +1] until no cycle found in CDG[i] end for Search for cycle in CDG[max]

Jens Domke Slide 19

slide-20
SLIDE 20

How to identify the ”weakest” edge?

... to minimize the number of needed virtual layers. Abstract formulation: ”acyclic path partitioning” problem (APP) Split a set of paths into subsets which produces acyclic channel dependency graphs. Shown to be NP-complete Proof based on an polynomial transformation from graph k-colorability problem into APP APP is NP-complete ⇒ use heuristic to identify the ”weakest” edge Edge with most paths in the cycle Random edge of the cycle Edge with smallest number of paths

Jens Domke Slide 20

slide-21
SLIDE 21

Outline Simulations and measurements

Simulations with IBSim

Real existing topologies

Measurements on a real system – Deimos

PC-Farm Deimos Netgauge BenchIT NAS parallel benchmarks

Jens Domke Slide 21

slide-22
SLIDE 22

Real existing topologies

0,2 0,4 0,6 0,8 1 C H i C D e i m

  • s

J U R O P A O d i n R a n g e r T s u b a m e

  • Eff. bisection bandwidth

MinHop Up*/Down* FatTree LASH DOR SSSP DFSSSP 10-4 10-2 100 102 104 CHiC Deimos JUROPA Odin Ranger Tsubame Runtime in s

Figure: Simulation with IBSim and ORCS [Schneider et al., 2009]

Jens Domke Slide 22

slide-23
SLIDE 23

Measurements on a real system – Deimos

HPC-system operated by ZIH Linux Networx PC-Farm (13.9 TFlop/s) 726 compute nodes connected by 108 IB switches 2,6 GHz AMD Opteron X85 dual core 1, 2 or 4 processors per node 2 GByte RAM per core

Jens Domke Slide 23

slide-24
SLIDE 24

Measurements on a real system – Deimos

Measurement environment and used benchmarks Exclusive access One MPI process per node (for measurements with ≤ 512 cores) Same number of MPI processes = ⇒ same compute nodes used

  • Eff. bisection bandwidth with Netgauge [Hoefler et al., 2007]

Runtime and bandwidths of pure MPI communication measured with micro-benchmarks (BenchIT [Juckeland et al., 2004]) Performance gain for application benchmarks of NASA (NAS Parallel Benchmarks [Bailey et al., 1995])

Jens Domke Slide 24

slide-25
SLIDE 25

Netgauge

50 100 150 200 250 300 350 400 128 256 512 1024

  • Eff. bisection bandwidth in MiByte/s

Number of cores MinHop LASH SSSP DFSSSP Figure: Approximation with 1000 random bisections

Jens Domke Slide 25

slide-26
SLIDE 26

BenchIT

0,01 0,02 0,03 0,04 0,05 0,06 0,07 0,08 512 1024 1536 2048 2560 3072 3584 4096 Runtime in s Elements in send buffer (#floats)

MinHop LASH SSSP DFSSSP

Figure: Collective N-to-N MPI operation on 128 nodes

Jens Domke Slide 26

slide-27
SLIDE 27

NAS parallel benchmarks

50 100 150 200 250 121 256 484 1024 Gflop/s (total) Number of cores MinHop LASH SSSP DFSSSP Figure: BT, class C – equation system solver

Jens Domke Slide 27

slide-28
SLIDE 28

Conclusion

Developed deadlock-free SSSP routing for arbitrary network topologies DF-/SSSP routing algorithm integrated in OpenSM

Patch available: http://unixer.de/research/dfsssp/

Not limited to InfiniBand; usable for all interconnects which support virtual channels Modeled the ” acyclic path partition” problem; proofed NP-completeness Doubled the eff. bisection bandwidth of Deimos for 512 nodes Performance gain (communication bound) for application benchmarks up to 95%

Jens Domke Slide 28

slide-29
SLIDE 29

References

  • D. Bailey, T. Harris, W. Saphir, R. V. D. Wijngaart, A. Woo, and M. Yarrow. The nas parallel benchmarks 2.0. Technical Report NAS-95-020,

NASA Ames Research Center, Dec. 1995.

  • W. Dally and C. Seitz. Deadlock-Free Message Routing in Multiprocessor Interconnection Networks. Computers, IEEE Transactions on, C-36(5):

547–553, May 1987. ISSN 0018-9340. doi: 10.1109/TC.1987.1676939.

  • T. Hamada and N. Nakasato. InfiniBand Trade Association, InfiniBand Architecture Specification, Volume 1, Release 1.0. In International

Conference on Field Programmable Logic and Applications, pages 366–373, 2005.

  • T. Hoefler, T. Mehlan, A. Lumsdaine, and W. Rehm. Netgauge: A Network Performance Measurement Framework. In High Performance

Computing and Communications, Third International Conference, HPCC 2007, Houston, USA, September 26-28, 2007, Proceedings, volume 4782, pages 659–671. Springer, Sept. 2007. ISBN 978-3-540-75443-5.

  • T. Hoefler, T. Schneider, and A. Lumsdaine. Optimized Routing for Large-Scale InfiniBand Networks. In 17th Annual IEEE Symposium on High

Performance Interconnects (HOTI 2009), Aug. 2009.

  • G. Juckeland, S. B¨
  • rner, M. Kluge, S. K¨
  • lling, W. Nagel, S. Pfl¨

uger, H. R¨

  • ding, S. Seidl, T. William, and R. Wloch. Benchit – performance

measurement and comparison for scientific applications. In F. P. G.R. Joubert, W.E. Nagel and W. Walter, editors, Parallel Computing - Software Technology, Algorithms, Architectures and Applications, volume 13 of Advances in Parallel Computing, pages 501–508. North-Holland, 2004.

  • T. Schneider, T. Hoefler, and A. Lumsdaine. ORCS: An Oblivious Routing Congestion Simulator. Technical Report 675, Indiana University, Feb.

2009.

  • A. S. Tanenbaum. Modern Operating Systems. Prentice Hall Press, Upper Saddle River, NJ, USA, 3. edition, 2007. ISBN 9780136006633.

Jens Domke Slide 29

slide-30
SLIDE 30

Backup – Complexity analysis

Time complexity

The time complexity for the DFSSSP routing algorithm is O(|N|2 ·(log|N|+∇)+|N|·|C|+∇·(|C|+|E|))

Memory complexity

The memory complexity for DFSSSP is O(∇·d(I)·|N|2 +∇·(|C|+|E|)+|N|) Variables: N – nodes in the network C – channels/links E – edges in the channel dependency graph ∇ – minimal number of needed virtual layer d(I) – diameter of network I

Jens Domke Slide 30

slide-31
SLIDE 31

Backup – InfiniBand subnet

... ... ...

Switch 1 Switch 2 Switch n Link

HCA HCA TCA TCA

Subnet

Router

OpenSM

Subnet Subnet

CPU CPU CPU CPU

Compute node I/O node

Tape Tape Tape Tape

Jens Domke Slide 31

slide-32
SLIDE 32

Backup – Metrics for interconnects

Significant properties

Low latency High bandwidth for package transfer Absence of deadlocks in the routing

Established metrics to rate the interconnect

Latency Bandwidth Bisection bandwidth Effective bandwidth Effective bisection bandwidth

Jens Domke Slide 32

slide-33
SLIDE 33

Backup – SSSP algorithm

Algorithm 3 SSSP routing algorithm (Phase 2) Input: Context of DFSSSP routing Output: Linear Forwarding Tabellen /* N-to-N, multi-graph Dijkstra algorithm */ for all Port ∈ Subnet do Dijkstra(...) for this port as source Update all linear forwarding tables Increase edge wights end for

Jens Domke Slide 33