Technische Universität München
Parallel Programming and High-Performance Computing
Part 2: High-Performance Networks
- Dr. Ralf-Peter Mundani
CeSIM / IGSSE
Parallel Programming and High-Performance Computing Part 2: - - PowerPoint PPT Presentation
Technische Universitt Mnchen Parallel Programming and High-Performance Computing Part 2: High-Performance Networks Dr. Ralf-Peter Mundani CeSIM / IGSSE Technische Universitt Mnchen 2 High-Performance Networks Overview some
Technische Universität München
Part 2: High-Performance Networks
CeSIM / IGSSE
Technische Universität München
2−2
2 High-Performance Networks
640k is enough for anyone, and by the way, what’s a network? —William Gates III, chairman Microsoft Corp., 1984
Technische Universität München
2−3
2 High-Performance Networks
– number of connections (incoming and outgoing) between this node and
– degree of a network = max. degree of all nodes in the network – higher degrees lead to
– objective: keep degree and, thus, costs small
degree 3 degree 4
Technische Universität München
2−4
2 High-Performance Networks
– distance of a pair of nodes (length of the shortest path between a pair of nodes), i. e. the amount of nodes a message has to pass on its way from the sender to the receiver – diameter of a network = max. distance of all pair of nodes in the network – higher diameters (between two nodes) lead to
work properly) – objective: small diameter
diameter 4
Technische Universität München
2−5
2 High-Performance Networks
– min. amount of edges (cables) that have to be removed to disconnect the network, i. e. the network falls apart into two loose sub-networks – higher connectivity leads to
network) – objective: high connectivity
connectivity 2
Technische Universität München
2−6
2 High-Performance Networks
– amount of necessary hardware for the realisation of the network (network cards, cables, switches, e. g.) – higher complexity / costs due to more hardware
– extent of deviation in local network quantities (degree, connectivity, e. g.) – more regular quantities are easier to implement
– physical length of the connections – shorter lengths of lines (for all connections) are advantageous
Technische Universität München
2−7
2 High-Performance Networks
– min. amount of edges (cables) that have to be removed to separate the network into two equal parts (bisection width ≠ connectivity, see example below) – important for determining the amount of messages that can be transmitted in parallel between one half of the nodes to the other half without the repeated usage of any connection – extreme case: Ethernet with bisection width 1 – objective: high bisection width (ideal: amount of nodes/2)
bisection width 4 (connectivity 3)
Technische Universität München
2−8
2 High-Performance Networks
– a desired connection between two nodes cannot be established due to already existing connections between other pairs of nodes – objective: non-blocking networks
– in which steps the network can be extended (arbitrarily or only by doubling the amount of nodes, e. g.)
– keeping the essential properties of the network under any increase of the amount of nodes
Technische Universität München
2−9
2 High-Performance Networks
– connections between (arbitrary) nodes can still be established even under the breakdown of single components – a fault-tolerant network has to provide at least one redundant path between all arbitrary pairs of nodes – graceful degradation: the ability of a system to stay functional (maybe with less performance) even under the breakdown of single components
– costs for determining a route for a message from the sender to the receiver – objective: routing should be simple (to be implemented in hardware)
Technische Universität München
2−10
2 High-Performance Networks
Technische Universität München
2−11
2 High-Performance Networks
– max. transmission performance of a network for a certain amount of time – bandwidth B in general measured as megabits or megabytes per second (Mbps or MBps, resp.), nowadays more often as gigabits or gigabytes per second (Gbps or GBps, resp.)
– max. transmission performance of a network over the bisection line, i. e. sum of single bandwidths from all edges (cables) that are “cut” when bisecting the network – thus bisection bandwidth is a measure of bottleneck bandwidth – units are same as for bandwidth
Technische Universität München
2−12
2 High-Performance Networks
– delay time of a communication (time between sending and receiving the head of a message) – latency L measured in seconds
– time for transmitting an entire message between two nodes – transmission time depends on the size S of a message – in case there are no conflicts, the transmission time can be computed as follows D(S) = L + S/B – sometimes this is also referred to as delay
Technische Universität München
2−13
2 High-Performance Networks
– bandwidth = throughput (Smax) – mostly, the (theoretical) bandwidth is not achieved with common message sizes – throughput: ratio between message size and delay P(S) = S / D(S) – throughput interesting for determination of half-power-point: at which message size SH can half of the bandwidth be achieved – example: L = 10μs, B = 10MBps ½B = SH / D(SH) = SH / (L + SH/B) → SH = B·L → SH = 0.1kB – a lower half-power-point means a higher percentage of frames that can take advantage of a network’s bandwidth
Technische Universität München
2−14
2 High-Performance Networks
Technische Universität München
2−15
2 High-Performance Networks
– fixed connections between pairs of nodes – control functions are done by the nodes or by special connection hardware
– no fixed connections between pairs of nodes – all nodes are connected via inputs and outputs to a so called switching component – control functions are concentrated in the switching component – various routes can be switched
Technische Universität München
2−16
2 High-Performance Networks
– circuit switching
– packet switching
separately from a sender to a receiver node
assigned them to a corresponding output
Technische Universität München
2−17
2 High-Performance Networks
– destination-based routing
– source-based routing
node (on its way over intermediate nodes)
frames” for TCP/IP, e. g.
Technische Universität München
2−18
2 High-Performance Networks
– deterministic
– adaptive
Technische Universität München
2−19
2 High-Performance Networks
– transmission between not neighbouring sender and receiver nodes requires buffer mechanisms in intermediate nodes – intermediate nodes have to take care about buffer overflows – buffer management is organised via an additional flow control
– store-and-forward – cut-through / wormhole routing – virtual-cut-through – buffered wormhole routing
Technische Universität München
2−20
2 High-Performance Networks
– store-and-forward
analysed, and then forwarded to selected output
Technische Universität München
2−21
2 High-Performance Networks
– cut-through / wormhole routing
units (phits) or flow control digits (flits)
at any point in time
least enough of it with address information) arrived
rejected in case output is busy ( blocking)
Technische Universität München
2−22
2 High-Performance Networks
– virtual-cut-through
frame in case of blocking
causing deadlocks
cut-through / wormhole routing virtual-cut-through
Technische Universität München
2−23
2 High-Performance Networks
– buffered wormhole routing
and can disappear more easily
Technische Universität München
2−24
2 High-Performance Networks
Technische Universität München
2−25
2 High-Performance Networks
– one-dimensional network – N nodes and N−1 edges – degree = 2 – diameter = N−1 – bisection width = 1 – drawback: too slow for large N
Technische Universität München
2−26
2 High-Performance Networks
– two-dimensional network – N nodes and N edges – degree = 2 – diameter = ⎣N/2⎦ – bisection width = 2 – drawback: too slow for large N – how about fault tolerance?
Technische Universität München
2−27
2 High-Performance Networks
– two-dimensional network – N nodes and 3N/2, 4N/2, 5N/2, … edges – degree = 3, 4, 5, … – higher degrees lead to
ring with degree 3 and 4
Technische Universität München
2−28
2 High-Performance Networks
– two-dimensional network – N nodes and N·(N−1)/2 edges – degree = N−1 – diameter = 1 – bisection width = ⎣N/2⎦·⎡N/2⎤ – very high fault tolerance – drawback: too expensive for large N
Technische Universität München
2−29
2 High-Performance Networks
– two-dimensional network – N nodes and N−1 edges – degree = N−1 – diameter = 2 – bisection width = ⎣N/2⎦ – drawback: bottleneck in central node
Technische Universität München
2−30
2 High-Performance Networks
– two-dimensional network – N = 2k nodes and 2k−1(2k−1) edges – degree = 2k−1 – diameter = ⎡k/2⎤ – each node has connections to all nodes with a distance d = 2i, i ∈ [0, k−1] – simple routing possible ( address shifting)
1 2 3 4 5 6 7
Technische Universität München
2−31
2 High-Performance Networks
– spatial unfolding provides a shifter with k levels – example: k = 3
2 3 4 5 6 7 1 level 1 2 4 6 3 5 7 1 level 2 2 3 4 5 6 7 1 level 3
Technische Universität München
2−32
2 High-Performance Networks
– two-dimensional network – N nodes and N−1 edges (tree height h = ⎡log2 N⎤) – degree = 3 – diameter = 2(h−1) – bisection width = 1 – drawback: bottleneck in direction of root ( blocking)
Technische Universität München
2−33
2 High-Performance Networks
– addressing
– routing
S 1 10 11 100 101 110 111 1000 1001 1010 1011 1100 1101 1111 1110 P D
Technische Universität München
2−34
2 High-Performance Networks
– solution to overcome the bottleneck fat tree – edges on level m get higher priority than edges on level m+1 – capacity is doubled on each higher level – now, bisection width = 2h−2 – frequently used: HLRB II, e. g.
Technische Universität München
2−35
2 High-Performance Networks
– k-dimensional network – N nodes and k·(N−r) edges (r×r mesh, r = ) – degree = 2k – diameter = k·(r−1) – bisection width = rk−1 – high fault tolerance – drawback
k N
Technische Universität München
2−36
2 High-Performance Networks
– k-dimensional mesh with cyclic connections in each dimension – N nodes and k·N edges (r×r mesh, r = ) – diameter = k·⎣r/2⎦ – bisection width = 2rk−1 – frequently used: BlueGene/L, e. g. – drawback: too expensive for k > 3
k N
Technische Universität München
2−37
2 High-Performance Networks
– two-dimensional network – N nodes and 2N edges (r×r mesh, r = ) – degree = 4 – diameter = r−1 – bisection width = 2r – conforms to a chordal ring of degree 4
Technische Universität München
2−38
2 High-Performance Networks
– k-dimensional network – 2k nodes and k·2k-1 edges – degree = k – diameter = k – bisection width = 2k-1 – drawback: scalability (only doubling of nodes allowed)
Technische Universität München
2−39
2 High-Performance Networks
– principle design
corresponding nodes of two k−1-dimensional hypercubes
“1” to the other sub-cube
0D 00 01 10 11 2D 001 000 011 010 100 110 101 111 3D 1 1D
Technische Universität München
2−40
2 High-Performance Networks
– nodes are directly connected for a HAMMING distance of 1 only – routing
destination is reached – example
001 000 011 010 100 110 101 111 S D
Technische Universität München
2−41
2 High-Performance Networks
Technische Universität München
2−42
2 High-Performance Networks
– simple and cheap single stage network – shared usage from all connected nodes, thus, just one frame transfer at any point in time – frame transfer in one step (i. e. diameter = 1) – good extensibility, but bad scalability – fault tolerance only for multiple bus systems – example: Ethernet
single bus multiple bus (here dual)
Technische Universität München
2−43
2 High-Performance Networks
– completely connected network with all possible permutations of N inputs and N outputs (in general N×M inputs / outputs) – switch elements allow simultaneous communication between all possible disjoint pairs of inputs and outputs without blocking – very fast (diameter = 1), but expensive due to N2 switch elements – used for processor—processor and processor—memory coupling – example: The Earth Simulator
switch element 1 2 3 1 2 3 input
Technische Universität München
2−44
2 High-Performance Networks
– tradeoff between low performance of buses and high hardware costs of crossbars – often 2×2 crossbar as basic element – N inputs can simultaneously be switched to N outputs permutation of inputs (to outputs)
straight crossed upper broadcast lower broadcast
Technische Universität München
2−45
2 High-Performance Networks
– permutations: unique (bijective) mapping of inputs to outputs – addressing
– permutations can now be expressed as simple bit manipulation – typical permutations
Technische Universität München
2−46
2 High-Performance Networks
– perfect shuffle permutation
a3 a2 a1 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 a2 a1 a3 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
Technische Universität München
2−47
2 High-Performance Networks
– butterfly permutation
a3 a2 a1 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 a1 a2 a3 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
Technische Universität München
2−48
2 High-Performance Networks
– exchange permutation
a3 a2 a1 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 a3 a2 ā1 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
Technische Universität München
2−49
2 High-Performance Networks
– example: perfect shuffle connection pattern – problem: not all destinations are accessible from a source
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
Technische Universität München
2−50
2 High-Performance Networks
– adding additional exchange permutations ( shuffle-exchange) – all destinations are now accessible from any source
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
Technische Universität München
2−51
2 High-Performance Networks
– based on the shuffle-exchange connection pattern – exchange permutations replaced by 2×2 switch elements
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Technische Universität München
2−52
2 High-Performance Networks
– multistage network – N nodes and E = N/2⋅(log2 N) switch elements – diameter = log2 N (all stages have to be passed) – N! permutations possible, but only 2E different switch states – (self configuring) routing
– example
Technische Universität München
2−53
2 High-Performance Networks
– omega is a bidelta network operates also backwards – drawback: blocking possible
1 2 3 4 5 6 7
2 3 4 5 6 7
Technische Universität München
2−54
2 High-Performance Networks
– idea: unrolling of a static hypercube – bitwise processing of address bits ai from left to right dynamic hypercube a. k. a. butterfly (known from FFT flow diagram)
000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 001 000 011 010 100 110 101 111
Technische Universität München
2−55
2 High-Performance Networks
– replace crossed connections by 2×2 switch elements – introduced by GOKE and LIPOVSKI in 1973; blocking still possible
banyan tree 1 2 3 4 5 6 7 1 2 3 4 5 6 7
Technische Universität München
2−56
2 High-Performance Networks
– multistage network – butterfly merged at the last column with its copied mirror – N nodes and N⋅(log2 N)−N/2 switch elements – diameter = 2(log2 N)−1 – N! permutations possible, all can be switched – key property: for any permutation of inputs to outputs there is a contention-free routing
Technische Universität München
2−57
2 High-Performance Networks
1 4 5 6 7 3 2
4 5 6 7 3 2 4 5 6 7 2 1 3
– example
Technische Universität München
2−58
2 High-Performance Networks
– example
1 4 5 6 7 2 3 1 4 5 6 7 2 3
Technische Universität München
2−59
2 High-Performance Networks
– proposed by CLOS in 1953 for telephone switching systems – objective: to overcome the costs of crossbars (N2 switch elements) – idea
– ingress stage: R crossbars with N×M inputs / outputs – middle stage: M crossbars with R×R inputs / outputs – egress stage: R crossbars with M×N inputs / outputs
– any incoming frame is routed from the input via one of the middle stage crossbars to the respective output – a middle stage crossbar is available if both links to the ingress and egress stage are free
Technische Universität München
2−60
2 High-Performance Networks
– R⋅N inputs can be assigned to R⋅N outputs
1 1 n 1 m 2 1 n 1 m 1 1 r n m
M
1 1 r 1 r 2 1 r 1 r m 1 r 1 r
1 1 m 1 n 2 1 m 1 n r 1 m 1 n
Technische Universität München
2−61
2 High-Performance Networks
– relative values of M and N define the blocking characteristics
– a free input can always be connected to a free output – existing connections might be assigned to different middle stage crossbars (rearrangement)
– a free input can always be connected to a free output – no re-assignment necessary
Technische Universität München
2−62
2 High-Performance Networks
– proof for M ≥ N via HALL’s “Marriage Theorem” (1) Let G = (VIN, VOUT, E) be a bipartite graph. A perfect matching for G is an injective function f : VIN → VOUT so that for every x ∈ VIN, there is an edge in E whose endpoints are x and f(x). One would expect a perfect matching to exist if G contains “enough” edges, i. e. if for every subset A ⊂ VIN the image set δA ⊂ VOUT is sufficient large. Theorem: G has a perfect matching if and only if for every subset A ⊂ VIN the inequality ⎪A⎪ ≤ ⎪δA⎪ holds. Often explained as follows: Imagine two groups of N men and N women. If any subset of S boys (where 0 ≤ S ≤ N) knows S or more girls, each boy can be married with a girl he knows.
Technische Universität München
2−63
2 High-Performance Networks
– proof for M ≥ N via HALL’s “Marriage Theorem” (2) boy := ingress stage crossbar girl := egress stage crossbar a boy knows a girl if there exists a (direct) connection between them assume there‘s one free input and one free output left 1) for 0 ≤ S ≤ R boys there are S⋅N connections at least S girls 2) thus, HALL’s theorem states there exists a perfect matching 3) R connections can be handled by one middle stage crossbar 4) bundle these connections and delete the middle stage crossbar 5) repeat from step 1) until M = 1 6) new connection can be handled, maybe rearrangement necessary □
Technische Universität München
2−64
2 High-Performance Networks
– proof for M ≥ 2N−1 via worst case scenario
connected to different middle stage crossbars
1 n−1 n
2n−2 2n−1
1 n n−1 1 n n−1
Technische Universität München
2−65
2 High-Performance Networks
– more general concept of CLOS and fat tree networks – construction of a non-blocking network connecting M nodes
nodes) is identical to the upstream bandwidth (in direction to the interconnection) – key for non-blocking: always preserve identical bandwidth (upstream and downstream) between any two levels – observation: for two-stage constant bisection bandwidth (CBB) networks connecting M nodes always 3M ports (i. e. sum of inputs and
port in the second stage) – frequently used: InfiniBand, e. g.
Technische Universität München
2−66
2 High-Performance Networks
– example: CBB connecting 16 nodes with 4×4 switch elements
level 1 level 2
Technische Universität München
2−67
2 High-Performance Networks
Technische Universität München
2−68
2 High-Performance Networks
established on the market
– a static and/or dynamic network topology – sophisticated network interface cards (NIC)
– Myrinet – InfiniBand – Scalable Coherent Interface (SCI)
Technische Universität München
2−69
– developed in 1994 by Myricom for clusters – particularly efficient due to
and low-latency, kernel-bypass operations
– latest product: Myri-10G
– switching: rearrangeable non-blocking CLOS (128 nodes)
2 High-Performance Networks
Technische Universität München
2−70
2 High-Performance Networks
– programming model
mmap TCP UDP IP Ethernet Myrinet Myrinet GM API Ethernet OS kernel proprietary protocol (ParaStation, e. g.) low level message passing Myrinet Application
Technische Universität München
2−71
– unification of two competing efforts in 1999
– idea: introduction of a future I/O standard as successor for PCI
(via target channel adapters (TCA)) to the I/O “fabric” – switched point-to-point bidirectional links – bonding of links for bandwidth improvements: 1× (2.5 Gbps), 4× (10 Gbps), 8× (20 Gbps), and 12× (30 Gbps) – available for both copper and fiber cables – nowadays only used for cluster connection 2 High-Performance Networks
Technische Universität München
2−72
– particularly efficient (among others) due to
write) via HCA to local and remote memory without CPU usage and CPU interrupts – switching: constant bisection bandwidth (up to 288 nodes) – approx. costs 50,000 € for switch and 110,000 € for 128 NICs
memory controller memory CPU CPU HCA Switch … link HCA TCA
2 High-Performance Networks
Technische Universität München
2−73
2 High-Performance Networks
– originated as an offshoot from IEEE Futurebus+ project in 1988 – became IEEE standard in 1992 – SCI is a high performance interconnect technology that
– SCI controller monitors I/O transactions (memory) to assure cache coherence of all attached nodes, i. e. all write accesses that invalidate cache entries of other SCI modules are detected – performance: up to 1 GBps with latencies smaller than 2 μs – different topologies such as ring or torus possible
Technische Universität München
2−74
2 High-Performance Networks
– shared memory: SCI uses a 64-bit fixed addressing scheme
is located
can be mapped into a node’s local memory
SCI address space virtual address space P2 virtual address space P1 mmap mmap physical address space export import node A node B