Parallel Programming and High-Performance Computing Part 2: - - PowerPoint PPT Presentation

parallel programming and high performance computing
SMART_READER_LITE
LIVE PREVIEW

Parallel Programming and High-Performance Computing Part 2: - - PowerPoint PPT Presentation

Technische Universitt Mnchen Parallel Programming and High-Performance Computing Part 2: High-Performance Networks Dr. Ralf-Peter Mundani CeSIM / IGSSE Technische Universitt Mnchen 2 High-Performance Networks Overview some


slide-1
SLIDE 1

Technische Universität München

Parallel Programming and High-Performance Computing

Part 2: High-Performance Networks

  • Dr. Ralf-Peter Mundani

CeSIM / IGSSE

slide-2
SLIDE 2

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−2

2 High-Performance Networks

Overview

  • some definitions
  • (more) practical definitions
  • characteristics
  • static network topologies
  • dynamic network topologies
  • examples

640k is enough for anyone, and by the way, what’s a network? —William Gates III, chairman Microsoft Corp., 1984

slide-3
SLIDE 3

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−3

2 High-Performance Networks

Some Definitions

  • degree (node degree)

– number of connections (incoming and outgoing) between this node and

  • ther nodes

– degree of a network = max. degree of all nodes in the network – higher degrees lead to

  • more parallelism and bandwidth for the communication
  • more costs (due to a higher amount of connections)

– objective: keep degree and, thus, costs small

degree 3 degree 4

slide-4
SLIDE 4

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−4

2 High-Performance Networks

Some Definitions

  • diameter

– distance of a pair of nodes (length of the shortest path between a pair of nodes), i. e. the amount of nodes a message has to pass on its way from the sender to the receiver – diameter of a network = max. distance of all pair of nodes in the network – higher diameters (between two nodes) lead to

  • longer communications
  • less fault tolerance (due to the higher amount of nodes that have to

work properly) – objective: small diameter

diameter 4

slide-5
SLIDE 5

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−5

2 High-Performance Networks

Some Definitions

  • connectivity

– min. amount of edges (cables) that have to be removed to disconnect the network, i. e. the network falls apart into two loose sub-networks – higher connectivity leads to

  • more independent paths between two nodes
  • better fault tolerance (due to more routing possibilities)
  • faster communication (due to the avoidance of congestions in the

network) – objective: high connectivity

connectivity 2

slide-6
SLIDE 6

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−6

2 High-Performance Networks

Some Definitions

  • complexity / costs

– amount of necessary hardware for the realisation of the network (network cards, cables, switches, e. g.) – higher complexity / costs due to more hardware

  • regularity

– extent of deviation in local network quantities (degree, connectivity, e. g.) – more regular quantities are easier to implement

  • length of lines

– physical length of the connections – shorter lengths of lines (for all connections) are advantageous

slide-7
SLIDE 7

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−7

2 High-Performance Networks

Some Definitions

  • bisection width

– min. amount of edges (cables) that have to be removed to separate the network into two equal parts (bisection width ≠ connectivity, see example below) – important for determining the amount of messages that can be transmitted in parallel between one half of the nodes to the other half without the repeated usage of any connection – extreme case: Ethernet with bisection width 1 – objective: high bisection width (ideal: amount of nodes/2)

bisection width 4 (connectivity 3)

slide-8
SLIDE 8

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−8

2 High-Performance Networks

Some Definitions

  • blocking

– a desired connection between two nodes cannot be established due to already existing connections between other pairs of nodes – objective: non-blocking networks

  • extensibility

– in which steps the network can be extended (arbitrarily or only by doubling the amount of nodes, e. g.)

  • scalability

– keeping the essential properties of the network under any increase of the amount of nodes

slide-9
SLIDE 9

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−9

2 High-Performance Networks

Some Definitions

  • fault tolerance (redundancy)

– connections between (arbitrary) nodes can still be established even under the breakdown of single components – a fault-tolerant network has to provide at least one redundant path between all arbitrary pairs of nodes – graceful degradation: the ability of a system to stay functional (maybe with less performance) even under the breakdown of single components

  • complexity of routing

– costs for determining a route for a message from the sender to the receiver – objective: routing should be simple (to be implemented in hardware)

slide-10
SLIDE 10

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−10

2 High-Performance Networks

Overview

  • some definitions
  • (more) practical definitions
  • characteristics
  • static network topologies
  • dynamic network topologies
  • examples
slide-11
SLIDE 11

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−11

2 High-Performance Networks

Practical Definitions

  • bandwidth

– max. transmission performance of a network for a certain amount of time – bandwidth B in general measured as megabits or megabytes per second (Mbps or MBps, resp.), nowadays more often as gigabits or gigabytes per second (Gbps or GBps, resp.)

  • bisection bandwidth

– max. transmission performance of a network over the bisection line, i. e. sum of single bandwidths from all edges (cables) that are “cut” when bisecting the network – thus bisection bandwidth is a measure of bottleneck bandwidth – units are same as for bandwidth

slide-12
SLIDE 12

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−12

2 High-Performance Networks

Practical Definitions

  • latency

– delay time of a communication (time between sending and receiving the head of a message) – latency L measured in seconds

  • transmission time (delay)

– time for transmitting an entire message between two nodes – transmission time depends on the size S of a message – in case there are no conflicts, the transmission time can be computed as follows D(S) = L + S/B – sometimes this is also referred to as delay

slide-13
SLIDE 13

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−13

2 High-Performance Networks

Practical Definitions

  • throughput

– bandwidth = throughput (Smax) – mostly, the (theoretical) bandwidth is not achieved with common message sizes – throughput: ratio between message size and delay P(S) = S / D(S) – throughput interesting for determination of half-power-point: at which message size SH can half of the bandwidth be achieved – example: L = 10μs, B = 10MBps ½B = SH / D(SH) = SH / (L + SH/B) → SH = B·L → SH = 0.1kB – a lower half-power-point means a higher percentage of frames that can take advantage of a network’s bandwidth

slide-14
SLIDE 14

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−14

2 High-Performance Networks

Overview

  • some definitions
  • (more) practical definitions
  • characteristics
  • static network topologies
  • dynamic network topologies
  • examples
slide-15
SLIDE 15

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−15

2 High-Performance Networks

Characteristics

  • static networks

– fixed connections between pairs of nodes – control functions are done by the nodes or by special connection hardware

  • dynamic networks

– no fixed connections between pairs of nodes – all nodes are connected via inputs and outputs to a so called switching component – control functions are concentrated in the switching component – various routes can be switched

slide-16
SLIDE 16

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−16

2 High-Performance Networks

Characteristics

  • data transfer

– circuit switching

  • first some address information is sent
  • inputs are assigned to outputs for a certain period of time
  • switching elements forward data without further processing
  • connection stays alive for whole duration of transmission
  • switched connection is private for one sender-receiver-pair

– packet switching

  • messages are cut into frames (with address information) to be sent

separately from a sender to a receiver node

  • switching elements process each incoming frame individually and

assigned them to a corresponding output

  • drawback: routing strategies necessary for finding outputs
  • nevertheless standard for multiprocessors
slide-17
SLIDE 17

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−17

2 High-Performance Networks

Characteristics

  • addressing mode

– destination-based routing

  • head of each frame contains unique address of receiver
  • all intermediate nodes use this address for routing decisions
  • most frequent choice in static networks

– source-based routing

  • frame contains all necessary (routing) information to reach receiver

node (on its way over intermediate nodes)

  • intermediate nodes only responsible for a correct forwarding
  • drawbacks?
  • most frequent choice in dynamic networks
  • dangerous within some protocols: option “drop source-routed

frames” for TCP/IP, e. g.

slide-18
SLIDE 18

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−18

2 High-Performance Networks

Characteristics

  • routing

– deterministic

  • always the same path between pairs of nodes
  • advantage: simple and quick path determination (routing tables, e. g.)
  • drawback: risk of blockings, poor fault tolerance

– adaptive

  • possibility to select between alternative paths between pairs of nodes
  • advantage: more flexible
  • drawback: slightly increased hardware costs
slide-19
SLIDE 19

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−19

2 High-Performance Networks

Characteristics

  • flow control

– transmission between not neighbouring sender and receiver nodes requires buffer mechanisms in intermediate nodes – intermediate nodes have to take care about buffer overflows – buffer management is organised via an additional flow control

  • transmission mode

– store-and-forward – cut-through / wormhole routing – virtual-cut-through – buffered wormhole routing

slide-20
SLIDE 20

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−20

2 High-Performance Networks

Characteristics

  • transmission mode (cont’d)

– store-and-forward

  • each node contains a buffer for storing the entire frame
  • a frame is received and completely stored by intermediate node,

analysed, and then forwarded to selected output

  • subsequent frames are transmitted successively
  • integrity of frames is verified by intermediate nodes
  • high bandwidth, but also high latency
slide-21
SLIDE 21

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−21

2 High-Performance Networks

Characteristics

  • transmission mode (cont’d)

– cut-through / wormhole routing

  • frames are subdivided into smaller parts called physical transfer

units (phits) or flow control digits (flits)

  • one phit is the portion of data to be transferred between two nodes

at any point in time

  • switch element decides about output as soon as head of frame (or at

least enough of it with address information) arrived

  • rest of frame follows along paths (without further processing) or is

rejected in case output is busy ( blocking)

  • drawback: frames are distributed over several nodes
  • only very small times for address decoding
  • used primarily in multiprocessor systems
slide-22
SLIDE 22

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−22

2 High-Performance Networks

Characteristics

  • transmission mode (cont’d)

– virtual-cut-through

  • unlike wormhole routing, nodes receive and store also the rest of a

frame in case of blocking

  • in general, blockings have only local character and disappear before

causing deadlocks

cut-through / wormhole routing virtual-cut-through

slide-23
SLIDE 23

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−23

2 High-Performance Networks

Characteristics

  • transmission mode (cont’d)

– buffered wormhole routing

  • compromise between virtual-cut-through and wormhole routing
  • nodes with limited buffer receive and store parts of a frame in case
  • f blocking
  • in general, blockings are distributed over smaller amounts of nodes

and can disappear more easily

  • used in massive parallel computers (Cray T3E, e. g.)
slide-24
SLIDE 24

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−24

2 High-Performance Networks

Overview

  • some definitions
  • (more) practical definitions
  • characteristics
  • static network topologies
  • dynamic network topologies
  • examples
slide-25
SLIDE 25

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−25

2 High-Performance Networks

Static Network Topologies

  • chain (linear array)

– one-dimensional network – N nodes and N−1 edges – degree = 2 – diameter = N−1 – bisection width = 1 – drawback: too slow for large N

slide-26
SLIDE 26

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−26

2 High-Performance Networks

Static Network Topologies

  • ring

– two-dimensional network – N nodes and N edges – degree = 2 – diameter = ⎣N/2⎦ – bisection width = 2 – drawback: too slow for large N – how about fault tolerance?

slide-27
SLIDE 27

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−27

2 High-Performance Networks

Static Network Topologies

  • chordal ring

– two-dimensional network – N nodes and 3N/2, 4N/2, 5N/2, … edges – degree = 3, 4, 5, … – higher degrees lead to

  • smaller diameters
  • higher fault tolerance (due to redundant connections)
  • drawback: higher costs

ring with degree 3 and 4

slide-28
SLIDE 28

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−28

2 High-Performance Networks

Static Network Topologies

  • completely connected

– two-dimensional network – N nodes and N·(N−1)/2 edges – degree = N−1 – diameter = 1 – bisection width = ⎣N/2⎦·⎡N/2⎤ – very high fault tolerance – drawback: too expensive for large N

slide-29
SLIDE 29

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−29

2 High-Performance Networks

Static Network Topologies

  • star

– two-dimensional network – N nodes and N−1 edges – degree = N−1 – diameter = 2 – bisection width = ⎣N/2⎦ – drawback: bottleneck in central node

slide-30
SLIDE 30

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−30

2 High-Performance Networks

Static Network Topologies

  • barrel shifter

– two-dimensional network – N = 2k nodes and 2k−1(2k−1) edges – degree = 2k−1 – diameter = ⎡k/2⎤ – each node has connections to all nodes with a distance d = 2i, i ∈ [0, k−1] – simple routing possible ( address shifting)

1 2 3 4 5 6 7

slide-31
SLIDE 31

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−31

2 High-Performance Networks

Static Network Topologies

  • barrel shifter (cont’d)

– spatial unfolding provides a shifter with k levels – example: k = 3

2 3 4 5 6 7 1 level 1 2 4 6 3 5 7 1 level 2 2 3 4 5 6 7 1 level 3

slide-32
SLIDE 32

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−32

2 High-Performance Networks

Static Network Topologies

  • binary tree

– two-dimensional network – N nodes and N−1 edges (tree height h = ⎡log2 N⎤) – degree = 3 – diameter = 2(h−1) – bisection width = 1 – drawback: bottleneck in direction of root ( blocking)

slide-33
SLIDE 33

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−33

2 High-Performance Networks

Static Network Topologies

  • binary tree (cont’d)

– addressing

  • label on level m consists of m bits; root has label “1”
  • suffix “0” is added to left son, suffix “1” is added to right son

– routing

  • find common parent node P of nodes S and D
  • ascend from S to P
  • descend from P to D

S 1 10 11 100 101 110 111 1000 1001 1010 1011 1100 1101 1111 1110 P D

slide-34
SLIDE 34

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−34

2 High-Performance Networks

Static Network Topologies

  • binary tree (cont’d)

– solution to overcome the bottleneck fat tree – edges on level m get higher priority than edges on level m+1 – capacity is doubled on each higher level – now, bisection width = 2h−2 – frequently used: HLRB II, e. g.

slide-35
SLIDE 35

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−35

2 High-Performance Networks

Static Network Topologies

  • mesh / torus

– k-dimensional network – N nodes and k·(N−r) edges (r×r mesh, r = ) – degree = 2k – diameter = k·(r−1) – bisection width = rk−1 – high fault tolerance – drawback

  • large diameter
  • too expensive for k > 3

k N

slide-36
SLIDE 36

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−36

2 High-Performance Networks

Static Network Topologies

  • mesh / torus (cont’d)

– k-dimensional mesh with cyclic connections in each dimension – N nodes and k·N edges (r×r mesh, r = ) – diameter = k·⎣r/2⎦ – bisection width = 2rk−1 – frequently used: BlueGene/L, e. g. – drawback: too expensive for k > 3

k N

slide-37
SLIDE 37

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−37

2 High-Performance Networks

Static Network Topologies

  • ILLIAC mesh

– two-dimensional network – N nodes and 2N edges (r×r mesh, r = ) – degree = 4 – diameter = r−1 – bisection width = 2r – conforms to a chordal ring of degree 4

N

slide-38
SLIDE 38

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−38

2 High-Performance Networks

Static Network Topologies

  • hypercube

– k-dimensional network – 2k nodes and k·2k-1 edges – degree = k – diameter = k – bisection width = 2k-1 – drawback: scalability (only doubling of nodes allowed)

slide-39
SLIDE 39

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−39

2 High-Performance Networks

Static Network Topologies

  • hypercube (cont’d)

– principle design

  • construction of a k-dimensional hypercube via connection of the

corresponding nodes of two k−1-dimensional hypercubes

  • inherent labelling via adding prefix “0” to one sub-cube and prefix

“1” to the other sub-cube

0D 00 01 10 11 2D 001 000 011 010 100 110 101 111 3D 1 1D

slide-40
SLIDE 40

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−40

2 High-Performance Networks

Static Network Topologies

  • hypercube (cont’d)

– nodes are directly connected for a HAMMING distance of 1 only – routing

  • compute S XOR D for possible ways between nodes S and D
  • route frames in increasingly / decreasingly order until final

destination is reached – example

  • S = “011”, D = “110”
  • S XOR D = “101”
  • decreasing: “011” → “010” → “110”
  • increasing: “011” → “111” → “110”

001 000 011 010 100 110 101 111 S D

slide-41
SLIDE 41

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−41

2 High-Performance Networks

Overview

  • some definitions
  • (more) practical definitions
  • characteristics
  • static network topologies
  • dynamic network topologies
  • examples
slide-42
SLIDE 42

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−42

2 High-Performance Networks

Dynamic Network Topologies

  • bus

– simple and cheap single stage network – shared usage from all connected nodes, thus, just one frame transfer at any point in time – frame transfer in one step (i. e. diameter = 1) – good extensibility, but bad scalability – fault tolerance only for multiple bus systems – example: Ethernet

single bus multiple bus (here dual)

slide-43
SLIDE 43

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−43

2 High-Performance Networks

Dynamic Network Topologies

  • crossbar

– completely connected network with all possible permutations of N inputs and N outputs (in general N×M inputs / outputs) – switch elements allow simultaneous communication between all possible disjoint pairs of inputs and outputs without blocking – very fast (diameter = 1), but expensive due to N2 switch elements – used for processor—processor and processor—memory coupling – example: The Earth Simulator

switch element 1 2 3 1 2 3 input

  • utput
slide-44
SLIDE 44

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−44

2 High-Performance Networks

Dynamic Network Topologies

  • permutation networks

– tradeoff between low performance of buses and high hardware costs of crossbars – often 2×2 crossbar as basic element – N inputs can simultaneously be switched to N outputs permutation of inputs (to outputs)

  • single stage: consists of one column of 2×2 switch elements
  • multistage: consists of several of those columns

straight crossed upper broadcast lower broadcast

slide-45
SLIDE 45

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−45

2 High-Performance Networks

Dynamic Network Topologies

  • permutation networks (cont’d)

– permutations: unique (bijective) mapping of inputs to outputs – addressing

  • label inputs from 0 to 2N−1 (in case of N switch elements)
  • write labels in binary representation (aK, aK−1, …, a2, a1)

– permutations can now be expressed as simple bit manipulation – typical permutations

  • perfect shuffle
  • butterfly
  • exchange
slide-46
SLIDE 46

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−46

2 High-Performance Networks

Dynamic Network Topologies

  • permutation networks (cont’d)

– perfect shuffle permutation

  • cyclic left shift
  • P(aK, aK−1, …, a2, a1) → (aK−1, …, a2, a1, aK)

a3 a2 a1 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 a2 a1 a3 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1

slide-47
SLIDE 47

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−47

2 High-Performance Networks

Dynamic Network Topologies

  • permutation networks (cont’d)

– butterfly permutation

  • exchange of first / highest and last / lowest bit
  • B(aK, aK−1, …, a2, a1) → (a1, aK−1, …, a2, aK)

a3 a2 a1 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 a1 a2 a3 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1

slide-48
SLIDE 48

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−48

2 High-Performance Networks

Dynamic Network Topologies

  • permutation networks (cont’d)

– exchange permutation

  • negation of last / lowest bit
  • E(aK, aK−1, …, a2, a1) → (aK, aK−1, …, a2, ā1)

a3 a2 a1 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 a3 a2 ā1 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1

slide-49
SLIDE 49

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−49

2 High-Performance Networks

Dynamic Network Topologies

  • permutation networks (cont’d)

– example: perfect shuffle connection pattern – problem: not all destinations are accessible from a source

1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

slide-50
SLIDE 50

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−50

2 High-Performance Networks

Dynamic Network Topologies

  • permutation networks (cont’d)

– adding additional exchange permutations ( shuffle-exchange) – all destinations are now accessible from any source

1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

slide-51
SLIDE 51

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−51

2 High-Performance Networks

Dynamic Network Topologies

  • mega

– based on the shuffle-exchange connection pattern – exchange permutations replaced by 2×2 switch elements

1 2 3 4 5 6 7 1 2 3 4 5 6 7

slide-52
SLIDE 52

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−52

2 High-Performance Networks

Dynamic Network Topologies

  • mega (cont’d)

– multistage network – N nodes and E = N/2⋅(log2 N) switch elements – diameter = log2 N (all stages have to be passed) – N! permutations possible, but only 2E different switch states – (self configuring) routing

  • compare addresses from S and D bitwise from left to right,
  • i. e. stage i evaluates address bits si and di
  • if equal switch straight (−), otherwise switch crossed (×)

– example

  • S = “001”, D = “010”
  • switch states: − × ×
slide-53
SLIDE 53

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−53

2 High-Performance Networks

Dynamic Network Topologies

  • mega (cont’d)

– omega is a bidelta network operates also backwards – drawback: blocking possible

1 2 3 4 5 6 7

  • 1

2 3 4 5 6 7

slide-54
SLIDE 54

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−54

2 High-Performance Networks

Dynamic Network Topologies

  • banyan / butterfly

– idea: unrolling of a static hypercube – bitwise processing of address bits ai from left to right dynamic hypercube a. k. a. butterfly (known from FFT flow diagram)

000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 001 000 011 010 100 110 101 111

slide-55
SLIDE 55

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−55

2 High-Performance Networks

Dynamic Network Topologies

  • banyan / butterfly (cont’d)

– replace crossed connections by 2×2 switch elements – introduced by GOKE and LIPOVSKI in 1973; blocking still possible

banyan tree 1 2 3 4 5 6 7 1 2 3 4 5 6 7

slide-56
SLIDE 56

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−56

2 High-Performance Networks

Dynamic Network Topologies

  • BENEŠ

– multistage network – butterfly merged at the last column with its copied mirror – N nodes and N⋅(log2 N)−N/2 switch elements – diameter = 2(log2 N)−1 – N! permutations possible, all can be switched – key property: for any permutation of inputs to outputs there is a contention-free routing

slide-57
SLIDE 57

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−57

2 High-Performance Networks

Dynamic Network Topologies

1 4 5 6 7 3 2

  • 1

4 5 6 7 3 2 4 5 6 7 2 1 3

  • BENEŠ (cont’d)

– example

  • S1 = 2, D1 = 3 and S2 = 3, D2 = 1 blocking for butterfly
slide-58
SLIDE 58

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−58

2 High-Performance Networks

Dynamic Network Topologies

  • BENEŠ (cont’d)

– example

  • S1 = 2, D1 = 3 and S2 = 3, D2 = 1 no blocking for BENEŠ

1 4 5 6 7 2 3 1 4 5 6 7 2 3

slide-59
SLIDE 59

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−59

2 High-Performance Networks

Dynamic Network Topologies

  • CLOS

– proposed by CLOS in 1953 for telephone switching systems – objective: to overcome the costs of crossbars (N2 switch elements) – idea

  • replace the entire crossbar with three stages of smaller ones

– ingress stage: R crossbars with N×M inputs / outputs – middle stage: M crossbars with R×R inputs / outputs – egress stage: R crossbars with M×N inputs / outputs

  • thus much fewer switch elements than for the entire system

– any incoming frame is routed from the input via one of the middle stage crossbars to the respective output – a middle stage crossbar is available if both links to the ingress and egress stage are free

slide-60
SLIDE 60

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−60

2 High-Performance Networks

Dynamic Network Topologies

  • CLOS (cont’d)

– R⋅N inputs can be assigned to R⋅N outputs

1 1 n 1 m 2 1 n 1 m 1 1 r n m

M

1 1 r 1 r 2 1 r 1 r m 1 r 1 r

M

1 1 m 1 n 2 1 m 1 n r 1 m 1 n

M

slide-61
SLIDE 61

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−61

2 High-Performance Networks

Dynamic Network Topologies

  • CLOS (cont’d)

– relative values of M and N define the blocking characteristics

  • M ≥ N: rearrangeable non-blocking

– a free input can always be connected to a free output – existing connections might be assigned to different middle stage crossbars (rearrangement)

  • M ≥ 2N−1: strict-sense non-blocking

– a free input can always be connected to a free output – no re-assignment necessary

slide-62
SLIDE 62

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−62

2 High-Performance Networks

Dynamic Network Topologies

  • CLOS (cont’d)

– proof for M ≥ N via HALL’s “Marriage Theorem” (1) Let G = (VIN, VOUT, E) be a bipartite graph. A perfect matching for G is an injective function f : VIN → VOUT so that for every x ∈ VIN, there is an edge in E whose endpoints are x and f(x). One would expect a perfect matching to exist if G contains “enough” edges, i. e. if for every subset A ⊂ VIN the image set δA ⊂ VOUT is sufficient large. Theorem: G has a perfect matching if and only if for every subset A ⊂ VIN the inequality ⎪A⎪ ≤ ⎪δA⎪ holds. Often explained as follows: Imagine two groups of N men and N women. If any subset of S boys (where 0 ≤ S ≤ N) knows S or more girls, each boy can be married with a girl he knows.

slide-63
SLIDE 63

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−63

2 High-Performance Networks

Dynamic Network Topologies

  • CLOS (cont’d)

– proof for M ≥ N via HALL’s “Marriage Theorem” (2) boy := ingress stage crossbar girl := egress stage crossbar a boy knows a girl if there exists a (direct) connection between them assume there‘s one free input and one free output left 1) for 0 ≤ S ≤ R boys there are S⋅N connections at least S girls 2) thus, HALL’s theorem states there exists a perfect matching 3) R connections can be handled by one middle stage crossbar 4) bundle these connections and delete the middle stage crossbar 5) repeat from step 1) until M = 1 6) new connection can be handled, maybe rearrangement necessary □

slide-64
SLIDE 64

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−64

2 High-Performance Networks

Dynamic Network Topologies

  • CLOS (cont’d)

– proof for M ≥ 2N−1 via worst case scenario

  • crossbar with N−1 inputs and crossbar with N−1 outputs, all

connected to different middle stage crossbars

  • one further connection

1 n−1 n

M

2n−2 2n−1

M

1 n n−1 1 n n−1

slide-65
SLIDE 65

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−65

2 High-Performance Networks

Dynamic Network Topologies

  • constant bisection bandwidth

– more general concept of CLOS and fat tree networks – construction of a non-blocking network connecting M nodes

  • using multiple levels of basic N×N switch elements (M > N)
  • for any given level, the downstream bandwidth (in direction to the

nodes) is identical to the upstream bandwidth (in direction to the interconnection) – key for non-blocking: always preserve identical bandwidth (upstream and downstream) between any two levels – observation: for two-stage constant bisection bandwidth (CBB) networks connecting M nodes always 3M ports (i. e. sum of inputs and

  • utputs) are necessary (each node needs two ports in the first and one

port in the second stage) – frequently used: InfiniBand, e. g.

slide-66
SLIDE 66

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−66

2 High-Performance Networks

Dynamic Network Topologies

  • constant bisection bandwidth (cont’d)

– example: CBB connecting 16 nodes with 4×4 switch elements

  • hence in total 48 ports (i. e. 6 switch elements) are necessary

level 1 level 2

slide-67
SLIDE 67

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−67

2 High-Performance Networks

Overview

  • some definitions
  • (more) practical definitions
  • characteristics
  • static network topologies
  • dynamic network topologies
  • examples
slide-68
SLIDE 68

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−68

2 High-Performance Networks

Examples

  • in the past years, different (proprietary) high-performance networks have

established on the market

  • typically, these consist of

– a static and/or dynamic network topology – sophisticated network interface cards (NIC)

  • popular networks

– Myrinet – InfiniBand – Scalable Coherent Interface (SCI)

slide-69
SLIDE 69

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−69

  • Myrinet

– developed in 1994 by Myricom for clusters – particularly efficient due to

  • usage of onboard (NIC) processors for protocol offload

and low-latency, kernel-bypass operations

  • highly scalable, cut-through switching

– latest product: Myri-10G

  • available for both copper and fiber cables
  • 10+10 Gbps throughput (two-way for sending and receiving)
  • measured performance: 9.6 Gbps one-way with 2.3μs latency

– switching: rearrangeable non-blocking CLOS (128 nodes)

  • “spine” of CLOS network consists of eight 16×16 crossbars
  • nodes are connected via line-cards with 8×8 crossbar each
  • approx. costs 50,000 € for switch and 75,000 € for NICs

2 High-Performance Networks

Examples

slide-70
SLIDE 70

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−70

2 High-Performance Networks

Examples

  • Myrinet (cont’d)

– programming model

mmap TCP UDP IP Ethernet Myrinet Myrinet GM API Ethernet OS kernel proprietary protocol (ParaStation, e. g.) low level message passing Myrinet Application

slide-71
SLIDE 71

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−71

  • InfiniBand

– unification of two competing efforts in 1999

  • Future I/O initiative (Compaq, IBM, HP)
  • Next-Generation I/O initiative (Dell, Intel, SUN et al.)

– idea: introduction of a future I/O standard as successor for PCI

  • overcome the bottleneck of limited I/O bandwidth
  • connection of hosts (via host channel adapters (HCA)) and devices

(via target channel adapters (TCA)) to the I/O “fabric” – switched point-to-point bidirectional links – bonding of links for bandwidth improvements: 1× (2.5 Gbps), 4× (10 Gbps), 8× (20 Gbps), and 12× (30 Gbps) – available for both copper and fiber cables – nowadays only used for cluster connection 2 High-Performance Networks

Examples

slide-72
SLIDE 72

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−72

  • InfiniBand (cont’d)

– particularly efficient (among others) due to

  • protocol offload and reduced CPU utilisation
  • Remote Direct Memory Access (RDMA), i. e. direct access (read and

write) via HCA to local and remote memory without CPU usage and CPU interrupts – switching: constant bisection bandwidth (up to 288 nodes) – approx. costs 50,000 € for switch and 110,000 € for 128 NICs

memory controller memory CPU CPU HCA Switch … link HCA TCA

2 High-Performance Networks

Examples

slide-73
SLIDE 73

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−73

2 High-Performance Networks

Examples

  • Scalable Coherent Interface

– originated as an offshoot from IEEE Futurebus+ project in 1988 – became IEEE standard in 1992 – SCI is a high performance interconnect technology that

  • connects up to 64,000 nodes (both hosts and devices)
  • supports remote memory access for read / write (NUMA)
  • uses packet switching point-to-point communication

– SCI controller monitors I/O transactions (memory) to assure cache coherence of all attached nodes, i. e. all write accesses that invalidate cache entries of other SCI modules are detected – performance: up to 1 GBps with latencies smaller than 2 μs – different topologies such as ring or torus possible

slide-74
SLIDE 74

Technische Universität München

  • Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008

2−74

2 High-Performance Networks

Examples

  • Scalable Coherent Interface (cont’d)

– shared memory: SCI uses a 64-bit fixed addressing scheme

  • upper 16 bits specify node on which the addressed physical storage

is located

  • lower 48 bits specify the local physical address within memory
  • hence, any physical memory location of the entire memory space

can be mapped into a node’s local memory

SCI address space virtual address space P2 virtual address space P1 mmap mmap physical address space export import node A node B