Scalable Interconnection Networks Chapter 10 1 Adaptado dos slides - - PDF document

scalable interconnection networks chapter 10
SMART_READER_LITE
LIVE PREVIEW

Scalable Interconnection Networks Chapter 10 1 Adaptado dos slides - - PDF document

Scalable Interconnection Networks Chapter 10 1 Adaptado dos slides da editora por M Mario Crtes IC/Unicamp 2009s2 10.1 Scalable, High Performance Network Mario Crtes IC/Unicamp 2009s2 At Core of Parallel Computer


slide-1
SLIDE 1

Mario Côrtes – IC/Unicamp – 2009s2

Chapter 10 Scalable Interconnection Networks

1 Adaptado dos slides da editora por M

slide-2
SLIDE 2

Mario Côrtes – IC/Unicamp – 2009s2

10.1 Scalable, High Performance Network

At Core of Parallel Computer Architecture Requirements and trade-offs at many levels

  • Elegant mathematical structure
  • Deep relationships to algorithm structure
  • Managing many traffic flows
  • Electrical / Optical link properties

2 Adaptado dos slides da editora por M

  • Electrical / Optical link properties

Little consensus

  • interactions across levels
  • Performance metrics?
  • Cost metrics?
  • Workload?

=> need holistic understanding

M P CA M P CA

network interface Scalable Interconnection Network

(p. 749)

slide-3
SLIDE 3

Mario Côrtes – IC/Unicamp – 2009s2

Requirements from Above

Communication-to-computation ratio

=> bandwidth that must be sustained for given computational rate

  • traffic localized or dispersed?
  • bursty or uniform?

Programming Model

  • protocol

3 Adaptado dos slides da editora por M

  • protocol
  • granularity of transfer
  • degree of overlap (slackness)

=> job of a parallel machine network is to transfer information from source node to dest. node in support of network transactions that realize the programming model

(p. 750)

slide-4
SLIDE 4

Mario Côrtes – IC/Unicamp – 2009s2

Goals

Latency as small as possible As many concurrent transfers as possible

  • operation bandwidth
  • data bandwidth

Cost as low as possible

4 Adaptado dos slides da editora por M

(p. 751)

slide-5
SLIDE 5

Mario Côrtes – IC/Unicamp – 2009s2

Outline

Introduction 10.1-2 Basic concepts, definitions, performance perspective 10.3 Organizational structure 10.4 Topologies 10.6 Routing and switch design

5 Adaptado dos slides da editora por M

slide-6
SLIDE 6

Mario Côrtes – IC/Unicamp – 2009s2

Basic Definitions

Network interface Links

  • bundle of wires or fibers that carries a signal

Switches

  • connects fixed number of input channels to fixed number of output

channels

6 Adaptado dos slides da editora por M

channels

(p. 751-752)

slide-7
SLIDE 7

Mario Côrtes – IC/Unicamp – 2009s2

Links and Channels

transmitter converts stream of digital symbols into signal that is driven down the link

Transmitter ...ABC123 => Receiver ...QR67 =>

7 Adaptado dos slides da editora por M

down the link receiver converts it back

  • tran/rcv share physical protocol

trans + link + rcv form Channel for digital info flow between switches link-level protocol segments stream of symbols into larger units: packets

  • r messages (framing)

node-level protocol embeds commands for dest communication assist within packet

(p. 751-752)

slide-8
SLIDE 8

Mario Côrtes – IC/Unicamp – 2009s2

Formalism

network is a graph V = {switches and nodes} connected by communication channels C ⊆ V × V Channel has width w and signaling rate f = 1/τ

  • channel bandwidth b = wf
  • phit (physical unit) data transferred per cycle
  • flit - basic unit of flow-control

8 Adaptado dos slides da editora por M

Number of input (output) channels is switch degree Sequence of switches and links followed by a message is a route Think streets and intersections

(p. 752)

slide-9
SLIDE 9

Mario Côrtes – IC/Unicamp – 2009s2

What characterizes a network?

Topology (what)

  • physical interconnection structure of the network graph
  • direct: node connected to every switch
  • indirect: nodes connected to specific subset of switches

Routing Algorithm (which routes)

  • restricts the set of paths that msgs may follow
  • many algorithms with different properties

9 Adaptado dos slides da editora por M

– gridlock avoidance?

Switching Strategy (how)

  • how data in a msg traverses a route
  • circuit switching vs. packet switching

Flow Control Mechanism (when)

  • when a msg or portions of it traverse a route
  • what happens when traffic is encountered?
  • (ver definição de flit e exemplo p. 753)

(p. 752-753)

slide-10
SLIDE 10

Mario Côrtes – IC/Unicamp – 2009s2

What determines performance

Interplay of all of these aspects of the design

10 Adaptado dos slides da editora por M

slide-11
SLIDE 11

Mario Côrtes – IC/Unicamp – 2009s2

Topological Properties

Routing Distance - number of links on route Shortest Path – menor routing distance entre dois nós Diameter - maximum shortest path entre quaisquer dois nós Average Distance – média das routing distances de todos pares de nós A network is partitioned by a set of links if their removal

11 Adaptado dos slides da editora por M

A network is partitioned by a set of links if their removal disconnects the graph

(p. 753-754)

slide-12
SLIDE 12

Mario Côrtes – IC/Unicamp – 2009s2

Typical Packet Format

Routing and Control Header Data Payload Error Code Trailer

digital symbol Sequence of symbols transmitted over a channel

12 Adaptado dos slides da editora por M

(Packet switching é mais usado que circuit switching) Componentes: Header (roteamento e controle), Trailer (ECC), Payload Two basic mechanisms for abstraction

  • Encapsulation: carrega informação de protocolo de alto nível dentro

do pacote

  • Fragmentation: fragmenta as informações de protocolo de alto nível

em uma sequência de mensagens

(p. 754)

slide-13
SLIDE 13

Mario Côrtes – IC/Unicamp – 2009s2

10.2 Communication Perf: Latency

  • Time(n)s-d = overhead + routing delay + channel
  • ccupancy + contention delay
  • Occupancy = (n + ne) / b
  • n= tamanho payload; ne = tamanho do envelope
  • b= bandwidth
  • visto nos capítulos passados; mas agora, visão do ponto de

13 Adaptado dos slides da editora por M

vista da rede

  • Routing delay?
  • depende do Nº de links (routing distance) e do delay de

cada switch (∆ ∆ ∆ ∆)

  • Contention?

(p. 756)

slide-14
SLIDE 14

Mario Côrtes – IC/Unicamp – 2009s2

Store&Forward vs Cut-Through Routing

2 3 1 0 2 3 1 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 2 3 1 0 2 3 1 0 2 3 1 2 3 1 2 3 3 1 0 2 1 0 2 3 1 0 1 2 3 Store & Forward R outing C ut-Through R outing S ource Dest Dest

14 Adaptado dos slides da editora por M

h(n/b + ∆) vs n/b + h ∆ what if message is fragmented? (ver texto; semelh. ct) ver texto: packet switching e circuit switching wormhole vs virtual cut-through

2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 0 2 3 1 Tim e

(p. 756-757) h = routing dist; ∆ =delay/hop

slide-15
SLIDE 15

Mario Côrtes – IC/Unicamp – 2009s2

Contention

Two packets trying to use the same link at same time

15 Adaptado dos slides da editora por M

  • limited buffering
  • drop?

Most parallel mach. networks block in place

  • link-level flow control
  • (back pressure)
  • tree saturation

Closed system - offered load depends on delivered

(p. 759)

slide-16
SLIDE 16

Mario Côrtes – IC/Unicamp – 2009s2

10.2.2 Bandwidth

What affects local bandwidth?

  • packet density

b . n/(n + ne)

  • routing delay (derated)

b . n / (n + ne + w∆ ∆ ∆ ∆)

  • contention

– endpoints – within the network

Aggregate bandwidth

w∆ = oportunidade perdida devido a link bloqueado

16 Adaptado dos slides da editora por M

Aggregate bandwidth

  • bisection bandwidth

– sum of bandwidth of smallest set of links that partition the network

(duas metades iguais)

  • total bandwidth of all the channels: Cb (Nº canais * b / canal)
  • suppose N hosts issue packet every M cycles with ave dist h

– each msg occupies h channels for = n/w cycles each – C/N channels available per node – link utilization ρ = MC/N h < 1 (na realidade << 1) (p. 761-762)

slide-17
SLIDE 17

Mario Côrtes – IC/Unicamp – 2009s2

Saturation

10 20 30 40 50 60 70 80

Latency

Saturation

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8

Delivered Bandwidth

Saturation 17 Adaptado dos slides da editora por M

0,2 0,4 0,6 0,8 1

Delivered Bandwidth

0,5 1 1,5

Offered Bandwidth (p. 762-763)

  • Delivered bandwidth: valor real “entregue” pela rede
  • saturação: latência cresce exponencialmente
  • Offered bandwidth: demanda colocada pelas aplicações
  • ok p/ baixas demandas; atinge a saturação
slide-18
SLIDE 18

Mario Côrtes – IC/Unicamp – 2009s2

Outline

Introduction 10.1-2 Basic concepts, definitions, performance perspective 10.3 Organizational structure 10.4 Topologies 10.6 Routing and switch design

18 Adaptado dos slides da editora por M

slide-19
SLIDE 19

Mario Côrtes – IC/Unicamp – 2009s2

10.3 Organizational Structure

Processors

  • datapath + control logic + memory interface
  • datapath: ALU, register file, pipeline latches …..
  • control logic determined by examining register transfers in the

datapath

  • conexões: via de dados (curtas e rápidas); controle (longas e lentas);

escalamento ??

19 Adaptado dos slides da editora por M

escalamento ??

Networks (composta dos componentes)

  • links
  • switches
  • network interfaces

(p. 764)

slide-20
SLIDE 20

Mario Côrtes – IC/Unicamp – 2009s2

10.3.1 Link Design/Engineering Space

  • Cable of one or more wires/fibers with connectors at the ends

attached to switches or interfaces

  • características importantes: length, width, clocking

Narrow:

  • control, data and timing

multiplexed on wire Synchronous:

  • source & dest on same

clock

20 Adaptado dos slides da editora por M

Short:

  • single logical

value at a time Long:

  • stream of logical

values at a time Wide:

  • control, data and timing
  • n separate wires

Asynchronous:

  • source encodes clock in

signal

(p. 764)

slide-21
SLIDE 21

Mario Côrtes – IC/Unicamp – 2009s2

Example: Cray MPPs

T3D: Short, Wide, Synchronous (300 MB/s)

  • 24 bits: 16 data, 4 control, 4 reverse direction flow control
  • single 150 MHz clock (including processor)
  • flit = phit = 16 bits
  • two control bits identify flit type (idle and framing)

– no-info, routing tag, packet, end-of-packet

T3E: long, wide, asynchronous (500 MB/s)

21 Adaptado dos slides da editora por M

T3E: long, wide, asynchronous (500 MB/s)

  • 14 bits, 375 MHz, LVDS (low voltage differential signal)
  • flit = 5 phits = 70 bits

– 64 bits data + 6 control

  • switches operate at 75 MHz
  • framed into 1-word and 8-word read/write request packets

Cost = f(length, width) ?

(p. 764)

slide-22
SLIDE 22

Mario Côrtes – IC/Unicamp – 2009s2

10.3.2 Switches

Cross-bar Input Buffer Output Ports Input Receiver Transmiter Ports Output Buffer

22 Adaptado dos slides da editora por M

  • usualmente, nº input ports = nº output ports = degree (há exceções)

Control Routing, Scheduling

(p. 767)

slide-23
SLIDE 23

Mario Côrtes – IC/Unicamp – 2009s2

Switch Components

Output ports

  • transmitter (typically drives clock and data)

Input ports

  • synchronizer aligns data signal with local clock domain
  • essentially FIFO buffer

Crossbar

  • connects each input to any output

23 Adaptado dos slides da editora por M

  • connects each input to any output
  • degree limited by area or pinout

Buffering Control logic

  • complexity depends on routing logic and scheduling algorithm
  • determine output port for each incoming packet
  • arbitrate among inputs directed at same output

(p. 767-768)

slide-24
SLIDE 24

Mario Côrtes – IC/Unicamp – 2009s2

Outline

Introduction 10.1-2 Basic concepts, definitions, performance perspective 10.3 Organizational structure 10.4 Topologies 10.6 Routing and switch design

24 Adaptado dos slides da editora por M

(p. xxxxx)

slide-25
SLIDE 25

Mario Côrtes – IC/Unicamp – 2009s2

10.4 Interconnection Topologies

Class networks scaling with N Logical Properties:

  • distance, degree

Physcial properties

  • length, width

10.4.1 Fully connected network (um switch, grau N, todos x todos)

25 Adaptado dos slides da editora por M

todos)

  • diameter = 1
  • degree = N
  • cost?

– bus => O(N), but BW is O(1)

  • actually worse

– crossbar => O(N2) for BW O(N)

  • fully connected não são escaláveis na prática

VLSI technology determines switch degree (alguns nós são sub-redes fully connected)

(p. 768)

slide-26
SLIDE 26

Mario Côrtes – IC/Unicamp – 2009s2

10.4.2 Linear Arrays and Rings

Linear Array

Linear Array Torus Torus arranged to use short wires

26 Adaptado dos slides da editora por M

Linear Array

  • Diameter?
  • Average Distance?
  • Bisection bandwidth?
  • Route A -> B given by relative address R = B-A

Torus? Examples: FDDI, SCI, FiberChannel Arbitrated Loop, KSR1

(p. 769)

slide-27
SLIDE 27

Mario Côrtes – IC/Unicamp – 2009s2

10.4.3 Multidimensional Meshes and Tori

2D Grid 3D Cube

27 Adaptado dos slides da editora por M

d-dimensional array

  • n = kd-1 * ...* kO nodes
  • described by d-vector of coordinates (id-1, ..., iO)

d-dimensional k-ary mesh: N = kd

  • k = d√N
  • described by d-vector of radix k coordinate

d-dimensional k-ary torus (or k-ary d-cube)?

(p. 769)

slide-28
SLIDE 28

Mario Côrtes – IC/Unicamp – 2009s2

Properties

Routing

  • relative distance: R = (b d-1 - a d-1, ... , b0 - a0 )
  • traverse ri = b i - a i hops in each dimension
  • dimension-order routing

Average Distance Wire Length?

  • d x 2k/3 for mesh

28 Adaptado dos slides da editora por M

  • d x 2k/3 for mesh
  • dk/2 for cube

Degree? Bisection bandwidth? Partitioning?

  • k d-1 bidirectional links

Physical layout?

  • 2D in O(N) space

Short wires

  • higher dimension?

(p. 771)

slide-29
SLIDE 29

Mario Côrtes – IC/Unicamp – 2009s2

Real World 2D mesh

29 Adaptado dos slides da editora por M

(ver expl 10.1) 1824 node Paragon: 16 x 114 array Cada gabinete: 4 (largura) x 16 (altura) = 64 nós Comunicação entre bastidores (= bissection bandwidth) = 16

(p. 771)

slide-30
SLIDE 30

Mario Côrtes – IC/Unicamp – 2009s2

Embeddings in two dimensions

30 Adaptado dos slides da editora por M

Embed multiple logical dimension in one physical dimension using long wires

6 x 3 x 2

(p. 772)

(4 dimension) d=4 k=3; disposto em 2D

slide-31
SLIDE 31

Mario Côrtes – IC/Unicamp – 2009s2

10.4.4 Trees

Diameter and avg. distance are logarithmic

  • k-ary tree, height d = logk N
  • address specified d-vector of radix k coordinates describing path down

from root (0111 ?)

d=4 k=2

A B X

31 Adaptado dos slides da editora por M

from root (0111 ?) Fixed degree (neste caso =3) Ver direct x indirect Route up to common ancestor and down (não precisa ir até raiz)

  • (A B) R = B xor A (ex: 0001 xor 0110 =0111)
  • let i be position of most significant 1 in R, route up i+1 levels (sobe 2+1)
  • down in direction given by low i+1 bits of B (001)

H-tree space is O(N) with O(√N) long wires Bisection BW? (analogia sistema circulatório fat trees)

(p. 772-773)

slide-32
SLIDE 32

Mario Côrtes – IC/Unicamp – 2009s2

Fat-Trees

Fat Tree 32 Adaptado dos slides da editora por M

Fatter links (really more of them) as you go up, so bisection BW scales with N

(p. 774)

slide-33
SLIDE 33

Mario Côrtes – IC/Unicamp – 2009s2

10.4.5 Butterflies

1 2 3 4 16 node butterfly

1 1 1 1

1

building block

Exmpl: BA: A=0010 B=0110 CB: C=1011

A B C

33 Adaptado dos slides da editora por M

Tree with lots of roots! N = 2d nós de host; e d.2d-1 nós de switch Indirect: hosts deliver pckts into lvl 0 and receive pckts from lvl d log N níveis (actually N/2 x logN) Exactly one route from any source to any dest R = A xor B, at level i use ‘straight’ edge if ri=0, otherwise cross edge Bisection N/2 vs N (d-1)/d (mesh)

(p. 774-775)

slide-34
SLIDE 34

Mario Côrtes – IC/Unicamp – 2009s2

k-ary d-cubes vs d-ary k-flies

Tree ou Mesh Fly Degree d Custo N switches vs N log N switches BW Diminishing BW per node vs constant

34 Adaptado dos slides da editora por M

Diminishing BW per node vs constant Requires locality vs little benefit to locality Can you route all permutations?

  • conflito em duas mensagens simultâneas que usem algum link

comum

  • confiabilidade: há apenas uma rota entre 2 links

(p. 776)

slide-35
SLIDE 35

Mario Côrtes – IC/Unicamp – 2009s2

Benes network and Fat Tree

16-node Benes Network (Unidirectional) 16-node 2-ary Fat-Tree (Bidirectional)

C D

35 Adaptado dos slides da editora por M

Back-to-back butterfly can route all permutations

  • off line

What if you just pick a random mid point? (expl: de AB, escolha C ou D como nós intermediários Forma elegante mas pouco uso (difícil calcular nó intermediário)

(p. 776-777) A B

slide-36
SLIDE 36

Mario Côrtes – IC/Unicamp – 2009s2

10.4.6 Hypercubes

Also called binary n-cubes. # of nodes = N = 2n O(logN) hops Good bisection BW Complexity

  • out degree is n = logN
  • correct dimensions in order

36 Adaptado dos slides da editora por M

  • with random comm. 2 ports per processor

Problema de escalamento

  • switch sempre = máximo

0-D 1-D 2-D 3-D 4-D

5-D !

(p. 778)

slide-37
SLIDE 37

Mario Côrtes – IC/Unicamp – 2009s2

Relationship of Butterflies to Hypercubes

37 Adaptado dos slides da editora por M

Wiring is isomorphic Except that Butterfly always takes log n steps

(p. 778)

slide-38
SLIDE 38

Mario Côrtes – IC/Unicamp – 2009s2

Properties of Some Topologies

Topology Degree Diameter Ave Dist Bisection D (D ave) @ P=1024 1D Array 2 N-1 N / 3 1 huge 1D Ring 2 N/2 N/4 2 2D Mesh 4 2 (N1/2 - 1) 2/3 N1/2 N1/2 63 (21) 2D Torus 4 N1/2 1/2 N1/2 2N1/2 32 (16)

38 Adaptado dos slides da editora por M

All have some “bad permutations”

  • many popular permutations are very bad for meshes (transpose)
  • ramdomness in wiring or routing makes it hard to find a bad one!

k-ary n-cube 2n nk/2 nk/4 nk/4 15 (7.5) @n=3 Hypercube n =log N n n/2 N/2 10 (5) (p. ???)

slide-39
SLIDE 39

Mario Côrtes – IC/Unicamp – 2009s2

Real Machines

39 Adaptado dos slides da editora por M

Wide links, smaller routing delay Tremendous variation

(p. 781)

slide-40
SLIDE 40

Mario Côrtes – IC/Unicamp – 2009s2

How Many Dimensions in Network?

n = 2 or n = 3

  • Short wires, easy to build
  • Many hops, low bisection bandwidth
  • Requires traffic locality

n >= 4

  • Harder to build, more wires, longer average length

40 Adaptado dos slides da editora por M

  • Harder to build, more wires, longer average length
  • Fewer hops, better bisection bandwidth
  • Can handle non-local traffic

k-ary d-cubes provide a consistent framework for comparison

  • N = kd
  • scale dimension (d) or nodes per dimension (k)
  • assume cut-through

(p. ??? 780?)

slide-41
SLIDE 41

Mario Côrtes – IC/Unicamp – 2009s2

Traditional Scaling: Latency(P)

40 60 80 100 120 140 Ave Latency T(n=40) d=2 d=3 d=4 k=2 n/w 100 150 200 250 Ave Latency T(n=140)

41 Adaptado dos slides da editora por M

Assumes equal channel width

  • independent of node count or dimension
  • dominated by average distance

20 5000 10000

Machine Size (N)

50 2000 4000 6000 8000 10000 Machine Size (N)

(p. 780)

slide-42
SLIDE 42

Mario Côrtes – IC/Unicamp – 2009s2

Average Distance

30 40 50 60 70 80 90 100 Ave Distance 256 1024 16384 1048576

  • Avg. distance = d (k-1)/2

42 Adaptado dos slides da editora por M

but, equal channel width is not equal cost! Higher dimension => more channels

10 20 30 5 10 15 20 25

Dimension

  • Avg. distance = d (k-1)/2

(p. ????)

slide-43
SLIDE 43

Mario Côrtes – IC/Unicamp – 2009s2

In the 3-D world

For n nodes, bisection area is O(n2/3 )

43 Adaptado dos slides da editora por M

For large n, bisection bandwidth is limited to O(n2/3 )

  • Dally, IEEE TPDS, [Dal90a]
  • For fixed bisection bandwidth, low-dimensional k-ary n-cubes are

better (otherwise higher is better)

  • i.e., a few short fat wires are better than many long thin wires
  • What about many long fat wires?

(p. ???)

slide-44
SLIDE 44

Mario Côrtes – IC/Unicamp – 2009s2

Equal cost in k-ary n-cubes

Equal number of nodes? Equal number of pins/wires? Equal bisection bandwidth? Equal area? Equal wire length? What do we know?

44 Adaptado dos slides da editora por M

What do we know? switch degree: d diameter = d(k-1) total links = Nd pins per node = 2wd bisection = kd-1 = N/k links in each directions 2Nw/k wires cross the middle

(p. 782???)

slide-45
SLIDE 45

Mario Côrtes – IC/Unicamp – 2009s2

Latency(d) for P with Equal Width

100 150 200 250 g e L a te n c y (n = 4 , ∆ ∆ ∆ ∆ = 2 ) 256 1024 16384 1048576

45 Adaptado dos slides da editora por M

total links(N) = Nd

50 5 10 15 20 25

Dimension

A v e ra g e

(p. ?????)

slide-46
SLIDE 46

Mario Côrtes – IC/Unicamp – 2009s2

Latency with Equal Pin Count

100 150 200 250 300 e Latency T(n=40B) 256 nodes 1024 nodes 16 k nodes 1M nodes 100 150 200 250 300 Latency T(n= 140 B)

46 Adaptado dos slides da editora por M

Baseline d=2, has w = 32 (128 wires per node) fix 2dw pins => w(d) = 64/d distance up with d, but channel time down

50 100 5 10 15 20 25 Dimension (d) Ave 50 100 5 10 15 20 25 Dimension (d) Ave L 256 nodes 1024 nodes 16 k nodes 1M nodes

(p. 782-783)

slide-47
SLIDE 47

Mario Côrtes – IC/Unicamp – 2009s2

Latency with Equal Bisection Width

N-node hypercube has N bisection links 2d torus has 2N 1/2 Fixed bisection => w(d) = N 1/d / 2 = k/2

400 500 600 700 800 900 1000 Latency T(n=40)

47 Adaptado dos slides da editora por M

1 M nodes, d=2 has w=512!

100 200 300 400 5 10 15 20 25

Dimension (d)

Ave La 256 nodes 1024 nodes 16 k nodes 1M nodes

(p. 782-784 Fig 10.15)

slide-48
SLIDE 48

Mario Côrtes – IC/Unicamp – 2009s2

Larger Routing Delay (w/ equal pin)

300 400 500 600 700 800 900 1000 ve Latency T(n= 140 B) 256 nodes

48 Adaptado dos slides da editora por M

Dally’s conclusions strongly influenced by assumption of small routing delay

100 200 300 5 10 15 20 25

Dimension (d)

Av 256 nodes 1024 nodes 16 k nodes 1M nodes

(p. 784 Fig. 10.16)

slide-49
SLIDE 49

Mario Côrtes – IC/Unicamp – 2009s2

Latency under Contention

150 200 250 300 Latency n40,d2,k32 n40,d3,k10 n16,d2,k32 n16,d3,k10 n8,d2,k32

49 Adaptado dos slides da editora por M

Optimal packet size? Channel utilization?

50 100 0.2 0.4 0.6 0.8 1

Channel Utilization

L n8,d3,k10 n4,d2,k32 n4,d3,k10

(p. 786 Fig. 10.17)

slide-50
SLIDE 50

Mario Côrtes – IC/Unicamp – 2009s2

Saturation

150 200 250 Latency n/w=40 n/w=16 n/w=8

50 Adaptado dos slides da editora por M

Fatter links shorten queuing delays

50 100 0.2 0.4 0.6 0.8 1 Ave Channel Utilization L n/w = 4

(p. ??)

slide-51
SLIDE 51

Mario Côrtes – IC/Unicamp – 2009s2

Phits per cycle

150 200 250 300 350 Latency

51 Adaptado dos slides da editora por M

Higher degree network has larger available bandwidth

  • cost?

50 100 0.05 0.1 0.15 0.2 0.25 Flits per cycle per processor n8, d3, k10 n8, d2, k32

(p. 787-789 Fig 10.18)

slide-52
SLIDE 52

Mario Côrtes – IC/Unicamp – 2009s2

Topology Summary

Rich set of topological alternatives with deep relationships Design point depends heavily on cost model

  • nodes, pins, area, ...
  • Wire length or wire delay metrics favor small dimension
  • Long (pipelined) links increase optimal dimension

Need a consistent framework and analysis to separate opinion from

52 Adaptado dos slides da editora por M

Need a consistent framework and analysis to separate opinion from design Optimal point changes with technology

(p. ???)

slide-53
SLIDE 53

Mario Côrtes – IC/Unicamp – 2009s2

Outline

Introduction 10.-2 Basic concepts, definitions, performance perspective 10.3 Organizational structure 10.4 Topologies 10.6 Routing and switch design

53 Adaptado dos slides da editora por M

slide-54
SLIDE 54

Mario Côrtes – IC/Unicamp – 2009s2

Routing and Switch Design

Routing Switch Design Flow Control Case Studies

54 Adaptado dos slides da editora por M

slide-55
SLIDE 55

Mario Côrtes – IC/Unicamp – 2009s2

Routing

Recall: routing algorithm determines

  • which of the possible paths are used as routes
  • how the route is determined
  • R: N x N -> C, which at each switch maps the destination node nd to

the next channel on the route

Issues:

55 Adaptado dos slides da editora por M

  • Routing mechanism

– arithmetic – source-based port select – table driven – general computation

  • Properties of the routes
  • Deadlock feee

(p. 789)

slide-56
SLIDE 56

Mario Côrtes – IC/Unicamp – 2009s2

Routing Mechanism

need to select output port for each input packet

  • in a few cycles

Simple arithmetic in regular topologies

  • ex: ∆x, ∆y routing in a grid

– west (-x) ∆x < 0 – east (+x) ∆x > 0

56 Adaptado dos slides da editora por M

– east (+x) ∆x > 0 – south (-y)

∆x = 0, ∆y < 0

– north (+y)

∆x = 0, ∆y > 0

– processor

∆x = 0, ∆y = 0

Reduce relative address of each dimension in order

  • Dimension-order routing in k-ary d-cubes
  • e-cube routing in n-cube (primeiro 1 na distância relativa – XOR)

(p. 789)

slide-57
SLIDE 57

Mario Côrtes – IC/Unicamp – 2009s2

Routing Mechanism (cont)

Source-based

  • message header carries series of port selects
  • used and stripped en route
  • CRC? Packet Format?

P0 P1 P2 P3

57 Adaptado dos slides da editora por M

  • CS-2, Myrinet, MIT Artic

Table-driven

  • message header carried index for next port at next switch (índice para

uma tabela em cada switch)

– output = R[i]

  • table also gives index for following hop

– output, I’ = R[i ]

  • ATM, HPPI

(p. 790)

slide-58
SLIDE 58

Mario Côrtes – IC/Unicamp – 2009s2

Properties of Routing Algorithms

Deterministic

  • route determined by (source, dest), not intermediate state (i.e. traffic)

Adaptive

  • route influenced by traffic along the way
  • (pode ser implementada por table driven ou source based)

Minimal

58 Adaptado dos slides da editora por M

Minimal

  • only selects shortest paths

Deadlock free

  • no traffic pattern can lead to a situation where no packets mover

forward

(p. 790-791)

slide-59
SLIDE 59

Mario Côrtes – IC/Unicamp – 2009s2

Deadlock Freedom

How can it arise?

  • necessary conditions:

– shared resource – incrementally allocated – non-preemptible

  • think of a channel as a shared

resource that is acquired incrementally

59 Adaptado dos slides da editora por M

that is acquired incrementally

– source buffer then dest. buffer – channels along a route

How do you avoid it?

  • constrain how channel resources are allocated
  • ex: dimension order

How do you prove that a routing algorithm is deadlock free

(p. 791-792)

slide-60
SLIDE 60

Mario Côrtes – IC/Unicamp – 2009s2

Proof Technique

Resources are logically associated with channels Messages introduce dependences between resources as they move forward Need to articulate possible dependences between channels Show that there are no cycles in Channel Dependence Graph

  • find a numbering of channel resources such that every legal route

60 Adaptado dos slides da editora por M

  • find a numbering of channel resources such that every legal route

follows a monotonic sequence

=> no traffic pattern can lead to deadlock Network need not be acyclic, on channel dependence graph

(p. 793)

slide-61
SLIDE 61

Mario Côrtes – IC/Unicamp – 2009s2

Example: k-ary 2D array

Theorem: x,y routing is deadlock free Numbering

  • +x channel (i,y) -> (i+1,y) gets i
  • similarly for -x with 0 as most positive edge
  • +y channel (x,j) -> (x,j+1) gets N+j
  • similary for -y channels

61 Adaptado dos slides da editora por M

  • similary for -y channels

Any routing sequence: x direction, turn, y direction is increasing

1 2 3 1 2 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 17 18 19 16 17 18

slide-62
SLIDE 62

Mario Côrtes – IC/Unicamp – 2009s2

Channel Dependence Graph

1 2 3 1 2 00 01 02 03 10 11 12 13 17 18 17 18

1 2 3 1 2 17 18 17 18 17 18 17 18 1 2 3 1 2

62 Adaptado dos slides da editora por M

20 21 22 23 30 31 32 33 18 19 16 17

1 18 17 18 17 18 17 18 17 1 2 3 1 2 19 16 19 16 19 16 19 16 1 2 3 1 2

slide-63
SLIDE 63

Mario Côrtes – IC/Unicamp – 2009s2

More examples

Why is the obvious routing on X deadlock free?

  • butterfly?
  • tree?
  • fat tree?

Any assumptions about routing mechanism? amount of buffering? What about wormhole routing on a ring?

63 Adaptado dos slides da editora por M

What about wormhole routing on a ring?

1 2 3 4 5 6 7

slide-64
SLIDE 64

Mario Côrtes – IC/Unicamp – 2009s2

Deadlock free wormhole networks?

Basic dimension-order routing doesn’t work for k-ary d-cubes

  • only for k-ary d-arrays (bi-directional)

Idea: add channels!

  • provide multiple “virtual channels” to break dependence cycle
  • good for BW too!

64 Adaptado dos slides da editora por M

  • Don’t need to add links, or xbar, only buffer resources

This adds nodes the the CDG, remove edges?

Output Ports Input Ports Cross-Bar

slide-65
SLIDE 65

Mario Côrtes – IC/Unicamp – 2009s2

Breaking deadlock with virtual channels

65 Adaptado dos slides da editora por M

Packet switches from lo to hi channel

slide-66
SLIDE 66

Mario Côrtes – IC/Unicamp – 2009s2

Up*-Down* routing

Given any bidirectional network Construct a spanning tree Number of the nodes increasing from leaves to roots UP increase node numbers Any Source -> Dest by UP*-DOWN* route

up edges, single turn, down edges

66 Adaptado dos slides da editora por M

  • up edges, single turn, down edges

Performance?

  • Some numberings and routes much better than others
  • interacts with topology in strange ways
slide-67
SLIDE 67

Mario Côrtes – IC/Unicamp – 2009s2

Turn Restrictions in X,Y

+Y +X

  • X

67 Adaptado dos slides da editora por M

XY routing forbids 4 of 8 turns and leaves no room for adaptive routing Can you allow more turns and still be deadlock free

  • Y
slide-68
SLIDE 68

Mario Côrtes – IC/Unicamp – 2009s2

Minimal turn restrictions in 2D

West-first

  • x

+x +y

68 Adaptado dos slides da editora por M

north-last negative first

  • y
slide-69
SLIDE 69

Mario Côrtes – IC/Unicamp – 2009s2

Example legal west-first routes

69 Adaptado dos slides da editora por M

Can route around failures or congestion Can combine turn restrictions with virtual channels

slide-70
SLIDE 70

Mario Côrtes – IC/Unicamp – 2009s2

Adaptive Routing

R: C x N x Σ -> C Essential for fault tolerance

  • at least multipath

Can improve utilization of the network Simple deterministic algorithms easily run into bad permutations

70 Adaptado dos slides da editora por M

Fully/partially adaptive, minimal/non-minimal Can introduce complexity or anomolies Little adaptation goes a long way!

slide-71
SLIDE 71

Mario Côrtes – IC/Unicamp – 2009s2

Switch Design

Cross-bar Input Buffer Output Ports Input Receiver Transmiter Ports Output Buffer

71 Adaptado dos slides da editora por M

Control Routing, Scheduling

slide-72
SLIDE 72

Mario Côrtes – IC/Unicamp – 2009s2

How do you build a crossbar

Io I1 I

Io I1 I2 I3 O0 Oi O2 72 Adaptado dos slides da editora por M

I2 I3

O3

RAM phase O0 Oi O2 O3 Dout Din Io I1 I2 I3 addr

slide-73
SLIDE 73

Mario Côrtes – IC/Unicamp – 2009s2

Input buffered swtich

Cross-bar Output Ports Input Ports R0 R1 R2 R3 73 Adaptado dos slides da editora por M

Independent routing logic per input

  • FSM

Scheduler logic arbitrates each output

  • priority, FIFO, random

Head-of-line blocking problem

Scheduling

slide-74
SLIDE 74

Mario Côrtes – IC/Unicamp – 2009s2

Output Buffered Switch

Output Ports Input Ports Output Ports Output Ports R0 R1

74 Adaptado dos slides da editora por M

How would you build a shared pool?

Control Ports Output Ports R2 R3

slide-75
SLIDE 75

Mario Côrtes – IC/Unicamp – 2009s2

Example: IBM SP vulcan switch

FIFO CRC check Route control Flow Control 8 8 Deserializer 64 Input Port RAM 64x128 In Arb Out Arb Central Queue FIFO CRC Gen Flow Control 8 8 Serializer 64 Ouput Port XBar Arb ° ° ° ° ° °

75 Adaptado dos slides da editora por M

Many gigabit ethernet switches use similar design without the cut- through

8 x 8 Crossbar FIFO CRC check Route control Flow Control 8 8 Deserializer Input Port ° 64 ° ° ° FIFO CRC Gen Flow Control 8 8 Serializer Ouput Port XBar Arb 8 ° 8

slide-76
SLIDE 76

Mario Côrtes – IC/Unicamp – 2009s2

Output scheduling

Cross-bar Output Ports R0 R1 R2 O0 O1 Input Buffers

76 Adaptado dos slides da editora por M

n independent arbitration problems?

  • static priority, random, round-robin

Simplifications due to routing algorithm? General case is max bipartite matching

R3 O2

slide-77
SLIDE 77

Mario Côrtes – IC/Unicamp – 2009s2

Stacked Dimension Switches

Dimension order on 3D cube? Cube connected cycles?

Host In Yin Zin Yout Zout 2x2 2x2

77 Adaptado dos slides da editora por M

Host Out Xin Yin Xout Yout 2x2 2x2

slide-78
SLIDE 78

Mario Côrtes – IC/Unicamp – 2009s2

Flow Control

What do you do when push comes to shove?

  • ethernet: collision detection and retry after delay
  • FDDI, token ring: arbitration token
  • TCP/WAN: buffer, drop, adjust rate
  • any solution must adjust to output rate

Link-level flow control

78 Adaptado dos slides da editora por M

Link-level flow control

Data Ready

slide-79
SLIDE 79

Mario Côrtes – IC/Unicamp – 2009s2

Examples

Short Links Long links

Source Destination Data Req Ready/Ack

F/E F/E

79 Adaptado dos slides da editora por M

Long links

  • several flits on the wire
slide-80
SLIDE 80

Mario Côrtes – IC/Unicamp – 2009s2

Smoothing the flow

Low Mark High Mark Full Stop Go Incoming Phits Flow-control Symbols

80 Adaptado dos slides da editora por M

How much slack do you need to maximize bandwidth?

Empty Outgoing Phits

slide-81
SLIDE 81

Mario Côrtes – IC/Unicamp – 2009s2

Link vs global flow control

Hot Spots Global communication operations Natural parallel program dependences

81 Adaptado dos slides da editora por M

slide-82
SLIDE 82

Mario Côrtes – IC/Unicamp – 2009s2

Example: T3D

Route Tag Dest PE Command Route Tag Dest PE Command Route Tag Dest PE Command Route Tag Dest PE Command Route Tag Dest PE Command Route Tag Dest PE Command R oute Tag D est PE C ommand R ead Req

  • no cache
  • cache
  • prefetch
  • fetch&inc

Addr 0 Addr 1 Src PE Read Resp Read Resp

  • cached

Word 0 Word 0 Word 1 Word 2 Word 3 Write Req

  • Proc
  • BLT 1
  • fetch&inc

Addr 0 Addr 1 Src PE Word 0 Addr 0 Addr 1 Src PE Word 0 Word 1 Word 2 Write Req

  • proc 4
  • BLT 4

Write Resp A ddr 0 A ddr 1 Src PE A ddr 0 A ddr 1 B LT R ead Req

82 Adaptado dos slides da editora por M

  • 3D bidirectional torus, dimension order (NIC selected), virtual

cut-through, packet sw.

  • 16 bit x 150 MHz, short, wide, synch.
  • rotating priority per output
  • logically separate request/response
  • 3 independent, stacked switches
  • 8 16-bit flits on each of 4 VC in each directions

Word 2 Word 3 Packet Type req/resp coomand 3 1 8

slide-83
SLIDE 83

Mario Côrtes – IC/Unicamp – 2009s2

Example: SP

E0E1 E2E3 E15 Inter-Rack External Switch Ports 16-node Rack Switch Board Multi-rack Configuration

83 Adaptado dos slides da editora por M

  • 8-port switch, 40 MB/s per link, 8-bit phit, 16-bit flit, single 40 MHz

clock

  • packet sw, cut-through, no virtual channel, source-based routing
  • variable packet <= 255 bytes, 31 byte FIFO per input, 7 bytes per output,

16 phit links

  • 128 8-byte ‘chunks’ in central queue, LRU per output
  • run in shadow mode

P0P1 P2P3 P15 Intra-Rack Host Ports

slide-84
SLIDE 84

Mario Côrtes – IC/Unicamp – 2009s2

Routing and Switch Design Summary

Routing Algorithms restrict the set of routes within the topology

  • simple mechanism selects turn at each hop
  • arithmetic, selection, lookup

Deadlock-free if channel dependence graph is acyclic

  • limit turns to eliminate dependences
  • add separate channel resources to break dependences

84 Adaptado dos slides da editora por M

  • add separate channel resources to break dependences
  • combination of topology, algorithm, and switch design

Deterministic vs adaptive routing Switch design issues

  • input/output/pooled buffering, routing logic, selection logic

Flow control Real networks are a ‘package’ of design choices

slide-85
SLIDE 85

Mario Côrtes – IC/Unicamp – 2009s2

Medidas para mesh e torus

Diâmetro

  • Dist. média

Bisection BW Linear Array N-1 2/3 N 1 Torus N-1; N/2 N/2; N/3 1; 2 D-dim. Array Σ (ki - 1) 2/3 Σ ki (Π ki) / k

iMax

85 Adaptado dos slides da editora por M

D-dim. K-ary mesh

  • d. (k-1)

k.d 2/3 k

(d-1)

D-dim. K-ary torus (k-ary d-cube)

  • d. (k-1) ; d. k/2 d. k/2; d. k/3 k

(d-1) ; 2 . k

Obs: Σ e Π são para i de 1 a d

(d-1)

Valores separados por ponto-e-vírgula:

  • Primeiro: conexão uni-direcional
  • Segundo: conexão bi-direcional