Robustness of Interconnection Networks 3rd JLESC Summer School - - PowerPoint PPT Presentation

robustness of interconnection networks
SMART_READER_LITE
LIVE PREVIEW

Robustness of Interconnection Networks 3rd JLESC Summer School - - PowerPoint PPT Presentation

1 Robustness of Interconnection Networks 3rd JLESC Summer School Atsushi Hori RIKEN AICS 16 6 28 Self Introduction Atsushi Hori - System Software Researcher The oldest and largest governmental research institute in


slide-1
SLIDE 1

Robustness of Interconnection Networks

3rd JLESC Summer School

Atsushi Hori

RIKEN AICS

1

16年6月28日火曜日

slide-2
SLIDE 2

3rd JLESC SS@Lyon 2016

Self Introduction

Atsushi Hori - System Software Researcher

The oldest and largest governmental research institute in Japan, since 1917

Advanced Institute for Computational Science (AICS), since 2010

Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016) involved in the Flagship2020 project to develop the post-K computer

2

16年6月28日火曜日

slide-3
SLIDE 3

3rd JLESC SS@Lyon 2016

Atsushi Hori - System Software Researcher

The oldest and largest governmental research institute in Japan, since 1917

Advanced Institute for Computational Science (AICS), since 2010

Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016) involved in the Flagship2020 project to develop the post-K computer

Self Introduction

3

16年6月28日火曜日

slide-4
SLIDE 4

3rd JLESC SS@Lyon 2016

Atsushi Hori - System Software Researcher

The oldest and largest governmental research institute in Japan, since 1917

Advanced Institute for Computational Science (AICS), since 2010

Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016) involved in the Flagship2020 project to develop the post-K computer

DISCLAIMER This contents of this talk are based

  • n my personal experiences and

independent from the Flagship2020 project

Self Introduction

3

16年6月28日火曜日

slide-5
SLIDE 5

3rd JLESC SS@Lyon 2016

Atsushi Hori - System Software Researcher

The oldest and largest governmental research institute in Japan, since 1917

Advanced Institute for Computational Science (AICS), since 2010

Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016) involved in the Flagship2020 project to develop the post-K computer

The colored slides are supplements

Self Introduction

3

16年6月28日火曜日

slide-6
SLIDE 6

3rd JLESC SS@Lyon 2016

Self Introduction

Atsushi Hori - System Software Researcher

The oldest and largest governmental research institute in Japan, since 1917

Advanced Institute for Computational Science (AICS), since 2010

Running the K computer, the largest super computer in Japan (ranked 5th in Top500, Jun., 2016) involved in the Flagship2020 project to develop the post-K computer

4

The venue of the next JLESC, in Dec., Kobe

16年6月28日火曜日

slide-7
SLIDE 7

3rd JLESC SS@Lyon 2016

HPC Network

  • Low latency and high bandwidth
  • Higher performance than silicon disks
  • High Bi-section bandwidth
  • Low congestion possibility (hopefully)
  • Very Reliable
  • No error, No loss
  • Dense (in a computer room)
  • Internet covers the whole earth
  • Packet Switching
  • No circuit switching (old telephone network)

5

16年6月28日火曜日

slide-8
SLIDE 8

3rd JLESC SS@Lyon 2016

Network Basics

Outline

6

Routing Topology Implementation Fault Resilience

+ my personal opinion

16年6月28日火曜日

slide-9
SLIDE 9

3rd JLESC SS@Lyon 2016

Glossary

  • A network consists of
  • Nodes

where packets are sent and received may include a switch (see below)

  • Switches (Routers)
  • Links

connecting nodes and switches

  • Data transfer
  • Packet

a unit of transfer

  • Message consists of multiple packets

7

16年6月28日火曜日

slide-10
SLIDE 10

3rd JLESC SS@Lyon 2016

Topology

8

16年6月28日火曜日

slide-11
SLIDE 11

3rd JLESC SS@Lyon 2016

Network Topologies (1)

9

Torus FatTree

Switch Switch Switch Switch

Mesh

Node Link

“SkinnyTree”

Switch Switch Switch

16年6月28日火曜日

slide-12
SLIDE 12

3rd JLESC SS@Lyon 2016

Network Topology in Top500

  • Topologies in Top500 http://www.top500.org
  • Torus/Mesh BG/Q, the K (Tofu)
  • FatTree

Infiniband, Aries, Cray Gemini, Tiahne

  • SkinnyTree

Ethernet

  • Misc.

IBM Power 775

10

❘ ❘❘❘❘ ❘❘❘❘❘ ❘❘ ❘❘❘❘❘❘❘❘ ❘ ❘❘❘❘❘❘❘❘ ❘ ❘❘❘❘❘ ❘ ❘❘❘❘❘❘❘ ❘ ❘❘❘ ❘ ❘❘❘❘❘❘❘❘❘❘❘ ❘ ❘ ❘ ❘ ❘ ❘❘ ❘❘ ❘❘❘ ❘❘ ❘ ❘ ❘❘❘❘ ❘ ❘❘ ❘❘ ❘❘❘❘ ❘❘ ❘❘❘❘❘ ❘ ❘ ❘❘❘❘❘❘❘❘❘❘❘ ❘❘❘ ❘❘❘❘ ❘ ❘❘ ❘❘❘❘❘ ❘❘❘❘ ❘ ❘ ❘ ❘❘ ❘❘ ❘ ❘ ❘❘❘ ❘ ❘ ❘❘ ❘ ❘ ❘❘ ❘❘❘ ❘❘ ❘❘ ❘❘ ❘❘❘ ❘❘❘ ❘❘ ❘❘ ❘ ❘❘❘❘❘ ❘❘ ❘❘❘ ❘❘❘❘ ❘❘ ❘❘ ❘ ❘❘ ❘ ❘ ❘ ❘ ❘❘ ❘❘❘❘❘ ❘ ❘❘❘❘ ❘❘ ❘ ❘❘❘❘ ❘❘ ❘ ❘❘ ❘ ❘ ❘ ❘❘❘❘❘ ❘❘ ❘❘❘❘ ❘ ❘❘❘ ❘ ❘ ❘❘❘ ❘ ❘❘ ❘ ❘❘ ❘ ❘❘❘❘ ❘❘ ❘ ❘ ❘❘ ❘ ❘ ❘❘ ❘ ❘ ❘❘❘❘ ❘❘❘ ❘❘❘ ❘❘❘❘ ❘ ❘ ❘❘❘❘ ❘ ❘❘ ❘ ❘❘ ❘ ❘ ❘❘❘ ❘ ❘ ❘❘ ❘ ❘❘❘❘❘❘ ❘❘ ❘ ❘ ❘❘❘❘❘ ❘ ❘❘❘ ❘❘ ❘ ❘ ❘❘❘ ❘❘ ❘❘ ❘ ❘❘ ❘ ❘ ❘ ❘❘ ❘ ❘ ❘❘❘ ❘ ❘❘ ❘❘ ❘ ❘ ❘ ❘❘❘❘❘ ❘❘❘❘❘❘❘❘❘❘ ❘ ❘ ❘ ❘ ❘❘ ❘ ❘❘ ❘ ❘❘❘ ❘❘❘ ❘ ❘ ❘❘ ❘ ❘ ❘ ❘ ❘❘❘ ❘❘ ❘❘❘❘ ❘❘ ❘ ❘❘❘ ❘ ❘ ❘❘❘ ❘ ❘❘ ❘ ❘ ❘ ❘ ❘ ❘❘ ❘ ❘ ❘❘ ❘❘❘ ❘ ❘ ❘❘ ❘❘❘❘❘❘❘ ❘ ❘❘❘❘ ❘ ❘❘ ❘ ❘ ❘ ❘ ❘❘ ❘ ❘❘ ❘❘ ❘❘❘❘ ❘❘ ❘ ❘ ❘ ❘ ❘❘❘❘ ❘❘ ❘❘ ❘❘ ❘ ❘❘ ❘❘ ❘❘❘❘ ❘ ❘❘ ❘ ❘ ❘❘ ❘ ❘ ❘❘❘ ❘❘❘❘

50 100 150 200 250 300 350 400 450 500 Topology Rank in Top500 as of Nov. 2015 FatTree Torus/Mesh SkinnyTree Misc.

16年6月28日火曜日

slide-13
SLIDE 13

3rd JLESC SS@Lyon 2016

Network Topologies (2)

11

Hypercube Dragonfly CM-2, nCUBE in 90s Cray XC series

and many others (ring, star, butterfly, to name a few)

Nodes Link Node

Sw.

16年6月28日火曜日

slide-14
SLIDE 14

3rd JLESC SS@Lyon 2016

Routing

12

16年6月28日火曜日

slide-15
SLIDE 15

3rd JLESC SS@Lyon 2016

Routing

  • Find a path from a sender node to a receiver

node

  • Ex) X-Y (Dimension Order) Routing in 2D

Mesh

13

Nj Ni

Node

16年6月28日火曜日

slide-16
SLIDE 16

3rd JLESC SS@Lyon 2016

Deadlock

  • A routing algorithm on a network topology

must be deadlock free

  • Cyclic path can cause deadlock
  • Deadlock can be avoided by having bypass
  • Virtual channels

14

Sw. 1 channel 2 (virtual) channels Sw.

16年6月28日火曜日

slide-17
SLIDE 17

3rd JLESC SS@Lyon 2016

Deadlock

  • A routing algorithm on a network topology

must be deadlock free

  • Cyclic path can cause deadlock
  • Deadlock can be avoided by having bypass
  • Virtual channels

15

16年6月28日火曜日

slide-18
SLIDE 18

3rd JLESC SS@Lyon 2016

  • Hot spot
  • Packet congestion happens
  • 2D Mesh Hot spot at the center
  • 2D Torus No hot spots

16

Hot Spot

Node

16年6月28日火曜日

slide-19
SLIDE 19

3rd JLESC SS@Lyon 2016

Partitioning

  • Multiple jobs can run on a big machine
  • Node space is partitioned
  • Partitioning may change topology of a job
  • Jobs may have interference

17

Job A Job B Job C Job D Job C Job A Job B Job D

Job B, C and D can interfere with the others 2D torus turns into 2D mesh

Job C

Node Node

16年6月28日火曜日

slide-20
SLIDE 20

3rd JLESC SS@Lyon 2016

Dynamic (Adaptive) Routing

  • Static Routing
  • Once a path is fixed, packets go along with the

path

  • Dynamic (adaptive) Routing
  • Paths can be changed dynamically according to

the state of the network

  • Issues
  • Algorithm: how, who, when ?
  • Deadlock free
  • Route changing latency & H/W resource
  • Stability (see next slide)
  • Packet order is not preserved (see next of next)

18

16年6月28日火曜日

slide-21
SLIDE 21

3rd JLESC SS@Lyon 2016

Oscillation in Adaptive Routing

Two roads to the same destination

  • 1. One is very crowded
  • 2. The radio says the
  • ther is empty
  • 3. Everybody rushes

into the other road

  • 4. (repeat 1-3)

19

16年6月28日火曜日

slide-22
SLIDE 22

3rd JLESC SS@Lyon 2016

Packet Order

  • Adaptive routing cannot preserve packet
  • rdering
  • This can be problematic when receiving large

messages consisting of multiple packets

20

P0 P1 P2 P3 P4 P5 P6 P7 …

Sending Order = Receiving Order

P0 P1 P2 P3 P4 P5 P6 P7

Recvbuf 0 Recvbuf 1

P0 P5 P3 P2 P4 P7 P9 P6 …

Sending Order ≠ Receiving Order

P0 P2 P3 P4 P5 P6 P7

Recvbuf 0 Recvbuf 1

16年6月28日火曜日

slide-23
SLIDE 23

3rd JLESC SS@Lyon 2016

Metrics

  • Topology
  • The higher radix, the smaller network diameter
  • Network Diameter
  • High-Radix or Low-Radix
  • Performance
  • Whole
  • Bisection Bandwidth
  • P2P
  • Bandwidth and Latency
  • Hop count
  • Collective Operations (Barrier, and so on)
  • Latency

21

16年6月28日火曜日

slide-24
SLIDE 24

3rd JLESC SS@Lyon 2016

Implementation

22

16年6月28日火曜日

slide-25
SLIDE 25

3rd JLESC SS@Lyon 2016

Installation of the K Computer

23

16年6月28日火曜日

slide-26
SLIDE 26

3rd JLESC SS@Lyon 2016

Direct/Indirect Network

  • Direct or Indirect Network
  • Direct network
  • Every node has a switch inside
  • Indirect network
  • Node has no switch

24

Machine/Network Direct/Indirect the K (Tofu) Direct BG/Q Direct Infiniband Indirect Ethernet Indirect

Note: In many books, direct or indirect network is categorized as an aspect of topology

16年6月28日火曜日

slide-27
SLIDE 27

3rd JLESC SS@Lyon 2016

Level off Cable Lengths

  • Naive implementation results in uneven cable

lengths

25

16年6月28日火曜日

slide-28
SLIDE 28

3rd JLESC SS@Lyon 2016

Level off Cable Lengths

  • To level off cable lengths, alternate nodes are

connected

26

16年6月28日火曜日

slide-29
SLIDE 29

3rd JLESC SS@Lyon 2016

Co-Design

  • Network Cost = ∑ C + ∑ S + ∑ L

C: Network interface (card) of a node S: Switch L: Cable

  • Co-design
  • Communication patterns of applications
  • Find protocols to maximize performance of

possible applications, and

  • to minimize network cost
  • to minimize power consumption

27

16年6月28日火曜日

slide-30
SLIDE 30

3rd JLESC SS@Lyon 2016

Fault Resilience

28

16年6月28日火曜日

slide-31
SLIDE 31

3rd JLESC SS@Lyon 2016

Fault Resilience

  • System and/or jobs can survive from a

network component failure

  • Possible failure points
  • Link
  • Switch
  • Node

29

16年6月28日火曜日

slide-32
SLIDE 32

3rd JLESC SS@Lyon 2016

Link or Switch Failure

  • Static routing
  • Somebody changes routing info. to bypass

failed part(s)

  • Dynamic routing
  • If a failure can be detected, the failed

part(s) can be automatically bypassed

  • Needless to say it must be deadlock free

30

16年6月28日火曜日

slide-33
SLIDE 33

3rd JLESC SS@Lyon 2016

Node Failure

  • If application has
  • Dynamic load balancing
  • Job stops using the failed node,

and rebalance load

  • Static load balancing
  • Ex) Stencil Computation
  • hard to rebalance load => spare node

31

2D Jacobi iteration V’(i,j) = A * ( V(i-1,j ) + V(i+1,j) + V(i,j-1) + V(i,j+1) )

2D array V(N,M)

16年6月28日火曜日

slide-34
SLIDE 34

3rd JLESC SS@Lyon 2016

Spare Node Substitution

  • Assuming switches and links are all healthy
  • A naive spare node substitution may result in a

large number of packet collisions

  • Max. latency depends on #collisions
  • Is there a way to avoid this situation ?

32

Spare

No

S F 2 3 2 3 2 2 Migration

4 Possible Collisions 5 Possible Collisions

16年6月28日火曜日

slide-35
SLIDE 35

3rd JLESC SS@Lyon 2016

Spare Node

  • Pros
  • Easy to program
  • Balanced load
  • Cons
  • Lower node

utilization

  • Additional

packet collisions

33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 21 24 25 26 27 28 29 30 31 32 33 34 35

0D Sliding

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25 26 21 28 29 30 31 32 27 34 35 33

1D Sliding 2D Sliding

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Spare Nodes Spare Nodes

Node 21 fails

16年6月28日火曜日

slide-36
SLIDE 36

3rd JLESC SS@Lyon 2016

Sliding Methods - Basic Idea

  • Sliding Methods
  • 0D - Naive method
  • 1D - Up to 3 (worst case)
  • 2D - Up to 2
  • and so on
  • Hybrid Sliding
  • Try the highest

degree method first

  • If failed, try lower

degree method

34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 21 24 25 26 27 28 29 30 31 32 33 34 35

0D Sliding

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25 26 21 28 29 30 31 32 27 34 35 33

1D Sliding 2D Sliding

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Spare Nodes Spare Nodes

Node 21 fails

16年6月28日火曜日

slide-37
SLIDE 37

3rd JLESC SS@Lyon 2016

Repair

35

16年6月28日火曜日

slide-38
SLIDE 38

3rd JLESC SS@Lyon 2016

The K and FX100

36

  • キャッシュ、

を 倍 スーパースカラーの強化

  • アウトオブオーダ資源の増加
  • 分岐予測の強化
  • 単精度倍幅モード
  • バイト整数命令

アシスタントコア

・通信のデーモンを処理

キーテクノロジー

core core core core core core core core core core core core core core core core Assistant core Assistant core core core core core core core core core core core core core core core core core

Tofu2 interface

Tofu2 controller HMC interface HMC interface L2 cache L2 cache PCI interface MAC MAC MAC MAC PCI controller
  • ノード

キャビネット

  • 、メモリ、光モジュールを直接水冷 水冷率
  • インチランクマウント型シャーシ
  • ノード

本体装置間 は光接続

  • キーテクノロジー

4 system boards (384 cores) 3 CPUs (Nodes) 32+2 Cores + Tofu CPU 8 Cores ICC (Tofu Network)

The K Computer 2011 FX100 2015

1 system board 4 CPUs and 4 ICCs (32 cores) 18 chassis (6,912 cores) 24 chassis (768 cores)

http://accc.riken.jp/wp-content/ uploads/2015/06/chiba.pdf

16年6月28日火曜日

slide-39
SLIDE 39

3rd JLESC SS@Lyon 2016

Fujitsu FX100

  • A chassis contains 3 nodes, switches and links.
  • Tofu unit consists of 12 nodes is also a scheduling

unit

  • Tofu 6D coordinate: “XYZabc” (a=2, b=3, c=2)
  • “XYZ” coord. represents the location of a Tofu unit
  • “abc” coord. represents the location inside of a Tofu unit
  • A chassis contains 12 nodes
  • 3 chassis compose 3 Tofu units

37

™ user’s view i

Figure 14 Connection topology in main unit

C axis B axis A axis (b) xis A axis

A Tofu unit

White paper FUJITSU Supercomputer PRIMEHPC FX100 – Evolution to the Next Generation

https://www.fujitsu.com/global/Images/primehpc-fx100-hard-en.pdf 16年6月28日火曜日

slide-40
SLIDE 40

3rd JLESC SS@Lyon 2016

The K and FX100

38

  • キャッシュ、

を 倍 スーパースカラーの強化

  • アウトオブオーダ資源の増加
  • 分岐予測の強化
  • 単精度倍幅モード
  • バイト整数命令

アシスタントコア

・通信のデーモンを処理

キーテクノロジー

core core core core core core core core core core core core core core core core Assistant core Assistant core core core core core core core core core core core core core core core core core

Tofu2 interface

Tofu2 controller HMC interface HMC interface L2 cache L2 cache PCI interface MAC MAC MAC MAC PCI controller
  • ノード

キャビネット

  • 、メモリ、光モジュールを直接水冷 水冷率
  • インチランクマウント型シャーシ
  • ノード

本体装置間 は光接続

  • キーテクノロジー

3 CPUs (Nodes)

The K Computer 2011 FX100 2015

18 chassis (6,912 cores) 24 chassis (768 cores) 4 system boards (384 cores) 1 system board 4 CPUs and 4 ICCs (32 cores)

The Tofu circuit is on the same CPU die, however, the Tofu circuit can keep running while the CPU cores are shutdown and power off.

32+2 Cores + Tofu

http://accc.riken.jp/wp-content/ uploads/2015/06/chiba.pdf

16年6月28日火曜日

slide-41
SLIDE 41

3rd JLESC SS@Lyon 2016

Various Units in FX100

  • Various units in Fujitsu FX100
  • Unit of network
  • Tofu (12 nodes)
  • Unit of scheduling
  • Tofu (not to interfere with other jobs)
  • Physical Unit
  • Chassis - A chassis spans 3 Tofu units
  • Replacement of a chassis
  • Affects 3 Tofu units (36 nodes, 1152 cores)
  • Before replacement, the jobs running on the 36 nodes must

be aborted

  • While in the replacement, the affected 36 nodes can not

accept jobs

  • Tofu is direct network, replacement can affect entire

network because XYZ connections for I/O are lost

  • This 36-Nf nodes are called apparent failure (in this talk)

39

16年6月28日火曜日

slide-42
SLIDE 42

3rd JLESC SS@Lyon 2016

Repair Schedule

  • Entire system must go on as much as possible
  • Replacement may cause more apparent failures as packaging

density increases

  • Replacement cannot take place as soon as failure happens
  • Remedy for apparent failures is getting harder
  • The more frequent system service, the higher running cost
  • So, repair is scheduled once in a day, 2-3 times in a week,
  • nce in a week, and so on

40

The K’s case: Every morning, SEs replace the failed nodes

  • 1. Shutdown the chassis
  • 2. Unplug the chassis
  • 3. Replace failed mother board
  • 4. Plug the chassis
  • 5. Reboot (K’s nodes are disk-less)

Apparent failure

16年6月28日火曜日

slide-43
SLIDE 43

3rd JLESC SS@Lyon 2016

Repair Interval

  • The longer repair interval, the larger number
  • f failed parts
  • K: One node failure in 1~2 days

41

# Failed Components Operation Repair Time

Apparent Failure

# Failed Components Operation Repair Time

Apparent Failure Average Average

16年6月28日火曜日

slide-44
SLIDE 44

3rd JLESC SS@Lyon 2016

Towards Exa-flops

  • Higher failure rate
  • larger number of components
  • end of Moore’s law is close
  • Longer time between repair
  • to reduce running cost
  • denser packaging results in more apparent

failures

  • larger impact on running jobs

➡ Always having one or more number of failed

components

42

16年6月28日火曜日

slide-45
SLIDE 45

3rd JLESC SS@Lyon 2016

Network Resilience Towards Exa-scale and Beyond

My Personal Opinion

43

16年6月28日火曜日

slide-46
SLIDE 46

3rd JLESC SS@Lyon 2016

Failure will be Daily Life

  • Assumption of current HPC

failure happens unexpectedly and unusually

  • System design is based on particular rules

and algorithms

  • Random failure breaks those rules and

algorithms

  • Node MTBF is less than a day already
  • If failure is daily happening, why don’t we

design HPC systems having failures in mind ?

44

16年6月28日火曜日

slide-47
SLIDE 47

3rd JLESC SS@Lyon 2016

Failure Conscious Design

  • Failure
  • happens randomly
  • # Combinations are factorial !
  • impossible to handle failures case by case
  • impossible to predict performance

degradation due to the failures

45

16年6月28日火曜日

slide-48
SLIDE 48

3rd JLESC SS@Lyon 2016

Stencil and Cartesian Topology

  • The node failure problem in stencil computations

is revisited

  • Communication pattern of stencil computation

fits with Cartesian topology very very well

  • When spare node

substitutions take place, then the fitness is gone and performance degrades

46

20 40 60 80 100 120 20 40 60 80 100 120 140 160 180 200 # Collisions # Node Failures Best Average Worst

5P-Stencil Communication Performance Degradation over the Number of Failed Nodes [7]

16年6月28日火曜日

slide-49
SLIDE 49

3rd JLESC SS@Lyon 2016

Topology and Protocol

  • Protocols of collective
  • perations are
  • ptimized according to

topology

  • If conditions of H/W

support are NOT met, then general protocol takes place

  • Failure break those

conditions

47

2 4 6 8 10 12 50 100 150 200 250 300 Slowdown # Node Failures K-Barrier K-Allreduce BGQ-Barrier BGQ-Allreduce BGQ-Barrier* BGQ-Allreduce*

16年6月28日火曜日

slide-50
SLIDE 50

3rd JLESC SS@Lyon 2016

Regular topology turns into random topology

as the number of failed links increases

48

(Full) Dragonfly 22/28 16/28

N

  • d

e s

Sw.

16年6月28日火曜日

slide-51
SLIDE 51

3rd JLESC SS@Lyon 2016

Regular topology turns into random topology

as the number of failed links increases

49

(Full) Dragonfly 22/28 16/28

N

  • d

e s

Sw.

Qualitative Change Quantitative Change

16年6月28日火曜日

slide-52
SLIDE 52

3rd JLESC SS@Lyon 2016

Randomness may be an answer

50

16年6月28日火曜日

slide-53
SLIDE 53

3rd JLESC SS@Lyon 2016

Randomness may be an answer

  • Can we rely on the rules and algorithms which can be

broken by failures ?

  • Failures on regularity

Qualitative change: Hard to imagine

50

16年6月28日火曜日

slide-54
SLIDE 54

3rd JLESC SS@Lyon 2016

Randomness may be an answer

  • Can we rely on the rules and algorithms which can be

broken by failures ?

  • Failures on regularity

Qualitative change: Hard to imagine

  • What if we give up having such rules ?
  • Failures on randomness

Quantitative change: Easier to imagine

50

16年6月28日火曜日

slide-55
SLIDE 55

3rd JLESC SS@Lyon 2016

Randomness may be an answer

  • Can we rely on the rules and algorithms which can be

broken by failures ?

  • Failures on regularity

Qualitative change: Hard to imagine

  • What if we give up having such rules ?
  • Failures on randomness

Quantitative change: Easier to imagine

  • Let’s start designing random systems

from the beginning, forget about failures in regular systems

50

➡Random Topology ➡Random Network

16年6月28日火曜日

slide-56
SLIDE 56

3rd JLESC SS@Lyon 2016

Random Topology (1)

51

  • Goal: to make a low-latency topology for HPC networks

– low diameter and low average path hops

  • Random topology is best [Koibuchi et al, ISCA2012]

100 times improvement

(a) Non-random Shortcuts (b) Random Shortcuts

1,024-node network

  • Avg. shortest path length [hops]

Good Point of Random

  • m Topolog
  • gy

3

Switch degree ≈ Number of shortcuts Michihiro Koibuchi, http://research.nii.ac.jp/~koibuchi/pdf/hpca2013_slide.pdf

16年6月28日火曜日

slide-57
SLIDE 57

3rd JLESC SS@Lyon 2016

Random Topology (2)

52

Tw Two Appr proa

  • aches

hes to Quasi si-randomne randomness ss

  • Method A makes a non-random topology random
  • Method B makes a random topology layout-friendly

11

Low High (not random) (fully random) Randomness

Method A Method B

start start

Quasi-random topologies

Michihiro Koibuchi, http://research.nii.ac.jp/~koibuchi/pdf/hpca2013_slide.pdf

16年6月28日火曜日

slide-58
SLIDE 58

3rd JLESC SS@Lyon 2016

Random Routing in Hypercube

53

7

Sid C-K Chau

Random Routing in Hypercube

  • For deterministic bit-fix routing, the worst case requires

at least 2𝑜/2/2 steps (exponential in n)

  • But for random bit-fix routing, it requires O(n) steps

with high probability (i.e., using more than O(n) steps has a vanishing probability converging to 0, as n0)

  • Random bit-fix routing has two stages:
  • 1. Pick a random node r(i) in the hypercube independently, and

use bit-fixing routing from i to r(i)

  • 2. Use bit-fixing routing from r(i) to d(i)
  • Obviously, longer paths are needed for random bit-fix
  • routing. Then why is this better?
  • The intuition is that random routing can average out the

worst case configuration from deterministic routing

  • The probability that a randomly generated configuration

is the worst case is very low, and is vanishing for large n

  • This intuition is behind many randomized algorithms
1100 0100 0110 1110 0010 1010 1000 0001 1101 0101 1111 0011 1011 1001 0111 0000 i j d(i) d(j)

Random bit-fixing routing

r(i) r(j)

0000 0000 0000 0001 0001 0100 0010 1000 1000 0011 0101 1100 0100 0001 0001 0101 1110 0101 0111 1101 1101 1110 0000 1011 1111 1110 1111 i r(i) d(i)

A two-stage configuration

Sid C-K Chau, https://www.cl.cam.ac.uk/teaching/1011/CompSysMod/RandBits_Lec2V2.pdf

16年6月28日火曜日

slide-59
SLIDE 59

3rd JLESC SS@Lyon 2016

Dynamic routing vs. Random routing

  • A switch has several routing candidates for a

packet to go through

  • Static routing
  • choose fixed one always
  • Dynamic routing
  • choose one according to network status
  • Random routing
  • choose one in a random way
  • not have to be uniformly random

54

16年6月28日火曜日

slide-60
SLIDE 60

3rd JLESC SS@Lyon 2016

Randomness in a Network

  • Combination of regularity and randomness
  • Random Topology
  • Regular part + Random part
  • Ex) Ring + Random shortcuts
  • Random (Oblivious) Routing

(≠ Brownian motion)

  • Random routing + Regular routing
  • Node/switch on the way is randomly chosen
  • Failure may happen on the regular part ?
  • The factorial nature can be relaxed
  • Ex) Redundant links of the regular part of topology.

55

x

  • Random Shortcut Topology

y

16年6月28日火曜日

slide-61
SLIDE 61

3rd JLESC SS@Lyon 2016

My Last Word

“An eye for an eye, a tooth for a tooth”

Randomness for randomness

Randomness MAY save the future supercomputers (not yet proven) Thank you

56

16年6月28日火曜日

slide-62
SLIDE 62

3rd JLESC SS@Lyon 2016

Reference

1) High-radix router: Microarchitecture of a High-Radix Router, John Kim, William J. Dally, et. al., ISCA’05. 2) Tofu network: THE TOFU INTERCONNECT, Yuichiro Ajima, et. al., HOT INTERCONNECTS, 2012. 3) Dragonfly network: Technology-Driven, Highly-Scalable Dragonfly Topology, John Kim, William J. Dally, et. al., ISCA '08. 4) Routing algorithms: A Survey and Evaluation of Topology-Agnostic Deterministic Routing Algorithms, J. Flich et al., in IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 3,

  • pp. 405-425, March 2012.

5) Shortest path finding algorithm (Dijkstra Algorithm): A note on two problems in connexion with graphs, Dijkstra, E.W., In Numerische Mathematik, 1959. 6) Adaptive routing in Infiniband: Fail-in-place Network Design: Interaction Between Topology, Routing Algorithm and Failures, J. Domke, T. Hoefler, and S. Matsuoka, SC ’14, 2014. 7) Spare node substitution: Sliding Substitution of Failed Nodes, Atsushi Hori, et. al., In Proceedings

  • f the 22nd European MPI Users' Group Meeting, ACM, 2015.

8) Random algorithms including the random routing: Randomized Algorithms, Rajeev Motwani and Prabhakar Raghavan, Cambridge University Press, 1995. 9) Random network: A Case for Random Shortcut Topologies for HPC Interconnects, Michihiro Koibuchi, et. al., ISCA’12. 10)Another view on HPC network robustness: Robustness Attributes of Interconnection Networks for Parallel Processing, Behrooz Parhami, Keynote lecture, 1st Int'l Suprcomputing Conf. (ISUM-2010), 2010 March 4. (https://www.ece.ucsb.edu/~parhami/pres_folder/parh10-isum-robustness-int- nets.ppt)

57

16年6月28日火曜日