OVERHEAD OF A DECENTRALIZED GOSSIP ALGORITHM ON THE PERFORMANCE OF - - PowerPoint PPT Presentation

overhead of a decentralized gossip algorithm on the
SMART_READER_LITE
LIVE PREVIEW

OVERHEAD OF A DECENTRALIZED GOSSIP ALGORITHM ON THE PERFORMANCE OF - - PowerPoint PPT Presentation

The Hebrew University of Jerusalem Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group OVERHEAD OF A DECENTRALIZED GOSSIP ALGORITHM ON THE PERFORMANCE OF HPC APPLICATIONS ELY LEVY, AMNON BARAK, AMNON


slide-1
SLIDE 1

ELY LEVY, AMNON BARAK, AMNON SHILOH, MATTHIAS LIEBER, CARSTEN WEINHOLD, HERMANN HÄRTIG

Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group

The Hebrew University 


  • f Jerusalem

OVERHEAD OF A DECENTRALIZED
 GOSSIP ALGORITHM ON THE
 PERFORMANCE OF HPC APPLICATIONS

slide-2
SLIDE 2

TU Dresden Overhead of a Decentralized Gossip Algorithm

MOTIVATION

Management tasks in supercomputers:

Process placement Load management System monitoring

Up-to-date information required to make informed decisions

2

slide-3
SLIDE 3

TU Dresden Overhead of a Decentralized Gossip Algorithm

REQUIREMENTS

Low overhead on application performance Scalability:

Decentralized information dissemination Decentralized decision making

Fault tolerance

3

slide-4
SLIDE 4

TU Dresden Overhead of a Decentralized Gossip Algorithm

RANDOMIZED GOSSIP

4

slide-5
SLIDE 5

TU Dresden Overhead of a Decentralized Gossip Algorithm

MERGING WINDOWS

5

A:0 B:12 C:2 D:4 E:11 ... A:0 C:2 ... D:4 A:5 B:2 C:4 D:3 E:0 ... A:1 C:3 ... D:5 C:3 A:1 Node A Node E

slide-6
SLIDE 6

TU Dresden Overhead of a Decentralized Gossip Algorithm

WINDOW SIZE

6

5 10 15 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1024 Nodes 14.21 9.77 8.46 7.83 7.49 7.29 7.18 7.09 7.03 7.01 Window size (rel. to node count) 5 10 15 2048 Nodes 14.86 10.46 9.15 8.53 8.19 7.99 7.87 7.78 7.73 7.71

How much data to send?

  • Small window sizes already yield good average age
  • Diminishing return for larger window sizes
  • Example: 20% of 1024 nodes w/ 1KiB per node ➞ 200 KiB
slide-7
SLIDE 7

TU Dresden Overhead of a Decentralized Gossip Algorithm

HARDWARE

BlueGene/Q at Jülich (JUQUEEN) 28.672 nodes total (used 1024 – 8192) 16 cores per node (PowerPC A2 @ 1.6 GHz) 5D torus network (10 links per node) 2 GB/s per link send + receive Total bandwidth per node: 40 GB/s 2.6 µs worst-case latency

7

slide-8
SLIDE 8

TU Dresden Overhead of a Decentralized Gossip Algorithm

GOSSIP ALGORITHM

MPI-based implementation (MPI_Bsend) Gossip algorithm runs on 1 core Application uses remaining 15 cores How to run two programs on BG/Q?

Gossip algorithm + application linked together MPI communicators configured to hide every 16th core from application Wrapped all uses of MPI_COMM_WORLD

8

slide-9
SLIDE 9

TU Dresden Overhead of a Decentralized Gossip Algorithm

BENCHMARKS

HPCC suite: MPI-FFT HPCC suite: PTRANS Application: COSMO-SPECS+FD4

9

Heavy network usage Moderate network usage

slide-10
SLIDE 10

TU Dresden Overhead of a Decentralized Gossip Algorithm

BENCHMARKS

HPCC suite: MPI-FFT HPCC suite: PTRANS Application: COSMO-SPECS+FD4

10

slide-11
SLIDE 11

TU Dresden Overhead of a Decentralized Gossip Algorithm

HPCC: MPI-FFT

11

10 20 1024 Nodes 19.0 s 19.0 s 19.0 s 19.0 s 19.1 s 19.2 s 19.5 s 20.0 s 12.2 s 12.2 s 12.2 s 12.2 s 12.3 s 12.4 s 12.7 s 13.2 s 20 40 60 2048 Nodes 50.2 s 50.2 s 50.4 s 50.5 s 51.1 s 52.0 s 54.0 s 36.4 s 36.4 s 36.5 s 36.6 s 37.2 s 38.1 s 40.0 s Without Gossip Interval = 1024 ms Interval = 256 ms Interval = 64 ms Interval = 16 ms Interval = 8 ms Interval = 4 ms Interval = 2 ms

Benchmark: Fast Fourier Transform

  • All-to-all comm-

unication pattern

  • Stresses bisection

bandwidth

  • 1024 nodes: 136

million vector elements
 (2025 GiB)

  • 2048+ nodes: 544

million vector elements
 (8100 GiB)

20 40 60 40.0 s 40.0 s 40.2 s 40.7 s 42.6 s 45.3 s 32.9 s 32.9 s 33.1 s 33.6 s 35.5 s 38.1 s 20 40 60 27.8 s 28.0 s 28.5 s 29.9 s 35.2 s 42.2 s 24.0 s 24.2 s 24.7 s 26.1 s 31.4 s 38.4 s Without Gossip Interval = 1024 ms Interval = 256 ms Interval = 64 ms Interval = 16 ms Interval = 8 ms Interval = 4 ms Interval = 2 ms 4096 Nodes 8192 Nodes

slide-12
SLIDE 12

TU Dresden Overhead of a Decentralized Gossip Algorithm

COSMO-SPECS+FD4

12

Benchmark: Atmo- spheric Simulation

  • COSMO: static

regular communi- cation

  • SPECS: dynamic,

irregular communi- cation

  • Model coupling:

dynamic, irregular, small volume

  • Partitioning:

collectives

  • Migration: highly

local, mostly between neighbors

10 20 30 40 50 1024 Nodes 40.6 s 40.6 s 40.6 s 40.6 s 40.6 s 40.6 s 40.7 s 40.8 s 41.1 s 8.2 s 8.2 s 8.2 s 8.2 s 8.2 s 8.2 s 8.3 s 8.4 s 8.6 s 10 20 30 40 50 2048 Nodes 36.7 s 36.7 s 36.7 s 36.7 s 36.8 s 36.9 s 37.2 s 38.0 s 4.3 s 4.2 s 4.2 s 4.2 s 4.3 s 4.4 s 4.7 s 5.5 s Without Gossip Interval = 1024 ms Interval = 256 ms Interval = 64 ms Interval = 16 ms Interval = 8 ms Interval = 4 ms Interval = 2 ms Interval = 1 ms 10 20 30 40 50 36.7 s 36.7 s 36.7 s 36.8 s 37.3 s 37.6 s 38.7 s 4.3 s 4.3 s 4.4 s 4.4 s 4.8 s 5.1 s 6.1 s 10 20 30 40 50 37.3 s 37.3 s 37.3 s 37.5 s 37.9 s 38.2 s 4.8 s 4.8 s 4.9 s 5.0 s 5.3 s 5.6 s Without Gossip Interval = 1024 ms Interval = 256 ms Interval = 64 ms Interval = 16 ms Interval = 8 ms Interval = 4 ms Interval = 2 ms Interval = 1 ms

4096 Nodes 8192 Nodes

slide-13
SLIDE 13

TU Dresden Overhead of a Decentralized Gossip Algorithm

GOSSIP SCALABILITY

13

50% 100% 1024 Nodes 0.0% 0.0% 1.3% 5.4% 10.6% 20.9% 41.4% 80.3% 50% 100% 2048 Nodes 0.0% 0.3% 2.6% 10.7% 20.9% 41.3% 79.3% Without Gossip Interval = 1024 ms Interval = 256 ms Interval = 64 ms Interval = 16 ms Interval = 8 ms Interval = 4 ms Interval = 2 ms

Computational complexity:
 O(n·log(n))

50% 100% 0.0% 1.0% 5.4% 21.2% 41.6% 79.4% 50% 100% 0.2% 2.4% 11.0% 42.7% 80.5% Without Gossip Interval = 1024 ms Interval = 256 ms Interval = 64 ms Interval = 16 ms Interval = 8 ms Interval = 4 ms Interval = 2 ms 4096 Nodes 8192 Nodes

slide-14
SLIDE 14

TU Dresden Overhead of a Decentralized Gossip Algorithm

RATE VS WIN SIZE

14

Window Size Gossip Interval 10 % 20 % 40 % 80 % 1 ms 3.8% 2 ms 1.8% 4.7% 4 ms 1.1% 2.6% 6.3% 17 .2% 8 ms 0.7% 1.4% 3.2% 8.7% 16 ms 0.3% 0.7% 1.8% 4.5% 32 ms 0.3% 0.4% 0.9% 2.6% 64 ms 0.2% 0.3% 0.6% 1.5% More Data More Data

slide-15
SLIDE 15

TU Dresden Overhead of a Decentralized Gossip Algorithm

REACTION TIME?

Gossip intervals + average vector age:

256 ms ➞ 2–3 s 1024 ms ➞ 10 s

Applicability for system services:

Global load information (allocation, …) Local load balancing (MOSIX-like, …) System monitoring (node health, …)

15

slide-16
SLIDE 16

TU Dresden Overhead of a Decentralized Gossip Algorithm

FUTURE WORK

Other types of network (Infiniband, Cray, …) Fault tolerance, loss of messages Must adapt for exascale systems:

Incomplete knowledge at each node

Groups of gossip nodes Smaller vectors

Hierarchical gossip for global view

16

slide-17
SLIDE 17

TU Dresden Overhead of a Decentralized Gossip Algorithm

CONCLUSIONS

Gossip algorithm scales to thousands of nodes Increasing window size causes more overhead than decreasing gossip interval Collective MPI communication most sensitive Gossip intervals of 256–1024 ms with no noticeable overhead (in most cases) Average age of information at nodes in the

  • rder of 2–3 s with gossip interval of 256 ms

17

German Priority Programme 1648 Software for Exascale Computing

FFMK.tudos.org