[PPT] - Extreme-Scale HPC Network Analysis using Discrete-Event Simula=on PowerPoint Presentation

SLIDE 1

Extreme-Scale HPC Network Analysis using Discrete-Event Simula=on

Noah Wolfe1, Misbah Mubarak2, Nikhil Jain3, Jens Domke4, Abhinav Bhatele3, Christopher D. Carothers1, Robert B. Ross2

1Rensselaer Polytechnic InsDtute 2Argonne NaDonal Laboratory 3Lawrence Livermore NaDonal Laboratory 4Technische Universität Dresden, Dresden, Germany

SLIDE 2

Introduc2on
Mo=va=on
Background: ROSS, CODES, DUMPI
Slim Fly Network Model
Design
Verifica=on
Network visualiza=on
Large-scale configura=on
Single-job applica=on trace performance
PDES performance
Summary

Outline

2

SLIDE 3

Mo=va=on

[1] http://images.anandtech.com/doci/8727/Processors.jpg [2] http://images.fastcompany.com/upload/Aubrey_Isle_die.jpg [3] http://www.anandtech.com/show/9151/intel-xeon-phi-cray-energy-supercomputers

IBM Power9 CPU NVIDIA Volta GPU

[1] [1] [3]

+

Intel Xeon Phi MIC

[2] [3]

SLIDE 4

Network Design Problem

http://farm7.static.flickr.com/6114/6322893212_22f00600a1_b.jpg

Variables
Topology
Planes/Rails
Technology (link speed, switch

radix)

Job Alloca=on
Rou=ng
Communica=on PaVerns
Ques2ons
Bandwidth?
Latency?
Job Interference?
Answer: Simula=on!!!

SLIDE 5

Approach

CODES

Sequential — Conservative — Optimistic

ROSS — PDES Framework

AMG

DUMPI — DOE Application Traces

Crystal Router MultiGrid

Slim Fly Fat Tree Network Models

MPI Simulation Layer

SLIDE 6

Rensselaer Op=mis=c Simula=on

System (ROSS)

Provides the parallel discrete-event

simula=on plaYorm for CODES simula=ons

Supports op=mis=c event

scheduling using reverse computa=on

Demonstrated super-linear

speedup and is capable of processing 500 billion events per second with over 250 million LPs

n 120 racks of the Sequoia

supercomputer at LLNL

Background — ROSS

6

5e+10 1e+11 1.5e+11 2e+11 2.5e+11 3e+11 3.5e+11 4e+11 4.5e+11 5e+11 5.5e+11 2 8 24 48 96 120 Event Rate (events/second) Number of Blue Gene/Q Racks Actual Sequoia Performance Linear Performance (2 racks as base)

97x speedup for 60x more hardware

SLIDE 7

7

Background — CODES

Framework for exploring design
f HPC interconnects, storage

systems, and workloads

High-fidelity packet-level models
f HPC interconnect topologies
Synthe=c or applica=on trace

network and I/O workloads

CO-Design of mul=-layer Exascale Storage systems (CODES)

*Source: “Quantifying I/O and communication traffic interference on burst-buffer enabled dragonfly networks,” (submitted Cluster 17)

SLIDE 8

8

Background — DUMPI

DUMPI - The MPI Trace Library
Provides libraries to collect and

read traces of MPI applica=ons

DOE Design Forward Traces
Variety of communica=on paVers

and intensi=es of applica=ons at scale

AMG
Crystal Router
Mul=grid
Fig. Distribu=on of MPI communica=on for AMG trace

Source: nersc.gov

SLIDE 9

9

HPC Systems

Compute Nodes: ~4,600
CPU Cores: ~220,000
Routers: ~500
Cables: ~12,000

Summit Supercomputer

IBM Power9 CPU NVIDIA Volta GPU

SLIDE 10

10

… … …

Routers/Switches Compute Nodes MPI Processes

LPs: Logical Processes
PEs: Processing Elements
Events: Time stamped elements of work

HPC Components Discrete-Event Simulation Components

SLIDE 11

Rensselaer Op=mis=c Simula=on

System (ROSS)

Logical Processes (LPs):
MPI processes (virtual)
Compute nodes
Network switches
Processing Elements (PEs):
MPI process (physical)
Events:
Network Messages
Event Scheduling:
Sequen=al
Conserva=ve
Op=mis=c

Discrete-Event Mapping

11

Network Switch Compute Node MPI Process (virtual) MPI Process (physical) LP Types PEs{

SLIDE 12

Discrete-Event Implementa=on

SLIDE 13

Slim Fly Network Model

SLIDE 14

Slim Fly — Design

Descrip=on:
Built on MMS graphs
Max network diameter of 2
Uses high-radix routers
Complex layout and connec=vity
Network Parameters:
q: number of routers per group and

number of global connec=ons per router

p: number of terminal connec=ons per

router, p=floor(k’/2)

k: router radix
k’: router network radix
Fig. Slim Fly with q=5

SLIDE 15

Slim Fly — Network Model

Routing Synthetic Workloads Additional Components

Uniform Random (UR)
Worst-Case (WC)
Minimal rou=ng
Max 2 hops
Non-minimal rou=ng
Messages routed minimally to random

intermediate router and then minimally to the des=na=on

Adap=ve rou=ng
Chooses minimal and non-minimal based
n conges=on of source router
Credit-based flow control
Virtual Channels (VCs) to avoid rou=ng deadlocks
Minimal rou=ng: 2 vcs per port
Non-minimal rou=ng: 4 vcs per port

SLIDE 16

Slim Fly — Verifica=on

Adaptive Routing Non-Minimal Routing Minimal Routing

[1] G. Kathareios, C. Minkenberg, B. Prisacari, G. Rodriguez, and T. Hoefler. Cost-Effective Diameter- Two Topologies: Analysis and Evaluation. Nov. 2015. IEEE/ACM ICHPCNSA (SC15).

Comparison with published Slim Fly network

results by Kathareios et al. [1]

Slim Fly Configura=on
3,042 nodes
338 routers
q=13
k=28
Simula=on Configura=on
link bandwidth: 12.5GB/s
link latency: 50ns
buffer space: 100KB
router delay: 100ns
Simulated =me: 220us

Setup

SLIDE 17

Slim Fly — Network Visualiza=on

Uniform random traffic with minimal rou=ng
Fig. Virtual channel occupancy for all router ports in network

SLIDE 18

Slim Fly — Network Visualiza=on

Send/Receive Performance:

Uniform random traffic
Minimal rou=ng
100% bandwidth injec=on

load

Fig. Number of sends and receives sampled over the simula=on

SLIDE 19

Slim Fly — Large-Scale

74K Node (Aurora) System:
2,738 routers
q=37, k=82, p=floor(k’/2)=27
1M Node System:
53,178 routers
q=163, k=264, k’=244,
ideal_p=122, actual_p=19
Simula=on Configura=on:
link bandwidth: 12.5GB/s
link latency: 50ns
buffer space: 100KB
router delay: 100ns
Simulated =me: 200us

SLIDE 20

Applica=on Traces

Crystal Router:
Descrip2on: Mini-app for the highly scalable Nek5000 spectral

element code developed at ANL [3].

Communica2on PaHern: Large synchronized messages following n-

dimensional hypercube (many-to-many)

Communica2on Time: 68.5% of run=me
Trace size: 1,000 MPI processes
Mul=grid:
Descrip2on: Implements a single produc=on cycle of the linear solver

used in BoxLib [1], an adap=ve mesh refinement code.

Communica2on PaHern: Bursty periods of small messages along

diagonals (many-to-many)

Communica2on Time: 5% of run=me
Trace sizes: 10,648

[1] Department of Energy, “AMR Box Lib.” [Online]. Available: https://ccse.lbl.gov/BoxLib/ [2] Co-design at Lawrence Livermore National Laboratory, “Algebraic Multigrid Solver (AMG).” (Accessed on: Apr. 19, 2015). [Online]. Available:https://codesign.llnl.gov/amg2013.php [3] J. Shin et al. “Speeding up nek5000 with autotuning and specialization,” in Proceedings of the 24th ACM International Conference for Supercomputing. ACM, 2010.

SLIDE 21

Crystal Router (CR)

Simula=on:

Workload: 1,000
End Time: 290us

SLIDE 22

Mul=grid (MG)

Simula=on:

Workload: 10,648 ranks MG
End Time: 290us

SLIDE 23

Applica=on Traces Summary

Applica=on MPI Ranks Virtual End Time (ns) Recvs GB Received Waits Wait Alls Avg Msg Size CR 1000 750866 263K 724MB 263K 263K 2890B MG 10648 44798942 2.6M 18.1GB 7480B

Summary:
CR:
Small quan=ty of medium sized messages
Synchroniza=on auer each message transfer
MG:
large quan=ty of large messages
No synchroniza=on

SLIDE 24

Slim Fly — Crystal Router

(a) Simulation End Time (b) Packet Hops (c) Packet Latency (d) Network Congestion

SLIDE 25

Slim Fly — Mul=grid

(a) Simulation End Time (b) Packet Hops (c) Packet Latency (d) Network Congestion

SLIDE 26

Slim Fly — PDES Scaling

74K Node System:
43M events/s
543M events processed
1M Node System:
36M events/s
7B events processed
Op=mis=c:
Ideal scaling
> 95% efficiency
Conserva=ve:
LiVle to no scaling

performance

SLIDE 27

Slim Fly — PDES Analysis

Op=mis=c:

Distribu=on of simula=on =me

scales linearly

Uniform distribu=on of work

among PEs Conserva=ve:

Distribu=on of simula=on =me

scales linearly

Uniform distribu=on of work

among PEs

SLIDE 28

Slim Fly — PDES Analysis

(a) Memory Consumption (b) Time Slowdown

Memory Consump=on:

Physical amount of system

memory required to ini=alize the model and run the simula=on Time Slowdown:

A measure of how much slower

the simula=on is compared to the real-world experiment being modeled

SLIDE 29

Summary

Slim fly network model: A new parallel discrete event slim fly network

model capable of providing insight into network behavior at scale

Verifica2on: Verified the accuracy of the slim fly model with published

results

Network performance analysis: Performed detailed analysis of the

slim fly model in response to single job execu=ons of applica=on communica=on traces showing preferred rou=ng algorithms

PDES Analysis: Conducted strong scaling as well as discrete-event

simula=on analysis showing the efficiency and scalability of the network model under both conserva=ve and op=mis=c event scheduling

Overall: U=lizing the discrete-event simula=on approach for large-

scale HPC system simula=ons results in an effec=ve tool for analysis and co-design