Extreme-Scale HPC Network Analysis using Discrete-Event Simula=on - - PowerPoint PPT Presentation

extreme scale hpc network analysis using discrete event
SMART_READER_LITE
LIVE PREVIEW

Extreme-Scale HPC Network Analysis using Discrete-Event Simula=on - - PowerPoint PPT Presentation

Extreme-Scale HPC Network Analysis using Discrete-Event Simula=on Noah Wolfe 1 , Misbah Mubarak 2 , Nikhil Jain 3 , Jens Domke 4 , Abhinav Bhatele 3 , Christopher D. Carothers 1 , Robert B. Ross 2 1 Rensselaer Polytechnic InsDtute 2 Argonne NaDonal


slide-1
SLIDE 1

Extreme-Scale HPC Network Analysis using Discrete-Event Simula=on

Noah Wolfe1, Misbah Mubarak2, Nikhil Jain3, Jens Domke4, Abhinav Bhatele3, Christopher D. Carothers1, Robert B. Ross2

1Rensselaer Polytechnic InsDtute 2Argonne NaDonal Laboratory 3Lawrence Livermore NaDonal Laboratory 4Technische Universität Dresden, Dresden, Germany

slide-2
SLIDE 2
  • Introduc2on
  • Mo=va=on
  • Background: ROSS, CODES, DUMPI
  • Slim Fly Network Model
  • Design
  • Verifica=on
  • Network visualiza=on
  • Large-scale configura=on
  • Single-job applica=on trace performance
  • PDES performance
  • Summary

Outline

2

slide-3
SLIDE 3

Mo=va=on

[1] http://images.anandtech.com/doci/8727/Processors.jpg [2] http://images.fastcompany.com/upload/Aubrey_Isle_die.jpg [3] http://www.anandtech.com/show/9151/intel-xeon-phi-cray-energy-supercomputers

IBM Power9 CPU NVIDIA Volta GPU

[1] [1] [3]

+

Intel Xeon Phi MIC

[2] [3]

slide-4
SLIDE 4

Network Design Problem

http://farm7.static.flickr.com/6114/6322893212_22f00600a1_b.jpg

  • Variables
  • Topology
  • Planes/Rails
  • Technology (link speed, switch

radix)

  • Job Alloca=on
  • Rou=ng
  • Communica=on PaVerns
  • Ques2ons
  • Bandwidth?
  • Latency?
  • Job Interference?
  • Answer: Simula=on!!!
slide-5
SLIDE 5

Approach

CODES

Sequential — Conservative — Optimistic

ROSS — PDES Framework

AMG

DUMPI — DOE Application Traces

Crystal Router MultiGrid

Slim Fly Fat Tree Network Models

MPI Simulation Layer

slide-6
SLIDE 6
  • Rensselaer Op=mis=c Simula=on

System (ROSS)

  • Provides the parallel discrete-event

simula=on plaYorm for CODES simula=ons

  • Supports op=mis=c event

scheduling using reverse computa=on

  • Demonstrated super-linear

speedup and is capable of processing 500 billion events per second with over 250 million LPs

  • n 120 racks of the Sequoia

supercomputer at LLNL

Background — ROSS

6

5e+10 1e+11 1.5e+11 2e+11 2.5e+11 3e+11 3.5e+11 4e+11 4.5e+11 5e+11 5.5e+11 2 8 24 48 96 120 Event Rate (events/second) Number of Blue Gene/Q Racks Actual Sequoia Performance Linear Performance (2 racks as base)

97x speedup for 60x more hardware

slide-7
SLIDE 7

7

Background — CODES

  • Framework for exploring design
  • f HPC interconnects, storage

systems, and workloads

  • High-fidelity packet-level models
  • f HPC interconnect topologies
  • Synthe=c or applica=on trace

network and I/O workloads

  • CO-Design of mul=-layer Exascale Storage systems (CODES)

*Source: “Quantifying I/O and communication traffic interference on burst-buffer enabled dragonfly networks,” (submitted Cluster 17)

slide-8
SLIDE 8

8

Background — DUMPI

  • DUMPI - The MPI Trace Library
  • Provides libraries to collect and

read traces of MPI applica=ons

  • DOE Design Forward Traces
  • Variety of communica=on paVers

and intensi=es of applica=ons at scale

  • AMG
  • Crystal Router
  • Mul=grid
  • Fig. Distribu=on of MPI communica=on for AMG trace

Source: nersc.gov

slide-9
SLIDE 9

9

HPC Systems

  • Compute Nodes: ~4,600
  • CPU Cores: ~220,000
  • Routers: ~500
  • Cables: ~12,000

Summit Supercomputer

IBM Power9 CPU NVIDIA Volta GPU

slide-10
SLIDE 10

10

… … …

Routers/Switches Compute Nodes MPI Processes

  • LPs: Logical Processes
  • PEs: Processing Elements
  • Events: Time stamped elements of work

HPC Components Discrete-Event Simulation Components

slide-11
SLIDE 11
  • Rensselaer Op=mis=c Simula=on

System (ROSS)

  • Logical Processes (LPs):
  • MPI processes (virtual)
  • Compute nodes
  • Network switches
  • Processing Elements (PEs):
  • MPI process (physical)
  • Events:
  • Network Messages
  • Event Scheduling:
  • Sequen=al
  • Conserva=ve
  • Op=mis=c

Discrete-Event Mapping

11

Network Switch Compute Node MPI Process (virtual) MPI Process (physical) LP Types PEs{

slide-12
SLIDE 12

Discrete-Event Implementa=on

slide-13
SLIDE 13

Slim Fly Network Model

slide-14
SLIDE 14

Slim Fly — Design

  • Descrip=on:
  • Built on MMS graphs
  • Max network diameter of 2
  • Uses high-radix routers
  • Complex layout and connec=vity
  • Network Parameters:
  • q: number of routers per group and

number of global connec=ons per router

  • p: number of terminal connec=ons per

router, p=floor(k’/2)

  • k: router radix
  • k’: router network radix
  • Fig. Slim Fly with q=5
slide-15
SLIDE 15

Slim Fly — Network Model

Routing Synthetic Workloads Additional Components

  • Uniform Random (UR)
  • Worst-Case (WC)
  • Minimal rou=ng
  • Max 2 hops
  • Non-minimal rou=ng
  • Messages routed minimally to random

intermediate router and then minimally to the des=na=on

  • Adap=ve rou=ng
  • Chooses minimal and non-minimal based
  • n conges=on of source router
  • Credit-based flow control
  • Virtual Channels (VCs) to avoid rou=ng deadlocks
  • Minimal rou=ng: 2 vcs per port
  • Non-minimal rou=ng: 4 vcs per port
slide-16
SLIDE 16

Slim Fly — Verifica=on

Adaptive Routing Non-Minimal Routing Minimal Routing

[1] G. Kathareios, C. Minkenberg, B. Prisacari, G. Rodriguez, and T. Hoefler. Cost-Effective Diameter- Two Topologies: Analysis and Evaluation. Nov. 2015. IEEE/ACM ICHPCNSA (SC15).

  • Comparison with published Slim Fly network

results by Kathareios et al. [1]

  • Slim Fly Configura=on
  • 3,042 nodes
  • 338 routers
  • q=13
  • k=28
  • Simula=on Configura=on
  • link bandwidth: 12.5GB/s
  • link latency: 50ns
  • buffer space: 100KB
  • router delay: 100ns
  • Simulated =me: 220us

Setup

slide-17
SLIDE 17

Slim Fly — Network Visualiza=on

  • Uniform random traffic with minimal rou=ng
  • Fig. Virtual channel occupancy for all router ports in network
slide-18
SLIDE 18

Slim Fly — Network Visualiza=on

Send/Receive Performance:

  • Uniform random traffic
  • Minimal rou=ng
  • 100% bandwidth injec=on

load

  • Fig. Number of sends and receives sampled over the simula=on
slide-19
SLIDE 19

Slim Fly — Large-Scale

  • 74K Node (Aurora) System:
  • 2,738 routers
  • q=37, k=82, p=floor(k’/2)=27
  • 1M Node System:
  • 53,178 routers
  • q=163, k=264, k’=244,
  • ideal_p=122, actual_p=19
  • Simula=on Configura=on:
  • link bandwidth: 12.5GB/s
  • link latency: 50ns
  • buffer space: 100KB
  • router delay: 100ns
  • Simulated =me: 200us
slide-20
SLIDE 20

Applica=on Traces

  • Crystal Router:
  • Descrip2on: Mini-app for the highly scalable Nek5000 spectral

element code developed at ANL [3].

  • Communica2on PaHern: Large synchronized messages following n-

dimensional hypercube (many-to-many)

  • Communica2on Time: 68.5% of run=me
  • Trace size: 1,000 MPI processes
  • Mul=grid:
  • Descrip2on: Implements a single produc=on cycle of the linear solver

used in BoxLib [1], an adap=ve mesh refinement code.

  • Communica2on PaHern: Bursty periods of small messages along

diagonals (many-to-many)

  • Communica2on Time: 5% of run=me
  • Trace sizes: 10,648

[1] Department of Energy, “AMR Box Lib.” [Online]. Available: https://ccse.lbl.gov/BoxLib/ [2] Co-design at Lawrence Livermore National Laboratory, “Algebraic Multigrid Solver (AMG).” (Accessed on: Apr. 19, 2015). [Online]. Available:https://codesign.llnl.gov/amg2013.php [3] J. Shin et al. “Speeding up nek5000 with autotuning and specialization,” in Proceedings of the 24th ACM International Conference for Supercomputing. ACM, 2010.

slide-21
SLIDE 21

Crystal Router (CR)

Simula=on:

  • Workload: 1,000
  • End Time: 290us
slide-22
SLIDE 22

Mul=grid (MG)

Simula=on:

  • Workload: 10,648 ranks MG
  • End Time: 290us
slide-23
SLIDE 23

Applica=on Traces Summary

Applica=on MPI Ranks Virtual End Time (ns) Recvs GB Received Waits Wait Alls Avg Msg Size CR 1000 750866 263K 724MB 263K 263K 2890B MG 10648 44798942 2.6M 18.1GB 7480B

  • Summary:
  • CR:
  • Small quan=ty of medium sized messages
  • Synchroniza=on auer each message transfer
  • MG:
  • large quan=ty of large messages
  • No synchroniza=on
slide-24
SLIDE 24

Slim Fly — Crystal Router

(a) Simulation End Time (b) Packet Hops (c) Packet Latency (d) Network Congestion

slide-25
SLIDE 25

Slim Fly — Mul=grid

(a) Simulation End Time (b) Packet Hops (c) Packet Latency (d) Network Congestion

slide-26
SLIDE 26

Slim Fly — PDES Scaling

  • 74K Node System:
  • 43M events/s
  • 543M events processed
  • 1M Node System:
  • 36M events/s
  • 7B events processed
  • Op=mis=c:
  • Ideal scaling
  • > 95% efficiency
  • Conserva=ve:
  • LiVle to no scaling

performance

slide-27
SLIDE 27

Slim Fly — PDES Analysis

Op=mis=c:

  • Distribu=on of simula=on =me

scales linearly

  • Uniform distribu=on of work

among PEs Conserva=ve:

  • Distribu=on of simula=on =me

scales linearly

  • Uniform distribu=on of work

among PEs

slide-28
SLIDE 28

Slim Fly — PDES Analysis

(a) Memory Consumption (b) Time Slowdown

Memory Consump=on:

  • Physical amount of system

memory required to ini=alize the model and run the simula=on Time Slowdown:

  • A measure of how much slower

the simula=on is compared to the real-world experiment being modeled

slide-29
SLIDE 29

Summary

  • Slim fly network model: A new parallel discrete event slim fly network

model capable of providing insight into network behavior at scale

  • Verifica2on: Verified the accuracy of the slim fly model with published

results

  • Network performance analysis: Performed detailed analysis of the

slim fly model in response to single job execu=ons of applica=on communica=on traces showing preferred rou=ng algorithms

  • PDES Analysis: Conducted strong scaling as well as discrete-event

simula=on analysis showing the efficiency and scalability of the network model under both conserva=ve and op=mis=c event scheduling

  • Overall: U=lizing the discrete-event simula=on approach for large-

scale HPC system simula=ons results in an effec=ve tool for analysis and co-design