Insightful Automatic Performance Modeling Alexandru Calotoiu 1 , - - PowerPoint PPT Presentation

insightful automatic performance modeling
SMART_READER_LITE
LIVE PREVIEW

Insightful Automatic Performance Modeling Alexandru Calotoiu 1 , - - PowerPoint PPT Presentation

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING Insightful Automatic Performance Modeling Alexandru Calotoiu 1 , Torsten Hoefler 2 , Martin Schulz 3 , Sergei Shudler 1 and Felix Wolf 1 1 TU Darmstadt , 2 ETH Zrich , 3 Lawrence Livermore


slide-1
SLIDE 1

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Insightful Automatic Performance Modeling

Alexandru Calotoiu1, Torsten Hoefler2, Martin Schulz3, Sergei Shudler1 and Felix Wolf1

1 TU Darmstadt , 2 ETH Zürich , 3 Lawrence Livermore National Laboratory

slide-2
SLIDE 2

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Sponsors

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 2

slide-3
SLIDE 3

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Virtual Institute – High Productivity Supercomputing

Association of HPC programming tool builders Mission

Development of portable programming tools that assist programmers in diagnosing programming errors and optimizing the performance of their applications Integration of these tools Organization of training events designed to teach the application of these tools Organization of academic workshops to facilitate the exchange of ideas on tool development and to promote young scientists

www.vi-hps.org

slide-4
SLIDE 4

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Motivation - latent scalability bugs

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 4

System size Execution time

slide-5
SLIDE 5

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Learning objectives

Performance modeling background Automatic performance modeling with Extra-P

How it works When it doesn’t work

Practical experiences with

Prepared examples Your own data

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 5

slide-6
SLIDE 6

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Talk structure

Introduction

Background Automatic performance modeling

Theory

Performance Model Normal Form (PMNF) Assumptions & limitations

Practice

Workflow Model refinement

Examples

Case studies Discussion

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 6

slide-7
SLIDE 7

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Introduction

slide-8
SLIDE 8

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Outline

Performance analysis methods Analytical performance modeling Automatic performance modeling Scalability validation framework

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 8

slide-9
SLIDE 9

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Spectrum of performance analysis methods

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 9

Number of parameters Model error

Benchmark Full simulation Model simulation Model

slide-10
SLIDE 10

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Scaling model

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 10

29 210 211 212 213 3 6 9 12 15 18 21

3 ¨ 1 0´4 p2 ` c

Processes Time rss

Represents performance metric as a function of the number of processes Provides insight into the program behavior at scale

slide-11
SLIDE 11

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Pitfalls

Intuition is not enough

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 11

2.95*log2 p+ 0.0871* p

12.06* p

slide-12
SLIDE 12

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Analytical performance modeling

Disadvantages: Time consuming Danger of overlooking unscalable code

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 12

Identify kernels

  • Parts of the program that dominate its

performance at larger scales

  • Identified via small-scale tests and intuition

Create models

  • Laborious process
  • Still confined to a small community of skilled

experts

Examples: Hoisie et al.: Performance and scalability analysis of teraflop-scale parallel architectures using multi- dimensional wavefront

  • applications. International

Journal of High Performance Computing Applications, 2000 Bauer et al.: Analysis of the MILC Lattice QCD Application su3_rmd. CCGrid, 2012

slide-13
SLIDE 13

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Automatic performance modeling with Extra-P

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 13

Mj

main() { foo() bar() compute() } Instrumentation Performance measurements Input Output

Mi

Extra-P Human-readable performance models of all functions (e.g., t = c1*log(p) + c2)

slide-14
SLIDE 14

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Automatic performance modeling with Extra-P

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 14

p4 = 1,024 p5 = 2,048 p6 = 4,096 main() { foo() bar() compute() } Instrumentation Performance measurements (profiles) Input Output

  • 1. foo
  • 2. compute
  • 3. main
  • 4. bar

[…] Ranking:

  • Target scale pt
  • Asymptotic

p1 = 128 p2 = 256 p3 = 512

  • All functions

Extra-P

slide-15
SLIDE 15

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Requirements modeling

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 15

Program Computation Communication FLOPS Load Store P2P Collective … Time Disagreement may be indicative of wait states

slide-16
SLIDE 16

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Algorithm engineering

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 16

Courtesy of Peter Sanders, KIT

slide-17
SLIDE 17

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

How to validate scalability in practice?

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 17

Small text book example Real application Verifiable analytical expression Asymptotic complexity Program Expectation #FLOPS = n2(2n − 1) #FLOPS = O(n2.8074)

slide-18
SLIDE 18

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Scalability evaluation framework

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 18

Search space generation Model generation Benchmark Performance measurements Scaling model Expectation + optional deviation limit Divergence model Initial validation Comparing alternatives Regression testing

2015

Shudler et al: Exascaling Your Library: Will Your Implementation Meet Your Expectations?.

slide-19
SLIDE 19

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Theory

slide-20
SLIDE 20

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Outline

Goal – scaling trends Model generation Performance Model Normal Form (PMNF) Statistical quality control & confidence intervals Assumptions & limitations of the method

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 20

slide-21
SLIDE 21

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Automatic performance modeling

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 21

Mj

main() { foo() bar() compute() } Instrumentation Performance measurements Input Output

Mi

Extra-P

  • All functions

Human-readable performance models

  • f all functions

(e.g., t = c1*log(p) + c2)

slide-22
SLIDE 22

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Primary focus on scaling trend

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 22

  • 1. F2
  • 2. F1
  • 3. F3

Ranking

Common performance analysis chart in a paper Common performance analysis chart in a paper

slide-23
SLIDE 23

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Primary focus on scaling trend

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 23

  • 1. F2
  • 2. F1
  • 3. F3

Ranking

Common performance analysis chart in a paper Actual measurement in laboratory conditions

slide-24
SLIDE 24

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Primary focus on scaling trend

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 24

  • 1. F2
  • 2. F1
  • 3. F3

Ranking

Production Reality

slide-25
SLIDE 25

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Model building blocks

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 25

Computation Communication

Samplesort

t(p) ~ p2

Naïve N-body

t(p) ~ p

FFT

t(p) ~ c

LU

t(p) ~ c

Samplesort

t(p) ~ p2 log2

2(p)

Naïve N-body

t(p) ~ p

FFT

t(p) ~ log2(p)

LU

t(p) ~ c

… …

slide-26
SLIDE 26

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Performance model normal form

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 26

n ∈ ik ∈ I jk ∈ J I, J ⊂

f (p) = ck ⋅ pik ⋅log2

jk (p) k=1 n

n =1 I = 0,1,2

{ }

J = {0,1} c1 c1 ⋅ p c1 ⋅ p2 c1 ⋅log(p) c1 ⋅ p⋅log(p) c1 ⋅ p2 ⋅log(p)

slide-27
SLIDE 27

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Performance model normal form

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 27

n ∈ ik ∈ I jk ∈ J I, J ⊂

f (p) = ck ⋅ pik ⋅log2

jk (p) k=1 n

n =1 I = 0,1,2

{ }

J = {0,1} c1 c1 ⋅ p c1 ⋅ p2 c1 ⋅log(p) c1 ⋅ p⋅log(p) c1 ⋅ p2 ⋅log(p)

c1 +c2 ⋅ p c1 +c2 ⋅ p2 c1 +c2 ⋅log(p) c1 +c2 ⋅ p⋅log(p) c1 +c2 ⋅ p2 ⋅log(p)

c1 ⋅log(p)+c2 ⋅ p c1 ⋅log(p)+c2 ⋅ p⋅log(p) c1 ⋅log(p)+c2 ⋅ p2 c1 ⋅log(p)+c2 ⋅ p2 ⋅log(p) c1 ⋅ p+c2 ⋅ p⋅log(p) c1 ⋅ p+c2 ⋅ p2 c1 ⋅ p+c2 ⋅ p2 ⋅log(p) c1 ⋅ p⋅log(p)+c2 ⋅ p2 c1 ⋅ p⋅log(p)+c2 ⋅ p2 ⋅log(p) c1 ⋅ p2 +c2 ⋅ p2 ⋅log(p)

slide-28
SLIDE 28

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Weak vs. strong scaling

Wall-clock time not necessarily monotonically increasing under strong scaling

Harder to capture model automatically Different invariants require different reductions across processes

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 28

Weak scaling Strong scaling Invariant Problem size per process Overall problem size Model target Wall-clock time Accumulated time Reduction Maximum / average Sum

slide-29
SLIDE 29

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Statistical quality control

If the 95% confidence interval is too wide, the fit could be adversely influenced by noise, or overfitting might

  • ccur

CI = f(mean, stddev) To improve CI - increase repetitions, include different configurations

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 29

x1 x2 x3 x4 x5

Confidence band: noise uncertainty

Performance metric

Unknown behavior of kernel Data(with noise) Confidence interval (t-Test) Confidence band – noise uncertainty

slide-30
SLIDE 30

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Assumptions & limitations

Only one scaling behavior for all the measurements; no jumps Some MPI collective operations switch their algorithm – results in inaccurate models Example: red model tries to model measurements of different algorithms

First 4 points – one function Last 4 points – another function (linear)

  • Adj. R2 = 0.95085 (!)

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 30

slide-31
SLIDE 31

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Changing growth trends

Ranking according to growth rate difficult: log2(p) ?

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 31

p

slide-32
SLIDE 32

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Changing growth trends (2)

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 32

slide-33
SLIDE 33

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Practice

slide-34
SLIDE 34

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Outline

Workflow Performance measurements Model refinement Adjusted R2 Kernel ranking Output representations

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 34

slide-35
SLIDE 35

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Workflow

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 35

Performance measurements Performance profiles Model generation Scaling models Performance extrapolation Ranking of kernels Statistical quality control Model generation Accuracy saturated? Model refinement Scaling models Yes No Kernel refinement

slide-36
SLIDE 36

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Performance measurements

Different ways of collecting measurements Score-P (http://www.vi-hps.org/projects/score-p/) Other profiling tools, e.g., HPCToolkit Manual ad-hoc measurements

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 36

slide-37
SLIDE 37

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Performance measurements (2)

Our experience shows at least 5 different measurements required

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 37

Performance measurements (profiles) p1 = 256 p2 = 512 p3 = 1024 p4 = 2048 p5 = 4096

slide-38
SLIDE 38

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Performance measurements (3)

Our experience shows at least 5 different measurements required Each measurement repeated multiple times

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 38

Performance measurements (profiles) p1 = 256 p2 = 512 p3 = 1024 p4 = 2048 p5 = 4096 p1 = 256 p2 = 512 p3 = 1024 p4 = 2048 p5 = 4096

. . . . . . . . . . . . . . .

Noisy results Variance too big

slide-39
SLIDE 39

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Model refinement

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 39

Hypothesis generation; hypothesis size n Scaling model Input data Hypothesis evaluation via cross-validation Computation of for best hypothesis No Yes

n =1;R0

2 = −∞

Rn

2

n ++

Rn−1

2

> Rn

2 ∨

n = nmax

{(p1,t1),...,(p6,t6)} c1 ⋅ log(p) c1 ⋅ p⋅ log(p) c1 ⋅ p2 ⋅ log(p)

c1 c1 ⋅ p c1 ⋅ p2 R2 =1− residualSumSquares totalSumSquares R

2 =1−(1− R2)⋅

6 −1 6 − n − 2

I = {0,1,2};J = {0,1};nmax = 2

c1 ⋅ log(p)

slide-40
SLIDE 40

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Adjusted R2

R2 represents how well the determined function fits the M available measurements Adjusted R2 adjusts for N, the number of terms used

  • Adj. R2 decreases more useless variables
  • Adj. R2 increases more useful variables

Rule of thumb: adj. R2 > 0.95

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 40

R2 =1− residualSumSquares totalSumSquares R

2 =1−(1− R2)⋅

M −1 M − N − 2

slide-41
SLIDE 41

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Ranking of kernels

Kernels are ranked according the leading-order terms in the models Leading-order term big-O notation For example: O(x) comes before O(x2)

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 41

slide-42
SLIDE 42

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Extra-P – User interface

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 42

slide-43
SLIDE 43

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Extra-P – User interface

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 43

Call tree exploration Plot of the model Selected kernel(s)

slide-44
SLIDE 44

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Extra-P – Text output

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 44

Callpath/Region: exp4 Metric: Test Data: ( 1000, 1e+06) 95% CI [1.00001e+06, 999989] ( 2000, 4e+06) 95% CI [4.00003e+06, 3.99998e+06] ( 4000, 1.6e+07) 95% CI [1.6e+07, 1.6e+07] ( 8000, 6.4e+07) 95% CI [6.4e+07, 6.4e+07] ( 16000, 2.56e+08) 95% CI [2.56e+08, 2.56e+08] Model: 0+1*(p^2) RSS: 3.35017 Adjusted R^2: 1

slide-45
SLIDE 45

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Case studies

slide-46
SLIDE 46

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Case studies

46

Lulesh JUSPIC NEST UG4 MP2C BLAST

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL

XNS Sweep3d Milc HOMME

slide-47
SLIDE 47

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Sweep3D – Neutron transport simulation

LogGP model for communication developed by Hoisie et al.

Kernel [2 of 40] Model [s] t = f(p) Predictive error [%] pt=262k sweep → MPI_Recv sweep

5.10

0.01 4.03 p 582.19

#bytes const. #msg const.

pi ≤ 8k tcomm =[2(px + py − 2)+ 4(nsweep −1)]⋅tmsg tcomm = c⋅ p

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL

~ ~

47

slide-48
SLIDE 48

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Sweep3D – Neutron transport simulation

Model:

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 48

20 40 60 80 100 Relative error (%) 26 27 28 29 210 211 212 213 214 215 216 217 218 500 1,000 1,500 2,000 2,500 Processes Time (s) Model Data Prediction Relative error

Training Prediction

4.03 p

slide-49
SLIDE 49

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Milc

MILC/su3_rmd – from MILC suite of QCD codes with performance model manually created by Hoefler et al. Time per process should remain constant except for a rather small logarithmic term caused by global convergence checks

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 49

Kernel [3 of 479] Model [s] t=f(p) Predictive Error [%] pt=64k compute_gen_staple_field g_vecdoublesum → MPI_Allreduce mult_adj_su3_fieldlink_lathwec

2.40⋅10−2 6.30⋅10-6 ⋅log2

2(p)

3.80⋅10−3 0.43 0.01 0.04 pi ≤16k

slide-50
SLIDE 50

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

HOMME – Climate

Core of the Community Atmospheric Model (CAM)

  • Spectral element dynamical core on a cubed

sphere grid

Kernel [3 of 194] Model [s] t = f(p) Predictive error [%] pt = 130k

box_rearrange → MPI_Reduce vlaplace_sphere_vk compute_and_apply_rhs

0.026 + 2.53⋅10-6p⋅ p+ 1.24⋅10-12p3

49.53 48.68 57.02 99.32 1.65

pi ≤15k

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 50

slide-51
SLIDE 51

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Kernel [3 of 194] Model [s] t = f(p) Predictive error [%] pt = 130k

box_rearrange → MPI_Reduce vlaplace_sphere_vk compute_and_apply_rhs

HOMME – Climate (2)

Core of the Community Atmospheric Model (CAM)

  • Spectral element dynamical core on a cubed

sphere grid

3.63⋅10-6p⋅ p+ 7.21⋅10-13p3 pi ≤ 43k 30.34 4.28 0.83 24.44+2.26⋅10-7p2 49.09

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 51

slide-52
SLIDE 52

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

HOMME – Climate (3)

210 212 214 216 218 220 222 0.01 1 102 104 106 108 P rocesses Time (s)

MPI_Reduce vlaplace_sphere_wk compute_and_apply_rhs

Training Prediction

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 52

slide-53
SLIDE 53

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

UG4

  • Numerical framework for grid-based solution of partial differential equations

(~500,000 lines of C++ code, 2,000 kernels)

  • Application: drug diffusion through the human skin
  • In general, all kernels scale well
  • Multigrid solver kernel (MGM) scales logarithmically
  • Number of iterations needed by the unpreconditioned conjugate gradient (CG) method depends on the mesh size
  • Increases by factor of two with each refinement
  • Will therefore suffer from iteration count increase in weak scaling

Kernel Model (time [s]) CG 0.227 + 0.31 * p0.5 MGM 0.219 + 0.0006 * log2(p)

!!"" #$!!"" %#!""

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL

G-CSC

53

slide-54
SLIDE 54

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

XNS

Finite element flow simulation Strong scaling analysis using accumulated time across processes as metric

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 54

Kernel Runtime [%] p=128 Runtime [%] p=4096 Model [s] t = f(p) ewdgennprm->MPI_Recv ewddot 51.46 5.04 0.029⋅ p2 37406.80 +13.29⋅ p ⋅log(p)

0.46

44.78 #bytes = ~p #msg = ~p

slide-55
SLIDE 55

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Platform Juqueen Juropa Piz Daint MPI memory [MB] Expectation: O (log p) Model O (log p) O (p) O (log p) R2 0.72 1 0.23 Divergence O (1) O (p / log p) O (1) Match ✔ ✘ ✔ Comm_create [B] Expectation: O (p) Model O (p) O (p) O (p) R2 1 1 0.99 Divergence O (1) O (1) O (1) Match ✔ ✔ ✔ Win_create [B] Expectation: O (p) Model O (p) O (p) O (p) R2 1 1 0.99 Divergence O (1) O (1) O (1) Match ✔ ✔ ✔

MPI

Platform Juqueen Juropa Piz Daint Barrier [s] Expectation: O (log p) Model O (log p) O (p0.67 log p) O (p0.33) R2 0.99 0.99 0.99 Divergence O (1) O (p0.67) O (p0.33/log p) Match ✔ ✘ ~ Bcast [s] Expectation: O (log p) Model O (log p) O (p0.5) O (p0.5) R2 0.86 0.98 0.94 Divergence O (1) O (p0.5/log p) O (p0.5/log p) Match ✔ ~ ~ Reduce [s] Expectation: O (log p) Model O (log p) O (p0.5 log p) O (p0.5 log p) R2 0.93 0.99 0.94 Divergence O (1) O (p0.5) O (p0.5) Match ✔ ~ ~

55

slide-56
SLIDE 56

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

MPI (2)

Platform Juqueen Juropa Piz Daint Allreduce [s] Expectation: O (log p) Model O (log p) O (p0.5) O (p0.67 log p) R2 0.87 0.99 0.99 Divergence O (p0.5/log p) O (p0.67) Match ✔ ~ ✘! Comm_dup [B] Expectation: O (1) Model 2.2e5 256 3770 + 18p R2 1 1 0.99 Divergence O (1) O (1) O (p) Match ✔ ✔ ✘

56

slide-57
SLIDE 57

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

MPI (3)

MPI_Allreduce 3 different machines – 3 different models

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 57

slide-58
SLIDE 58

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Mass-producing performance models

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 58

  • Is feasible
  • Offers insight
  • Requires low effort
  • Improves code coverage
slide-59
SLIDE 59

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Process count (p)

Coming soon: Fast multi-parameter performance modeling

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 59

Problem size (n) Order of a solver (o) Result precision (ε)

Multi-parameter Performance models

Independent or compounded effect?

f (p,n,o,ε) =10 + p +10⋅n + o2 f (p,n,o,ε) =10 + p ⋅10⋅n⋅o2

Consider the performance difference:

slide-60
SLIDE 60

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Coming soon: Fast multi-parameter performance modeling

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 60

n ∈ m ∈ ikl ∈ I jkl ∈ J I, J ⊂

f (p) = ck pl

ikl ⋅log2 jkl (pl) l=1 m

k=1 n

n = 2 m = 2 I = 0 4, 1 4,...,12 4 ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ J = {0,1,2}

c1 c1 +c2 ⋅ p1 ... c1 +c2 ⋅ p1 +c3 ⋅ p2 c1 +c2 ⋅ p1 ⋅ p2 c1 +c2 ⋅ p1 +c3 ⋅ p2

1 ⋅ p2 ⋅log2(p2)

  • Constant
  • Single parameter
  • Multiple parameters
  • Additive
  • Multiplicative
  • Complex

Model candidates Expanded performance model normal form

slide-61
SLIDE 61

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Coming soon: Fast multi-parameter performance modeling

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 61

Exhaustive search

  • Hierarchical search – Reduces search from all possible models to combinations of the best

single parameter models of each parameter

  • Modified golden section search – Orders single parameter search space and applies a

modified binary search

  • Assuming 300.000 models searched per second* – 3 parameter models

3⋅9.139 + 512 = 27.929 C(59.319,3) = 34.786.300.841.019

Hierarchical search

3⋅26 + 512 = 590

Hierarchical search + Modified golden section search

*This is a simplification, multi-parameter models take much longer to be evaluated

~3.5 years/model ~11 models/second ~508 models/second

slide-62
SLIDE 62

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

[1] Alexandru Calotoiu, Torsten Hoefler, Marius Poke, Felix Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes. In Proc. of the ACM/IEEE Conference on Supercomputing (SC13), Denver, CO, USA, pages 1-12, ACM, November 2013. [2] Sergei Shudler, Alexandru Calotoiu, Torsten Hoefler, Alexandre Strube, Felix Wolf: Exascaling Your Library: Will Your Implementation Meet Your Expectations?. In Proc. of the International Conference

  • n Supercomputing (ICS), Newport Beach, CA, USA, pages 1-11, ACM, June 2015

[3] Andreas Vogel, Alexandru Calotoiu, Alexandre Strube, Sebastian Reiter, Arne Nägel, Felix Wolf, Gabriel Wittum: 10,000 Performance Models per Minute - Scalability of the UG4 Simulation

  • Framework. In Proc. of the 21st Euro-Par Conference, Vienna, Austria of Lecture Notes in Computer

Science, pages 519–531, Springer, August 2015. [4] Christian Iwainsky, Sergei Shudler, Alexandru Calotoiu, Alexandre Strube, Michael Knobloch, Christian Bischof, Felix Wolf: How Many Threads will be too Many? On the Scalability of OpenMP

  • Implementations. In Proc. of the 21st Euro-Par Conference, Vienna, Austria of Lecture Notes in

Computer Science, pages 451–463, Springer, August 2015. [5] Alexandru Calotoiu, David Beckingsale, Christopher W. Earl, Torsten Hoefler, Ian Karlin, Martin Schulz, Felix Wolf: Fast Multi-Parameter Performance Modeling. In Proc. of the 2016 IEEE International Conference on Cluster Computing (CLUSTER), Taipei, Taiwan, pages 1-10, IEEE Computer Society, September 2016, (to appear).

References

INSIGHTFUL AUTOMATIC PERFORMANCE MODELING TUTORIAL 62

2015 Euro-Par 2015 Euro-Par 2015

slide-63
SLIDE 63

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING

Thank You!

Get Extra-P at: http://www.scalasca.org/software/extra-p/download.html