23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 1
Calotoiu Alexandru Dagstuhl Seminar| 23.10.2017
Requirement Models for Co-Design
Requirement Models for Co-Design Calotoiu Alexandru Dagstuhl - - PowerPoint PPT Presentation
Requirement Models for Co-Design Calotoiu Alexandru Dagstuhl Seminar| 23.10.2017 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 1 Automatic empirical modeling Performance measurements
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 1
Calotoiu Alexandru Dagstuhl Seminar| 23.10.2017
Requirement Models for Co-Design
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 2
Automatic empirical modeling Mj
main() { foo() bar() compute() }
Instrumentation Performance measurements Input Output
Mi
Model generator Human-readable performance models of all functions (e.g., t = c1*log(p) + c2)
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 3
Complexity building blocks
Computation Communication
Samplesort
t(p) ~ p2
Naïve N-body
t(p) ~ p
FFT
t(p) ~ c
LU
t(p) ~ c
Samplesort
t(p) ~ p2 log2
2(p)
Naïve N-body
t(p) ~ p
FFT
t(p) ~ log2(p)
LU
t(p) ~ c
… …
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 4
Performance model normal form
jk ∈ J
jk (p) k=1 n
I, J ⊂ n ∈ ik ∈ I
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 5
Creating search spaces
n =1 I = 0,1,2
{ }
J = {0,1} c1 c1 ⋅ p c1 ⋅ p2 c1 ⋅log(p) c1 ⋅ p⋅log(p) c1 ⋅ p2 ⋅log(p)
f (p) = ck ⋅ pik ⋅log2
jk (p) k=1 n
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 6
Creating search spaces
c1 + c2 ⋅ p c1 + c2 ⋅ p2 c1 + c2 ⋅log(p) c1 + c2 ⋅ p⋅log(p) c1 + c2 ⋅ p2 ⋅log(p) c1 ⋅log(p)+c2 ⋅ p2 ⋅log(p) c1 ⋅ p+c2 ⋅ p⋅log(p) c1 ⋅ p+c2 ⋅ p2 c1 ⋅ p+c2 ⋅ p2 ⋅log(p) c1 ⋅ p⋅log(p)+c2 ⋅ p2
n = 2 I = 0,1,2
{ }
J = {0,1}
f (p) = ck ⋅ pik ⋅log2
jk (p) k=1 n
c1 ⋅ p⋅log(p)+ c2 ⋅ p2 ⋅log(p) c1 ⋅ p2 + c2 ⋅ p2 ⋅log(p) c1 ⋅log(p)+ c2 ⋅ p c1 ⋅log(p)+ c2 ⋅ p⋅log(p) c1 ⋅log(p)+ c2 ⋅ p2
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 7
Case study – HOMME
Kernel [3 of 194] Model [s] t = f(p) Predictive error [%] pt = 130k Box_rearrange->MPI_Reduce Vlaplace_sphere_vk Compute_and_apply_rhs
3.63⋅10-6p⋅ p+ 7.21⋅10-13p3 24.44+2.26⋅10-7p2 49.09 30.34 4.28 0.83 P
i ≤ 43k
Core of the Community Atmospheric Model (CAM)
sphere grid
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 8
210 212 214 216 218 220 222 0.01 1 102 104 106 108 Processes Time (s)
MPI_Reduce vlaplace_sphere_wk compute_and_apply_rhs
Training Prediction
Case study – HOMME
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 9
Multi-parameter performance modeling
Process count Process count Problem size Hardware configuration Algorithm configuration Execution time Floating point
Bytes sent and received
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 10
Process count Process count (p) Problem size (n) Execution time (t)
Multi-parameter performance modeling
Hardware configuration Algorithm configuration Floating point
Bytes sent and received
t = f (p)⋅ g(n)
Model: OR
t = f (p)+ g(n)
OR …
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 11
Extended performance model normal form
n ∈ m ∈ ikl ∈ I jkl ∈ J I, J ⊂
ikl ⋅log2 jkl (xl) l=1 m
k=1 n
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 12
Extended performance model normal form
c1 c1 +c2 ⋅ x1 c1 +c2 ⋅ x1 +c3 ⋅ x2 +c4 ⋅ x3 c1 +c2 ⋅ x1 ⋅ x2 ⋅ x3 c1 +c2 ⋅ x1 ⋅ x3 +c3 ⋅ x2
1 ⋅ x2 ⋅log2(x2)
Possible parameter interactions
n ∈ m ∈ ikl ∈ I jkl ∈ J I, J ⊂
ikl ⋅log2 jkl (xl) l=1 m
k=1 n
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 13
Milc
Requirements engineering
HOMME Sweep3d BLAST Kripke Clover Leaf Lulesh Re-learn OpenFoam
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 14
Requirements engineering – a per-process view
Network bandwidth Computational performance Memory bandwidth Memory capacity Network
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 15
Network bandwidth Computational performance Memory bandwidth Memory capacity
Requirements engineering – a per-process view
Network
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 16
Network bandwidth Computational performance Memory bandwidth Memory capacity
Requirements engineering – a per-process view
Network
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 17
Application requirements
Lulesh
Requirement Metric Model
Computation #FLOPs Communication #Bytes sent & received Memory access #Loads & stores Memory footprint #Bytes used
105 ⋅n⋅log(n)⋅ p0.25 ⋅log(p) 103 ⋅n⋅ p0.25 ⋅log(p) 105 ⋅n⋅log(n)⋅log(p) 105 ⋅n⋅log(n)
Models represent per process effects p – Number of processes n – Problem size per process
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 18
Co-design using performance models
Lulesh
Which is the best investement?
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 19
Co-design using performance models
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 20
Co-design using performance models
Double the memory
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 21
Co-design using performance models
Double the processors
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 22
Co-design using performance models
Double the racks
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 23
Co-design using performance models
Double the racks
p' = 2⋅ p m' = m
# Processes Memory per process I
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 24
Co-design using performance models
Double the racks
p' = 2⋅ p m' = m
# Processes Memory per process I II
m' = m =105 ⋅n'⋅log(n')
Memory requirement
n' = n
Problem size per process
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 25
Co-design using performance models
Double the racks
p' = 2⋅ p m' = m
# Processes Memory per process I II
m' = m =105 ⋅n'⋅log(n')
Memory requirement
n' = n
Problem size per process Overall problem size
n'⋅ p' = 2⋅n⋅ p
III
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 26
Co-design using performance models
Double the racks # FLOPS Ratio new to old IV
105 ⋅n⋅log(n)⋅(2p)0.25 ⋅log(2p) 20.25 ⋅(1+1/ log(p))
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 27
Co-design using performance models
Double the racks # FLOPS Ratio new to old IV V
105 ⋅n⋅log(n)⋅(2p)0.25 ⋅log(2p) 20.25 ⋅(1+1/ log(p))
#Bytes sent & received Ratio new to old
103 ⋅n⋅(2p)0.25 ⋅log(2p) 20.25 ⋅(1+1/ log(p))
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 28
Visual representation of requirements
p' = p m' = 2⋅m p' = 2⋅ p m' = m p' = 2⋅ p m' = m / 2
Communication Computation Problem size Memory access
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 29
Co-design using performance models
Double the racks # FLOPS Ratio new to old IV V VI
105 ⋅n⋅log(n)⋅(2p)0.25 ⋅log(2p) 20.25 ⋅(1+1/ log(p))
#Bytes sent & received Ratio new to old
103 ⋅n⋅(2p)0.25 ⋅log(2p) 20.25 ⋅(1+1/ log(p))
#Loads & stores Ratio new to old
105 ⋅n⋅log(n)⋅log(2p) 1+1/ log(p)
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 30
An accessible empirical performance modeling solution
A method for requirements engineering in hardware-software co-design Users can model entire system workloads to
application requirements
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 31
Extra-P
http://www.scalasca.org/software/extra-p/download.html Open source:
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 32
Publication list
[1] Alexandru Calotoiu, Torsten Hoefler, Marius Poke, Felix Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex
Denver, CO, USA [7] Christian Iwainsky, Sergei Shudler, Alexandru Calotoiu, Alexandre Strube, Michael Knobloch, Christian Bischof, Felix Wolf: How Many Threads will be too Many? On the Scalability of OpenMP Implementations. In Proc. of the 21st Euro-Par Conference, Vienna, Austria [2] Alexandru Calotoiu, David Beckingsale, Christopher W. Earl, Torsten Hoefler, Ian Karlin, Martin Schulz, Felix Wolf: Fast Multi-Parameter Performance Modeling. In Proc. of the 2016 IEEE International Conference
[8] Andreas Vogel, Alexandru Calotoiu, Arne Nägel, Sebastian Reiter, Alexandre Strube, Gabriel Wittum, Felix Wolf: Software for Exascale Computing - SPPEXA 2013-2015, chapter Automated Performance Modeling of the UG4 Simulation Framework. [3] Alexandru Calotoiu, Torsten Hoefler, Felix Wolf: Mass-producing Insightful Performance Models. In Workshop on Modeling & Simulation of Systems and Applications, University of Washington, Seattle, Washington, USA [9] Felix Wolf, Christian Bischof, Alexandru Calotoiu, Torsten Hoefler, Christian Iwainsky, Grzegorz Kwasniewski, Bernd Mohr, Sergei Shudler, Alexandre Strube, Andreas Vogel, Gabriel Wittum: Software for Exascale Computing - SPPEXA 2013-2015, chapter Automatic Performance Modeling of HPC Applications. [4] Sergei Shudler, Alexandru Calotoiu, Torsten Hoefler, Alexandre Strube, Felix Wolf: Exascaling Your Library: Will Your Implementation Meet Your Expectations?. In Proc. of the International Conference on Supercomputing (ICS), Newport Beach, CA, USA [10] Sergei Shudler, Alexandru Calotoiu, Torsten Hoefler, Felix Wolf: Isoefficiency in Practice: Configuring and Understanding the Performance
Principles and Practice of Parallel Programming (PPoPP), Austin, TX, USA [5] Felix Wolf, Christian Bischof, Torsten Hoefler, Bernd Mohr, Gabriel Wittum, Alexandru Calotoiu, Christian Iwainsky, Alexandre Strube, Andreas Vogel: Catwalk: A Quick Development Path for Performance Models. In Euro-Par 2014: Parallel Processing Workshops [11] Patrick Reisert, Alexandru Calotoiu, Sergei Shudler, Felix Wolf: Following the Blind Seer – Creating Better Performance Models Using Less
Compostela, Spain [6] Andreas Vogel, Alexandru Calotoiu, Alexandre Strube, Sebastian Reiter, Arne Nägel, Felix Wolf, Gabriel Wittum: 10,000 Performance Models per Minute - Scalability of the UG4 Simulation Framework. In Proc. of the 21st Euro-Par Conference, Vienna, Austria [12] Kashif Ilyas, Alexandru Calotoiu, Felix Wolf: Off-Road Performance Modeling – How to Deal with Segmented Data. In Proc. of the 23rd Euro- Par Conference, Santiago de Compostela, Spain
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 33
Outlook
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 34
Requirements modeling
Program Computation Communication FLOPS Load Store P2P Collective … Time Disagreement may be indicative of wait states
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 35
Pitfalls
Intuition is not enough
2.95*log
2 p+0.0871* p
12.06* p
… and modeling is seldom better than our intuition.
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 36
Performance measurements
Our experience shows at least 5 different measurements required Performance measurements (profiles) p1 = 256 p2 = 512 p3 = 1024 p4 = 2048 p5 = 4096
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 37
Performance measurements
Our experience shows at least 5 different measurements required Each measurement repeated multiple times Performance measurements (profiles) p1 = 256 p2 = 512 p3 = 1024 p4 = 2048 p5 = 4096 p1 = 256 p2 = 512 p3 = 1024 p4 = 2048 p5 = 4096 . . . . . . . . . . . . . . . Noisy results Variance too big
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 38
Sweep3D – Neutron transport simulation
LogGP model for communication developed by Hoisie et al.
Kernel [2 of 40] Model [s] t = f(p) Predictive error [%] pt=262k sweep → MPI_Recv sweep
5.10
0.01 4.03 p 582.19
#bytes const. #msg const.
pi ≤ 8k
tcomm =[2(px + py − 2)+ 4(nsweep −1)]⋅tmsg tcomm = c⋅ p
~ ~
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 39
Sweep3D – Neutron transport simulation
Model:
20 40 60 80 100 Relative error (%) 26 27 28 29 210 211 212 213 214 215 216 217 218 500 1,000 1,500 2,000 2,500 Processes Time (s) Model Data Prediction Relative error
Training Prediction
4.03 p
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 40
Milc
with performance model manually created by Hoefler et al.
small logarithmic term caused by global convergence checks Kernel [3 of 479] Model [s] t=f(p) Predictive Error [%] pt=64k compute_gen_staple_field g_vecdoublesum → MPI_Allreduce mult_adj_su3_fieldlink_lathwec
2.40⋅10−2 6.30⋅10-6 ⋅log2
2(p)
3.80⋅10−3 0.43 0.01 0.04 pi ≤16k
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 41
UG4
equations (~500,000 lines of C++ code, 2,000 kernels)
(CG) method depends on the mesh size
Kernel Model (time [s]) CG 0.227 + 0.31 * p0.5 MGM 0.219 + 0.0006 * log2(p)
!!"" #$!!"" %#!""Insightful Automatic Performance Modeling Tutorial 4 1
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 42
XNS
across processes as metric Kernel Runtime [%] p=128 Runtime [%] p=4096 Model [s] t = f(p) ewdgennprm->MPI_Recv ewddot
51.46 5.04 0.029⋅ p2 37406.80 +13.29⋅ p ⋅log(p) 0.46
44.78 #bytes = ~p #msg = ~p
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 43
Platform Juqueen Juropa Piz Daint MPI memory [MB] Expectation: O (log p) Model O (log p) O (p) O (log p) R2 0.72 1 0.23 Divergence O (1) O (p / log p) O (1) Match ✔ ✘ ✔ Comm_create [B] Expectation: O (p) Model O (p) O (p) O (p) R2 1 1 0.99 Divergence O (1) O (1) O (1) Match ✔ ✔ ✔ Win_create [B] Expectation: O (p) Model O (p) O (p) O (p) R2 1 1 0.99 Divergence O (1) O (1) O (1) Match ✔ ✔ ✔
MPI
Platform Juqueen Juropa Piz Daint Barrier [s] Expectation: O (log p) Model O (log p) O (p0.67 log p) O (p0.33) R2 0.99 0.99 0.99 Divergence O (1) O (p0.67) O (p0.33/log p) Match ✔ ✘ ~∽ ~∽ Bcast [s] Expectation: O (log p) Model O (log p) O (p0.5) O (p0.5) R2 0.86 0.98 0.94 Divergence O (1) O (p0.5/log p) O (p0.5/log p) Match ✔ ~∽ ~∽ ~∽ ~∽ Reduce [s] Expectation: O (log p) Model O (log p) O (p0.5 log p) O (p0.5 log p) R2 0.93 0.99 0.94 Divergence O (1) O (p0.5) O (p0.5) Match ✔ ~∽ ~∽ ~∽ ~∽
4 3
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 44
MPI (2)
Platform Juqueen Juropa Piz Daint Allreduce [s] Expectation: O (log p) Model O (log p) O (p0.5) O (p0.67 log p) R2 0.87 0.99 0.99 Divergence O (p0.5/log p) O (p0.67) Match ✔ ~∽ ~∽ ✘! Comm_dup [B] Expectation: O (1) Model 2.2e5 256 3770 + 18p R2 1 1 0.99 Divergence O (1) O (1) O (p) Match ✔ ✔ ✘
4 4
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 45
MPI (3)
3 different models
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 46
Modified golden section search
rate (derivative at observation with largest parameter value)
n =1 m =1 I = 0 4, 1 4,...,12 4 ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ J = {0}
x0, x0.25,..., x3
{ }
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 47
Modified golden section search
the best fit
steps
n =1 m =1 I = 0 4, 1 4,...,12 4 ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ J = {0}
Fit error x0 Hypotheses x3 x1 log0.67 I P1 P2 P4 x1.75 P3
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 48
Modified golden section search
the best fit
steps
n =1 m =1 I = 0 4, 1 4,...,12 4 ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ J = {0}
Fit error x0 Hypotheses x3 x1 log0.67 I P1 P2 P4 x1.75 P3
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 49
Modified golden section search
n =1 m =1 I = 0 4, 1 4,...,12 4 ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ J = {0}
Fit error x0 Hypotheses x3 x1 P1 P2 P4 x1.75 P3 x0.5 P5
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 50
Modified golden section search
n =1 m =1 I = 0 4, 1 4,...,12 4 ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ J = {0}
Fit error x0 Hypotheses x1 P1 P2 x1.75 P3 x0.5 P5 P2 P3 P4
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 51
Modified golden section search
n =1 m =1 I = 0 4, 1 4,...,12 4 ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ J = {0}
Fit error x0 Hypotheses x1 P1 P2 x1.75 P3 x0.5 P5 P2 P3 P4
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 52
Modified golden section search – What is the reduction?
is eliminated in each step we need 23 steps to identify the best model.
4 models must be computed.
n = 3 m = 3 I = 0 4, 1 4,...,12 4 ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ J = {0,1,2}
26 models evaluated for the best model for a single parameter 10 models evaluated using exhaustive search
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 53
Cache Modeling
A A B B C
Threadspotter
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 54
Parameter space sampling
f1(p3)− f2 Cost(p3, f1(p3)) > Threshhold
We have (p1,t1) ; (p2,t2). How to choose p3? Create f1 and f2 out of the points we have and then find p3!
23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 55
Weak vs. strong scaling
under strong scaling
Weak scaling Strong scaling Invariant Problem size per process Overall problem size Model target Wall-clock time Accumulated time Reduction Maximum / average Sum