Requirement Models for Co-Design Calotoiu Alexandru Dagstuhl - - PowerPoint PPT Presentation

requirement models for co design
SMART_READER_LITE
LIVE PREVIEW

Requirement Models for Co-Design Calotoiu Alexandru Dagstuhl - - PowerPoint PPT Presentation

Requirement Models for Co-Design Calotoiu Alexandru Dagstuhl Seminar| 23.10.2017 23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 1 Automatic empirical modeling Performance measurements


slide-1
SLIDE 1

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 1

Calotoiu Alexandru Dagstuhl Seminar| 23.10.2017

Requirement Models for Co-Design

slide-2
SLIDE 2

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 2

Automatic empirical modeling Mj

main() { foo() bar() compute() }

Instrumentation Performance measurements Input Output

Mi

Model generator Human-readable performance models of all functions (e.g., t = c1*log(p) + c2)

slide-3
SLIDE 3

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 3

Complexity building blocks

Computation Communication

Samplesort

t(p) ~ p2

Naïve N-body

t(p) ~ p

FFT

t(p) ~ c

LU

t(p) ~ c

Samplesort

t(p) ~ p2 log2

2(p)

Naïve N-body

t(p) ~ p

FFT

t(p) ~ log2(p)

LU

t(p) ~ c

… …

slide-4
SLIDE 4

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 4

Performance model normal form

jk ∈ J

f (p) = ck ⋅ pik ⋅log2

jk (p) k=1 n

I, J ⊂ n ∈ ik ∈ I

slide-5
SLIDE 5

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 5

Creating search spaces

n =1 I = 0,1,2

{ }

J = {0,1} c1 c1 ⋅ p c1 ⋅ p2 c1 ⋅log(p) c1 ⋅ p⋅log(p) c1 ⋅ p2 ⋅log(p)

f (p) = ck ⋅ pik ⋅log2

jk (p) k=1 n

slide-6
SLIDE 6

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 6

Creating search spaces

c1 + c2 ⋅ p c1 + c2 ⋅ p2 c1 + c2 ⋅log(p) c1 + c2 ⋅ p⋅log(p) c1 + c2 ⋅ p2 ⋅log(p) c1 ⋅log(p)+c2 ⋅ p2 ⋅log(p) c1 ⋅ p+c2 ⋅ p⋅log(p) c1 ⋅ p+c2 ⋅ p2 c1 ⋅ p+c2 ⋅ p2 ⋅log(p) c1 ⋅ p⋅log(p)+c2 ⋅ p2

n = 2 I = 0,1,2

{ }

J = {0,1}

f (p) = ck ⋅ pik ⋅log2

jk (p) k=1 n

c1 ⋅ p⋅log(p)+ c2 ⋅ p2 ⋅log(p) c1 ⋅ p2 + c2 ⋅ p2 ⋅log(p) c1 ⋅log(p)+ c2 ⋅ p c1 ⋅log(p)+ c2 ⋅ p⋅log(p) c1 ⋅log(p)+ c2 ⋅ p2

slide-7
SLIDE 7

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 7

Case study – HOMME

Kernel [3 of 194] Model [s] t = f(p) Predictive error [%] pt = 130k Box_rearrange->MPI_Reduce Vlaplace_sphere_vk Compute_and_apply_rhs

3.63⋅10-6p⋅ p+ 7.21⋅10-13p3 24.44+2.26⋅10-7p2 49.09 30.34 4.28 0.83 P

i ≤ 43k

Core of the Community Atmospheric Model (CAM)

  • Spectral element dynamical core on a cubed

sphere grid

slide-8
SLIDE 8

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 8

210 212 214 216 218 220 222 0.01 1 102 104 106 108 Processes Time (s)

MPI_Reduce vlaplace_sphere_wk compute_and_apply_rhs

Training Prediction

Case study – HOMME

slide-9
SLIDE 9

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 9

Multi-parameter performance modeling

Process count Process count Problem size Hardware configuration Algorithm configuration Execution time Floating point

  • perations

Bytes sent and received

slide-10
SLIDE 10

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 10

Process count Process count (p) Problem size (n) Execution time (t)

Multi-parameter performance modeling

Hardware configuration Algorithm configuration Floating point

  • perations

Bytes sent and received

t = f (p)⋅ g(n)

Model: OR

t = f (p)+ g(n)

OR …

slide-11
SLIDE 11

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 11

Extended performance model normal form

n ∈ m ∈ ikl ∈ I jkl ∈ J I, J ⊂

f (x1,.., xm) = ck xl

ikl ⋅log2 jkl (xl) l=1 m

k=1 n

slide-12
SLIDE 12

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 12

Extended performance model normal form

c1 c1 +c2 ⋅ x1 c1 +c2 ⋅ x1 +c3 ⋅ x2 +c4 ⋅ x3 c1 +c2 ⋅ x1 ⋅ x2 ⋅ x3 c1 +c2 ⋅ x1 ⋅ x3 +c3 ⋅ x2

1 ⋅ x2 ⋅log2(x2)

Possible parameter interactions

  • Constant
  • Single parameter
  • Additive
  • Multiplicative
  • Several options

n ∈ m ∈ ikl ∈ I jkl ∈ J I, J ⊂

f (x1,.., xm) = ck xl

ikl ⋅log2 jkl (xl) l=1 m

k=1 n

slide-13
SLIDE 13

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 13

Milc

Requirements engineering

HOMME Sweep3d BLAST Kripke Clover Leaf Lulesh Re-learn OpenFoam

slide-14
SLIDE 14

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 14

Requirements engineering – a per-process view

Network bandwidth Computational performance Memory bandwidth Memory capacity Network

slide-15
SLIDE 15

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 15

Network bandwidth Computational performance Memory bandwidth Memory capacity

Requirements engineering – a per-process view

Network

slide-16
SLIDE 16

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 16

Network bandwidth Computational performance Memory bandwidth Memory capacity

Requirements engineering – a per-process view

Network

slide-17
SLIDE 17

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 17

Application requirements

Lulesh

Requirement Metric Model

Computation #FLOPs Communication #Bytes sent & received Memory access #Loads & stores Memory footprint #Bytes used

105 ⋅n⋅log(n)⋅ p0.25 ⋅log(p) 103 ⋅n⋅ p0.25 ⋅log(p) 105 ⋅n⋅log(n)⋅log(p) 105 ⋅n⋅log(n)

Models represent per process effects p – Number of processes n – Problem size per process

slide-18
SLIDE 18

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 18

Co-design using performance models

Lulesh

Which is the best investement?

slide-19
SLIDE 19

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 19

Co-design using performance models

slide-20
SLIDE 20

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 20

Co-design using performance models

Double the memory

slide-21
SLIDE 21

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 21

Co-design using performance models

Double the processors

slide-22
SLIDE 22

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 22

Co-design using performance models

Double the racks

slide-23
SLIDE 23

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 23

Co-design using performance models

Double the racks

p' = 2⋅ p m' = m

# Processes Memory per process I

slide-24
SLIDE 24

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 24

Co-design using performance models

Double the racks

p' = 2⋅ p m' = m

# Processes Memory per process I II

m' = m =105 ⋅n'⋅log(n')

Memory requirement

n' = n

Problem size per process

slide-25
SLIDE 25

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 25

Co-design using performance models

Double the racks

p' = 2⋅ p m' = m

# Processes Memory per process I II

m' = m =105 ⋅n'⋅log(n')

Memory requirement

n' = n

Problem size per process Overall problem size

n'⋅ p' = 2⋅n⋅ p

III

slide-26
SLIDE 26

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 26

Co-design using performance models

Double the racks # FLOPS Ratio new to old IV

105 ⋅n⋅log(n)⋅(2p)0.25 ⋅log(2p) 20.25 ⋅(1+1/ log(p))

slide-27
SLIDE 27

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 27

Co-design using performance models

Double the racks # FLOPS Ratio new to old IV V

105 ⋅n⋅log(n)⋅(2p)0.25 ⋅log(2p) 20.25 ⋅(1+1/ log(p))

#Bytes sent & received Ratio new to old

103 ⋅n⋅(2p)0.25 ⋅log(2p) 20.25 ⋅(1+1/ log(p))

slide-28
SLIDE 28

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 28

Visual representation of requirements

p' = p m' = 2⋅m p' = 2⋅ p m' = m p' = 2⋅ p m' = m / 2

Communication Computation Problem size Memory access

slide-29
SLIDE 29

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 29

Co-design using performance models

Double the racks # FLOPS Ratio new to old IV V VI

105 ⋅n⋅log(n)⋅(2p)0.25 ⋅log(2p) 20.25 ⋅(1+1/ log(p))

#Bytes sent & received Ratio new to old

103 ⋅n⋅(2p)0.25 ⋅log(2p) 20.25 ⋅(1+1/ log(p))

#Loads & stores Ratio new to old

105 ⋅n⋅log(n)⋅log(2p) 1+1/ log(p)

slide-30
SLIDE 30

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 30

An accessible empirical performance modeling solution

A method for requirements engineering in hardware-software co-design Users can model entire system workloads to

  • Adapt applications to existing hardware
  • Inform purchasing decisions with modeled

application requirements

  • Replace BOE requirements approximations
slide-31
SLIDE 31

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 31

Extra-P

http://www.scalasca.org/software/extra-p/download.html Open source:

slide-32
SLIDE 32

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 32

Publication list

[1] Alexandru Calotoiu, Torsten Hoefler, Marius Poke, Felix Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex

  • Codes. In Proc. of the ACM/IEEE Conference on Supercomputing (SC13),

Denver, CO, USA [7] Christian Iwainsky, Sergei Shudler, Alexandru Calotoiu, Alexandre Strube, Michael Knobloch, Christian Bischof, Felix Wolf: How Many Threads will be too Many? On the Scalability of OpenMP Implementations. In Proc. of the 21st Euro-Par Conference, Vienna, Austria [2] Alexandru Calotoiu, David Beckingsale, Christopher W. Earl, Torsten Hoefler, Ian Karlin, Martin Schulz, Felix Wolf: Fast Multi-Parameter Performance Modeling. In Proc. of the 2016 IEEE International Conference

  • n Cluster Computing (CLUSTER), Taipei, Taiwan

[8] Andreas Vogel, Alexandru Calotoiu, Arne Nägel, Sebastian Reiter, Alexandre Strube, Gabriel Wittum, Felix Wolf: Software for Exascale Computing - SPPEXA 2013-2015, chapter Automated Performance Modeling of the UG4 Simulation Framework. [3] Alexandru Calotoiu, Torsten Hoefler, Felix Wolf: Mass-producing Insightful Performance Models. In Workshop on Modeling & Simulation of Systems and Applications, University of Washington, Seattle, Washington, USA [9] Felix Wolf, Christian Bischof, Alexandru Calotoiu, Torsten Hoefler, Christian Iwainsky, Grzegorz Kwasniewski, Bernd Mohr, Sergei Shudler, Alexandre Strube, Andreas Vogel, Gabriel Wittum: Software for Exascale Computing - SPPEXA 2013-2015, chapter Automatic Performance Modeling of HPC Applications. [4] Sergei Shudler, Alexandru Calotoiu, Torsten Hoefler, Alexandre Strube, Felix Wolf: Exascaling Your Library: Will Your Implementation Meet Your Expectations?. In Proc. of the International Conference on Supercomputing (ICS), Newport Beach, CA, USA [10] Sergei Shudler, Alexandru Calotoiu, Torsten Hoefler, Felix Wolf: Isoefficiency in Practice: Configuring and Understanding the Performance

  • f Task-based Applications. In Proc. of the ACM SIGPLAN Symposium on

Principles and Practice of Parallel Programming (PPoPP), Austin, TX, USA [5] Felix Wolf, Christian Bischof, Torsten Hoefler, Bernd Mohr, Gabriel Wittum, Alexandru Calotoiu, Christian Iwainsky, Alexandre Strube, Andreas Vogel: Catwalk: A Quick Development Path for Performance Models. In Euro-Par 2014: Parallel Processing Workshops [11] Patrick Reisert, Alexandru Calotoiu, Sergei Shudler, Felix Wolf: Following the Blind Seer – Creating Better Performance Models Using Less

  • Information. In Proc. of the 23rd Euro-Par Conference, Santiago de

Compostela, Spain [6] Andreas Vogel, Alexandru Calotoiu, Alexandre Strube, Sebastian Reiter, Arne Nägel, Felix Wolf, Gabriel Wittum: 10,000 Performance Models per Minute - Scalability of the UG4 Simulation Framework. In Proc. of the 21st Euro-Par Conference, Vienna, Austria [12] Kashif Ilyas, Alexandru Calotoiu, Felix Wolf: Off-Road Performance Modeling – How to Deal with Segmented Data. In Proc. of the 23rd Euro- Par Conference, Santiago de Compostela, Spain

slide-33
SLIDE 33

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 33

Outlook

slide-34
SLIDE 34

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 34

Requirements modeling

Program Computation Communication FLOPS Load Store P2P Collective … Time Disagreement may be indicative of wait states

slide-35
SLIDE 35

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 35

Pitfalls

Intuition is not enough

2.95*log

2 p+0.0871* p

12.06* p

… and modeling is seldom better than our intuition.

slide-36
SLIDE 36

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 36

Performance measurements

Our experience shows at least 5 different measurements required Performance measurements (profiles) p1 = 256 p2 = 512 p3 = 1024 p4 = 2048 p5 = 4096

slide-37
SLIDE 37

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 37

Performance measurements

Our experience shows at least 5 different measurements required Each measurement repeated multiple times Performance measurements (profiles) p1 = 256 p2 = 512 p3 = 1024 p4 = 2048 p5 = 4096 p1 = 256 p2 = 512 p3 = 1024 p4 = 2048 p5 = 4096 . . . . . . . . . . . . . . . Noisy results Variance too big

slide-38
SLIDE 38

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 38

Sweep3D – Neutron transport simulation

LogGP model for communication developed by Hoisie et al.

Kernel [2 of 40] Model [s] t = f(p) Predictive error [%] pt=262k sweep → MPI_Recv sweep

5.10

0.01 4.03 p 582.19

#bytes const. #msg const.

pi ≤ 8k

tcomm =[2(px + py − 2)+ 4(nsweep −1)]⋅tmsg tcomm = c⋅ p

~ ~

slide-39
SLIDE 39

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 39

Sweep3D – Neutron transport simulation

Model:

20 40 60 80 100 Relative error (%) 26 27 28 29 210 211 212 213 214 215 216 217 218 500 1,000 1,500 2,000 2,500 Processes Time (s) Model Data Prediction Relative error

Training Prediction

4.03 p

slide-40
SLIDE 40

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 40

Milc

  • MILC/su3_rmd – from MILC suite of QCD codes

with performance model manually created by Hoefler et al.

  • Time per process should remain constant except for a rather

small logarithmic term caused by global convergence checks Kernel [3 of 479] Model [s] t=f(p) Predictive Error [%] pt=64k compute_gen_staple_field g_vecdoublesum → MPI_Allreduce mult_adj_su3_fieldlink_lathwec

2.40⋅10−2 6.30⋅10-6 ⋅log2

2(p)

3.80⋅10−3 0.43 0.01 0.04 pi ≤16k

slide-41
SLIDE 41

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 41

UG4

  • Numerical framework for grid-based solution of partial differential

equations (~500,000 lines of C++ code, 2,000 kernels)

  • Application: drug diffusion through the human skin
  • In general, all kernels scale well
  • Multigrid solver kernel (MGM) scales logarithmically
  • Number of iterations needed by the unpreconditioned conjugate gradient

(CG) method depends on the mesh size

  • Increases by factor of two with each refinement
  • Will therefore suffer from iteration count increase in weak scaling

Kernel Model (time [s]) CG 0.227 + 0.31 * p0.5 MGM 0.219 + 0.0006 * log2(p)

!!"" #$!!"" %#!""

Insightful Automatic Performance Modeling Tutorial 4 1

slide-42
SLIDE 42

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 42

XNS

  • Finite element flow simulation
  • Strong scaling analysis using accumulated time

across processes as metric Kernel Runtime [%] p=128 Runtime [%] p=4096 Model [s] t = f(p) ewdgennprm->MPI_Recv ewddot

51.46 5.04 0.029⋅ p2 37406.80 +13.29⋅ p ⋅log(p) 0.46

44.78 #bytes = ~p #msg = ~p

slide-43
SLIDE 43

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 43

Platform Juqueen Juropa Piz Daint MPI memory [MB] Expectation: O (log p) Model O (log p) O (p) O (log p) R2 0.72 1 0.23 Divergence O (1) O (p / log p) O (1) Match ✔ ✘ ✔ Comm_create [B] Expectation: O (p) Model O (p) O (p) O (p) R2 1 1 0.99 Divergence O (1) O (1) O (1) Match ✔ ✔ ✔ Win_create [B] Expectation: O (p) Model O (p) O (p) O (p) R2 1 1 0.99 Divergence O (1) O (1) O (1) Match ✔ ✔ ✔

MPI

Platform Juqueen Juropa Piz Daint Barrier [s] Expectation: O (log p) Model O (log p) O (p0.67 log p) O (p0.33) R2 0.99 0.99 0.99 Divergence O (1) O (p0.67) O (p0.33/log p) Match ✔ ✘ ~∽ ~∽ Bcast [s] Expectation: O (log p) Model O (log p) O (p0.5) O (p0.5) R2 0.86 0.98 0.94 Divergence O (1) O (p0.5/log p) O (p0.5/log p) Match ✔ ~∽ ~∽ ~∽ ~∽ Reduce [s] Expectation: O (log p) Model O (log p) O (p0.5 log p) O (p0.5 log p) R2 0.93 0.99 0.94 Divergence O (1) O (p0.5) O (p0.5) Match ✔ ~∽ ~∽ ~∽ ~∽

4 3

slide-44
SLIDE 44

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 44

MPI (2)

Platform Juqueen Juropa Piz Daint Allreduce [s] Expectation: O (log p) Model O (log p) O (p0.5) O (p0.67 log p) R2 0.87 0.99 0.99 Divergence O (p0.5/log p) O (p0.67) Match ✔ ~∽ ~∽ ✘! Comm_dup [B] Expectation: O (1) Model 2.2e5 256 3770 + 18p R2 1 1 0.99 Divergence O (1) O (1) O (p) Match ✔ ✔ ✘

4 4

slide-45
SLIDE 45

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 45

MPI (3)

  • MPI_Allreduce
  • 3 different machines –

3 different models

slide-46
SLIDE 46

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 46

Modified golden section search

  • Hypothesis search space can be ordered according to growth

rate (derivative at observation with largest parameter value)

  • In this simplified example the ordering is:

n =1 m =1 I = 0 4, 1 4,...,12 4 ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ J = {0}

x0, x0.25,..., x3

{ }

slide-47
SLIDE 47

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 47

Modified golden section search

  • The assumption – the fit error is a unimodal function
  • It has one minimum in the hypothesis space, for the hypothesis with

the best fit

  • We use a modified golden section search to find the minimum in

steps

n =1 m =1 I = 0 4, 1 4,...,12 4 ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ J = {0}

Fit error x0 Hypotheses x3 x1 log0.67 I P1 P2 P4 x1.75 P3

slide-48
SLIDE 48

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 48

Modified golden section search

  • The assumption – the fit error is a unimodal function
  • It has one minimum in the hypothesis space, for the hypothesis with

the best fit

  • We use a modified golden section search to find the minimum in

steps

n =1 m =1 I = 0 4, 1 4,...,12 4 ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ J = {0}

Fit error x0 Hypotheses x3 x1 log0.67 I P1 P2 P4 x1.75 P3

slide-49
SLIDE 49

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 49

Modified golden section search

  • A step of the method:
  • Compare P2 and P3
  • If P1>P2<P3, select a new point P5 between P1 and P2

n =1 m =1 I = 0 4, 1 4,...,12 4 ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ J = {0}

Fit error x0 Hypotheses x3 x1 P1 P2 P4 x1.75 P3 x0.5 P5

slide-50
SLIDE 50

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 50

Modified golden section search

  • A step of the method:
  • Compare P2 and P3
  • If P1>P2<P3, select a new point P5 between P1 and P2
  • Reassign P2,P3, and P4 as below

n =1 m =1 I = 0 4, 1 4,...,12 4 ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ J = {0}

Fit error x0 Hypotheses x1 P1 P2 x1.75 P3 x0.5 P5 P2 P3 P4

slide-51
SLIDE 51

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 51

Modified golden section search

  • A step of the method:
  • Compare P2 and P3
  • If P1>P2<P3, select a new point P5 between P1 and P2
  • Reassign P2,P3, and P4 as below
  • Continue with the next step

n =1 m =1 I = 0 4, 1 4,...,12 4 ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ J = {0}

Fit error x0 Hypotheses x1 P1 P2 x1.75 P3 x0.5 P5 P2 P3 P4

slide-52
SLIDE 52

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 52

Modified golden section search – What is the reduction?

  • We use the example from before
  • Assuming the worst case scenario that only 1/3 of the interval

is eliminated in each step we need 23 steps to identify the best model.

  • In each step 1 new model needs to be computed, except in the first step where

4 models must be computed.

n = 3 m = 3 I = 0 4, 1 4,...,12 4 ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ J = {0,1,2}

26 models evaluated for the best model for a single parameter 10 models evaluated using exhaustive search

slide-53
SLIDE 53

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 53

Cache Modeling

  • Byte reuse distance
  • Line reuse distance
  • Byte stack distance
  • Line stack distance

A A B B C

Threadspotter

slide-54
SLIDE 54

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 54

Parameter space sampling

f1(p3)− f2 Cost(p3, f1(p3)) > Threshhold

We have (p1,t1) ; (p2,t2). How to choose p3? Create f1 and f2 out of the points we have and then find p3!

slide-55
SLIDE 55

23.10.17 | Department of Computer Science | Laboratory for Parallel Programming | Alexandru Calotoiu | 55

Weak vs. strong scaling

  • Wall-clock time not necessarily monotonically increasing

under strong scaling

  • Harder to capture model automatically
  • Different invariants require different reductions across processes

Weak scaling Strong scaling Invariant Problem size per process Overall problem size Model target Wall-clock time Accumulated time Reduction Maximum / average Sum