Lightweight Requirements Engineering for Exascale Co-design Felix - - PowerPoint PPT Presentation

lightweight requirements engineering for exascale co
SMART_READER_LITE
LIVE PREVIEW

Lightweight Requirements Engineering for Exascale Co-design Felix - - PowerPoint PPT Presentation

Lightweight Requirements Engineering for Exascale Co-design Felix Wolf, TU Darmstadt Application System 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 1 Acknowledgement


slide-1
SLIDE 1

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 1

Felix Wolf, TU Darmstadt

Lightweight Requirements Engineering for Exascale Co-design

Application System

slide-2
SLIDE 2

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 2

Acknowledgement

  • Alexandru Calotoiu, TU Darmstadt
  • Alexander Graf, TU Darmstadt
  • Torsten Hoefler, ETH Zurich
  • Daniel Lorenz, TU Darmstadt
  • Sergei Shudler, TU Darmstadt
  • Sebastian Rinke, TU Darmstadt
slide-3
SLIDE 3

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 3

Co-design

Workload System

Better algorithms

slide-4
SLIDE 4

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 4

Current performance might be deceptive…

Communication

Computation

Communication C

  • m

p u t a t i

  • n
slide-5
SLIDE 5

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 5

System 1 System 2 System n Application 1 Performance model 1.3 Performance model 1.2 Performance model 1.1 Application 1 Performance model 1.3 Performance model 1.2 Performance model 1.1 Application 1 Performance model 1.3 Performance model 1.2 Performance model 1.1 Application 1 Performance model 1.3 Performance model 1.2 Performance model 1.1

Hardware-specific performance models

… Application 1 Performance model 1.3 Performance model 1.2 Performance model 1.1

slide-6
SLIDE 6

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 6

System 1 System 2 System n Application 1 Requirments model 1 Application 1 Requirments model 1 Application 1 Requirments model 1 Application 1 Requirments model 1

Application-centric requirements models

Application 1 Requirments model 1

slide-7
SLIDE 7

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 7

Data metabolism at the hardware / software interface

Application Hardware

slide-8
SLIDE 8

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 8

Hardware-independent requirement metrics

#Loads & stores #Loads & stores #Loads & stores

#Bytes sent & received #Bytes sent & received #Bytes sent & received

Memory CPU Network #Bytes used #Bytes used #Bytes used

#FLOPS #FLOPS #FLOPS + Stack distance + Stack distance

slide-9
SLIDE 9

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 9

Requirements model of an application

Set of functions with each ri representing one of the requirement metrics

  • All metrics refer to single process
  • We model neither time nor energy

ri(p,n)

p = #processes n = input size per process

slide-10
SLIDE 10

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 10

Lightweight requirements engineering for (exascale) co-design

Collect portable requirement metrics Derive requirement models Extrapolate to new system

slide-11
SLIDE 11

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 11

Collection of requirement metrics

Requirement Metric Profiling tool Computation # Floating-point operations Network comm. # Bytes sent & received Memory footprint # Bytes used getrusage() Memory access # Loads & stores Memory locality Stack distance Threadspotter Collection single-threaded (#FLOPS roughly independent of #threads)

slide-12
SLIDE 12

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 12

Modeling locality

Reuse distance vs. stack distance

A A B B C Reuse distance = 1 Stack distance=1 Reuse distance = 3 Stack distance = 2 Paratools Threadspotter

slide-13
SLIDE 13

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 13

main() { foo() bar() compute() } Instrumentation Input Output Extra-P Human-readable, multi-parameter performance models

Automatic empirical performance modeling with Extra-P

f (x1,.., xm) = ck xl

ikl ⋅log2 jkl (xl) l=1 m

k=1 n

  • A. Calotoiu, et al.: Fast Multi-Parameter

Performance Modeling (CLUSTER ’16) www.scalasca.org/software/extra-p/download.html

Small-scale measurements

slide-14
SLIDE 14

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 14

Test applications

Kripke MILC LULESH icoFoam Relearn

slide-15
SLIDE 15

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 15

Experimental setup

JUQUEEN @ Jülich Supercomputing Centre IBM Blue Gene/Q Lichtenberg @ TU Darmstadt Intel Xeon with Infiniband

slide-16
SLIDE 16

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 16

Modeling application requirements

Lulesh

Requirement Metric Model

Computation #FLOPs Communication #Bytes sent & received Memory access #Loads & stores Memory footprint #Bytes used Memory locality Stack distance

Constant

105 ⋅n⋅log(n)⋅ p0.25 ⋅log(p) 103 ⋅n⋅ p0.25 ⋅log(p) 105 ⋅n⋅log(n)⋅log(p) 105 ⋅n⋅log(n)

Models represent per-process effects p – number of processes n – problem size per process

slide-17
SLIDE 17

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 17

Determining requirements on a new system

Available sockets # Processes Available memory per process Problem size per process Overall problem size Requirement models Requirements #FLOPS #Bytes sent ...

slide-18
SLIDE 18

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 18

Network bandwidth Computational performance Memory bandwidth

Requirements engineering process

Memory capacity

slide-19
SLIDE 19

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 19

Case study Three system upgrades

Memory x 2 Sockets x 2 Racks x 2

slide-20
SLIDE 20

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 20

Kripke LULESH MILC Relearn icoFoam Baseline System Upgrade A: Double the racks Problem size per process 1 1 1 1 0.5 1 Overall problem size 2 2 2 2 1 2 Computation 1 1.2 1 1 0.5 1 Communication 1 1.2 1 1 0.7 1 Memory accesses 2 1.2 2.8 2 0.7 1 System Upgrade B: Double the sockets Problem size per process 0.5 0.5 0.5 0.3 0.3 0.5 Overall problem size 1 1 1 0.5 0.6 1 Computation 0.5 0.6 0.5 0.3 0.2 0.5 Communication 0.5 0.6 0.5 0.3 0.3 0.5 Memory accesses 0.5 1 1.4 1 0.5 0.5 System Upgrade C: Double the memory Problem size per process 2 1.4 2 4 1.4 2 Overall problem size 2 1.4 2 4 1.4 2 Computation 2 1.4 2 4 1.7 2 Communication 2 1.4 2 4 1.4 2 Memory accesses 2 1.4 2 4 1.4 2

Three upgrades – summary

Apps Ratios Kripke LULESH MILC Relearn icoFoam Kripke Relearn MILC

slide-21
SLIDE 21

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 21

Case study II Three exascale strawman systems

Metric Massively parallel Vector Hybrid Nodes 2 * 104 5 * 104 104 Processors 2 * 109 5 * 107 108 Processors per node 105 103 104 Memory per processor 5 * 106 2 * 108 108 Flop/s per processor 5 * 108 2 * 1010 1010

Many but weak processors Few but powerful processors Moderate number of moderate processors

Total memory: 10 PB

slide-22
SLIDE 22

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 22

Metric Massively parallel Vector Hybrid

Maximum overall problem size

1010 1010 1010

Minimum wall time for benchmark problem[s]

0.1 0.1 0.1

Maximum overall problem size

3.9Ÿ1010 1.7Ÿ1010 1.9Ÿ1010

Minimum wall time for benchmark problem [s]

40 21.5 33

Maximum overall problem size

1010 1010 1010

Minimum wall time for benchmark problem [s]

102 102 102

Maximum overall problem size

5Ÿ1010 4Ÿ1012 1012

Minimum wall time for benchmark problem [s]

4 0.02 0.2

Case study II Three exascale strawman systems

Kripke Lulesh MILC Relearn

Bigger problem versus faster solution Vector system clear winner

slide-23
SLIDE 23

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 23

Summary

Application-centric requirements models

  • No need to integrate hardware knowledge
  • Generation via standard profiling tools
  • Memory locality taken into account

Practical co-design process

  • Extrapolates requirements to envisaged system
  • Points out bottlenecks on both sides

Automated BOE co-design for large workloads

slide-24
SLIDE 24

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 24

Tasking

Idea – separate problem decomposition from concurrency

  • Decompose problem into a set of tasks and insert them into task pool
  • Threads fetch them from there until all tasks are completed and task

pool empty. Note that a task may create new tasks

  • Advantage: good load balance if problem is over-decomposed

Task pool Thread pool create tasks Scheduler fetch tasks assign them to threads

slide-25
SLIDE 25

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 25

Tasking (2)

§ Task-based paradigms: Cilk, OmpSs, OpenMP,… § Scheduling managed by the runtime system § Example:

fib(5) fib(3) fib(4) fib(2) fib(3)

#pragma omp task shared(x) x = fib( n – 1 ); #pragma omp task shared(y) y = fib( n – 2 ); #pragma omp taskwait return x + y;

slide-26
SLIDE 26

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 26

Efficiency of task-based applications – performance issues

Task graph Core count Input size

  • const. efficiency = S

p

slide-27
SLIDE 27

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 27

Efficiency of task-based applications – performance issues (2)

Task graph Core count Input size

  • const. efficiency = S

p

slide-28
SLIDE 28

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 28

Efficiency of task-based applications – performance issues (3)

Input size Resource contention Core count

  • const. efficiency = S

p

slide-29
SLIDE 29

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 29

Task dependency graph (TDG)

§ Nodes – tasks, edges – dependencies § – processing elements, input size § – all the task times (work) § – longest path (depth) § – average parallelism § – execution time § – speedup

1

p,n

T

1(n)

T∞(n) T

1 = 45

T∞ = 25

6 1 7 3 4 7 5 2

Tp(n) Sp(n) = T

1(n)

Tp(n) π(n) = T

1(n)

T∞(n)

slide-30
SLIDE 30

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 30

Efficiency & isoefficiency

§ Efficiency is defined as: § Isoefficiency binds together the core count and the input size for a specific, constant efficiency:

  • A contour line on the efficiency

surface

§ Example: Mergesort

  • Surface depicts

E(p,n) = Sp(n) p ≤ min 1, π(n) p " # $ % & ' = Eub(p,n)

n = fE(p) π(n) = logn

Isoefficiency functions

Eub(p,n)

slide-31
SLIDE 31

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 31

Modeled efficiency functions

– reflects actual performance – contention- free replays – upper bound based on avg. parallelism

Eac(p,n) Ecf (p,n)

Eub(p,n)

Δcon = Ecf (p,n)− Eac(p,n) Δstr = Eub(p,n)− Ecf (p,n)

Structural discrepancy: characterizes optimization potential on the task-graph level Contention discrepancy: shows how severe the resource contention is

slide-32
SLIDE 32

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 32

For example (Strassen): Let E = 0.8 and p = 60: After solving:

Co-design aspects

0.8 =1.55−1.02⋅600.25 + 4.59⋅10−2 ⋅600.25 logn

n = 83,600

App. Model Input size for p = 60, E = 0.8 Fibonacci 51 51 49 Strassen 83,600 x 83,600 12,680 x 12,680 1,200 x 1,200 Eac = 0.98− 5.11⋅10−3 p1.25 +1.76⋅10−3 p1.25 logn Ecf = 0.97−1.46⋅10−2 p1.25 + 9.26⋅10−3 p1.25 logn Eac =1.55−1.02p0.25 + 4.59⋅10−2 p0.25 logn Ecf =1.26 − 0.65p0.33 +3.89⋅10−2 p0.33 logn Eub = min 1, 25.48+ 0.49n2.75 logn

( ) p−1 { }

Eub = min 1, 0.25n0.75

( ) p−1 { }

Eac =1.55−1.02p0.25 + 4.59⋅10−2 p0.25 logn

slide-33
SLIDE 33

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 33

Extra-P 3.0

  • GUI improvements, better stability, additional features
  • Tutorials available through VI-HPS and upon request

http://www.scalasca.org/software/extra-p/download.html

slide-34
SLIDE 34

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 34

Related publications

[1] Alexandru Calotoiu, Alexander Graf, Torsten Hoefler, Daniel Lorenz, Sebastian Rinke, Felix Wolf: Lightweight Requirements for Exascale Co-design. In Proc. of the 2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK [2] Sergei Shudler, Alexandru Calotoiu, Torsten Hoefler, Felix Wolf: Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based

  • Applications. In Proc. of the ACM SIGPLAN Symposium on Principles and

Practice of Parallel Programming (PPoPP), Austin, TX, USA, pages 1-13, ACM, February, 2017 [2] Alexandru Calotoiu, David Beckingsale, Christopher W. Earl, Torsten Hoefler, Ian Karlin, Martin Schulz, Felix Wolf: Fast Multi-Parameter Performance Modeling. In

  • Proc. of the 2016 IEEE International Conference on Cluster Computing

(CLUSTER), Taipei, Taiwan [3] Alexandru Calotoiu, Torsten Hoefler, Marius Poke, Felix Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes. In Proc. of the ACM/IEEE Conference on Supercomputing (SC13), Denver, CO, USA

slide-35
SLIDE 35

11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 35

Thank you!