update on the performance modeling tool extra p
play

Update on the Performance-Modeling Tool Extra-P Felix Wolf, TU - PowerPoint PPT Presentation

Update on the Performance-Modeling Tool Extra-P Felix Wolf, TU Darmstadt Acknowledgement David Beckingsale Alexandru Calotoiu Christopher W. Earl Torsten Hoefler Kashif Ilyas Ian Karlin Daniel Lorenz


  1. Update on the Performance-Modeling Tool Extra-P Felix Wolf, TU Darmstadt

  2. Acknowledgement • David Beckingsale • Alexandru Calotoiu • Christopher W. Earl • Torsten Hoefler • Kashif Ilyas • Ian Karlin • Daniel Lorenz • Patrick Reisert • Martin Schulz • Sergei Shudler • Andreas Vogel 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 2

  3. Latent scalability bugs System size Wall time 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 3

  4. Motivation Performance model = formula that expresses relevant performance metrics as a function of one or more execution parameters 21 Manual creation challenging 18 • Incomplete 3 ¨ 10 ´ 4 p 2 ` c 15 coverage Identify kernels 12 Time r s s 9 • Laborious, difficult Create 6 models 3 0 2 9 2 10 2 11 2 12 2 13 Processes 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 4

  5. Automatic empirical performance modeling n j k ( p ) c k ⋅ p i k ⋅ log 2 ∑ f ( p ) = k = 1 Small-scale measurements Performance model normal form (PMNF) c 1 ⋅ log( p ) + c 2 ⋅ p c 1 ⋅ log( p ) + c 2 ⋅ p ⋅ log( p ) c 1 ⋅ log( p ) + c 2 ⋅ p 2 c 1 + c 2 ⋅ p c 1 ⋅ log( p ) + c 2 ⋅ p 2 ⋅ log( p ) Kernel Model [s] c 1 + c 2 ⋅ p 2 c 1 ⋅ p + c 2 ⋅ p ⋅ log( p ) [2 of 40] t = f(p) c 1 ⋅ p + c 2 ⋅ p 2 c 1 + c 2 ⋅ log( p ) c 1 ⋅ p + c 2 ⋅ p 2 ⋅ log( p ) c 1 + c 2 ⋅ p ⋅ log( p ) sweep → c 1 + c 2 ⋅ p 2 ⋅ log( p ) c 1 ⋅ p ⋅ log( p ) + c 2 ⋅ p 2 4.03 p c 1 ⋅ p ⋅ log( p ) + c 2 ⋅ p 2 ⋅ log( p ) MPI_Recv c 1 ⋅ p 2 + c 2 ⋅ p 2 ⋅ log( p ) sweep 582.19 Generation of candidate models and selection of best fit 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 5

  6. Extra-P 3.0 • GUI improvements, better stability, additional features • Tutorials available through VI-HPS and upon request http://www.scalasca.org/software/extra-p/download.html 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 6

  7. Recent developments 1. Performance models with multiple parameters 2. Automatic configuration of the search space 3. Segmented models 4. Iso-efficiency modeling 5. Lightweight requirements engineering for co-design 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 7

  8. Models with more than one parameter n m n = 3 j kl ( x l ) i kl ⋅ log 2 ∑ ∏ m = 3 f ( x 1 ,.., x m ) = c k x l ⎧ ⎫ I = 0 4, 1 4,...,12 ⎨ ⎬ ⎩ 4 ⎭ k = 1 l = 1 J = {0,1,2} Search space explosion • Total number of hypotheses to search: 34.786,300,841,019 • Too slow for any practical purpose 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 8

  9. Search space reduction through heuristics • Hierarchical search – Assumes the best multi- parameter model is created out of the combination of the best single parameter hypothesis for each parameter • Modified golden section search – Speeds up the single parameter search by ordering the hypothesis space and then using a variant of binary search to find the model in logarithmic time rather than linear time Calotoiu et al. 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 9

  10. Search space reduction • Assuming 300.000 hypotheses searched per second* n = 3 • 3-parameter models m = 3 ⎧ ⎫ I = 0 4, 1 4,...,12 ⎨ ⎬ ⎩ 4 ⎭ J = {0,1,2} 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 10

  11. Search space reduction • Assuming 300.000 hypotheses searched per second* n = 3 • 3-parameter models m = 3 ⎧ ⎫ I = 0 4, 1 4,...,12 *This is optimistic ⎨ ⎬ ⎩ 4 ⎭ J = {0,1,2} Exhaustive search 34.786.300.841.019 hypotheses searched ~1 model / 3.5 years 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 11

  12. Search space reduction • Assuming 300.000 hypotheses searched per second* n = 3 • 3-parameter models m = 3 ⎧ ⎫ I = 0 4, 1 4,...,12 *This is optimistic ⎨ ⎬ ⎩ 4 ⎭ J = {0,1,2} Exhaustive search 34.786.300.841.019 27.929 hypotheses hypotheses searched searched ~1 model / 3.5 years ~11 models / second 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 12

  13. Search space reduction • Assuming 300.000 hypotheses searched per second* n = 3 • 3-parameter models m = 3 ⎧ ⎫ I = 0 4, 1 4,...,12 *This is optimistic ⎨ ⎬ ⎩ 4 ⎭ J = {0,1,2} Exhaustive + search 34.786.300.841.019 27.929 590 hypotheses hypotheses hypotheses searched searched searched ~1 model / 3.5 years ~11 models / second ~508 models / second 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 13

  14. Evaluation with synthetic data (100,000 models with two parameters) Distribution of generated models [%] 100 90 Exhaustive search - 107 hours 80 70 Heuristics - 1.5 hours 60 50 40 30 20 10 0 Optimal model Lead-order term Lead-order term not identified identified identified 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 14

  15. Evaluation with application data Distribution of generated models [%] 100 90 80 70 Identical models 60 Lead-order terms identical 50 40 Different lead-order terms 30 20 10 0 Blast (full) Blast (partial) CloverLeaf Kripke 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 15

  16. Case study – Kripke • Neutron transport proxy code • Three parameters considered • Process count – p • Number of directions – d • Number of groups – g 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 16

  17. Expected behavior SweepSolver MPI_Testany Main computation kernel Main communication kernel: 3D wave-front communication pattern Expectation – Performance depends on Expectation – Performance depends on problem size cubic root of process count t ~ p t ~ d ⋅ g 3 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 17

  18. Expected behavior SweepSolver MPI_Testany Main computation kernel Main communication kernel: 3D wave-front communication pattern Expectation – Performance depends on Expectation – Performance depends on problem size cubic root of process count Kernels must wait on t ~ p t ~ d ⋅ g 3 each other Actual model: Actual model: t = 5 + d ⋅ g + 0.005 ⋅ p ⋅ d ⋅ g t = 7 + p + 0.005 ⋅ p ⋅ d ⋅ g 3 3 3 Smaller compounded effect discovered 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 18

  19. How to find good PMNF parameters? Option (1) : Rely on default parameters → But what if they don‘t fit the problem? Option (2): Try those parameters that you expect to fit → Requires prior expertise! Also, what if your expectation is wrong? Option (3) : Try very large sets I, J → Requires more resources (especially bad for multiple parameters)! Option (4) : Let Extra-P automatically refine the search space based on previous results. 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 19

  20. Simplified PMNF • Use only constant and “lead order” term • Want to find values for c ₀ , c ₁ , α, and β, such that model error is minimized • c ₀ and c ₁ are determined by regression • What about α and β? 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 20

  21. Simplified PMNF We define four slices: • β = 0, α = ? • β = 1, α = ? • β = 2, α = ? • α = 0, β = ? Goal: Unimodal error distribution along each slice 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 21

  22. Evaluation Data from previous case studies Results • Sweep3D • 4453 models • MILC • 49% remain unchanged • UG4 • 39% get better • MPI collective operations • 12% get worse • BLAST • Mean relative prediction down from 45.7% to 13.0% • Kripke • Improvements in every individual • 5–9 points available case study • Last data point (largest p) not used for modeling, but to evaluate prediction accuracy Reisert et al. 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 22

  23. Segmented behavior Second Model behaviour: predicted by 30 + p Extra-P: log 22 (p) Runtime First p 2 behaviour: 30 + p p 2 2 (p) l og 2 Number of processors (p) 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 23

  24. Divide data into subsets Subset 3 Subset 6 Subset 2 Runtime Subset 1 p 2 30 + p Number of processors (p) 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 24

  25. Model each subset and compute nRSS High nRSS Normalized RSS values Heterogeneous subsets 7/10/18 | Department of Computer Science | Laboratory for Parallel Programming | Felix Wolf | 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend