finding and optimizing phases
play

Finding and Optimizing Phases in Parallel Programs Ray Chen - PowerPoint PPT Presentation

Finding and Optimizing Phases in Parallel Programs Ray Chen <rchen@cs.umd.edu> Jeffrey K. Hollingsworth <hollings@cs.umd.edu> Scalable Tools Workshop 2016 Motivation HPC programs often contain phases Dynamic execution


  1. Finding and Optimizing Phases in Parallel Programs Ray Chen <rchen@cs.umd.edu> Jeffrey K. Hollingsworth <hollings@cs.umd.edu> Scalable Tools Workshop 2016

  2. Motivation • HPC programs often contain “phases” – Dynamic execution context (like a stack trace for performance) – Each have distinct performance traits • Particularly disruptive if inside a timestep loop – Short phases confound tools – Difficult to analyze a rapidly changing landscape – Worse if phases are nested 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 2

  3. LULESH2 MPI Call Trace while (locDom->time() < locDom->stoptime()) { TimeIncrement(*locDom); LagrangeLeapFrog(*locDom); } 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 3

  4. Automatic Phase Identification • Prior art (chosen completely at random) – IPS-2 – Paradyn’s Performance Consultant • Key: Automatic identification is hard – Rely on experts for annotations 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 4

  5. Guided Phase Identification while (locDom->time() < locDom->stoptime()) while (locDom->time() < locDom->stoptime()) { { cali ::Annotation region1(“ tuner.communication ”).begin(); TimeIncrement(*locDom); TimeIncrement(*locDom); region1.end(); cali ::Annotation region2(“ tuner.computation ”).begin(); LagrangeLeapFrog(*locDom); LagrangeLeapFrog(*locDom); region2.end() } } 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 5

  6. Performance Landscape 2.5KB Contextual Per Iteration Timeline Actual Timeline 3,700KB Contextual Timeline Per Iteration 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 6

  7. Cross-Domain Analysis • Utilize experts during development I know what variables affect – Library writers specify tuning variables FFTW performance – Application writers specify code regions I know what variables affect MPI My application performance has three phases I know what – Phase dictates different performance context variables affect BLAS • Even though the same function is being called performance 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 7

  8. Integration Work • Special annotation types identify: – Tunable variables – Code regions that should enable tuning • New Caliper tuning service – Listens for and reacts to special annotations – Calls Active Harmony to perform search 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 8

  9. 3D Fast Fourier Transform • FFT in 3 dimensions – Composed of three 1 dimensional FFT’s – Data is redistributed among processes between FFT’s 1 3 3 0 2 2 1 3 1 0 0 2 FFTz FFTx A2A1 FFTy A2A2 (blocking) (blocking) 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 9

  10. Computation/Communication Overlap 3 1 3 0 2 2 1 3 1 0 0 2 FFTz FFTx A2A1 FFTy A2A2 (blocking) (blocking) 1 3 1 3 3 0 2 0 2 2 1 3 1 0 0 2 A2A2 A2A1 FFTz FFTx FFTy1 FFTy2 (non-blocking) (non-blocking) 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 10

  11. Auto-tuning Opportunities T1 T2 1 3 1 3 3 0 2 0 2 2 1 3 1 0 0 2 FFTz FFTx A2A1 FFTy1 FFTy2 A2A2 (non-blocking) (non-blocking) T1 T1 y z T1 T1 Ux1 Px1 Nz / p2 Ny / p2 3 1 Py1 Uz1 0 2 x x 3 1 0 2 Unpack & FFTy1 FFTz & Pack 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 11

  12. Nested Phases • Block size during A2A transfer is tunable – Relatively independent from other variables – May be tuned as a nested sub-phase • Outer and inner phases run in tandum 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 12

  13. Online Auto-Tuning 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 13

  14. Offline Auto-Tuning Cost 14

  15. Online vs. Offline Tuning • Improvements over offline tuning – Nested phases simplifies search complexity – Reduce search dimensions from 24 to 16 – 40% fewer search steps needed to converge – Equivalent performance after convergence • Eliminates need for training runs – Don’t allocate thousands of nodes to train 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 15

  16. Conclusion • Phases are key for HPC analysis tools – Rely on human guidance through annotations • Annotations unite cross-domain expertise – Libraries annotate variables to analyze – Application annotate regions to analyze • Currently analyzing other HPC codes – HPGMG has natural phases to exploit – AMR codes are next in line 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend