Finding and Optimizing Phases in Parallel Programs Ray Chen - - PowerPoint PPT Presentation

finding and optimizing phases
SMART_READER_LITE
LIVE PREVIEW

Finding and Optimizing Phases in Parallel Programs Ray Chen - - PowerPoint PPT Presentation

Finding and Optimizing Phases in Parallel Programs Ray Chen <rchen@cs.umd.edu> Jeffrey K. Hollingsworth <hollings@cs.umd.edu> Scalable Tools Workshop 2016 Motivation HPC programs often contain phases Dynamic execution


slide-1
SLIDE 1

Finding and Optimizing Phases in Parallel Programs

Ray Chen <rchen@cs.umd.edu> Jeffrey K. Hollingsworth <hollings@cs.umd.edu>

Scalable Tools Workshop 2016

slide-2
SLIDE 2

Motivation

8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop

2

  • HPC programs often contain “phases”

– Dynamic execution context (like a stack trace for performance) – Each have distinct performance traits

  • Particularly disruptive if inside a timestep loop

– Short phases confound tools – Difficult to analyze a rapidly changing landscape – Worse if phases are nested

slide-3
SLIDE 3

LULESH2 MPI Call Trace

8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop

3

while (locDom->time() < locDom->stoptime()) { TimeIncrement(*locDom); LagrangeLeapFrog(*locDom); }

slide-4
SLIDE 4

Automatic Phase Identification

8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop

4

  • Prior art (chosen completely at random)

– IPS-2 – Paradyn’s Performance Consultant

  • Key: Automatic identification is hard

– Rely on experts for annotations

slide-5
SLIDE 5

while (locDom->time() < locDom->stoptime()) { TimeIncrement(*locDom); LagrangeLeapFrog(*locDom); } while (locDom->time() < locDom->stoptime()) { cali::Annotation region1(“tuner.communication”).begin(); TimeIncrement(*locDom); region1.end(); cali::Annotation region2(“tuner.computation”).begin(); LagrangeLeapFrog(*locDom); region2.end() }

Guided Phase Identification

8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop

5

slide-6
SLIDE 6

Performance Landscape

8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop

6

Actual Timeline Contextual Timeline Contextual Timeline

2.5KB Per Iteration 3,700KB Per Iteration

slide-7
SLIDE 7

Cross-Domain Analysis

8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop

7

  • Utilize experts during development

– Library writers specify tuning variables – Application writers specify code regions – Phase dictates different performance context

  • Even though the same function is being called

My application has three phases

I know what variables affect MPI performance I know what variables affect BLAS performance I know what variables affect FFTW performance

slide-8
SLIDE 8

Integration Work

8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop

8

  • Special annotation types identify:

– Tunable variables – Code regions that should enable tuning

  • New Caliper tuning service

– Listens for and reacts to special annotations – Calls Active Harmony to perform search

slide-9
SLIDE 9

3D Fast Fourier Transform

8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop

9

  • FFT in 3 dimensions

– Composed of three 1 dimensional FFT’s – Data is redistributed among processes between FFT’s

FFTz FFTy FFTx A2A1 A2A2

2 1 3 2 1 3 1 3 2

(blocking) (blocking)

slide-10
SLIDE 10

Computation/Communication Overlap

8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop

10

1 3 2 1 3 2 1 3 2 1 3 2

FFTz FFTy1 FFTx A2A1

(non-blocking)

A2A2

(non-blocking)

FFTy2 FFTz FFTy FFTx A2A1

(blocking)

A2A2

(blocking)

2 1 3 2 1 3 1 3 2

slide-11
SLIDE 11

Auto-tuning Opportunities

8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop

11

T1

2 1 3

FFTz & Pack

1 3 2

Unpack & FFTy1

Px1 Py1

x y

T1 Ny / p2 Ux1 Uz1

x z

T1 Nz / p2

T1 T1 T2

1 3 2 1 3 2 1 3 2 1 3 2

FFTz FFTy1 FFTx A2A1 A2A2

(non-blocking) (non-blocking)

FFTy2

slide-12
SLIDE 12

Nested Phases

8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop

12

  • Block size during A2A transfer is tunable

– Relatively independent from other variables – May be tuned as a nested sub-phase

  • Outer and inner phases run in tandum
slide-13
SLIDE 13

Online Auto-Tuning

8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop

13

slide-14
SLIDE 14

Offline Auto-Tuning Cost

14

slide-15
SLIDE 15

Online vs. Offline Tuning

8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop

15

  • Improvements over offline tuning

– Nested phases simplifies search complexity – Reduce search dimensions from 24 to 16 – 40% fewer search steps needed to converge – Equivalent performance after convergence

  • Eliminates need for training runs

– Don’t allocate thousands of nodes to train

slide-16
SLIDE 16

Conclusion

8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop

16

  • Phases are key for HPC analysis tools

– Rely on human guidance through annotations

  • Annotations unite cross-domain expertise

– Libraries annotate variables to analyze – Application annotate regions to analyze

  • Currently analyzing other HPC codes

– HPGMG has natural phases to exploit – AMR codes are next in line