Finding and Optimizing Phases in Parallel Programs Jeffrey K. - - PowerPoint PPT Presentation

finding and optimizing phases in parallel programs
SMART_READER_LITE
LIVE PREVIEW

Finding and Optimizing Phases in Parallel Programs Jeffrey K. - - PowerPoint PPT Presentation

Finding and Optimizing Phases in Parallel Programs Jeffrey K. Hollingsworth <hollings@cs.umd.edu> Ray Chen <rchen@cs.umd.edu> Phases of UMD CS Computer & Space Sciences: 1962-1987 AV Williams: 1987-2018 Iribe Center: 2018-


slide-1
SLIDE 1

Finding and Optimizing Phases in Parallel Programs

Jeffrey K. Hollingsworth <hollings@cs.umd.edu> Ray Chen <rchen@cs.umd.edu>

slide-2
SLIDE 2

Phases of UMD CS

Computer & Space Sciences: 1962-1987 AV Williams: 1987-2018 Iribe Center: 2018-

CS@UMD: Future is Exciting Largest Major on campus (over 2,800 undergrads, plus 400+ Computer Engineering) New Building in 2018 Hiring O(10) New Faculty in couple of years New Big Data Masters & Certificate Programs

slide-3
SLIDE 3

Motivation

3

  • HPC programs often contain “phases”

– Dynamic execution context – Each have distinct performance traits

  • Particularly problematic if inside a time-step loop

– Short phases confound tools – Difficult to analyze a rapidly changing landscape – Worse if phases are nested

slide-4
SLIDE 4

LULESH2 MPI Call Trace

4

while (locDom->time() < locDom->stoptime()) { TimeIncrement(*locDom); LagrangeLeapFrog(*locDom); }

slide-5
SLIDE 5

Automatic Phase Identification

5

  • My Failed Prior Attempts

– IPS-2 – Paradyn’s Performance Consultant – Solution

  • Automatic identification is hard, rely on experts for annotations
  • Create virtual phases by stitching short ones together

(c. 1990) (c. 1995)

slide-6
SLIDE 6

while (locDom->time() < locDom->stoptime()) { TimeIncrement(*locDom); LagrangeLeapFrog(*locDom); } while (locDom->time() < locDom->stoptime()) { cali::Annotation region1(“tuner.communication”).begin(); TimeIncrement(*locDom); region1.end(); cali::Annotation region2(“tuner.computation”).begin(); LagrangeLeapFrog(*locDom); region2.end() }

Guided Phase Identification

6

slide-7
SLIDE 7

Performance Landscape

7

Actual Timeline Contextual Timeline Contextual Timeline

2.5KB Per Iteration 3,700KB Per Iteration

slide-8
SLIDE 8

Cross-Domain Analysis

8

  • Utilize experts during development

– Library writers specify tuning variables – Application writers specify code regions – Phase dictates different performance context

  • Even though the same function is being called

My application has three phases

I know what variables affect MPI performance I know what variables affect BLAS performance I know what variables affect FFTW performance

slide-9
SLIDE 9

Integration Work

9

  • Special annotation types identify:

– Tunable variables – Code regions that should enable tuning

  • New Caliper tuning service

– Listens for and reacts to special annotations – Calls Active Harmony to perform search

slide-10
SLIDE 10

3D Fast Fourier Transform

10

  • FFT in 3 dimensions

– Composed of three 1 dimensional FFT’s – Data is redistributed among processes between FFT’s

FFTz FFTy FFTx A2A1 A2A2

2 1 3 2 1 3 1 3 2

(blocking) (blocking)

slide-11
SLIDE 11

Computation/Communication Overlap

11

1 3 2 1 3 2 1 3 2 1 3 2

FFTz FFTy1 FFTx A2A1

(non-blocking)

A2A2

(non-blocking)

FFTy2 FFTz FFTy FFTx A2A1

(blocking)

A2A2

(blocking)

2 1 3 2 1 3 1 3 2

slide-12
SLIDE 12

Auto-tuning Opportunities

12

T1

2 1 3

FFTz & Pack

1 3 2

Unpack & FFTy1

Px1 Py1

x y

T1 Ny / p2 Ux1 Uz1

x z

T1 Nz / p2

T1 T1 T2

1 3 2 1 3 2 1 3 2 1 3 2

FFTz FFTy1 FFTx A2A1 A2A2

(non-blocking) (non-blocking)

FFTy2

slide-13
SLIDE 13

Online Auto-Tuning

13

slide-14
SLIDE 14

Phase Aware Tuning

14

  • Improvements over offline (non-phase) tuning

– Reduce search dimensions from 24 to 16 – 40% fewer search steps needed to converge – Equivalent performance after convergence

  • Eliminates need for training runs

– Don’t allocate thousands of nodes to train

slide-15
SLIDE 15

Offline Auto-Tuning Cost

15

slide-16
SLIDE 16

Conclusion

16

  • Phases are key for HPC analysis tools

– Rely on human guidance through annotations – Virtualizing repeated phases helps many types of tools

  • Annotations unite cross-domain expertise

– Libraries annotate variables to analyze – Application annotate regions to analyze

  • Currently analyzing other HPC codes