A DSL for Performance Orchestration Thiago Teixeira David Padua - - PowerPoint PPT Presentation

a dsl for performance orchestration
SMART_READER_LITE
LIVE PREVIEW

A DSL for Performance Orchestration Thiago Teixeira David Padua - - PowerPoint PPT Presentation

A DSL for Performance Orchestration Thiago Teixeira David Padua William Gropp Department of Computer Science Scalable Tools Workshop University of Illinois at Urbana-Champaign July, 2018 DOE/NNSA/ASC/PSAAPII: XPACC The Center for Exascale


slide-1
SLIDE 1

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

A DSL for Performance Orchestration

Thiago Teixeira David Padua William Gropp Department of Computer Science University of Illinois at Urbana-Champaign Scalable Tools Workshop July, 2018

slide-2
SLIDE 2

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

XPACC

  • The Center for Exascale Simulation of Plasma-Coupled Combution
  • Developing a framework to leverage parallelism on exascale systems
  • Comprises Aerospace, Chemistry, CS, ECE, Mechanical Eng
  • Some of the tools being developed:

✦ Moya: Just-in-time recompilations ✦ Tangram: Compiler programming system for performance portability ✦ AMPI: Model for coarse-grained

  • verdecomposition for load balancing

✦ PickPocket: Data relocation for

efficient computation

2

Annotations Overset Meshes

PlasComCM/2

UQ Validation M

  • d

e l F

  • r

m E r r

  • r

Prediction Interoperable CS Tools Integrated Models Discretizations S u r f a c e P h y s i c s / B C s Solvers (E-field) Multi-rate Integrators Full Application Experiment

[Olson] [Kloeckner/Bodony] [Panesi/Stephani]

K i n e t i c s

[Pantano/Adamovich] [Johnson/Freund] [Bodony] [Gropp/Padua] [Elliott/Glumac/Lee] [Freund/Panesi]

Low-D Physics-targeted Experiments [Elliott/Glumac] F l a m e D y n a m i c s

[Pantano]

(Nishihara) (Popov, Massa) (Smith)

J u s t

  • i

n

  • T

i m e ( J I T )

Moya [Gropp]

{Prabhu} {Alberti, Taylor} {Reisner} {Shields} {Mikida} {Cisneros, Eckert}

A n a l y s i s

VectorSeeker [Padua]

{Retter, Koll} {Wang} {Mackay} {Ghale}

H e t e r

  • g

e n e

  • u

s S y s t e m s

Tangram[Hwu]

(Larson)

{Garcia}

(Ibeid) (Spies)

{Teixeira}

[Adamovich]

Detailed Experimental Diagnostics

{Lee}

Sensitivity/Adjoints

[Freund/Bodony] {Chung}

O v e r d e c

  • m

p

  • s

i t i

  • n

AMPI [Kale]

{White}

T u r b u l e n c e

[Bodony/Freund]

(Buchta, Popov) (Munafo)

CS Research Numerics Research Physics Research

[Faculty]

(Staff) (Pending Stafg)

{XPACC Student}

Code/Tool

(Capecelatro, Bryngelson)

Leap

Engineering Application

Plasma-PIC Methods

[Olson/Freund]

Plasma Mediated Ignition Threshold

Annotations

[Gropp/Padua]

{Teixeira}

(Tang)

Overkit ICE

(Campbell) (Anderson)

LANL-BoxMG

(Frederickson) (Kress) (Diener)

[Snir]

{Brooks}

(Yeoh)

{Bay, Smith}

(Petty) (Movahed)

{Gulko}

UG Research

(Evans) (MacArt) (Baccarella)

B r e a k d

  • w

n

{Fellows} {Hagen}

P l a s m a / L a s e r

slide-3
SLIDE 3

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

Performance Optimization

  • Applications are often targeting multiple complex systems → Large
  • ptimization space
  • Compilers deliver unsatisfactory performance (-O3 is not enough)

No Optimization → No Performance

  • Hard to maintain and manage optimizations as the code evolves and

new features are added

  • And, as optimizations are added it becomes hard to maintain the

code

3

Source Code

slide-4
SLIDE 4

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

Performance Optimization on HPC

  • Scientists make decisions based on maximizing scientific output, not

application’s performance

  • They also want to control the performance to the level needed,

even sacrificing abstraction and ease of programming

  • A new technology that can coexist with older ones has a greater

chance of success (e.g. MPI)

  • No complete buy-in at the beginning

✦ A barrier for new frameworks is that you can’t integrate them

incrementally

  • Risk-mitigation strategy is to let competing technologies coexist, but

not always possible

4

Source: Understanding the High-Performance Computing Community: A Software Engineer’s Perspective. Victor Basili et al

slide-5
SLIDE 5

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

Performance Optimization

  • How handle all the required optimizations together for many

different scenarios?

  • How to keep the code maintainable?
  • How to find the best sequence of optimizations?

5

Source Code

slide-6
SLIDE 6

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

Goal

  • No complete buy-in
  • Incremental adoption
  • Coexistence with other tools
  • Automatically finding the best sequence of optimizations and

applying them without disrupting the original code is important to improve performance and keep the code maintainable

6

slide-7
SLIDE 7

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

ICE

  • Illinois Coding Environment
  • Golden copy approach: baseline version without architecture- or

compiler-specific optimizations (not buy-in)

  • Search combined with application’s developer expertise
  • Build-time, Compile-time and Runtime optimizations
  • Non-prescriptive, Gradual adoption, Separation of Concerns
  • Reuse of other optimizations tools already implemented

✦ Interfaces to simplify plug-in

search and optimization tools

7

Tangram Moya Alternative Transformations

ICE

Optimization File Search Golden Copy PIPs OpenMP

slide-8
SLIDE 8

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

ICE

  • Source code is annotated to define code regions
  • Optimization file notation orchestrates the use of the optimization

tools on the code regions defined

  • Interface provides operations on the Source code to invoke
  • ptimizations through:

✦ Adding pragmas ✦ Adding labels ✦ Replacing code regions

  • These operations are used by the interface to plug-in optimization

tools

  • Most tools are source-to-source

✦ tools must understand output of previous tools

8

slide-9
SLIDE 9

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

Opt Language Parser FrontEnd Source Code (Fortran/C/C++) Parser BackEnd Code Gen Select a variant Evaluate

  • Parser the original code
  • Extract Code Regions
  • Goes through the optimization space
  • Machine Learning methods to select

variants

  • Empirically evaluate variants

Best Variant

RoseLoops Pips OpenMP / OpenACC Clay Moya 9

slide-10
SLIDE 10

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

Optimization Tools

  • Pips (MINES ParisTech)

✦ Code optimization tool based on polyhedral framework

  • Moya (Tarun Prabhu/UIUC)

✦ Rutime optimizations

  • Clay (Joël Poudroux, Oleksandr Zinenko)

✦ Loop transformations using the polyhedral framework

  • OpenMP

✦ Parallelization of code regions using pragmas

  • RoseLoop

✦ Loop transformations based Rose compiler infrastructure

  • Altdesc

✦ Replace of code regions (e.g. hand optimized ones)

10

slide-11
SLIDE 11

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

Search Methods

  • Optimizations Space: optimizations, parameters and software stack

(compiler version, flags, libraries)

  • It cannot be exhaustively traversed (gcc parameters has 10806

configurations)

  • Complex space that requires different search techniques
  • OpenTuner (Jason Ansel et al)

✦ Meta technique is used to control the use of the other techniques (e.g

round robin, random, auc bandit)

✦ Multi-armed bandit problem: deciding which, on which order, and how

many times to pull levers on a slot machine with many arms with an unknown payout probability

  • Spearmint (Jasper Snoek et al)

✦ Bayesian Optimization of Machine Learning Algorithms (NIPS’12)

11

slide-12
SLIDE 12

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

Optimization Language

  • Domain specific (easier and straightforward)
  • Expose the optimization space

✦ What sequences of optimizations to evaluate? ✦ What are the best parameters?

  • Control the use of the optimization tools
  • Record the steps to efficient code
  • It can be shared with others or go along with the deployment and

installation

12

slide-13
SLIDE 13

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

ICE

  • Annotations in Fortran
  • Annotations in C/C++
  • Block

! @ICE block=b1 … ! @ICE endblock

  • Loop

! @ICE loop=l1 DO i = 1, n … END DO

13

  • Block

#pragma @ICE block=<id> … #pragma @ICE endblock

  • Loop

#pragma @ICE loop=<id> for(…) { … }

slide-14
SLIDE 14

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

ICE

  • Optimization file (extended .yaml)
  • Direct
  • Search
  • <preamble commands>

<block/loop id>: <ref> <commands>[*+?] ...

  • compilers: [gcc, icc]

# Built command before compilation prebuildcmd: # Compilation command before tests buildcmd: make clean all #Command call for each test runcmd: time ./run.sh search: on # or off memoryBound: &id01

  • unroll:

loop: 3 factor: 4

  • tile:

loop: 8 factor: 1 example1: *id01

  • runtime:

example2:

  • altdesc: ./opt2/*.opt

sc2:

  • altdesc: ./opt2/*.opt

...

  • <commands>+ : 1 or more in the combinations
  • <commands>* : 0 or more in the combinations
  • <commands>? : 0 or 1

14

slide-15
SLIDE 15

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

ICE

  • Loop Interchange

15

  • L1:
  • interchange:
  • rder: 2,1,0

  • Loop indices in the nest start from 0
  • Order accepts * to generate all combinations
  • Order accepts | to select specific combinations
  • Example: 2,1,0|1,0,2

! @ICE loop=L1 do i = 1,n do j = 1,m a(i) = a(i) + b(i,j) * c(j) end do end do ! @ICE loop=L1 do j = 1,m do i = 1,n a(i) = a(i) + b(i,j) * c(j) end do end do

slide-16
SLIDE 16

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

ICE

  • Loop Unroll

16

  • L2:
  • unroll:

loop: 2 factor: 3 ...

  • Loop indices in the nest start from 1
  • Factor accepts .. to generate a range
  • Example: 2..10

! @ICE loop=L2 do i = 1,n do j = 1,m a(i) = a(i) + b(i,j) * c(j) end do end do ! @ICE loop=L2 do i = 1,n do j = 1,m,3 a(i) = a(i) + b(i,j) * c(j) a(i) = a(i) + b(i,j+1) * c(j+1) a(i) = a(i) + b(i,j+2) * c(j+2) end do end do

slide-17
SLIDE 17

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

Cache Option

  • The best variant found for each code region is saved in a cache
  • Consistency: a hash of the golden copy’s code region that it was

based on is also saved along with the best variant

  • Able to take advantage of the efficient generated code avoiding the

installation of all the tool chain

  • Cache can be shipped with the application and invoked according to

the machine/system used

17

slide-18
SLIDE 18

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

Evaluation

  • Matrix Multiplication

#pragma @ICE loop=matmul for (i=0; i<matSize; i++) for (j=0; j<matSize; j++) { for (k=0; k<matSize; k++) { matC[i][j] += matA[i][k] * matB[k][j]; } } } — # Built command before compilation prebuildcmd: # Compilation command before tests buildcmd: make realclean; make #Command call for each test runcmd: ./mmc matmul:

  • Pips.tiling+:

loop: 1 factor: [2..512, 2..512, 2..512]

  • Pips.tiling*:

loop: 4 factor: [8, 16, 8]

  • OpenMP.OMPFor*:

loop: 1 …

+

18

slide-19
SLIDE 19

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

Evaluation

  • Matrix Multiplication

#pragma @ICE loop=matmul for (i=0; i<matSize; i++) for (j=0; j<matSize; j++) { for (k=0; k<matSize; k++) { matC[i][j] += matA[i][k] * matB[k][j]; } } }

— # Built command before compilation prebuildcmd: # Compilation command before tests buildcmd: make realclean; make #Command call for each test runcmd: ./mmc matmul:

  • Pips.tiling+:

loop: 1 factor: [2..512, 2..512, 2..512]

  • Pips.tiling+:

loop: 4 factor: [8, 16, 8]

  • OpenMP.OMPFor+:

loop: 1 …

+

19

#pragma omp parallel for schedule(static,1) private(i_t, k_t, j_t,i_t_t, k_t_t ,j_t_t, i, k,j) for (i_t = 0; i_t <= 127; i_t += 1) for (k_t = 0; k_t <= 127; k_t += 1) for (j_t = 0; j_t <= 3; j_t += 1) for (i_t_t = 4 * i_t; i_t_t <= ((4 * i_t) + 3); i_t_t += 1) for (k_t_t = 2 * k_t; k_t_t <= ((2 * k_t) + 1); k_t_t += 1) for (j_t_t = 32 * j_t; j_t_t <= ((32 * j_t) + 31); j_t_t += 1) for (i = 4 * i_t_t; i <= ((4 * i_t_t) + 3); i += 1) for (k = 8 * k_t_t; k <= ((8 * k_t_t) + 7); k += 1) for (j = 16 * j_t_t; j <= ((16 * j_t_t) + 15); j += 1) matC[i][j] += matA[i][k] * matB[k][j];

=

slide-20
SLIDE 20

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

Matrix Multiplication

20

Speedup 175 350 525 700 Execution Time (ms) 500 1000 1500 2000 CPU Cores 1 2 4 6 8 10

Pluto Time ICE Time Pluto Speedup ICE Speedup

2048^2 elements icc 17.0.1 Intel E5-2660 v3 Pluto Pet branch

  • Two levels of tiling + OpenMP
  • Original version: 78,825 ms
  • 98x speedup (1 core)
  • 694x speedup (10 cores)
  • Avg 2.2x speedup over Pluto
slide-21
SLIDE 21

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

Kripke

  • 3D Sn deterministic particle transport code.
  • Proxy-App for the LLNL transport code ARDRA
  • 6 data layouts of angular fluxes (Psi) and 6 hand-tuned versions

21

*Developed by Adam J. Kunen, Peter N. Brown, Teresa S. Bailey, Peter G. Maginot. Source: https://codesign.llnl.gov/kripke.php

DGZ DZG ZDG ZGD GDZ GZD

+ ICE

slide-22
SLIDE 22

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

  • LLNL Kripke + ICE

✦ Optimizations according to the data layout selected ✦ Loop transformations + inline data layout access ✦ 10-30% slower than hand-optimized, but 6-8x speedup over baseline ✦ Other optimizations under test to close the hand-optimized

performance gap

Experiments

22

slide-23
SLIDE 23

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

Conclusions

  • In order to harness all computing power available in current and

future architectures, it is necessary to apply specific optimizations to the source code

  • ICE:
  • Separation of Concerns (opt file) +
  • Coexistence with other tools +
  • Gradual adoption +
  • Empirical search + Developer Knowledge
  • Golden copy: the developer can focus on the problem
  • Simple and easy to be used by the programmers
  • Hard to get the tools to work though!

23

slide-24
SLIDE 24

XPACC

DOE/NNSA/ASC/PSAAPII: The Center for Exascale Simulation of Plasma-coupled Combustion

This material is based in part upon work supported by the Department of Energy, National Nuclear Security Administration, under Award Number DE-NA0002374.