Case Studies in Using a DSL and Task Graphs for Portable Reacting - - PowerPoint PPT Presentation

case studies in using a dsl and task graphs for portable
SMART_READER_LITE
LIVE PREVIEW

Case Studies in Using a DSL and Task Graphs for Portable Reacting - - PowerPoint PPT Presentation

Case Studies in Using a DSL and Task Graphs for Portable Reacting Flow Simulations J AMES C. S UTHERLAND Associate Professor - Chemical Engineering T ONY S AAD Assistant Professor - Chemical Engineering Acknowledgments B ABAK G OSHAYESHI C


slide-1
SLIDE 1

Case Studies in Using a DSL and Task Graphs for Portable Reacting Flow Simulations

JAMES C. SUTHERLAND

Associate Professor - Chemical Engineering

TONY SAAD

Assistant Professor - Chemical Engineering

slide-2
SLIDE 2

Acknowledgments

BABAK GOSHAYESHI

Research Staff

MIKE HANSEN JOSH MCCONNELL

Ph.D. Students

CHRISTOPHER EARL

Postdoctoral Researcher Now at LLNL

ABHISHEK BAGUSETTY DEVIN ROBISON MICHAEL BROWN

M.S. Students

XPS award1337145 DE-NA0002375 DE-NA-000740 DE-SC0008998

slide-3
SLIDE 3

Nebo (E)DSL: “Matlab for PDEs on Supercomputers”

Field & stencil

  • perations:

rhs <<= -divOpX( xConvFlux + xDiffFlux )

  • divOpY( yConvFlux + yDiffFlux )
  • divOpZ( zConvFlux + zDiffFlux );

rhs = − ∂ ∂x(Jx + Cx) − ∂ ∂y (Jy + Cy) − ∂ ∂z (Jz + Cz)

Can “chain” stencil operations where necessary.

Auto-generate code for efficient execution on CPU, GPU, XeonPhi,

  • etc. during compilation.

Expressiveness Efficiency

C++ Matlab DSL

  • Stencils: >150 natively supported stencil
  • perations (easily extensible)
  • cond: “vectorized if”
  • Arbitrary composition of operations
  • Masked assignment (perform operations
  • n a defined subset of points)
  • Portable: same code works for CPU,

multicore, GPU execution

  • Embedded in C++ → “plays well with
  • thers”

Earl, C., Might, M., Bagusetty, A., & Sutherland, J. C., Journal of Systems and Software (2016).

slide-4
SLIDE 4

u

Γ

T

Γ = Γ(T, p, yi)

p yi τ

Direct (expressed) dependencies. Indirect (discovered) dependencies.

The Power of Task Graphs

Register all expressions

  • Each “expression” calculates one or more field

quantities.

  • Each expression advertises its direct dependencies.

Set a “root” expression; construct a graph

  • All dependencies are discovered/resolved

automatically.

  • Highly localized influence of changes in models.
  • Not all expressions in the registry may be relevant/

used.

From the graph:

  • Deduce storage requirements & allocate memory

(externally to each expression).

  • Automatically schedule evaluation, ensuring proper
  • rdering.
  • Robust scheduling algorithms are key.

Expression Registry

ρ φ sφ

*Notz, Pawlowski, & Sutherland (2012). ACM Transactions on Mathematical Software, 39(1).

slide-5
SLIDE 5

Changes in model form are naturally handled

q λ T

q = λrT

Pure substance heat flux:

slide-6
SLIDE 6

Changes in model form are naturally handled

q λ T

q = λrT +

n

X

i=1

hiJi

h1 hn J1 Jn y1 yn

Multi-species mixture heat flux:

No complex logic changes in code when model are added/changed.

slide-7
SLIDE 7

“Modifiers” — injecting new dependencies

Motivation:

  • Boundary conditions: modify a subset of the

computed values.

  • Multiphase coupling: add source terms to RHS
  • f equations.

A B C

slide-8
SLIDE 8

“Modifiers” — injecting new dependencies

Motivation:

  • Boundary conditions: modify a subset of the

computed values.

  • Multiphase coupling: add source terms to RHS
  • f equations.

Modifiers allow “push” rather than “pull” dependency addition. Modifiers are deployed after the node they are attached to, and are provided a handle to the field just computed.

A B C BC1 S1

slide-9
SLIDE 9

“Modifiers” — injecting new dependencies

Motivation:

  • Boundary conditions: modify a subset of the

computed values.

  • Multiphase coupling: add source terms to RHS
  • f equations.

Modifiers allow “push” rather than “pull” dependency addition. Modifiers are deployed after the node they are attached to, and are provided a handle to the field just computed. Modifiers can introduce new dependencies to the graph.

A B C BC1 S1 D E F

slide-10
SLIDE 10

Example: PoKiTT

ρ∂yi ∂t = r · Ji + si ρ∂h ∂t = r · qi

(Portable Kinetics, Thermodynamics & Transport)

Triple flame computed on GPU with PoKiTT

  • Detailed kinetics
  • Mixture-averaged transport
  • Detailed thermodynamics

Yonkee & Sutherland, SIAM Journal on Scientific Computing (2016)

slide-11
SLIDE 11

Example: PoKiTT

ρ∂yi ∂t = r · Ji + si ρ∂h ∂t = r · qi

(Portable Kinetics, Thermodynamics & Transport)

Triple flame computed on GPU with PoKiTT

  • Detailed kinetics
  • Mixture-averaged transport
  • Detailed thermodynamics
  • 32 PDEs
  • 2562 grid points
  • 8 million timesteps
  • 8 days on 1 GPU (~5 months on 1 CPU core)

Speedup 6 12 18 24 30

30 5 27 5 18.2 2.4

256^2 512^2 1024^2

12 cores GPU

Yonkee & Sutherland, SIAM Journal on Scientific Computing (2016)

slide-12
SLIDE 12

Titan: Hybrid Low Mach Algorithm

Weak Scaling Mean time per timestep

0.01s 0.1s 1s 10s 100s

GPUs (also # Titan Nodes, 1 GPU per Titan Node)

1 2 8 64 512 4096 8192 12800 16^3 32^3 64^3 128^3

Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)

Everything on GPU except Poisson solve on CPU.

slide-13
SLIDE 13

Titan: Hybrid Low Mach Algorithm

Weak Scaling Mean time per timestep

0.01s 0.1s 1s 10s 100s

GPUs (also # Titan Nodes, 1 GPU per Titan Node)

1 2 8 64 512 4096 8192 12800 16^3 32^3 64^3 128^3

GPU Speedup Speedup (CPU/GPU) 0X 0.5X 1X 1.5X 2X CPUs/GPUs (also # Titan Nodes, 1 MPI Rank per Titan Node) 1 2 8 6 4 5 1 2 4 9 6 8 1 9 2 1 2 8

1X

16^3 32^3 64^3 128^3

Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)

slide-14
SLIDE 14

Titan: Compressible Algorithm

Weak Scaling Mean time per timestep 0.01s 0.1s 1s 10s GPUs (also # Titan Nodes, 1 GPU per Titan Node) 1 8 512 8192 18252 16^3 32^3 64^3 128^3

Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)

slide-15
SLIDE 15

Titan: Compressible Algorithm

Weak Scaling Mean time per timestep 0.01s 0.1s 1s 10s GPUs (also # Titan Nodes, 1 GPU per Titan Node) 1 8 512 8192 18252 16^3 32^3 64^3 128^3 GPU Speedup Speedup (CPU/GPU) 0.1X 1X 10X 100X CPUs (also # Titan Nodes, 1 MPI Rank per Titan Node) 1 8 512 8192 18252

1X

16^3 32^3 64^3 128^3

Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)

slide-16
SLIDE 16 Institute for

CLEAN AND SECURE ENERGY

THE UNIVERSITY OF UTAH

TM

What next?

Wait for linear solvers to get us to many-GPU systems?

  • Even when these arrive, it puts a lot of demand on black-box linear solvers to

achieve scalability & performance.

Compressible Speedup (CPU/GPU) 0.1X 1X 10X 100X 1 2 8 6 4 5 1 2 4 9 6 8 1 9 2 1 2 8 1 8 2 5 2

1X

16^3 32^3 64^3 128^3

Low-Mach Speedup (CPU/GPU) 0X 0.5X 1X 1.5X 2X 1 2 8 6 4 5 1 2 4 9 6 8 1 9 2 1 2 8

1X

16^3 32^3 64^3 128^3

slide-17
SLIDE 17 Institute for

CLEAN AND SECURE ENERGY

THE UNIVERSITY OF UTAH

TM

What next?

Wait for linear solvers to get us to many-GPU systems?

  • Even when these arrive, it puts a lot of demand on black-box linear solvers to

achieve scalability & performance.

Consider alternative algorithms?

Compressible Speedup (CPU/GPU) 0.1X 1X 10X 100X 1 2 8 6 4 5 1 2 4 9 6 8 1 9 2 1 2 8 1 8 2 5 2

1X

16^3 32^3 64^3 128^3

Low-Mach Speedup (CPU/GPU) 0X 0.5X 1X 1.5X 2X 1 2 8 6 4 5 1 2 4 9 6 8 1 9 2 1 2 8

1X

16^3 32^3 64^3 128^3

slide-18
SLIDE 18 Institute for

CLEAN AND SECURE ENERGY

THE UNIVERSITY OF UTAH

TM

Point-implicit algorithms:

High arithmetic intensity Communication patterns are the same as explicit codes (ghost/halo- updates) Well-suited for reacting flow calculations.

 I − ∆σ ∂h ∂u ∆u ∆σ = h(u)

Local Jacobian matrix Local residual

Computational kernel

  • Residual (right-hand side) evaluation
  • Pointwise Jacobian evaluation
  • Local linear solves
  • Local eigenvalue decompositions

Matrix assembly must be efficient and extensible to complex, multiphysics problems

slide-19
SLIDE 19

Example: Highly nonlinear, parameterized ODE systems

  • Detailed chemical kinetics
  • Analytical Jacobian in PoKiTT w/

Nebo for GPU

  • Dense matrix formed w/primitives

and sparse transformation

  • Simple convective heat

transfer

  • Single-element Jacobian combined

with sparse transform

  • Finite mixing time
  • Scalar Jacobian matrix

kinetics source terms mixing/flow convective heat transfer

K + Q + T

∂K ∂V + ∂Q ∂V ∂V ∂U − 1 τ I

Right-hand side: Jacobian:

Full matrix
 (dense submat)

( dKdV + dqdV ) * dVdU - invT

1-element (sparse) 2N-elements (sparse) scalar matrix

GPU Speedup - 16x16 Matrix

5 10 15 20 25 30 16^3 32^3 64^3

Dot Product MatVec Ax=b Eigen-decomp

C++ code:

slide-20
SLIDE 20 Institute for

CLEAN AND SECURE ENERGY

THE UNIVERSITY OF UTAH

TM

Conclusions

Robust abstractions are needed to facilitate portable & performant applications on upcoming architectures.

  • DAG-based software design allows flexibility needed for multiphysics codes
  • n heterogeneous platforms.
  • (E)-DSLs can provide convenient, portable & performant abstractions for HPC

applications

The Algorithm-Hardware collision:

  • Scalable GPU linear solvers are needed for traditional algorithms to be viable
  • n new architectures.
  • Alternative algorithms may be needed with higher arithmetic intensity
  • higher-order
  • point-implicit?

XPS award1337145 DE-NA0002375 DE-NA-000740 DE-SC0008998