Case Studies in Using a DSL and Task Graphs for Portable Reacting Flow Simulations
JAMES C. SUTHERLAND
Associate Professor - Chemical Engineering
TONY SAAD
Assistant Professor - Chemical Engineering
Case Studies in Using a DSL and Task Graphs for Portable Reacting - - PowerPoint PPT Presentation
Case Studies in Using a DSL and Task Graphs for Portable Reacting Flow Simulations J AMES C. S UTHERLAND Associate Professor - Chemical Engineering T ONY S AAD Assistant Professor - Chemical Engineering Acknowledgments B ABAK G OSHAYESHI C
JAMES C. SUTHERLAND
Associate Professor - Chemical Engineering
TONY SAAD
Assistant Professor - Chemical Engineering
BABAK GOSHAYESHI
Research Staff
MIKE HANSEN JOSH MCCONNELL
Ph.D. Students
CHRISTOPHER EARL
Postdoctoral Researcher Now at LLNL
ABHISHEK BAGUSETTY DEVIN ROBISON MICHAEL BROWN
M.S. Students
XPS award1337145 DE-NA0002375 DE-NA-000740 DE-SC0008998
Field & stencil
rhs <<= -divOpX( xConvFlux + xDiffFlux )
rhs = − ∂ ∂x(Jx + Cx) − ∂ ∂y (Jy + Cy) − ∂ ∂z (Jz + Cz)
Can “chain” stencil operations where necessary.
Auto-generate code for efficient execution on CPU, GPU, XeonPhi,
Expressiveness Efficiency
C++ Matlab DSL
multicore, GPU execution
Earl, C., Might, M., Bagusetty, A., & Sutherland, J. C., Journal of Systems and Software (2016).
u
Γ
T
Γ = Γ(T, p, yi)
p yi τ
Direct (expressed) dependencies. Indirect (discovered) dependencies.
Register all expressions
quantities.
Set a “root” expression; construct a graph
automatically.
used.
From the graph:
(externally to each expression).
Expression Registry
ρ φ sφ
*Notz, Pawlowski, & Sutherland (2012). ACM Transactions on Mathematical Software, 39(1).
q λ T
Pure substance heat flux:
q λ T
q = λrT +
n
X
i=1
hiJi
h1 hn J1 Jn y1 yn
Multi-species mixture heat flux:
No complex logic changes in code when model are added/changed.
Motivation:
computed values.
A B C
Motivation:
computed values.
Modifiers allow “push” rather than “pull” dependency addition. Modifiers are deployed after the node they are attached to, and are provided a handle to the field just computed.
A B C BC1 S1
Motivation:
computed values.
Modifiers allow “push” rather than “pull” dependency addition. Modifiers are deployed after the node they are attached to, and are provided a handle to the field just computed. Modifiers can introduce new dependencies to the graph.
A B C BC1 S1 D E F
ρ∂yi ∂t = r · Ji + si ρ∂h ∂t = r · qi
(Portable Kinetics, Thermodynamics & Transport)
Triple flame computed on GPU with PoKiTT
Yonkee & Sutherland, SIAM Journal on Scientific Computing (2016)
ρ∂yi ∂t = r · Ji + si ρ∂h ∂t = r · qi
(Portable Kinetics, Thermodynamics & Transport)
Triple flame computed on GPU with PoKiTT
Speedup 6 12 18 24 30
30 5 27 5 18.2 2.4
256^2 512^2 1024^2
12 cores GPU
Yonkee & Sutherland, SIAM Journal on Scientific Computing (2016)
Weak Scaling Mean time per timestep
0.01s 0.1s 1s 10s 100s
GPUs (also # Titan Nodes, 1 GPU per Titan Node)
1 2 8 64 512 4096 8192 12800 16^3 32^3 64^3 128^3
Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)
Everything on GPU except Poisson solve on CPU.
Weak Scaling Mean time per timestep
0.01s 0.1s 1s 10s 100s
GPUs (also # Titan Nodes, 1 GPU per Titan Node)
1 2 8 64 512 4096 8192 12800 16^3 32^3 64^3 128^3
GPU Speedup Speedup (CPU/GPU) 0X 0.5X 1X 1.5X 2X CPUs/GPUs (also # Titan Nodes, 1 MPI Rank per Titan Node) 1 2 8 6 4 5 1 2 4 9 6 8 1 9 2 1 2 8
1X
16^3 32^3 64^3 128^3
Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)
Weak Scaling Mean time per timestep 0.01s 0.1s 1s 10s GPUs (also # Titan Nodes, 1 GPU per Titan Node) 1 8 512 8192 18252 16^3 32^3 64^3 128^3
Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)
Weak Scaling Mean time per timestep 0.01s 0.1s 1s 10s GPUs (also # Titan Nodes, 1 GPU per Titan Node) 1 8 512 8192 18252 16^3 32^3 64^3 128^3 GPU Speedup Speedup (CPU/GPU) 0.1X 1X 10X 100X CPUs (also # Titan Nodes, 1 MPI Rank per Titan Node) 1 8 512 8192 18252
1X
16^3 32^3 64^3 128^3
Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)
CLEAN AND SECURE ENERGY
THE UNIVERSITY OF UTAH
TMWait for linear solvers to get us to many-GPU systems?
achieve scalability & performance.
Compressible Speedup (CPU/GPU) 0.1X 1X 10X 100X 1 2 8 6 4 5 1 2 4 9 6 8 1 9 2 1 2 8 1 8 2 5 2
1X
16^3 32^3 64^3 128^3
Low-Mach Speedup (CPU/GPU) 0X 0.5X 1X 1.5X 2X 1 2 8 6 4 5 1 2 4 9 6 8 1 9 2 1 2 8
1X
16^3 32^3 64^3 128^3
CLEAN AND SECURE ENERGY
THE UNIVERSITY OF UTAH
TMWait for linear solvers to get us to many-GPU systems?
achieve scalability & performance.
Consider alternative algorithms?
Compressible Speedup (CPU/GPU) 0.1X 1X 10X 100X 1 2 8 6 4 5 1 2 4 9 6 8 1 9 2 1 2 8 1 8 2 5 2
1X
16^3 32^3 64^3 128^3
Low-Mach Speedup (CPU/GPU) 0X 0.5X 1X 1.5X 2X 1 2 8 6 4 5 1 2 4 9 6 8 1 9 2 1 2 8
1X
16^3 32^3 64^3 128^3
CLEAN AND SECURE ENERGY
THE UNIVERSITY OF UTAH
TMHigh arithmetic intensity Communication patterns are the same as explicit codes (ghost/halo- updates) Well-suited for reacting flow calculations.
I − ∆σ ∂h ∂u ∆u ∆σ = h(u)
Local Jacobian matrix Local residual
Computational kernel
Matrix assembly must be efficient and extensible to complex, multiphysics problems
Example: Highly nonlinear, parameterized ODE systems
Nebo for GPU
and sparse transformation
transfer
with sparse transform
kinetics source terms mixing/flow convective heat transfer
K + Q + T
∂K ∂V + ∂Q ∂V ∂V ∂U − 1 τ I
Right-hand side: Jacobian:
Full matrix (dense submat)
( dKdV + dqdV ) * dVdU - invT
1-element (sparse) 2N-elements (sparse) scalar matrix
GPU Speedup - 16x16 Matrix
5 10 15 20 25 30 16^3 32^3 64^3
Dot Product MatVec Ax=b Eigen-decomp
C++ code:
CLEAN AND SECURE ENERGY
THE UNIVERSITY OF UTAH
TMRobust abstractions are needed to facilitate portable & performant applications on upcoming architectures.
applications
The Algorithm-Hardware collision:
XPS award1337145 DE-NA0002375 DE-NA-000740 DE-SC0008998