Case Studies in Using a DSL and Task Graphs for Portable Reacting - PowerPoint PPT Presentation

Case Studies in Using a DSL and Task Graphs for Portable Reacting Flow Simulations J AMES C. S UTHERLAND Associate Professor - Chemical Engineering T ONY S AAD Assistant Professor - Chemical Engineering

Acknowledgments B ABAK G OSHAYESHI C HRISTOPHER E ARL Research Staff Postdoctoral Researcher Now at LLNL A BHISHEK B AGUSETTY M IKE H ANSEN D EVIN R OBISON J OSH M C C ONNELL M ICHAEL B ROWN Ph.D. Students M.S. Students DE-NA0002375 DE-NA-000740 XPS award1337145 DE-SC0008998

Nebo (E)DSL: “Matlab for PDEs on Supercomputers” rhs = − ∂ ∂ x ( J x + C x ) − ∂ ∂ y ( J y + C y ) − ∂ ∂ z ( J z + C z ) Field & stencil operations: rhs <<= -divOpX( xConvFlux + xDiffFlux ) -divOpY( yConvFlux + yDiffFlux ) -divOpZ( zConvFlux + zDiffFlux ); Can “chain” stencil operations where necessary. • Stencils : >150 natively supported stencil operations (easily extensible) Auto-generate code for DSL C++ • cond : “vectorized if” Efficiency efficient execution on • Arbitrary composition of operations • Masked assignment (perform operations CPU, GPU, XeonPhi, on a defined subset of points) Matlab etc. during compilation. • Portable : same code works for CPU, multicore, GPU execution Expressiveness • Embedded in C++ → “ plays well with others ” Earl, C., Might, M., Bagusetty, A., & Sutherland, J. C., Journal of Systems and Software (2016).

The Power of Task Graphs Register all expressions Γ = Γ ( T, p, y i ) • Each “expression” calculates one or more field quantities. Γ • Each expression advertises its direct dependencies. Direct (expressed) Set a “root” expression; construct a graph dependencies. p • All dependencies are discovered/resolved automatically. y i T Indirect (discovered) ρ • Highly localized influence of changes in models. dependencies. • Not all expressions in the registry may be relevant/ used. From the graph: u Expression • Deduce storage requirements & allocate memory τ Registry (externally to each expression). s φ • Automatically schedule evaluation, ensuring proper φ ordering. • Robust scheduling algorithms are key. *Notz, Pawlowski, & Sutherland (2012). ACM Transactions on Mathematical Software, 39(1).

Changes in model form are naturally handled Pure substance heat flux: q = � λ r T q λ T

Changes in model form are naturally handled Multi-species mixture heat flux: n X q = � λ r T + h i J i i =1 q λ J 1 J n T h n h 1 y 1 y n No complex logic changes in code when model are added/changed.

“Modifiers” — injecting new dependencies Motivation: • Boundary conditions : modify a subset of the A computed values. • Multiphase coupling : add source terms to RHS of equations. B C

“Modifiers” — injecting new dependencies Motivation: • Boundary conditions : modify a subset of the A computed values. • Multiphase coupling : add source terms to RHS BC1 S1 of equations. Modifiers allow “push” rather than B C “pull” dependency addition. Modifiers are deployed after the node they are attached to, and are provided a handle to the field just computed.

“Modifiers” — injecting new dependencies Motivation: • Boundary conditions : modify a subset of the A computed values. • Multiphase coupling : add source terms to RHS BC1 S1 of equations. Modifiers allow “push” rather than B C “pull” dependency addition. Modifiers are deployed after the node D E F they are attached to, and are provided a handle to the field just computed. Modifiers can introduce new dependencies to the graph.

Example: PoKiTT ( Po rtable Ki netics, T hermodynamics & T ransport) ρ∂ y i ∂ t = �r · J i + s i ρ∂ h ∂ t = �r · q i • Detailed kinetics • Mixture-averaged transport • Detailed thermodynamics Triple flame computed on GPU with PoKiTT Yonkee & Sutherland, SIAM Journal on Scientific Computing (2016)

Example: PoKiTT ( Po rtable Ki netics, T hermodynamics & T ransport) ρ∂ y i ∂ t = �r · J i + s i • 32 PDEs ρ∂ h • 256 2 grid points ∂ t = �r · q i • 8 million timesteps • Detailed kinetics • 8 days on 1 GPU (~5 months on 1 CPU core) • Mixture-averaged transport • Detailed thermodynamics 2.4 256^2 12 cores Triple flame computed on GPU with PoKiTT 512^2 5 1024^2 5 18.2 GPU 27 30 6 12 18 24 30 Speedup Yonkee & Sutherland, SIAM Journal on Scientific Computing (2016)

Titan: Hybrid Low Mach Algorithm Weak Scaling 100s 16^3 32^3 64^3 128^3 Mean time per timestep 10s 1s 0.1s 0.01s 1 2 8 64 512 4096 8192 12800 Everything on GPU except Poisson solve on CPU. GPUs (also # Titan Nodes, 1 GPU per Titan Node) Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)

Titan: Hybrid Low Mach Algorithm Weak Scaling GPU Speedup 100s 2X 16^3 32^3 16^3 32^3 64^3 128^3 Mean time per timestep 64^3 128^3 Speedup (CPU/GPU) 10s 1.5X 1s 1X 1X 0.1s 0.5X 0.01s 0X 1 2 8 64 512 4096 8192 12800 1 2 8 4 2 6 2 0 6 1 9 9 0 5 0 1 8 4 8 2 1 GPUs (also # Titan Nodes, 1 CPUs/GPUs (also # Titan Nodes, GPU per Titan Node) 1 MPI Rank per Titan Node) Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)

Titan: Compressible Algorithm Weak Scaling 10s Mean time per timestep 1s 0.1s 16^3 32^3 64^3 128^3 0.01s 1 8 512 8192 18252 GPUs (also # Titan Nodes, 1 GPU per Titan Node) Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)

Titan: Compressible Algorithm Weak Scaling GPU Speedup 10s 100X 16^3 32^3 64^3 Mean time per timestep Speedup (CPU/GPU) 128^3 1s 10X 0.1s 1X 1X 16^3 32^3 64^3 128^3 0.01s 0.1X 1 8 512 8192 18252 1 8 512 8192 18252 GPUs (also # Titan Nodes, 1 GPU per CPUs (also # Titan Nodes, 1 MPI Rank Titan Node) per Titan Node) Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)

What next? Low-Mach CLEAN AND SECURE ENERGY Compressible THE UNIVERSITY OF UTAH 2X 100X 16^3 Speedup (CPU/GPU) 16^3 32^3 Speedup (CPU/GPU) 32^3 64^3 128^3 Institute for 64^3 1.5X 10X 128^3 TM 1X 1X 1X 1X 0.5X 0X 0.1X 1 2 8 4 2 6 2 0 2 1 2 8 4 2 6 2 0 6 1 9 9 0 5 6 1 9 9 0 5 0 1 8 2 5 0 1 8 4 8 2 8 4 8 2 1 1 1 Wait for linear solvers to get us to many-GPU systems? • Even when these arrive, it puts a lot of demand on black-box linear solvers to achieve scalability & performance.

What next? Low-Mach CLEAN AND SECURE ENERGY Compressible THE UNIVERSITY OF UTAH 2X 100X 16^3 Speedup (CPU/GPU) 16^3 32^3 Speedup (CPU/GPU) 32^3 64^3 128^3 Institute for 64^3 1.5X 10X 128^3 TM 1X 1X 1X 1X 0.5X 0X 0.1X 1 2 8 4 2 6 2 0 2 1 2 8 4 2 6 2 0 6 1 9 9 0 5 6 1 9 9 0 5 0 1 8 2 5 0 1 8 4 8 2 8 4 8 2 1 1 1 Wait for linear solvers to get us to many-GPU systems? • Even when these arrive, it puts a lot of demand on black-box linear solvers to achieve scalability & performance. Consider alternative algorithms?

Point-implicit algorithms: CLEAN AND SECURE ENERGY THE UNIVERSITY OF UTAH High arithmetic intensity Communication patterns are the same as explicit codes (ghost/halo- Institute for updates) TM Well-suited for reacting flow calculations. Local residual � ∆ u  I − ∆ σ ∂ h ∆ σ = h ( u ) ∂ u Local Jacobian matrix Computational kernel Residual (right-hand side) evaluation - Pointwise Jacobian evaluation - Local linear solves - Local eigenvalue decompositions - Matrix assembly must be efficient and extensible to complex, multiphysics problems

Example: Highly nonlinear, parameterized ODE systems • Detailed chemical kinetics K Q T Right-hand side: + + - Analytical Jacobian in PoKiTT w/ kinetics convective mixing/flow Nebo for GPU source terms heat transfer � ∂ V - Dense matrix formed w/primitives  ∂ K ∂ V + ∂ Q ∂ U − 1 and sparse transformation τ I Jacobian: ∂ V • Simple convective heat transfer Full matrix   1-element 2N-elements scalar matrix - Single-element Jacobian combined (dense submat) (sparse) (sparse) with sparse transform C++ code: ( dKdV + dqdV ) * dVdU - invT • Finite mixing time - Scalar Jacobian matrix GPU Speedup - 16x16 Matrix 30 Dot Product MatVec 25 Ax=b 20 Eigen-decomp 15 10 5 0 16^3 32^3 64^3

Conclusions CLEAN AND SECURE ENERGY THE UNIVERSITY OF UTAH Robust abstractions are needed to facilitate portable & performant applications on upcoming architectures. Institute for • DAG-based software design allows flexibility needed for multiphysics codes TM on heterogeneous platforms. • (E)-DSLs can provide convenient, portable & performant abstractions for HPC applications The Algorithm-Hardware collision: • Scalable GPU linear solvers are needed for traditional algorithms to be viable on new architectures. • Alternative algorithms may be needed with higher arithmetic intensity • higher-order • point-implicit? DE-NA0002375 DE-NA-000740 XPS award1337145 DE-SC0008998

Case Studies in Using a DSL and Task Graphs for Portable Reacting - PowerPoint PPT Presentation

Case Studies in Using a DSL and Task Graphs for Portable Reacting Flow Simulations J AMES C. S UTHERLAND Associate Professor - Chemical Engineering T ONY S AAD Assistant Professor - Chemical Engineering Acknowledgments B ABAK G OSHAYESHI C

PC PORTABLE PC PORTABLE PC PORTABLE Introducing the PC Portable Lamp, one of a range of

DSL with pyrser Author: L. Auroux lionel@lse.epita.fr For pyParis 2018 lionel@lse.epita.fr For

100% JDclare Language Workbench Software Factories DSL Workbenches - PMW DSL Workbenches -

Portable fuel cell system s Jaeyoung Lee September 19, 2006 http:/ / w w w .h2 fc.re.kr Energy

Using Aspects for Language Portability Lennart Kats Eelco Visser DSLs Stratego SDF Spoofax

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Portable Enforcement Solution International Product Marketing Department Portable PTZ Dome Body

DSL CPE Module A unique solution for enabling board functional test of existing & emerging

DSL Design DSL Design Jumps/GOTOs Control flow in DSLs A jump transfers control to a

Perl in Scheme: A DSL Abram Hindle Kitchener/Waterloo Perl Mongers Canada http://kw.pm.org/ {

DSL vs. Library API Shootout Rich Unger Salesforce.com Jaroslav Tulach Oracle Agenda What do

Twitter: @pandamonial www.pandamonial.com Objectives DSL/LOP Background Internal/External

(DSL) ETI 2506 TELECOMMUNICATION SYSTEMS Monday, 10 October 2016 1 COURSE OUTLINE (5) 2

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

Quantum Biology Where does QM come into play? Biology has a knack for using what works. And if

PYRROC an alternative to Copper catalysts in strain promoted azide-alkyne cycloaddition reactions

Prevention of Contamination with Nitrosamine Impurities Sartans with Nitrosamine impurities -

SEP TM 100, A SODIUM AZIDE-BASED BROAD SPECTRUM PESTICIDE D. J. RICHARDS American Pacific

Selected developments UK State Aid Law Association Berlin Roundtable 24 June 2016 Nicola

DNA-scaffolded biomaterials enable modular and tunable control of immune cell therapies Xiao

1Q2019 Opportunity Day Thursday, May 16, 2019 Disclaimer This presentation includes

ASTE ON ASHES COMBU BUSTIO ES AND AND GROUND ND BONE NES USING NG RES ESIDU DUES HORUS

Case Studies in Using a DSL and Task Graphs for Portable Reacting - PowerPoint PPT Presentation

Case Studies in Using a DSL and Task Graphs for Portable Reacting Flow Simulations J AMES C. S UTHERLAND Associate Professor - Chemical Engineering T ONY S AAD Assistant Professor - Chemical Engineering Acknowledgments B ABAK G OSHAYESHI C

PC PORTABLE PC PORTABLE PC PORTABLE Introducing the PC Portable Lamp, one of a range of

DSL with pyrser Author: L. Auroux lionel@lse.epita.fr For pyParis 2018 lionel@lse.epita.fr For

100% JDclare Language Workbench Software Factories DSL Workbenches - PMW DSL Workbenches -

Portable fuel cell system s Jaeyoung Lee September 19, 2006 http:/ / w w w .h2 fc.re.kr Energy

Using Aspects for Language Portability Lennart Kats Eelco Visser DSLs Stratego SDF Spoofax

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Portable Enforcement Solution International Product Marketing Department Portable PTZ Dome Body

DSL CPE Module A unique solution for enabling board functional test of existing &amp; emerging

DSL Design DSL Design Jumps/GOTOs Control flow in DSLs A jump transfers control to a

Perl in Scheme: A DSL Abram Hindle Kitchener/Waterloo Perl Mongers Canada http://kw.pm.org/ {

DSL vs. Library API Shootout Rich Unger Salesforce.com Jaroslav Tulach Oracle Agenda What do

Twitter: @pandamonial www.pandamonial.com Objectives DSL/LOP Background Internal/External

(DSL) ETI 2506 TELECOMMUNICATION SYSTEMS Monday, 10 October 2016 1 COURSE OUTLINE (5) 2

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

Quantum Biology Where does QM come into play? Biology has a knack for using what works. And if

PYRROC an alternative to Copper catalysts in strain promoted azide-alkyne cycloaddition reactions

Prevention of Contamination with Nitrosamine Impurities Sartans with Nitrosamine impurities -

SEP TM 100, A SODIUM AZIDE-BASED BROAD SPECTRUM PESTICIDE D. J. RICHARDS American Pacific

Selected developments UK State Aid Law Association Berlin Roundtable 24 June 2016 Nicola

DNA-scaffolded biomaterials enable modular and tunable control of immune cell therapies Xiao

1Q2019 Opportunity Day Thursday, May 16, 2019 Disclaimer This presentation includes

ASTE ON ASHES COMBU BUSTIO ES AND AND GROUND ND BONE NES USING NG RES ESIDU DUES HORUS

DSL CPE Module A unique solution for enabling board functional test of existing & emerging