case studies in using a dsl and task graphs for portable
play

Case Studies in Using a DSL and Task Graphs for Portable Reacting - PowerPoint PPT Presentation

Case Studies in Using a DSL and Task Graphs for Portable Reacting Flow Simulations J AMES C. S UTHERLAND Associate Professor - Chemical Engineering T ONY S AAD Assistant Professor - Chemical Engineering Acknowledgments B ABAK G OSHAYESHI C


  1. Case Studies in Using a DSL and Task Graphs for Portable Reacting Flow Simulations J AMES C. S UTHERLAND Associate Professor - Chemical Engineering T ONY S AAD Assistant Professor - Chemical Engineering

  2. Acknowledgments B ABAK G OSHAYESHI C HRISTOPHER E ARL Research Staff Postdoctoral Researcher Now at LLNL A BHISHEK B AGUSETTY M IKE H ANSEN D EVIN R OBISON J OSH M C C ONNELL M ICHAEL B ROWN Ph.D. Students M.S. Students DE-NA0002375 DE-NA-000740 XPS award1337145 DE-SC0008998

  3. Nebo (E)DSL: “Matlab for PDEs on Supercomputers” rhs = − ∂ ∂ x ( J x + C x ) − ∂ ∂ y ( J y + C y ) − ∂ ∂ z ( J z + C z ) Field & stencil operations: rhs <<= -divOpX( xConvFlux + xDiffFlux ) -divOpY( yConvFlux + yDiffFlux ) -divOpZ( zConvFlux + zDiffFlux ); Can “chain” stencil operations where necessary. • Stencils : >150 natively supported stencil operations (easily extensible) Auto-generate code for DSL C++ • cond : “vectorized if” Efficiency efficient execution on • Arbitrary composition of operations • Masked assignment (perform operations CPU, GPU, XeonPhi, on a defined subset of points) Matlab etc. during compilation. • Portable : same code works for CPU, multicore, GPU execution Expressiveness • Embedded in C++ → “ plays well with others ” Earl, C., Might, M., Bagusetty, A., & Sutherland, J. C., Journal of Systems and Software (2016).

  4. The Power of Task Graphs Register all expressions Γ = Γ ( T, p, y i ) • Each “expression” calculates one or more field quantities. Γ • Each expression advertises its direct dependencies. Direct (expressed) Set a “root” expression; construct a graph dependencies. p • All dependencies are discovered/resolved automatically. y i T Indirect (discovered) ρ • Highly localized influence of changes in models. dependencies. • Not all expressions in the registry may be relevant/ used. From the graph: u Expression • Deduce storage requirements & allocate memory τ Registry (externally to each expression). s φ • Automatically schedule evaluation, ensuring proper φ ordering. • Robust scheduling algorithms are key. *Notz, Pawlowski, & Sutherland (2012). ACM Transactions on Mathematical Software, 39(1).

  5. Changes in model form are naturally handled Pure substance heat flux: q = � λ r T q λ T

  6. Changes in model form are naturally handled Multi-species mixture heat flux: n X q = � λ r T + h i J i i =1 q λ J 1 J n T h n h 1 y 1 y n No complex logic changes in code when model are added/changed.

  7. “Modifiers” — injecting new dependencies Motivation: • Boundary conditions : modify a subset of the A computed values. • Multiphase coupling : add source terms to RHS of equations. B C

  8. “Modifiers” — injecting new dependencies Motivation: • Boundary conditions : modify a subset of the A computed values. • Multiphase coupling : add source terms to RHS BC1 S1 of equations. Modifiers allow “push” rather than B C “pull” dependency addition. Modifiers are deployed after the node they are attached to, and are provided a handle to the field just computed.

  9. “Modifiers” — injecting new dependencies Motivation: • Boundary conditions : modify a subset of the A computed values. • Multiphase coupling : add source terms to RHS BC1 S1 of equations. Modifiers allow “push” rather than B C “pull” dependency addition. Modifiers are deployed after the node D E F they are attached to, and are provided a handle to the field just computed. Modifiers can introduce new dependencies to the graph.

  10. Example: PoKiTT ( Po rtable Ki netics, T hermodynamics & T ransport) ρ∂ y i ∂ t = �r · J i + s i ρ∂ h ∂ t = �r · q i • Detailed kinetics • Mixture-averaged transport • Detailed thermodynamics Triple flame computed on GPU with PoKiTT Yonkee & Sutherland, SIAM Journal on Scientific Computing (2016)

  11. Example: PoKiTT ( Po rtable Ki netics, T hermodynamics & T ransport) ρ∂ y i ∂ t = �r · J i + s i • 32 PDEs ρ∂ h • 256 2 grid points ∂ t = �r · q i • 8 million timesteps • Detailed kinetics • 8 days on 1 GPU (~5 months on 1 CPU core) • Mixture-averaged transport • Detailed thermodynamics 2.4 256^2 12 cores Triple flame computed on GPU with PoKiTT 512^2 5 1024^2 5 18.2 GPU 27 30 6 12 18 24 30 Speedup Yonkee & Sutherland, SIAM Journal on Scientific Computing (2016)

  12. Titan: Hybrid Low Mach Algorithm Weak Scaling 100s 16^3 32^3 64^3 128^3 Mean time per timestep 10s 1s 0.1s 0.01s 1 2 8 64 512 4096 8192 12800 Everything on GPU except Poisson solve on CPU. GPUs (also # Titan Nodes, 1 GPU per Titan Node) Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)

  13. Titan: Hybrid Low Mach Algorithm Weak Scaling GPU Speedup 100s 2X 16^3 32^3 16^3 32^3 64^3 128^3 Mean time per timestep 64^3 128^3 Speedup (CPU/GPU) 10s 1.5X 1s 1X 1X 0.1s 0.5X 0.01s 0X 1 2 8 64 512 4096 8192 12800 1 2 8 4 2 6 2 0 6 1 9 9 0 5 0 1 8 4 8 2 1 GPUs (also # Titan Nodes, 1 CPUs/GPUs (also # Titan Nodes, GPU per Titan Node) 1 MPI Rank per Titan Node) Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)

  14. Titan: Compressible Algorithm Weak Scaling 10s Mean time per timestep 1s 0.1s 16^3 32^3 64^3 128^3 0.01s 1 8 512 8192 18252 GPUs (also # Titan Nodes, 1 GPU per Titan Node) Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)

  15. Titan: Compressible Algorithm Weak Scaling GPU Speedup 10s 100X 16^3 32^3 64^3 Mean time per timestep Speedup (CPU/GPU) 128^3 1s 10X 0.1s 1X 1X 16^3 32^3 64^3 128^3 0.01s 0.1X 1 8 512 8192 18252 1 8 512 8192 18252 GPUs (also # Titan Nodes, 1 GPU per CPUs (also # Titan Nodes, 1 MPI Rank Titan Node) per Titan Node) Saad, T., & Sutherland, J. C., Journal of Computational Science (2016)

  16. What next? Low-Mach CLEAN AND SECURE ENERGY Compressible THE UNIVERSITY OF UTAH 2X 100X 16^3 Speedup (CPU/GPU) 16^3 32^3 Speedup (CPU/GPU) 32^3 64^3 128^3 Institute for 64^3 1.5X 10X 128^3 TM 1X 1X 1X 1X 0.5X 0X 0.1X 1 2 8 4 2 6 2 0 2 1 2 8 4 2 6 2 0 6 1 9 9 0 5 6 1 9 9 0 5 0 1 8 2 5 0 1 8 4 8 2 8 4 8 2 1 1 1 Wait for linear solvers to get us to many-GPU systems? • Even when these arrive, it puts a lot of demand on black-box linear solvers to achieve scalability & performance.

  17. What next? Low-Mach CLEAN AND SECURE ENERGY Compressible THE UNIVERSITY OF UTAH 2X 100X 16^3 Speedup (CPU/GPU) 16^3 32^3 Speedup (CPU/GPU) 32^3 64^3 128^3 Institute for 64^3 1.5X 10X 128^3 TM 1X 1X 1X 1X 0.5X 0X 0.1X 1 2 8 4 2 6 2 0 2 1 2 8 4 2 6 2 0 6 1 9 9 0 5 6 1 9 9 0 5 0 1 8 2 5 0 1 8 4 8 2 8 4 8 2 1 1 1 Wait for linear solvers to get us to many-GPU systems? • Even when these arrive, it puts a lot of demand on black-box linear solvers to achieve scalability & performance. Consider alternative algorithms?

  18. Point-implicit algorithms: CLEAN AND SECURE ENERGY THE UNIVERSITY OF UTAH High arithmetic intensity Communication patterns are the same as explicit codes (ghost/halo- Institute for updates) TM Well-suited for reacting flow calculations. Local residual � ∆ u  I − ∆ σ ∂ h ∆ σ = h ( u ) ∂ u Local Jacobian matrix Computational kernel Residual (right-hand side) evaluation - Pointwise Jacobian evaluation - Local linear solves - Local eigenvalue decompositions - Matrix assembly must be efficient and extensible to complex, multiphysics problems

  19. Example: Highly nonlinear, parameterized ODE systems • Detailed chemical kinetics K Q T Right-hand side: + + - Analytical Jacobian in PoKiTT w/ kinetics convective mixing/flow Nebo for GPU source terms heat transfer � ∂ V - Dense matrix formed w/primitives  ∂ K ∂ V + ∂ Q ∂ U − 1 and sparse transformation τ I Jacobian: ∂ V • Simple convective heat transfer Full matrix 
 1-element 2N-elements scalar matrix - Single-element Jacobian combined (dense submat) (sparse) (sparse) with sparse transform C++ code: ( dKdV + dqdV ) * dVdU - invT • Finite mixing time - Scalar Jacobian matrix GPU Speedup - 16x16 Matrix 30 Dot Product MatVec 25 Ax=b 20 Eigen-decomp 15 10 5 0 16^3 32^3 64^3

  20. Conclusions CLEAN AND SECURE ENERGY THE UNIVERSITY OF UTAH Robust abstractions are needed to facilitate portable & performant applications on upcoming architectures. Institute for • DAG-based software design allows flexibility needed for multiphysics codes TM on heterogeneous platforms. • (E)-DSLs can provide convenient, portable & performant abstractions for HPC applications The Algorithm-Hardware collision: • Scalable GPU linear solvers are needed for traditional algorithms to be viable on new architectures. • Alternative algorithms may be needed with higher arithmetic intensity • higher-order • point-implicit? DE-NA0002375 DE-NA-000740 XPS award1337145 DE-SC0008998

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend