Solving Petascale Turbulent Combustion Problems with the Uintah - PowerPoint PPT Presentation

Solving Petascale Turbulent Combustion Problems with the Uintah Software Martin Berzins DOE NNSA PSAAP2 Center Thanks to DOE ASCI (97-10), NSF , DOE NETL+NNSA, NSF , INCITE, XSEDE, ALCC, ORNL, ALCF for funding and cpu hours This work is part of our NNSA PSSAP2 Center using INCITE + ALCC awards

Part of of Utah PS PSAAP C P Center r Phil S Phi Smith(PI) I) Dave Pe Pershing MB NSF RESILIENCE Sahithi Chaganti Aditya Pakki PSAAP2 Applications Team PSAAP DSL Team Todd Harman Jeremy Thornock Derek Harris Ben Issac James Sutherland Tony Saad PSAAP Extrem eme e Scaling team S SAND NDIA John Schmidt Alan Humphrey John Holmen Brad Peterson Dan Sunderland

Sev even a n abstractions f ns for appl pplications p ns post- petascal ale 1.A task-based formulation of problems at scale PSAAP GE/Alstom Clean Coal Boiler 2. A programming model to write these tasks as code Uintah tasks specify halos; Read from /Write to local data warehouse 3.A runtime system to execute these tasks Uintah Runtime System continues to evolve 4. A low-level portability layer to allow tasks to run on different architectures Kokkos 5.Domain specific language to ease problem solving Nebo Wasatch (not discussed here) 6 A Resilience model AMR based duplication 7. Scalable components I/O, in-situ Vis, Solvers PIDX, Visit, hypre.

92 meters O 2 concentrations boiler simulation Alstom Power 1000MWe “Twin Fireball” boiler Supply power for 1M people 1mm grid resolution = 9 x 10 12 cells 100x > largest problems solved today AMR, linear systems, thermal radiation Turbulent combustion LES

Simulati tions o of Cle lean an coal B l Boile oilers using A g ARCHES HES i in Uintah • Traditional Lagrangian/RANS approaches do not address well particle effects so use Large E Eddy ddy Simula ulatio ion has potential to be an important design tool • Struc uctured, high order f finite-vo volume Mass, momentum, energy conservation • Partic icles v via DQMOM (many small linear solves) • Low Mach number approx. (pressure P e Poisson solve up t to 12 variables hypre GMG + RB GS 10 • Ra Radi diatio ion via Discrete Ordinates – massive • solves 20+ every few steps of Radiation Transfer Equation with hypre • Ra Radi diatio ion Ra Ray t tracing . Red is expt • Unc ncertain inty qua quantific icatio ion Blue is sim. Green is consistent See [ Modest and Howarth]

Uintah Pro rogramin ing M Model f l for St r Stencil il Times estep ep MPI Old Data Halo sends Example Stencil Task Warehouse Network GET Uold Uhalo Unew = Uold + dt*F(Uold,Uhalo) New Data PUT Unew Warehouse Halo receives Uhalo User specifies mesh patches and halo levels and connections

Uintah Architecture Applications code Programing model UQ DRIVERS ARCHES DSL: NEBO Components NOT architecture specific and do not change Automatic icall ally g y generated Abstract C t C++ T Task G Graph F Form Adaptive E e Exec ecuti tion o of tasks Runtime System asynchronous out-of-order Simulation Load Controller Balancer execution, work stealing, overlap communication & computation. Scheduler PIDX Strong an St and w weak sc ak scal aling out t to Data VisIT Task Warehouse 800K c cores for AMR Fl R Flui uid s structure interaction ction hypre linear solver Open en s source s e software Worldwide d e distribution GPUs CPUs Xeon Phis Broad u user er b base

Uinta tah: U Unifi fied Heter erogen eneo eous Sched eduler er & & Runtime e node GPU Data Running GPU Task Warehouse Devilishly GPU Kernels PUT GPU Running GPU Task GET difficult completed tasks Data Warehouse stream H2D D2H events stream stream MPI sends Running CPU Task MPI recvs PUT Running CPU Task GET CPU Threads PUT Running CPU Task Data Network GET Warehouse (variables Task GPU ready tasks ready tasks Graph directory) GPU Task Queues Shared Data GPU-enabled tasks MPI Data Ready CPU Task Queues Internal ready tasks No MPI inside node, lock free Data Warehouse , cores and GPUs pull work

Scaling Results Mira 5/22 Time I/O every 10 steps Radiation solve Discrete Ordinates Every 7 steps S_N 6 , 48 directions hypre for each direction Standard timestep including pressure Poisson solve One 12x12x12 patch per core, 10K variables per core, 31 timesteps Largest case 5 Bn unknowns. Production runs use 250K cores For I/O PIDX scales better and is being linked to Uintah For radiation we have Raytracing working

Radiation Overview Solving energy and radiative heat transfer equations simultaneously ∂ T ∇ ⋅ = q Diffusion – Convection + Source/Sinks ∂ t • Radiation-energy coupling incorporated by radiative source term • Energy equation conventionally solved by ARCHES (finite volume) • Temperature field, T used to compute net radiative source term ∇ ⋅ • requires integration of incoming intensity about a solid angle with q reverse Monte Carlo ray tracing (RMCRT) 4 π N ∑ ∫ Ω ⇒ I d I in ray N = π ray 1 4 Mutually exclusive Rays traced backwards from e.g. S to E computational cell (cuda thread), eliminating the need to track rays that never reach that cell S Todd Harman, Alan Humphrey, Derek Harris 10

Multi-Level AMR GPU RMCRT Replicate mesh and use coarse representation of computational domain with multiple levels Define Region of Interest ( ROI ) Surround ROI with coarser grids As rays travel further away from ROI , the mesh spacing becomes larger Transmit new information relating to heat fluxes adsorption and scattering coeffs using same adaptive ideas Reduces computational cost, memory and communications 16,384 GPUs volume. Todd Harman, Alan Humphrey 11

Better use of GPUs with Per Task GPU Datawarehouse Single, shared DataWarehouse does not scale with problem • complexity increasing DW size, meant more device synchronization • Solution: per task DataWarehouses on GPU • no sharing or atomic operations required • can overlap comp and comm in a thread-safe manner • Brad Peterson 12

Better use of GPUs with Per Task GPU Datawarehouse Single, shared DataWarehouse does not scale with problem • complexity increasing DW size, meant more device synchronization • Solution: per task DataWarehouses on GPU • no sharing or atomic operations required • can overlap comp and comm in a thread-safe manner • before Allows rapid execution Of GPU TASK < 1microsecond order of after magnitude speedup Brad Peterson 13

Abstractions for Portability and Node Performance • Use Domain Specific Language Nebo - weak scales to all of Titan 18K GPUs and 260K cpus • Use Kokkos abstraction layer that maps loops onto machine efficiently using cache aware memory models and vectorization / Openmp • Both use C++ template metaprogramming for compile time data structures and functions • While Nebo allows users to solve problems within language framework, Kokkos allows users to modify code at loop level and to optimize loops and good memory placement

Kokkos – Uintah Infrastructure Incremental refactor to Kokkos parallel patterns/views Replace patch grid iterator loops for (auto itr = patch.begin(); itr != patch.end(); ++itr) { IntVector iv = *itr; BECOMES A[iv] = B[iv] + C[iv];} parallel_for (patch.range(), LAMBDA(int i, int j, int k) { Dan Sunderland, Alan Humphrey A(i,j,k) = B(i,j,k) + C(i,j,k)}); Refactored grid variables to expose 2x speedup on 72 cores unmanaged Kokkos views Uses the For RMCRT already existing memory allocations and layouts Removes many levels of OLD indirection in existing NEW implementation. Future work using managed Kokkos views for portability all components benefit 15

DSL: NEBO Uintah Kokkos loops Applications UQ Drivers ARCHES Use Kokkos kkos abstraction layer that maps loops onto machine Task Graph specific cache friendly data layouts and has appropriate memory abstractions Runtime Kokkos Load Simulation System Balancer Infrastructure Controller + Key Scheduler External PIDX Data Kokkos memory Modules Warehouse “views” Task VisIT hypre linear solver Kokkos loops Target Architecture GPUs CPUs Xeon Phis

Resilience Joint Work With NSF XPS Project • Need interfaces at system level to address : • Core failure – reroute tasks • Comms failure – reroute message • Node failure – need to replicate patches use an AMR type approach in which a coarse patch is on another node. In 3D has 12.5% overhead Interpolation is key here • Core slowdown - move tasaks elsewhere . 10% slowdown auto move Respa SC 2015 workshop paper • Need to address possible MTBF of minutes ? Or do we? • Early user program TACC Intel KNL Aditya Pakki, Sahithi Chaganti, Alan Humphrey John Schmidt

Solving Petascale Turbulent Combustion Problems with the Uintah - PowerPoint PPT Presentation

Solving Petascale Turbulent Combustion Problems with the Uintah Software Martin Berzins DOE NNSA PSAAP2 Center Thanks to DOE ASCI (97-10), NSF , DOE NETL+NNSA, NSF , INCITE, XSEDE, ALCC, ORNL, ALCF for funding and cpu hours This work is part

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Integrating Problem Solving 2020 Integrating Problem Solving 2020 Integrating Problem Solving

Solving Word Problems The strategy for solving word problems, presented in written form, may be

Solving Percent Problems Word Problems Find a Pattern Estimation Problems Fraction Problems

Preparing for Turbulent Times Ahead Preparing for Turbulent Times Ahead Further Strengthening our

FLOW CONDITIONING FLOW CONDITIONING DESIGN IN TURBULENT DESIGN IN TURBULENT LIQUID SHEETS

JSE Power Hour JSE Power Hour JSE Power Hour JSE Power Hour Managing your portoflio Managing

Weathering the Turbulent Business Climate Weathering the Turbulent Business Climate Further

Financial Planning During Turbulent Financial Planning During Turbulent Times: Role of Household

Objectives Objectives Turbulent Jet Flows Turbulent Jet Flows Momentum Integral

The Study of Turbulent The Study of Turbulent State in Quantum Fluid State in Quantum Fluid

Outline Outline Turbulent Wake Flows Turbulent Wake Flows Momentum Integral

Planetesimal formation in Planetesimal formation in turbulent protoplanetary discs turbulent

Introduction to Turbulent Flows Yonsei University Fall, 2016 Introduction to Turbulent Flows

Theory and applications 1 Roadmap 1. Post-processing and analysis of turbulent simulations 2

Foundations of AI 3. Solving Problems by Searching Problem-Solving Agents, Formulating

Antitrust Notice The Casualty Actuarial Society is committed to adhering strictly to the

Hadronic Parity Violation in Effective Field Theory Matthias R. Schindler University of South

BCSSE and NSSE Reports for 2015 Presented to Faculty Senate April 25, 2016 By Data Team

MECHANICS OF INTRA-HIERARCHICAL INTERACTIONS AND ITS POTENTIAL IN DESIGN OF TOUGH MATERIALS L.

Supporting Sophomore Student Success: Student- and Institution-Level Results from Two National

Key Findings SUNY POTSDAM OFFICE OF INSTITUTIONAL EFFECTIVENESS 1 National Survey of Student

Chapter 3 Lesson Plan Module 7 Types of Radio Circuits The Basic Transceiver Combination

E.V.D.S Emergency Vehicle Detection System Group 28 Ryan Chappell EE Daniel Christiano - EE