Towards Exascale Direct Numerical Simulations
- f Turbulent Combustion on Heterogeneous
of Turbulent Combustion on Heterogeneous Machines Jacqueline Chen - - PowerPoint PPT Presentation
Towards Exascale Direct Numerical Simulations of Turbulent Combustion on Heterogeneous Machines Jacqueline Chen Director of ExaCT Sandia National Laboratories Livermore, CA jhchen@sandia.gov www.exactcodesign.org NVIDIA Booth@SC14
– Develop new combustor concepts – Design new fuels
– Need to perform simulations with sufficient chemical fidelity to differentiate effects of fuels where there is strong coupling with turbulence – Need to address uncertainties in thermo- chemical properties – Not addressing complexity of geometry in engineering design codes
Chemiluminescence from diesel lift-off stabilization for #2 diesel, ambient 21% O2, 850K, 35 bar courtesy of Lyle Pickett, SNL
Ethylene-air lifted jet flame at Re=10000
– Express data locality (sometimes at the expense of FLOPS) and independence – Allow expression of massive parallelism – Minimize data movement and reduce synchronization – Detect and address faults
limiter for performance improvement
system: optimize for compute
parallelism by adding nodes
within node & between nodes)
capacity and bandwidth
performance
parallelism within chips
locality and possibly topology
faster than capacity or bandwidth, no global hardware cache coherence
performance non-uniformity increase
– Compressible formulation – Eighth-order finite difference discretization – Fourth-order Runge-Kutta temporal integrator – Detailed kinetics and transport – Hybrid parallel model with MPI + OpenMP – MPI+ OpenACC (directives for GPU’s) – Legion (deferred execution hides latencies)
– Low Mach Number model that exploits separation of scales between acoustic wave speed and fluid motion – Second-order projection formulation – Detailed kinetics and transport – Block-structure adaptive mesh refinement – Hybrid parallel model with MPI + OpenMP
LMC simulation
from a low swirl injector S3D simulation
marker in lifted flame Laboratory scale flames
– Each MPI process is in charge of a piece of the 3D domain.
– Large message sizes. Non-blocking sends and receives
Levesque et al. SC’12
The logarithm of the scalar dissipation rate (that is, the local mixing rate) where white denotes high mixing rates and red, lower mixing rates
Bhagatwala et al., Proc. Combust. Inst. (2014)
raw simulation data to persistent storage Workflow must integrate simulation , UQ and analysis !!!!
placement of solver + UQ, and analytics, reducing checkpointing size with in-situ and in-transit analytics
– Express essential parts of application that are essential to science – Express pain points for performance – Hide complexities of REAL apps that are inconsequential to science
– Express what is important for performance – Hide complexity that is not consequential to performance
– Not absolute performance predictions – Scientific experiments that gauge relative improvements over control
Applications Architectures Proxy Apps Proxy Architectures Models/ Measurements Architecture Simulators
Exascale Machine Compressible and Low- Mach Reacting Flow Solvers Write to Disk
Proxy Apps Descriptive Statistics Visualization Topological Analysis UQ Meta-skeletal workflow proxy app New UQ + topology proxy for adjoint solution data flow
CNS
SSA Streaming Statistics PVR Parallel Volume Rendering RTC Reduced Topology LMC
Multi-grid Chemistry
UQ SMC
S3D
www.exactcodesign.org
Castro
adds, 3620 muls, 4303 divs, 540 trans, 710
Instruc on Mix
adds 3% muls 4% divs 18% trans 75%
Breakdown
CPU Time
adds, 14509 muls, 16447 divs, 387 sqrt,1
Mix
34% muls 34% divs 32%
Breakdown
CPU Time
Chemistry Dynamics
1 4 16 64 256 1024 4096 1 4 16 64 256 1024 4096 Number
Accesses Variables (sorted by number
accesses) 9 Species 21 Species 53 Species 71 Species 107 Species
Chemistry FP State Variables by Rank
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2 8 32 128 512 2048 8192 Bytes per Flop Cache Size (kB) Baseline
Blocking
Fusion
Bandwidth Tapering
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 9 21 53 71 107
Es mated Teraflo p s
Number
Species +Fast NIC (400 GB/s) +Fast-exp +Fast-div +Fast memory (4 TB/s) +Loop fusion +Cache blocking Baseline
Estimated Performance Benefits
25
26
27
$ $ $ $
N U M A N U M A
FB D R A M x86 CUDA x86 x86 x86
11/05/13
28
ξ OH HO2 CH2O
11/05/13
30
GPU’s CPU cores Legion Runtime core Dependence analysis tasks
CHEMKIN TRANSPORT THERMO SSE AVX CUDA
Program P Input Data
P P P P P P P P P
Warp 0 Warp 1 Warp 2 Partitioned Program P0 P1 P2 Input Data
P0 P0 P0 P1 P1 P1 P2 P2 P2
Warp 0 Warp 1 Warp 2
Data parallel Warp specialization
Chem Reacs 1 Chem Reacs 2 Chem Reacs 3 Chem Reacs 4 Chem Reacs 5 Chem Reacs 6 QSSA QSSA Stiff Stiff Stiff Stiff Output Math Output Math Output Math Output Math
Reaction Rate Exchange through Shared Mem Reaction Rate Exchange through Shared Mem
Warp 0 Warp 1 Warp 2 Warp 3
QSS 5 QSS 1 QSS 11 QSS 10 QSS 9 QSS 6 QSS 3 QSS 7 QSS 15 QSS 14 QSS 2
11/05/13
35