of Turbulent Combustion on Heterogeneous Machines Jacqueline Chen - - PowerPoint PPT Presentation

of turbulent combustion on heterogeneous
SMART_READER_LITE
LIVE PREVIEW

of Turbulent Combustion on Heterogeneous Machines Jacqueline Chen - - PowerPoint PPT Presentation

Towards Exascale Direct Numerical Simulations of Turbulent Combustion on Heterogeneous Machines Jacqueline Chen Director of ExaCT Sandia National Laboratories Livermore, CA jhchen@sandia.gov www.exactcodesign.org NVIDIA Booth@SC14


slide-1
SLIDE 1

Towards Exascale Direct Numerical Simulations

  • f Turbulent Combustion on Heterogeneous

Machines

Jacqueline Chen Director of ExaCT Sandia National Laboratories Livermore, CA jhchen@sandia.gov www.exactcodesign.org

NVIDIA Booth@SC’14 November 20, 2014, New Orleans

slide-2
SLIDE 2

Why Exascale Combustion

  • Predict behavior of new fuels in different

combustion scenarios at realistic pressure and turbulence conditions

– Develop new combustor concepts – Design new fuels

  • Co-design center is focusing on high-fidelity

direct numerical simulation methodologies

– Need to perform simulations with sufficient chemical fidelity to differentiate effects of fuels where there is strong coupling with turbulence – Need to address uncertainties in thermo- chemical properties – Not addressing complexity of geometry in engineering design codes

slide-3
SLIDE 3

Fundamental Turbulence-Chemistry Interactions Motivated by Advanced Engines and Gas Turbines

  • Higher fuel efficiency and lower emissions

driving combustion towards more dilute, fuel lean, partially-premixed conditions

  • New mixed-mode combustion regimes
  • Strong sensitivities to fuel chemistry
  • Preferential diffusion effects – synthesis

gases enriched with hydrogen for carbon capture storage in gas turbines for power

slide-4
SLIDE 4

Motivation: Understanding Stabilization of Lifted Flames in Heated Coflow

Chemiluminescence from diesel lift-off stabilization for #2 diesel, ambient 21% O2, 850K, 35 bar courtesy of Lyle Pickett, SNL

What is the role of ignition in lifted flame stabilization?

slide-5
SLIDE 5

DNS of Lifted Ethylene-air Jet Flame in a Heated Coflow

  • 3D slot burner configuration:

– Lx  Ly  Lz = 30  40  6 mm3 with – 1.28 billion grid points – High fuel jet velocity (204m/s); coflow velocity (20m/s) – Nozzle size for fuel jet, H = 2.0mm – Rejet = 10,000 – Cold fuel jet (18% C2H4 + 82% N2) at 550K, ηst ≈ 0.27 – Detailed C2H4/air chemistry, 22 species 18 global reactions, 201 steps – Hot coflow air at 1,550K

Ethylene-air lifted jet flame at Re=10000

slide-6
SLIDE 6

Dynamics of lifted flame stabilization – Log(scalar dissipation) and Temperature

slide-7
SLIDE 7

Why does this need exascale?

  • Turbulent combustion consists of phenomena occurring over

a wide range of scales that are closely coupled – More grid points needed to resolve larger dynamic range

  • f scales

– More time steps needed for better statistics and less dependence on initial condition

  • Complex fuels require higher number of equations per grid

point

  • In situ uncertainty quantification with adjoint sensitivity –

reverse causality – uncertainties in chemical inputs

  • In situ analytics/visualization
  • Coupled execution (hybrid Eulerian-Lagrangian particle solver,
  • r lockstep DNS/LES)
slide-8
SLIDE 8
  • Future algorithms, programming environments, runtimes, hardware need to:

– Express data locality (sometimes at the expense of FLOPS) and independence – Allow expression of massive parallelism – Minimize data movement and reduce synchronization – Detect and address faults

Why do we need to do co-design?

  • Peak clock frequency: as primary

limiter for performance improvement

  • Cost: FLOPs are biggest cost for

system: optimize for compute

  • Concurrency: Modest growth of

parallelism by adding nodes

  • Locality: MPI+X model (uniform costs

within node & between nodes)

  • Memory Scaling: maintain byte per flop

capacity and bandwidth

  • Uniformity: Assume uniform system

performance

  • Power: primary design constraint for

future HPC system design

  • Cost: Data movement dominates:
  • ptimize to minimize data movement
  • Concurrency: Exponential growth of

parallelism within chips

  • Locality: must reason about data

locality and possibly topology

  • Memory Scaling: Compute growing 2x

faster than capacity or bandwidth, no global hardware cache coherence

  • Heterogeneity:Architectural and

performance non-uniformity increase

Old Constraints New Constraints

slide-9
SLIDE 9

ExaCT Vision and Goal

  • Goal of combustion exascale co-design is to consider all aspects of the

combustion simulation process from formulation and basic algorithms to programming environments to hardware characteristics needed to enable combustion simulations on exascale architectures

– Interact with vendors to help define hardware requirements, computer scientists on requirements for programming environment and software stack, and applied mathematics community locality-aware algorithms for PDE’s, UQ, and analytics

  • Combustion is a surrogate for a much broader range of multiphysics

computational science areas

slide-10
SLIDE 10

Petascale codes provide starting point for co-design process

  • S3D

– Compressible formulation – Eighth-order finite difference discretization – Fourth-order Runge-Kutta temporal integrator – Detailed kinetics and transport – Hybrid parallel model with MPI + OpenMP – MPI+ OpenACC (directives for GPU’s) – Legion (deferred execution hides latencies)

  • LMC

– Low Mach Number model that exploits separation of scales between acoustic wave speed and fluid motion – Second-order projection formulation – Detailed kinetics and transport – Block-structure adaptive mesh refinement – Hybrid parallel model with MPI + OpenMP

Expectation is that exascale will require new code base

LMC simulation

  • f NOx emissions

from a low swirl injector S3D simulation

  • f HO2 ignition

marker in lifted flame Laboratory scale flames

slide-11
SLIDE 11

S3D MPI Parallelism

  • 3D domain decomposition.

– Each MPI process is in charge of a piece of the 3D domain.

  • All MPI processes have the same number of grid

points and the same computational load

  • Inter-processor communication is only between

nearest neighbors in 3D topology

– Large message sizes. Non-blocking sends and receives

  • All-to-all communications are only required for

monitoring and synchronization ahead of I/O

  • Good parallel scaling on Titan

1 N 1 N

slide-12
SLIDE 12

What happens in the main solver?

  • Computes rate of change of N conserved quantities at every grid

point

– d/dt (Qk) = (Advection) + (Diffusion) + (Source) – Sum of all the terms that contribute to the time derivative is called the RHS

  • d/dt (Qk) is integrated explicitly in time through Runge-Kutta
  • RHS contains multiple terms that are functions of Qk, variables

derived from Qk

  • Advection and diffusion require finite differencing and MPI
  • Source terms are point-wise functions
  • Thermodynamic, chemical and molecular transport properties are

point-wise functions of Qk

slide-13
SLIDE 13

Source term is the most compute intensive kernel

  • Called as ckwyp or getrates
  • Chemical reaction rate computed using Arrhenius model

– A + B  C + D – Forward reaction rate = C*[A]*[B]*Taexp(-Ta/T) – Equilibrium constant gives reverse reaction rates – More terms for third body efficiency, collision efficiency, pressure corrections …

  • The source term for a species is the sum of the rates of all reactions

in which it participates

  • The kernel uses exp/log heavily
slide-14
SLIDE 14

Hybridizing S3D: OLCF CAAR Early Science Project

  • Collaboration between Cray (John Levesque), NVIDIA (Greg Ruetsch, Cliff

Woolley), NREL (Ray Grout) and ORNL (Ramanan Sankaran)

  • Significant restructuring of S3D to expose node-level parallelism

– Movement of outer loops to the highest level in RHS – Combine several pointwise physics computations together within same OpenMP structure – Reordered computation and communication to achieve most overlap – Restructured computation to minimize memory operations and vectorized all loops that reside on accelerator

  • Control minimal data communication between the host and the

accelerator with asynchronous updates

  • Resulting code is hybrid MPI+OpenMP and MPI+OpenACC (-DGPU only

changes directives)

  • 6-fold performance improvement on Titan over Jaguar

Levesque et al. SC’12

slide-15
SLIDE 15

Day 1 Science with S3D OpenACC on Titan

  • Extinction/Re-ignition in a Turbulent

Di-methyl Ether Jet Flame

– Largest Reynolds number reacting DNS with complex chemistry (32 species for DME, an oxygenated biofuel) – Enabled comparison with companion experiment – Validate an experimental diagnostic for surrogate peak heat release based on product imaging of CH2O and OH – Exploring the causality between turbulence and chemistry in re-ignition process

The logarithm of the scalar dissipation rate (that is, the local mixing rate) where white denotes high mixing rates and red, lower mixing rates

Bhagatwala et al., Proc. Combust. Inst. (2014)

slide-16
SLIDE 16

In Situ Uncertainty Quantification Guided by Analytics

  • Topological Segmentation and Tracking
  • Topological segmentation and tracking
  • Distance field (level set)
  • Statistics
  • Filtering and averaging (spatial and temporal)
  • Statistical moments (conditional)
  • Statistical dimensionality reduction (joint PDFS)
  • Spectra (scalar, velocity, coherency)
  • Chemical Explosive Mode
  • Uncertainty in reaction rates characterizing ignition/extinction

events that control fuel efficiency and emissions with respect to uncertainties in input chemical and transport parameters

  • Solve adjoint equations backward in time: need the primal state at

all times

  • Exploit space-time locality guided by analytics to bound regions of

interest

slide-17
SLIDE 17
  • I/O bandwidth constraint make it infeasible to save all

raw simulation data to persistent storage Workflow must integrate simulation , UQ and analysis !!!!

  • Challenge: co-design a workflow that supports smart

placement of solver + UQ, and analytics, reducing checkpointing size with in-situ and in-transit analytics

Petascale Workflow Model Won’t Scale

O(1B) cores O(1 PB)/dump every 30 min (1 min) O(1-10 EB)/run Synchronous I/O Synchronous I/O combustion simulation

  • Analysis
  • Visualization

Performing the simulation is not enough – need to analyze results

slide-18
SLIDE 18

Proxy Architectures and Apps are a key Capability for HPC Co-design

  • Mini-Applications and Proxy Apps

– Express essential parts of application that are essential to science – Express pain points for performance – Hide complexities of REAL apps that are inconsequential to science

  • Proxy Hardware Models are a Reflection
  • f the Underlying Machine Architecture

– Express what is important for performance – Hide complexity that is not consequential to performance

  • Use Proxy Applications on Proxy

Hardware Architectures to understand what choices offer the highest value (using modeling and simulation)

– Not absolute performance predictions – Scientific experiments that gauge relative improvements over control

  • Still Need Full-Applications

Applications Architectures Proxy Apps Proxy Architectures Models/ Measurements Architecture Simulators

slide-19
SLIDE 19

Overall Workflow Captured Through Proxy and Full Apps

Exascale Machine Compressible and Low- Mach Reacting Flow Solvers Write to Disk

  • Reduced Checkpoints
  • Reduced Data

Proxy Apps Descriptive Statistics Visualization Topological Analysis UQ Meta-skeletal workflow proxy app New UQ + topology proxy for adjoint solution data flow

CNS

SSA Streaming Statistics PVR Parallel Volume Rendering RTC Reduced Topology LMC

Multi-grid Chemistry

UQ SMC

S3D

www.exactcodesign.org

Castro

slide-20
SLIDE 20

adds, 3620 muls, 4303 divs, 540 trans, 710

Instruc on Mix

adds 3% muls 4% divs 18% trans 75%

Breakdown

  • f

CPU Time

adds, 14509 muls, 16447 divs, 387 sqrt,1

  • Instruc on

Mix

  • adds

34% muls 34% divs 32%

Breakdown

  • f

CPU Time

Chemistry Dynamics

1 4 16 64 256 1024 4096 1 4 16 64 256 1024 4096 Number

  • f

Accesses Variables (sorted by number

  • f

accesses) 9 Species 21 Species 53 Species 71 Species 107 Species

Chemistry FP State Variables by Rank

ExaSAT: Analysis of Proxy Applications

  • Extracts floating point instruction mix and estimates CPU time
  • Analyzes state variable accesses and estimates register spill impact
  • n L1 cache traffic
slide-21
SLIDE 21

Impact of Software Optimization on SMC Dynamics Computation

  • Data reuse analysis shows impact of software optimizations on

memory traffic for various cache sizes

  • Large potential benefit from increased memory bandwidth and fast

exponentials

  • Trends can be investigated in further detail using hardware

simulators

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2 8 32 128 512 2048 8192 Bytes per Flop Cache Size (kB) Baseline

  • +Best

Blocking

  • +Best

Fusion

Bandwidth Tapering

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 9 21 53 71 107

Es mated Teraflo p s

Number

  • f

Species +Fast NIC (400 GB/s) +Fast-exp +Fast-div +Fast memory (4 TB/s) +Loop fusion +Cache blocking Baseline

Estimated Performance Benefits

slide-22
SLIDE 22

Programming Environment Critical to Performance

Effective use of exascale hardware will require programming environment that effectively maps algorithms to hardware

  • Driven by programmability of combustion applications and

characterization of algorithms on different designs of architectures

– Simplify programming to express locality and independence – Simplify programming of block-structured PDE’s, analytics, UQ for performance, scalability & portability on heterogeneous architectures with high variability and still maintain readability

slide-23
SLIDE 23

Legion Approach

  • Capture the structure of program data
  • Decouple specification from mapping
  • Automate data movement, parallelism discovery, synchronization,

hiding long latency

Goal: A performance-oriented programming model with a raised abstraction for productivity and portability.

(Bauer, Treichler, Slaughter, Aiken – Structure Slicing: Extending Logical Regions with Fields, Thursday Nov 20 1:30-2:00, Rm 388-390) http://legion.stanford.edu

slide-24
SLIDE 24

f0 f1

p0 p1 p2 p3 p4 p5

Describing Data

  • Logical regions (array of
  • bjects):

– Have no implied layout – Have no implied location

  • Described by:

– Index space (set of keys) – Field space (set of fields)

  • Operations include:

– Partitioning into subregions – Slicing by fields

25

Field Space Index Space

p6 p7 p8 p9 p10 p11

f2 f3

slide-25
SLIDE 25

B

Legion Tasks

  • Hierarchical Tree of Tasks
  • Legion tasks specify:

– Region usage – Field usage – Access modes

  • Legion runtime:

– Infers data dependences – Inserts copies

26

Task 1 Region R Field A Read-Write Task 2 Region R Field B Read-Only Node 0 Node 1 Copy

slide-26
SLIDE 26

Legion Mapping

  • Mapping happens at runtime:

– Assign task to processors – Assign regions to memories

  • With mapping API, programmer

selects:

– Where tasks run – Where data is placed

27

t 1 t 2 t 3 t 4 t 5 rc rw rw1 rw2 rn rn1 rn2

$ $ $ $

N U M A N U M A

FB D R A M x86 CUDA x86 x86 x86

slide-27
SLIDE 27

Application: S3D

  • Current state-of-the-art combustion DNS simulation

– Written in 200K lines of Fortran – Direct numerical solver using explicit methods – Many phases, lots of parallelism

11/05/13

28

 ξ OH HO2 CH2O

slide-28
SLIDE 28

S3D Task Parallelism

  • One call to Right-Hand-Side-Function (RHSF) as seen by the Legion

runtime

– Called 6 times per time step by Runge-Kutta solver – Width == task parallelism – H2 mechanism (only 9 species) – Heptane (52 species) is significantly wider

  • Manual task scheduling would be difficult!
slide-29
SLIDE 29

S3D Execution Profile

  • Two custom mappers

– Schedule tasks onto CPUs and GPUs – Prioritize tasks on critical path of RHSf Legion execution of RHSF on one node

11/05/13

30

GPU’s CPU cores Legion Runtime core Dependence analysis tasks

slide-30
SLIDE 30

Leaf Tasks

  • Legion treats tasks as black boxes

– Doesn’t care how tasks are written

  • Still need fast leaf tasks for

computationally expensive chemistry, diffusion, viscosity

– For CPUs & GPUs – For multiple mechanisms

  • Singe* is a DSL compiler for

chemistry kernels

Singe Compiler

CHEMKIN TRANSPORT THERMO SSE AVX CUDA

*Bauer et al. PPoPP’14

slide-31
SLIDE 31

Combustion Challenges

  • GPU programming models emphasize

data parallelism

– Not always the best choice for performance

  • Large working sets (per point)

– nHeptane chemistry needs 566 double precision reaction rates (per point) – GPU register file only store 128 per thread

  • Multi-phase computations

– Fissioned kernels limited by memory bandwidth, slow Compute Reaction Rates QSSA Stiffness Accumulate Outputs

slide-32
SLIDE 32

Warp Specialization

  • Leverage knowledge of underlying

hardware

– GPUs execute warps: streams of 32-wide vector instructions – All threads in warp execute the same program (data parallel unit)

  • Each warp can run different

computation

– Generate code that specializes each warp,(leverage task parallelism) – Different warps do different computations on the same data – Allows much better use of memory while keeping processors busy – Fit large working sets on chip

Program P Input Data

P P P P P P P P P

Warp 0 Warp 1 Warp 2 Partitioned Program P0 P1 P2 Input Data

P0 P0 P0 P1 P1 P1 P2 P2 P2

Warp 0 Warp 1 Warp 2

Data parallel Warp specialization

slide-33
SLIDE 33

Singe Warp-Specialized Chemistry Kernel

Chem Reacs 1 Chem Reacs 2 Chem Reacs 3 Chem Reacs 4 Chem Reacs 5 Chem Reacs 6 QSSA QSSA Stiff Stiff Stiff Stiff Output Math Output Math Output Math Output Math

Reaction Rate Exchange through Shared Mem Reaction Rate Exchange through Shared Mem

Warp 0 Warp 1 Warp 2 Warp 3

QSS 5 QSS 1 QSS 11 QSS 10 QSS 9 QSS 6 QSS 3 QSS 7 QSS 15 QSS 14 QSS 2

slide-34
SLIDE 34

Performance Results

  • Chemistry Kernel

– All Singe kernels significantly faster than current production versions – Warp specialized SINGE code is up to 3.75 times faster than previously

  • ptimized data-parallel CUDA kernels
  • Multi-Node Heterogeneous Testbeds S3D Legion:

– Keeneland: 128 nodes, 16 Sandy Bridge cores, 3 Fermis – Titan: 18K nodes, 16 Interlagos cores, 1 K20 GPU

11/05/13

35

slide-35
SLIDE 35

S3D Performance Comparison

  • Compare against MPI+OpenACC code on Titan

– Tuned by Cray and NVIDIA engineers with ORNL/NREL domain experts

  • OpenACC runs 483 and 643 for DME and heptane

– Fixed mapping (most compute on GPU)

  • Legion runs 483, 643, and 963 for any mechanism (DME, heptane,

PRF)

– Try both All-GPU and mixed CPU+GPU mappings

  • Use re-ranking script for runs >= 1024 nodes
slide-36
SLIDE 36

Legion S3D Performance on Titan (weak scaling)

N-heptane 52 species

1.71X - 2.33X faster between 1024 and 8192 nodes DME 1.88-2.85X faster between 1024 and 8192 nodes n-heptane (64*3)

slide-37
SLIDE 37

Exascale Use Cases: Science at Relevant Conditions

  • Homogeneous Charge Compression Ignition (HCCI

engine combustion) – ‘Chemical’ engine with high diesel- like efficiency without NOx and soot, tailor the charge stratification to control ignition and burn rate

  • Turbulent Jet Flames (Swirl, transverse, cavity) – low-

swirl Injector gas turbines with staged lean premixed combustion, flame stabilization, emissions

  • Lifted Diesel Jet flames –lifted autoignitive diesel jet

flame stabilization with multi-stage ignition fuels

  • Need to include UQ with respect to chemistry and

transport properties

  • Extrapolation of current capability show that these

cases will require exascale-level resources

slide-38
SLIDE 38

Thanks!

  • Questions?
  • jhchen@sandia.gov
  • www.exactcodesign.org