Practical Combustion Kinetics with CUDA GPU Technology Conference - - PowerPoint PPT Presentation

practical combustion kinetics with cuda
SMART_READER_LITE
LIVE PREVIEW

Practical Combustion Kinetics with CUDA GPU Technology Conference - - PowerPoint PPT Presentation

Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides & Matthew McNenly


slide-1
SLIDE 1

LLNL-PRES-668639

This work was performed under the auspices of the U.S. Department

  • f Energy by Lawrence Livermore National Laboratory under Contract

DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Practical Combustion Kinetics with CUDA

GPU Technology Conference Russell Whitesides & Matthew McNenly

March 20, 2015

Session S5468

Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton

slide-2
SLIDE 2

Lawrence Livermore National Laboratory

LLNL-PRES-668639

2

§ Cummins Inc. § Convergent Science § NVIDIA § Indiana University

Collaborators

Good guys to work with.

slide-3
SLIDE 3

Lawrence Livermore National Laboratory

LLNL-PRES-668639

3

The big question.

plus equal

Does

?

slide-4
SLIDE 4

Lawrence Livermore National Laboratory

LLNL-PRES-668639

4

Lots of smaller questions:

There won’t be a quiz at the end.

  • What has already been done in this area?
  • How are we approaching the problem?
  • What have we accomplished?
  • What’s left to do?

? ?

slide-5
SLIDE 5

Lawrence Livermore National Laboratory

LLNL-PRES-668639

5

? NVIDIA GPUs/CUDA Toolkit

More FLOP/s, More GB/s, Faster Growth in Both.

Data from NVIDIA’s, CUDA C Programming Guide Version 6.0, 2014.

Why?

slide-6
SLIDE 6

Lawrence Livermore National Laboratory

LLNL-PRES-668639

6

Approach also used to simulate gas turbines, burners, flames, etc.

?

  • Reacting flow simulation
  • Computational Fluid Dynamics (CFD)
  • Detailed chemical kinetics
  • Tracking 10-1000’s of species
  • ConvergeCFD (internal combustion engines)
slide-7
SLIDE 7

Lawrence Livermore National Laboratory

LLNL-PRES-668639

7

What has been done already in combustion kinetics on GPU’s?

A few groups working (publicly) on this. Some progress has been made. Recent review by Niemeyer & Sung [1]:

  • Spafford, Sankaran & co-workers (ORNL) (first published 2010)
  • Shi, Green & co-workers (MIT)
  • Stone (CS&E LLC)
  • Niemeyer & Sung (CWRU/OSU, UConn)

Most approaches use explicit or semi-implicit Runge-Kutta techniques Some only use GPU for derivative calculation From [1]: “Furthermore, no practical demonstration of a GPU chemistry solver capable of handling stiff chemistry has yet been made. This is one area where efforts need to be focused.”

[1] K.E. Niemeyer, C.-J. Sung, Recent progress and challenges in exploiting graphics processors in computational fluid dynamics, J

  • Supercomput. 67 (2014) 528–564. doi:10.1007/s11227-013-1015-7.
slide-8
SLIDE 8

Lawrence Livermore National Laboratory

LLNL-PRES-668639

8

Problem: Can’t directly port CPU chemistry algorithms to GPU

For chemistry it’s not as simple as adding new hardware.

§ GPUs need dense data and lots of it. § Large chemical mechanisms are sparse. § Small chemical mechanisms don’t have enough data.

(even large mechanisms aren’t large in GPU context)

Solution: Re-frame many uncoupled reactor calculations into a single system of coupled reactors.

slide-9
SLIDE 9

Lawrence Livermore National Laboratory

LLNL-PRES-668639

9

Example: Engine Simulation in Converge CFD YO2 Temperature

How do we solve chemistry on the CPU?

slide-10
SLIDE 10

Lawrence Livermore National Laboratory

LLNL-PRES-668639

10

Example: Engine Simulation in Converge CFD YO2 Temperature

How do we solve chemistry on the CPU?

slide-11
SLIDE 11

Lawrence Livermore National Laboratory

LLNL-PRES-668639

11

Detailed Chemistry in Reacting Flow CFD:

Each cells is treated as an isolated system for chemistry. Operator Splitting Technique: Solve independent Initial Value Problem in each cell (or zone) to calculate chemical source terms for species and energy advection/diffusion equations.

slide-12
SLIDE 12

Lawrence Livermore National Laboratory

LLNL-PRES-668639

12

Detailed Chemistry in Reacting Flow CFD:

Each cells is treated as an isolated system for chemistry. Operator Splitting Technique: Solve independent Initial Value Problem in each cell (or zone) to calculate chemical source terms for species and energy advection/diffusion equations.

slide-13
SLIDE 13

Lawrence Livermore National Laboratory

LLNL-PRES-668639

13

Detailed Chemistry in Reacting Flow CFD:

Each cells is treated as an isolated system for chemistry. Operator Splitting Technique: Solve independent Initial Value Problem in each cell (or zone) to calculate chemical source terms for species and energy advection/diffusion equations.

t t+∆t

slide-14
SLIDE 14

Lawrence Livermore National Laboratory

LLNL-PRES-668639

14

CPU (un-coupled) chemistry integration

Each cells is treated as an isolated system for chemistry.

t t+∆t

slide-15
SLIDE 15

Lawrence Livermore National Laboratory

LLNL-PRES-668639

15

GPU (coupled) chemistry integration

For the GPU we solve chemistry simultaneously in large groups of cells.

t t+∆t

slide-16
SLIDE 16

Lawrence Livermore National Laboratory

LLNL-PRES-668639

16

What about variations in practical engine CFD?

If the systems are not similar how much extra work needs to be done?

vs.

slide-17
SLIDE 17

Lawrence Livermore National Laboratory

LLNL-PRES-668639

17

What are the equations we’re trying to solve?

Significant effort to transform fastest CPU algorithms to GPU appropriate versions.

dyi dt = wi ρ dCi dt

dT dt = − RT ρcv ui dCi dt

i species

Derivative Equations (vector calculations) Jacobian Matrix Solution

= ¡

* ¡ L U A

Derivative represents system

  • f equations to be solved

(perfectly stirred reactor).

  • Matrix solution required due to stiffness
  • Matrix storage in dense or sparse formats

= ¡

* ¡

dense sparse

slide-18
SLIDE 18

Lawrence Livermore National Laboratory

LLNL-PRES-668639

18

We want to solve many of these simultaneously

Not as easy as copy and paste.

slide-19
SLIDE 19

Lawrence Livermore National Laboratory

LLNL-PRES-668639

19

Example: Species production rates

Major component of derivative; Lots of sparse operations.

ki = AiT nie

−EA,i RT

ki = ki, f Keq = ki, f exp Gj RT −

j prod

Gj RT

j reac

        ′ ki = ki α jCj

j species

Ri = ki Cj

νij j species

dCi dt = Rj

j create

− Rj

j destroy

Chemical reaction rates of progress Net rates of production Chemical reaction step rate coefficients

Arrhenius Rates Equilibrium Reverse Rates Third-body enhanced Rates Fall-off rates

′ ki = ki...

slide-20
SLIDE 20

Lawrence Livermore National Laboratory

LLNL-PRES-668639

20

Example: Species production rates

Major component of derivative; Lots of sparse operations.

ki = AiT nie

−EA,i RT

ki = ki, f Keq = ki, f exp Gj RT −

j prod

Gj RT

j reac

        ′ ki = ki α jCj

j species

Ri = ki Cj

νij j species

dCi dt = Rj

j create

− Rj

j destroy

Chemical reaction rates of progress Net rates of production Chemical reaction step rate coefficients

Arrhenius Rates Equilibrium Reverse Rates Third-body enhanced Rates Fall-off rates

′ ki = ki...

  • Chemical species connectivity
  • Generally sparsely connected
  • Leads to poor memory locality
  • Bad for GPU performance
slide-21
SLIDE 21

Lawrence Livermore National Laboratory

LLNL-PRES-668639

21

Example: Species production rates

Approach: couple together reactors (or cells) and make smart use of GPU memory. Each column is data for single reactor (cell). Each row is data element for all reactors. data now arranged for coalesced access

slide-22
SLIDE 22

Lawrence Livermore National Laboratory

LLNL-PRES-668639

22

Surface Big Red 2

§

  • AMD Opteron Interlagos (16 core)
  • 1x-Tesla K20

§ (not pictured)

  • Intel Xeon E5-2670 (16 core)
  • 2x-Tesla K40m

Benchmarking Platforms:

CPU and GPU Used Both Matter

slide-23
SLIDE 23

Lawrence Livermore National Laboratory

LLNL-PRES-668639

23

128 simultaneous net production rate calculations 256 2048 512 1024

dCi dt

Significant speedup achieved for species production rates. Big Red 2

slide-24
SLIDE 24

Lawrence Livermore National Laboratory

LLNL-PRES-668639

24

128 simultaneous net production rate calculations 256 2048 512 1024

dCi dt

Less speedup than Big Red 2 because the CPU is faster. Surface

slide-25
SLIDE 25

Lawrence Livermore National Laboratory

LLNL-PRES-668639

25

Need to put the rest of the calculations on the GPU.

dyi dt = wi ρ dCi dt

dT dt = − RT ρcv ui dCi dt

i species

Derivative Equations (vector calculations) Jacobian Matrix Solution

= ¡

* ¡ L U A

= ¡

* ¡

dense sparse

We have implemented or borrowed algorithms for the rest of the chemistry integration.

slide-26
SLIDE 26

Lawrence Livermore National Laboratory

LLNL-PRES-668639

26

Need to put the rest of the calculations on the GPU.

dyi dt = wi ρ dCi dt

dT dt = − RT ρcv ui dCi dt

i species

Derivative Equations (vector calculations) Jacobian Matrix Solution

= ¡

* ¡ L U A

= ¡

* ¡

dense sparse

We have implemented or borrowed algorithms for the rest of the chemistry integration.

Apart from dCi/dt, derivative is straightforward on GPU.

slide-27
SLIDE 27

Lawrence Livermore National Laboratory

LLNL-PRES-668639

27

Need to put the rest of the calculations on the GPU.

dyi dt = wi ρ dCi dt

dT dt = − RT ρcv ui dCi dt

i species

Derivative Equations (vector calculations) Jacobian Matrix Solution

= ¡

* ¡ L U A

= ¡

* ¡

dense sparse

We have implemented or borrowed algorithms for the rest of the chemistry integration.

  • We are able to use NVIDIA

developed algorithms to perform matrix operations on GPU. Apart from dCi/dt, derivative is straightforward on GPU.

slide-28
SLIDE 28

Lawrence Livermore National Laboratory

LLNL-PRES-668639

28

Matrix Solution Methods

  • CPU
  • LAPACK
  • dgetrf
  • dgetrs
  • GPU
  • CUBLAS
  • dgetrfbatched
  • dgtribatched
  • batched matrix-vector

multiplication

  • CPU
  • SuperLU
  • dgetrf
  • dgetrs
  • GPU
  • GLU (soon cusolverSP (7.0))
  • LU refactorization

(SuperLU for first factor)

  • LU solve
  • Conglomerate matrix (<6.5)
  • Batched matrices (>= 6.5)

(2-4x faster)

=

* ¡

dense

=

* ¡

sparse

slide-29
SLIDE 29

Lawrence Livermore National Laboratory

LLNL-PRES-668639

29

§ Ignition delay time calculation (i.e. shock tube simulation):

  • 256-2048 constant volume reactor calculations
  • No coupling to CFD
  • Comparing CPU and GPU

with both dense and sparse matrix operations

Test case for full chemistry integration

This provides a gauge of what the ideal speedup will be in CFD simulations.

slide-30
SLIDE 30

Lawrence Livermore National Laboratory

LLNL-PRES-668639

30

256 512 1024 2048 2 4 6 8 10 12 14 16 10 32 48 79 94 111 160 Number

  • f Reactors

Speedup (CPU time/GPU time) Number of Species

0D, Uncoupled, Ideal Case: Max speedup

As with dCi/dt best speedup is for large number of reactors.

CPU Dense GPU Dense CPU Sparse GPU Dense CPU Sparse GPU Sparse

Big Red 2

slide-31
SLIDE 31

Lawrence Livermore National Laboratory

LLNL-PRES-668639

31

§ Converge CFD § Rectilinear volume (16x8x8 mm) § Initial conditions:

  • Variable gradients in temperature (

) & phi ( )

  • Uniform zero velocity
  • Uniform pressure (20 bar)

§ Boundary conditions:

  • No flux for all variables

§ ~50 CFD steps capturing complete fuel conversion § Every cell chemistry (2048 cells, 1 CPU core, 1 GPU device) § 7 kinetic mechanisms from 10-160 species § Solved with both sparse and dense matrix algorithms

Synchronization Penalty Test Case:

Testing affect of non-identical reactors.

slide-32
SLIDE 32

Lawrence Livermore National Laboratory

LLNL-PRES-668639

32

We compared the total chemistry cost for sequential auto-ignition in a constant volume chamber

Increasing Temperature Increasing Equivalence Ratio

Initial Conditions:

1.5 2.5 3.5 4.5 5.5 50 Pressure (MPa) Time (µs)

slide-33
SLIDE 33

Lawrence Livermore National Laboratory

LLNL-PRES-668639

33

We compared the total chemistry cost for sequential auto-ignition in a constant volume chamber

Increasing Temperature Increasing Equivalence Ratio

Initial Conditions:

Condition T spread ϕ spread Grad0 1450 1.0 Grad1 1400-1450 0.95-1.05 Grad2 1350-1450 0.90-1.10 Grad3 1250-1450 0.80-1.20

slide-34
SLIDE 34

Lawrence Livermore National Laboratory

LLNL-PRES-668639

34

grad3 grad2 grad1 grad0 2 4 6 8 10 12 14 10 32 48 79 94 111 160 Amount of Gradation Speedup (CPU time/GPU time) Number of Species

Converge GPU: Sequential Auto-ignition

Even in non-ideal case we find significant speedup.

CPU Sparse GPU Dense CPU Dense GPU Dense CPU Sparse GPU Sparse

Big Red 2

slide-35
SLIDE 35

Lawrence Livermore National Laboratory

LLNL-PRES-668639

35

What’s the speedup on a “real” problem?

YO2 Temperature

Finally ready to run engine simulation on GPU

Compared cost of every cell chemistry from -20 to 15 CAD. 24 nodes of Big Red 2: 24 CPU cores vs. 24 GPU devices. Should be close to worst case scenario w.r.t. synchronization penalty.

Big Red 2

slide-36
SLIDE 36

Lawrence Livermore National Laboratory

LLNL-PRES-668639

36

Good speedup. With Caveats.

Engine calculation on GPU

§ 24 CPU cores = 53.8 hours § 24 GPU devices = 14.5 hours § Speedup = 53.8/14.5 = 3.7

Big Red 2

slide-37
SLIDE 37

Lawrence Livermore National Laboratory

LLNL-PRES-668639

37

Let’s make use of the whole machine.

CPU-GPU Work-sharing

§ GPU Speedup = S § Number of CPU cores = NCPU § Number of GPU devices = NGPU

Stotal = NCPU + NGPU S −1

( )

( )

NCPU

1 2 3 4 5 6 7 8 1 2 3 4 Stotal NGPU

S=8 NCPU=4 NCPU=8 NCPU=16 NCPU=32

Ideal Case

* * ¡ * ¡Big ¡Red ¡2 ¡(1.4375) ¡ * ¡Surface ¡ ¡ ¡(1.8750) ¡

slide-38
SLIDE 38

Lawrence Livermore National Laboratory

LLNL-PRES-668639

38

Strong scaling is good for this problem on CPU.

CPU-GPU Work-sharing: Strong scaling

100 ¡ 1000 ¡ 10000 ¡ 1 ¡ 2 ¡ 4 ¡ 8 ¡ 16 ¡ Chemistry ¡Time ¡(seconds) ¡ Number ¡of ¡Processors ¡

CPU Chemistry

Sequential auto-ignition case, grad0, 53 species, ~10,000 cells Surface

slide-39
SLIDE 39

Lawrence Livermore National Laboratory

LLNL-PRES-668639

39

Poor scaling with GPUS, if all processors get the same amount of work.

CPU-GPU Work-sharing: Strong scaling

100 ¡ 1000 ¡ 10000 ¡ 1 ¡ 2 ¡ 4 ¡ 8 ¡ 16 ¡ Chemistry ¡Time ¡(seconds) ¡ Number ¡of ¡Processors ¡

CPU Chemistry GPU Chemistry (std work sharing)

~7x Sequential auto-ignition case, grad0, 53 species, ~10,000 cells Surface

slide-40
SLIDE 40

Lawrence Livermore National Laboratory

LLNL-PRES-668639

40

Better scaling if give GPU processors appropriate work load.

CPU-GPU Work-sharing: Strong scaling

100 ¡ 1000 ¡ 10000 ¡ 1 ¡ 2 ¡ 4 ¡ 8 ¡ 16 ¡ Chemistry ¡Time ¡(seconds) ¡ Number ¡of ¡Processors ¡

CPU Chemistry GPU Chemistry (std work sharing) GPU Chemistry (custom work sharing)

~1.7x (Stotal) (S = 6.6) Sequential auto-ignition case, grad0, 53 species, ~10,000 cells ~7x Surface

slide-41
SLIDE 41

Lawrence Livermore National Laboratory

LLNL-PRES-668639

41

In line with expectations.

Proof of Principle: Engine calculation on GPU+CPU Cores

§ 16 cpu cores = 21.2 hours § 16 cpu cores + 2 GPU devices = 17.6 hours § Speedup = 21.2/17.6 = 1.20 (Stotal, S = 2.6)

Surface

slide-42
SLIDE 42

Lawrence Livermore National Laboratory

LLNL-PRES-668639

42

Sort of.

plus equal

Does

?

slide-43
SLIDE 43

Lawrence Livermore National Laboratory

LLNL-PRES-668639

43

§ Improve CPU/GPU parallel task management

  • Minimize synchronization penalty
  • Work stealing

§ Improvements to derivative calculation

  • Custom code generation
  • Reframe parts as matrix multiplication

§ Improvements to matrix calculations

  • Analytical Jacobian
  • Mixed precision calculations

Future directions

Possibilities for significant further improvements.

slide-44
SLIDE 44

Lawrence Livermore National Laboratory

LLNL-PRES-668639

44

§ GPU chemistry for stiff integration implemented § Implemented as Converge CFD UDF but flexible for

incorporation in other CFD codes.

§ Continuing development:

  • Further speedup envisioned
  • More work can improve applicability

Conclusions

Thank you!

+

slide-45
SLIDE 45

Lawrence Livermore National Laboratory

LLNL-PRES-668639

45

Supplemental Slides

Just in case.

+

slide-46
SLIDE 46

Lawrence Livermore National Laboratory

LLNL-PRES-668639

46

0D, Uncoupled, Ideal Case: Cost Breakdown

Evenly distributed costs both on CPU and GPU Surface 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Matrix Formation Matrix Factor Matrix Solve Derivatives Other

10 32 94 111 160 48 79 # of species CPU GPU Normalized Computation Time dense sparse

slide-47
SLIDE 47

Lawrence Livermore National Laboratory

LLNL-PRES-668639

47

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Matrix Formation Matrix Factor Matrix Solve Derivatives Other

0D, Uncoupled, Ideal Case: Cost Breakdown

Evenly distributed costs both on CPU and GPU Surface 10 32 94 111 160 48 79 # of species Normalized Computation Time dense sparse

slide-48
SLIDE 48

Lawrence Livermore National Laboratory

LLNL-PRES-668639

48

256 512 1024 2048 2 4 6 8 10 12 14 16 10 32 48 79 94 111 160 Number

  • f Reactors

Speedup (CPU time/GPU time) Number of Species

0D, Uncoupled, Ideal Case: Max speedup

As with dCi/dt best speedup is for large number of reactors.

CPU Dense GPU Dense CPU Sparse GPU Sparse

Surface