Burning on the GPU: Fast and Accurate Chemical Kinetics GPU - - PowerPoint PPT Presentation

burning on the gpu fast and accurate chemical kinetics
SMART_READER_LITE
LIVE PREVIEW

Burning on the GPU: Fast and Accurate Chemical Kinetics GPU - - PowerPoint PPT Presentation

Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Burning on the GPU: Fast and Accurate Chemical Kinetics GPU Technology Conference April 7, 2016 Russell Whitesides Session


slide-1
SLIDE 1

LLNL-PRES-687782

This work was performed under the auspices of the U.S. Department

  • f Energy by Lawrence Livermore National Laboratory under Contract

DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Burning on the GPU: Fast and Accurate Chemical Kinetics

GPU Technology Conference Russell Whitesides

April 7, 2016

Session 6195

Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton

slide-2
SLIDE 2

Lawrence Livermore National Laboratory

LLNL-PRES-687782 2

To make it go faster?

+ Why?

slide-3
SLIDE 3

Lawrence Livermore National Laboratory

LLNL-PRES-687782 3

Why?

We burn a lot of gasoline.

  • Transportation efficiency
  • Chemistry is vital to predictive simulations
  • Chemistry can be > 90% of simulation time.
slide-4
SLIDE 4

Lawrence Livermore National Laboratory

LLNL-PRES-687782 4

National lab compute power and industry need.

Supercomputing @ DOE labs: Strong investment in GPUs with eye towards exascale OEM engine designers: Require fast turnaround with desktop class hardware

Why?

slide-5
SLIDE 5

Lawrence Livermore National Laboratory

LLNL-PRES-687782 5

“Colorful Fluid Dynamics” YO2 Temperature

“Typical” engine simulation w/ detailed chemistry

slide-6
SLIDE 6

Lawrence Livermore National Laboratory

LLNL-PRES-687782 6

Detailed Chemistry in Reacting Flow CFD:

Each cells is treated as an isolated system for chemistry. Operator Splitting Technique: Solve independent set of ordinary differential equations (ODEs) in each cell to calculate chemical source terms for species and energy advection/diffusion equations.

t t+∆t

slide-7
SLIDE 7

Lawrence Livermore National Laboratory

LLNL-PRES-687782 7

CPU (un-coupled) chemistry integration

Each cells is treated as an isolated system for chemistry.

t t+∆t

slide-8
SLIDE 8

Lawrence Livermore National Laboratory

LLNL-PRES-687782 8

GPU (batched) chemistry integration

On the GPU we solve chemistry in batches of cells simultaneously.

t t+∆t

slide-9
SLIDE 9

Lawrence Livermore National Laboratory

LLNL-PRES-687782 9

See also Whitesides & McNenly, GTC 2015; McNenly & Whitesides, GTC 2014

Previously at GTC:

slide-10
SLIDE 10

Lawrence Livermore National Laboratory

LLNL-PRES-687782 10

n_gpu = 0;

Note: most CFD simulations are done on distributed memory systems

rank0 rank1 rank2 rank3 rank4 rank6 rank7 rank5

CPU CPU CPU CPU CPU CPU CPU CPU

slide-11
SLIDE 11

Lawrence Livermore National Laboratory

LLNL-PRES-687782 11

++n_gpu; //now what?

Note: most CFD simulations are done on distributed memory systems

rank0 rank1 rank2 rank3 rank4 rank6 rank7 rank5

CPU CPU CPU CPU CPU CPU CPU CPU

slide-12
SLIDE 12

Lawrence Livermore National Laboratory

LLNL-PRES-687782 12

Here CPU is a single core.

Ideal CPU-GPU Work-sharing

SGPU = walltime(CPU) walltime(GPU)

slide-13
SLIDE 13

Lawrence Livermore National Laboratory

LLNL-PRES-687782 13

Let’s make use of the whole machine.

Ideal CPU-GPU Work-sharing

§ # CPU cores = NCPU § # GPU devices = NGPU

Stotal = NCPU + NGPU SGPU −1

( )

( )

NCPU

1 2 3 4 5 6 7 8 1 2 3 4 Stotal NGPU

SGPU = 8

NCPU=4 NCPU=8 NCPU=16 NCPU=32

* *

* TITAN (1.4375) * surface (1.8750)

SGPU = walltime(CPU) walltime(GPU)

slide-14
SLIDE 14

Lawrence Livermore National Laboratory

LLNL-PRES-687782 14

Distribute based on number of cells and give more to GPU.

Good performance in simple case with both CPU and GPU doing work

100 1000 10000 1 2 4 8 16 Chemistry Time (seconds) Number of Processors

CPU Chemistry GPU Chemistry (std work sharing)

slide-15
SLIDE 15

Lawrence Livermore National Laboratory

LLNL-PRES-687782 15

Distribute based on number of cells and give more to GPU.

Good performance in simple case with both CPU and GPU doing work

100 1000 10000 1 2 4 8 16 Chemistry Time (seconds) Number of Processors

CPU Chemistry GPU Chemistry (std work sharing) GPU Chemistry (custom work sharing)

SGPU = 7 Stotal = 1.7 (SGPU = 6.6)

slide-16
SLIDE 16

Lawrence Livermore National Laboratory

LLNL-PRES-687782 16

Let’s go!

First attempt @ engine calculation on GPU+CPU

slide-17
SLIDE 17

Lawrence Livermore National Laboratory

LLNL-PRES-687782 17

What happened?

First attempt @ engine calculation on GPU+CPU

§ 2x Xeon E5-2670 (16 cores) => § 2x Xeon E5-2670 + 2 Tesla K40m => § Stotal = 21.2/17.6 = 1.20

21.2 hours 17.6 hours (SGPU = 2.6)

slide-18
SLIDE 18

Lawrence Livermore National Laboratory

LLNL-PRES-687782 18

Integrator performance when doing batch solution

If the systems are not similar how much extra work needs to be done?

vs.

slide-19
SLIDE 19

Lawrence Livermore National Laboratory

LLNL-PRES-687782 19

Batches of dissimilar reactors will suffer from excessive extra steps

What penalty do we pay when batching?

slide-20
SLIDE 20

Lawrence Livermore National Laboratory

LLNL-PRES-687782 20

Batches of dissimilar reactors will suffer from excessive extra steps

What penalty do we pay when batching?

slide-21
SLIDE 21

Lawrence Livermore National Laboratory

LLNL-PRES-687782 21

Batches of dissimilar reactors will suffer from excessive extra steps

Possibly a lot of extra steps.

slide-22
SLIDE 22

Lawrence Livermore National Laboratory

LLNL-PRES-687782 22

Sort reactors by how many steps they took to solve on the last CFD step

Easy as pie? n_steps

>100 1

batch3 batch2 batch1 batch0

slide-23
SLIDE 23

Lawrence Livermore National Laboratory

LLNL-PRES-687782 23

Have to manage the sorting and load- balancing in distributed memory system

Not so fast. rank0 rank7 rank5 rank6 rank4 rank1 rank2 rank3

slide-24
SLIDE 24

Lawrence Livermore National Laboratory

LLNL-PRES-687782 24

Load balance based on expected cost and expected performance.

MPI communication to re-balance for chemistry. rank0 rank7 rank5 rank6 rank4 rank1 rank2 rank3

slide-25
SLIDE 25

Lawrence Livermore National Laboratory

LLNL-PRES-687782 25

Let’s go again!

Second attempt @ engine calculation on GPU+CPU

slide-26
SLIDE 26

Lawrence Livermore National Laboratory

LLNL-PRES-687782 26

How much does difference does it make?

Total steps significantly reduced by batching appropriately

slide-27
SLIDE 27

Lawrence Livermore National Laboratory

LLNL-PRES-687782 27

J

Engine results with improved work- sharing and reactor sorting

9.1 hrs 7.6 hrs 13.0 hrs ~40 % reduction in chemistry time; ~36% reduction in overall time Stotal =1.7 SGPU =6.6

slide-28
SLIDE 28

Lawrence Livermore National Laboratory

LLNL-PRES-687782 28

§ Improve SGPU

  • Derivative kernels
  • Matrix operations

§ Extrapolative integration methods

  • Less “startup” cost when re-initializing
  • Potentially well suited for GPU

§ Non-chemistry calc’s on GPU

  • Multi-species transport
  • Particle spray

Future directions

Possibilities for significant further improvements.

slide-29
SLIDE 29

Lawrence Livermore National Laboratory

LLNL-PRES-687782 29

§ Much improved CFD chemistry work-

sharing with GPU

§ ~40% reduction in chemistry time

for real engine case (~36% total time)

§ Working on further improvement

Summary

Thank you!

+