[PPT] - HPC Application Porting to CUDA at BSC Pau Farr, Marc Jord GTC PowerPoint Presentation

SLIDE 1

GTC 2016 - San Jose

HPC Application Porting to CUDA at BSC

Pau Farré, Marc Jordà www.bsc.es

SLIDE 2

WARIS-Transport

○ Atmospheric volcanic ash transport simulation ○ Computer Applications department

Agenda

2

PELE

○ Protein-drug interaction simulation ○ Life Sciences department

SLIDE 3

WARIS-Transport Volcano ash dispersion simulation

SLIDE 4

Motivation

4

VAAC: Volcanic Ash Advisory centers

○ Controlling volcano eruptions ○ Help airliners → Redirect flights

Forecast of atmospheric

transport and deposition of volcanic ash

○ Meteorological models

SLIDE 5

Eyajfajallajökull eruption (Iceland, 2010)

○ 48% cancelled flights in europe during a week (107.000 flights) ○ Over €1.3 billions in losses

Puyehue-Cordon Caullé eruption (Chile, 2011)

○ Multiple flights cancelled in ■ Chile ■ Argentina ■ South-Africa ■ Australia

Eruptions

5

Ash extension map Airspace shutdown Ash extension map

SLIDE 6

Rectangular Cartesian Grid (x,y,z) Factors controlling atmospheric transport:

Wind advection
Turbulent diffusion
Gravitational settling of particles

General Advection-Diffusion-Reaction Eq. ⇒ Custom Jacobi Stencil

Description

6

stencil

utput

SLIDE 7

Finite difference method: Iterative process
Main computation

– Advection-Diffusion-Reaction

Algorithm

7

SLIDE 8

1. Advection-Diffusion-Reaction Kernel

○ ~80% CPU execution time

CUDA Implementation (I)

8

SLIDE 9

1. Advection-Diffusion-Reaction Kernel
2. Compute Terminal Velocity

○ Meteorological computations

CUDA Implementation (II)

9

SLIDE 10

1. Advection-Diffusion-Reaction Kernel
2. Compute Terminal Velocity
3. Implement all non-IO computations in GPU

○ Minimize CPU ⇔ GPU copies

CUDA Implementation (III)

10

SLIDE 11

1. Advection-Diffusion-Reaction Kernel
2. Compute Terminal Velocity
3. Implement all non-IO computations in GPU
4. Different particles sizes are launched in different streams

CUDA Implementation (IV)

11

SLIDE 12

Kernel Overlap

12

Some datasets are too small to fully occupy all SMs with only one

kernel

Parallel kernel execution to fully occupy all SMs

Chile-2011 dataset 0.25º (grid size 121x121x64) Chile-2011 dataset 0.05º (grid size 601x601x64)

SLIDE 13

Results

13

4 GPU runs as fast as 8 Marenostrum3 nodes (128 cores) Implementations:

MPI + AVX
MPI + OpenMP + AVX
MIC (MPI+OpenMP+AVX)
MPI + CUDA (1 GPU/rank)
Chile 2011 dataset 0.05º
Marenostrum supercomputer

○ 16x cores/node ○ 2x Intel MIC

GPU Server:

○ 4x Nvidia Tesla K40

SLIDE 14

PELE: Protein Energy Landscape Exploration

Interactive Drug Design with Monte Carlo Simulations

SLIDE 15

PELE Vision

15

Drug design is a costly process
Design through Interactive biomolecular

simulations ○ Statistical approach → Faster simulations ○ Visual analysis

Computational power + human intuition

PELE-GUI

SLIDE 16

Monte Carlo approach where each trial does:

Perturbation

○ Protein shape + ligand position

Relaxation

○ Further refinement to a more stable position (energy minimization)

Acceptance test

○ If accepted, used as inital conformation for future trials

PELE: Protein Energy Landscape Exploration

16

Relaxation Perturbation

SLIDE 17

PELE Demo

17

SLIDE 18

Exec. time cost of energy terms
Bond Energy: 1.27%
Angle Energy: 0.93%
Dihedral Energy: 2.13%
Non-bonding Interactions

○ Electrostatic ○ Lennard Jones ○ Solvent Energy ○ Total: 37,58 %

Update alphas: 27.96%

PELE Energy Formula

18

Initial profiling → Energy computation was the most time consuming task

SLIDE 19

Exec. time cost of energy terms
Bond Energy: 1.27%
Angle Energy: 0.93%
Dihedral Energy: 2.13%
Non-bonding Interactions

○ Electrostatic ○ Lennard Jones ○ Solvent Energy ○ Total: 37.58 %

Update alphas: 27.96%

PELE Energy Formula

19

Initial profiling → Energy computation was the most time consuming task

SLIDE 20

CUDA Implementation

20

CUDA implementation

○ New data structure for interactions list in GPU ○ With atomics ■ Profiling showed high overheads

Lack of DP atomics?
High contention due to list order?

○ Without atomics ■ Main kernel + custom reduction to aggregate results ■ ~3x faster than 1st approach Update Alphas (27.96%)

All to all atom interactions
No major issues

Non-bonding Terms (37.58%)

List of interactions (atom pairs)

○ Several cut-offs to reduce the number of interactions

SLIDE 21

Energy computations are performed

multiple times in different parts of PELE

Maintain data coherent between CPU and

GPU

High code complexity

○ Porting everything inbetween involves a major refactoring

CUDA Implementation (II)

21

PELE call graph Energy computations in time

SLIDE 22

Automatic CPU ⇔ GPU copies

CUDA Unified Virtual Memory (UVM)
Unified CPU & GPU data structures

○ Allocation pointers can be used both in the CPU and GPU ○ CUDA runtime manages the copies internally

Custom std::allocator for std::vectors

CPU/GPU data coherence

22

Explicit CPU ⇔ GPU copies

Code is harder to follow and maintain
Complex application:

○ Difficult to track which CPU code uses GPU results ○ Usage may depend on many conditions

Programmers tend to be conservative

○ Always copy GPU results to host after the kernel ■ If not used, performance cost for no reason

SLIDE 23

4KB copies are not large enough to get maximum PCIe bandwidth
Also, some unnecessary copies

○ The runtime has to be conservative because it doesn’t always know what’s input or output ○ Our use of streams and allocations attached to them was not optimal

UVM profiling

23

SLIDE 24

After the kernel launch

Call owner_CPU(...) to notify the memory manager
As said, copies are done lazily when needed

Before launching a kernel

Call owner_GPU(void* host_ptr, access_type)

○ Access types

■ Read, Write, ReadWrite, FullWrite

○ Returns gpu_ptr

Semi-automatic memory manager

24

UVM style

It maintains pairs of allocations (CPU & GPU)
DtoH copies are only performed when data is really needed in the CPU

○ A page-fault handler detects CPU accesses

Copies all the allocation at once

○ Better bandwidth

SLIDE 25

Performance comparison

25

UVM Semi-automatic memory manager

Semi-automatic memory manager has better performance

○ Mainly because of better PCIe bandwidth

SLIDE 26

Results (I)

26

55x 5.29x 15.09x

SLIDE 27

Results (II)

27

2.4x 2x

Upper bound 2.9x (Amdahl’s law) PELE acceleration is still ongoing

Non-bonding list generation
Computations in perturbation

step

Etc.

SLIDE 28

Conclusions

SLIDE 29

Acceleration of existing applications

Some parts are accelerated while others are kept in the CPU
Maintain data coherence between CPU & GPU is complex
We showed two examples:

○ WARIS-Transport ■ Simple enough to port most of the computations to GPU and keep data there ○ PELE ■ Complex app → use a manager to handle the copies ■ UVM is a great tool to automatize the copies ■ We implemented a Semi-automatic memory manager to improve the performance Atomics might have a large performance impact

Store partial results and apply a reduction step after the kernel
Libraries can help with reductions

○ CUB, Modern GPU, etc.

Conclusions

29

SLIDE 30

GTC 2016 - San Jose

HPC Application Porting to CUDA at BSC

Pau Farré, Marc Jordà www.bsc.es

Agenda

WARIS-Transport Volcano ash dispersion simulation

Motivation

○ Controlling volcano eruptions ○ Help airliners → Redirect flights

transport and deposition of volcanic ash

Eruptions

Description

Algorithm

CUDA Implementation (I)

CUDA Implementation (II)

CUDA Implementation (III)

CUDA Implementation (IV)

Kernel Overlap

Results

PELE: Protein Energy Landscape Exploration

Interactive Drug Design with Monte Carlo Simulations

PELE Vision

PELE: Protein Energy Landscape Exploration

PELE Demo

PELE Energy Formula

Initial profiling → Energy computation was the most time consuming task

PELE Energy Formula

Initial profiling → Energy computation was the most time consuming task

CUDA Implementation

CUDA Implementation (II)

CPU/GPU data coherence

UVM profiling

Semi-automatic memory manager

Performance comparison

Results (I)

55x 5.29x 15.09x

Results (II)

2.4x 2x

Conclusions

Conclusions

Thank you!

For further information please contact pau.farre@bsc.es marc.jorda@bsc.es

www.bsc.es