HPC Application Porting to CUDA at BSC Pau Farr, Marc Jord GTC - - PowerPoint PPT Presentation

hpc application porting to cuda at bsc
SMART_READER_LITE
LIVE PREVIEW

HPC Application Porting to CUDA at BSC Pau Farr, Marc Jord GTC - - PowerPoint PPT Presentation

www.bsc.es HPC Application Porting to CUDA at BSC Pau Farr, Marc Jord GTC 2016 - San Jose Agenda WARIS-Transport PELE Atmospheric volcanic ash transport Protein-drug interaction simulation simulation Life Sciences


slide-1
SLIDE 1

GTC 2016 - San Jose

HPC Application Porting to CUDA at BSC

Pau Farré, Marc Jordà www.bsc.es

slide-2
SLIDE 2
  • WARIS-Transport

○ Atmospheric volcanic ash transport simulation ○ Computer Applications department

Agenda

2

  • PELE

○ Protein-drug interaction simulation ○ Life Sciences department

slide-3
SLIDE 3

WARIS-Transport Volcano ash dispersion simulation

slide-4
SLIDE 4

Motivation

4

  • VAAC: Volcanic Ash Advisory centers

○ Controlling volcano eruptions ○ Help airliners → Redirect flights

  • Forecast of atmospheric

transport and deposition of volcanic ash

○ Meteorological models

slide-5
SLIDE 5
  • Eyajfajallajökull eruption (Iceland, 2010)

○ 48% cancelled flights in europe during a week (107.000 flights) ○ Over €1.3 billions in losses

  • Puyehue-Cordon Caullé eruption (Chile, 2011)

○ Multiple flights cancelled in ■ Chile ■ Argentina ■ South-Africa ■ Australia

Eruptions

5

Ash extension map Airspace shutdown Ash extension map

slide-6
SLIDE 6

Rectangular Cartesian Grid (x,y,z) Factors controlling atmospheric transport:

  • Wind advection
  • Turbulent diffusion
  • Gravitational settling of particles

General Advection-Diffusion-Reaction Eq. ⇒ Custom Jacobi Stencil

Description

6

stencil

  • utput
slide-7
SLIDE 7
  • Finite difference method: Iterative process
  • Main computation

– Advection-Diffusion-Reaction

Algorithm

7

slide-8
SLIDE 8
  • 1. Advection-Diffusion-Reaction Kernel

○ ~80% CPU execution time

CUDA Implementation (I)

8

slide-9
SLIDE 9
  • 1. Advection-Diffusion-Reaction Kernel
  • 2. Compute Terminal Velocity

○ Meteorological computations

CUDA Implementation (II)

9

slide-10
SLIDE 10
  • 1. Advection-Diffusion-Reaction Kernel
  • 2. Compute Terminal Velocity
  • 3. Implement all non-IO computations in GPU

○ Minimize CPU ⇔ GPU copies

CUDA Implementation (III)

10

slide-11
SLIDE 11
  • 1. Advection-Diffusion-Reaction Kernel
  • 2. Compute Terminal Velocity
  • 3. Implement all non-IO computations in GPU
  • 4. Different particles sizes are launched in different streams

CUDA Implementation (IV)

11

slide-12
SLIDE 12

Kernel Overlap

12

  • Some datasets are too small to fully occupy all SMs with only one

kernel

  • Parallel kernel execution to fully occupy all SMs

Chile-2011 dataset 0.25º (grid size 121x121x64) Chile-2011 dataset 0.05º (grid size 601x601x64)

slide-13
SLIDE 13

Results

13

4 GPU runs as fast as 8 Marenostrum3 nodes (128 cores) Implementations:

  • MPI + AVX
  • MPI + OpenMP + AVX
  • MIC (MPI+OpenMP+AVX)
  • MPI + CUDA (1 GPU/rank)
  • Chile 2011 dataset 0.05º
  • Marenostrum supercomputer

○ 16x cores/node ○ 2x Intel MIC

  • GPU Server:

○ 4x Nvidia Tesla K40

slide-14
SLIDE 14

PELE: Protein Energy Landscape Exploration

Interactive Drug Design with Monte Carlo Simulations

slide-15
SLIDE 15

PELE Vision

15

  • Drug design is a costly process
  • Design through Interactive biomolecular

simulations ○ Statistical approach → Faster simulations ○ Visual analysis

  • Computational power + human intuition

PELE-GUI

slide-16
SLIDE 16

Monte Carlo approach where each trial does:

  • Perturbation

○ Protein shape + ligand position

  • Relaxation

○ Further refinement to a more stable position (energy minimization)

  • Acceptance test

○ If accepted, used as inital conformation for future trials

PELE: Protein Energy Landscape Exploration

16

Relaxation Perturbation

slide-17
SLIDE 17

PELE Demo

17

slide-18
SLIDE 18
  • Exec. time cost of energy terms
  • Bond Energy: 1.27%
  • Angle Energy: 0.93%
  • Dihedral Energy: 2.13%
  • Non-bonding Interactions

○ Electrostatic ○ Lennard Jones ○ Solvent Energy ○ Total: 37,58 %

  • Update alphas: 27.96%

PELE Energy Formula

18

Initial profiling → Energy computation was the most time consuming task

slide-19
SLIDE 19
  • Exec. time cost of energy terms
  • Bond Energy: 1.27%
  • Angle Energy: 0.93%
  • Dihedral Energy: 2.13%
  • Non-bonding Interactions

○ Electrostatic ○ Lennard Jones ○ Solvent Energy ○ Total: 37.58 %

  • Update alphas: 27.96%

PELE Energy Formula

19

Initial profiling → Energy computation was the most time consuming task

slide-20
SLIDE 20

CUDA Implementation

20

  • CUDA implementation

○ New data structure for interactions list in GPU ○ With atomics ■ Profiling showed high overheads

  • Lack of DP atomics?
  • High contention due to list order?

○ Without atomics ■ Main kernel + custom reduction to aggregate results ■ ~3x faster than 1st approach Update Alphas (27.96%)

  • All to all atom interactions
  • No major issues

Non-bonding Terms (37.58%)

  • List of interactions (atom pairs)

○ Several cut-offs to reduce the number of interactions

slide-21
SLIDE 21
  • Energy computations are performed

multiple times in different parts of PELE

  • Maintain data coherent between CPU and

GPU

  • High code complexity

○ Porting everything inbetween involves a major refactoring

CUDA Implementation (II)

21

PELE call graph Energy computations in time

slide-22
SLIDE 22

Automatic CPU ⇔ GPU copies

  • CUDA Unified Virtual Memory (UVM)
  • Unified CPU & GPU data structures

○ Allocation pointers can be used both in the CPU and GPU ○ CUDA runtime manages the copies internally

  • Custom std::allocator for std::vectors

CPU/GPU data coherence

22

Explicit CPU ⇔ GPU copies

  • Code is harder to follow and maintain
  • Complex application:

○ Difficult to track which CPU code uses GPU results ○ Usage may depend on many conditions

  • Programmers tend to be conservative

○ Always copy GPU results to host after the kernel ■ If not used, performance cost for no reason

slide-23
SLIDE 23
  • 4KB copies are not large enough to get maximum PCIe bandwidth
  • Also, some unnecessary copies

○ The runtime has to be conservative because it doesn’t always know what’s input or output ○ Our use of streams and allocations attached to them was not optimal

UVM profiling

23

slide-24
SLIDE 24

After the kernel launch

  • Call owner_CPU(...) to notify the memory manager
  • As said, copies are done lazily when needed

Before launching a kernel

  • Call owner_GPU(void* host_ptr, access_type)

○ Access types

■ Read, Write, ReadWrite, FullWrite

○ Returns gpu_ptr

Semi-automatic memory manager

24

UVM style

  • It maintains pairs of allocations (CPU & GPU)
  • DtoH copies are only performed when data is really needed in the CPU

○ A page-fault handler detects CPU accesses

  • Copies all the allocation at once

○ Better bandwidth

slide-25
SLIDE 25

Performance comparison

25

UVM Semi-automatic memory manager

  • Semi-automatic memory manager has better performance

○ Mainly because of better PCIe bandwidth

slide-26
SLIDE 26

Results (I)

26

55x 5.29x 15.09x

slide-27
SLIDE 27

Results (II)

27

2.4x 2x

Upper bound 2.9x (Amdahl’s law) PELE acceleration is still ongoing

  • Non-bonding list generation
  • Computations in perturbation

step

  • Etc.
slide-28
SLIDE 28

Conclusions

slide-29
SLIDE 29

Acceleration of existing applications

  • Some parts are accelerated while others are kept in the CPU
  • Maintain data coherence between CPU & GPU is complex
  • We showed two examples:

○ WARIS-Transport ■ Simple enough to port most of the computations to GPU and keep data there ○ PELE ■ Complex app → use a manager to handle the copies ■ UVM is a great tool to automatize the copies ■ We implemented a Semi-automatic memory manager to improve the performance Atomics might have a large performance impact

  • Store partial results and apply a reduction step after the kernel
  • Libraries can help with reductions

○ CUB, Modern GPU, etc.

Conclusions

29

slide-30
SLIDE 30

Thank you!

For further information please contact pau.farre@bsc.es marc.jorda@bsc.es

www.bsc.es