HPC Application Porting to CUDA at BSC Pau Farr, Marc Jord GTC - - PowerPoint PPT Presentation
HPC Application Porting to CUDA at BSC Pau Farr, Marc Jord GTC - - PowerPoint PPT Presentation
www.bsc.es HPC Application Porting to CUDA at BSC Pau Farr, Marc Jord GTC 2016 - San Jose Agenda WARIS-Transport PELE Atmospheric volcanic ash transport Protein-drug interaction simulation simulation Life Sciences
- WARIS-Transport
○ Atmospheric volcanic ash transport simulation ○ Computer Applications department
Agenda
2
- PELE
○ Protein-drug interaction simulation ○ Life Sciences department
WARIS-Transport Volcano ash dispersion simulation
Motivation
4
- VAAC: Volcanic Ash Advisory centers
○ Controlling volcano eruptions ○ Help airliners → Redirect flights
- Forecast of atmospheric
transport and deposition of volcanic ash
○ Meteorological models
- Eyajfajallajökull eruption (Iceland, 2010)
○ 48% cancelled flights in europe during a week (107.000 flights) ○ Over €1.3 billions in losses
- Puyehue-Cordon Caullé eruption (Chile, 2011)
○ Multiple flights cancelled in ■ Chile ■ Argentina ■ South-Africa ■ Australia
Eruptions
5
Ash extension map Airspace shutdown Ash extension map
Rectangular Cartesian Grid (x,y,z) Factors controlling atmospheric transport:
- Wind advection
- Turbulent diffusion
- Gravitational settling of particles
General Advection-Diffusion-Reaction Eq. ⇒ Custom Jacobi Stencil
Description
6
stencil
- utput
- Finite difference method: Iterative process
- Main computation
– Advection-Diffusion-Reaction
Algorithm
7
- 1. Advection-Diffusion-Reaction Kernel
○ ~80% CPU execution time
CUDA Implementation (I)
8
- 1. Advection-Diffusion-Reaction Kernel
- 2. Compute Terminal Velocity
○ Meteorological computations
CUDA Implementation (II)
9
- 1. Advection-Diffusion-Reaction Kernel
- 2. Compute Terminal Velocity
- 3. Implement all non-IO computations in GPU
○ Minimize CPU ⇔ GPU copies
CUDA Implementation (III)
10
- 1. Advection-Diffusion-Reaction Kernel
- 2. Compute Terminal Velocity
- 3. Implement all non-IO computations in GPU
- 4. Different particles sizes are launched in different streams
CUDA Implementation (IV)
11
Kernel Overlap
12
- Some datasets are too small to fully occupy all SMs with only one
kernel
- Parallel kernel execution to fully occupy all SMs
Chile-2011 dataset 0.25º (grid size 121x121x64) Chile-2011 dataset 0.05º (grid size 601x601x64)
Results
13
4 GPU runs as fast as 8 Marenostrum3 nodes (128 cores) Implementations:
- MPI + AVX
- MPI + OpenMP + AVX
- MIC (MPI+OpenMP+AVX)
- MPI + CUDA (1 GPU/rank)
- Chile 2011 dataset 0.05º
- Marenostrum supercomputer
○ 16x cores/node ○ 2x Intel MIC
- GPU Server:
○ 4x Nvidia Tesla K40
PELE: Protein Energy Landscape Exploration
Interactive Drug Design with Monte Carlo Simulations
PELE Vision
15
- Drug design is a costly process
- Design through Interactive biomolecular
simulations ○ Statistical approach → Faster simulations ○ Visual analysis
- Computational power + human intuition
PELE-GUI
Monte Carlo approach where each trial does:
- Perturbation
○ Protein shape + ligand position
- Relaxation
○ Further refinement to a more stable position (energy minimization)
- Acceptance test
○ If accepted, used as inital conformation for future trials
PELE: Protein Energy Landscape Exploration
16
Relaxation Perturbation
PELE Demo
17
- Exec. time cost of energy terms
- Bond Energy: 1.27%
- Angle Energy: 0.93%
- Dihedral Energy: 2.13%
- Non-bonding Interactions
○ Electrostatic ○ Lennard Jones ○ Solvent Energy ○ Total: 37,58 %
- Update alphas: 27.96%
PELE Energy Formula
18
Initial profiling → Energy computation was the most time consuming task
- Exec. time cost of energy terms
- Bond Energy: 1.27%
- Angle Energy: 0.93%
- Dihedral Energy: 2.13%
- Non-bonding Interactions
○ Electrostatic ○ Lennard Jones ○ Solvent Energy ○ Total: 37.58 %
- Update alphas: 27.96%
PELE Energy Formula
19
Initial profiling → Energy computation was the most time consuming task
CUDA Implementation
20
- CUDA implementation
○ New data structure for interactions list in GPU ○ With atomics ■ Profiling showed high overheads
- Lack of DP atomics?
- High contention due to list order?
○ Without atomics ■ Main kernel + custom reduction to aggregate results ■ ~3x faster than 1st approach Update Alphas (27.96%)
- All to all atom interactions
- No major issues
Non-bonding Terms (37.58%)
- List of interactions (atom pairs)
○ Several cut-offs to reduce the number of interactions
- Energy computations are performed
multiple times in different parts of PELE
- Maintain data coherent between CPU and
GPU
- High code complexity
○ Porting everything inbetween involves a major refactoring
CUDA Implementation (II)
21
PELE call graph Energy computations in time
Automatic CPU ⇔ GPU copies
- CUDA Unified Virtual Memory (UVM)
- Unified CPU & GPU data structures
○ Allocation pointers can be used both in the CPU and GPU ○ CUDA runtime manages the copies internally
- Custom std::allocator for std::vectors
CPU/GPU data coherence
22
Explicit CPU ⇔ GPU copies
- Code is harder to follow and maintain
- Complex application:
○ Difficult to track which CPU code uses GPU results ○ Usage may depend on many conditions
- Programmers tend to be conservative
○ Always copy GPU results to host after the kernel ■ If not used, performance cost for no reason
- 4KB copies are not large enough to get maximum PCIe bandwidth
- Also, some unnecessary copies
○ The runtime has to be conservative because it doesn’t always know what’s input or output ○ Our use of streams and allocations attached to them was not optimal
UVM profiling
23
After the kernel launch
- Call owner_CPU(...) to notify the memory manager
- As said, copies are done lazily when needed
Before launching a kernel
- Call owner_GPU(void* host_ptr, access_type)
○ Access types
■ Read, Write, ReadWrite, FullWrite
○ Returns gpu_ptr
Semi-automatic memory manager
24
UVM style
- It maintains pairs of allocations (CPU & GPU)
- DtoH copies are only performed when data is really needed in the CPU
○ A page-fault handler detects CPU accesses
- Copies all the allocation at once
○ Better bandwidth
Performance comparison
25
UVM Semi-automatic memory manager
- Semi-automatic memory manager has better performance
○ Mainly because of better PCIe bandwidth
Results (I)
26
55x 5.29x 15.09x
Results (II)
27
2.4x 2x
Upper bound 2.9x (Amdahl’s law) PELE acceleration is still ongoing
- Non-bonding list generation
- Computations in perturbation
step
- Etc.
Conclusions
Acceleration of existing applications
- Some parts are accelerated while others are kept in the CPU
- Maintain data coherence between CPU & GPU is complex
- We showed two examples:
○ WARIS-Transport ■ Simple enough to port most of the computations to GPU and keep data there ○ PELE ■ Complex app → use a manager to handle the copies ■ UVM is a great tool to automatize the copies ■ We implemented a Semi-automatic memory manager to improve the performance Atomics might have a large performance impact
- Store partial results and apply a reduction step after the kernel
- Libraries can help with reductions
○ CUB, Modern GPU, etc.
Conclusions
29