ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITE
Abhilash Jayaraj, IIT Delhi Bharatkumar Sharma, Nvidia Shashank Shekhar, IIT Delhi Nagavijayalakshmi, Nvidia
ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITE Abhilash - - PowerPoint PPT Presentation
ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITE Abhilash Jayaraj, IIT Delhi Bharatkumar Sharma, Nvidia Shashank Shekhar, IIT Delhi Nagavijayalakshmi, Nvidia AGENDA What to expect and what not to Quick Introduction to Computer Aided
Abhilash Jayaraj, IIT Delhi Bharatkumar Sharma, Nvidia Shashank Shekhar, IIT Delhi Nagavijayalakshmi, Nvidia
2
3
Target Discovery Lead Generation Lead Optimization Preclinical Development Phase I, II & III Clinical Trials FDA Review & Approval Drug to the Market
14 yrs $1.4 billion
2.5yrs 3.0yrs 1.0yrs 6.0yrs 1.5yrs 4% 15% 10% 68% 3%
4
Check Lipinski compliance Generate rapid binding energy estimates by RASPD protocol Predict all possible binding sites and store top ten sites Dock and Score Optimize geometry / Assign TPACM4/derive quantum mechanical charges Assign force field parameters Perform molecular dynamics simulations and post facto free energy component analyses (Optional) Generate canonical A/B DNA or MD averaged structure of B DNA Self drawn ligand molecule Protein-ligand Complex/ Protein/DNA sequence NRDBSM/Million molecule library/Natural products database
5
▪ OpenACC acceleration of ParDOCK module ▪ All atom energy based Monte Carlo docking for protein- ligand complexes
6
Analyze Parallelize Optimize
7
Analyze Parallelize Optimize
8
Flat profile:
% time Cumulative seconds Self seconds Calls Self calls Total s/calls Name 69.78 557.90 557.90 1188000 0.00 0.00 PDB::EnergyCalculator() 12.92 661.19 103.29 8 12.91 20.26 PDB::clashCombination() 7.35 719.96 58.77 26051422500 0.00 0.00 getRadius1() 5.49 763.85 43.89 885075 0.00 0.00 PDB::energyAtom()
9
Analyze Parallelize Optimize
10
double PDB::EnergyCalculator(float **&energyGrid, const vector <points> &vDrugGrid, points coords[], const unsigned &totalDockAtoms, … ){ for( int atomcount = 0; atomcount < totalDockAtoms; atomcount++ ){ for( int counter = 0; counter < vDrugGrid.size(); counter++ ){ // compute ‘distance’ between coords[atomcount] and vDrugGrid[counter] // minDis = minimum of ‘distance’, minCounter = counter corresponding to minDis } ene += EnergyGrid[minCounter][atomcount]; } return ene; }
11
Simple | Powerful | Portable
Fueling the Next Wave of Scientific Discoveries in HPC
University of Illinois
PowerGrid- MRI Reconstruction
70x Speed-Up 2 Days of Effort
http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7
RIKEN Japan
NICAM- Climate Modeling
7-8x Speed-Up 5% of Code Modified
main() main() { <serial code> #pragma acc kernels
//automatically runs on GPU
{ { <parallel code> } }
12
Manage Data Movement Initiate Parallel Execution Optimize Loop Mappings #pragma acc data copyin(x,y) copyout(z) { ... #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { z[i] = x[i] + y[i]; ... } } ... }
13
double PDB::EnergyCalculator(float **&energyGrid, const vector <points> &vDrugGrid, points coords[], const unsigned &totalDockAtoms, … ){ #pragma acc parallel loop reduction(+:ene) private(minDis,minCounter) present() copyin() firstprivate() for( int atomcount = 0; atomcount < totalDockAtoms; atomcount++ ){ #pragma acc loop reduction(min:minDis) for( int counter = 0; counter < vDrugGrid.size(); counter++ ){ // compute ‘distance’ between coords[atomcount] and vDrugGrid[counter] minDis = (minDis > distance) ? distance; }
14
#pragma acc loop reduction(min:minCounter) for( int counter = 0; counter < vDrugGrid.size(); counter++ ){ // compute ‘distance’ between coords[atomcount] and vDrugGrid[counter] if ( distance == minDis ){ minCounter = (minCounter > counter) ? counter; } } ene += EnergyGrid[minCounter][atomcount]; } return ene; }
15
const points *vDrugGridData = vDrugGrid.data(); // compute ‘distance’ between coords[atomcount] and vDrugGridData[counter]
▪ Use ‘raw data pointer’ to access vectors
16
unsigned totDockAtoms = totalDockAtoms; float **eneGrid = EnergyGrid; #pragma acc parallel loop reduction(+:ene) … copyin(coords[0:tot DockAtoms]) present(eneGrid) ene += eneGrid[minCounter][atomcount];
▪ Use ‘raw data pointer’ to access vectors ▪ Avoid using C++ references in OpenACC pragmas
17
unsigned totDockAtoms = totalDockAtoms; float **eneGrid = EnergyGrid; #pragma acc parallel loop reduction(+:ene) … copyin(coords[0:tot DockAtoms]) present(eneGrid) ene += eneGrid[minCounter][atomcount];
▪ Use ‘raw data pointer’ to access vectors ▪ Avoid using C++ references in OpenACC pragmas
PDB::EnergyCalculator(float **&, const std::vector<points, std::allocator<points>> &, const std::vector<points, std::allocator<points>> &, points *, const unsigned int &, energy &, int): 22, Generating present(vDrugGridData[:]) Generating copyin(coords[:totalDockAtoms->]) Generating present(EnergyGrid[:][:][:])
Runtime memory access violation
18
Workers Gang Workers Gang
Vector Vector
19
PDB::EnergyCalculator(float **&, const std::vector<points, std::allocator<points>> &, const std::vector<points, std::allocator<points>> &, points *, const unsigned int &, energy &, int): 22, Generating present(vDrugGridData[:],eneGrid[:][:]) Generating copyin(coords[:totDockAtoms]) 22, Accelerator kernel generated Generating Tesla code 22, Generating reduction(+:ene) 24, #pragma acc loop gang /* blockIdx.x */ 31, #pragma acc loop vector(256) /* threadIdx.x */ Generating reduction(min:minDis) 45, #pragma acc loop vector(256) /* threadIdx.x */ Generating reduction(min:minIdx) 31, Loop is parallelizable 45, Loop is parallelizable
20
Currently data is moved at the beginning and end of each function, in case the data is needed on the CPU We know that the data is only needed on the CPU after convergence We should inform the compiler when data movement is really needed to improved performance
21
The data directive defines a region of code in which GPU arrays remain on the GPU and are shared among all kernels in that region.
#pragma acc data { #pragma acc parallel loop ... #pragma acc parallel loop ... }
Data Region
Arrays used within the data region will remain
end of the data region.
22
Used to define data regions when scoping doesn’t allow the use of normal data regions (e.g. the constructor/destructor of a class). enter data Defines the start of an unstructured data lifetime
exit data Defines the end of an unstructured data lifetime
#pragma acc enter data copyin(a) ... #pragma acc exit data delete(a)
23
int **vProteinListData = new int *[vProteinList.size()]; n = vProteinList.size(); #pragma acc enter data create(vProteinListData[0:n][0:1]) for( int count = 0; count < n; count++ ){ int numPro = vProteinList[count].size(); vProteinListData[count] = vProteinList[count].data(); #pragma acc enter data copyin(vProteinListData[count:1][0:numPro]) }
▪ Use ‘raw data pointer’ to access vectors ▪ How will you access ‘vector of vector (jagged arrays)’ ? Creation and copy
24
for( int count = 0; count < n; count++ ){ int numPro = vProteinList[count].size(); #pragma acc exit data delete(vProteinListData[count:1][0:numPro]) vProteinListData[count] = NULL; } #pragma acc exit data delete(vProteinListData[0:n][0:1])
▪ Use ‘raw data pointer’ to access vectors ▪ How will you access ‘vector of vector (jagged arrays)’ ? Deletion of jagged arrays
25
PDB::energyAtom(const std::vector<PDB, std::allocator<PDB>> &, PDB, points, const std::vector<Box, std::allocator<Box>>&, const std::vector<int, std::allocator<int>>&, const std::vector<std::vector<int, std::allocator<int>>, std::allocator<std::vector<int, std::allocator<int>>>>&, int **): 79, Generating enter data copyin(boxListData[:boxListNumElements],rec,coord) 85, Generating present(coord,boxListData[:],rec,vProteinListData[:][:],vProData[:]) Accelerator kernel generated Generating Tesla code 85, Generating reduction(+:electro,vandw,ehyd) 87, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 129, Generating exit data delete(boxListData[:boxListNumElements],rec,coord)
26
main: 266, Generating enter data copyin(vProData[:vProNumElements]) Generating enter data create(vProteinListData[:vProteinListNumElements][:1]) 275, Generating enter data copyin(vProteinListData[proList][:numElements]) 321, Generating exit data delete(vProteinListData[proList][:numElements]) 322, Generating exit data delete(vProteinListData[:vProteinListNumElements][:1],vProData[:vProNumElements])
27
Without Unified Memory With Unified Memory
Unified Memory System Memory GPU Memory
Sometimes referred to as “managed memory.” New “Pascal” GPUs handle Unified Memory in hardware.
28
Analyze Parallelize Optimize
30
▪ PSG Cluster node, Haswell E5- 2698 v3@ 2.3 GHz, dual socket, 16 core ▪ 256 GB RAM ▪ Tesla P100 GPU ▪ CentOS 7.2 ▪ Cuda Toolkit 8.0.61 ▪ MPS enabled for GPU access
CPU+GPU 5.8x/3.3x faster than CPU at 8 MPI procs, ROTATE=1000/100 16 MPI procs on a single GPU -> GPU is the bottleneck!
31
▪ Average ‘time to predict’ over 160 datasets ▪ PSG Cluster node, Haswell E5- 2698 v3@ 2.3 GHz, dual socket, 16 core ▪ 256 GB RAM ▪ Tesla P100 GPU ▪ CentOS 7.2 ▪ Cuda Toolkit 8.0.61 ▪ MPS enabled for GPU access
CPU+GPU 5.3x/3.2x faster than CPU at 8 MPI procs, ROTATE=1000/100
32
Volta Architecture
Most Productive GPU
Tensor Core
125 Programmable TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink & HBM2
Efficient Bandwidth
33
34
Typical legacy application
MPI parallel Single or few threads per MPI rank (e.g. OpenMP)
Running with multiple MPI ranks per node GPU acceleration in phases
Proof of concept prototype, … Great speedup at kernel level
Application performance misses expectations
4/2/2018
35
4/2/2018
N=4 N=2 N=1 N=8
Multicore CPU only With Hyper-Q/MPS
Available on Tesla/Quadro with CC 3.5+ (e.g. K20, K40, K80, M40,…)
N=4 N=2 N=8
GPU parallelizable part CPU parallel part Serial part GPU-accelerated
N=1
36
4/2/2018 Process A Process B Context A Context B Process A Process B GPU
37
4/2/2018 Time-slided use of GPU Context switch Context Switch
38
4/2/2018 Process A Process B Context A Context B GPU Kernels from Process A Kernels from Process B MPS Process
39
4/2/2018
40
41
42
▪ Average ‘time to predict’ over 160 datasets, ROTATE=1000 ▪ PSG Cluster node, Haswell E5- 2698 v3@ 2.3 GHz, dual socket, 16 core ▪ 256 GB RAM ▪ Tesla P100/V100 GPU ▪ CentOS 7.2 ▪ Cuda Toolkit 8.0.61/9.0.176 ▪ MPS enabled for GPU access
Volta is 2.1x faster than Pascal
43
▪ Use ‘raw data pointer’ to access vectors ▪ Avoid using C++ references in OpenACC pragmas ▪ Standard classes called from an OpenACC region may result in compilation/linking errors. Use math.h instead of cmath ☺ ▪ Unified memory has improved over time but sometimes there might be a need to explicitly use data clause to minimize data copies ▪ Volta works excellent with program needing functionality of MPS
44
45
▪ Average ‘time to predict’ over 160 datasets, ROTATE=1000 ▪ PSG Cluster node, Haswell E5- 2698 v3@ 2.3 GHz, dual socket, 16 core ▪ 256 GB RAM ▪ Tesla P100/V100 GPU ▪ CentOS 7.2 ▪ Cuda Toolkit 8.0.61/9.0.176 ▪ MPS enabled for GPU access
Volta is 2.1x faster than Pascal due to hardware accelerated MPS
46
▪ ‘1qbt’ dataset, ROTATE=1000, 8 MPI procs ▪ PSG Cluster node, Haswell E5- 2698 v3@ 2.3 GHz, dual socket, 16 core ▪ 256 GB RAM ▪ Tesla P100 GPU ▪ CentOS 7.2 ▪ Cuda Toolkit 8.0.61 ▪ MPS enabled for GPU access ▪ Higher concurrency possible with more devices->lower GPU time ▪ Lesser latency with more devices/MPS servers->lower CPU time
47
▪ ‘5cna’ dataset, ROTATE=100, 8 MPI procs, Tesla P100 GPUs, MPS
48
▪ ‘1a4w’ dataset, ROTATE=100, 8 MPI procs, Tesla P100/V100 GPUs, MPS
49
ligand complexes." Protein and peptide letters 14.7 (2007): 632-646.
interaction." Nature 530.7591 (2016): 485.
identification, docking, and scoring protocol for protein targets based on physicochemical descriptors." Journal of chemical information and modeling 51.10 (2011): 2515-2527.
inhibitors of matrix metalloproteinases by molecular docking, quantum mechanical calculations, molecular dynamics simulations, and a MMGBSA/MMBappl study." Molecular BioSystems 11.4 (2015): 1041-1051.
molecule discovery." BMC bioinformatics. Vol. 13. No. 17. BioMed Central, 2012.
50