ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITE Abhilash - - PowerPoint PPT Presentation

accelerating sanjeevini a drug
SMART_READER_LITE
LIVE PREVIEW

ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITE Abhilash - - PowerPoint PPT Presentation

ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITE Abhilash Jayaraj, IIT Delhi Bharatkumar Sharma, Nvidia Shashank Shekhar, IIT Delhi Nagavijayalakshmi, Nvidia AGENDA What to expect and what not to Quick Introduction to Computer Aided


slide-1
SLIDE 1

ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITE

Abhilash Jayaraj, IIT Delhi Bharatkumar Sharma, Nvidia Shashank Shekhar, IIT Delhi Nagavijayalakshmi, Nvidia

slide-2
SLIDE 2

2

AGENDA

  • Quick Introduction to Computer Aided Drug Discovery software Sanjeevani
  • Challenges
  • Code documentation in process of being improved
  • Code maintained by Non Computer Science
  • Designed to suit distributed programming
  • Constraints
  • Code modification should be minimal  Ease of Maintenance.
  • The current cluster has mix of CPU and GPU. Should run on both  Portable
  • Learnings

What to expect and what not to

slide-3
SLIDE 3

3

COMPUTER AIDED DRUG DISCOVERY

Introduction

Target Discovery Lead Generation Lead Optimization Preclinical Development Phase I, II & III Clinical Trials FDA Review & Approval Drug to the Market

14 yrs $1.4 billion

2.5yrs 3.0yrs 1.0yrs 6.0yrs 1.5yrs 4% 15% 10% 68% 3%

slide-4
SLIDE 4

4

SANJEEVINI FOR COMPUTER AIDED DRUG DESIGN

Check Lipinski compliance Generate rapid binding energy estimates by RASPD protocol Predict all possible binding sites and store top ten sites Dock and Score Optimize geometry / Assign TPACM4/derive quantum mechanical charges Assign force field parameters Perform molecular dynamics simulations and post facto free energy component analyses (Optional) Generate canonical A/B DNA or MD averaged structure of B DNA Self drawn ligand molecule Protein-ligand Complex/ Protein/DNA sequence NRDBSM/Million molecule library/Natural products database

Overview

slide-5
SLIDE 5

5

SANJEEVINI

GPU acceleration

▪ OpenACC acceleration of ParDOCK module ▪ All atom energy based Monte Carlo docking for protein- ligand complexes

slide-6
SLIDE 6

6

PERFORMANCE OPTIMIZATION

Strategy

Analyze Parallelize Optimize

slide-7
SLIDE 7

7

PERFORMANCE OPTIMIZATION

Strategy

Analyze Parallelize Optimize

slide-8
SLIDE 8

8

SANJEEVINI: PARDOCK

Flat profile:

Hotspots

% time Cumulative seconds Self seconds Calls Self calls Total s/calls Name 69.78 557.90 557.90 1188000 0.00 0.00 PDB::EnergyCalculator() 12.92 661.19 103.29 8 12.91 20.26 PDB::clashCombination() 7.35 719.96 58.77 26051422500 0.00 0.00 getRadius1() 5.49 763.85 43.89 885075 0.00 0.00 PDB::energyAtom()

slide-9
SLIDE 9

9

PERFORMANCE OPTIMIZATION

Strategy

Analyze Parallelize Optimize

slide-10
SLIDE 10

10

SANJEEVINI: PARDOCK

CPU code: EnergyCalculator

double PDB::EnergyCalculator(float **&energyGrid, const vector <points> &vDrugGrid, points coords[], const unsigned &totalDockAtoms, … ){ for( int atomcount = 0; atomcount < totalDockAtoms; atomcount++ ){ for( int counter = 0; counter < vDrugGrid.size(); counter++ ){ // compute ‘distance’ between coords[atomcount] and vDrugGrid[counter] // minDis = minimum of ‘distance’, minCounter = counter corresponding to minDis } ene += EnergyGrid[minCounter][atomcount]; } return ene; }

slide-11
SLIDE 11

11

OpenACC

Simple | Powerful | Portable

Fueling the Next Wave of Scientific Discoveries in HPC

University of Illinois

PowerGrid- MRI Reconstruction

70x Speed-Up 2 Days of Effort

http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7

RIKEN Japan

NICAM- Climate Modeling

7-8x Speed-Up 5% of Code Modified

main() main() { <serial code> #pragma acc kernels

//automatically runs on GPU

{ { <parallel code> } }

slide-12
SLIDE 12

12

OPENACC DIRECTIVES

Manage Data Movement Initiate Parallel Execution Optimize Loop Mappings #pragma acc data copyin(x,y) copyout(z) { ... #pragma acc parallel { #pragma acc loop gang vector for (i = 0; i < n; ++i) { z[i] = x[i] + y[i]; ... } } ... }

Performance portable Interoperable Single source Incremental

slide-13
SLIDE 13

13

SANJEEVINI: PARDOCK

OpenACC parallelization: EnergyCalculator (1)

double PDB::EnergyCalculator(float **&energyGrid, const vector <points> &vDrugGrid, points coords[], const unsigned &totalDockAtoms, … ){ #pragma acc parallel loop reduction(+:ene) private(minDis,minCounter) present() copyin() firstprivate() for( int atomcount = 0; atomcount < totalDockAtoms; atomcount++ ){ #pragma acc loop reduction(min:minDis) for( int counter = 0; counter < vDrugGrid.size(); counter++ ){ // compute ‘distance’ between coords[atomcount] and vDrugGrid[counter] minDis = (minDis > distance) ? distance; }

slide-14
SLIDE 14

14

SANJEEVINI: PARDOCK

OpenACC parallelization: EnergyCalculator (2)

#pragma acc loop reduction(min:minCounter) for( int counter = 0; counter < vDrugGrid.size(); counter++ ){ // compute ‘distance’ between coords[atomcount] and vDrugGrid[counter] if ( distance == minDis ){ minCounter = (minCounter > counter) ? counter; } } ene += EnergyGrid[minCounter][atomcount]; } return ene; }

slide-15
SLIDE 15

15

SANJEEVINI: PARDOCK

OpenACC parallelization: EnergyCalculator (3)

const points *vDrugGridData = vDrugGrid.data(); // compute ‘distance’ between coords[atomcount] and vDrugGridData[counter]

▪ Use ‘raw data pointer’ to access vectors

slide-16
SLIDE 16

16

SANJEEVINI: PARDOCK

OpenACC parallelization: EnergyCalculator (4)

unsigned totDockAtoms = totalDockAtoms; float **eneGrid = EnergyGrid; #pragma acc parallel loop reduction(+:ene) … copyin(coords[0:tot DockAtoms]) present(eneGrid) ene += eneGrid[minCounter][atomcount];

▪ Use ‘raw data pointer’ to access vectors ▪ Avoid using C++ references in OpenACC pragmas

slide-17
SLIDE 17

17

SANJEEVINI: PARDOCK

OpenACC parallelization: EnergyCalculator (4)

unsigned totDockAtoms = totalDockAtoms; float **eneGrid = EnergyGrid; #pragma acc parallel loop reduction(+:ene) … copyin(coords[0:tot DockAtoms]) present(eneGrid) ene += eneGrid[minCounter][atomcount];

▪ Use ‘raw data pointer’ to access vectors ▪ Avoid using C++ references in OpenACC pragmas

PDB::EnergyCalculator(float **&, const std::vector<points, std::allocator<points>> &, const std::vector<points, std::allocator<points>> &, points *, const unsigned int &, energy &, int): 22, Generating present(vDrugGridData[:]) Generating copyin(coords[:totalDockAtoms->]) Generating present(EnergyGrid[:][:][:])

Runtime memory access violation

slide-18
SLIDE 18

18

OPENACC: 3 LEVELS OF PARALLELISM

  • Vector threads work in

lockstep (SIMD/SIMT parallelism)

  • Workers compute a vector
  • Gangs have 1 or more

workers and share resources (such as cache, the streaming multiprocessor, etc.)

  • Multiple gangs work

independently of each other

Workers Gang Workers Gang

Vector Vector

slide-19
SLIDE 19

19

SANJEEVINI: PARDOCK

OpenACC compiler output: EnergyCalculator

PDB::EnergyCalculator(float **&, const std::vector<points, std::allocator<points>> &, const std::vector<points, std::allocator<points>> &, points *, const unsigned int &, energy &, int): 22, Generating present(vDrugGridData[:],eneGrid[:][:]) Generating copyin(coords[:totDockAtoms]) 22, Accelerator kernel generated Generating Tesla code 22, Generating reduction(+:ene) 24, #pragma acc loop gang /* blockIdx.x */ 31, #pragma acc loop vector(256) /* threadIdx.x */ Generating reduction(min:minDis) 45, #pragma acc loop vector(256) /* threadIdx.x */ Generating reduction(min:minIdx) 31, Loop is parallelizable 45, Loop is parallelizable

slide-20
SLIDE 20

20

MANAGE DATA HIGHER IN THE PROGRAM

Currently data is moved at the beginning and end of each function, in case the data is needed on the CPU We know that the data is only needed on the CPU after convergence We should inform the compiler when data movement is really needed to improved performance

slide-21
SLIDE 21

21

STRUCTURED DATA REGIONS

The data directive defines a region of code in which GPU arrays remain on the GPU and are shared among all kernels in that region.

#pragma acc data { #pragma acc parallel loop ... #pragma acc parallel loop ... }

Data Region

Arrays used within the data region will remain

  • n the GPU until the

end of the data region.

slide-22
SLIDE 22

22

UNSTRUCTURED DATA DIRECTIVES

Used to define data regions when scoping doesn’t allow the use of normal data regions (e.g. the constructor/destructor of a class). enter data Defines the start of an unstructured data lifetime

  • clauses: copyin(list), create(list)

exit data Defines the end of an unstructured data lifetime

  • clauses: copyout(list), delete(list), finalize

#pragma acc enter data copyin(a) ... #pragma acc exit data delete(a)

slide-23
SLIDE 23

23

SANJEEVINI: PARDOCK

OpenACC parallelization: EnergyAtom (3)

int **vProteinListData = new int *[vProteinList.size()]; n = vProteinList.size(); #pragma acc enter data create(vProteinListData[0:n][0:1]) for( int count = 0; count < n; count++ ){ int numPro = vProteinList[count].size(); vProteinListData[count] = vProteinList[count].data(); #pragma acc enter data copyin(vProteinListData[count:1][0:numPro]) }

▪ Use ‘raw data pointer’ to access vectors ▪ How will you access ‘vector of vector (jagged arrays)’ ? Creation and copy

  • f jagged arrays
slide-24
SLIDE 24

24

SANJEEVINI: PARDOCK

OpenACC parallelization: EnergyAtom (4)

for( int count = 0; count < n; count++ ){ int numPro = vProteinList[count].size(); #pragma acc exit data delete(vProteinListData[count:1][0:numPro]) vProteinListData[count] = NULL; } #pragma acc exit data delete(vProteinListData[0:n][0:1])

▪ Use ‘raw data pointer’ to access vectors ▪ How will you access ‘vector of vector (jagged arrays)’ ? Deletion of jagged arrays

slide-25
SLIDE 25

25

SANJEEVINI: PARDOCK

OpenACC compiler output: EnergyAtom (1)

PDB::energyAtom(const std::vector<PDB, std::allocator<PDB>> &, PDB, points, const std::vector<Box, std::allocator<Box>>&, const std::vector<int, std::allocator<int>>&, const std::vector<std::vector<int, std::allocator<int>>, std::allocator<std::vector<int, std::allocator<int>>>>&, int **): 79, Generating enter data copyin(boxListData[:boxListNumElements],rec,coord) 85, Generating present(coord,boxListData[:],rec,vProteinListData[:][:],vProData[:]) Accelerator kernel generated Generating Tesla code 85, Generating reduction(+:electro,vandw,ehyd) 87, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 129, Generating exit data delete(boxListData[:boxListNumElements],rec,coord)

slide-26
SLIDE 26

26

SANJEEVINI: PARDOCK

OpenACC compiler output: EnergyAtom (2)

main: 266, Generating enter data copyin(vProData[:vProNumElements]) Generating enter data create(vProteinListData[:vProteinListNumElements][:1]) 275, Generating enter data copyin(vProteinListData[proList][:numElements]) 321, Generating exit data delete(vProteinListData[proList][:numElements]) 322, Generating exit data delete(vProteinListData[:vProteinListNumElements][:1],vProData[:vProNumElements])

slide-27
SLIDE 27

27

CUDA UNIFIED MEMORY

Simplified Developer Effort

Without Unified Memory With Unified Memory

Unified Memory System Memory GPU Memory

Sometimes referred to as “managed memory.” New “Pascal” GPUs handle Unified Memory in hardware.

slide-28
SLIDE 28

28

PERFORMANCE OPTIMIZATION

Strategy

Analyze Parallelize Optimize

slide-29
SLIDE 29

30

SAJEEVINI: PARDOCK

Performance: CPU and GPU (1)

▪ PSG Cluster node, Haswell E5- 2698 v3@ 2.3 GHz, dual socket, 16 core ▪ 256 GB RAM ▪ Tesla P100 GPU ▪ CentOS 7.2 ▪ Cuda Toolkit 8.0.61 ▪ MPS enabled for GPU access

CPU+GPU 5.8x/3.3x faster than CPU at 8 MPI procs, ROTATE=1000/100 16 MPI procs on a single GPU -> GPU is the bottleneck!

slide-30
SLIDE 30

31

SAJEEVINI: PARDOCK

Performance: CPU and GPU (2)

▪ Average ‘time to predict’ over 160 datasets ▪ PSG Cluster node, Haswell E5- 2698 v3@ 2.3 GHz, dual socket, 16 core ▪ 256 GB RAM ▪ Tesla P100 GPU ▪ CentOS 7.2 ▪ Cuda Toolkit 8.0.61 ▪ MPS enabled for GPU access

CPU+GPU 5.3x/3.2x faster than CPU at 8 MPI procs, ROTATE=1000/100

slide-31
SLIDE 31

32

TESLA V100

The Fastest and Most Productive GPU for AI and HPC

Volta Architecture

Most Productive GPU

Tensor Core

125 Programmable TFLOPS Deep Learning

Improved SIMT Model

New Algorithms

Volta MPS

Inference Utilization

Improved NVLink & HBM2

Efficient Bandwidth

slide-32
SLIDE 32

33

MULTI PROCESS SERVICE (MPS) FOR MPI APPLICATIONS

slide-33
SLIDE 33

34

GPU ACCELERATION OF LEGACY MPI APPS

Typical legacy application

MPI parallel Single or few threads per MPI rank (e.g. OpenMP)

Running with multiple MPI ranks per node GPU acceleration in phases

Proof of concept prototype, … Great speedup at kernel level

Application performance misses expectations

4/2/2018

slide-34
SLIDE 34

35

MULTI PROCESS SERVICE (MPS)

For Legacy MPI Applications

4/2/2018

N=4 N=2 N=1 N=8

Multicore CPU only With Hyper-Q/MPS

Available on Tesla/Quadro with CC 3.5+ (e.g. K20, K40, K80, M40,…)

N=4 N=2 N=8

GPU parallelizable part CPU parallel part Serial part GPU-accelerated

N=1

slide-35
SLIDE 35

36

PROCESSES SHARING GPU WITHOUT MPS

No Overlap

4/2/2018 Process A Process B Context A Context B Process A Process B GPU

slide-36
SLIDE 36

37

PROCESSES SHARING GPU WITHOUT MPS

Context Switch Overhead

4/2/2018 Time-slided use of GPU Context switch Context Switch

slide-37
SLIDE 37

38

PROCESSES SHARING GPU WITH MPS

Maximum Overlap

4/2/2018 Process A Process B Context A Context B GPU Kernels from Process A Kernels from Process B MPS Process

slide-38
SLIDE 38

39

PROCESSES SHARING GPU WITH MPS

No Context Switch Overhead

4/2/2018

slide-39
SLIDE 39

40

slide-40
SLIDE 40

41

slide-41
SLIDE 41

42

SAJEEVINI: PARDOCK

Pascal vs Volta

▪ Average ‘time to predict’ over 160 datasets, ROTATE=1000 ▪ PSG Cluster node, Haswell E5- 2698 v3@ 2.3 GHz, dual socket, 16 core ▪ 256 GB RAM ▪ Tesla P100/V100 GPU ▪ CentOS 7.2 ▪ Cuda Toolkit 8.0.61/9.0.176 ▪ MPS enabled for GPU access

Volta is 2.1x faster than Pascal

slide-42
SLIDE 42

43

SANJEEVINI: PARDOCK

OpenACC parallelization

▪ Use ‘raw data pointer’ to access vectors ▪ Avoid using C++ references in OpenACC pragmas ▪ Standard classes called from an OpenACC region may result in compilation/linking errors. Use math.h instead of cmath ☺ ▪ Unified memory has improved over time but sometimes there might be a need to explicitly use data clause to minimize data copies ▪ Volta works excellent with program needing functionality of MPS

slide-43
SLIDE 43

44

ONGOING WORK

slide-44
SLIDE 44

45

SAJEEVINI: PARDOCK

Pascal vs Volta

▪ Average ‘time to predict’ over 160 datasets, ROTATE=1000 ▪ PSG Cluster node, Haswell E5- 2698 v3@ 2.3 GHz, dual socket, 16 core ▪ 256 GB RAM ▪ Tesla P100/V100 GPU ▪ CentOS 7.2 ▪ Cuda Toolkit 8.0.61/9.0.176 ▪ MPS enabled for GPU access

Volta is 2.1x faster than Pascal due to hardware accelerated MPS

slide-45
SLIDE 45

46

SAJEEVINI: PARDOCK

Multi-GPU scalability (2)

▪ ‘1qbt’ dataset, ROTATE=1000, 8 MPI procs ▪ PSG Cluster node, Haswell E5- 2698 v3@ 2.3 GHz, dual socket, 16 core ▪ 256 GB RAM ▪ Tesla P100 GPU ▪ CentOS 7.2 ▪ Cuda Toolkit 8.0.61 ▪ MPS enabled for GPU access ▪ Higher concurrency possible with more devices->lower GPU time ▪ Lesser latency with more devices/MPS servers->lower CPU time

slide-46
SLIDE 46

47

SAJEEVINI: PARDOCK

Multi-GPU scalability (3)

▪ ‘5cna’ dataset, ROTATE=100, 8 MPI procs, Tesla P100 GPUs, MPS

slide-47
SLIDE 47

48

SAJEEVINI: PARDOCK

Pascal vs Volta (2)

▪ ‘1a4w’ dataset, ROTATE=100, 8 MPI procs, Tesla P100/V100 GPUs, MPS

slide-48
SLIDE 48

49

REFERENCES: PARDOCK

  • Gupta, A., et al. "ParDOCK: An all atom energy based Monte Carlo docking protocol for protein-

ligand complexes." Protein and peptide letters 14.7 (2007): 632-646.

  • Nishikawa, Joy L., et al. "Inhibiting fungal multidrug resistance by disrupting an activator–Mediator

interaction." Nature 530.7591 (2016): 485.

  • Singh, Tanya, D. Biswas, and Bhyravabhotla Jayaram. "AADS-An automated active site

identification, docking, and scoring protocol for protein targets based on physicochemical descriptors." Journal of chemical information and modeling 51.10 (2011): 2515-2527.

  • Singh, Tanya, Olayiwola Adedotun Adekoya, and B. Jayaram. "Understanding the binding of

inhibitors of matrix metalloproteinases by molecular docking, quantum mechanical calculations, molecular dynamics simulations, and a MMGBSA/MMBappl study." Molecular BioSystems 11.4 (2015): 1041-1051.

  • Jayaram, Bhyravabhotla, et al. "Sanjeevini: a freely accessible web-server for target directed lead

molecule discovery." BMC bioinformatics. Vol. 13. No. 17. BioMed Central, 2012.

slide-49
SLIDE 49

50

SANJEEVINI: PARDOCK

Steps involved