Using a CUDA-Accelerated PGAS Model on a GPU Cluster for - - PowerPoint PPT Presentation

using a cuda accelerated pgas model on a gpu cluster for
SMART_READER_LITE
LIVE PREVIEW

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for - - PowerPoint PPT Presentation

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge Gonzlez-Domnguez Parallel and Distributed Architectures Group Johannes Gutenberg


slide-1
SLIDE 1

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Jorge González-Domínguez

Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de

GTC 2015

slide-2
SLIDE 2

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

1

Overview of the Problem

2

Intra-GPU Parallelization with CUDA

3

Inter-GPU Parallelization with UPC++

4

Experimental Evaluation

5

Conclusions

slide-3
SLIDE 3

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Overview of the Problem

1

Overview of the Problem

2

Intra-GPU Parallelization with CUDA

3

Inter-GPU Parallelization with UPC++

4

Experimental Evaluation

5

Conclusions

slide-4
SLIDE 4

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Overview of the Problem

Genome-Wide Association Studies (I)

Analyses of genetic influence

  • n diseases
slide-5
SLIDE 5

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Overview of the Problem

Genome-Wide Association Studies (I)

Analyses of genetic influence

  • n diseases

M individuals

slide-6
SLIDE 6

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Overview of the Problem

Genome-Wide Association Studies (I)

Analyses of genetic influence

  • n diseases

M individuals

K cases

slide-7
SLIDE 7

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Overview of the Problem

Genome-Wide Association Studies (I)

Analyses of genetic influence

  • n diseases

M individuals

K cases C controls

slide-8
SLIDE 8

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Overview of the Problem

Genome-Wide Association Studies (I)

Analyses of genetic influence

  • n diseases

M individuals

K cases C controls

N genetic markers, Single Nucleotide Polymorphisms (SNPs). 3 genotypes:

Homozygous Wild (w, AA, 0) Heterozygous (h, Aa, 1) Homozygous Variant (v, aa, 2)

slide-9
SLIDE 9

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Overview of the Problem

Genome-Wide Association Studies (II)

Cases Controls SNP 1 1 2 1 2 1 2 1 2 1 2 1 SNP 2 1 1 2 1 2 2 1 1 1 2 SNP 3 1 2 1 1 1 2 1 1 SNP 4 1 1 1 1 2 2 2 2 1 1 1 1 SNP 5 2 2 2 1 1 1 1 1 1 2 2 SNP 6 1 1 1 1 1 2 1 2 1 2 2 1

slide-10
SLIDE 10

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Overview of the Problem

Genome-Wide Association Studies (II)

Cases Controls SNP 1 1 2 1 2 1 2 1 2 1 2 1 SNP 2 1 1 2 1 2 2 1 1 1 2 SNP 3 1 2 1 1 1 2 1 1 SNP 4 1 1 1 1 2 2 2 2 1 1 1 1 SNP 5 2 2 2 1 1 1 1 1 1 2 2 SNP 6 1 1 1 1 1 2 1 2 1 2 2 1

slide-11
SLIDE 11

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Overview of the Problem

Genome-Wide Association Studies (II)

Cases Controls SNP 1 1 2 1 2 1 2 1 2 1 2 1 SNP 2 1 1 2 1 2 2 1 1 1 2 SNP 3 1 2 1 1 1 2 1 1 SNP 4 1 1 1 1 2 2 2 2 1 1 1 1 SNP 5 2 2 2 1 1 1 1 1 1 2 2 SNP 6 1 1 1 1 1 2 1 2 1 2 2 1

slide-12
SLIDE 12

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Overview of the Problem

Genome-Wide Association Studies (and III)

Definition Two SNPs present epistasis or interaction if: Their joint genotype frequencies show a statistically significant difference between cases and controls which potentially explains the effect of the genetic variation leading to disease. The difference between cases and controls shown by the joint values is significantly higher than using only the individual SNP values.

slide-13
SLIDE 13

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Overview of the Problem

BOOST

BOolean Operation-based Screening and Testing Binary traits Exhaustive search Statistical regression Good accuracy (used by biologists) Returns a list of SNP pairs with high interaction probability Fastest available tool. Intel Core i7 3.20GHz:

40,000 SNPs and 3,200 individuals

About 800 million pairs 51 minutes

500,000 SNPs and 5,000 individuals

About 125 billion pairs (moderated size) Estimated 7 days

slide-14
SLIDE 14

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Overview of the Problem

GBOOST

CUDA version for GPUs Same accuracy as BOOST 40,000 SNPs and 6,400 individuals

About 800 million pairs 28 seconds on a GTX Titan

500,000 SNPs and 5,000 individuals

About 125 billion pairs (moderated size) 1 hour on a GTX Titan

slide-15
SLIDE 15

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Overview of the Problem

GBOOST

CUDA version for GPUs Same accuracy as BOOST 40,000 SNPs and 6,400 individuals

About 800 million pairs 28 seconds on a GTX Titan

500,000 SNPs and 5,000 individuals

About 125 billion pairs (moderated size) 1 hour on a GTX Titan

High-throughput genotyping technologies collect few million SNPs of an individual within a few minutes → Expected datasets with 5M SNPs and 10,000 individuals

slide-16
SLIDE 16

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Intra-GPU Parallelization with CUDA

1

Overview of the Problem

2

Intra-GPU Parallelization with CUDA

3

Inter-GPU Parallelization with UPC++

4

Experimental Evaluation

5

Conclusions

slide-17
SLIDE 17

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Intra-GPU Parallelization with CUDA

Calculation of Contingency Tables (I)

For each SNP-pair → Number of occurrences of each combination of genotypes Cases SNP2=0 SNP2=1 SNP2=2 SNP1=0 n000 n010 n020 SNP1=1 n100 n110 n120 SNP1=2 n200 n210 n220 Controls SNP2=0 SNP2=1 SNP2=2 SNP1=0 n001 n011 n021 SNP1=1 n101 n111 n121 SNP1=2 n201 n211 n221

slide-18
SLIDE 18

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Intra-GPU Parallelization with CUDA

Calculation of Contingency Tables (II)

SNP 4 1 1 1 1 2 2 2 2 1 1 1 1 SNP 6 1 1 1 1 1 2 1 2 1 2 2 1 Casos SNP6=0 SNP6=1 SNP6=2 SNP4=0 4 SNP4=1 4 SNP4=2 Controles SNP6=0 SNP6=1 SNP6=2 SNP4=0 SNP4=1 2 2 SNP4=2 1 2

slide-19
SLIDE 19

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Intra-GPU Parallelization with CUDA

Calculation of Contingency Tables (II)

SNP 4 1 1 1 1 2 2 2 2 1 1 1 1 SNP 6 1 1 1 1 1 2 1 2 1 2 2 1 Casos SNP6=0 SNP6=1 SNP6=2 SNP4=0 4 SNP4=1 4 SNP4=2 Controles SNP6=0 SNP6=1 SNP6=2 SNP4=0 SNP4=1 2 2 SNP4=2 1 2

slide-20
SLIDE 20

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Intra-GPU Parallelization with CUDA

Calculation of Contingency Tables (II)

SNP 4 1 1 1 1 2 2 2 2 1 1 1 1 SNP 6 1 1 1 1 1 2 1 2 1 2 2 1 Casos SNP6=0 SNP6=1 SNP6=2 SNP4=0 4 SNP4=1 4 SNP4=2 Controles SNP6=0 SNP6=1 SNP6=2 SNP4=0 SNP4=1 2 2 SNP4=2 1 2

slide-21
SLIDE 21

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Intra-GPU Parallelization with CUDA

Filtering Stage

Epistatic interaction measured via log-linear models All SNP-pairs analyzed The measure is obtained with numerical calculations from the values of the contingency table Pairs with measure higher than a threshold pass the filter

They are included in the output file

multiEpistSearch uses a faster filter than GBOOST (out of the scope)

slide-22
SLIDE 22

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Intra-GPU Parallelization with CUDA

CUDA Implementation

CUDA Kernel Genotyping information loaded in device memory through pinned copies Each thread performs the whole calculation of independent SNP-pairs Only one kernel for the whole computation Each call to the kernel analyzes a batch of SNP-pairs

slide-23
SLIDE 23

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Intra-GPU Parallelization with CUDA

CUDA Implementation

CUDA Kernel Genotyping information loaded in device memory through pinned copies Each thread performs the whole calculation of independent SNP-pairs Only one kernel for the whole computation Each call to the kernel analyzes a batch of SNP-pairs Optimization Techniques Boolean representation of genotyping information Increase of coalescence Exploitation of shared memory

slide-24
SLIDE 24

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Inter-GPU Parallelization with UPC++

1

Overview of the Problem

2

Intra-GPU Parallelization with CUDA

3

Inter-GPU Parallelization with UPC++

4

Experimental Evaluation

5

Conclusions

slide-25
SLIDE 25

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Inter-GPU Parallelization with UPC++

UPC++ (I)

Unified Parallel C++ Novel extension of ANSI C++

Y Zheng, A Kamil, M Driscoll, H Shan, and K Yelick. UPC++: a PGAS Extension for C++. In Proc. 28th IEEE Intl. Parallel and Distributed Processing Symp. (IPDPS’14), Phoenix, AR, USA, 2014.

Follows the Partitioned Global Address Space (PGAS) programming model Single Program Multiple Data (SPMD) execution model Works on shared and distributed memory systems

slide-26
SLIDE 26

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Inter-GPU Parallelization with UPC++

UPC++ (and II)

Global memory logically partitioned among processes Processes can directly access (read/write) any part of the global memory Memory with affinity usually mapped in the same node (faster accesses)

slide-27
SLIDE 27

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Inter-GPU Parallelization with UPC++

Multi-GPU Approach (I)

One UPC++ process per GPU SNP data distributed among the parts of the global memory

All the information of the same SNP in the same part

Each GPU (UPC++ process) analyzes different SNP-pairs

Creation of contingency table Filtering

The data of the SNPs to analyze might be in remote memory

slide-28
SLIDE 28

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Inter-GPU Parallelization with UPC++

Multi-GPU Approach (II)

slide-29
SLIDE 29

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Inter-GPU Parallelization with UPC++

Multi-GPU Approach (II)

slide-30
SLIDE 30

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Inter-GPU Parallelization with UPC++

Multi-GPU Approach (II)

slide-31
SLIDE 31

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Inter-GPU Parallelization with UPC++

Multi-GPU Approach (II)

slide-32
SLIDE 32

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Inter-GPU Parallelization with UPC++

Multi-GPU Approach (VI)

Static distribution Workload distributed at the beginning

Metablocks that will be analyzed by each GPU

The distribution does not change during the execution Balance of the number of metablocks per GPU

Similar workload for each GPU Good distribution for systems with similar GPUs

Minimization of remote copies

slide-33
SLIDE 33

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Inter-GPU Parallelization with UPC++

Multi-GPU Approach (and VII)

On-demand distribution The metablocks computed by each GPU initially unknown Table with one binary value per metablock that indicates if it has been computed When one GPU finishes with one metablock → Looks for the next one that has not been analyzed Locks or semaphores necessary for the concurrent accesses to the table

Easy with UPC++ support Synchronizations include performance overhead

GPUs might compute different number of metablocks

Faster GPUs analyze more SNP-pairs Good distribution for systems with different GPUs

slide-34
SLIDE 34

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Experimental Evaluation

1

Overview of the Problem

2

Intra-GPU Parallelization with CUDA

3

Inter-GPU Parallelization with UPC++

4

Experimental Evaluation

5

Conclusions

slide-35
SLIDE 35

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Experimental Evaluation

Evaluation with Homogeneous GPUs (I)

Platform Mogon cluster Johannes Gutenberg Universität 8 nodes with 3 GTX Titan GPUs

One of the most powerful GPUs

Infiniband network

slide-36
SLIDE 36

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Experimental Evaluation

Evaluation with Homogeneous GPUs (I)

Platform Mogon cluster Johannes Gutenberg Universität 8 nodes with 3 GTX Titan GPUs

One of the most powerful GPUs

Infiniband network Dataset Real-world data from the WTCCC database Moderately-sized

500,568 SNPs 2,005 cases with bipolar disorder 3,004 controls

slide-37
SLIDE 37

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Experimental Evaluation

Evaluation with Homogeneous GPUs (II)

200 400 600 800 1000 2 4 8 16 24 Execution Time (sec) Number of GTX Titan WTCCC Dataset on Homogeneous GPUs

(1.93) (1.83) (3.91) (3.21) (7.77) (6.93) (15.68) (12.20) (22.82) (16.51)

static

  • n-demand

Static 1.38 times faster for 24 GPUs Static always > 95 % parallel efficiency

slide-38
SLIDE 38

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Experimental Evaluation

Evaluation with Homogeneous GPUs (and III)

Design Architecture Runtime Speed (106 pairs/s) multiEpistSearch 24 GTX Titan 1 m 11 s 1764.56 multiEpistSearch 1 GTX Titan 27 m 77.34 GBOOST 1 GTX Titan 1 h 15 m 34.23 EpiGPU* 1 GTX 580 2 h 55 m 11.90 SHEsisEPI* 1 GTX 285 27 h 1.29 BOOST** Intel Core i7 7 d 0.21

Speedups for one GPU:

2.77 over GBOOST > 373 over estimation for BOOST on a 3GHz Intel Core i7

With 24 Titan 54.93 and > 8,500 times faster than GBOOST and BOOST, respectively

slide-39
SLIDE 39

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Experimental Evaluation

Evaluation with Heterogeneous GPUs (I)

Platform Pluton cluster Universidade da Coruña (Spain) 8 nodes with 1 GTX Tesla K20m 4 nodes with 2 Tesla 2050

Less cores

Gigabit Ethernet network

slide-40
SLIDE 40

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Experimental Evaluation

Evaluation with Heterogeneous GPUs (II)

500 1000 1500 2000 2500 3000 1+1 2+2 4+4 8+8 Execution Time (sec) Number of GPUs WTCCC Dataset on Heterogeneous GPUs

(0.90) (1.31) (2.24) (2.47) (4.44) (4.93) (7.98) (9.43)

static

  • n-demand

On demand 1.18 times faster for 16 GPUs

slide-41
SLIDE 41

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Experimental Evaluation

Evaluation with Heterogeneous GPUs (and III)

Design Architecture Runtime Speed (106 pairs/s) multiEpistSearch 8 Tesla K20m + 8 2050 4 m 20 s 481.86 multiEpistSearch 8 Tesla K20m 5 m 40 s 348.01 multiEpistSearch 8 Tesla 2050 10 m 12 s 204.71 multiEpistSearch 1 Tesla K20m 41 m 50.93 multiEpistSearch 1 Tesla 2050 1 h 1 m 34.23 GBOOST 1 Tesla K20m 1 h 26 m 24.28 GBOOST 1 Tesla 2050 2 h 17 m 15.22

With 1 GPU 2.10 and 2.25 times faster than GBOOST 1.31 times faster using the whole cluster (on-demand) than

  • nly the 8 Tesla K20m
slide-42
SLIDE 42

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Experimental Evaluation

Evaluation of a Large-Scale Dataset

Simulated dataset

5M SNPs 5,000 cases 5,000 controls

2 hours and 45 minutes on Mogon (24 GTX Titan) Estimation of more than 2 days and 14 hours on 1 GPU GBOOST is not able to analyze it

Out-of-bound problems in the arrays

slide-43
SLIDE 43

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Conclusions

1

Overview of the Problem

2

Intra-GPU Parallelization with CUDA

3

Inter-GPU Parallelization with UPC++

4

Experimental Evaluation

5

Conclusions

slide-44
SLIDE 44

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Conclusions

Conclusions

multiEpistSearch looks for epistatic interactions on GPU clusters Hybrid CUDA&UPC++ implementation On only one GPU always speedups higher than 2 over GBOOST Two inter-GPU data distributions

Static for homogeneous clusters Dynamic for heterogeneous clusters

High scalability

95% Parallel efficiency with 24 GTX Titans and WTCCC dataset

2 hours and 45 minutes for 5M SNPs and 10K samples on 24 GTX Titans

slide-45
SLIDE 45

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Conclusions

Bibliography

First version of the GPU kernel

  • J. González-Domínguez, B. Schmidt, J. C. Kässens, and L.

Wienbrandt. Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS. In Proc. 20th Intl. European Conf. on Parallel and Distributed Computing (Euro-Par’14), Porto, Portugal. multiEpistSeach (minor revision)

  • J. González-Domínguez, J. C. Kässens, L. Wienbrandt, and B.

Schmidt. Large-Scale Genome-Wide Association Studies on a GPU Cluster Using a CUDA-Accelerated PGAS Programming Model.

  • Intl. Journal of High Performance Computing Applications

(IJHPCA).

slide-46
SLIDE 46

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Conclusions

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Jorge González-Domínguez

Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de

GTC 2015