Accelerating Gene Set Enrichment Analysis on CUDA-Enabled GPUs - - PowerPoint PPT Presentation
Accelerating Gene Set Enrichment Analysis on CUDA-Enabled GPUs - - PowerPoint PPT Presentation
Accelerating Gene Set Enrichment Analysis on CUDA-Enabled GPUs Bertil Schmidt Christian Hundt Contents Gene Set Enrichment Analysis (GSEA) Background Algorithmic details cudaGSEA Performance evaluation GSEA and
Contents
- Gene Set Enrichment Analysis (GSEA)
– Background – Algorithmic details
- cudaGSEA
- Performance evaluation
GSEA and Bioinformatics
- High throughput technologies generate large-scale
gene expression data sets
– RNA-Seq – Microarrays
- GSEA uses annotated gene sets to mine a given gene
expression matrix
– MSigDB contains over 10K signatures each containing around 100 gene identifiers on average
- Typical GSEA study:
– identify metabolic pathways that are differentially changed in human type-2 diabetes
Gene Set Enrichment Analysis
- Reveals correlation between gene
sets and diseases using gene expression data
- State-of-the-art tool with over
10,000 citations
- Written in (multi-threaded) Java
- Highly time consuming
– analyzing 20,639 genes measured in 200 patients with 4,725 pathways and 1M permutations takes around 1 week with GSEA 2.2.2 software on a CPU
- We present
– GSEA parallelization on a GPU using CUDA (cudaGSEA) – cudaGSEA around two orders-of- magnitude faster than BroadGSEA
GSEA Algorithm – Gene Ranking
- Gene expression matrix D obtained from RNA-Seq or Microarray experiments
- For each gene i and patient j with associated (binary) phenotype C expression value
D[i,j] is stored
- Diseases driven by complex gene interactions simply reporting top-ranked genes
produce many false positives
- Domain experts provides set of genes that might possibly explain observed
phenotypes
GSEA Algorithm – Enrichment score
- Enrichment score (ES) measure correlation between given gene set S and
calculated gene ranking g(i)
– Report maximum deviation of a running sum (k) – Sum increases if we hit a member of S and decreases otherwise
- How significant is ES = 0.857? p-value calculation using permutation testing
GSEA Algorithm – Permuation testing
GSEA Algorithm – Permuation testing
GSEA Algorithm
|ES|
- |ES|
- Histogram of 1,000,000 enrichment scores gained by permuting
patient phenotypes
- Estimate p-value by counting events in both tails
- Why so many permutations?
– When testing 1,000 gene sets at significance level p<0.001 we need more than 1,000,000 samples to reject null hypothesis at 1,000p < 0.001 (Bonferroni correction)
CUDA Parallelization
Transpose D to ensure coalesced memory accesses
CUDA Parallelization
CUDA Parallelization
CUDA Implementation Details
- Support for single-precision and double-precision
- Resulting matrix of enrichment scores (#gene sets x
#permutations) can be large
– e.g. 5K x 1M x 8B = 40GB
- p-value estimation, Family-wise error rate (FWER),
normalized enrichment score (NES) computation can be accomplished on the GPU with (sum/max) reduction kernels without the need for storing this matrix
- False discovery rate (FDR) computation this matrix is
transferred to the CPU for post-processing
cudaGSEA Features
- Reading data sets directly in Broad Institute-compatible file
formats
- Supporting several local deviation measures
– Mean-based measures (difference/quotient/log-quotient of means) – Mean and standard deviation-based measures (signal to noise- ratio, t-tests, one/two-pass estimation) – Numerically stable summation schemes for local measures and ES (Kahan etc.)
- Package for the R framework and standalone application
- Multi-threaded CPU version in C++ using OpenMP
- GSE19429 dataset
– collapsed to 20,639 gene symbols; 200 patients (183 cases + 17 controls)
- Hallmark: 50 gene sets
– MSigDB 5.1 smallest gene set collection
- GeForce Titan X (single precison) / Tesla K40c (double precision, ECC off), CUDA 7.5
- 10 core Xeon E5-2660v3@2.60GHz, 20 Threads, Ubuntu 14.04, gcc 4.8.4, 64-bit OpenJDK
- BroadGSEA v.2.2.2
Performance Evaluation
Performance Evaluation
- GSE19429 dataset
– collapsed to 20,639 gene symbols; 200 patients (183 cases + 17 controls)
- C2: 4726 gene sets
– MSigDB 5.1 largest gene set collection
- GeForce Titan X (single precison) / Tesla K40c (double precision, ECC off), CUDA 7.5
- 10 core Xeon E5-2660v3@2.60GHz, 20 Threads, Ubuntu 14.04, gcc 4.8.4, 64-bit OpenJDK
- BroadGSEA v.2.2.2
Conclusion
- High-throughput technologies establish the need for
scalable bioinformatics tools that can process large- scale gene expression data sets
- CUDA is a suitable technology to address this need
- cudaGSEA on one GPU achieves around two orders-of-
magnitude speedup versus BroadGSEA on a CPU
– analyzing 20,639 genes measured in 200 patients with 4,726 pathways and 1M permutations takes around 1 week with GSEA 2.2.2 on a Xeon E5-2660v3 CPU while less than 1 hour on a GeForce Titan X
- Source code available at:
– https://github.com/gravitino/cudaGSEA
- Group Website:
– https://www.hpc.informatik.uni-mainz.de/
Accelerating Gene Set Enrichment Analysis on CUDA-Enabled GPUs
Bertil Schmidt, Christian Hundt Institute of Computer Science Johannes Gutenberg University Mainz {bertil.schmidt, hundt}@uni-mainz.de