[PPT] - HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver PowerPoint Presentation

SLIDE 1

HIGH-PERFORMANCE GENOME STUDIES

Lucas Beyer Diego Fabregat-Traver and Prof. Paolo Bientinesi

RWTH Aachen University 19 June 2012, SIAM Conference on Applied Linear Algebra, Valencia, Spain

Thanks to the AICES HPAC group and DFG grant GSC111

SLIDE 2

f 38

OUTLINE

Intoduction and motivation
Problem description
CPU-only algorithm
Leveraging the GPU
Results and conclusion

2

SLIDE 3

f 38

OUTLINE

3

Intoduction and motivation
Problem description
CPU-only algorithm
Leveraging the GPU
Results and conclusion

SLIDE 4

f 38

THE BIOLOGY

4

Roughly, how an engineer sees it

Organisms Cells Proteins DNA Genes Nucleotides

SLIDE 5

f 38

SINGLE NUCLEOTIDE POLYMORPHISM

nucleotide’s allele differs between two individuals of a species

5

link to traits, diseases?

Image source: wikipedia

SLIDE 6

f 38

GWAS

Human Genome project (2004)
Genome-wide association studies (GWAS)
Find correlations between SNPs and traits (diseases)
Case group vs. control group
Variance Components & Generalized Linear Mixed Models

6

SLIDE 7

f 38

GWAS STATS

7

750 1500 2250 3000 2005 2006 2007 2008 2009 2010 2011

2333 2304 1257 999 453 13 2

# of GWAS carried out each year

SLIDE 8

f 38

GWAS STATS

8

0K 10K 20K 30K 40K 2005 2006 2007 2008 2009 2010 2011

Sample size

SLIDE 9

f 38

0M 1M 2M 3M 4M 2005 2006 2007 2008 2009 2010 2011

#SNPs passing QC

GWAS STATS

9

SLIDE 10

f 38

0M 3M 6M 9M 12M 2005 2006 2007 2008 2009 2010 2011

0,2M 2,4M 1,5M 2,6M 2,7M 7,5M 10,5M

Largest #SNPs passing QC

GWAS STATS

10

SLIDE 11

f 38

HPC BASICS

Basic Linear Algebra Subprograms
LEGO-like building-blocks of LA
Vendor optimized implementations
TRSM: solution of multiple triangular systems TX =

Y

Linear Algebra PACKage (LAPACK)
Higher-level LA algorithms
POTRF: Cholesky factorization of a SPD matrix LLT = A

11

SLIDE 12

f 38

OUTLINE

12

Intoduction and motivation
Problem description
CPU-only algorithm
Leveraging the GPU
Results and conclusion

SLIDE 13

f 38

GENOME-WIDE ASSOCIATION STUDIES

13

y ∈ ℝn

bservations (phenotype)

Xi ∈ ℝn×p genome measurements/covariates M ∈ ℝn×n

bservation dependencies

ri ∈ ℝp relations between phenotype and genome variations n n p

lots of GLS because i = 0..millions

ri ← (XT

i M −1Xi)−1XT i M −1y

input

utput

SLIDE 14

f 38

THE NUMBERS

14

# DNA fragments (nucleotides) m ~ 48﹣250 000 000 # samples n ~ 10 000 # covariates p = 20 y ∈ ℝn 80 MB M ∈ ℝn×n 800 MB r ∈ ℝp×m 7-40 GB X ∈ ℝn×p×m 72 TB﹣373 TB

⇒

ri ← (XT

i M −1Xi)−1XT i M −1y

SLIDE 15

f 38

OUTLINE

15

Intoduction and motivation
Problem description
CPU-only algorithm
Leveraging the GPU
Results and conclusion

SLIDE 16

f 38

BASIC ALGORITHM

16

ri ← (XT

i M −1Xi)−1XT i M −1y

SLIDE 17

f 38

BASIC ALGORITHM

17

ri ← (XT

i M −1Xi)−1XT i M −1y

Cholesky once during initialization

LLT := M

SLIDE 18

f 38

BASIC ALGORITHM

18

ri ← (XT

i M −1Xi)−1XT i M −1y

ri ← (XT

i L−T L−1Xi)−1XT i L−T L−1y

Cholesky once during initialization

LLT := M

SLIDE 19

f 38

BASIC ALGORITHM

19

ri ← (XT

i M −1Xi)−1XT i M −1y

ri ← (XT

i L−T L−1Xi)−1XT i L−T L−1y

Cholesky once during initialization

LLT := M ri ← ((L−1Xi)T L−1Xi)−1L−1XiL−1y

SLIDE 20

f 38

BASIC ALGORITHM

20

ri ← (XT

i M −1Xi)−1XT i M −1y

ri ← (XT

i L−T L−1Xi)−1XT i L−T L−1y

Cholesky once during initialization

LLT := M

One trsm per iteration step i

ri ← ((L−1Xi)T L−1Xi)−1L−1XiL−1y ˆ Xi := L−1Xi ri ← ( ˆ XT

i ˆ

Xi)−1 ˆ XiL−1y

SLIDE 21

f 38

OPTIMIZATIONS

Blocking in i
many small trsms vs. one big trsm

21

... ⇒

SLIDE 22

f 38

OPTIMIZATIONS

Blocking in i
many small trsms vs. one big trsm
Out-of-core algorithm
read block b+1 while computing block b
double-buffering technique necessary

22

... ⇒

SLIDE 23

f 38

PERFORMANCE

From years/months down to weeks/days

23

Minutes Hours Days Months Years 100s 1.000s 10.000s 100.000s 1.000.000s 10.000.000s 1M 10M 100M m (SNP count) EMMAX GWFGLS FLMM CLAK-Chol

SLIDE 24

f 38

OUTLINE

24

Intoduction and motivation
Problem description
CPU-only algorithm
Leveraging the GPU
Results and conclusion

SLIDE 25

f 38

CAN GPUS HELP GO FURTHER?

trsm takes 90-95% of time
compute on the GPU
while GPU computes:
CPU computations
CPU ⇄ GPU transfers
our cluster: nVidia Fermi ☞

25

SLIDE 26

f 38

MEMORY PYRAMID

Need for streaming computation Need for two levels of double-buffering

HDD: Terabytes RAM: 10-100 Gb GPU: 1-10 Gb

26

SLIDE 27

f 38

b-1

β

27

b b-1 b-2 b+1 b+2 b+3 b-3

HDD CPU/RAM GPU

b-1 b-2 b-1 b-3 Results r Data X

2-LEVEL TRIPLE-DOUBLE-BUFFERING

(1) Retrieve previous results from GPU, start reading second-next block from HDD

b

trsm α

b+2

A

b-1

B

b+1

C

SLIDE 28

f 38

b-1

B Computation b+1

28

b b-1 b-2 b+1 b+2 b+3 b-3

HDD CPU/RAM GPUs

b-2 b-1 b-3 Results r Data X

2-LEVEL TRIPLE-DOUBLE-BUFFERING

(2) Send next block to GPU, start CPU computation

b+2

A

b-1

β

b

trsm α

b+1

C

SLIDE 29

f 38

29

b b-1 b-2 b+1 b+2 b+3 b-3

HDD CPU/RAM GPUs

b-2 b-1 b-3 Results r Data X

2-LEVEL TRIPLE-DOUBLE-BUFFERING

(3) Write results to disk (fast because small)

b+2

A

b+1

β

b

trsm α

b-1

B

b+1

C

SLIDE 30

f 38

30

b b-1 b-2 b+1 b+2 b+3 b-3

HDD CPU/RAM GPUs

b-2 b-1 b-3 Results r Data X

2-LEVEL TRIPLE-DOUBLE-BUFFERING

(4) Rotate buffers, iterate, smile

b+2

A

b+1

β

b

α

b-1

B

b+1

C

SLIDE 31

f 38

31

b b-1 b-2 b+1 b+2 b+3 b-3

HDD CPU/RAM GPUs

b-2 b-1 b-3 Results r Data X

2-LEVEL TRIPLE-DOUBLE-BUFFERING

One full iteration

b+2

A

b-1 b+1

β b-1

b

trsm α b+1

b-1

B Computation

b+1

C

SLIDE 32

f 38

TIMELINE

Parallelism on the vertical axis Heavy use of asynchronous dispatching

GPU CPU HDD

t Read b+3 Send b+2 GPU trsm b+1 GPU trsm b Read b+2 Recv b-1 Send b+1 CPU comp b-1 Write b-1 Recv b CPU b

CPU ⇄ GPU transfer HDD ⇄ CPU transfer GPU computation CPU computation Data dependencies Asynchronous dispatch

SLIDE 33

f 38

TIMELINE, TO SCALE

problem sizes: n=10k, m=100k, block=10k

GPU: 2x nVidia Quadro 6000 (Fermi, 515 GFlops each, 6GB memory) = 10.000$ CPU: 2x Intel Xeon X5650 (6cores, 128 GFlops, 24GB memory) = 2000$

CPU ⇄ GPU transfer HDD ⇄ CPU transfer GPU computation CPU computation

33

GPU CPU HDD

t

Blas: Intel MKL 10.2 Compiler: icc 12.1

SLIDE 34

f 38

OUTLINE

34

Intoduction and motivation
Problem description
CPU-only algorithm
Leveraging the GPU
Results and conclusion

SLIDE 35

f 38

PERFORMANCE

5.2x speedup using 2 GPU sustained in-core performance when out-of-core

25 50 75 100 1k 10k 20k 30k 40k 50k 60k 70k 80k 90k

11,6s 24,9s 32,9s 43,1s 52,4s 65,6s 74,6s 84,8s 96,7s 4,3s 6,3s 8,3s 10,3s 12,3s 14,3s 16,3s 18,3s

Time [s] m (nucleotide count) Hybrid CPU+2GPU algorithm Original CPU-only algorithm

35

⟵in-core

ut-of-core⟶

SLIDE 36

f 38

PERFORMANCE

From years/months down to hours

36

Minutes Hours Days Months Years 100s 1.000s 10.000s 100.000s 1.000.000s 10.000.000s 1M 10M 100M m (SNP count) EMMAX GWFGLS FLMM CLAK-Chol CLAK-Chol GPU Extrapolated

SLIDE 37

f 38

SCALABILITY

#GPUs x2 ⇒ time x0.54

11,3s 22,5s 33,8s 45s 1 2 3 4

40,7s 21,6s 16,2s 11,7s

Time number of GPUs Runtime Perfect scalability

37

Almost perfect

SLIDE 38

f 38

CONCLUSION

Don’t replace the CPU by GPU
Combine them
Hide data transfer latency by overlapping with computation
Double/triple-buffering
GPU never stops computing
GPUs order of magnitude faster?
V. Volkov (http://www.cs.berkeley.edu/~volkov/)
Victor W. Lee et al. («Debunking the 100X GPU vs. CPU Myth: An Evaluation of

Throughput Computing on CPU and GPU», 2010)

38

SLIDE 39

f 38

QUESTIONS?

beyer@aices.rwth-aachen.de

39

Minutes Hours Days Months Years 100s 1.000s 10.000s 100.000s 1.000.000s 10.000.000s 1M 10M 100M m (SNP count) EMMAX GWFGLS FLMM CLAK-Chol CLAK-Chol GPU Extrapolated

SLIDE 40

f 38

40

Viewer’s eyes Projection surface 3D scene

SLIDE 41

f 38

41

Viewer’s eyes Projection surface 3D scene

SLIDE 42

f 38

VISUALIZATION

42

SLIDE 43

f 38

FUTURE WORK

Solution for L too big for GPU memory
Apply similar technique to similar problems
Extension to multiple phenotypes (y)

43