HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver - - PowerPoint PPT Presentation

high performance genome studies
SMART_READER_LITE
LIVE PREVIEW

HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver - - PowerPoint PPT Presentation

HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver and Prof. Paolo Bientinesi RWTH Aachen University 19 June 2012, SIAM Conference on Applied Linear Algebra, Valencia, Spain Thanks to the AICES HPAC group and DFG grant GSC111


slide-1
SLIDE 1

HIGH-PERFORMANCE GENOME STUDIES

Lucas Beyer Diego Fabregat-Traver and Prof. Paolo Bientinesi

RWTH Aachen University 19 June 2012, SIAM Conference on Applied Linear Algebra, Valencia, Spain

Thanks to the AICES HPAC group and DFG grant GSC111

slide-2
SLIDE 2
  • f 38

OUTLINE

  • Intoduction and motivation
  • Problem description
  • CPU-only algorithm
  • Leveraging the GPU
  • Results and conclusion

2

slide-3
SLIDE 3
  • f 38

OUTLINE

3

  • Intoduction and motivation
  • Problem description
  • CPU-only algorithm
  • Leveraging the GPU
  • Results and conclusion
slide-4
SLIDE 4
  • f 38

THE BIOLOGY

4

Roughly, how an engineer sees it

Organisms Cells Proteins DNA Genes Nucleotides

slide-5
SLIDE 5
  • f 38

SINGLE NUCLEOTIDE POLYMORPHISM

nucleotide’s allele differs between two individuals of a species

5

link to traits, diseases?

Image source: wikipedia

slide-6
SLIDE 6
  • f 38

GWAS

  • Human Genome project (2004)
  • Genome-wide association studies (GWAS)
  • Find correlations between SNPs and traits (diseases)
  • Case group vs. control group
  • Variance Components & Generalized Linear Mixed Models

6

slide-7
SLIDE 7
  • f 38

GWAS STATS

7

750 1500 2250 3000 2005 2006 2007 2008 2009 2010 2011

2333 2304 1257 999 453 13 2

# of GWAS carried out each year

slide-8
SLIDE 8
  • f 38

GWAS STATS

8

0K 10K 20K 30K 40K 2005 2006 2007 2008 2009 2010 2011

Sample size

slide-9
SLIDE 9
  • f 38

0M 1M 2M 3M 4M 2005 2006 2007 2008 2009 2010 2011

#SNPs passing QC

GWAS STATS

9

slide-10
SLIDE 10
  • f 38

0M 3M 6M 9M 12M 2005 2006 2007 2008 2009 2010 2011

0,2M 2,4M 1,5M 2,6M 2,7M 7,5M 10,5M

Largest #SNPs passing QC

GWAS STATS

10

slide-11
SLIDE 11
  • f 38

HPC BASICS

  • Basic Linear Algebra Subprograms
  • LEGO-like building-blocks of LA
  • Vendor optimized implementations
  • TRSM: solution of multiple triangular systems TX =

Y

  • Linear Algebra PACKage (LAPACK)
  • Higher-level LA algorithms
  • POTRF: Cholesky factorization of a SPD matrix LLT = A

11

slide-12
SLIDE 12
  • f 38

OUTLINE

12

  • Intoduction and motivation
  • Problem description
  • CPU-only algorithm
  • Leveraging the GPU
  • Results and conclusion
slide-13
SLIDE 13
  • f 38

GENOME-WIDE ASSOCIATION STUDIES

13

y ∈ ℝn

  • bservations (phenotype)

Xi ∈ ℝn×p genome measurements/covariates M ∈ ℝn×n

  • bservation dependencies

ri ∈ ℝp relations between phenotype and genome variations n n p

lots of GLS because i = 0..millions

ri ← (XT

i M −1Xi)−1XT i M −1y

input

  • utput
slide-14
SLIDE 14
  • f 38

THE NUMBERS

14

# DNA fragments (nucleotides) m ~ 48﹣250 000 000 # samples n ~ 10 000 # covariates p = 20 y ∈ ℝn 80 MB M ∈ ℝn×n 800 MB r ∈ ℝp×m 7-40 GB X ∈ ℝn×p×m 72 TB﹣373 TB

ri ← (XT

i M −1Xi)−1XT i M −1y

slide-15
SLIDE 15
  • f 38

OUTLINE

15

  • Intoduction and motivation
  • Problem description
  • CPU-only algorithm
  • Leveraging the GPU
  • Results and conclusion
slide-16
SLIDE 16
  • f 38

BASIC ALGORITHM

16

ri ← (XT

i M −1Xi)−1XT i M −1y

slide-17
SLIDE 17
  • f 38

BASIC ALGORITHM

17

ri ← (XT

i M −1Xi)−1XT i M −1y

Cholesky once during initialization

LLT := M

slide-18
SLIDE 18
  • f 38

BASIC ALGORITHM

18

ri ← (XT

i M −1Xi)−1XT i M −1y

ri ← (XT

i L−T L−1Xi)−1XT i L−T L−1y

Cholesky once during initialization

LLT := M

slide-19
SLIDE 19
  • f 38

BASIC ALGORITHM

19

ri ← (XT

i M −1Xi)−1XT i M −1y

ri ← (XT

i L−T L−1Xi)−1XT i L−T L−1y

Cholesky once during initialization

LLT := M ri ← ((L−1Xi)T L−1Xi)−1L−1XiL−1y

slide-20
SLIDE 20
  • f 38

BASIC ALGORITHM

20

ri ← (XT

i M −1Xi)−1XT i M −1y

ri ← (XT

i L−T L−1Xi)−1XT i L−T L−1y

Cholesky once during initialization

LLT := M

One trsm per iteration step i

ri ← ((L−1Xi)T L−1Xi)−1L−1XiL−1y ˆ Xi := L−1Xi ri ← ( ˆ XT

i ˆ

Xi)−1 ˆ XiL−1y

slide-21
SLIDE 21
  • f 38

OPTIMIZATIONS

  • Blocking in i
  • many small trsms vs. one big trsm

21

... ⇒

slide-22
SLIDE 22
  • f 38

OPTIMIZATIONS

  • Blocking in i
  • many small trsms vs. one big trsm
  • Out-of-core algorithm
  • read block b+1 while computing block b
  • double-buffering technique necessary

22

... ⇒

slide-23
SLIDE 23
  • f 38

PERFORMANCE

From years/months down to weeks/days

23

Minutes Hours Days Months Years 100s 1.000s 10.000s 100.000s 1.000.000s 10.000.000s 1M 10M 100M m (SNP count) EMMAX GWFGLS FLMM CLAK-Chol

slide-24
SLIDE 24
  • f 38

OUTLINE

24

  • Intoduction and motivation
  • Problem description
  • CPU-only algorithm
  • Leveraging the GPU
  • Results and conclusion
slide-25
SLIDE 25
  • f 38

CAN GPUS HELP GO FURTHER?

  • trsm takes 90-95% of time
  • compute on the GPU
  • while GPU computes:
  • CPU computations
  • CPU ⇄ GPU transfers
  • our cluster: nVidia Fermi ☞

25

slide-26
SLIDE 26
  • f 38

MEMORY PYRAMID

Need for streaming computation Need for two levels of double-buffering

HDD: Terabytes RAM: 10-100 Gb GPU: 1-10 Gb

26

slide-27
SLIDE 27
  • f 38

b-1

β

27

b b-1 b-2 b+1 b+2 b+3 b-3

HDD CPU/RAM GPU

b-1 b-2 b-1 b-3 Results r Data X

2-LEVEL TRIPLE-DOUBLE-BUFFERING

(1) Retrieve previous results from GPU, start reading second-next block from HDD

b

trsm α

b+2

A

b-1

B

b+1

C

slide-28
SLIDE 28
  • f 38

b-1

B Computation b+1

28

b b-1 b-2 b+1 b+2 b+3 b-3

HDD CPU/RAM GPUs

b-2 b-1 b-3 Results r Data X

2-LEVEL TRIPLE-DOUBLE-BUFFERING

(2) Send next block to GPU, start CPU computation

b+2

A

b-1

β

b

trsm α

b+1

C

slide-29
SLIDE 29
  • f 38

29

b b-1 b-2 b+1 b+2 b+3 b-3

HDD CPU/RAM GPUs

b-2 b-1 b-3 Results r Data X

2-LEVEL TRIPLE-DOUBLE-BUFFERING

(3) Write results to disk (fast because small)

b+2

A

b+1

β

b

trsm α

b-1

B

b+1

C

slide-30
SLIDE 30
  • f 38

30

b b-1 b-2 b+1 b+2 b+3 b-3

HDD CPU/RAM GPUs

b-2 b-1 b-3 Results r Data X

2-LEVEL TRIPLE-DOUBLE-BUFFERING

(4) Rotate buffers, iterate, smile

b+2

A

b+1

β

b

α

b-1

B

b+1

C

slide-31
SLIDE 31
  • f 38

31

b b-1 b-2 b+1 b+2 b+3 b-3

HDD CPU/RAM GPUs

b-2 b-1 b-3 Results r Data X

2-LEVEL TRIPLE-DOUBLE-BUFFERING

One full iteration

b+2

A

b-1 b+1

β b-1

b

trsm α b+1

b-1

B Computation

b+1

C

slide-32
SLIDE 32
  • f 38

TIMELINE

Parallelism on the vertical axis Heavy use of asynchronous dispatching

GPU CPU HDD

t Read b+3 Send b+2 GPU trsm b+1 GPU trsm b Read b+2 Recv b-1 Send b+1 CPU comp b-1 Write b-1 Recv b CPU b

CPU ⇄ GPU transfer HDD ⇄ CPU transfer GPU computation CPU computation Data dependencies Asynchronous dispatch

slide-33
SLIDE 33
  • f 38

TIMELINE, TO SCALE

problem sizes: n=10k, m=100k, block=10k

GPU: 2x nVidia Quadro 6000 (Fermi, 515 GFlops each, 6GB memory) = 10.000$ CPU: 2x Intel Xeon X5650 (6cores, 128 GFlops, 24GB memory) = 2000$

CPU ⇄ GPU transfer HDD ⇄ CPU transfer GPU computation CPU computation

33

GPU CPU HDD

t

Blas: Intel MKL 10.2 Compiler: icc 12.1

slide-34
SLIDE 34
  • f 38

OUTLINE

34

  • Intoduction and motivation
  • Problem description
  • CPU-only algorithm
  • Leveraging the GPU
  • Results and conclusion
slide-35
SLIDE 35
  • f 38

PERFORMANCE

5.2x speedup using 2 GPU sustained in-core performance when out-of-core

25 50 75 100 1k 10k 20k 30k 40k 50k 60k 70k 80k 90k

11,6s 24,9s 32,9s 43,1s 52,4s 65,6s 74,6s 84,8s 96,7s 4,3s 6,3s 8,3s 10,3s 12,3s 14,3s 16,3s 18,3s

Time [s] m (nucleotide count) Hybrid CPU+2GPU algorithm Original CPU-only algorithm

35

⟵in-core

  • ut-of-core⟶
slide-36
SLIDE 36
  • f 38

PERFORMANCE

From years/months down to hours

36

Minutes Hours Days Months Years 100s 1.000s 10.000s 100.000s 1.000.000s 10.000.000s 1M 10M 100M m (SNP count) EMMAX GWFGLS FLMM CLAK-Chol CLAK-Chol GPU Extrapolated

slide-37
SLIDE 37
  • f 38

SCALABILITY

#GPUs x2 ⇒ time x0.54

11,3s 22,5s 33,8s 45s 1 2 3 4

40,7s 21,6s 16,2s 11,7s

Time number of GPUs Runtime Perfect scalability

37

Almost perfect

slide-38
SLIDE 38
  • f 38

CONCLUSION

  • Don’t replace the CPU by GPU
  • Combine them
  • Hide data transfer latency by overlapping with computation
  • Double/triple-buffering
  • GPU never stops computing
  • GPUs order of magnitude faster?
  • V. Volkov (http://www.cs.berkeley.edu/~volkov/)
  • Victor W. Lee et al. («Debunking the 100X GPU vs. CPU Myth: An Evaluation of

Throughput Computing on CPU and GPU», 2010)

38

slide-39
SLIDE 39
  • f 38

QUESTIONS?

beyer@aices.rwth-aachen.de

39

Minutes Hours Days Months Years 100s 1.000s 10.000s 100.000s 1.000.000s 10.000.000s 1M 10M 100M m (SNP count) EMMAX GWFGLS FLMM CLAK-Chol CLAK-Chol GPU Extrapolated

slide-40
SLIDE 40
  • f 38

40

Viewer’s eyes Projection surface 3D scene

slide-41
SLIDE 41
  • f 38

41

Viewer’s eyes Projection surface 3D scene

slide-42
SLIDE 42
  • f 38

VISUALIZATION

42

slide-43
SLIDE 43
  • f 38

FUTURE WORK

  • Solution for L too big for GPU memory
  • Apply similar technique to similar problems
  • Extension to multiple phenotypes (y)

43