HIGH-PERFORMANCE GENOME STUDIES
Lucas Beyer Diego Fabregat-Traver and Prof. Paolo Bientinesi
RWTH Aachen University 19 June 2012, SIAM Conference on Applied Linear Algebra, Valencia, Spain
Thanks to the AICES HPAC group and DFG grant GSC111
HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver - - PowerPoint PPT Presentation
HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver and Prof. Paolo Bientinesi RWTH Aachen University 19 June 2012, SIAM Conference on Applied Linear Algebra, Valencia, Spain Thanks to the AICES HPAC group and DFG grant GSC111
RWTH Aachen University 19 June 2012, SIAM Conference on Applied Linear Algebra, Valencia, Spain
Thanks to the AICES HPAC group and DFG grant GSC111
2
3
4
Roughly, how an engineer sees it
5
Image source: wikipedia
6
7
750 1500 2250 3000 2005 2006 2007 2008 2009 2010 2011
2333 2304 1257 999 453 13 2
8
0K 10K 20K 30K 40K 2005 2006 2007 2008 2009 2010 2011
0M 1M 2M 3M 4M 2005 2006 2007 2008 2009 2010 2011
9
0M 3M 6M 9M 12M 2005 2006 2007 2008 2009 2010 2011
0,2M 2,4M 1,5M 2,6M 2,7M 7,5M 10,5M
10
Y
11
12
13
y ∈ ℝn
Xi ∈ ℝn×p genome measurements/covariates M ∈ ℝn×n
ri ∈ ℝp relations between phenotype and genome variations n n p
14
# DNA fragments (nucleotides) m ~ 48﹣250 000 000 # samples n ~ 10 000 # covariates p = 20 y ∈ ℝn 80 MB M ∈ ℝn×n 800 MB r ∈ ℝp×m 7-40 GB X ∈ ℝn×p×m 72 TB﹣373 TB
15
16
ri ← (XT
i M −1Xi)−1XT i M −1y
17
ri ← (XT
i M −1Xi)−1XT i M −1y
LLT := M
18
ri ← (XT
i M −1Xi)−1XT i M −1y
ri ← (XT
i L−T L−1Xi)−1XT i L−T L−1y
LLT := M
19
ri ← (XT
i M −1Xi)−1XT i M −1y
ri ← (XT
i L−T L−1Xi)−1XT i L−T L−1y
LLT := M ri ← ((L−1Xi)T L−1Xi)−1L−1XiL−1y
20
ri ← (XT
i M −1Xi)−1XT i M −1y
ri ← (XT
i L−T L−1Xi)−1XT i L−T L−1y
LLT := M
ri ← ((L−1Xi)T L−1Xi)−1L−1XiL−1y ˆ Xi := L−1Xi ri ← ( ˆ XT
i ˆ
Xi)−1 ˆ XiL−1y
21
22
23
Minutes Hours Days Months Years 100s 1.000s 10.000s 100.000s 1.000.000s 10.000.000s 1M 10M 100M m (SNP count) EMMAX GWFGLS FLMM CLAK-Chol
24
25
26
b-1
β
27
b-1 b-2 b-1 b-3 Results r Data X
b
trsm α
A
B
C
B Computation b+1
28
b-2 b-1 b-3 Results r Data X
A
b-1
β
b
trsm α
C
29
b-2 b-1 b-3 Results r Data X
A
b+1
β
b
trsm α
B
C
30
b-2 b-1 b-3 Results r Data X
A
b+1
β
b
α
B
C
31
b-2 b-1 b-3 Results r Data X
A
b-1 b+1
β b-1
b
trsm α b+1
B Computation
C
CPU ⇄ GPU transfer HDD ⇄ CPU transfer GPU computation CPU computation Data dependencies Asynchronous dispatch
GPU: 2x nVidia Quadro 6000 (Fermi, 515 GFlops each, 6GB memory) = 10.000$ CPU: 2x Intel Xeon X5650 (6cores, 128 GFlops, 24GB memory) = 2000$
CPU ⇄ GPU transfer HDD ⇄ CPU transfer GPU computation CPU computation
33
Blas: Intel MKL 10.2 Compiler: icc 12.1
34
25 50 75 100 1k 10k 20k 30k 40k 50k 60k 70k 80k 90k
11,6s 24,9s 32,9s 43,1s 52,4s 65,6s 74,6s 84,8s 96,7s 4,3s 6,3s 8,3s 10,3s 12,3s 14,3s 16,3s 18,3s
Time [s] m (nucleotide count) Hybrid CPU+2GPU algorithm Original CPU-only algorithm
35
⟵in-core
36
Minutes Hours Days Months Years 100s 1.000s 10.000s 100.000s 1.000.000s 10.000.000s 1M 10M 100M m (SNP count) EMMAX GWFGLS FLMM CLAK-Chol CLAK-Chol GPU Extrapolated
11,3s 22,5s 33,8s 45s 1 2 3 4
40,7s 21,6s 16,2s 11,7s
Time number of GPUs Runtime Perfect scalability
37
Almost perfect
Throughput Computing on CPU and GPU», 2010)
38
39
Minutes Hours Days Months Years 100s 1.000s 10.000s 100.000s 1.000.000s 10.000.000s 1M 10M 100M m (SNP count) EMMAX GWFGLS FLMM CLAK-Chol CLAK-Chol GPU Extrapolated
40
Viewer’s eyes Projection surface 3D scene
41
Viewer’s eyes Projection surface 3D scene
42
43