HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver - - PowerPoint PPT Presentation

high performance genome studies
SMART_READER_LITE
LIVE PREVIEW

HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver - - PowerPoint PPT Presentation

HIGH-PERFORMANCE GENOME STUDIES Lucas Beyer Diego Fabregat-Traver and Prof. Paolo Bientinesi RWTH Aachen University 19 June 2012, SIAM Conference on Applied Linear Algebra, Valencia, Spain Thanks to the AICES HPAC group and DFG grant GSC111


slide-1
SLIDE 1

HIGH-PERFORMANCE GENOME STUDIES

Lucas Beyer Diego Fabregat-Traver and Prof. Paolo Bientinesi

RWTH Aachen University 19 June 2012, SIAM Conference on Applied Linear Algebra, Valencia, Spain

Thanks to the AICES HPAC group and DFG grant GSC111

slide-2
SLIDE 2
  • f 30

OUTLINE

  • Problem description
  • CPU-only algorithm
  • Leveraging the GPU
  • Results and conclusion

2

slide-3
SLIDE 3
  • f 30

OUTLINE

  • Problem description
  • CPU-only algorithm
  • Leveraging the GPU
  • Results and conclusion

3

slide-4
SLIDE 4
  • f 30

GENOME-WIDE ASSOCIATION STUDIES

4

y ∈ ℝn

  • bservations (phenotype)

Xi ∈ ℝn×p genome measurements/covariates M ∈ ℝn×n

  • bservation dependencies

ri ∈ ℝp relations between phenotype and genome variations n n p

lots of GLS because i = 0..millions

ri ← (XT

i M −1Xi)−1XT i M −1y

input

  • utput
slide-5
SLIDE 5
  • f 30

THE NUMBERS

5

# DNA fragments (nucleotides) m ~ 48﹣250 000 000 # samples n ~ 10 000 # covariates p = 20 y ∈ ℝn 80 MB M ∈ ℝn×n 800 MB r ∈ ℝp×m 7-40 GB X ∈ ℝn×p×m 72 TB﹣373 PB

ri ← (XT

i M −1Xi)−1XT i M −1y

slide-6
SLIDE 6
  • f 30

OUTLINE

  • Problem description
  • CPU-only algorithm
  • Leveraging the GPU
  • Results and conclusion

6

slide-7
SLIDE 7
  • f 30

BASIC ALGORITHM

7

ri ← (XT

i M −1Xi)−1XT i M −1y

slide-8
SLIDE 8
  • f 30

BASIC ALGORITHM

8

ri ← (XT

i M −1Xi)−1XT i M −1y

Cholesky once during initialization

LLT := M

slide-9
SLIDE 9
  • f 30

BASIC ALGORITHM

9

ri ← (XT

i M −1Xi)−1XT i M −1y

ri ← (XT

i L−T L−1Xi)−1XT i L−T L−1y

Cholesky once during initialization

LLT := M

slide-10
SLIDE 10
  • f 30

BASIC ALGORITHM

10

ri ← (XT

i M −1Xi)−1XT i M −1y

ri ← (XT

i L−T L−1Xi)−1XT i L−T L−1y

Cholesky once during initialization

LLT := M ri ← ((L−1Xi)T L−1Xi)−1L−1XiL−1y

slide-11
SLIDE 11
  • f 30

BASIC ALGORITHM

11

ri ← (XT

i M −1Xi)−1XT i M −1y

ri ← (XT

i L−T L−1Xi)−1XT i L−T L−1y

Cholesky once during initialization

LLT := M

One trsm per iteration step i

ri ← ((L−1Xi)T L−1Xi)−1L−1XiL−1y ˆ Xi := L−1Xi ri ← ( ˆ XT

i ˆ

Xi)−1 ˆ XiL−1y

slide-12
SLIDE 12
  • f 30

OPTIMIZATIONS

  • Blocking in i
  • many small trsms vs. one big trsm

12

... ⇒

slide-13
SLIDE 13
  • f 30

OPTIMIZATIONS

  • Blocking in i
  • many small trsms vs. one big trsm
  • Out-of-core algorithm
  • read block b+1 while computing block b
  • double-buffering technique necessary

13

... ⇒

slide-14
SLIDE 14
  • f 30

PERFORMANCE

From years/months down to weeks/days

14

100s 1.000s 10.000s 100.000s 1.000.000s 10.000.000s 1m 10m 36m m (nucleotide count) CLAK-Chol FLMM GWFGLS EMMAX Minutes Hours Days Months Years

slide-15
SLIDE 15
  • f 30

OUTLINE

  • Problem description
  • CPU-only algorithm
  • Leveraging the GPU
  • Results and conclusion

15

slide-16
SLIDE 16
  • f 30

CAN GPUS HELP GO FURTHER?

  • trsm takes 90-95% of time
  • compute on the GPU
  • while GPU computes:
  • CPU computations
  • CPU ⇄ GPU transfers
  • our cluster: nVidia Fermi ☞

16

slide-17
SLIDE 17
  • f 30

MEMORY PYRAMID

Need for streaming computation Need for two levels of double-buffering

HDD: Terabytes RAM: 10-100 Gb GPU: 1-10 Gb

17

slide-18
SLIDE 18
  • f 30

18

b b-1 b-2 b+1 b+2 b+3 b-3

HDD

b-1

CPU/RAM GPU

b-1 b

trsm b-1

b+1

b-2 b-1 b-3 Results r Data X

b+2

2-LEVEL TRIPLE-DOUBLE-BUFFERING

(1) Retrieve previous results from GPU, start reading second-next block from HDD

slide-19
SLIDE 19
  • f 30

19

b b-1 b-2 b+1 b+2 b+3 b-3

HDD

b-1

CPU/RAM GPUs

b+1

b-2 b-1 b-3 Results r Data X

b+2

b-1 b

trsm

2-LEVEL TRIPLE-DOUBLE-BUFFERING

(2) Buffer switch (no copying)

slide-20
SLIDE 20
  • f 30

b-1 b

trsm

20

b b-1 b-2 b+1 b+2 b+3 b-3

HDD

b+1

CPU/RAM GPUs

b-1

b-2 b-1 b-3 Results r Data X

b+2

2-LEVEL TRIPLE-DOUBLE-BUFFERING

Buffers switched

slide-21
SLIDE 21
  • f 30

b+1 b

trsm

21

b b-1 b-2 b+1 b+2 b+3 b-3

HDD

b+1

CPU/RAM GPUs

b-1

Computation b-2 b-1 b-3 Results r Data X

b+2

b+1

2-LEVEL TRIPLE-DOUBLE-BUFFERING

(3) Send next block to GPU, start CPU computation

slide-22
SLIDE 22
  • f 30

b+1 b

trsm

22

b b-1 b-2 b+1 b+2 b+3 b-3

HDD

b+1

CPU/RAM GPUs

b-1

b-2 b-1 b-3 Results r Data X

b+2

2-LEVEL TRIPLE-DOUBLE-BUFFERING

(4) Write results to disk (fast because small)

slide-23
SLIDE 23
  • f 30

b

trsm

b+1

23

b b-1 b-2 b+1 b+2 b+3 b-3

HDD

b+1

CPU/RAM GPUs

b-1

b-2 b-1 b-3 Results r Data X

b+2

2-LEVEL TRIPLE-DOUBLE-BUFFERING

(5) Buffer switch (no copying)

slide-24
SLIDE 24
  • f 30

b-1 b+1 b

trsm

24

b b-1 b-2 b+1 b+2 b+3 b-3

HDD

b+1

CPU/RAM GPUs

b+1 b-1

b-1

Computation b-2 b-1 b-3 Results r Data X

b+2

2-LEVEL TRIPLE-DOUBLE-BUFFERING

One full iteration

slide-25
SLIDE 25
  • f 30

TIMELINE

Parallelism on the vertical axis Heavy use of asynchronous dispatching

CPU ⇄ GPU transfer HDD ⇄ CPU transfer GPU computation CPU computation Data dependencies

Read b+3 Send b+2 GPU trsm b+1 GPU trsm b Read b+2 Recv b-1 Send b+1 CPU comp b-1 Write b-1 Recv b CPU b

25

GPU CPU HDD

t

slide-26
SLIDE 26
  • f 30

TIMELINE, TO SCALE

problem sizes: n=10k, m=100k, block=10k

GPU: 2x nVidia Quadro 6000 (Fermi, 515 GFlops each, 6GB memory) = 10.000$ CPU: 2x Intel Xeon X5650 (6cores, 128 GFlops, 24GB memory) = 2000$

CPU ⇄ GPU transfer HDD ⇄ CPU transfer GPU computation CPU computation

26

GPU CPU HDD

t

Blas: Intel MKL 10.2 Compiler: icc 12.1

slide-27
SLIDE 27
  • f 30

OUTLINE

  • Problem description
  • CPU-only algorithm
  • Leveraging the GPU
  • Results and conclusion

27

slide-28
SLIDE 28
  • f 30

PERFORMANCE

sustained in-core performance when out-of-core extrapolated: 13h/70h vs 2.5h/13h

25 50 75 100 1k 10k 20k 30k 40k 50k 60k 70k 80k 90k

11,6s 24,9s 32,9s 43,1s 52,4s 65,6s 74,6s 84,8s 96,7s 4,3s 6,3s 8,3s 10,3s 12,3s 14,3s 16,3s 18,3s

Time [s] m (nucleotide count) Hybrid CPU+2GPU algorithm Original CPU-only algorithm

28

⟵in-core

  • ut-of-core⟶
slide-29
SLIDE 29
  • f 30

25 50 75 100 1k 10k 20k 30k 40k 50k 60k 70k 80k 90k

11,6s 24,9s 32,9s 43,1s 52,4s 65,6s 74,6s 84,8s 96,7s 4,3s 6,3s 8,3s 10,3s 12,3s 14,3s 16,3s 18,3s

Time [s] m (nucleotide count)

PERFORMANCE

5.2x speedup using 2 GPU, 10x using 4 GPUs from days to hours (when m is millions)

29

⟵in-core

  • ut-of-core⟶

Hybrid CPU+2GPU algorithm Original CPU-only algorithm

slide-30
SLIDE 30
  • f 30

CONCLUSION

  • Don’t replace the CPU by GPU
  • Combine them
  • Hide data transfer latency by overlapping with computation
  • Double/triple-buffering
  • GPU never stops computing
  • GPUs order of magnitude faster?
  • V. Volkov (http://www.cs.berkeley.edu/~volkov/)
  • Victor W. Lee et al. («Debunking the 100X GPU vs. CPU Myth: An Evaluation of

Throughput Computing on CPU and GPU», 2010)

30

slide-31
SLIDE 31
  • f 30

25 50 75 100 1k 10k 20k 30k 40k 50k 60k 70k 80k 90k

11,6s 24,9s 32,9s 43,1s 52,4s 65,6s 74,6s 84,8s 96,7s 4,3s 6,3s 8,3s 10,3s 12,3s 14,3s 16,3s 18,3s

Time [s] m (nucleotide count) Hybrid CPU+2GPU algorithm Original CPU-only algorithm

QUESTIONS?

beyer@aices.rwth-aachen.de

31

⟵in-core

  • ut-of-core⟶
slide-32
SLIDE 32
  • f 30

FUTURE WORK

  • Solution for L too big for GPU memory
  • Apply similar technique to similar problems
  • Extension to multiple phenotypes (y)

32