S6636 - GEM3: CPU-GPU Heterogeneous DNA Sequence Alignment for Scalable Read Sizes
San%ago Marco (CNAG-CRG) Paolo Ribeca (CNAG-CRG) Alejandro Chacón (UAB) Juan Carlos Moure (UAB) Antonio Espinosa (UAB)
1
S6636 - GEM3: CPU-GPU Heterogeneous DNA Sequence Alignment for - - PowerPoint PPT Presentation
S6636 - GEM3: CPU-GPU Heterogeneous DNA Sequence Alignment for Scalable Read Sizes Alejandro Chacn (UAB) San%ago Marco (CNAG-CRG) Juan Carlos Moure (UAB) Paolo Ribeca (CNAG-CRG) Antonio Espinosa (UAB) 1 Genomic Sequencing applicaKons
1
2
3
Process widely used in bioinformaKcs analysis GOAL: Correct the sequencing errors
4
Genome Region: 5
Human Genome
GEM CNAG
6
7
8
GEM CNAG
(500x)
GEM3-GPU
Performance: 2 TFLOPS 18 TFLOPS (9x) Bandwidth: 136 GB/s 960 GB/s (7x) Power Efficiency: 8GFlops/W 30GFlops/W (3.8x) Main Memory: 128GBs 4x12GBs (10%) Threads: 48 HW threads 106K HW threads Cache/thread: 1.25 MB 60 Bytes Limited space for data structures Extract explicitly huge parallelism Huge memory constrains Thread hierarchy (warps, blocks ...) Explicit transferences
9
(10K cores)
(4x12GB) 960GB/s
16GB/s
(16 cores)
(128GB) 136GB/s 16GB/s
10
(fine grain parallelism & batch mode)
(manage latency- & throughput-oriented cores)
(specialized structures)
(CPU and GPU collaboraTon & be warp aware)
(problem decomposiTon & thread-cooperaTve parallelizaTon) 11
12
query (500nt) 1 query x 30 seeds 30 seeds x 50 occ
...
30 seeds x 50 occ
no
no
GEM stages
Task - Thread
query (500nt) 30 threads 1500 threads 1500 threads
...
no
Batch Mode + Task - Thread
30K threads 1.5M threads 1.5M threads
...
query (500nt)
... ... ... ... ...
no no
Batch Mode + Task - R threads
240K threads
(30K x 8 th)
12M threads
(1.5M x 8 th)
6M threads
(1.5M x 4 th)
...
query (500nt)
... ... ... ... ...
no no
CORE 0 CORE 2 CORE 1 CORE n
transferences queue transferences queue CPU Thread 0 CPU Thread 1 CPU Thread 2 CPU Thread n
INPUT OUTPUT
Disk 0 Disk 1 Disk 2 Disk r
13
OUTPUT: Alignments (SAM) STAGE 1
S2: input S2: input S2: output S2: output STAGE 2 STAGE 3 STAGE 5 STAGE 7 S4: input S4: output S4: input S4: output STAGE 4 STAGE 6 S6: input S6: output S6: input S6: output
14
OUTPUT: Alignments (SAM) STAGE 1 STAGE 3 STAGE 5 STAGE 7
STAGE 2 STAGE 4 STAGE 6
15
16
INPUT OUTPUT Disk 0 Disk 1 Disk r
Non-cri%cal structures to be (1) remotely accessed from the CPU memory or (2) work be processed by the CPU.
17
A) CPU & GPU collaboraKon
1GPU be[er on regular workflow -execuTons 2CPU be[er on latency bounded and divergent -execuTons
B) Special compressing strategies
18
Batch Mode + Fine ParallelizaKon
(30K x 8 th) (1.5M x 8 th) (1.5M x 4 th)
...
query (500nt)
... ... ... ... ...
no no
join BPM (reconstruct results) divide seeds (filter Ns) expand intervals (process Ns) split BPM (break dependencies)
(Adapt irregular work)
Thread Parallelism Working set # threads bytes/thread
256 B/thr
64 B/thr 60 B/thr
470 KB/thr
19
candidatep-1 query
result0 Levenshtei n distance with Myers
candidate1 query
result0 Levenshtei n distance with Myers
candidate0 query
result0 Homology
20
Memory per thread
candidate query
result0 Homology
r threads
Memory per thread
(but requires)
21
0.00 0.50 1.00 1.50 2.00 2.50
Giga query bases / second
Query Size (m) 16x 3.3x
1compared to the tradiTonal task-parallel strategy
(fine grain parallelism & batch mode)
(manage latency- & throughput-oriented cores)
(specialized structures)
(CPU and GPU collaboraTon & be warp aware)
(problem decomposiTon & thread-cooperaTve parallelizaTon) 22
23
10 20 30 40 50 60 70 80 100 250 500 1000
BWA-MEM nvBow%e Cushaw GEM GPU 2 Sockets E5-2650 (32th) + K40
95.0 96.0 97.0 98.0 99.0 100.0 101.0
100 250 500 1000
BWA-MEM nvBow%e Cushaw GEM GPU 2 Sockets E5-2650 (32th) + K40
*Illumina Like Sequence Data
*Benchmark: BWA-MEM 0.7.10 (CPU) - nvBowTe 1.1.0 (GPU) - CUSHAW2 2.1.8-r16 (CPU+GPU) - GEM GPU 3.1.1 (CPU+GPU)
24
Muta%ons
Muta%ons
25
coverage
26
27
28
29
(1) hhp://grupsderecerca.uab.cat/hpca4se/en/content/gpu (2) hhp://www.cnag.cat/team/bioinforma%cs/bioinforma%cs-development-group-sta%s%cal-genomics-team/