S6636 - GEM3: CPU-GPU Heterogeneous DNA Sequence Alignment for - - PowerPoint PPT Presentation

s6636 gem3 cpu gpu heterogeneous dna sequence alignment
SMART_READER_LITE
LIVE PREVIEW

S6636 - GEM3: CPU-GPU Heterogeneous DNA Sequence Alignment for - - PowerPoint PPT Presentation

S6636 - GEM3: CPU-GPU Heterogeneous DNA Sequence Alignment for Scalable Read Sizes Alejandro Chacn (UAB) San%ago Marco (CNAG-CRG) Juan Carlos Moure (UAB) Paolo Ribeca (CNAG-CRG) Antonio Espinosa (UAB) 1 Genomic Sequencing applicaKons


slide-1
SLIDE 1

S6636 - GEM3: CPU-GPU Heterogeneous DNA Sequence Alignment for Scalable Read Sizes

San%ago Marco (CNAG-CRG) Paolo Ribeca (CNAG-CRG) Alejandro Chacón (UAB) Juan Carlos Moure (UAB) Antonio Espinosa (UAB)

1

slide-2
SLIDE 2

2

Genomic Sequencing applicaKons

  • Personalized Medicine

Example of benefits in diagnosis.

Detect cancer with a blood test: à Earlier detec%on. à Non-intrusive methods. à (ctDNA + deep sequencing)

Direct ApplicaKons

  • Diagnosis and intervenKon
  • Drug development and usage
  • Cancer genomics
  • Genome EdiKng (crisp)
  • Large Scale populaKon analysis
  • PhylogeneKc
  • In vitro meat
slide-3
SLIDE 3

3

Genomic Sequencing Mapping

Individual MutaKons Sample Sequencing Errors Reference Genome (Gbases)

Repeat (x20 – x60)

Queries (Tera-sequences)

Process widely used in bioinformaKcs analysis GOAL: Correct the sequencing errors

Approximate mapping process

Aligned queries

slide-4
SLIDE 4

4

Growing the Sequencing Data

Falling the sequencing cost à Democra%zes the personalized medicine

Exposes a computaKonal demanding problem (HPC) (Deep sequencing)

slide-5
SLIDE 5

Query:

x Input (FASTQ file)

EXTENDING

(text comparison) Phase 2 Candidates:

Seed & Extend mapping strategy

Genome posi%ons :

SEEDING

(text search)

Phase 1

OK NO: Alignment Filtered-out

Genome Region: 5

Seeds: REDUCE THE REPORTED POSITIONS REDUCE THE COMPUTATIONAL COST

Human Genome

?

Output (SAM file)

slide-6
SLIDE 6

Smith&Waterman OK NO: Alignment Filtered-out Exact Search (index) Decode PosiKons NO MAP

Genome posi%ons Query: Seeds: Candidates: :

Input (FASTQ file)

SEEDING EXTENDING

TradiKonal Mapper: Internal workflow

x Output (SAM file)

340 h

Human Genome (%me mapping)

GEM CNAG

6

slide-7
SLIDE 7

7

Introducing the new Mapper GEM3-GPU

slide-8
SLIDE 8

OK NO: Alignment Filtered-out Smith&Waterman

Candidates:

Introducing GEM3: Internal workflow

Decode PosiKons NO MAP Neighborhood Search

Genome posi%ons :

8

The queries present different size & error rate:

GEM3 Mapper uses adaptaKve strategies to process them.

340 h

Human Genome (%me mapping)

GEM CNAG

40 min

(500x)

à

GEM3-GPU

Output (SAM file)

K-mer Distance Filter BitParallel Myers Approximate Search Exact Search (index)

Query: Seeds:

Input (FASTQ file) x

slide-9
SLIDE 9

Architectural characterisKcs CPUs GPUs

Performance: 2 TFLOPS 18 TFLOPS (9x) Bandwidth: 136 GB/s 960 GB/s (7x) Power Efficiency: 8GFlops/W 30GFlops/W (3.8x) Main Memory: 128GBs 4x12GBs (10%) Threads: 48 HW threads 106K HW threads Cache/thread: 1.25 MB 60 Bytes Limited space for data structures Extract explicitly huge parallelism Huge memory constrains Thread hierarchy (warps, blocks ...) Explicit transferences

GPU Algorithmic programming challenges

Sequencing ProducKon Hybrid Nodes

9

GPUs

(10K cores)

GDRAM

(4x12GB) 960GB/s

16GB/s

CPUs

(16 cores)

DRAM

(128GB) 136GB/s 16GB/s

2 x K80 2 x E5 2640

slide-10
SLIDE 10

K-mer Distance Filter BitParallel Myers Smith&Waterman OK NO: Alignment Filtered-out

Output (SAM file)

Exact Search Approximate Search Neighborhood Search Decode PosiKons NO MAP

Genome posi%ons Read: Seeds: Candidates: : GPU dedicated à (1) %me consuming stages + (2) best mapping for GPU arch (specialized kernels for common cases) GPU (x15) GPU (x15) GPU (x21) Input (FASTQ file)

10

GEM3: GPU Internal workflow

slide-11
SLIDE 11

1) Exposing massive parallelism

(fine grain parallelism & batch mode)

2) Algorithmic interac%ons for hybrid systems

(manage latency- & throughput-oriented cores)

3) Reducing the memory requirements

(specialized structures)

4) Regularize the work

(CPU and GPU collaboraTon & be warp aware)

5) Reduce the thread memory footprint

(problem decomposiTon & thread-cooperaTve parallelizaTon) 11

GEM3: GPU algorithmic challenges

slide-12
SLIDE 12

12

1) Expose massive parallelism (sequence life-cycle)

CPU:

query (500nt) 1 query x 30 seeds 30 seeds x 50 occ

...

30 seeds x 50 occ

  • 6. BPM
  • 4. Decode
  • k

no

  • 5. KMER
  • 1. Exact S
  • 7. SW
  • 6. BPM
  • 2. Approx S
  • 3. Neigh S
  • 4. Decode

no

  • k
  • 1. Exact Search

GEM stages

GPU:

Task - Thread

query (500nt) 30 threads 1500 threads 1500 threads

Not enough to saturate GPU

...

  • k

no

GPU:

Batch Mode + Task - Thread

30K threads 1.5M threads 1.5M threads

Higher parallelism

...

1K queries

query (500nt)

... ... ... ... ...

  • k

no no

GPU:

Batch Mode + Task - R threads

240K threads

(30K x 8 th)

12M threads

(1.5M x 8 th)

6M threads

(1.5M x 4 th)

Much higher paralle. + keep the same amount of memory

...

1K queries

query (500nt)

... ... ... ... ...

  • k

no no

slide-13
SLIDE 13

... ...

CPU Tasks CPU Buffers GPU Buffers GPU DEVICE 0 GPU DEVICE m

...

CORE 0 CORE 2 CORE 1 CORE n

GPU Tasks wai%ng queue ready queue kernel execu%on

transferences queue transferences queue CPU Thread 0 CPU Thread 1 CPU Thread 2 CPU Thread n

Dynamic I/O dispatcher

INPUT OUTPUT

Disk 0 Disk 1 Disk 2 Disk r

...

13

2) GEM3: An hybrid processing system

slide-14
SLIDE 14

INPUT: Sequences (FASTQ)

OUTPUT: Alignments (SAM) STAGE 1

Transfers

S2: input S2: input S2: output S2: output STAGE 2 STAGE 3 STAGE 5 STAGE 7 S4: input S4: output S4: input S4: output STAGE 4 STAGE 6 S6: input S6: output S6: input S6: output

14

Workflow dependences serialize CPU, GPU and transferences.

2) Algorithmic interacKons on hybrid systems

slide-15
SLIDE 15

MulK-Buffering strategy: Overlap both-direc%ons transferences + CPU tasks + GPU tasks. Breaking dependences: Dependence buffer explora%on to increase the parallelism. INPUT: Sequences (FASTQ)

OUTPUT: Alignments (SAM) STAGE 1 STAGE 3 STAGE 5 STAGE 7

Transfers

STAGE 2 STAGE 4 STAGE 6

... ... ... ... ... ...

15

2) Algorithmic interacKons on hybrid systems

slide-16
SLIDE 16

16

Dynamic Parallel I/O dispatcher

2) Adapt the applicaKon to hybrid systems

Batch queries processing

INPUT OUTPUT Disk 0 Disk 1 Disk r

...

Get Next Batch Store results StaKc Scheduler & System analyzer

Memory allocaKon policies

  • Buffer tuning (size, number, ...)
  • Buffer distribu%on between GPUs
  • Enable / disable compute stages
  • Par%al data structures alloca%on

Recollect architecture specs

Coupling GPUs with different memory restric%ons GPUs with limited memory space à flexible data allocaTon policy:

Non-cri%cal structures to be (1) remotely accessed from the CPU memory or (2) work be processed by the CPU.

slide-17
SLIDE 17

CPU Memory (128 GB) GPU Memory (12 GB)

17

3) Reducing the GPU memory requirements

(10x more space) Preprocessed data structures are specialized for each system:

Support all query operaKons2

  • large memory space requirements

Highly opKmized for common cases1

  • but ... does not support all query opera%ons

A) CPU & GPU collaboraKon

1GPU be[er on regular workflow -execuTons 2CPU be[er on latency bounded and divergent -execuTons

To reduce index size ...

  • more mem. accesses (big caches)

To reduce index size ...

  • more compute (many compute resources)

allocaKon policies + highly compacted indexes allow large scale genomes on GPU !!

B) Special compressing strategies

slide-18
SLIDE 18

A) CPU help to regularize the GPU work:

· GPU process common cases à Corner cases relegated to CPU · Problem decomposi%on for GPUs à CPU cares to split in smaller problems

Irregular work and parallelism along the pipeline (GPUs are friendly to regular work)

18

4) Regularize the work

GPU:

Batch Mode + Fine ParallelizaKon

(30K x 8 th) (1.5M x 8 th) (1.5M x 4 th)

...

1K queries

query (500nt)

... ... ... ... ...

  • k

no no

  • 6. BPM
  • 4. Decode
  • 1. Exact Search

join BPM (reconstruct results) divide seeds (filter Ns) expand intervals (process Ns) split BPM (break dependencies)

CPU:

(Adapt irregular work)

B) Fine grain parallelism (thread-cooperaTon strategies)

· Threads working in the same element à helps to regularize the work size

slide-19
SLIDE 19

5) Reduce the thread memory footprint

Thread Parallelism Working set # threads bytes/thread

Cache L2 :

256 B/thr

Local Memory:

64 B/thr 60 B/thr

Registers : Memory constrains: kernels with large memory footprints

  • BioinformaKc algorithms are an example.

470 KB/thr

Main memory: Example K80 runs 27K threads:

If thread memory footprint not fit in cache memories:

· Memory cache pressure issues:

  • Increase GPU memory traffic

· GPU resources alloca%on issues:

  • Reduces thread GPU occupancy

19

MUST re-think bio-algorithms to scale in GPUs

slide-20
SLIDE 20

candidatep-1 query

result0 Levenshtei n distance with Myers

candidate1 query

result0 Levenshtei n distance with Myers

candidate0 query

result0 Homology

20

5) Reduce the thread memory footprint

(B) BPM: Thread cooperaKve (A) BPM: Task parallel

|q|=100 164 Bytes |q|=1000 1580 Bytes (1 thread – 1 task)

Memory per thread

candidate query

result0 Homology

...

r threads

(r threads – 1 task) |q|=100 202 Bytes |q|=1000 202 Bytes

Memory per thread

ALL local data fits in REGISTERS!

Query read just once

(avoid all mem. re-accesses) larger query à more threads (all is dynamic & flexible) · Complex register com. · Special data layouts · Data regulariza%on · Distribu%on of the work

(but requires)

slide-21
SLIDE 21

Thread-coopera%ve strategy allows to scale larger problems in GPU. à The memory footprint reduc%on delivers 2.3x – 6.8x beher performance.1

21

5) Kernel performance improvements (memory footprint reduc%on)

0.00 0.50 1.00 1.50 2.00 2.50

Giga query bases / second

Performance Bit Parallel Myers kernel Performance Exact Search kernel

Query Size (m) 16x 3.3x

1compared to the tradiTonal task-parallel strategy

slide-22
SLIDE 22

1) Exposing massive parallelism

(fine grain parallelism & batch mode)

2) Algorithmic interac%ons for hybrid systems

(manage latency- & throughput-oriented cores)

3) Reducing the memory requirements

(specialized structures)

4) Regularize the work

(CPU and GPU collaboraTon & be warp aware)

5) Reduce the thread memory footprint

(problem decomposiTon & thread-cooperaTve parallelizaTon) 22

GEM3-GPU: Final results complete applicaKon

slide-23
SLIDE 23

23

GEM3-GPU: Final Results complete applicaKon

1) An order of magnitude berer performance than best compe%%ve mappers. 2) Results show a mapper GPU that can process large query sizes. GEM3-GPU:

10 20 30 40 50 60 70 80 100 250 500 1000

Performance (Million DNA Bases / second)*

BWA-MEM nvBow%e Cushaw GEM GPU 2 Sockets E5-2650 (32th) + K40

7x – 18x 10x – 21x 11x – 14x 12x

95.0 96.0 97.0 98.0 99.0 100.0 101.0

100 250 500 1000

Quality (%) - Queries mapped correctly*

BWA-MEM nvBow%e Cushaw GEM GPU 2 Sockets E5-2650 (32th) + K40

3) More sensi%ve obtaining berer mapping quality.

*Illumina Like Sequence Data

*Benchmark: BWA-MEM 0.7.10 (CPU) - nvBowTe 1.1.0 (GPU) - CUSHAW2 2.1.8-r16 (CPU+GPU) - GEM GPU 3.1.1 (CPU+GPU)

slide-24
SLIDE 24

GEM 3 ApplicaKon GEM-CuYer: GPU Library NVIDIA CUDA SDK GPU Hardware

How to manage this resources?

We implemented on top of CUDA a library to increase the GPU HW abstrac%on: · Provides basic block genomic primiTves highly opKmized for GPUs. · Offers an API based on Send / Receive primiKves (message passing). · Supports all GPU architectures. · Supports MulK-GPU. · Manages heterogeneous coupled GPUs. · Incorporates an scheduler to balance the work.

24

GEM-Curer: GPU library for genomic applicaKons

slide-25
SLIDE 25

Muta%ons

Errors (1%)

30x

short queries ~100 bases Last short-read technology

Muta%ons

Errors (> 20%) Random dist.

75x

long queries ~20K bases Upcoming long-read technology Haplotyping à (determining a variaKon from the mother or father) Calling out structural varia%ons à (reshuffled DNA present in cancer) Achieve beher quality: Q40 (99.99 %) à Q60 (99.9999 %)

New analysis

25

Future of Genomic Sequencing

coverage

slide-26
SLIDE 26

· GPU mapper used in producKon pipeline. · Highly opKmized library validated in a real produc%on center. · GEM allows increase the read size à others GPU mappers have scalability issues. · Performance >10x higher than compe%%ve mappers. · Higher accuracy than others mappers. · Compa%bility for old GPUs and support last technical features of the new GPUs. à We think that bio-applicaTons will fit even beYer in the future GPUs.

26

Conclusions

slide-27
SLIDE 27

Future bioinformaKc trends in GPU

· Future GPUs with bigger memories (32GBs) · Faster logical instrucKons (promises 40% of improvement in our algorithms) · 3D memory stack tech: Higher memory bandwidths (1TB/s - 4x %mes) · NvLink faster CPU <-> GPU communicaKons (5x %mes) · Beher performance support for random memory accesses (bigger TLB pages)

27

slide-28
SLIDE 28

Special Thanks

UAB: Ph.D. colleagues CRAG: Javier Navarro CNAG-CRG: Simon Heath, Jordi Camps, Miguel Bernabeu NVIDIA: Jacopo Pantaleoni, Nuno Sub%l, Jon Cohen, Mark Berger My supporTve wife, Alba (for all your support)

28

Thanks to everyone involved in the project, specially:

slide-29
SLIDE 29

29

Alejando Chacón HPCA4SE - UAB (Spain)1 alejandro.chacon@uab.es

Main Developers Contact informaKon

SanKago Marco-Sola CNAG-CRG (Spain)2 san%agomsola@gmail.com GEM3 will be available soon at: hhp://algorithms.cnag.cat/gem3

(1) hhp://grupsderecerca.uab.cat/hpca4se/en/content/gpu (2) hhp://www.cnag.cat/team/bioinforma%cs/bioinforma%cs-development-group-sta%s%cal-genomics-team/