S6636 - GEM3: CPU-GPU Heterogeneous DNA Sequence Alignment for - PowerPoint PPT Presentation

S6636 - GEM3: CPU-GPU Heterogeneous DNA Sequence Alignment for Scalable Read Sizes Alejandro Chacón (UAB) San%ago Marco (CNAG-CRG) Juan Carlos Moure (UAB) Paolo Ribeca (CNAG-CRG) Antonio Espinosa (UAB) 1

Genomic Sequencing applicaKons Direct ApplicaKons • Personalized Medicine • Diagnosis and intervenKon • Drug development and usage • Cancer genomics • Genome EdiKng (crisp) • Large Scale populaKon analysis • PhylogeneKc • In vitro meat Example of benefits in diagnosis. Detect cancer with a blood test: à Earlier detec%on. à Non-intrusive methods. à (ctDNA + deep sequencing ) 2

Genomic Sequencing Mapping Individual MutaKons Process widely used in Sample bioinformaKcs analysis GOAL: Correct the sequencing errors Queries (Tera-sequences) Repeat (x20 – x60) Approximate mapping process Sequencing Errors Aligned queries Reference Genome (Gbases) 3

Growing the Sequencing Data Falling the sequencing cost à Democra%zes the personalized medicine Exposes a computaKonal demanding problem (HPC) (Deep sequencing) 4

Seed & Extend mapping strategy Input (FASTQ file) Candidates: Query: x Seeds: Phase 1 Phase 2 Genome Region: ? EXTENDING SEEDING (text comparison) (text search) Human Genome OK NO: Alignment Filtered-out Genome : posi%ons Output (SAM file) REDUCE THE COMPUTATIONAL COST REDUCE THE REPORTED POSITIONS 5

TradiKonal Mapper: Internal workflow Input (FASTQ file) Candidates: Query: x Seeds: Exact Search (index) EXTENDING SEEDING Smith&Waterman NO MAP Decode PosiKons OK NO: Alignment Filtered-out Genome : posi%ons Human Genome (%me mapping) GEM CNAG 340 h Output (SAM file) 6

Introducing the new Mapper GEM3-GPU 7

Introducing GEM3: Internal workflow Input (FASTQ file) Candidates: Query: x Seeds: Exact Search (index) K-mer Distance Filter Approximate Search BitParallel Myers Neighborhood Search Smith&Waterman NO MAP Decode PosiKons OK NO: Alignment Filtered-out Genome : posi%ons Human Genome (%me mapping) GEM CNAG GEM3-GPU (500x) 340 h 40 min à Output (SAM file) The queries present different size & error rate: GEM3 Mapper uses adaptaKve strategies to process them. 8

Sequencing ProducKon Hybrid Nodes 2 x K80 2 x E5 2640 16GB/s 136GB/s 16GB/s 960GB/s GDRAM CPUs DRAM GPUs (4x12GB) (16 cores) (128GB) (10K cores) GPU Algorithmic programming challenges Architectural characterisKcs CPUs GPUs Explicit transferences Performance: 2 TFLOPS 18 TFLOPS ( 9x ) Thread hierarchy (warps, blocks ...) Bandwidth: 136 GB/s 960 GB/s ( 7x ) Power Efficiency: 8GFlops/W 30GFlops/W (3.8x) Limited space for data structures Main Memory: 128GBs 4x12GBs ( 10% ) Threads: 48 HW threads 106K HW threads Extract explicitly huge parallelism Cache/thread: 1.25 MB 60 Bytes Huge memory constrains 9

GEM3: GPU Internal workflow Input (FASTQ file) Candidates: Read: Seeds: GPU (x15) Exact Search K-mer Distance Filter Approximate Search GPU (x21) BitParallel Myers Neighborhood Search Smith&Waterman NO MAP Decode PosiKons GPU (x15) OK NO: Alignment Filtered-out Genome : posi%ons Output (SAM file) GPU dedicated à (1) %me consuming stages + (2) best mapping for GPU arch (specialized kernels for common cases) 10

GEM3: GPU algorithmic challenges 1) Exposing massive parallelism (fine grain parallelism & batch mode) 2) Algorithmic interac%ons for hybrid systems (manage latency- & throughput-oriented cores) 3) Reducing the memory requirements (specialized structures) 4) Regularize the work (CPU and GPU collaboraTon & be warp aware) 5) Reduce the thread memory footprint (problem decomposiTon & thread-cooperaTve parallelizaTon) 11

1) Expose massive parallelism (sequence life-cycle) 1. Exact Search 4. Decode 6. BPM GEM stages ... ok 1. Exact S 5. KMER query (500nt) no 2. Approx S 6. BPM CPU: 3. Neigh S 7. SW 4. Decode 1 query x 30 seeds 30 seeds x 50 occ 30 seeds x 50 occ ok no ... ok query (500nt) no GPU: Not enough Task - Thread to saturate GPU 30 threads 1500 threads 1500 threads ... ok query (500nt) ... no GPU: ... ... ... no Higher parallelism ... Batch Mode + Task - Thread 1K queries 1.5M threads 30K threads 1.5M threads ... ok query (500nt) Much higher paralle. ... no GPU: + ... ... ... no Batch Mode ... keep the same + Task - R threads 240K threads 12M threads amount of memory 6M threads 1K queries (30K x 8 th) (1.5M x 8 th) (1.5M x 4 th) 12

2) GEM3: An hybrid processing system GPU Tasks CPU Tasks CPU Buffers GPU Buffers INPUT wai%ng queue CPU Thread 0 CORE 0 kernel execu%on GPU DEVICE 0 transferences Dynamic I/O dispatcher CPU queue Disk 0 Thread 1 Disk 1 CORE 1 Disk 2 CPU Thread 2 ... CORE 2 ready queue Disk r transferences queue GPU DEVICE m ... ... ... CPU Thread n OUTPUT CORE n 13

2) Algorithmic interacKons on hybrid systems INPUT: Sequences (FASTQ) STAGE 1 Transfers S2: output S2: input S2: output S2: input STAGE 2 STAGE 3 S4: output S4: input STAGE 4 S4: output S4: input STAGE 5 S6: output S6: input S6: output S6: input STAGE 6 STAGE 7 OUTPUT: Alignments (SAM) Workflow dependences serialize CPU, GPU and transferences. 14

2) Algorithmic interacKons on hybrid systems INPUT: Sequences (FASTQ) STAGE 1 ... ... STAGE 2 STAGE 3 Transfers ... ... STAGE 4 STAGE 5 ... STAGE 6 ... STAGE 7 OUTPUT: Alignments (SAM) MulK-Buffering strategy : Overlap both-direc%ons transferences + CPU tasks + GPU tasks. Breaking dependences : Dependence buffer explora%on to increase the parallelism. 15

2) Adapt the applicaKon to hybrid systems StaKc Scheduler & System analyzer Recollect architecture specs Dynamic Parallel I/O dispatcher INPUT Memory allocaKon policies Get Next Batch Batch queries processing Disk 0 - Buffer tuning (size, number, ...) - Buffer distribu%on between GPUs Disk 1 ... - Enable / disable compute stages - Par%al data structures alloca%on Disk r OUTPUT Store results Coupling GPUs with different memory restric%ons GPUs with limited memory space à flexible data allocaTon policy: Non-cri%cal structures to be (1) remotely accessed from the CPU memory or (2) work be processed by the CPU. 16

3) Reducing the GPU memory requirements CPU Memory (10x more GPU Memory (128 GB) space) (12 GB) Preprocessed data structures are specialized for each system: A) CPU & GPU collaboraKon Highly opKmized for common cases 1 Support all query operaKons 2 - but ... does not support all query opera%ons - large memory space requirements B) Special compressing strategies To reduce index size ... To reduce index size ... - more mem. accesses (big caches) - more compute (many compute resources) allocaKon policies + highly compacted indexes allow large scale genomes on GPU !! 17 1 GPU be[er on regular workflow -execuTons 2 CPU be[er on latency bounded and divergent -execuTons

4) Regularize the work Irregular work and parallelism along the pipeline (GPUs are friendly to regular work) 1. Exact Search 4. Decode 6. BPM ok ... query (500nt) no ... GPU: no ... ... ... ... Batch Mode + (1.5M x 4 th) (30K x 8 th) (1.5M x 8 th) Fine ParallelizaKon 1K queries CPU: divide seeds expand intervals split BPM join BPM (filter Ns) (process Ns) (Adapt irregular work) (break dependencies) (reconstruct results) A) CPU help to regularize the GPU work: · GPU process common cases à Corner cases relegated to CPU · Problem decomposi%on for GPUs à CPU cares to split in smaller problems B) Fine grain parallelism (thread-cooperaTon strategies) · Threads working in the same element à helps to regularize the work size 18

5) Reduce the thread memory footprint Example K80 runs 27K threads: Cache L2 : 60 B/thr If thread memory footprint not fit in Local Memory: cache memories: 64 B/thr Registers : 256 B/thr · Memory cache pressure issues: Main memory: 470 KB/thr - Increase GPU memory traffic · GPU resources alloca%on issues: Memory constrains: kernels with large memory footprints - Reduces thread GPU occupancy - BioinformaKc algorithms are an example. MUST re-think bio-algorithms to Working set Thread Parallelism scale in GPUs bytes/thread # threads 19

5) Reduce the thread memory footprint (A) BPM: Task parallel (B) BPM: Thread cooperaKve (1 thread – 1 task) (r threads – 1 task) Memory Memory per thread candidate p-1 per thread candidate candidate 1 |q|=100 candidate 0 |q|=100 r threads Levenshtei query 202 Bytes 164 Bytes query n distance Homology Levenshtei with Myers query |q|=1000 |q|=1000 n distance query ... 202 Bytes with Myers Homology 1580 Bytes result 0 result 0 result 0 larger query à more threads (but requires) result 0 (all is dynamic & flexible) · Complex register com. Query read ALL local data fits · Special data layouts in REGISTERS ! just once · Data regulariza%on · Distribu%on of the work (avoid all mem. re-accesses) 20

5) Kernel performance improvements (memory footprint reduc%on) Performance Exact Search kernel Performance Bit Parallel Myers kernel 2.50 Giga query bases / second 2.00 16x 1.50 1.00 3.3x 0.50 0.00 Query Size (m) Thread-coopera%ve strategy allows to scale larger problems in GPU. à The memory footprint reduc%on delivers 2.3x – 6.8x beher performance. 1 1 compared to the tradiTonal task-parallel strategy 21

S6636 - GEM3: CPU-GPU Heterogeneous DNA Sequence Alignment for - PowerPoint PPT Presentation

S6636 - GEM3: CPU-GPU Heterogeneous DNA Sequence Alignment for Scalable Read Sizes Alejandro Chacn (UAB) San%ago Marco (CNAG-CRG) Juan Carlos Moure (UAB) Paolo Ribeca (CNAG-CRG) Antonio Espinosa (UAB) 1 Genomic Sequencing applicaKons

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Analysis and classification of the DNA Analysis and classification of the DNA sequence of TARA

DNA D DNA Double bl Helix DNA stands for: DNA stands for: U d Under a Deoxyribose

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on DNA

CSE 427 Computational Biology Winter 2008 Sequence Alignment; DNA Replication 1 Sequence

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Take out your DNA model DNA and the Human Genome DNA Model How was your How was your model

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on

Sequence Alignment Mark Voorhies 4/24/2012 Mark Voorhies Sequence Alignment Exercise:

MISSIONS Infrastructures and applications instrumentation OF THE CEA Physics (Nuclear physics,

Modeling in Molecular Biology Peter Schuster Institut fr Theoretische Chemie, Universitt

Universidade Federal de Pelotas Office for Technology Innovation Prof. Vinicius s Farias s

UV/Visible Light Imaging and BioSAXS Mark Benson Rigaku Europe Automated Drop Imaging

Qualitative Evaluation of the VHA Telegenomics Clinic Implementation Process VA Boston Center

Case 4 Goal: What does team plan for Phase 2b trial? Phase 2A study PK results by CYP2C8

Interpretation tools for coreceptor usage Rolf Kaiser Institute of Virology University of

Statistical Validation of Endophenotypes Using a Surrogate Endpoint Analytic Analogue Guan-Hua

S6636 - GEM3: CPU-GPU Heterogeneous DNA Sequence Alignment for - PowerPoint PPT Presentation

S6636 - GEM3: CPU-GPU Heterogeneous DNA Sequence Alignment for Scalable Read Sizes Alejandro Chacn (UAB) San%ago Marco (CNAG-CRG) Juan Carlos Moure (UAB) Paolo Ribeca (CNAG-CRG) Antonio Espinosa (UAB) 1 Genomic Sequencing applicaKons

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Analysis and classification of the DNA Analysis and classification of the DNA sequence of TARA

DNA D DNA Double bl Helix DNA stands for: DNA stands for: U d Under a Deoxyribose

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on DNA

CSE 427 Computational Biology Winter 2008 Sequence Alignment; DNA Replication 1 Sequence

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Take out your DNA model DNA and the Human Genome DNA Model How was your How was your model

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on

Sequence Alignment Mark Voorhies 4/24/2012 Mark Voorhies Sequence Alignment Exercise:

MISSIONS Infrastructures and applications instrumentation OF THE CEA Physics (Nuclear physics,

Modeling in Molecular Biology Peter Schuster Institut fr Theoretische Chemie, Universitt

Universidade Federal de Pelotas Office for Technology Innovation Prof. Vinicius s Farias s

UV/Visible Light Imaging and BioSAXS Mark Benson Rigaku Europe Automated Drop Imaging

Qualitative Evaluation of the VHA Telegenomics Clinic Implementation Process VA Boston Center

Case 4 Goal: What does team plan for Phase 2b trial? Phase 2A study PK results by CYP2C8

Interpretation tools for coreceptor usage Rolf Kaiser Institute of Virology University of

Statistical Validation of Endophenotypes Using a Surrogate Endpoint Analytic Analogue Guan-Hua

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or