Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on - PowerPoint PPT Presentation

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2 , Erik Saule 1 , Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical and Computer Engineering The Ohio State University

Outline I. Introduction • Motivation • Contribution • Related Work II. Masher Workflow • Index Construction • Mapping III. Experiments and Results IV. Conclusion and Future Work ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 2 23 Sep 2013

Motivation  The read length of next generation sequencing (NGS) devices is continuously increasing so there is a wide interest in efficient and accurate mapping of long(er) reads.  Utilizing the powerful capabilities of GPUs to improve the mapping of NGS reads. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 3 23 Sep 2013

Related Work and Contributions Contribution  A novel hash-based indexing technique by which:  For large genomes, the memory footprint small enough to be stored in a restricted-memory device such as a GPU.  The index data structure is more suitable for GPU parallelization Related Work  Burrows-Wheeler Transform (BWT) o Bowtie2 o CUSHAW2 o Soap3-dp  Hash Indexing o SeqAlto o BFAST ) ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 4 23 Sep 2013

Masher workflow ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 5 23 Sep 2013

Index Construction Processing genome file  Base pairs to 2 bit format.  Replacing each N with A . ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 6 23 Sep 2013

Index Construction Processing genome file  Base pairs to 2 bit format.  Replacing each N with A . Indexing  Seed length L S  Indexing step size ∆ G ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 7 23 Sep 2013

Index Construction Index arrays - Locations array  Genome length, N  Stores the indexed locations in order for each seed  Location array size = log 2 (N) x 𝑂/ ∆ G  Size ≈ 2.9 GB , hg19, ∆ G = 4 ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 8 23 Sep 2013

Index Construction Index arrays - Count array  Stores the number of occurrences for each seed  Size = 4 Ls x log 2 𝑂/ ∆ G  Store at most 255 locations.  Appear more than 255, do uniform selection.  Size = 1 GB , L S = 15. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 9 23 Sep 2013

Index Construction Index arrays - Ptrs array  Stores the starting index at locs array for a group of seeds  Seed group size, δ .  Group id = seed / δ  Size = 4 L / δ x log 2 ( 𝑂/ ∆ G  Size = 0.5 GB , δ = 8, ∆ G = 4. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 10 23 Sep 2013

Index Construction Index arrays  L S = 15, ∆ G = 4, δ = 8 , hg19  Total indexing arrays size = 2.9 + 1 + 0.5 = 4.4 GB.  Space – time tradeoff ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 11 23 Sep 2013

Index Construction Accessing the Index  Count array Assume seed = i + 4 • Belongs to seed group ( i , i + δ −1 ) • , δ = 8 , i mod δ = 0. • Seed index in group, k = ( i +4) mod δ • C k=4 = count[ i + 4 ] •  Ptrs array j = seed / δ , • • Locs group index (Lgi) = ptrs[ j ] 𝑙−1 𝐷 𝑜 Locs seed index (Lsi) = Lgi + 𝑜=0 •  Locs array • Extract locations from (Lsi , Lsi + C k - 1 ) ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 12 23 Sep 2013

Index Construction 1 0.9 Pr(count <= x) 0.8 0.7 0.6 0.5 1 6 11 16 21 26 31 36 41 46 51 56 Seeds count ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 13 23 Sep 2013

Mapping Seed & hash  Read step size, ∆ R  Read length, L R  N seeds = ∆ G x ( L R − L S )/ ∆ R Locate candidate alignment locations (CALs)  Each thread is assigned to a specific seed. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 14 23 Sep 2013

Mapping Merge CALs and weights  In merging CALs, if two CALs are within a threshold distance, the second weight will be added to the first weight.  For efficiency purpose, Masher consists of two main loops. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 15 23 Sep 2013

Mapping Sorting and Batching CALs  Sorting and setting the CALs in batches with respect to their weights.  At this stage, a filter operation for CALs with low weight could be applied. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 16 23 Sep 2013

Mapping Sorting and Batching CALs  Sorting and setting the CALs in batches with respect to their weights.  At this stage, a filter operation for CALs with low weight could be applied. Bounded local Alignment  A parameterized variant of Smith-Waterman (SW) algorithm supporting affinity gap scoring.  Bounded alignment, only the matrix cells (i, j) where |i - j| <= w are visited and scored.  Masher does two passes and sets w to 4 and 16 respectively  GPU block performs multiple SWs in parallel. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 17 23 Sep 2013

Experiments and Results Platform  Intel core i7-960 CPU clocked at 3.2 Ghz. 4 Hyper-Threading cores, 24GB of DDR3 memory.  Tesla K20c GPU, 4.8GB of global memory.  CUDA 5.0 and GCC 4.2.4. Human genome and Simulated Reads  Human genome hg19  Wgsim simulator, 100K reads of length 100, 300, 500, and 1000 with error rates 2%, 4%, 6%, and 8%. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 18 23 Sep 2013

Experiments and Results Metrics for comparison  Sensitivity , is the percentage of the aligned reads.  Accuracy , is the percentage of the reads correctly aligned to simulator locations among all aligned reads.  Execution time : Only alignment time was measured.  The lower bound for a valid alignment score is set to score LB = L R x (1.9 - 0.5 x Error Rate) Two modes of Masher  Normal mode, ∆ R = 0.7 L R  Fast mode, ∆ R = L R Comparison with  Bowtie2 (sensitive and fast) , 8 threads  SOAP3-dp  CUSHAW2-GPU. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 19 23 Sep 2013

Experiments and Results L R = 100 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp CUSHAW2-GPU 99.44 99.36 99.23 98.87 97.55 99.9 96.81 99.9 98.8 98.5 98.8 94.63 93.15 96.2 94.5 98 98 89.83 92.5 96 88.8 100 Sensitivity % 81.7 80.6 90 80 67.7 70 60 50 40 1 2 3 4 100 95.49 96.2 95.01 94.44 95.5 95.2 95.2 93.82 93.78 94.5 94.3 93.07 95 92.42 93.2 Accuracy % 94 92.6 91.49 91.9 91.7 90.84 95 93 91.1 89.47 90 85 80 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 20 23 Sep 2013

Experiments and Results L R = 500 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp 99.89 99.89 99.84 99.78 99.74 99.51 99.34 99.62 98.93 99.9 99.9 99.9 99.8 99.9 99.9 99.2 97.7 94.3 100 Sensitivity % 90 75.3 80 70 60 48.6 50 40 1 2 3 4 97.69 97.78 98.2 98.1 98.8 97.19 98.5 98.3 97.8 96.83 96.83 97.8 97.6 97.4 97.2 96.25 96.15 100 98 98 97 Accuracy % 95 90 85 80 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 21 23 Sep 2013

Experiments and Results L R = 1000 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp 99.99 99.73 99.53 98.93 99.8 99.9 99.9 99.9 99.9 99.8 99.8 99.5 99.3 100 100 98.7 100 100 91.4 100 Sensitivity % 90 68.9 80 70 60 50 40 1 2 3 4 98.5 98.25 98.5 98.5 98.9 98.28 97.78 98.3 98.1 98.5 97.86 97.24 97.41 97.8 96.66 97.5 97.3 96.43 100 96.1 96 Accuracy % 95 90 85 80 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 22 23 Sep 2013

Experiments and Results L R = 100 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp CUSHAW2-GPU Execution time (sec.) in log scale 25 14.9 11.8 10.5 11 10 9.3 9.4 9.3 9.1 8.6 8.3 9 9 7.3 6.6 6.6 5.5 5.3 4.9 5 5 5 5 5 5 1 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 23 23 Sep 2013

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on - PowerPoint PPT Presentation

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2 , Erik Saule 1 , Kamer Kaya 1 and mit V. atalyrek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical and Computer Engineering The

Strategies for Bulk RNA-seq Analysis Genome Transcriptome Assembly Mapping Mapping Reads

Hash Functions in Action Hash Functions in Action Lecture 12 Hash Functions Hash Functions

Hash Functions in Action Hash Functions in Action Lecture 11 Hash Functions Hash Functions

Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, 2019 1 Next-Gen Sequencing

Texture and other Mappings Texture Mapping Texture Mapping Bump Mapping Bump Mapping

Hash Functions Hash Functions 1 Cryptographic Hash Function Crypto hash function h(x) must

Hash Functions and Hash Tables (2.5.2) A hash function h maps keys of a given type to

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Generics Asumu Takikawa RacketCon 2012 1 What are generics? 2 What are generics? hash-ref

Hash Pile Ups: Using Collisions to Identify Unknown Hash Functions R. Joshua Tobin and David

Image Warping Image Mapping Image Mapping - Examples Forward Mapping Forward Mapping -

TEXTURE MAPPING 1 OUTLINE Introduce Mapping Methods Texture Mapping Environment

Long-term secure signatures for the IoT Andreas Hlsing Hash-based Signature Schemes [Mer89]

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

So Solving geometry problems: co combining text and

paradigm A. Vlachostergiou [1] , G. Stratogiannis [1] , G. Siolas [1,2] , G. Caridakis [1] , Ph.

Media Technology Human-Centred Creative Technology A Vision of the Future of Media Technology

The Impact of Caregiver Alcoholism on Youth and Families Shawn S. Sidhu, M.D., F.A.P.A.

The OMRAS2 project Bringing together semantic audio, music informatics and computational

Percona Backup for MongoDB Akira Kurogane Percona 3 - 2 - 1 MongoDB Percona Server for

Outline Problem Approach Integration with IDSs Demo 1 Attack 160 158 47

Alert system retirement You are now Building on Bitcoin Bryan Bishop <kanzure@gmail.com>

Sambuz

Useful Links

Newsletter

Mail Us

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on - PowerPoint PPT Presentation

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2 , Erik Saule 1 , Kamer Kaya 1 and mit V. atalyrek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical and Computer Engineering The

Strategies for Bulk RNA-seq Analysis Genome Transcriptome Assembly Mapping Mapping Reads

Hash Functions in Action Hash Functions in Action Lecture 12 Hash Functions Hash Functions

Hash Functions in Action Hash Functions in Action Lecture 11 Hash Functions Hash Functions

Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, 2019 1 Next-Gen Sequencing

Texture and other Mappings Texture Mapping Texture Mapping Bump Mapping Bump Mapping

Hash Functions Hash Functions 1 Cryptographic Hash Function Crypto hash function h(x) must

Hash Functions and Hash Tables (2.5.2) A hash function h maps keys of a given type to

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Generics Asumu Takikawa RacketCon 2012 1 What are generics? 2 What are generics? hash-ref

Hash Pile Ups: Using Collisions to Identify Unknown Hash Functions R. Joshua Tobin and David

Image Warping Image Mapping Image Mapping - Examples Forward Mapping Forward Mapping -

TEXTURE MAPPING 1 OUTLINE Introduce Mapping Methods Texture Mapping Environment

Long-term secure signatures for the IoT Andreas Hlsing Hash-based Signature Schemes [Mer89]

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics &amp; Computational

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

So Solving geometry problems: co combining text and

paradigm A. Vlachostergiou [1] , G. Stratogiannis [1] , G. Siolas [1,2] , G. Caridakis [1] , Ph.

Media Technology Human-Centred Creative Technology A Vision of the Future of Media Technology

The Impact of Caregiver Alcoholism on Youth and Families Shawn S. Sidhu, M.D., F.A.P.A.

The OMRAS2 project Bringing together semantic audio, music informatics and computational

Percona Backup for MongoDB Akira Kurogane Percona 3 - 2 - 1 MongoDB Percona Server for

Outline Problem Approach Integration with IDSs Demo 1 Attack 160 158 47

Alert system retirement You are now Building on Bitcoin Bryan Bishop &lt;kanzure@gmail.com&gt;

Sambuz

Useful Links

Newsletter

Mail Us

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Alert system retirement You are now Building on Bitcoin Bryan Bishop <kanzure@gmail.com>