masher mapping long er reads with hash based genome
play

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on - PowerPoint PPT Presentation

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2 , Erik Saule 1 , Kamer Kaya 1 and mit V. atalyrek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical and Computer Engineering The


  1. Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2 , Erik Saule 1 , Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical and Computer Engineering The Ohio State University

  2. Outline I. Introduction • Motivation • Contribution • Related Work II. Masher Workflow • Index Construction • Mapping III. Experiments and Results IV. Conclusion and Future Work ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 2 23 Sep 2013

  3. Motivation  The read length of next generation sequencing (NGS) devices is continuously increasing so there is a wide interest in efficient and accurate mapping of long(er) reads.  Utilizing the powerful capabilities of GPUs to improve the mapping of NGS reads. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 3 23 Sep 2013

  4. Related Work and Contributions Contribution  A novel hash-based indexing technique by which:  For large genomes, the memory footprint small enough to be stored in a restricted-memory device such as a GPU.  The index data structure is more suitable for GPU parallelization Related Work  Burrows-Wheeler Transform (BWT) o Bowtie2 o CUSHAW2 o Soap3-dp  Hash Indexing o SeqAlto o BFAST ) ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 4 23 Sep 2013

  5. Masher workflow ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 5 23 Sep 2013

  6. Index Construction Processing genome file  Base pairs to 2 bit format.  Replacing each N with A . ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 6 23 Sep 2013

  7. Index Construction Processing genome file  Base pairs to 2 bit format.  Replacing each N with A . Indexing  Seed length L S  Indexing step size ∆ G ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 7 23 Sep 2013

  8. Index Construction Index arrays - Locations array  Genome length, N  Stores the indexed locations in order for each seed  Location array size = log 2 (N) x 𝑂/ ∆ G  Size ≈ 2.9 GB , hg19, ∆ G = 4 ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 8 23 Sep 2013

  9. Index Construction Index arrays - Count array  Stores the number of occurrences for each seed  Size = 4 Ls x log 2 𝑂/ ∆ G  Store at most 255 locations.  Appear more than 255, do uniform selection.  Size = 1 GB , L S = 15. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 9 23 Sep 2013

  10. Index Construction Index arrays - Ptrs array  Stores the starting index at locs array for a group of seeds  Seed group size, δ .  Group id = seed / δ  Size = 4 L / δ x log 2 ( 𝑂/ ∆ G  Size = 0.5 GB , δ = 8, ∆ G = 4. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 10 23 Sep 2013

  11. Index Construction Index arrays  L S = 15, ∆ G = 4, δ = 8 , hg19  Total indexing arrays size = 2.9 + 1 + 0.5 = 4.4 GB.  Space – time tradeoff ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 11 23 Sep 2013

  12. Index Construction Accessing the Index  Count array Assume seed = i + 4 • Belongs to seed group ( i , i + δ −1 ) • , δ = 8 , i mod δ = 0. • Seed index in group, k = ( i +4) mod δ • C k=4 = count[ i + 4 ] •  Ptrs array j = seed / δ , • • Locs group index (Lgi) = ptrs[ j ] 𝑙−1 𝐷 𝑜 Locs seed index (Lsi) = Lgi + 𝑜=0 •  Locs array • Extract locations from (Lsi , Lsi + C k - 1 ) ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 12 23 Sep 2013

  13. Index Construction 1 0.9 Pr(count <= x) 0.8 0.7 0.6 0.5 1 6 11 16 21 26 31 36 41 46 51 56 Seeds count ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 13 23 Sep 2013

  14. Mapping Seed & hash  Read step size, ∆ R  Read length, L R  N seeds = ∆ G x ( L R − L S )/ ∆ R Locate candidate alignment locations (CALs)  Each thread is assigned to a specific seed. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 14 23 Sep 2013

  15. Mapping Merge CALs and weights  In merging CALs, if two CALs are within a threshold distance, the second weight will be added to the first weight.  For efficiency purpose, Masher consists of two main loops. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 15 23 Sep 2013

  16. Mapping Sorting and Batching CALs  Sorting and setting the CALs in batches with respect to their weights.  At this stage, a filter operation for CALs with low weight could be applied. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 16 23 Sep 2013

  17. Mapping Sorting and Batching CALs  Sorting and setting the CALs in batches with respect to their weights.  At this stage, a filter operation for CALs with low weight could be applied. Bounded local Alignment  A parameterized variant of Smith-Waterman (SW) algorithm supporting affinity gap scoring.  Bounded alignment, only the matrix cells (i, j) where |i - j| <= w are visited and scored.  Masher does two passes and sets w to 4 and 16 respectively  GPU block performs multiple SWs in parallel. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 17 23 Sep 2013

  18. Experiments and Results Platform  Intel core i7-960 CPU clocked at 3.2 Ghz. 4 Hyper-Threading cores, 24GB of DDR3 memory.  Tesla K20c GPU, 4.8GB of global memory.  CUDA 5.0 and GCC 4.2.4. Human genome and Simulated Reads  Human genome hg19  Wgsim simulator, 100K reads of length 100, 300, 500, and 1000 with error rates 2%, 4%, 6%, and 8%. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 18 23 Sep 2013

  19. Experiments and Results Metrics for comparison  Sensitivity , is the percentage of the aligned reads.  Accuracy , is the percentage of the reads correctly aligned to simulator locations among all aligned reads.  Execution time : Only alignment time was measured.  The lower bound for a valid alignment score is set to score LB = L R x (1.9 - 0.5 x Error Rate) Two modes of Masher  Normal mode, ∆ R = 0.7 L R  Fast mode, ∆ R = L R Comparison with  Bowtie2 (sensitive and fast) , 8 threads  SOAP3-dp  CUSHAW2-GPU. ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 19 23 Sep 2013

  20. Experiments and Results L R = 100 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp CUSHAW2-GPU 99.44 99.36 99.23 98.87 97.55 99.9 96.81 99.9 98.8 98.5 98.8 94.63 93.15 96.2 94.5 98 98 89.83 92.5 96 88.8 100 Sensitivity % 81.7 80.6 90 80 67.7 70 60 50 40 1 2 3 4 100 95.49 96.2 95.01 94.44 95.5 95.2 95.2 93.82 93.78 94.5 94.3 93.07 95 92.42 93.2 Accuracy % 94 92.6 91.49 91.9 91.7 90.84 95 93 91.1 89.47 90 85 80 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 20 23 Sep 2013

  21. Experiments and Results L R = 500 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp 99.89 99.89 99.84 99.78 99.74 99.51 99.34 99.62 98.93 99.9 99.9 99.9 99.8 99.9 99.9 99.2 97.7 94.3 100 Sensitivity % 90 75.3 80 70 60 48.6 50 40 1 2 3 4 97.69 97.78 98.2 98.1 98.8 97.19 98.5 98.3 97.8 96.83 96.83 97.8 97.6 97.4 97.2 96.25 96.15 100 98 98 97 Accuracy % 95 90 85 80 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 21 23 Sep 2013

  22. Experiments and Results L R = 1000 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp 99.99 99.73 99.53 98.93 99.8 99.9 99.9 99.9 99.9 99.8 99.8 99.5 99.3 100 100 98.7 100 100 91.4 100 Sensitivity % 90 68.9 80 70 60 50 40 1 2 3 4 98.5 98.25 98.5 98.5 98.9 98.28 97.78 98.3 98.1 98.5 97.86 97.24 97.41 97.8 96.66 97.5 97.3 96.43 100 96.1 96 Accuracy % 95 90 85 80 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 22 23 Sep 2013

  23. Experiments and Results L R = 100 bps. Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp CUSHAW2-GPU Execution time (sec.) in log scale 25 14.9 11.8 10.5 11 10 9.3 9.4 9.3 9.1 8.6 8.3 9 9 7.3 6.6 6.6 5.5 5.3 4.9 5 5 5 5 5 5 1 2% 4% 6% 8% Error rate ACM-BCB13 A Abu- Doleh “Masher: Mapping Long(er) Reads with Hash -based Genome Indexing on GPUs" 23 23 Sep 2013

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend