Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on - - PowerPoint PPT Presentation

masher mapping long er reads with hash based genome
SMART_READER_LITE
LIVE PREVIEW

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on - - PowerPoint PPT Presentation

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2 , Erik Saule 1 , Kamer Kaya 1 and mit V. atalyrek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical and Computer Engineering The


slide-1
SLIDE 1

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs

Anas Abu-Doleh1,2, Erik Saule1, Kamer Kaya1 and Ümit V. Çatalyürek1,2

1 Department of Biomedical Informatics 2 Department of Electrical and Computer Engineering

The Ohio State University

slide-2
SLIDE 2

I. Introduction

  • Motivation
  • Contribution
  • Related Work

II. Masher Workflow

  • Index Construction
  • Mapping
  • III. Experiments and Results
  • IV. Conclusion and Future Work

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 2

ACM-BCB13 23 Sep 2013

Outline

slide-3
SLIDE 3
  • The read length of next generation sequencing (NGS) devices is continuously

increasing so there is a wide interest in efficient and accurate mapping of long(er) reads.

  • Utilizing the powerful capabilities of GPUs to improve the mapping of NGS

reads.

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 3

ACM-BCB13 23 Sep 2013

Motivation

slide-4
SLIDE 4

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 4

ACM-BCB13 23 Sep 2013

Related Work and Contributions

  • Burrows-Wheeler Transform (BWT)
  • Bowtie2
  • CUSHAW2
  • Soap3-dp
  • Hash Indexing
  • SeqAlto
  • BFAST

) Related Work

  • A novel hash-based indexing technique by which:
  • For large genomes, the memory footprint small enough to be stored in a

restricted-memory device such as a GPU.

  • The index data structure is more suitable for GPU parallelization

Contribution

slide-5
SLIDE 5

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 5

ACM-BCB13 23 Sep 2013

Masher workflow

slide-6
SLIDE 6

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 6

ACM-BCB13 23 Sep 2013

Index Construction

  • Base pairs to 2 bit format.
  • Replacing each N with A.

Processing genome file

slide-7
SLIDE 7

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 7

ACM-BCB13 23 Sep 2013

Index Construction

  • Base pairs to 2 bit format.
  • Replacing each N with A.

Processing genome file

  • Seed length LS
  • Indexing step size ∆G

Indexing

slide-8
SLIDE 8

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 8

ACM-BCB13 23 Sep 2013

Index Construction

  • Genome length, N
  • Stores the indexed locations in
  • rder for each seed
  • Location array size = log2(N) x

𝑂/∆G

  • Size ≈ 2.9 GB , hg19, ∆G = 4

Index arrays - Locations array

slide-9
SLIDE 9

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 9

ACM-BCB13 23 Sep 2013

Index Construction

  • Stores the number of
  • ccurrences for each seed
  • Size = 4Ls x log2 𝑂/∆G
  • Store at most 255 locations.
  • Appear more than 255, do

uniform selection.

  • Size = 1 GB , LS = 15.

Index arrays - Count array

slide-10
SLIDE 10

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 10

ACM-BCB13 23 Sep 2013

Index Construction

  • Stores the starting index at

locs array for a group of seeds

  • Seed group size, δ.
  • Group id = seed/δ
  • Size = 4L/ δ x log2 ( 𝑂/∆G
  • Size = 0.5 GB , δ = 8, ∆G = 4.

Index arrays - Ptrs array

slide-11
SLIDE 11

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 11

ACM-BCB13 23 Sep 2013

Index Construction

  • LS = 15, ∆G = 4, δ = 8 , hg19
  • Total indexing arrays size =

2.9 + 1 + 0.5 = 4.4 GB.

  • Space–time tradeoff

Index arrays

slide-12
SLIDE 12

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 12

ACM-BCB13 23 Sep 2013

Index Construction

  • Count array
  • Assume seed = i + 4
  • Belongs to seed group (i , i + δ −1 )
  • , δ = 8 , i mod δ = 0.
  • Seed index in group, k = (i +4) mod δ
  • Ck=4 = count[i + 4 ]
  • Ptrs array
  • j = seed /δ ,
  • Locs group index (Lgi) = ptrs[ j ]
  • Locs seed index (Lsi) = Lgi + 𝑜=0

𝑙−1 𝐷𝑜

  • Locs array
  • Extract locations from (Lsi , Lsi + Ck - 1 )

Accessing the Index

slide-13
SLIDE 13

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 13

ACM-BCB13 23 Sep 2013

Index Construction

0.5 0.6 0.7 0.8 0.9 1 1 6 11 16 21 26 31 36 41 46 51 56 Pr(count <= x) Seeds count

slide-14
SLIDE 14

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 14

ACM-BCB13 23 Sep 2013

Mapping

  • Read step size, ∆R
  • Read length, LR
  • Nseeds = ∆G x (LR − LS)/∆R

Seed & hash

  • Each thread is assigned to a specific seed.

Locate candidate alignment locations (CALs)

slide-15
SLIDE 15

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 15

ACM-BCB13 23 Sep 2013

Mapping

  • In merging CALs, if two CALs are within a threshold distance, the second weight will

be added to the first weight.

  • For efficiency purpose, Masher consists of two main loops.

Merge CALs and weights

slide-16
SLIDE 16

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 16

ACM-BCB13 23 Sep 2013

Mapping

  • Sorting and setting the CALs in batches with respect to their weights.
  • At this stage, a filter operation for CALs with low weight could be applied.

Sorting and Batching CALs

slide-17
SLIDE 17

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 17

ACM-BCB13 23 Sep 2013

Mapping

  • A parameterized variant of Smith-Waterman

(SW) algorithm supporting affinity gap scoring.

  • Bounded alignment, only the matrix cells

(i, j) where |i - j| <= w are visited and scored.

  • Masher does two passes and sets w to 4 and

16 respectively

  • GPU block performs multiple SWs in parallel.

Bounded local Alignment

  • Sorting and setting the CALs in batches with respect to their weights.
  • At this stage, a filter operation for CALs with low weight could be applied.

Sorting and Batching CALs

slide-18
SLIDE 18

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 18

ACM-BCB13 23 Sep 2013

Experiments and Results

  • Intel core i7-960 CPU clocked at 3.2 Ghz. 4 Hyper-Threading cores, 24GB of DDR3

memory.

  • Tesla K20c GPU, 4.8GB of global memory.
  • CUDA 5.0 and GCC 4.2.4.

Platform

  • Human genome hg19
  • Wgsim simulator, 100K reads of length 100, 300, 500, and 1000 with error rates 2%, 4%,

6%, and 8%.

Human genome and Simulated Reads

slide-19
SLIDE 19

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 19

ACM-BCB13 23 Sep 2013

Experiments and Results

  • Sensitivity, is the percentage of the aligned reads.
  • Accuracy, is the percentage of the reads correctly aligned to simulator locations among

all aligned reads.

  • Execution time: Only alignment time was measured.
  • The lower bound for a valid alignment score is set to

scoreLB = LR x (1.9 - 0.5 x Error Rate)

Metrics for comparison

  • Normal mode, ∆R = 0.7

LR

  • Fast mode, ∆R = LR

Two modes of Masher

  • Bowtie2 (sensitive and fast) , 8 threads
  • SOAP3-dp
  • CUSHAW2-GPU.

Comparison with

slide-20
SLIDE 20

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 20

ACM-BCB13 23 Sep 2013

Experiments and Results

99.23 99.44 99.36 98.87 97.55 96.81 94.5 89.83 98.8 98 96 93.15 98 94.63 88.8 80.6 98.5 92.5 81.7 67.7 99.9 99.9 98.8 96.2

40 50 60 70 80 90 100

1 2 3 4

Sensitivity %

Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp CUSHAW2-GPU

95.01 93.82 92.42 90.84 95.49 94.44 93.07 91.49 95.2 94 92.6 91.1 95 93.78 91.7 89.47 96.2 95.5 94.5 93 95.2 94.3 93.2 91.9

80 85 90 95 100

2% 4% 6% 8%

Accuracy % Error rate LR = 100 bps.

slide-21
SLIDE 21

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 21

ACM-BCB13 23 Sep 2013

Experiments and Results

99.89 99.84 99.74 99.62 99.89 99.78 99.51 98.93 99.9 99.9 99.9 99.9 99.9 99.8 99.34 97.7 99.2 94.3 75.3 48.6

40 50 60 70 80 90 100

1 2 3 4

Sensitivity %

Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp

97.69 97.2 96.83 96.25 97.78 97.19 96.83 96.15 98.2 98 97.8 97.4 98.1 97.8 97.6 97 98.8 98.5 98.3 98

80 85 90 95 100

2% 4% 6% 8%

Accuracy % Error rate LR = 500 bps.

slide-22
SLIDE 22

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 22

ACM-BCB13 23 Sep 2013

Experiments and Results

100 100 100 100 99.8 99.73 99.53 98.93 99.99 99.9 99.9 99.8 99.9 99.9 99.8 99.5 99.3 98.7 91.4 68.9

40 50 60 70 80 90 100

1 2 3 4

Sensitivity %

Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp

98.5 98.28 97.86 97.41 98.25 97.78 97.24 96.66 98.5 98.3 97.5 96.43 98.5 98.1 97.3 96.1 98.9 98.5 97.8 96

80 85 90 95 100

2% 4% 6% 8%

Accuracy % Error rate LR = 1000 bps.

slide-23
SLIDE 23

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 23

ACM-BCB13 23 Sep 2013

Experiments and Results

9.1 8.6 9.3 9.4 5 4.9 5.3 5.5 11 10 9 9 5 5 5 5 7.3 8.3 6.6 6.6 9.3 10.5 14.9 11.8

1 5 25

2% 4% 6% 8%

Execution time (sec.) in log scale Error rate

Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp CUSHAW2-GPU

LR = 100 bps.

slide-24
SLIDE 24

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 24

ACM-BCB13 23 Sep 2013

Experiments and Results

15.1 19.1 24 31.7 9.9 8.2 11 11.7 134 160 165 180 100 111 117 123 1010 734 522 333

1 5 25 125 625 3125

2% 4% 6% 8%

Execution time (sec.) in log scale Error rate

Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp

LR = 500 bps.

slide-25
SLIDE 25

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 25

ACM-BCB13 23 Sep 2013

Experiments and Results

17.6 18.5 20.4 21.8 15.5 17.5 20.2 22 456 567 662 752 345 403 452 491 5607 4600 3206 2027

1 5 25 125 625 3125

2% 4% 6% 8%

Execution time (sec.) in log scale Error rate

Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp

LR = 1000 bps.

slide-26
SLIDE 26

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 26

ACM-BCB13 23 Sep 2013

Experiments and Results

LR = 1000 bps, Error rate 2%

1 10 100 1000 10000 90 92 94 96 98 100

Accuracy %

Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp 1 10 100 1000 10000 90 92 94 96 98 100

Execution time (sec.) in log scale

Sensitivity %

slide-27
SLIDE 27

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 27

ACM-BCB13 23 Sep 2013

Conclusion and future work

  • Masher, a fast and accurate short/long read mapper, which uses memory efficient

indexing scheme to reduce the size of a human genome index and to make it fit to the memory of a GPU.

  • The results show that Masher produces accurate alignments.
  • Its speed is competitive with the tested state-of-the-art tools for reads of length less than

500 and an order of magnitude faster when the reads are longer than 500.

Conclusion

  • Making the software publicly available.
  • Improving Masher’s performance further by using GPU-specific optimizations and with a

better CPU/GPU pipelining.

  • Adding new features such as a support for paired-end sequences or fastq format.

Future work

slide-28
SLIDE 28
  • For more information
  • Visit http://bmi.osu.edu/hpc
  • Acknowledgement of Support

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 28

ACM-BCB13 23 Sep 2013

Thanks