lecture 16 mapping reads to a reference
play

Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, - PowerPoint PPT Presentation

Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, 2019 1 Next-Gen Sequencing Read Mapping Align reads to genome/transcriptome Input: Many reads, length l strings R 1 ,...,R m and approximate reference genome string G


  1. Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, 2019 1

  2. Next-Gen Sequencing Read Mapping — Align reads to genome/transcriptome — Input: Many reads, length l strings R 1 ,...,R m and approximate reference genome string G — Output: positions in G where reads match — Complications: ◦ Millions of reads to analyze ◦ Repetitive regions: non-unique mapping ◦ Mutations/errors: imperfect match, incorrect mapping ◦ Paired end reads ◦ Numbers: ~100M reads, |G| ~ 3x10 9 , read length: 100-200bp 2

  3. Solution – Suffix Tree — Build suffix tree for G — For each read R i ◦ Find matches of R i to G by tree traversal — Time complexity? 3

  4. Solution – Suffix Tree — Build suffix tree for G — For each read R i ◦ Find matches of R i to G by tree traversal — Time complexity: O( l m+|G|) for exact matching — Can store human genome text in 750M bytes (6G bits) but, need ~64G bytes for the tree — Large constants, hard to implement 4

  5. Solution – Hashed Index — Preprocessing: – Create hash-table H – For each position p=l,...,|G| Hash (key=G[p,...,p + l -1], value=p) in H — For each R i report H(S i ) — Time complexity? 5

  6. Solution – Hashed Index — Preprocessing: – Create hash-table H – For each position p=l,...,|G| Hash (key=G[p,...,p + l -1], value=p) in H — For each R i report H(S i ) — Time complexity: O( l( m+|G|)) — Problems: ◦ Only exact matching, memory O(|G|) ◦ Implemented by PASS Campagna D, Albiero A, Bilardi A, Caniato E, Forcato C, Manavski S, Vitulo N, Valle G. PASS: a program to align short sequences. Bioinformatics. 2009 Apr 1;25(7):967-8 6

  7. � Improvements � � — Pack strings into bit vectors � TGAGTGTGAC 11100010111011100001 — Do not hash every position 7

  8. Solution – Hashed Reads — Hash the reads, and scan using the genome. — Implemented by the MAQ tool — Memory usage: depends on number of reads Mapping short DNA sequencing reads and calling variants using mapping quality scores. Li H, Ruan J, Durbin R. Genome Res. 2008 Nov;18(11):1851-8. 8

  9. Handling mismatches les. — If a read has one mismatch with respect to its correct position, then hash T 1 partitioning that read into two T 2 makes sure one part is still exactly � matched. � — For two mismatches: create six T 3 T 4 templates, some of which are T 5 noncontiguous, so that each and T 6 template covers half the read and has a partner template that complements it to form the full read. — Related to the idea of spaced seeds 9

  10. Solution - Burrows-Wheeler Transform (BWT) BWT $acaacg aacg$ac acaacg$ gc$aaac acaacg$ acg$aca caacg$a cg$acaa g$acaac Burrows-Wheeler Matrix (BWM)

  11. Solution - Burrows-Wheeler Transform (BWT) BWT $acaacg aacg$ac acaacg$ gc$aaac acaacg$ acg$aca caacg$a cg$acaa g$acaac BWT(G) is often highly compressible

  12. Burrows-Wheeler Matrix $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac

  13. Burrows-Wheeler Matrix $acaacg aacg$ac 3 1 acaacg$ 4 acg$aca See the suffix array? 2 caacg$a 5 cg$acaa g$acaac 6

  14. Bowtie/BWA Bowtie/BWA: the first methods to use BWT for read mapping ◦ Optimized to run against mammalian sized genomes ◦ Genome is indexed using the Burrows-Wheeler algorithm to keep the memory footprint small – The human genome can be stored in ~2Gb BWA: Bowtie: http://bio-bwa.sourceforge.net http://bowtie-bio.sourceforge.net/ (Bowtie) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Langmead et al, Genome Biology 2009, 10:R25 (BWA) Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform, Li Heng and Richard Durbin, (2009) 25:1754–1760

  15. Lots of Read Mappers! DNA: blue, RNA: in red, miRNA: green Bisulphite: purple. See: http://wwwdev.ebi.ac.uk/fg/hts_mappers/ and the corresponding paper Tools for mapping high-throughput sequencing data Nuno A. Fonseca, Johan Rung, Alvis Brazma and John C. Marioni 15

  16. Phred Quality Scores Q s = -10 log 10 Pr(base is incorrect) 16

  17. Using Phred Scores — Mappers typically use Phred scores in one of two ways: ◦ Incorporate them into the score of a match ◦ Trim the 3’ end of a read when its quality is deemed too low (takes advantage of the fact that for the Illumina platform, sequencing accuracy decreases from 5’ to 3’) 17

  18. Read mapping software — General mappers (BLAT, SSAHA, Exonerate) were designed for aligning any sequences and the source of data is irrelevant. — Some mappers are designed for data coming from a specific platform (e.g. Illumina, SOLiD). — Some mappers exploit specific biases associated with a sequencing platform. — Some mappers use Phred quality scores for scoring a match. 18

  19. Ungapped vs. Spliced Alignment — Unspliced reads (Black) ◦ Map within exactly one exon — Spliced reads (Blue) ◦ Overlap two or more exons ◦ A more challenging problem than ungapped read mapping

  20. Splice Sites Acceptor splice site (3’) Donor splice site (5’) — Canonical Splice Site Consensus: GT---AG — GT Occurs about 1 billion times in the 3 Billion nucleotide long human DNA — Donor consensus ◦ A 64 G 73 G 100 T 100 A 62 A 68 G 84 T 63 — Acceptor consensus ◦ C 65 A 100 G 100 20

  21. PALMapper — Strategy: ◦ Identify unspliced alignments and seed regions for spliced reads using a fast indexing tool. ◦ Infer spliced alignments from seed regions. De Bona, F. et al., Optimal Spliced Alignments of Short Sequence Reads, ECCB08/Bioinformatics, 24 (16):i174, 2008. Images from http://www.raetschlab.org/lectures/qpalma-ismb08-short-SIG.pdf

  22. QPALMA:Optimal Spliced Alignments of Reads — Based on PALMA ◦ Uses Computational Splice Site Prediction — Q = Quality ◦ Uses a read’s quality information — Information Used in QPALMA ◦ Read quality ◦ Splice site models ◦ Intron length m odels

  23. Splice site prediction — Splice site classifiers (support vector machine) ◦ Predictions of d onor sites ◦ Predictions of acceptor sites — Training ◦ Known Acceptor & Donor Sites – Obtained by aligning EST and cDNA sequences using BLAT (or any program for spliced alignments for long sequences) – High quality alignments can be used as positive examples – Other decoy sites used as negative examples – Window of length 141 nt around the splice site

  24. PALMapper uses Smith Waterman — PALMapper is an “Intelligent” variant of SW that changes the scoring matrix in certain ways to incorporate ◦ Sequence Quality Scores ◦ Donor & Acceptor Splice Site Prediction Scores ◦ Intron Length Model — To perform spliced alignments of short read sequences against a genome 24

  25. Smith Waterman Source of Information Sequence matches Classical scoring f : Σ × Σ → R 25

  26. Extensions of SW Penalizes the scoring if the classifier does not consider that there is a plausible intron. Source of Information Sequence matches Computational splice site predictions 26

  27. Extensions of SW Penalizes the scoring if the intron length is not “expected” Source of Information Sequence matches Computational splice site predictions Intron length model 27

  28. Incorporating Quality Scores Source of Information Sequence matches Computational splice site predictions Intron length model Read quality information Quality scoring f : ( Σ × R ) × Σ → R 28

  29. Incorporating Quality Scores — Q E (i): the quality score of the read at position i

  30. The Extended SW g o = intron opening score — g e = intron extension score — f acc (i) = score of an acceptor site — f don (i) = score of a donor site — Intron scoring

  31. Training 31

  32. Performance of Various Extensions — (% of correct alignments)

  33. Error depends on where the read falls within an intron 33

  34. Summary — As read length becomes larger more reads require spliced alignment — There are many tools for this task: v TopHat (uses Bowtie for ungapped mapping) v MapSplice (uses Bowtie) v RUM (uses Bowtie and BLAT) v PASS v ARYANA 34

  35. A comparison of spliced alignment tools Figure from: Systematic evaluation of spliced alignment programs for RNA-seq data Nature Methods (2013) http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.2722.html 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend