Lecture 16: Mapping Reads to a Reference
1
Fall 2019 November 12,14, 2019
Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, - - PowerPoint PPT Presentation
Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, 2019 1 Next-Gen Sequencing Read Mapping Align reads to genome/transcriptome Input: Many reads, length l strings R 1 ,...,R m and approximate reference genome string G
1
Fall 2019 November 12,14, 2019
Align reads to genome/transcriptome Input: Many reads, length l strings R1,...,Rm and approximate
reference genome string G
Output: positions in G where reads match Complications:
2
Build suffix tree for G For each read Ri
Time complexity?
3
Build suffix tree for G For each read Ri
Time complexity: O(lm+|G|) for exact matching Can store human genome text in 750M bytes (6G bits) but, need
~64G bytes for the tree
Large constants, hard to implement
4
Preprocessing:
– Create hash-table H – For each position p=l,...,|G| Hash (key=G[p,...,p + l -1], value=p) in H
For each Ri report H(Si) Time complexity?
5
Preprocessing:
– Create hash-table H – For each position p=l,...,|G| Hash (key=G[p,...,p + l -1], value=p) in H
For each Ri report H(Si) Time complexity: O(l(m+|G|)) Problems:
6
Campagna D, Albiero A, Bilardi A, Caniato E, Forcato C, Manavski S, Vitulo N, Valle
Pack strings into bit vectors Do not hash every position
7
11100010111011100001
Hash the reads, and scan using the genome. Implemented by the MAQ tool Memory usage: depends on number of reads
8
Mapping short DNA sequencing reads and calling variants using mapping quality scores. Li H, Ruan J, Durbin R. Genome Res. 2008 Nov;18(11):1851-8.
If a read has one mismatch with
respect to its correct position, then partitioning that read into two makes sure one part is still exactly matched.
For two mismatches: create six
templates, some of which are noncontiguous, so that each template covers half the read and has a partner template that complements it to form the full read.
Related to the idea of spaced seeds
9
les. hash
T1 T2 T3 T4 T5 T6
$acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac
Burrows-Wheeler Matrix (BWM) BWT
$acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac
BWT(G) is often highly compressible BWT
See the suffix array?
Bowtie/BWA: the first methods to use BWT for read mapping
keep the memory footprint small
The human genome can be stored in ~2Gb
BWA: http://bio-bwa.sourceforge.net Bowtie: http://bowtie-bio.sourceforge.net/ (Bowtie) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Langmead et al, Genome Biology 2009, 10:R25 (BWA) Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform, Li Heng and Richard Durbin, (2009) 25:1754–1760
15
See: http://wwwdev.ebi.ac.uk/fg/hts_mappers/ and the corresponding paper Tools for mapping high-throughput sequencing data Nuno A. Fonseca, Johan Rung, Alvis Brazma and John C. Marioni DNA: blue, RNA: in red, miRNA: green Bisulphite: purple.
16
Qs = -10 log10 Pr(base is incorrect)
Mappers typically use Phred scores in one of two ways:
advantage of the fact that for the Illumina platform, sequencing accuracy decreases from 5’ to 3’)
17
General mappers (BLAT, SSAHA, Exonerate) were designed for
aligning any sequences and the source of data is irrelevant.
Some mappers are designed for data coming from a specific
platform (e.g. Illumina, SOLiD).
Some mappers exploit specific biases associated with a
sequencing platform.
Some mappers use Phred quality scores for scoring a match.
18
Unspliced reads (Black)
Spliced reads (Blue)
Canonical Splice Site Consensus: GT---AG GT Occurs about 1 billion times in the 3 Billion nucleotide long
human DNA
Donor consensus
Acceptor consensus
20 Donor splice site (5’) Acceptor splice site (3’)
Strategy:
fast indexing tool.
De Bona, F. et al., Optimal Spliced Alignments of Short Sequence Reads, ECCB08/Bioinformatics, 24 (16):i174, 2008. Images from http://www.raetschlab.org/lectures/qpalma-ismb08-short-SIG.pdf
QPALMA:Optimal Spliced Alignments of Reads
Based on PALMA
Q = Quality
Information Used in QPALMA
Splice site classifiers (support vector machine)
Training
Obtained by aligning EST and cDNA sequences using BLAT (or any program for spliced alignments for long sequences) High quality alignments can be used as positive examples Other decoy sites used as negative examples Window of length 141 nt around the splice site
PALMapper is an “Intelligent” variant of SW that changes
the scoring matrix in certain ways to incorporate
To perform spliced alignments of short read sequences
against a genome
24
25
Classical scoring f : Σ × Σ → R Source of Information Sequence matches
Source of Information Sequence matches Computational splice site predictions
26
Penalizes the scoring if the classifier does not consider that there is a plausible intron.
Source of Information Sequence matches Computational splice site predictions Intron length model
27
Penalizes the scoring if the intron length is not “expected”
Quality scoring f : (Σ × R) × Σ → R Source of Information Sequence matches Computational splice site predictions Intron length model Read quality information
28
QE(i): the quality score of the read at position i
go = intron opening score
ge = intron extension score
facc(i) = score of an acceptor site
fdon(i) = score of a donor site
Intron scoring
31
(% of correct alignments)
33
As read length becomes larger more reads require spliced
alignment
There are many tools for this task: v TopHat (uses Bowtie for ungapped mapping) v MapSplice (uses Bowtie) v RUM (uses Bowtie and BLAT) v PASS v ARYANA
34
35
Figure from: Systematic evaluation of spliced alignment programs for RNA-seq data Nature Methods (2013) http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.2722.html