Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, - - PowerPoint PPT Presentation

lecture 16 mapping reads to a reference
SMART_READER_LITE
LIVE PREVIEW

Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, - - PowerPoint PPT Presentation

Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, 2019 1 Next-Gen Sequencing Read Mapping Align reads to genome/transcriptome Input: Many reads, length l strings R 1 ,...,R m and approximate reference genome string G


slide-1
SLIDE 1

Lecture 16: Mapping Reads to a Reference

1

Fall 2019 November 12,14, 2019

slide-2
SLIDE 2

Next-Gen Sequencing Read Mapping

— Align reads to genome/transcriptome — Input: Many reads, length l strings R1,...,Rm and approximate

reference genome string G

— Output: positions in G where reads match — Complications:

  • Millions of reads to analyze
  • Repetitive regions: non-unique mapping
  • Mutations/errors: imperfect match, incorrect mapping
  • Paired end reads
  • Numbers: ~100M reads, |G| ~ 3x109, read length: 100-200bp

2

slide-3
SLIDE 3

Solution – Suffix Tree

— Build suffix tree for G — For each read Ri

  • Find matches of Ri to G by tree traversal

— Time complexity?

3

slide-4
SLIDE 4

Solution – Suffix Tree

— Build suffix tree for G — For each read Ri

  • Find matches of Ri to G by tree traversal

— Time complexity: O(lm+|G|) for exact matching — Can store human genome text in 750M bytes (6G bits) but, need

~64G bytes for the tree

— Large constants, hard to implement

4

slide-5
SLIDE 5

Solution – Hashed Index

— Preprocessing:

– Create hash-table H – For each position p=l,...,|G| Hash (key=G[p,...,p + l -1], value=p) in H

— For each Ri report H(Si) — Time complexity?

5

slide-6
SLIDE 6

Solution – Hashed Index

— Preprocessing:

– Create hash-table H – For each position p=l,...,|G| Hash (key=G[p,...,p + l -1], value=p) in H

— For each Ri report H(Si) — Time complexity: O(l(m+|G|)) — Problems:

  • Only exact matching, memory O(|G|)
  • Implemented by PASS

6

Campagna D, Albiero A, Bilardi A, Caniato E, Forcato C, Manavski S, Vitulo N, Valle

  • G. PASS: a program to align short sequences. Bioinformatics. 2009 Apr 1;25(7):967-8
slide-7
SLIDE 7

Improvements

— Pack strings into bit vectors — Do not hash every position

7

  • TGAGTGTGAC

11100010111011100001

slide-8
SLIDE 8

Solution – Hashed Reads

— Hash the reads, and scan using the genome. — Implemented by the MAQ tool — Memory usage: depends on number of reads

8

Mapping short DNA sequencing reads and calling variants using mapping quality scores. Li H, Ruan J, Durbin R. Genome Res. 2008 Nov;18(11):1851-8.

slide-9
SLIDE 9

Handling mismatches

— If a read has one mismatch with

respect to its correct position, then partitioning that read into two makes sure one part is still exactly matched.

— For two mismatches: create six

templates, some of which are noncontiguous, so that each template covers half the read and has a partner template that complements it to form the full read.

— Related to the idea of spaced seeds

9

les. hash

  • and

T1 T2 T3 T4 T5 T6

slide-10
SLIDE 10

Solution - Burrows-Wheeler Transform (BWT) acaacg$

$acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac

gc$aaac

Burrows-Wheeler Matrix (BWM) BWT

slide-11
SLIDE 11

Solution - Burrows-Wheeler Transform (BWT) acaacg$

$acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac

gc$aaac

BWT(G) is often highly compressible BWT

slide-12
SLIDE 12

Burrows-Wheeler Matrix

$acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac

slide-13
SLIDE 13

Burrows-Wheeler Matrix

$acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac

See the suffix array?

3 1 4 2 5 6

slide-14
SLIDE 14

Bowtie/BWA

Bowtie/BWA: the first methods to use BWT for read mapping

  • Optimized to run against mammalian sized genomes
  • Genome is indexed using the Burrows-Wheeler algorithm to

keep the memory footprint small

– The human genome can be stored in ~2Gb

BWA: http://bio-bwa.sourceforge.net Bowtie: http://bowtie-bio.sourceforge.net/ (Bowtie) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Langmead et al, Genome Biology 2009, 10:R25 (BWA) Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform, Li Heng and Richard Durbin, (2009) 25:1754–1760

slide-15
SLIDE 15

Lots of Read Mappers!

15

See: http://wwwdev.ebi.ac.uk/fg/hts_mappers/ and the corresponding paper Tools for mapping high-throughput sequencing data Nuno A. Fonseca, Johan Rung, Alvis Brazma and John C. Marioni DNA: blue, RNA: in red, miRNA: green Bisulphite: purple.

slide-16
SLIDE 16

Phred Quality Scores

16

Qs = -10 log10 Pr(base is incorrect)

slide-17
SLIDE 17

Using Phred Scores

— Mappers typically use Phred scores in one of two ways:

  • Incorporate them into the score of a match
  • Trim the 3’ end of a read when its quality is deemed too low (takes

advantage of the fact that for the Illumina platform, sequencing accuracy decreases from 5’ to 3’)

17

slide-18
SLIDE 18

Read mapping software

— General mappers (BLAT, SSAHA, Exonerate) were designed for

aligning any sequences and the source of data is irrelevant.

— Some mappers are designed for data coming from a specific

platform (e.g. Illumina, SOLiD).

— Some mappers exploit specific biases associated with a

sequencing platform.

— Some mappers use Phred quality scores for scoring a match.

18

slide-19
SLIDE 19

Ungapped vs. Spliced Alignment

— Unspliced reads (Black)

  • Map within exactly one exon

— Spliced reads (Blue)

  • Overlap two or more exons
  • A more challenging problem than ungapped read mapping
slide-20
SLIDE 20

Splice Sites

— Canonical Splice Site Consensus: GT---AG — GT Occurs about 1 billion times in the 3 Billion nucleotide long

human DNA

— Donor consensus

  • A64G73G100T100A62A68G84T63

— Acceptor consensus

  • C65A100G100

20 Donor splice site (5’) Acceptor splice site (3’)

slide-21
SLIDE 21

PALMapper

— Strategy:

  • Identify unspliced alignments and seed regions for spliced reads using a

fast indexing tool.

  • Infer spliced alignments from seed regions.

De Bona, F. et al., Optimal Spliced Alignments of Short Sequence Reads, ECCB08/Bioinformatics, 24 (16):i174, 2008. Images from http://www.raetschlab.org/lectures/qpalma-ismb08-short-SIG.pdf

slide-22
SLIDE 22

QPALMA:Optimal Spliced Alignments of Reads

— Based on PALMA

  • Uses Computational Splice Site Prediction

— Q = Quality

  • Uses a read’s quality information

— Information Used in QPALMA

  • Read quality
  • Splice site models
  • Intron length models
slide-23
SLIDE 23

Splice site prediction

— Splice site classifiers (support vector machine)

  • Predictions of donor sites
  • Predictions of acceptor sites

— Training

  • Known Acceptor & Donor Sites

– Obtained by aligning EST and cDNA sequences using BLAT (or any program for spliced alignments for long sequences) – High quality alignments can be used as positive examples – Other decoy sites used as negative examples – Window of length 141 nt around the splice site

slide-24
SLIDE 24

PALMapper uses Smith Waterman

— PALMapper is an “Intelligent” variant of SW that changes

the scoring matrix in certain ways to incorporate

  • Sequence Quality Scores
  • Donor & Acceptor Splice Site Prediction Scores
  • Intron Length Model

— To perform spliced alignments of short read sequences

against a genome

24

slide-25
SLIDE 25

Smith Waterman

25

Classical scoring f : Σ × Σ → R Source of Information Sequence matches

slide-26
SLIDE 26

Extensions of SW

Source of Information Sequence matches Computational splice site predictions

26

Penalizes the scoring if the classifier does not consider that there is a plausible intron.

slide-27
SLIDE 27

Extensions of SW

Source of Information Sequence matches Computational splice site predictions Intron length model

27

Penalizes the scoring if the intron length is not “expected”

slide-28
SLIDE 28

Incorporating Quality Scores

Quality scoring f : (Σ × R) × Σ → R Source of Information Sequence matches Computational splice site predictions Intron length model Read quality information

28

slide-29
SLIDE 29

Incorporating Quality Scores

— QE(i): the quality score of the read at position i

slide-30
SLIDE 30

The Extended SW

—

go = intron opening score

—

ge = intron extension score

—

facc(i) = score of an acceptor site

—

fdon(i) = score of a donor site

Intron scoring

slide-31
SLIDE 31

Training

31

slide-32
SLIDE 32

Performance of Various Extensions

— (% of correct alignments)

slide-33
SLIDE 33

Error depends on where the read falls within an intron

33

slide-34
SLIDE 34

Summary

— As read length becomes larger more reads require spliced

alignment

— There are many tools for this task: v TopHat (uses Bowtie for ungapped mapping) v MapSplice (uses Bowtie) v RUM (uses Bowtie and BLAT) v PASS v ARYANA

34

slide-35
SLIDE 35

A comparison of spliced alignment tools

35

Figure from: Systematic evaluation of spliced alignment programs for RNA-seq data Nature Methods (2013) http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.2722.html