Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, - PowerPoint PPT Presentation

Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, 2019 1

Next-Gen Sequencing Read Mapping Align reads to genome/transcriptome Input: Many reads, length l strings R 1 ,...,R m and approximate reference genome string G Output: positions in G where reads match Complications: ◦ Millions of reads to analyze ◦ Repetitive regions: non-unique mapping ◦ Mutations/errors: imperfect match, incorrect mapping ◦ Paired end reads ◦ Numbers: ~100M reads, |G| ~ 3x10 9 , read length: 100-200bp 2

Solution – Suffix Tree Build suffix tree for G For each read R i ◦ Find matches of R i to G by tree traversal Time complexity? 3

Solution – Suffix Tree Build suffix tree for G For each read R i ◦ Find matches of R i to G by tree traversal Time complexity: O( l m+|G|) for exact matching Can store human genome text in 750M bytes (6G bits) but, need ~64G bytes for the tree Large constants, hard to implement 4

Solution – Hashed Index Preprocessing: – Create hash-table H – For each position p=l,...,|G| Hash (key=G[p,...,p + l -1], value=p) in H For each R i report H(S i ) Time complexity? 5

Solution – Hashed Index Preprocessing: – Create hash-table H – For each position p=l,...,|G| Hash (key=G[p,...,p + l -1], value=p) in H For each R i report H(S i ) Time complexity: O( l( m+|G|)) Problems: ◦ Only exact matching, memory O(|G|) ◦ Implemented by PASS Campagna D, Albiero A, Bilardi A, Caniato E, Forcato C, Manavski S, Vitulo N, Valle G. PASS: a program to align short sequences. Bioinformatics. 2009 Apr 1;25(7):967-8 6

� Improvements � � Pack strings into bit vectors � TGAGTGTGAC 11100010111011100001 Do not hash every position 7

Solution – Hashed Reads Hash the reads, and scan using the genome. Implemented by the MAQ tool Memory usage: depends on number of reads Mapping short DNA sequencing reads and calling variants using mapping quality scores. Li H, Ruan J, Durbin R. Genome Res. 2008 Nov;18(11):1851-8. 8

Handling mismatches les. If a read has one mismatch with respect to its correct position, then hash T 1 partitioning that read into two T 2 makes sure one part is still exactly � matched. � For two mismatches: create six T 3 T 4 templates, some of which are T 5 noncontiguous, so that each and T 6 template covers half the read and has a partner template that complements it to form the full read. Related to the idea of spaced seeds 9

Solution - Burrows-Wheeler Transform (BWT) BWT $acaacg aacg$ac acaacg$ gc$aaac acaacg$ acg$aca caacg$a cg$acaa g$acaac Burrows-Wheeler Matrix (BWM)

Solution - Burrows-Wheeler Transform (BWT) BWT $acaacg aacg$ac acaacg$ gc$aaac acaacg$ acg$aca caacg$a cg$acaa g$acaac BWT(G) is often highly compressible

Burrows-Wheeler Matrix $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac

Burrows-Wheeler Matrix $acaacg aacg$ac 3 1 acaacg$ 4 acg$aca See the suffix array? 2 caacg$a 5 cg$acaa g$acaac 6

Bowtie/BWA Bowtie/BWA: the first methods to use BWT for read mapping ◦ Optimized to run against mammalian sized genomes ◦ Genome is indexed using the Burrows-Wheeler algorithm to keep the memory footprint small The human genome can be stored in ~2Gb BWA: Bowtie: http://bio-bwa.sourceforge.net http://bowtie-bio.sourceforge.net/ (Bowtie) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Langmead et al, Genome Biology 2009, 10:R25 (BWA) Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform, Li Heng and Richard Durbin, (2009) 25:1754–1760

Lots of Read Mappers! DNA: blue, RNA: in red, miRNA: green Bisulphite: purple. See: http://wwwdev.ebi.ac.uk/fg/hts_mappers/ and the corresponding paper Tools for mapping high-throughput sequencing data Nuno A. Fonseca, Johan Rung, Alvis Brazma and John C. Marioni 15

Phred Quality Scores Q s = -10 log 10 Pr(base is incorrect) 16

Using Phred Scores Mappers typically use Phred scores in one of two ways: ◦ Incorporate them into the score of a match ◦ Trim the 3’ end of a read when its quality is deemed too low (takes advantage of the fact that for the Illumina platform, sequencing accuracy decreases from 5’ to 3’) 17

Read mapping software General mappers (BLAT, SSAHA, Exonerate) were designed for aligning any sequences and the source of data is irrelevant. Some mappers are designed for data coming from a specific platform (e.g. Illumina, SOLiD). Some mappers exploit specific biases associated with a sequencing platform. Some mappers use Phred quality scores for scoring a match. 18

Ungapped vs. Spliced Alignment Unspliced reads (Black) ◦ Map within exactly one exon Spliced reads (Blue) ◦ Overlap two or more exons ◦ A more challenging problem than ungapped read mapping

Splice Sites Acceptor splice site (3’) Donor splice site (5’) Canonical Splice Site Consensus: GT---AG GT Occurs about 1 billion times in the 3 Billion nucleotide long human DNA Donor consensus ◦ A 64 G 73 G 100 T 100 A 62 A 68 G 84 T 63 Acceptor consensus ◦ C 65 A 100 G 100 20

PALMapper Strategy: ◦ Identify unspliced alignments and seed regions for spliced reads using a fast indexing tool. ◦ Infer spliced alignments from seed regions. De Bona, F. et al., Optimal Spliced Alignments of Short Sequence Reads, ECCB08/Bioinformatics, 24 (16):i174, 2008. Images from http://www.raetschlab.org/lectures/qpalma-ismb08-short-SIG.pdf

QPALMA:Optimal Spliced Alignments of Reads Based on PALMA ◦ Uses Computational Splice Site Prediction Q = Quality ◦ Uses a read’s quality information Information Used in QPALMA ◦ Read quality ◦ Splice site models ◦ Intron length m odels

Splice site prediction Splice site classifiers (support vector machine) ◦ Predictions of d onor sites ◦ Predictions of acceptor sites Training ◦ Known Acceptor & Donor Sites Obtained by aligning EST and cDNA sequences using BLAT (or any program for spliced alignments for long sequences) High quality alignments can be used as positive examples Other decoy sites used as negative examples Window of length 141 nt around the splice site

PALMapper uses Smith Waterman PALMapper is an “Intelligent” variant of SW that changes the scoring matrix in certain ways to incorporate ◦ Sequence Quality Scores ◦ Donor & Acceptor Splice Site Prediction Scores ◦ Intron Length Model To perform spliced alignments of short read sequences against a genome 24

Smith Waterman Source of Information Sequence matches Classical scoring f : Σ × Σ → R 25

Extensions of SW Penalizes the scoring if the classifier does not consider that there is a plausible intron. Source of Information Sequence matches Computational splice site predictions 26

Extensions of SW Penalizes the scoring if the intron length is not “expected” Source of Information Sequence matches Computational splice site predictions Intron length model 27

Incorporating Quality Scores Source of Information Sequence matches Computational splice site predictions Intron length model Read quality information Quality scoring f : ( Σ × R ) × Σ → R 28

Incorporating Quality Scores Q E (i): the quality score of the read at position i

The Extended SW g o = intron opening score g e = intron extension score f acc (i) = score of an acceptor site f don (i) = score of a donor site Intron scoring

Training 31

Performance of Various Extensions (% of correct alignments)

Error depends on where the read falls within an intron 33

Summary As read length becomes larger more reads require spliced alignment There are many tools for this task: v TopHat (uses Bowtie for ungapped mapping) v MapSplice (uses Bowtie) v RUM (uses Bowtie and BLAT) v PASS v ARYANA 34

A comparison of spliced alignment tools Figure from: Systematic evaluation of spliced alignment programs for RNA-seq data Nature Methods (2013) http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.2722.html 35

Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, - PowerPoint PPT Presentation

Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, 2019 1 Next-Gen Sequencing Read Mapping Align reads to genome/transcriptome Input: Many reads, length l strings R 1 ,...,R m and approximate reference genome string G

Texture and other Mappings Texture Mapping Texture Mapping Bump Mapping Bump Mapping

Strategies for Bulk RNA-seq Analysis Genome Transcriptome Assembly Mapping Mapping Reads

Image Warping Image Mapping Image Mapping - Examples Forward Mapping Forward Mapping -

TEXTURE MAPPING 1 OUTLINE Introduce Mapping Methods Texture Mapping Environment

lecture 17 - more on texture mapping - graphics pipeline - MIP mapping - procedural

Advanced Texturing Environment Mapping Environment Mapping reflections Environment Mapping

Texture Mapping Texture Mapping 1 Texture Mapping Texture Mapping Motivation Motivation:

Texture Mapping Surface mapping OpenGl and Implementation Details Texture mapping Bump

Lecture 9: Mapping Reads to a Reference Burrows Wheeler Transform and FM Index Spring 2020

Computer Graphics (CS 543) Lecture 10: Bump Mapping, Parallax, Relief, Alpha, Specular Mapping

RTK Mapping Process RTK Mapping Presentation RTK Mapping Presentation June 4, 2002 RTK

Mapping data Representing data with maps Geographic analysis tasks Mapping where things are

Texture and other Mappings Texture Mapping Bump Mapping Displacement Mapping Environment

lecture 16 Texture mapping Aliasing (and anti-aliasing) Texture (images) Texture Mapping Q:

Texture Mapping Texture (images) lecture 16 Texture mapping Aliasing (and anti-aliasing)

Alaska Reads Big Anna Bjartmarsdottir, UAA/APU Books of the Year Rayette Sterling, Anchorage

Mesh Data Structures how to store geometry & connectivity? compact storage file

The Spectrum of Physical States of the Dual Resonance Model Paolo Di Vecchia Niels Bohr

Christ Our Atonement Leviticus 17:11 For the life of the flesh is in the blood: and I have given

WELCOME TO VISION NEXT BASE CAMP! Vision Next Learning Journey A city of informed, engaged

Singular Monge-Ampre equations in geometry Rafe Mazzeo Stanford University June 18, 2012

1 Models from range data Models from range data (II) Cyberware whole body scanner, WB4

The projective line minus three fractional 3 kinds of integral points points Darmons M

GENERAL ITEMS Heidi Schellman, Oregon State University 3/24/19 Resource board meeting on the

Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, - PowerPoint PPT Presentation

Lecture 16: Mapping Reads to a Reference Fall 2019 November 12,14, 2019 1 Next-Gen Sequencing Read Mapping Align reads to genome/transcriptome Input: Many reads, length l strings R 1 ,...,R m and approximate reference genome string G

Texture and other Mappings Texture Mapping Texture Mapping Bump Mapping Bump Mapping

Strategies for Bulk RNA-seq Analysis Genome Transcriptome Assembly Mapping Mapping Reads

Image Warping Image Mapping Image Mapping - Examples Forward Mapping Forward Mapping -

TEXTURE MAPPING 1 OUTLINE Introduce Mapping Methods Texture Mapping Environment

lecture 17 - more on texture mapping - graphics pipeline - MIP mapping - procedural

Advanced Texturing Environment Mapping Environment Mapping reflections Environment Mapping

Texture Mapping Texture Mapping 1 Texture Mapping Texture Mapping Motivation Motivation:

Texture Mapping Surface mapping OpenGl and Implementation Details Texture mapping Bump

Lecture 9: Mapping Reads to a Reference Burrows Wheeler Transform and FM Index Spring 2020

Computer Graphics (CS 543) Lecture 10: Bump Mapping, Parallax, Relief, Alpha, Specular Mapping

RTK Mapping Process RTK Mapping Presentation RTK Mapping Presentation June 4, 2002 RTK

Mapping data Representing data with maps Geographic analysis tasks Mapping where things are

Texture and other Mappings Texture Mapping Bump Mapping Displacement Mapping Environment

lecture 16 Texture mapping Aliasing (and anti-aliasing) Texture (images) Texture Mapping Q:

Texture Mapping Texture (images) lecture 16 Texture mapping Aliasing (and anti-aliasing)

Alaska Reads Big Anna Bjartmarsdottir, UAA/APU Books of the Year Rayette Sterling, Anchorage

Mesh Data Structures how to store geometry &amp; connectivity? compact storage file

The Spectrum of Physical States of the Dual Resonance Model Paolo Di Vecchia Niels Bohr

Christ Our Atonement Leviticus 17:11 For the life of the flesh is in the blood: and I have given

WELCOME TO VISION NEXT BASE CAMP! Vision Next Learning Journey A city of informed, engaged

Singular Monge-Ampre equations in geometry Rafe Mazzeo Stanford University June 18, 2012

1 Models from range data Models from range data (II) Cyberware whole body scanner, WB4

The projective line minus three fractional 3 kinds of integral points points Darmons M

GENERAL ITEMS Heidi Schellman, Oregon State University 3/24/19 Resource board meeting on the

Mesh Data Structures how to store geometry & connectivity? compact storage file