CSE 527 Phylogeny & RNA: Pfold Lectures 20-21 Autumn 2006

Phylogenies (aka Evolutionary Trees) “Nothing in biology makes sense, except in the light of evolution” -- Dobzhansky

Modeling Sequence Evolution Simple but useful models; assume: Independence of separate positions Independence of separate lineages Stationarity - e.g., nuc freqs aren’t changing Markov property - nuc at a given position is independent of nuc there t 2 years ago given nuc there t 1 < t 2 years ago.

Simple Example: Jukes-Cantor rate of C → T A C G T changes per unit A -3a a a a time Rate matrix R = C a -3a a a G a a -3a a T a a a -3a Consequences: diagonal s.t. row equilibrium nuc freqs π i all = 1/4 sums = 0 all changes equally likely

Multiplicativity Matrix P t [i,j]: prob of change i → j in time t P s+t [i,j] = Σ k P s [i,k] P t [k,j] I.e., P s+t = P s P t

Finding Change Probabilities For small time ε , transition probabilities P ε ≈ I + ε R By multiplicativity P t+ ε = P t P ε ≈ P t (I + ε R) (P t+ ε - P t ) / ε ≈ P t R I.e., solve system of diff eqns: d dt P t = P t R

Jukes-Cantor, cont. d dt P t = P t R Solving r s s s s r s s Gives P t = where s s r s r = (1+3 exp(-4at))/4 s s s r s = (1 - exp(-4at))/4

Other Models Jukes-Cantor is simple, but inaccurate for some uses. E.g., Many genomes deviate sharply from π i = 1/4 In fact, “transversions” (purine {A,G} ↔ pyrimidine {C,T}) less frequent than “transitions” (pur ↔ pur or pyr ↔ pyr). Various other models often used

General Reversible Model Model is reversible if for all i, j π i P[i,j] = π j P[j,i] I.e., i → j and j → i changes are equally frequent; statistically, the past looks like the future d dt P t = P t R No closed form solution for but numerically solvable using eigenvalues of rate matrix R

Evolutionary Models: Key points Given small number of parameters (e.g., 4 x 4 symmetric rate matrix, …), an evolutionary tree, and branch lengths, you can calculate probabilities of changes on the tree

Uses: Example 1 Probability of changes shown on this (given) tree: G P(t 1 ,G → G) * P(t 2 ,G → T) T G

Uses: Example 2 What if ancestral state unknown? ? Σ a π a P(t 1 , a → G) * P(t 2 , a → T) T G draw a at root from equilibrium distribution

Uses: Example 3 What if sequences at leaves and ancestral sequence unknown? n � � 1 ) P ( t 2 , a u � x u 2 ) P ( t 1 , a u � x u � a u u = 1 a u

Uses: Example 4 What if branch lengths also unknown? Can find MLE by numerical optimization of n � � 1 ) P ( t 2 , a u � x u 2 ) argmax t 1 , t 2 P ( t 1 , a u � x u � a u u = 1 a u

Reversible model; you can’t place root

Uses: Example 5 What if Tree also unknown? x 3 x 1 x 3 x 1 x 2 x 4 x 4 x 2 Can try MLE of tree topology, too (>> parsimony)

A Complex Question: Given data (sequences, anatomy, ...) infer the phylogeny A Simpler Question: Given data and a phylogeny, evaluate “how much change” is needed to fit data to tree

Parsimony General idea ~ Occam’s Razor: If change is rare, prefer explanations requiring few changes Human A T G A T ... Chimp A T G A T ... Gorilla A T G A G ... Rat A T G C G ... Mouse A T G C T ...

Parsimony General idea ~ Occam’s Razor: If change is rare, prefer explanations requiring few changes G Human A T G A T ... 0 changes G G Chimp A T G A T ... G G G Gorilla A T G A G ... Rat A T G C G ... G G G Mouse A T G C T ...

Parsimony General idea ~ Occam’s Razor: If change is rare, prefer explanations requiring few changes A Human A T G A T ... 1 change A A Chimp A T G A T ... A A A/C Gorilla A T G A G ... Rat A T G C G ... C C C Mouse A T G C T ...

Parsimony General idea ~ Occam’s Razor: If change is rare, prefer explanations requiring few changes T Human A T G A T ... 2 changes T G/T Chimp A T G A T ... T G G/T Gorilla A T G A G ... Rat A T G C G ... G T G/T Mouse A T G C T ...

Likelihood Given a statistical model of evolutionary change, prefer the explanation of maximum likelihood Human A T G A T ... t 1 t 3 Chimp A T G A T ... t 2 t 5 t 4 Gorilla A T G A G ... t 6 Rat A T G C G ... t 8 Mouse A T G C T ... t 7

Sankoff & Rousseau, ‘75 P u (s) = best parsimony score of subtree rooted at node u , assuming u is labeled by character s A C G T A C G T A C G T A C G T A C G T A C G T A C G T A C G T A C G T T T G G T

Sankoff-Rousseau Recurrence P u (s) = best parsimony score of subtree rooted at node u , assuming u is labeled by character s Time: O(alphabet 2 x tree size)

So, Parsimony easy; What about Likelihood? Straightforward generalization of “simple” formula for 2-leaf tree n � � 1 ) P ( t 2 , a u � x u 2 ) P ( t 1 , a u � x u � a u u = 1 a u is infeasible, since you need to consider all (exponentially many) labelings of non-leaf nodes. Fortunately, there’s a better way…

Felsenstein Recurrence L u ( s | θ ) = Likelihood of subtree rooted at node u , assuming u is labeled by character s, given θ For Leaf u: For Internal node u:

Another Application: RNA folding

Using Evolution for RNA Folding Assume you have 1. Training set of trusted RNA alignments build evo model for unpaired columns build evo model for paired columns 2. Alignment (& tree) for some RNAs presumed to have an (unknown) common structure look at every col pair - better fit to paired model or 2 indp unpaired models? (Alternative to mutual information, using evo)

Training Data Trusted alignments of 1968 tRNAs + 305 LSU rRNAs

Rate Matrix (Unpaired)

Rate Matrix (Paired) shown

What about Gaps? option 1: evo model for them - hard & slow option 2: treat “-” as a 5th character - they don’t “evolve” quite like others option 3: treat “-” as unknown - ditto - end up pairing? (drop if < 75%) + easy

Which Tree? KH-99 : try to find MLE tree (using SCFG et al.) good but slow KH-03 : est tree without structure average unpaired & (marginalized) paired rates calc pairwise distances between seqs tree topology from “neighbor joining” MLE tree branch lengths

Synopsis of last lecture Based on simplifying assumptions (stationarity, independence, Markov, reversible), there are simple sequence-evolution models with a modest number of parameters giving, e.g., Pr(G → T | 1.5 m yr), … It can model base-pairing in RNA, too Felsenstein allows ML estimation of probabilities, branch lengths, even trees,… in this model. (Somewhat like “parsimony” algorithm, but better.) Goal: Use all this for inference of RNA 2ary struct

Phylogeny vs Mutual Information CCGUAGAUUA CCGUAGAAUU CCGUAGAUUA CGGUACAAUU CGGUACAUUA CGGUACAAUU MI = 1 bit in both cases, but green pair is more compelling evidence of interaction: 3 events, not 1

The Glue: An SCFG S → LS | L L → s | dFd F → LS | dFd

Full SCFG S → LS | L (0.868534) (0.131466) L → s (0.894603*p(s)) | dFd (0.105397*p(dd)) F → LS | dFd (0.212360) (0.787640*p(dd)) Where p(s) & p(dd) are the probabilities of the single/paired alignment columns s/dd as calculated by the Felsenstein algorithm based on the fixed evolutionary model and the given tree topology and branch lengths.

What SCFG Gives “Prior” probabilities for fraction paired vs unpaired lengths of each frequency of bulges in stems etc., and Inherits column probabilities from evo model

Cocke-Kasami-Younger for CFG Suppose all rules of form A → BC or A → a (by mechanical grammar transform, or use orig grammar & mechanically transform alg below…) Given x = x 1 …x n , want M i,j = { A | A → x i+1 …x j } For j=2 to n M[j-1,j] = {A | A → x j is a rule} A for i = j-1 down to 1 B C M[i,j] = ∪ i < k < j M[i,k] ⊗ M[k,j] Where X ⊗ Y = {A | A → BC , B ∈ X, and C ∈ Y } i+1 k k+1 j

The “Inside” Algorithm for SCFG (analogous to HMM “forward” alg) Just like CKY, but instead of just recording possibility of A in M[i,j], record its probability: For each A, do sum instead of union, over all possible k and all possible A → BC rules, of products of their respective probabilities. Result: for each i, j, A, have Pr(A → x i+1 …x j ) (There’s also an “outside” alg, analogous to backward…)

The “Viterbi” algorithm for SCFGs Just like inside, but use max instead of sum.

So what’s the structure? The usual dynamic programming traceback: Starting from S in upper right corner of matrix, find which k and which S → BC gave max probability, then (recursively) find where that B and that C came from… (Really, you want to do it with the F → dFd grammar, and where those rules are used tells you where the base pairs are.)

Results & Validation KH-99: 4 bacterial RNAse P, 340-380 nt 3: Pseudomonas florescens 1: Klebsiella pneumoniae 2: Serratia marscens 4: Thiobacillus ferrooxidans

Good overall structure prediction Klebsiella pneumoniae RNAse P pseudoknot (unpredicted)

Good Overall Structure Prediction

Not bad, even with only one seq

CSE 527 Phylogeny & RNA: Pfold Lectures 20-21 Autumn 2006 - PowerPoint PPT Presentation

CSE 527 Phylogeny & RNA: Pfold Lectures 20-21 Autumn 2006 Phylogenies (aka Evolutionary Trees) Nothing in biology makes sense, except in the light of evolution -- Dobzhansky Modeling Sequence Evolution Simple but useful models;

Phylogeny Topic 7.9 Phylogeny Phylogeny is the evolutionary history of a species or a group

Phylogeny Phylogeny Evolutionary history of a species or a group of species Goal:

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

The Message " CSE 527 ! noncoding RNA " Cells make lots of RNA " Computational

CSE 527 Autumn 2006 Lectures 15-16 RNA Secondary Structure Prediction RNA Secondary Structure:

CSE 527 Autumn 2007 Lectures 17-18 RNA Secondary Structure Prediction RNA Secondary Structure:

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

Outline CSE 527 What is it Lecture 17, 11/24/04 How is it Represented RNA Secondary

RNA Secondary Structure CSE 417 W.L. Ruzzo The Double Helix Los Alamos Science The Central

The Double Helix RNA Secondary Structure CSE 417 W.L. Ruzzo Los Alamos Science The Central

DNA AND RNA ATI TEAS SCIENCE DNA & RNA Questions related to DNA and RNA cover topics

Prediction of RNA-RNA-Interaction 20 1 15 1 5 10 20 5 10 20 15 10 1 15 5 1 20 10

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA) DNA

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA)

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

The CHyM hydrological model applied to the produc5on of flood

The NVAP Global Water Vapor Climate Data Record: Plans for Improvement and Extension from

Leatherhead Flood Recovery Meeting Mark Douch: Flood & Coastal Risk Manager Sarah

NASA Applied Sciences Disasters Program Support Timeline for Historic Louisiana Floods August

Meeting 35: 22 November 2017 Karakia 2 Karakia Ko te tumanako Kia pai tenei r Kia tutuki i

Accurate river level predictions using a Wavenet-like model Shannon Doyle and Anastasia Borovykh

Low-Overhead LogGP Parameter Assessment for Modern Interconnection Networks T. Hoefler, A.

Monitoring Shorelines via High-Resolution Satellite Imagery and Deep Learning Venkatesh Ramesh