Sequence Analysis 15: lecture 5 Substitution matrices Multiple - PowerPoint PPT Presentation

Sequence Analysis ‘15: lecture 5 Substitution matrices Multiple sequence alignment

A teacher's dilemma To understand... You first need to know... Multiple sequence alignment Substitution matrices Substitution matrices Phylogenetic trees Phylogenetic trees Multiple sequence alignment We’ll start with substitution matrices.

  Substitution matrices •Used to score aligned positions, usually of amino acids. •Expressed as the log-likelihood ratio of mutation (or log-odds ratio ) •Derived from multiple sequence alignments   Two commonly used matrices: PAM and BLOSUM •PAM = p ercent a ccepted m utations (Dayhoff) •BLOSUM = Blo cks su bstitution m atrix (Henikoff)

PAM M Dayhoff, 1978 •Evolutionary time is   measured in Percent   Accepted Mutations, or   PAMs •One PAM of evolution means 1% of the   residues/bases have changed, averaged   over all 20 amino acids. •To get the relative frequency of each type   of mutation, we count the times it was observed in a database Margaret Oakley Dayhoff of multiple sequence alignments. •Based on global alignments •Assumes a Markov model for evolution.

BLOSUM Henikoff & Henikoff, 1992 •Based on database of   ungapped local alignments   (BLOCKS) •Alignments have lower similarity than PAM alignments. •BLOSUM number indicates the percent identity level of sequences in the alignment. For example, for BLOSUM62 sequences with approximately 62% identity were counted. Steven Henikoff •Some BLOCKS represent functional units, providing validation of the alignment.

Multiple Sequence Alignment A multiple sequence alignment is made using many pairwise sequence alignments

Columns in a MSA have a common evolutionary history a phylogenetic tree for one position in the alignment ? G G N A S G G By aligning the sequences , we are asserting that the aligned residues in each column had a common ancestor.

A tree shows the evolutionary history of a single position G Ancestral characters can be inferred by G parsimony G analysis. W N G W W G goat fish bird worm clam 8

Counting mutations without knowing ancestral sequences Naíve way: Assume any of the characters could be the ancestral one. Assume equal distance to the ancestor from each taxon. G G   L K F R L S K K P G   L K F R L S K K P L K F R L T K K P W   W   L K F R L S K K P G W W N G G N   L K F R L S R K P If G was the ancestor, then it mutated L K F R L T R K P G   to a W twice, to N once, and stayed G L K F R L ~ K K P G three times.

We could have picked W as the ancestor... W G   L K F R L S K K P G   L K F R L S K K P * L K F R L T K K P W   W   L K F R L S K K P G G W N G G N   L K F R L S R K P If W was the ancestor, then it mutated L K F R L T R K P G   to a G four times, to N once, and L K F R L ~ K K P G stayed W once. *FYI: This is how you draw a phylogenetic tree when the branch order is not known.

Subsitution matrices are symmetrical Since we don't know which sequence came first, we don't know whether w G or w G ...is correct. So we count this as one mutation of each type. P(G-->W) and P(W-->G) are the same number. (That's why we only show the upper triangle)

Summing the substitution counts We assume the ancestor is one of the observed amino acids, but we don't know which, so we try them all. G N W G   G   G 3 1 2 W   W   N   G   N G symmetrical matrix one column of a MSA W

Next possible ancestor, G again. We already counted this G, so ignore it. G N W G   G   G 2 1 2 W   W   N   G   N G W

G N W G   G   G 2 W   W   N   G   N 1 G 1 W

G N W G   G   G 2 W   W   N   G   N 1 G 0 W

G N W G   G   G 2 W   W   N   G   0 N 0 G W

Next...G again G N W G   G   G 1 0 0 W   W   N   G   N G Counting G as the ancestor many times as it appears recognizes the increased likelihood that G (the most frequent aa at this position) is W the true ancestor.

G N W G   G   G 0 0 0 W   W   N   G   N G (no counts for last seq.) W

Go to next column. Continue summing. G N W G   P   G   P   G 6 4 8 W   I   W   N   N   P   G   P   0 2 N G A TOTAL=21 Continue doing this for every column in every multiple sequence 1 W alignment...

Probability ratios are expressed as log odds Substitutions (and many other things in bioinformatics) are expressed as a "likelihood ratio", or "odds ratio" of the observed data over the expected value. Likelihood and odds are synomyms for Probability. So Log Odds is the log (usually base 2) of the odds ratio. log odds ratio = log 2 (observed/expected )

How do you calculate log-odds? P(G) = 4/7 = 0.57 Observed probability of G->G q GG = P(G->G)=6/21 = 0.29 Expected probability of G->G, If the ‘lod’ is < 0., then e GG = 0.57*0.57 = 0.33 the mutation is less likely than expected by odds ratio = q GG /e GG = 0.29/0.33 chance. If it is > 0., it is more likely. log odds ratio = log 2 (q GG /e GG )

Same amino acids, different distribution, different outcome. P(G)=0.50 P(G)=0.50 e GG = 0.25   e GG = 0.25   q GG = 21/42 =0.5 q GG = 9/42 =0.21 lod = log 2 (0.50/0.25) = 1 lod = log 2 (0.21/0.25) = –0.2 G G   G W   G A   G A   W G   G W   W A   G A   N G   G W   G A   G A   G A G A G’s spread over many columns G’s concentrated

Different observations, same expectation P(G)=0.50, P(W)=0.14 P(G)=0.50, P(W)=0.14 e GW = 0.07   e GW = 0.07   q GG = 3/42 =0.07 q GW = 7/42 =0.17 lod = log 2 (0.07/0.07) = 0 lod = log 2 (0.17/0.07) = 1.3 G G   G W   G A   G A   W G   G W   A W   G A   N G   G W   G A   G A   G A A G G and W seen together more G’s and W’s not often than expected. seen together.

In class exercise: Get the substitution value for P->Q P Q PQPP   P( P )=_____, P( Q )=_____ QQQP   P e PQ = _____   QQPP   QPPP   q PQ = ___/___ =_____ Q QQQP lod = log 2 (q PQ /e PQ ) = ____ sequence substitution expected (e), versus alignment counts observed (q) for P->Q database.

PAM assumes Markovian evolution A Markov process is one where the likelihood of the next "state" depends only on the current state. The inference that evolution is Markovian assumes that base changes (or amino acid changes) occur at a constant rate and depend only on the identity of the current base (or amino acid). G->G G->A G->V V->V V->G transition .9946 likelihood / MY .9932 .0001 .0002 .0021 G G A V V G current aa millions of years (MY)

Markovian evolution is an extrapolation Start with one sequence. One position. Say Gly. PAM1 = Wait 1 million years . What amino acids are now found at that position? = PAM1 Wait another million years . PAM1 is just PAM1 But,... PAM1

250 million years? PAM1 PAM1 PAM1 = ••• 250 = PAM250 PAM1 The number after PAM denotes the power to which PAM1 was taken.

• NOTE OF CLARIFICATION: • PAM does not stand for Plus A Million years (or anything like that) . It stands for Percent Accepted Mutations. • One PAM1 unit does not correspond to 1 million years of evolution. There is no timescale associated with PAM. • PAM1 corresponds to 1% mutations. (or 99% identity). The timescale depends on the species. 28

Differences between PAM and BLOSUM PAM •PAM matrices are based on global alignments of closely related proteins. •The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. •Other PAM matrices are extrapolated from PAM1 using an assumed Markov chain. BLOSUM •BLOSUM matrices are based on local alignments . •BLOSUM 62 is a matrix calculated from comparisons of sequences with approx 62% identity. •All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. •BLOSUM 62 is the default matrix in BLAST (the database search program). It is tailored for comparisons of moderately distant proteins. Alignment of distant relatives may be more accurate with a different matrix.

PAM250

BLOSUM62

In class exercise: Which substitution matrix favors... PAM250 BLOSUM62 conservation of polar residues conservation of non-polar residues conservation of C, Y, or W polar-to-nonpolar mutations polar-to-polar mutations

Protein versus DNA alignments Are protein alignment better? • Protein alphabet = 20, DNA alphabet = 4. – Protein alignment is more informative – Less chance of homoplasy with proteins. – Homology detectable at greater edit distance – Protein alignment more informative • Better Gold Standard alignments are available for proteins. – Better statistics from G.S. alignments. • On the other hand, DNA alignments are more sensitive to short evolutionary distances. 33

DNA evolutionary models: P-distance What is the relationship between time and the %identity? evolutionary time p = D L 0 p 1 p is a good measure of time only when p is small. 34

Sequence Analysis 15: lecture 5 Substitution matrices Multiple - PowerPoint PPT Presentation

Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's dilemma To understand... You first need to know... Multiple sequence alignment Substitution matrices Substitution matrices Phylogenetic trees

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Protein Sequence Analysis Protein Sequence Analysis Domain review Domain review What is a

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

Analysis and classification of the DNA Analysis and classification of the DNA sequence of TARA

SEQ 3 : Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive

Sequence 7 January 2019 OSU CSE 1 Sequence The Sequence component family allows you to

Evaluating predictive loss for models with observation-level latent variables Russell Millar

Decentralised water and waste treatment in view of resource recovery: The I-QUA & WAVE

C A P E H E N L O P E N H I G H S C H O O L A G R I S C I E N C E P A T H W A Y S 3 P A T H

1Q20 EARNINGS PRESENTATION May 2020 Forward-looking Statements This presentation contains

Ecosystem Threats: Ecosystem Threats: What the fishing community can do to ensure a sustainable

CLUB XLIV & ENCORE Booking Information: 504.587.3663 or email clubXLIVsales@asmneworleans.com

Accepts AC power via E27 Socket as well as DC power from standard powerbanks and the NYX

WEBINAR Projection Technology Transforming the Customer Experience Cheryl Arment Epson Education

Sequence Analysis 15: lecture 5 Substitution matrices Multiple - PowerPoint PPT Presentation

Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's dilemma To understand... You first need to know... Multiple sequence alignment Substitution matrices Substitution matrices Phylogenetic trees

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Protein Sequence Analysis Protein Sequence Analysis Domain review Domain review What is a

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

Analysis and classification of the DNA Analysis and classification of the DNA sequence of TARA

SEQ 3 : Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive

Sequence 7 January 2019 OSU CSE 1 Sequence The Sequence component family allows you to

Evaluating predictive loss for models with observation-level latent variables Russell Millar

Decentralised water and waste treatment in view of resource recovery: The I-QUA &amp; WAVE

C A P E H E N L O P E N H I G H S C H O O L A G R I S C I E N C E P A T H W A Y S 3 P A T H

1Q20 EARNINGS PRESENTATION May 2020 Forward-looking Statements This presentation contains

Ecosystem Threats: Ecosystem Threats: What the fishing community can do to ensure a sustainable

CLUB XLIV &amp; ENCORE Booking Information: 504.587.3663 or email clubXLIVsales@asmneworleans.com

Accepts AC power via E27 Socket as well as DC power from standard powerbanks and the NYX

WEBINAR Projection Technology Transforming the Customer Experience Cheryl Arment Epson Education

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Decentralised water and waste treatment in view of resource recovery: The I-QUA & WAVE

CLUB XLIV & ENCORE Booking Information: 504.587.3663 or email clubXLIVsales@asmneworleans.com