1 sequence data sequence data 2 align sequences align
play

* 1 Sequence data Sequence data * 2 Align Sequences Align - PowerPoint PPT Presentation

Inferring trees is difficult!!! MODELS OF PROTEIN EVOLUTION: 1. The method problem AN INTRODUCTION TO AMINO ACID EXCHANGE MATRICES A Method 1 Method 1 Dataset 1 Dataset 1 B ? C A Robert Hirt Method 2 Method 2 Department of Zoology,


  1. Inferring trees is difficult!!! MODELS OF PROTEIN EVOLUTION: 1. The method problem AN INTRODUCTION TO AMINO ACID EXCHANGE MATRICES A Method 1 Method 1 Dataset 1 Dataset 1 B ? C A Robert Hirt Method 2 Method 2 Department of Zoology, The C Dataset 1 Dataset 1 Natural History Museum, B London From DNA/protein sequences to trees Inferring trees is difficult!!! * 1 Sequence data Sequence data * 2 Align Sequences Align Sequences 2. The dataset problem 2. The dataset problem Phylogenetic signal? Phylogenetic signal? 3 * Patterns—>evolutionary Patterns >evolutionary processes? processes? A Method 1 Method 1 Distances methods Characters based methods Dataset 1 Dataset 1 B Distance calculation Choose Choose a a method method * 4 (which model?) ? C MB ML MP A Wheighting? Wheighting Model? Model? Model? Model? Method 1 Method 1 (sites, changes)? (sites, changes)? Optimality criterion Single tree C Dataset 2 Dataset 2 LS LS ME ME NJ NJ B Calculate or Calculate or estimate estimate best best fit fit tree tree 5 Test phylogenetic Test phylogenetic reliability reliability Modified from Hillis et al., (1993). Methods in Enzymology 224, 456-487 1

  2. Agenda Why protein phylogenies? Why protein phylogenies? • Some general considerations • For historical reasons - the first sequences • For historical reasons - the first sequences – why protein phylogenetics? • Most genes encode proteins • Most genes encode proteins – What are we comparing? Protein sequences - some basic features • To study protein structure, function and evolution • To study protein structure, function and evolution – Protein structure/function and its impact on patterns of mutations • Comparing DNA and protein based phylogenies can • Comparing DNA and protein based phylogenies can • Amino acid exchange matrices: where do they come from be useful be useful and when do we use them? – Different genes - e.g. 18S Different genes - e.g. 18S rRNA rRNA versus EF-2 protein versus EF-2 protein – – Database searches (Blast, FASTA) – Protein encoding gene - Protein encoding gene - codons codons versus amino acids versus amino acids – – Sequence alignment (ClustalX) – Phylogenetics (model based methods) Proteins were the first molecular Phylogenies from proteins sequences to be used for phylogenetic inference • Parsimony • Distance matrices • Fitch and Margoliash (1967). • Maximum likelihood Construction of phylogenetic trees. • Bayesian methods Science 155, 279-284. 2

  3. Evolutionary models for amino acid Character states in DNA and changes protein alignments • All methods have explicit or implicit evolutionary models • DNA sequences have four states (five): A, C, G, T, (and ± indels) • Can be in the form of simple formula – Kimura formula to estimate distances • Proteins have 20 states (21): A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y (and ± indels) • Most models for amino acid changes typically include – 20x20 rate matrix – Correction for rate heterogeneity among sites ( G [a]+ G [a]+ pinv) —> more information in DNA or protein alignments? – Assume neutrality - what if there are biases, or non neutral changes - such as selection? DNA—>Protein DNA->Protein: the code • The code is degenerate: • 3 nucleotides (a codon) code for one amino acid 20 amino acids are encoded by 61 possible codons (3 stop (61 codons! 61x61 rate matrices?) codons) • Complex patterns of changes among codons: • Degeneracy of the code: most amino acids are – Synonymous/non synonymous changes coded by several codons – Synonymous changes correspond to codon changes not affecting the coded amino acid —> more data/information in DNA? 3

  4. Codon degeneracy: protein alignments as a guide DNA->Protein: code usage for DNA alignments Glu- Glu -Gly Gly- -Ser Ser- -Ser Ser- -Trp Trp- -Leu Leu- -Leu Leu- -Leu Leu- -Gly Gly- -Ser Ser • Difference in codon usage can lead to large base Glu- Glu -Gly Gly- -Ser Ser- -Ser Ser- -Tyr Tyr- -Leu Leu- -Leu Leu- -Ile Ile- -Gly Gly- -Ser Ser composition bias - in which case one often needs to Asp Asp- -Gly Gly- -Ser Ser- -Ala Ala- -Trp Trp- -Leu Leu- -Leu Leu- -Leu Leu- -Gly Gly- -Ser Ser remove the 3rd codon, the more bias prone site… Asp Asp- -Gly Gly- -Ser Ser- -Ala Ala- -Tyr Tyr- -Leu Leu- -Leu Leu- -Ala Ala- -Gly Gly- -Ser Ser and possibly the 1st GAA-GGA-AGC-TCC-TGG-TTA-CTC-CTG-GGA-TCC • Comparing protein sequences can reduce the GAG-GGT-TCC-AGC-TAT-CTA-TTA-ATT-GGT-AGC GAC-GGC-AGT-GCA-TGG-TTG-CTT-TTG-GGC-AGT compositional bias problem GAT-GGG-TCA-GCT-TAC-CTC-CTG-GCC-GGG-TCA —> more information in DNA or protein? Ask James for PUTGAPS… Evolutionary models for amino acid changes Models for DNA and Protein evolution • All methods have explicit or implicit evolutionary • DNA: 4 x 4 rate matrices models – Easy to estimate (can be combined with tree search) • Can be in the form of simple formula – Kimura formula to estimate distances • Protein: 20 x 20 matrices – More complex: time and estimation problems (rare changes?) -> • Most models for amino acid changes typically include empirical models from large datasets are typically used – 20x20 rate matrix – Correction for rate heterogeneity between sites ( G [a]+ G [a]+ pinv) 4

  5. Amino acid physico-chemical Proteins and amino acids properties • Proteins determine shape and structure of cells and – Major factor in protein folding carry most catalytic processes - 3D • Proteins are polymers of 20 different amino acids – Key to protein functions • Amino acids sequences determine the structure (2ndary, 3ary…) and function of the protein —> Major influence in pattern — > Major influence in pattern • Amino acids can be categorized by their side chain of amino acid mutations of amino acid mutations physicochemical properties As for Ts versus Tv in DNA sequences, some – Polarity (hydrophobic versus hydrophilic, +/- charges) – Size (small versus large) amino acid changes are more common than others: very important for sequence comparisons (alignment and phylogenetics!) Small <—> small > small <—> big Amino acid substitution matrices Estimation of relative rates of residue based on observed substitutions: replacement (models of evolution) “empirical models” • Differences/changes in protein alignments can be pooled and patterns of changes investigate. • Summarise the substitution pattern from large – Selected sequence, alignment and counting method dependent! Empirical amount of existing data models! • Patterns of changes give insights into the evolutionary processes • Based on a selection of proteins underlying protein diversification -> estimation of evolutionary – Globular proteins, membrane proteins? models – How general is such a model? – Mitochondrial proteins? • Choice of protein evolutionary models can be important for the • Uses a given counting method and the counted sequence analysis we perform (database searching, sequence alignment, phylogenetics) changes to be recorded – tree dependent/independent – restriction on the sequence divergence 5

  6. Taylor’s Venn diagram of amino acids Amino acid physico-chemical properties properties Tiny Small – Size P – Polarity A Aliphatic C S-S G S N Polar – Hydrophilic (polar, +/- charges) C S-H Q V D + T I E – Hydrophobic (non polar) L Charged K M - Y F H R W Hydrophobic Aromatic Amino acids categories 1: Amino acids categories 2 Doolittle (1985). Sci. Am. 253, 74-85. –Sulfhydryl: C –Small polar: S, G, D, N –Small hydrophilic: S, T, A, P, G –Small non-polar: T, A, P, C –Acid, amide: D, E, N, Q –Large polar: E, Q, K, R –Basic: H, R, K –Large non-polar: V, I, L, M, F –Small hydrophobic : M, I, L, V –Intermediate polarity: W, Y, H –Aromatic: F, Y, W 6

  7. Phylogenetic trees from protein alignments • Parsimony based methods - unweighted/weighted • Distance methods - model for distance estimation – probability of amino acid changes, site rate heterogeneity • Maximum likelihood and Bayesian methods- model for ML calculations – probability of amino acid changes, site rate heterogeneity —> Colour coding of different categories is useful for protein — > Colour coding of different categories is useful for protein alignment visual inspection alignment visual inspection Trees from protein alignment: Parsimony: unweighted matrix Parsimony methods - cost matrices for amino acid changes • All changes weighted equally – Ile -> Leu cost = 1 • Differential weighting of changes: an attempt to correct for homoplasy!: – Trp -> Asp cost = 1 – Based on the minimal number of amino acid substitutions, the genetic code matrix (PHYLIP -PROTPARS) – Ser -> Arg cost = 1 – Weights based on physico-chemical properties of amino acids – Weights based on observed frequency of amino acid substitutions in – Lys -> Asp cost = 1 alignments 7

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend