quantifying sequence similarity
play

Quantifying sequence similarity Analogy Homology Similar function - PDF document

22 Mar 15 Similarity Quantifying sequence similarity Analogy Homology Similar function Similar ancestry Homology can be used to determine evolutionary relationships predict functions Bas E. Dutilh


  1. 22 ‐ Mar ‐ 15 Similarity Quantifying sequence similarity • Analogy • Homology – Similar function – Similar ancestry • Homology can be used to… – …determine evolutionary relationships – …predict functions Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 19 th 2015 Archaeopteryx Evolution Biological sequences • Replication (copy) • Represented as 1 ‐ dimensional string • Nucleic acids – DNA: A, C, G, and T – RNA: A, C, G, and U • Mutation (modify) – N is “unknown” – Some other letters are ambiguities: Some other letters are ambiguities: • Proteins: • Selection (fitness) – X is “unknown” Time – All other letters except B, J, O, U, and Z are amino acids: Fasta files Fasta file extensions • Biological sequences are stored in Fasta files • The file extension of a Fasta file is .fa or .fasta • Fasta files are plain text files (open e.g. in ) • The preferred extension for protein Fasta files is .faa – Fasta Amino Acid Every new sequence entry starts with a “>” sign at the start of a line >protein_seque >protein_sequence_A nce_A MTQSSHAVAA FDL MTQSSHAVAA FDLGAALRQE GLTETDYSE GAALRQE GLTETDYSEI QRDPNRAELG TFGV I QRDPNRAELG TFGV Each sequence has an identifier >protein_seque >protein_sequence_B nce_B that has to be unique in the file MLTETDYSEI QRR MLTETDYSEI QRRLGRDPNR AELGMFGVM LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_sequence_A >protein_seque nce_A >protein_seque >protein_sequence_C nce_C MTQSSHAVAA FDL MT Q Q SSHAVAA FDLGAALR GAALRQE GLTETDYSE Q Q E GLTETDYSEI I QRDPNRAELG TFGV Q RDPNRAELG TFGV MHAVAAFDLG AALRQEGLTE TDYSEIQRR MHAVAAFDLG AALRQEGLTE TDYSEIQRR MHAVAAFDLG AAL MHAVAAFDLG AAL RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA L GRAMFGVMWS EHCCYRNDDA YRNDDA YRNDDA >protein_seque >protein_sequence_B nce_B RPLLRPIKSP FGA RPLLRPIKSP FGAWVVIV WVVIV MLTETDYSEI QRRLGRDPNR AELGMFGVM MLTETDYSEI QRR LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_sequence_C >protein_seque nce_C • The preferred extension for DNA Fasta files is .fna MHAVAAFDLG AALRQEGLTE TDYSEIQRR MHAVAAFDLG AAL RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA – Fasta Nucleic Acid RPLLRPIKSP FGAWVVIV RPLLRPIKSP FGA WVVIV >DNA_sequence_X >DNA_sequence_ GAGGAATTCA TAG GAGGAATTCA TAGCTGACGA GTCGAGTGA CTGACGA GTCGAGTGAA AACCGTGTCG TAAA A AACCGTGTCG TAAAAGA AGA >DNA_sequence_ >DNA_sequence_Y The sequence can be on one or more lines Spaces and newlines just make CTGACGAGTC GCC CTGACGAGTC GCCCCCCCCC ATAGAGTGG CCCCCCC ATAGAGTGGT TTCCGTTTCC GGAA T TTCCGTTTCC GGAAGGGTCG GGGTCG until the next “>” at the start of a new line sequences easier to read/count, >DNA_sequence_Z >DNA_sequence_ they do not have any meaning GAAGCTGACC CGTTTCCGGA AGAGGGAGG GAAGCTGACC CGT TTCCGGA AGAGGGAGG 1

  2. 22 ‐ Mar ‐ 15 Sequence alignment Identity versus similarity • How to determine if two sequences are homologous? • These three sequences are all 66.7% identical… • …but what are those colors? * – Some amino acids are more similar than others • How to quantify their similarity? • Make a sequence alignment! (* hint) • Aligned sequences allow you to calculate percent identity: – Di ff erences: 7 posi � ons → 32 posi � ons iden � cal – Alignment length: 39 positions 32 – The two sequences are 100 * = 82% identical Cysteine (C) Aspartic acid (D) Glutamic acid (E) 39 • Note: identity can be quantified, but homology cannot – Saying “82% homologous” is wrong! George Orwell George Orwell Animal Farm Animal Farm Amino acid properties The similarity signal • What is the most relevant definition of a “similarity signal” between amino acids? – Size? – Polarity? – Hydrophobicity? – Preferred protein fold? …any other measures that might be important? any other measures that might be important? – • Similarity signal of what, exactly? * – “Of the function of the amino acid in the protein” – “Of an evolutionary relationship” (* hint) • Evolution has tested this for us – So let’s use evolution to score similarity! http://swift.cmbi.ru.nl/teach/B1B/HTML/PosterA4_nbic_new.pdf BLOcks SUbstitution Matrix (BLOSUM) Are we looking at well ‐ aligned homologs? 1. Take some aligned homologous sequences You will learn how to do this next week – Unaligned sequences 2. Group highly identical sequences (e.g. >62% identity) To remove redundancy biases in the sequence database To remove redundancy biases in the sequence database – • Well ‐ aligned homologs show you how often two amino ll li d h l h h f i 3. Identify well ‐ aligned blocks acids really mutated into each other during evolution So that only real mutations are compared – 4. Count how two often amino acids mutated into each other This is the most relevant definition of “similarity”! – Well ‐ aligned homologs 2

  3. 22 ‐ Mar ‐ 15 How to quantify this similarity? An example • BLOSUM defines similarity between amino acids based on the • Let’s say that we have a well ‐ aligned block of: statistical probability that they align in well ‐ aligned homologs – 100 amino acids long – Observed/expected ratio (odds ratio) – 1,000 proteins “deep” • An obeserved/expected ratio of 2 means something happens twice as often as – With no gaps you expected by random chance 7,400 – Observed: aligned in well ‐ aligned homologs • 7.4% are alanine (A) → F A = = 0.074 100 ∙ 1,000 – Expected: “aligned” in unaligned sequences • 1.3% are tryptophan (W) → F W = 0.013 BLOSUM score: • Randomly, we expect a fraction of A ‐ W alignments of: Randomly, we expect a fraction of A W alignments of: log ( aligned in well ‐ aligned homologs observed ) → S i,j = 2 ∙ log 2 ( F i,j ) “aligned” in unaligned sequences expected F i ∙ F j F A ∙ F W = 0.074 ∙ 0.013 = 0.000962 • The observed/expected ratio for a whole alignment can be • In reality, we observe a fraction of A ‐ W alignments of: 0.00034 calculated by multiplying the ratios for all aligned amino acids – The A ‐ W mutation occurred less frequently in evolution than expected! – This is a lot of multiplying for long sequences – … so the substitution score must be negative • Logarithms allow us to sum the scores of aligned residues • The A ‐ W substitution score is: – Much easier to use for a scoring system S i,j = 2 ∙ log 2 ( F i,j ) S A,W = 2 ∙ log 2 ( 0.00034 ) = ‐ 3 • Hence: the BLOSUM score F i ∙ F j 0.000962 BLOSUM62 So what do the numbers mean? 2 (S i,j /2) = F i,j F i,j S i,j = 2 ∙ log 2 ( ) → F i ∙ F j F i ∙ F j Low for infrequent substitutions Small • A score of ‐ 3 for the substitution A ‐ W means that this substitution is High for frequent substitutions observed in well ‐ aligned homologs 0.35 times as often as expected: Low for common amino acids 2 ( ‐ 3/2) = 0.35 0.00034 Polar, non ‐ hydrophobic = 0.000962 1 High for rare amino acids – … or: = 2.83 times less often than expected 0.35 Positive • A score of 2 for the substitution D ‐ E means that this substitution is Hydrophobic observed in well ‐ aligned homologs twice as often as expected: 2 (2/2) = 2 Aromatic • Why are the highest numbers in each row/column on the • A score of 9 for the substitution C ‐ C means that this substitution is diagonal? observed in well ‐ aligned homologs 22.6 times as often as expected: – Because in well ‐ aligned homologs, every amino acid is most 2 (9/2) = 22.6 often aligned to itself (not mutated) Length of the alignment Odds ratio that sequences are well ‐ aligned homologs • As you see a more and more length of two sequences… • How likely is it that two sequences are well ‐ aligned homologs? – seqA – seqB : ‐ 1 + ‐ 3 + ‐ 2 = ‐ 6 – If two sequences are well ‐ aligned homologs, most of the aligned – seqA – seqC : ‐ 1 + ‐ 2 + ‐ 2 = ‐ 5 amino acids will have positive scores – seqB – seqC : 4 + 2 + 4 = 10 – … so the alignment score keeps increasing as you see more 2 ( ‐ 6/2) = 0.125 2 ( ‐ 5/2) = 0.177 2 (10/2) = 32 length of the sequences – … so the likelihood that they are well ‐ aligned homologs keeps increasing • When you observe RYD aligned to SDA, these sequences are 0.125 times more likely to be well ‐ aligned homologs than expected – … or 8 times less likely to be well ‐ aligned homologs than expected – If two sequences are not homologous, most of the aligned amino • When you observe RYD aligned to SEA, these sequences are 0.177 acids will have negative scores times more likely to be well ‐ aligned homologs than expected – … so the score keeps decreasing as you see more sequence – … or 5.6 times less likely to be well ‐ aligned homologs than expected – … so the likelihood that they are well ‐ aligned homologs keeps • When you observe SDA aligned to SEA, these sequences are 32 decreasing times more likely to be well ‐ aligned homologs than expected 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend