Quantifying sequence similarity Analogy Homology Similar function - - PDF document

quantifying sequence similarity
SMART_READER_LITE
LIVE PREVIEW

Quantifying sequence similarity Analogy Homology Similar function - - PDF document

22 Mar 15 Similarity Quantifying sequence similarity Analogy Homology Similar function Similar ancestry Homology can be used to determine evolutionary relationships predict functions Bas E. Dutilh


slide-1
SLIDE 1

22‐Mar‐15 1

Quantifying sequence similarity

Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 19th 2015

Similarity

  • Analogy

– Similar function

  • Homology

– Similar ancestry

  • Homology can be used to…

– …determine evolutionary relationships – …predict functions

Archaeopteryx

Evolution

  • Replication (copy)
  • Mutation (modify)

Time

  • Selection (fitness)

Biological sequences

  • Represented as 1‐dimensional string
  • Nucleic acids

– DNA: A, C, G, and T – RNA: A, C, G, and U – N is “unknown” Some other letters are ambiguities: – Some other letters are ambiguities:

  • Proteins:

– X is “unknown” – All other letters except B, J, O, U, and Z are amino acids:

  • Biological sequences are stored in Fasta files
  • Fasta files are plain text files (open e.g. in )

Fasta files

>protein_seque >protein_sequence_A nce_A MT MTQSSHAVAA FDL SSHAVAA FDLGAALR GAALRQE GLTETDYSE E GLTETDYSEI I QRDPNRAELG TFGV RDPNRAELG TFGV Every new sequence entry starts with a “>” sign at the start of a line Each sequence has an identifier that has to be unique in the file Q Q Q Q Q >protein_seque >protein_sequence_B nce_B MLTETDYSEI QRR MLTETDYSEI QRRLGRDPNR AELGMFGVM LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_seque >protein_sequence_C nce_C MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA RPLLRPIKSP FGA RPLLRPIKSP FGAWVVIV WVVIV The sequence can be on one or more lines until the next “>” at the start of a new line Spaces and newlines just make sequences easier to read/count, they do not have any meaning

Fasta file extensions

  • The file extension of a Fasta file is .fa or .fasta
  • The preferred extension for protein Fasta files is .faa

– Fasta Amino Acid

>protein_seque >protein_sequence_A nce_A MTQSSHAVAA FDL MTQSSHAVAA FDLGAALRQE GLTETDYSE GAALRQE GLTETDYSEI QRDPNRAELG TFGV I QRDPNRAELG TFGV >protein_seque >protein_sequence_B nce_B MLTETDYSEI QRR MLTETDYSEI QRRLGRDPNR AELGMFGVM LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_seque >protein_sequence_C nce_C MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA

  • The preferred extension for DNA Fasta files is .fna

– Fasta Nucleic Acid

MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA RPLLRPIKSP FGA RPLLRPIKSP FGAWVVIV WVVIV >DNA_sequence_ >DNA_sequence_X GAGGAATTCA TAG GAGGAATTCA TAGCTGACGA GTCGAGTGA CTGACGA GTCGAGTGAA AACCGTGTCG TAAA A AACCGTGTCG TAAAAGA AGA >DNA_sequence_ >DNA_sequence_Y CTGACGAGTC GCC CTGACGAGTC GCCCCCCCCC ATAGAGTGG CCCCCCC ATAGAGTGGT TTCCGTTTCC GGAA T TTCCGTTTCC GGAAGGGTCG GGGTCG >DNA_sequence_ >DNA_sequence_Z GAAGCTGACC CGT GAAGCTGACC CGTTTCCGGA AGAGGGAGG TTCCGGA AGAGGGAGG

slide-2
SLIDE 2

22‐Mar‐15 2

Sequence alignment

  • How to determine if two sequences are homologous?
  • How to quantify their similarity?
  • Make a sequence alignment!
  • Aligned sequences allow you to calculate percent identity:

– Differences: 7 posions → 32 posions idencal – Alignment length: 39 positions – The two sequences are 100 * = 82% identical

  • Note: identity can be quantified, but homology cannot

– Saying “82% homologous” is wrong!

32 39

Identity versus similarity

  • These three sequences are all 66.7% identical…
  • …but what are those colors? *

– Some amino acids are more similar than others

Glutamic acid (E) Aspartic acid (D) Cysteine (C) George Orwell

Animal Farm

George Orwell

Animal Farm

(* hint)

Amino acid properties

http://swift.cmbi.ru.nl/teach/B1B/HTML/PosterA4_nbic_new.pdf

The similarity signal

  • What is the most relevant definition of a “similarity signal”

between amino acids?

– Size? – Polarity? – Hydrophobicity? – Preferred protein fold? – any other measures that might be important? …any other measures that might be important?

  • Similarity signal of what, exactly? *

– “Of the function of the amino acid in the protein” – “Of an evolutionary relationship”

  • Evolution has tested this for us

– So let’s use evolution to score similarity!

(* hint)

BLOcks SUbstitution Matrix (BLOSUM)

1. Take some aligned homologous sequences

– You will learn how to do this next week

2. Group highly identical sequences (e.g. >62% identity)

– To remove redundancy biases in the sequence database To remove redundancy biases in the sequence database

3. Identify well‐aligned blocks

– So that only real mutations are compared

4. Count how two often amino acids mutated into each other

– This is the most relevant definition of “similarity”!

Are we looking at well‐aligned homologs?

ll li d h l h h f i Unaligned sequences

  • Well‐aligned homologs show you how often two amino

acids really mutated into each other during evolution Well‐aligned homologs

slide-3
SLIDE 3

22‐Mar‐15 3

  • BLOSUM defines similarity between amino acids based on the

statistical probability that they align in well‐aligned homologs

– Observed/expected ratio (odds ratio)

  • An obeserved/expected ratio of 2 means something happens twice as often as

you expected by random chance

– Observed: aligned in well‐aligned homologs – Expected: “aligned” in unaligned sequences

BLOSUM score:

How to quantify this similarity?

  • The observed/expected ratio for a whole alignment can be

calculated by multiplying the ratios for all aligned amino acids

– This is a lot of multiplying for long sequences

  • Logarithms allow us to sum the scores of aligned residues

– Much easier to use for a scoring system

  • Hence: the BLOSUM score
  • bserved

expected aligned in well‐aligned homologs “aligned” in unaligned sequences

log (

)

Si,j = 2 ∙ log2(

)

Fi,j Fi ∙ Fj

An example

  • Let’s say that we have a well‐aligned block of:

– 100 amino acids long – 1,000 proteins “deep” – With no gaps

  • 7.4% are alanine (A)
  • 1.3% are tryptophan (W)
  • Randomly, we expect a fraction of A‐W alignments of:

7,400 100 ∙ 1,000

→ FA = = 0.074 → FW = 0.013 Randomly, we expect a fraction of A W alignments of:

  • In reality, we observe a fraction of A‐W alignments of: 0.00034

– The A‐W mutation occurred less frequently in evolution than expected! – … so the substitution score must be negative

  • The A‐W substitution score is:

Si,j = 2 ∙ log2(

)

Fi,j Fi ∙ Fj

SA,W = 2 ∙ log2(

)

0.00034 0.000962

= ‐3 FA ∙ FW = 0.074 ∙ 0.013 = 0.000962

BLOSUM62

Positive Polar, non‐hydrophobic Small

Low for infrequent substitutions High for frequent substitutions Low for common amino acids High for rare amino acids

Aromatic Hydrophobic

  • Why are the highest numbers in each row/column on the

diagonal?

– Because in well‐aligned homologs, every amino acid is most

  • ften aligned to itself (not mutated)

So what do the numbers mean?

  • A score of ‐3 for the substitution A‐W means that this substitution is
  • bserved in well‐aligned homologs 0.35 times as often as expected:

– … or: = 2.83 times less often than expected

2(‐3/2) = 0.35 Si,j = 2 ∙ log2(

)

Fi,j Fi ∙ Fj Fi,j Fi ∙ Fj

2(Si,j/2) = →

1 0.35 0.00034 0.000962

=

  • A score of 2 for the substitution D‐E means that this substitution is
  • bserved in well‐aligned homologs twice as often as expected:
  • A score of 9 for the substitution C‐C means that this substitution is
  • bserved in well‐aligned homologs 22.6 times as often as expected:

2(2/2) = 2 2(9/2) = 22.6

Odds ratio that sequences are well‐aligned homologs

  • How likely is it that two sequences are well‐aligned homologs?

– seqA – seqB : ‐1 + ‐3 + ‐2 = ‐6 – seqA – seqC : ‐1 + ‐2 + ‐2 = ‐5 – seqB – seqC : 4 + 2 + 4 = 10

2(‐6/2) = 0.125 2(‐5/2) = 0.177 2(10/2) = 32

  • When you observe RYD aligned to SDA, these sequences are 0.125

times more likely to be well‐aligned homologs than expected

– … or 8 times less likely to be well‐aligned homologs than expected

  • When you observe RYD aligned to SEA, these sequences are 0.177

times more likely to be well‐aligned homologs than expected

– … or 5.6 times less likely to be well‐aligned homologs than expected

  • When you observe SDA aligned to SEA, these sequences are 32

times more likely to be well‐aligned homologs than expected

Length of the alignment

  • As you see a more and more length of two sequences…

– If two sequences are well‐aligned homologs, most of the aligned amino acids will have positive scores – … so the alignment score keeps increasing as you see more length of the sequences – … so the likelihood that they are well‐aligned homologs keeps increasing – If two sequences are not homologous, most of the aligned amino acids will have negative scores – … so the score keeps decreasing as you see more sequence – … so the likelihood that they are well‐aligned homologs keeps decreasing

slide-4
SLIDE 4

22‐Mar‐15 4

BLOSUM62

Positive Polar, non‐hydrophobic Small Aromatic Hydrophobic

  • Calculate the similarity score between these sequences:

a)

KAWSADV

b)

HAMNWEQ

c)

RDWSACV

– 1 + 4 – 1 + 1 – 3 + 2 – 2 = 0 0 – 2 – 1 + 1 – 3 + 5 – 2 = – 2 2 – 2 + 11 + 4 + 4 – 3 + 4 = 20

BLOSUM numbers

  • BLOSUM removes redundancy biases from the blocks of

well‐aligned homologs

– The sequence database is highly biased (>2,000 E. coli genomes!) – Clusters of highly identical sequences are collapsed – One sequence per cluster is included in the block for calculating the BLOSUM matrix

  • High numbers compare and detect closely related proteins

High numbers compare and detect closely related proteins

  • Low numbers compare and detect more distant proteins
  • The most used matrices:

– BLOSUM45: distant homology – BLOSUM62: default – BLOSUM80: closely related

20 40 60 80 100 BLOSUM45 BLOSUM62 BLOSUM80

% protein identity

Collapsed Included

Other substitution matrices

  • Point Accepted Mutations (PAM)

– Developed by Margaret Dayhoff in 1978 – Based on counting 1,572 real mutations that

  • ccurred during the evolution of 71 proteins
  • Gonnet (1992)

– Similar to PAM, but based on more sequences Similar to PAM, but based on more sequences – Default in Clustal Omega

  • Physico‐chemical properties

How similar is similar?

  • Burkhart Rost

– Protein identity >30%

  • 90% of structures are similar

– Protein identity <25%

  • 10% of structures are similar
  • Conservation:

– Protein sequence is more conserved than nucleic acid sequence – The structure and function of a protein are more conserved than its sequence – The catalytic reaction of an enzyme is more conserved than its substrate

Two options for DNA substitution matrices

  • Same penalty for all mismatches

– Example: identity matrix

  • Lower penalty for transitions than for transversions

– Transitions occur about twice as often

A C G T A 1 C ‐1 1 G ‐1 ‐1 1 T ‐1 ‐1 ‐1 1 A C G T A 1 A C T G Transitions: A ↔ G C ↔ T Transversions: A ↔ C A ↔ T C ↔ G G ↔ T A 1 C ‐2 1 G ‐1 ‐2 1 T ‐2 ‐1 ‐2 1

Which matrix to use?

  • Henikoff and Henikoff compared several BLOSUM matrices

to PAM, PET, Overington, Gonnet, and PAM

– They scored how effectively the matrices detected known members of a protein family in a database

  • Overall, the BLOSUM62 matrix is the most effective
  • However, all substitution matrices perform better than

BLOSUM62 for some protein families BLOSUM62 for some protein families

  • No single matrix has the complete answer 
slide-5
SLIDE 5

22‐Mar‐15 5

Substitution matrices

  • …are used in (almost) all sequence analyses
  • …represent a particular model of evolution

– This will be further discussed next week

  • …can strongly influence your results
  • Bioinformatic programs or web servers often have default

parameter values and use a default substitution matrix

– These are not always the most optimal for your questions – If you have no idea, check the literature in your field what is considered reliable