[PDF] - Quantifying sequence similarity Analogy Homology Similar function PDF Document

SLIDE 1

22‐Mar‐15 1

Quantifying sequence similarity

Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 19th 2015

Similarity

Analogy

– Similar function

Homology

– Similar ancestry

Homology can be used to…

– …determine evolutionary relationships – …predict functions

Archaeopteryx

Evolution

Replication (copy)
Mutation (modify)

Time

Selection (fitness)

Biological sequences

Represented as 1‐dimensional string
Nucleic acids

– DNA: A, C, G, and T – RNA: A, C, G, and U – N is “unknown” Some other letters are ambiguities: – Some other letters are ambiguities:

Proteins:

– X is “unknown” – All other letters except B, J, O, U, and Z are amino acids:

Biological sequences are stored in Fasta files
Fasta files are plain text files (open e.g. in )

Fasta files

>protein_seque >protein_sequence_A nce_A MT MTQSSHAVAA FDL SSHAVAA FDLGAALR GAALRQE GLTETDYSE E GLTETDYSEI I QRDPNRAELG TFGV RDPNRAELG TFGV Every new sequence entry starts with a “>” sign at the start of a line Each sequence has an identifier that has to be unique in the file Q Q Q Q Q >protein_seque >protein_sequence_B nce_B MLTETDYSEI QRR MLTETDYSEI QRRLGRDPNR AELGMFGVM LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_seque >protein_sequence_C nce_C MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA RPLLRPIKSP FGA RPLLRPIKSP FGAWVVIV WVVIV The sequence can be on one or more lines until the next “>” at the start of a new line Spaces and newlines just make sequences easier to read/count, they do not have any meaning

Fasta file extensions

The file extension of a Fasta file is .fa or .fasta
The preferred extension for protein Fasta files is .faa

– Fasta Amino Acid

>protein_seque >protein_sequence_A nce_A MTQSSHAVAA FDL MTQSSHAVAA FDLGAALRQE GLTETDYSE GAALRQE GLTETDYSEI QRDPNRAELG TFGV I QRDPNRAELG TFGV >protein_seque >protein_sequence_B nce_B MLTETDYSEI QRR MLTETDYSEI QRRLGRDPNR AELGMFGVM LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_seque >protein_sequence_C nce_C MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA

The preferred extension for DNA Fasta files is .fna

– Fasta Nucleic Acid

MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA RPLLRPIKSP FGA RPLLRPIKSP FGAWVVIV WVVIV >DNA_sequence_ >DNA_sequence_X GAGGAATTCA TAG GAGGAATTCA TAGCTGACGA GTCGAGTGA CTGACGA GTCGAGTGAA AACCGTGTCG TAAA A AACCGTGTCG TAAAAGA AGA >DNA_sequence_ >DNA_sequence_Y CTGACGAGTC GCC CTGACGAGTC GCCCCCCCCC ATAGAGTGG CCCCCCC ATAGAGTGGT TTCCGTTTCC GGAA T TTCCGTTTCC GGAAGGGTCG GGGTCG >DNA_sequence_ >DNA_sequence_Z GAAGCTGACC CGT GAAGCTGACC CGTTTCCGGA AGAGGGAGG TTCCGGA AGAGGGAGG

SLIDE 2

22‐Mar‐15 2

Sequence alignment

How to determine if two sequences are homologous?
How to quantify their similarity?
Make a sequence alignment!
Aligned sequences allow you to calculate percent identity:

– Differences: 7 posions → 32 posions idencal – Alignment length: 39 positions – The two sequences are 100 * = 82% identical

Note: identity can be quantified, but homology cannot

– Saying “82% homologous” is wrong!

32 39

Identity versus similarity

These three sequences are all 66.7% identical…
…but what are those colors? *

– Some amino acids are more similar than others

Glutamic acid (E) Aspartic acid (D) Cysteine (C) George Orwell

Animal Farm

George Orwell

Animal Farm

(* hint)

Amino acid properties

http://swift.cmbi.ru.nl/teach/B1B/HTML/PosterA4_nbic_new.pdf

The similarity signal

What is the most relevant definition of a “similarity signal”

between amino acids?

– Size? – Polarity? – Hydrophobicity? – Preferred protein fold? – any other measures that might be important? …any other measures that might be important?

Similarity signal of what, exactly? *

– “Of the function of the amino acid in the protein” – “Of an evolutionary relationship”

Evolution has tested this for us

– So let’s use evolution to score similarity!

(* hint)

BLOcks SUbstitution Matrix (BLOSUM)

1. Take some aligned homologous sequences

– You will learn how to do this next week

2. Group highly identical sequences (e.g. >62% identity)

– To remove redundancy biases in the sequence database To remove redundancy biases in the sequence database

3. Identify well‐aligned blocks

– So that only real mutations are compared

4. Count how two often amino acids mutated into each other

– This is the most relevant definition of “similarity”!

Are we looking at well‐aligned homologs?

ll li d h l h h f i Unaligned sequences

Well‐aligned homologs show you how often two amino

acids really mutated into each other during evolution Well‐aligned homologs

SLIDE 3

22‐Mar‐15 3

BLOSUM defines similarity between amino acids based on the

statistical probability that they align in well‐aligned homologs

– Observed/expected ratio (odds ratio)

An obeserved/expected ratio of 2 means something happens twice as often as

you expected by random chance

– Observed: aligned in well‐aligned homologs – Expected: “aligned” in unaligned sequences

BLOSUM score:

How to quantify this similarity?

The observed/expected ratio for a whole alignment can be

calculated by multiplying the ratios for all aligned amino acids

– This is a lot of multiplying for long sequences

Logarithms allow us to sum the scores of aligned residues

– Much easier to use for a scoring system

Hence: the BLOSUM score
bserved

expected aligned in well‐aligned homologs “aligned” in unaligned sequences

log (

)

Si,j = 2 ∙ log2(

)

Fi,j Fi ∙ Fj

→

An example

Let’s say that we have a well‐aligned block of:

– 100 amino acids long – 1,000 proteins “deep” – With no gaps

7.4% are alanine (A)
1.3% are tryptophan (W)
Randomly, we expect a fraction of A‐W alignments of:

7,400 100 ∙ 1,000

→ FA = = 0.074 → FW = 0.013 Randomly, we expect a fraction of A W alignments of:

In reality, we observe a fraction of A‐W alignments of: 0.00034

– The A‐W mutation occurred less frequently in evolution than expected! – … so the substitution score must be negative

The A‐W substitution score is:

Si,j = 2 ∙ log2(

)

Fi,j Fi ∙ Fj

SA,W = 2 ∙ log2(

)

0.00034 0.000962

= ‐3 FA ∙ FW = 0.074 ∙ 0.013 = 0.000962

BLOSUM62

Positive Polar, non‐hydrophobic Small

Low for infrequent substitutions High for frequent substitutions Low for common amino acids High for rare amino acids

Aromatic Hydrophobic

Why are the highest numbers in each row/column on the

diagonal?

– Because in well‐aligned homologs, every amino acid is most

ften aligned to itself (not mutated)

So what do the numbers mean?

A score of ‐3 for the substitution A‐W means that this substitution is
bserved in well‐aligned homologs 0.35 times as often as expected:

– … or: = 2.83 times less often than expected

2(‐3/2) = 0.35 Si,j = 2 ∙ log2(

)

Fi,j Fi ∙ Fj Fi,j Fi ∙ Fj

2(Si,j/2) = →

1 0.35 0.00034 0.000962

=

A score of 2 for the substitution D‐E means that this substitution is
bserved in well‐aligned homologs twice as often as expected:
A score of 9 for the substitution C‐C means that this substitution is
bserved in well‐aligned homologs 22.6 times as often as expected:

2(2/2) = 2 2(9/2) = 22.6

Odds ratio that sequences are well‐aligned homologs

How likely is it that two sequences are well‐aligned homologs?

– seqA – seqB : ‐1 + ‐3 + ‐2 = ‐6 – seqA – seqC : ‐1 + ‐2 + ‐2 = ‐5 – seqB – seqC : 4 + 2 + 4 = 10

2(‐6/2) = 0.125 2(‐5/2) = 0.177 2(10/2) = 32

When you observe RYD aligned to SDA, these sequences are 0.125

times more likely to be well‐aligned homologs than expected

– … or 8 times less likely to be well‐aligned homologs than expected

When you observe RYD aligned to SEA, these sequences are 0.177

times more likely to be well‐aligned homologs than expected

– … or 5.6 times less likely to be well‐aligned homologs than expected

When you observe SDA aligned to SEA, these sequences are 32

times more likely to be well‐aligned homologs than expected

Length of the alignment

As you see a more and more length of two sequences…

– If two sequences are well‐aligned homologs, most of the aligned amino acids will have positive scores – … so the alignment score keeps increasing as you see more length of the sequences – … so the likelihood that they are well‐aligned homologs keeps increasing – If two sequences are not homologous, most of the aligned amino acids will have negative scores – … so the score keeps decreasing as you see more sequence – … so the likelihood that they are well‐aligned homologs keeps decreasing

SLIDE 4

22‐Mar‐15 4

BLOSUM62

Positive Polar, non‐hydrophobic Small Aromatic Hydrophobic

Calculate the similarity score between these sequences:

a)

KAWSADV

b)

HAMNWEQ

c)

RDWSACV

– 1 + 4 – 1 + 1 – 3 + 2 – 2 = 0 0 – 2 – 1 + 1 – 3 + 5 – 2 = – 2 2 – 2 + 11 + 4 + 4 – 3 + 4 = 20

BLOSUM numbers

BLOSUM removes redundancy biases from the blocks of

well‐aligned homologs

– The sequence database is highly biased (>2,000 E. coli genomes!) – Clusters of highly identical sequences are collapsed – One sequence per cluster is included in the block for calculating the BLOSUM matrix

High numbers compare and detect closely related proteins

High numbers compare and detect closely related proteins

Low numbers compare and detect more distant proteins
The most used matrices:

– BLOSUM45: distant homology – BLOSUM62: default – BLOSUM80: closely related

20 40 60 80 100 BLOSUM45 BLOSUM62 BLOSUM80

% protein identity

Collapsed Included

Other substitution matrices

Point Accepted Mutations (PAM)

– Developed by Margaret Dayhoff in 1978 – Based on counting 1,572 real mutations that

ccurred during the evolution of 71 proteins
Gonnet (1992)

– Similar to PAM, but based on more sequences Similar to PAM, but based on more sequences – Default in Clustal Omega

Physico‐chemical properties

How similar is similar?

Burkhart Rost

– Protein identity >30%

90% of structures are similar

– Protein identity <25%

10% of structures are similar
Conservation:

– Protein sequence is more conserved than nucleic acid sequence – The structure and function of a protein are more conserved than its sequence – The catalytic reaction of an enzyme is more conserved than its substrate

Two options for DNA substitution matrices

Same penalty for all mismatches

– Example: identity matrix

Lower penalty for transitions than for transversions

– Transitions occur about twice as often

A C G T A 1 C ‐1 1 G ‐1 ‐1 1 T ‐1 ‐1 ‐1 1 A C G T A 1 A C T G Transitions: A ↔ G C ↔ T Transversions: A ↔ C A ↔ T C ↔ G G ↔ T A 1 C ‐2 1 G ‐1 ‐2 1 T ‐2 ‐1 ‐2 1

Which matrix to use?

Henikoff and Henikoff compared several BLOSUM matrices

to PAM, PET, Overington, Gonnet, and PAM

– They scored how effectively the matrices detected known members of a protein family in a database

Overall, the BLOSUM62 matrix is the most effective
However, all substitution matrices perform better than

BLOSUM62 for some protein families BLOSUM62 for some protein families

No single matrix has the complete answer 

SLIDE 5

22‐Mar‐15 5

Substitution matrices

…are used in (almost) all sequence analyses
…represent a particular model of evolution

– This will be further discussed next week

…can strongly influence your results
Bioinformatic programs or web servers often have default