Sequence comparison: Score matrices
Genome 559: Introduction to Statistical and Computational Genomics
- Prof. James H. Thomas
Sequence comparison: Score matrices Genome 559: Introduction to - - PowerPoint PPT Presentation
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas FYI - informal inductive proof of best alignment path Consider the last step in the best alignment path to node a
BUT the best paths to X, Y, and Z are analogously the max of their three upstream possibilities, etc. Inductively QED.
Consider the last step in the best alignment path to node a below. This path must come from one of the three nodes shown, where X, Y, and Z are the cumulative scores of the best alignments up to those
gap in sequence A or a gap in sequence B:
X Y Z
match gap gap
The best-scoring path to
X + match Y + gap Z + gap
A C G T A 2
C
2
G
2
T
2
1 , 1 j i F
j i F ,
j i F , 1
1 , j i F
d d
j i y
x s ,
(no arrow means no preceding alignment)
d = -5
dot plot of two DNA sequences
DP alignment path
regular 20 amino acids
ambiguity codes and stop
# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Cluster Percentage: >= 62
Hydrophobic Polar Charged
phenylalanine F
Good scores – chemically similar Bad scores – chemically dissimilar
alanine A valine V glycine G leucine isoleucine methionine M proline P L I
CH CH3 C N
.
CH H C N
.
CH C N CH3 CH3
.
CH C C N C CH3 CH3
.
CH C N CH3 CH3
.
CH C C N C S CH3
.
CH N C
. tryptophan W
CH C N
.
H N
. threonine T tyrosine Y serine S asparagine glutamine N Q cysteine C
CH C N
.
OH
.
CH C N
.
SH
.
CH C N
.
OH
.
CH C N C OH
.
CH C N
.
NH2 O
.
CH C N
. .
NH2 O
lysine K arginine R histidine H aspartate glutamate D E
CH C N
.
N N
+ .
CH C N NH3+
.
CH C N
.
NH NH2+ H2N
.
CH C N C
.
O- O
.
CH C N
. .
O- O
.
Hydrophobic Polar Charged
31 positions (columns) 61 sequences (rows)
computing a single BLOSUM matrix.
all blocks.
D E D N D D
6 D-D pairs 4 D-E pairs 4 D-N pairs 1 E-N pair Sample column from an alignment block:
where is the count of pairs and is the total pair count.
ij ij ij
c ij T
Actual aligned pair frequency:
where and are the overall probabilities (frequencies) of specific residues and .
a b aa a a ab a b b a a b
p p a b
Randomly expected pair frequency:
(a multiple alignment of N sequences is the equivalent of all the pairwise alignments, which number (N)(N-1)/2.) etc.
this is called the sum
part of BLOSUM)
For computational speed often rounded to nearest integer and (to reduce round-off error) they are often multiplied by 2 (or more) first, giving a “half-bit” score: 2
ij ij
(computers can add integers faster than floats)
Frequency of C residue
(you have to look this up)
Reverse calculation of aligned C-C pair frequency in BLOSUM data set:
5 . 4
cc cc
cc
cc
thus ( 9 half-bits = 4.5 bits )
a probabilistic interpretation to an alignment score. VHRDLKPENLLLASK VHRDLKPENLLLASK
(4+8+5+6+4+5+7+5+6+4+4+4+4+4+5)
score of 75, meaning that it is ~1011 times more likely to be seen in a real alignment than in a random alignment(!!).
BLOSUM 62, meaning an alignment with this score
from a random alignment.
(BLOSUM62)
2
ij ij
(probability of a gap at each position in the sequence) [note - the slope of the line on a log-linear plot will vary according to the frequency of gaps, but it will always be linear]
n n g
2 2 1
if then
log-linear plot
Nowhere near linear - hence the use of affine gap penalties (there ideally would be several levels of decreasing affine penalties)