Sequence comparison: Score matrices Genome 559: Introduction to - - PowerPoint PPT Presentation

▶

Oct 17, 2023 351 likes •554 views

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Informal inductive proof of best alignment path Consider the last step in the best alignment path to node a below.

SLIDE 1

Sequence comparison: Score matrices

Genome 559: Introduction to Statistical and Computational Genomics

Prof. James H. Thomas

SLIDE 2

BUT the best paths to X, Y, and Z are analogously the max of their three upstream possibilities, etc. Inductively QED.

Consider the last step in the best alignment path to node a below. This path must come from one of the three nodes shown, where X, Y, and Z are the cumulative scores of the best alignments up to those

nodes. We can reach node a by three possible paths: an A-B match, a

gap in sequence A or a gap in sequence B:

seq A seq B

X Y Z

match gap gap

a

The best-scoring path to

a is the maximum of:

X + match Y + gap Z + gap

Informal inductive proof of best alignment path

SLIDE 3

Local alignment

A C G T A 2

A A G A 2 G 4 C

 

1 , 1   j i F

 

j i F ,

 

j i F , 1 

 

1 ,  j i F

d d

 

j i y

x s ,

2

(no arrow means no preceding alignment)

d = -5

SLIDE 4

Local alignment

Two differences from global alignment:

– If a score is negative, replace with 0. – Traceback from the highest score in the matrix and continue until you reach 0.

Global alignment algorithm: Needleman-

Wunsch.

Local alignment algorithm: Smith-

Waterman.

SLIDE 5

DNA score matrices are much simpler (and are

conceptually similar).

Quantitatively represent the degree of conservation
f typical amino acid residues over evolutionary time.
All possible amino acid changes are represented

(matrix of size at least 20 x 20).

Most commonly used are several different BLOSUM

matrices derived for different degrees of evolutionary divergence.

Protein score matrices

SLIDE 6

regular 20 amino acids

BLOSUM62 Score Matrix

ambiguity codes and stop

# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Cluster Percentage: >= 62

SLIDE 7

Amino acid structures

Hydrophobic Polar Charged

phenylalanine F

SLIDE 8

BLOSUM62 Score Matrix

Good scores – chemically similar Bad scores – chemically dissimilar

SLIDE 9

Amino acid structures

alanine A valine V glycine G leucine isoleucine methionine M proline P L I

CH CH3 C N

CH H C N

CH C N CH3 CH3

CH C C N C CH3 CH3

CH C N CH3 CH3

CH C C N C S CH3

CH N C

. tryptophan W

CH C N

H N

. threonine T tyrosine Y serine S asparagine glutamine N Q cysteine C

CH C N

CH C N C OH

CH C N

NH2 O

CH C N

. .

NH2 O

lysine K arginine R histidine H aspartate glutamate D E

CH C N

N N

+ .

CH C N NH3+

CH C N

NH NH2+ H2N

CH C N C

O- O

CH C N

. .

O- O

Hydrophobic Polar Charged

SLIDE 10

Find sets of sequences whose alignment is thought to

be correct (this is partly bootstrapped by alignment).

Measure how often various amino acid pairs occur in

the alignments.

Normalize this to the expected frequency of such

pairs randomly in the same set of alignments.

Derive a log-odds score (often in half bits).

Deriving BLOSUM scores

SLIDE 11

Example of alignment block

31 amino acids (columns) 61 sequences (rows)

Thousands of such blocks go into

computing a single BLOSUM matrix.

Represent full diversity of sequences.
Results are summed over all columns of

all blocks.

SLIDE 12

Pair frequency vs. expectation

D E D N D D

6 D-D pairs 4 D-E pairs 4 D-N pairs 1 E-N pair Sample column from a multiple alignment:

where is the count of pairs and is the total pair count.

1

ij ij ij

c ij T

q c T  

Actual aligned pair frequency:

where and are the overall probabilities (frequencies) of specific residues and .

2

a b aa a a ab a b b a a b

p p a b

e p p e p p p p p p    

Randomly expected pair frequency:

A multiple alignment of N sequences is the equivalent of all the pairwise alignments, which number (N)(N-1)/2. etc.

SLIDE 13

Log-odds score calculation (so adding scores == multiplying probabilities)

2

log

ij ij ij

q s e 

For computational speed often rounded to nearest integer and (to reduce round-off error) they are often multiplied by 2 (or more) first, giving a “half-bit” score: 2

matrixScore (rounded) 2log

ij ij

q e



SLIDE 14

BLOSUM62 matrix (half-bit scores)

Frequency of C residue

ver all proteins: 0.0162

(you have to look this up)

C-C

Reverse calculation of aligned C-C pair frequency in BLOSUM data set:

 

63 . 22 2

5 . 4

 

cc cc

e q

00594 . 000262 . 63 . 22   

q 000262 . 0162 . 0162 .   

e

thus ( 9 half-bits = 4.5 bits )

SLIDE 15

Constructing Blocks

Blocks are ungapped alignments of multiple sequences,

usually 20 to 100 amino acids long.

Cluster the members of each block according to their

percent identity.

Make pair counts and score matrix from a large

collection of similarly clustered blocks.

Each BLOSUM matrix is named for the percent identity

cutoff in step 2 (e.g. BLOSUM70 for 70% identity).

SLIDE 16

Probabilistic Interpretation of Scores (ungapped)

By converting scores back to probabilities, we can give

a probabilistic interpretation to an alignment score. VHRDLKPENLLLASK VHRDLKPENLLLASK

(4+8+5+6+4+5+7+5+6+4+4+4+4+4+5)

this 15 amino acid alignment has a

score of 75, meaning that it is ~1011 times more likely to be seen in a real alignment than in a random alignment(!!).

FIAP FLSP

this alignment has a score of 16 (6+2+1+7) by

BLOSUM 62, meaning an alignment with this score

r more is 28 (256) times more likely to be seen in a

real alignment than in a random alignment. 2

matrixScore (rounded) 2log

ij ij

q e



(BLOSUM62)

SLIDE 17

Randomly Distributed Gaps

(probability of a gap at each position in the sequence) [note - the slope of the line on a log-linear plot will vary according to the frequency of gaps, but it will always be linear]

n n g

k g P k g P k g P k p     ) ( ,..., ) ( , ) (

2 2 1

if then

SLIDE 18

log-linear plot

Distribution of alignment gap lengths in large set of structurally-aligned proteins

SLIDE 19

Summary

How a score matrix is derived
What the scores mean probablistically
Why gap penalties should be affine
How to use scores in dynamic programming