[PPT] - Sequence comparison: Score matrices PowerPoint Presentation, free download

SLIDE 1

Sequence comparison: Score matrices

Genome 559: Introduction to Statistical and Computational Genomics

Prof. James H. Thomas

http://faculty.washington.edu/jht/GS559_2014/

SLIDE 2

BUT the best paths to X, Y, and Z are analogously the max of their three upstream possibilities, etc. Inductively QED.

Consider the last step in the best alignment path to node α below. This path must come from one of the three nodes shown, where X, Y, and Z are the cumulative scores of the best alignments up to those

nodes. We can reach node α by three possible paths: an A-B match, a

gap in sequence A or a gap in sequence B:

seq A seq B

X Y Z

match gap gap

α

The best-scoring path to

α is the maximum of:

X + match Y + gap Z + gap

FYI - informal inductive proof of best alignment path

SLIDE 3

Local alignment - review

A C G T A 2

7
5
7

C

7

2

7
5

G

5
7

2

7

T

7
5
7

2

A A G A 2 G 4 C

( )

1 , 1 − − j i F

( )

j i F ,

( )

j i F , 1 −

( )

1 , − j i F d d

( )

j i y

x s ,

2

(no arrow means no preceding alignment)

d = -5

SLIDE 4

Local alignment - review

Two differences from global alignment:

– If a score is negative, replace with 0. – Traceback from the highest score in the matrix and continue until you reach 0.

Global alignment algorithm: Needleman-

Wunsch.

Local alignment algorithm: Smith-

Waterman.

SLIDE 5

dot plot of two DNA sequences

verlay of the global

DP alignment path

SLIDE 6

Quantitatively represent the degree of conservation
f typical amino acid residues over evolutionary time.
All possible amino acid changes are represented

(matrix of size at least 20 x 20).

Most commonly used are several different BLOSUM

matrices derived for different degrees of evolutionary divergence.

DNA score matrices are simpler (and conceptually

similar).

Protein score matrices

SLIDE 7

regular 20 amino acids

BLOSUM62 Score Matrix

ambiguity codes and stop

# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Cluster Percentage: >= 62

SLIDE 8

Amino acid structures

Hydrophobic Polar Charged

phenylalanine F

SLIDE 9

BLOSUM62 Score Matrix

Good scores – chemically similar Bad scores – chemically dissimilar

SLIDE 10

Amino acid structures

alanine A valine V glycine G leucine isoleucine methionine M proline P L I

CH CH3 C N

.

CH H C N

.

CH C N CH3 CH3

.

CH C C N C CH3 CH3

.

CH C N CH3 CH3

.

CH C C N C S CH3

.

CH N C

. tryptophan W

CH C N

.

H N

. threonine T tyrosine Y serine S asparagine glutamine N Q cysteine C

CH C N

.

OH

.

CH C N

.

SH

.

CH C N

.

OH

.

CH C N C OH

.

CH C N

.

NH2 O

.

CH C N

. .

NH2 O

lysine K arginine R histidine H aspartate glutamate D E

CH C N

.

N N

+ .

CH C N NH3+

.

CH C N

.

NH NH2+ H2N

.

CH C N C

.

O- O

.

CH C N

. .

O- O

.

Hydrophobic Polar Charged

SLIDE 11

Find sets of sequences whose alignment is thought to

be correct (this is partly bootstrapped by alignment).

Measure how often various amino acid pairs occur in

the alignments.

Normalize this to the expected frequency of such

pairs randomly in the same set of alignments.

Derive a log-odds score for aligned vs. random.

Deriving BLOSUM scores

SLIDE 12

Example of alignment block (the BLO part of BLOSUM)

31 positions (columns) 61 sequences (rows)

Thousands of such blocks go into

computing a single BLOSUM matrix.

Represent full diversity of sequences.
Results are summed over all columns of

all blocks.

SLIDE 13

Pair frequency vs. expectation

where is the count of pairs and is the total pair count.

1

ij ij ij

c ij T

q c T = ∑

Actual aligned pair frequency:

where and are the overall probabilities (frequencies) of specific residues and .

2

a b aa a a ab a b b a a b

p p a b

e p p e p p p p p p = = + =

Randomly expected pair frequency: D E D N D D

6 D-D pairs 4 D-E pairs 4 D-N pairs 1 E-N pair Sample column from an alignment block:

(a multiple alignment of N sequences is the equivalent of all the pairwise alignments, which number (N)(N-1)/2.) etc.

this is called the sum

f pairs (the SUM

part of BLOSUM)

SLIDE 14

Log-odds score calculation (so adding scores == multiplying probabilities)

2

log

ij ij ij

q s e =

For computational speed often rounded to nearest integer and (to reduce round-off error) they are often multiplied by 2 (or more) first, giving a “half-bit” score: 2

matrixScore (rounded) 2log

ij ij

q e

=

(computers can add integers faster than floats) counted pair frequency expected random pair frequency

SLIDE 15

BLOSUM62 matrix (half-bit scores)

Frequency of C residue

ver all proteins: 0.0162

(you have to look this up)

Reverse calculation of aligned C-C pair frequency in BLOSUM data set:

C-C

( )

63 . 22 2

5 . 4

= =

cc cc

e q

000262 . 0162 . 0162 . = ∗ =

cc

e 00594 . 000262 . 63 . 22 = ∗ =

cc

q

thus ( 9 half-bits = 4.5 bits )

(in words, C-C pairs are 22.6 times as frequent as you would expect)

SLIDE 16

Constructing Blocks

Blocks are ungapped alignments of multiple sequences,

usually 20 to 100 amino acids long.

Cluster the members of each block according to their

percent identity.

Make pair counts and score matrix from a large

collection of similarly clustered blocks.

Each BLOSUM matrix is named for the percent identity

cutoff in step 2 (e.g. BLOSUM70 for 70% identity).

SLIDE 17

Randomly Distributed Gaps

(probability of a gap at each position in the sequence) [note - the slope of the line in this plot will vary according to the frequency of gaps, but it will always be linear]

n n g

k g P k g P k g P k p = = = = ) ( ,..., ) ( , ) (

2 2 1

if then

SLIDE 18

log-linear plot

Distribution of real alignment gap lengths in a large set of X-ray structure-aligned proteins

Nowhere near linear - hence the use of affine gap penalties (there ideally would be several levels of decreasing affine penalties)

SLIDE 19

What you should know

How a score matrix is derived
What the scores mean probablistically
Why gap penalties should be affine