Sequence comparison: Score matrices Genome 559: Introduction to - - PowerPoint PPT Presentation

sequence comparison
SMART_READER_LITE
LIVE PREVIEW

Sequence comparison: Score matrices Genome 559: Introduction to - - PowerPoint PPT Presentation

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Informal inductive proof of best alignment path Consider the last step in the best alignment path to node a below.


slide-1
SLIDE 1

Sequence comparison: Score matrices

Genome 559: Introduction to Statistical and Computational Genomics

  • Prof. James H. Thomas
slide-2
SLIDE 2

BUT the best paths to X, Y, and Z are analogously the max of their three upstream possibilities, etc. Inductively QED.

Consider the last step in the best alignment path to node a below. This path must come from one of the three nodes shown, where X, Y, and Z are the cumulative scores of the best alignments up to those

  • nodes. We can reach node a by three possible paths: an A-B match, a

gap in sequence A or a gap in sequence B:

seq A seq B

X Y Z

match gap gap

a

The best-scoring path to

a is the maximum of:

X + match Y + gap Z + gap

Informal inductive proof of best alignment path

slide-3
SLIDE 3

Local alignment

A C G T A 2

  • 7
  • 5
  • 7

C

  • 7

2

  • 7
  • 5

G

  • 5
  • 7

2

  • 7

T

  • 7
  • 5
  • 7

2

A A G A 2 G 4 C

 

1 , 1   j i F

 

j i F ,

 

j i F , 1 

 

1 ,  j i F

d d

 

j i y

x s ,

2

(no arrow means no preceding alignment)

d = -5

slide-4
SLIDE 4

Local alignment

  • Two differences from global alignment:

– If a score is negative, replace with 0. – Traceback from the highest score in the matrix and continue until you reach 0.

  • Global alignment algorithm: Needleman-

Wunsch.

  • Local alignment algorithm: Smith-

Waterman.

slide-5
SLIDE 5
  • DNA score matrices are much simpler (and are

conceptually similar).

  • Quantitatively represent the degree of conservation
  • f typical amino acid residues over evolutionary time.
  • All possible amino acid changes are represented

(matrix of size at least 20 x 20).

  • Most commonly used are several different BLOSUM

matrices derived for different degrees of evolutionary divergence.

Protein score matrices

slide-6
SLIDE 6

regular 20 amino acids

BLOSUM62 Score Matrix

ambiguity codes and stop

# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Cluster Percentage: >= 62

slide-7
SLIDE 7

Amino acid structures

Hydrophobic Polar Charged

phenylalanine F

slide-8
SLIDE 8

BLOSUM62 Score Matrix

Good scores – chemically similar Bad scores – chemically dissimilar

slide-9
SLIDE 9

Amino acid structures

alanine A valine V glycine G leucine isoleucine methionine M proline P L I

CH CH3 C N

.

CH H C N

.

CH C N CH3 CH3

.

CH C C N C CH3 CH3

.

CH C N CH3 CH3

.

CH C C N C S CH3

.

CH N C

. tryptophan W

CH C N

.

H N

. threonine T tyrosine Y serine S asparagine glutamine N Q cysteine C

CH C N

.

OH

.

CH C N

.

SH

.

CH C N

.

OH

.

CH C N C OH

.

CH C N

.

NH2 O

.

CH C N

. .

NH2 O

lysine K arginine R histidine H aspartate glutamate D E

CH C N

.

N N

+ .

CH C N NH3+

.

CH C N

.

NH NH2+ H2N

.

CH C N C

.

O- O

.

CH C N

. .

O- O

.

Hydrophobic Polar Charged

slide-10
SLIDE 10
  • Find sets of sequences whose alignment is thought to

be correct (this is partly bootstrapped by alignment).

  • Measure how often various amino acid pairs occur in

the alignments.

  • Normalize this to the expected frequency of such

pairs randomly in the same set of alignments.

  • Derive a log-odds score (often in half bits).

Deriving BLOSUM scores

slide-11
SLIDE 11

Example of alignment block

31 amino acids (columns) 61 sequences (rows)

  • Thousands of such blocks go into

computing a single BLOSUM matrix.

  • Represent full diversity of sequences.
  • Results are summed over all columns of

all blocks.

slide-12
SLIDE 12

Pair frequency vs. expectation

D E D N D D

6 D-D pairs 4 D-E pairs 4 D-N pairs 1 E-N pair Sample column from a multiple alignment:

where is the count of pairs and is the total pair count.

1

ij ij ij

c ij T

q c T  

Actual aligned pair frequency:

where and are the overall probabilities (frequencies) of specific residues and .

2

a b aa a a ab a b b a a b

p p a b

e p p e p p p p p p    

Randomly expected pair frequency:

A multiple alignment of N sequences is the equivalent of all the pairwise alignments, which number (N)(N-1)/2. etc.

slide-13
SLIDE 13

Log-odds score calculation (so adding scores == multiplying probabilities)

2

log

ij ij ij

q s e 

For computational speed often rounded to nearest integer and (to reduce round-off error) they are often multiplied by 2 (or more) first, giving a “half-bit” score: 2

matrixScore (rounded) 2log

ij ij

q e

slide-14
SLIDE 14

BLOSUM62 matrix (half-bit scores)

Frequency of C residue

  • ver all proteins: 0.0162

(you have to look this up)

C-C

Reverse calculation of aligned C-C pair frequency in BLOSUM data set:

 

63 . 22 2

5 . 4

 

cc cc

e q

00594 . 000262 . 63 . 22   

cc

q 000262 . 0162 . 0162 .   

cc

e

thus ( 9 half-bits = 4.5 bits )

slide-15
SLIDE 15

Constructing Blocks

  • Blocks are ungapped alignments of multiple sequences,

usually 20 to 100 amino acids long.

  • Cluster the members of each block according to their

percent identity.

  • Make pair counts and score matrix from a large

collection of similarly clustered blocks.

  • Each BLOSUM matrix is named for the percent identity

cutoff in step 2 (e.g. BLOSUM70 for 70% identity).

slide-16
SLIDE 16

Probabilistic Interpretation of Scores (ungapped)

  • By converting scores back to probabilities, we can give

a probabilistic interpretation to an alignment score. VHRDLKPENLLLASK VHRDLKPENLLLASK

(4+8+5+6+4+5+7+5+6+4+4+4+4+4+5)

  • this 15 amino acid alignment has a

score of 75, meaning that it is ~1011 times more likely to be seen in a real alignment than in a random alignment(!!).

FIAP FLSP

  • this alignment has a score of 16 (6+2+1+7) by

BLOSUM 62, meaning an alignment with this score

  • r more is 28 (256) times more likely to be seen in a

real alignment than in a random alignment. 2

matrixScore (rounded) 2log

ij ij

q e

(BLOSUM62)

slide-17
SLIDE 17

Randomly Distributed Gaps

(probability of a gap at each position in the sequence) [note - the slope of the line on a log-linear plot will vary according to the frequency of gaps, but it will always be linear]

n n g

k g P k g P k g P k p     ) ( ,..., ) ( , ) (

2 2 1

if then

slide-18
SLIDE 18

log-linear plot

Distribution of alignment gap lengths in large set of structurally-aligned proteins

slide-19
SLIDE 19

Summary

  • How a score matrix is derived
  • What the scores mean probablistically
  • Why gap penalties should be affine
  • How to use scores in dynamic programming