Sequence comparison: Score matrices - - PowerPoint PPT Presentation

sequence comparison score matrices
SMART_READER_LITE
LIVE PREVIEW

Sequence comparison: Score matrices - - PowerPoint PPT Presentation

Sequence comparison: Score matrices http://faculty.washington.edu/jht/GS559_2014/ Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas FYI - informal inductive proof of best alignment path Consider the last


slide-1
SLIDE 1

Sequence comparison: Score matrices

Genome 559: Introduction to Statistical and Computational Genomics

  • Prof. James H. Thomas

http://faculty.washington.edu/jht/GS559_2014/

slide-2
SLIDE 2

BUT the best paths to X, Y, and Z are analogously the max of their three upstream possibilities, etc. Inductively QED.

Consider the last step in the best alignment path to node α below. This path must come from one of the three nodes shown, where X, Y, and Z are the cumulative scores of the best alignments up to those

  • nodes. We can reach node α by three possible paths: an A-B match, a

gap in sequence A or a gap in sequence B:

seq A seq B

X Y Z

match gap gap

α

The best-scoring path to

α is the maximum of:

X + match Y + gap Z + gap

FYI - informal inductive proof of best alignment path

slide-3
SLIDE 3

Local alignment - review

A C G T A 2

  • 7
  • 5
  • 7

C

  • 7

2

  • 7
  • 5

G

  • 5
  • 7

2

  • 7

T

  • 7
  • 5
  • 7

2

A A G A 2 G 4 C

( )

1 , 1 − − j i F

( )

j i F ,

( )

j i F , 1 −

( )

1 , − j i F d d

( )

j i y

x s ,

2

(no arrow means no preceding alignment)

d = -5

slide-4
SLIDE 4

Local alignment - review

  • Two differences from global alignment:

– If a score is negative, replace with 0. – Traceback from the highest score in the matrix and continue until you reach 0.

  • Global alignment algorithm: Needleman-

Wunsch.

  • Local alignment algorithm: Smith-

Waterman.

slide-5
SLIDE 5

dot plot of two DNA sequences

  • verlay of the global

DP alignment path

slide-6
SLIDE 6
  • Quantitatively represent the degree of conservation
  • f typical amino acid residues over evolutionary time.
  • All possible amino acid changes are represented

(matrix of size at least 20 x 20).

  • Most commonly used are several different BLOSUM

matrices derived for different degrees of evolutionary divergence.

  • DNA score matrices are simpler (and conceptually

similar).

Protein score matrices

slide-7
SLIDE 7

regular 20 amino acids

BLOSUM62 Score Matrix

ambiguity codes and stop

# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Cluster Percentage: >= 62

slide-8
SLIDE 8

Amino acid structures

Hydrophobic Polar Charged

phenylalanine F

slide-9
SLIDE 9

BLOSUM62 Score Matrix

Good scores – chemically similar Bad scores – chemically dissimilar

slide-10
SLIDE 10

Amino acid structures

alanine A valine V glycine G leucine isoleucine methionine M proline P L I

CH CH3 C N

.

CH H C N

.

CH C N CH3 CH3

.

CH C C N C CH3 CH3

.

CH C N CH3 CH3

.

CH C C N C S CH3

.

CH N C

. tryptophan W

CH C N

.

H N

. threonine T tyrosine Y serine S asparagine glutamine N Q cysteine C

CH C N

.

OH

.

CH C N

.

SH

.

CH C N

.

OH

.

CH C N C OH

.

CH C N

.

NH2 O

.

CH C N

. .

NH2 O

lysine K arginine R histidine H aspartate glutamate D E

CH C N

.

N N

+ .

CH C N NH3+

.

CH C N

.

NH NH2+ H2N

.

CH C N C

.

O- O

.

CH C N

. .

O- O

.

Hydrophobic Polar Charged

slide-11
SLIDE 11
  • Find sets of sequences whose alignment is thought to

be correct (this is partly bootstrapped by alignment).

  • Measure how often various amino acid pairs occur in

the alignments.

  • Normalize this to the expected frequency of such

pairs randomly in the same set of alignments.

  • Derive a log-odds score for aligned vs. random.

Deriving BLOSUM scores

slide-12
SLIDE 12

Example of alignment block (the BLO part of BLOSUM)

31 positions (columns) 61 sequences (rows)

  • Thousands of such blocks go into

computing a single BLOSUM matrix.

  • Represent full diversity of sequences.
  • Results are summed over all columns of

all blocks.

slide-13
SLIDE 13

Pair frequency vs. expectation

where is the count of pairs and is the total pair count.

1

ij ij ij

c ij T

q c T = ∑

Actual aligned pair frequency:

where and are the overall probabilities (frequencies) of specific residues and .

2

a b aa a a ab a b b a a b

p p a b

e p p e p p p p p p = = + =

Randomly expected pair frequency: D E D N D D

6 D-D pairs 4 D-E pairs 4 D-N pairs 1 E-N pair Sample column from an alignment block:

(a multiple alignment of N sequences is the equivalent of all the pairwise alignments, which number (N)(N-1)/2.) etc.

this is called the sum

  • f pairs (the SUM

part of BLOSUM)

slide-14
SLIDE 14

Log-odds score calculation (so adding scores == multiplying probabilities)

2

log

ij ij ij

q s e =

For computational speed often rounded to nearest integer and (to reduce round-off error) they are often multiplied by 2 (or more) first, giving a “half-bit” score: 2

matrixScore (rounded) 2log

ij ij

q e

=

(computers can add integers faster than floats) counted pair frequency expected random pair frequency

slide-15
SLIDE 15

BLOSUM62 matrix (half-bit scores)

Frequency of C residue

  • ver all proteins: 0.0162

(you have to look this up)

Reverse calculation of aligned C-C pair frequency in BLOSUM data set:

C-C

( )

63 . 22 2

5 . 4

= =

cc cc

e q

000262 . 0162 . 0162 . = ∗ =

cc

e 00594 . 000262 . 63 . 22 = ∗ =

cc

q

thus ( 9 half-bits = 4.5 bits )

(in words, C-C pairs are 22.6 times as frequent as you would expect)

slide-16
SLIDE 16

Constructing Blocks

  • Blocks are ungapped alignments of multiple sequences,

usually 20 to 100 amino acids long.

  • Cluster the members of each block according to their

percent identity.

  • Make pair counts and score matrix from a large

collection of similarly clustered blocks.

  • Each BLOSUM matrix is named for the percent identity

cutoff in step 2 (e.g. BLOSUM70 for 70% identity).

slide-17
SLIDE 17

Randomly Distributed Gaps

(probability of a gap at each position in the sequence) [note - the slope of the line in this plot will vary according to the frequency of gaps, but it will always be linear]

n n g

k g P k g P k g P k p = = = = ) ( ,..., ) ( , ) (

2 2 1

if then

slide-18
SLIDE 18

log-linear plot

Distribution of real alignment gap lengths in a large set of X-ray structure-aligned proteins

Nowhere near linear - hence the use of affine gap penalties (there ideally would be several levels of decreasing affine penalties)

slide-19
SLIDE 19

What you should know

  • How a score matrix is derived
  • What the scores mean probablistically
  • Why gap penalties should be affine