Substitution Matrices Michael Schroeder Biotechnology Center TU - - PowerPoint PPT Presentation

substitution matrices
SMART_READER_LITE
LIVE PREVIEW

Substitution Matrices Michael Schroeder Biotechnology Center TU - - PowerPoint PPT Presentation

Substitution Matrices Michael Schroeder Biotechnology Center TU Dresden Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align sequences?


slide-1
SLIDE 1

Michael Schroeder

Biotechnology Center TU Dresden

Substitution Matrices

slide-2
SLIDE 2

Contents

§ Why to compare and align sequences? § How to judge an alignment?

§ Z-score, E-value, P-value, structure and function

§ How to compare and align sequences?

§ Levensthein distance, scoring schemes, longest common subsequence, global and local alignment, substitution matrix,

§ How to compute an alignment?

§ Dynamic programming

§ How to compute an alignment fast?

§ Blast

§ How to align many sequences

§ Multiple sequence alignment, phylogenetic trees

§ Alignments and structure

§ How to predict protein structure from protein sequence

2

slide-3
SLIDE 3

Scoring Scheme

§ Consider the alignments:

TTATACGTAA TTA-CC-AAA vs. TTATACGTAA TTAC-CA-AA

§ with sm = 1, sr = sg = -1. § Are these alignments equally good?

2

  • 2
  • 2

C

  • 2

2

  • 2

G

  • 2

2

  • 2

T

  • 2
  • 2

2 A C G T A

3

slide-4
SLIDE 4

Properties of Amino Acids

Hydrophobic A G P I L V C W M F Acidic D E Polar S T N Q Y H Aromatic K R Basic

4

slide-5
SLIDE 5

BLOSUM: Blocks Substitution Matrix

Blosum62, Henikoff, 1992

5

slide-6
SLIDE 6

How to choose data?

§ BLOSUM: Blocks Substitution matrix, Henikoff 1992 § BLOSUM62: Maximal sequence identity is 62% § 3000 Blocks of 800 families § > 2300 occurrences of each substitution § General problem: How to choose representative data?

6

slide-7
SLIDE 7

Log Odds Ratio

§ How to define a score for mutation x-y:

§ Many substitutions x-y: high score x-y § Many occurrences of x: low score for x-y § Many occurrences of y: low score for x-y

§ Log odds ratio:

§ P(x,y) is probability for a substitution x ➞ y § P(x,y) = P(y,x) § P(x) is the probability for finding x

§ Where else could log odds ratios be useful?

log2 P(x, y) P(x)P(y)

7

slide-8
SLIDE 8

How to Count Substitutions?

ACGCTAFKI ACGCTAFKL ASGCTAFKL ACACTAFKL GCGCTAFKI GCGCTGFKI GCGCTLFKI

How many A ➞ G?

8

slide-9
SLIDE 9

Substitutions with and without Evolutionary Model

§ How many A ➞ G?

§ BLOSUM:

§ 4*3+1*6+5*1 = 23

§ PAM: 3 ACGCTAFKI ACGCTAFKL ASGCTAFKL ACACTAFKL GCGCTAFKI GCGCTGFKI GCGCTLFKI

9

slide-10
SLIDE 10

PAM: Evolutionary Model

ACGCTAFKI A➞G I➞L GCGCTAFKI ACGCTAFKL A➞G A➞L C➞S G➞A GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL ACGCTAFKI ACGCTAFKL ASGCTAFKL ACACTAFKL GCGCTAFKI GCGCTGFKI GCGCTLFKI

10

slide-11
SLIDE 11

How to Measure Quality of Substitution Matrices?

11

slide-12
SLIDE 12

Styczynski 2008: Error in BLOSUM62!

§ But: Revised BLOSUM does worse!

Red +1 of Blosum, green -1 12

slide-13
SLIDE 13

Do Substitution Matrices Reflect Physicochemical Properties?

13

slide-14
SLIDE 14

Needleman-Wunsch Algorithm

14

Global alignment of string a to b

Let a = a1 . . . am and b = b1 . . . bn be strings. Then needlea,b = needlea,b(m, n) is the global alignment score of a and b, where needlea,b(i, j) = 8 > > > > > > < > > > > > > : isg if j = 0, jsg if i = 0, max 8 > < > : needlea,b(i 1, j) + sg needlea,b(i, j 1) + sg needlea,b(i 1, j 1) + sm(ai=bj)

  • therwise,

for 0  i  m, 0  j  n, gap penalty sg < 0, mismatch penalty sr < 0, match reward sm > 0, and sm(ai=bj) := ( sm if (ai = bj), sr if (ai 6= bj)

slide-15
SLIDE 15

Needleman-Wunsch Algorithm with Substitution Matrix

15

Global alignment of string a to b Let a = a1 . . . am and b = b1 . . . bn be strings. Then needlea,b = needlea,b(m, n) is the global alignment score of a and b with substitution matrix, where needlea,b(i, j) =                isg if j = 0, jsg if i = 0, max      needlea,b(i − 1, j) + sg needlea,b(i, j − 1) + sg needlea,b(i − 1, j − 1) + ds(ai, bj)

  • therwise,

for 0 ≤ i ≤ m and 0 ≤ j ≤ n, substitution matrix ds(ai, bj), and gap penalty sg < 0.

slide-16
SLIDE 16

Global Alignment with Dynamic Programming

needle(a,b): let d be a matrix of size m+1 × n+1 for 0 ≤ i ≤ m: d[i,0] = i * sg for 1 ≤ j ≤ n: d[0,j] = j * sg for 1 ≤ i ≤ m: for 1 ≤ j ≤ n: if ai == bj: s = sm else: s = sr d[i,j] = max(d[i-1,j ] + sg, d[i ,j-1] + sg, d[i-1,j-1] + s) return d[m,n]

16

slide-17
SLIDE 17

Global Alignment with Substitution Matrix and Dynamic Programming

needle(a,b,ds): let d be a matrix of size m+1 × n+1 for 0 ≤ i ≤ m: d[i,0] = i * sg for 1 ≤ j ≤ n: d[0,j] = j * sg for 1 ≤ i ≤ m: for 1 ≤ j ≤ n: d[i,j] = max(d[i-1,j ] + sg, d[i ,j-1] + sg, d[i-1,j-1] + ds[ai,bj]) return d[m,n]

17

slide-18
SLIDE 18

Global Alignment with SM and DP: Mark the Path

needle(a,b,ds): let d be the distance matrix of size m+1 × n+1 let db be the backtracking matrix of size m+1 × n+1 d[0,0], db[0,0] = 0, ∅ for 1 ≤ i ≤ m: d[i,0] = i * sg db[i,0] = {'N'} for 1 ≤ j ≤ n: d[0,j] = j * sg db[0,j] = {'W'} for 1 ≤ i ≤ m: for 1 ≤ j ≤ n: d[i,j], db[i,j] = max_dir(d[i-1,j ] + sg,
 d[i ,j-1] + sg, d[i-1,j-1] + ds[ai,bj]) return d[m,n], db

Legend: red=new code

18

slide-19
SLIDE 19

Global Alignment with DP: Determine Directions

max_dir(dN,dW,dNW): dir = {} if dN == max(dN,dW,dNW): dir.add('N') if dW == max(dN,dW,dNW): dir.add('W') if dNW == max(dN,dW,dNW): dir.add('NW') return max(dN,dW,dNW), dir

19

slide-20
SLIDE 20

Global Alignment with DP: Printing the Alignment

print-needle(a,b,db): return print-needle(a,b,m,n,db) print-needle(a,b,i,j,db): if 'NW' in db[i,j]: print-needle(a,i-1,j-1,db) print ai, bj else if 'W' in db[i,j]: print-needle(a,i,j-1,db) print '-', bj else if 'N' in db[i,j]: print-needle(a,i-1,j,db) print ai, '-'

20

slide-21
SLIDE 21

Global Alignment with Dynamic Programming

i \ j p e t e r p e d r

  • 21
slide-22
SLIDE 22

Build a Substitution Matrix

22

slide-23
SLIDE 23

Summary

§ BLOSUM (and PAM) § Log odds ratio § Role of evolutionary model § How to assess substitution matrices

23