Michael Schroeder
Biotechnology Center TU Dresden
Substitution Matrices Michael Schroeder Biotechnology Center TU - - PowerPoint PPT Presentation
Substitution Matrices Michael Schroeder Biotechnology Center TU Dresden Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align sequences?
Biotechnology Center TU Dresden
§ Why to compare and align sequences? § How to judge an alignment?
§ Z-score, E-value, P-value, structure and function
§ How to compare and align sequences?
§ Levensthein distance, scoring schemes, longest common subsequence, global and local alignment, substitution matrix,
§ How to compute an alignment?
§ Dynamic programming
§ How to compute an alignment fast?
§ Blast
§ How to align many sequences
§ Multiple sequence alignment, phylogenetic trees
§ Alignments and structure
§ How to predict protein structure from protein sequence
2
§ Consider the alignments:
TTATACGTAA TTA-CC-AAA vs. TTATACGTAA TTAC-CA-AA
§ with sm = 1, sr = sg = -1. § Are these alignments equally good?
2
C
2
G
2
T
2 A C G T A
3
Hydrophobic A G P I L V C W M F Acidic D E Polar S T N Q Y H Aromatic K R Basic
4
Blosum62, Henikoff, 1992
5
§ BLOSUM: Blocks Substitution matrix, Henikoff 1992 § BLOSUM62: Maximal sequence identity is 62% § 3000 Blocks of 800 families § > 2300 occurrences of each substitution § General problem: How to choose representative data?
6
§ How to define a score for mutation x-y:
§ Many substitutions x-y: high score x-y § Many occurrences of x: low score for x-y § Many occurrences of y: low score for x-y
§ Log odds ratio:
§ P(x,y) is probability for a substitution x ➞ y § P(x,y) = P(y,x) § P(x) is the probability for finding x
§ Where else could log odds ratios be useful?
log2 P(x, y) P(x)P(y)
7
ACGCTAFKI ACGCTAFKL ASGCTAFKL ACACTAFKL GCGCTAFKI GCGCTGFKI GCGCTLFKI
How many A ➞ G?
8
§ How many A ➞ G?
§ BLOSUM:
§ 4*3+1*6+5*1 = 23
§ PAM: 3 ACGCTAFKI ACGCTAFKL ASGCTAFKL ACACTAFKL GCGCTAFKI GCGCTGFKI GCGCTLFKI
9
ACGCTAFKI A➞G I➞L GCGCTAFKI ACGCTAFKL A➞G A➞L C➞S G➞A GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL ACGCTAFKI ACGCTAFKL ASGCTAFKL ACACTAFKL GCGCTAFKI GCGCTGFKI GCGCTLFKI
10
11
§ But: Revised BLOSUM does worse!
Red +1 of Blosum, green -1 12
13
14
Global alignment of string a to b
Let a = a1 . . . am and b = b1 . . . bn be strings. Then needlea,b = needlea,b(m, n) is the global alignment score of a and b, where needlea,b(i, j) = 8 > > > > > > < > > > > > > : isg if j = 0, jsg if i = 0, max 8 > < > : needlea,b(i 1, j) + sg needlea,b(i, j 1) + sg needlea,b(i 1, j 1) + sm(ai=bj)
for 0 i m, 0 j n, gap penalty sg < 0, mismatch penalty sr < 0, match reward sm > 0, and sm(ai=bj) := ( sm if (ai = bj), sr if (ai 6= bj)
15
Global alignment of string a to b Let a = a1 . . . am and b = b1 . . . bn be strings. Then needlea,b = needlea,b(m, n) is the global alignment score of a and b with substitution matrix, where needlea,b(i, j) = isg if j = 0, jsg if i = 0, max needlea,b(i − 1, j) + sg needlea,b(i, j − 1) + sg needlea,b(i − 1, j − 1) + ds(ai, bj)
for 0 ≤ i ≤ m and 0 ≤ j ≤ n, substitution matrix ds(ai, bj), and gap penalty sg < 0.
needle(a,b): let d be a matrix of size m+1 × n+1 for 0 ≤ i ≤ m: d[i,0] = i * sg for 1 ≤ j ≤ n: d[0,j] = j * sg for 1 ≤ i ≤ m: for 1 ≤ j ≤ n: if ai == bj: s = sm else: s = sr d[i,j] = max(d[i-1,j ] + sg, d[i ,j-1] + sg, d[i-1,j-1] + s) return d[m,n]
16
needle(a,b,ds): let d be a matrix of size m+1 × n+1 for 0 ≤ i ≤ m: d[i,0] = i * sg for 1 ≤ j ≤ n: d[0,j] = j * sg for 1 ≤ i ≤ m: for 1 ≤ j ≤ n: d[i,j] = max(d[i-1,j ] + sg, d[i ,j-1] + sg, d[i-1,j-1] + ds[ai,bj]) return d[m,n]
17
needle(a,b,ds): let d be the distance matrix of size m+1 × n+1 let db be the backtracking matrix of size m+1 × n+1 d[0,0], db[0,0] = 0, ∅ for 1 ≤ i ≤ m: d[i,0] = i * sg db[i,0] = {'N'} for 1 ≤ j ≤ n: d[0,j] = j * sg db[0,j] = {'W'} for 1 ≤ i ≤ m: for 1 ≤ j ≤ n: d[i,j], db[i,j] = max_dir(d[i-1,j ] + sg, d[i ,j-1] + sg, d[i-1,j-1] + ds[ai,bj]) return d[m,n], db
Legend: red=new code
18
max_dir(dN,dW,dNW): dir = {} if dN == max(dN,dW,dNW): dir.add('N') if dW == max(dN,dW,dNW): dir.add('W') if dNW == max(dN,dW,dNW): dir.add('NW') return max(dN,dW,dNW), dir
19
print-needle(a,b,db): return print-needle(a,b,m,n,db) print-needle(a,b,i,j,db): if 'NW' in db[i,j]: print-needle(a,i-1,j-1,db) print ai, bj else if 'W' in db[i,j]: print-needle(a,i,j-1,db) print '-', bj else if 'N' in db[i,j]: print-needle(a,i-1,j,db) print ai, '-'
20
i \ j p e t e r p e d r
22
§ BLOSUM (and PAM) § Log odds ratio § Role of evolutionary model § How to assess substitution matrices
23