Michael Schroeder
Biotechnology Center TU Dresden
BLAST Michael Schroeder Biotechnology Center TU Dresden Contents - - PowerPoint PPT Presentation
BLAST Michael Schroeder Biotechnology Center TU Dresden Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align sequences? Levensthein
Biotechnology Center TU Dresden
and local alignment, substitution matrix,
2
3
4
5
Levenshtein Distance with Dynamic Programming:
6
i \ j p e t e r 1 2 3 4 5 p 1 1 2 3 4 e 2 1 1 2 3 t 3 2 1 1 2 r 4 3 2 1 1 1 a 5 4 3 2 2 2
Levenshtein Distance with Dynamic Programming:
7
i \ j p e t e r 1 2 3 4 5 p 1 1 2 3 4 e 2 1 1 2 3 t 3 2 1 1 2 r 4 3 2 1 1 1 a 5 4 3 2 2 2
For the two alignments, we only used 8 out of 36 cells. Can we discard the other cells beforehand? How?
not used maybe used used
Levenshtein Distance with Dynamic Programming:
8
i \ j 1 2 3 4 5 1 1 2 2 3 3 4 4 5 5
What is the worst possible distance for two strings of size 5? 5 mismatches. This means all paths of length >5 can be excluded
Levenshtein Distance with Dynamic Programming:
9
i \ j 1 2 3 4 5 1 1 2 2 3 3 4 4 5 5
Paths through red cells have all length >5 Only 24 out of 36 can contribute to results.
not used maybe used used
10
Are all alignments useful? Only results with reasonable edit distance. For size 5 strings, let‘s say that‘s 3.
Levenshtein Distance with Dynamic Programming:
11
i \ j 1 2 3 4 5 1 1 2 2 3 3 4 4 5 5
Paths through red cells have all length >3 Only 16 out of 36 can contribute to results.
not used maybe used used
12
Ukkonen E. (1983) On approximate string matching. In: Karpinski M. (eds) Foundations of Computation Theory. FCT 1983. Lecture Notes in Computer Science, vol 158. Springer
13
(with dynamic programming)
14
alignments for all sequences in the database?
hits introduce a ➞ threshold
will contain short stretches
extend them to connect them as best possible
15
16
string a in a target string b, both of length n), combining it to an alignment of a and b with no more than k mismatches
tuples between a and b (can be done in linear time by inserting them into a hash table)
match by extending it to the left and right until either the first k+1 mismatches are found or the beginning
17
18
word length p (here: p = 4) no mismatches
grey areas
19
20
SWISS_PROT:C79A_HUMAN P11912 Search SWISSPROT for Ig-alpha:
21
22
23
Distribution of Hits:
>gi|126779|sp|P11911|C79A_MOUSE B-cell antigen receptor complex associated protein alpha-chain precursor (IG-alpha) (MB-1 membrane glycoprotein) (Surface-IGM-associated protein) (Membrane-bound immunoglobulin associated protein) (CD79A) Length = 220 Score = 278 bits (711), Expect = 5e-75 Identities = 150/226 (66%), Positives = 165/226 (73%), Gaps = 6/226 (2%) Query: 1 MPGGPGVLQALPATIFLLFLLSAVYLGPGCQALWMHKVPASLMVSLGEDAHFQCPHNSSN 60 MPGG + LL LS LGPGCQAL + P SL V+LGE+A C N+ Sbjct: 1 MPGG----LEALRALPLLLFLSYACLGPGCQALRVEGGPPSLTVNLGEEARLTC-ENNGR 55 Query: 61 NANVTWWRVLHGNYTWPPEFLGPGEDPNGTLIIQNVNKSHGGIYVCRVQEGNESYQQSCG 120 N N+TWW L N TWPP LGPG+ G L VNK+ G C+V E N ++SCG Sbjct: 56 NPNITWWFSLQSNITWPPVPLGPGQGTTGQLFFPEVNKNTGACTGCQVIE-NNILKRSCG 114 Query: 121 TYLRVRQPPPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKLGLDAGD 180 TYLRVR P PRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEK G+D D Sbjct: 115 TYLRVRNPVPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKFGVDMPD 174 Query: 181 EYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGSLNIGDVQLEKP 226 +YEDENLYEGLNLDDCSMYEDISRGLQGTYQDVG+L+IGD QLEKP Sbjct: 175 DYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGNLHIGDAQLEKP 220
24
Lineage Report root . cellular organisms . . Eukaryota [eukaryotes] . . . Fungi/Metazoa group [eukaryotes] . . . . Bilateria [animals] . . . . . Coelomata [animals] . . . . . . Gnathostomata [vertebrates] . . . . . . . Tetrapoda [vertebrates] . . . . . . . . Amniota [vertebrates] . . . . . . . . . Eutheria [mammals] . . . . . . . . . . Homo sapiens (man) ---------------------- 473 33 hits [mammals] . . . . . . . . . . Bos taurus (bovine) ..................... 312 2 hits [mammals] . . . . . . . . . . Mus musculus (mouse) .................... 278 31 hits [mammals] . . . . . . . . . . Canis familiaris (dogs) ................. 37 1 hit [mammals] . . . . . . . . . . Rattus norvegicus (brown rat) ........... 35 7 hits [mammals] . . . . . . . . . . Oryctolagus cuniculus (domestic rabbit) . 29 1 hit [mammals] . . . . . . . . . Coturnix japonica ------------------------- 33 2 hits [birds] . . . . . . . . . Gallus gallus (chickens) .................. 31 4 hits [birds] . . . . . . . . Xenopus laevis (clawed frog) ---------------- 30 2 hits [amphibians] . . . . . . . Heterodontus francisci ------------------------ 28 1 hit [sharks and rays] . . . . . . Drosophila melanogaster ------------------------- 30 2 hits [flies] . . . . . Caenorhabditis elegans ---------------------------- 29 1 hit [nematodes] . . . . Saccharomyces cerevisiae (brewer's yeast) ----------- 33 1 hit [ascomycetes] . . . Marchantia polymorpha --------------------------------- 29 1 hit [liverworts] . . Agrobacterium tumefaciens str. C58 ---------------------- 28 1 hit [a-proteobacteria] . Human adenovirus type 3 ----------------------------------- 30 1 hit [viruses] . Human adenovirus type 7 ................................... 30 1 hit [viruses] 25
many species
less than 10% identical residues
B C
better than score from direct comparison
26
50% Only 10% 50%
local matches
27
iterations
PSI-BLAST will relate A and C although they do not share any domain
28