BLAST Michael Schroeder Biotechnology Center TU Dresden Contents - - PowerPoint PPT Presentation

blast
SMART_READER_LITE
LIVE PREVIEW

BLAST Michael Schroeder Biotechnology Center TU Dresden Contents - - PowerPoint PPT Presentation

BLAST Michael Schroeder Biotechnology Center TU Dresden Contents Why to compare and align sequences? How to judge an alignment? Z-score, E-value, P-value, structure and function How to compare and align sequences? Levensthein


slide-1
SLIDE 1

Michael Schroeder

Biotechnology Center TU Dresden

BLAST

slide-2
SLIDE 2

Contents

  • Why to compare and align sequences?
  • How to judge an alignment?
  • Z-score, E-value, P-value, structure and function
  • How to compare and align sequences?
  • Levensthein distance, scoring schemes, longest common subsequence, global

and local alignment, substitution matrix,

  • How to compute an alignment?
  • Dynamic programming
  • How to compute an alignment fast?
  • BLAST
  • How to align many sequences
  • Multiple sequence alignment, phylogenetic trees
  • Alignments and structure
  • How to predict protein structure from protein sequence

2

slide-3
SLIDE 3

Motivation

  • Two sequences of length n
  • Dynamic programming:
  • matrix = time proportional to n2
  • Database of m sequences:
  • time proportional to m n2
  • Manual search: How long does it take?
  • 1 cell = 10 sec
  • 1.000.000 sequences
  • Sequence = 100 amino acids

3

slide-4
SLIDE 4

Motivation

  • How can we reduce the database size?
  • How can we reduce the matrix size?

4

slide-5
SLIDE 5

lev(petra, peter) = ?

5

slide-6
SLIDE 6

Levenshtein Distance with Dynamic Programming:

Are all cells in the matrix equally important?

6

i \ j p e t e r 1 2 3 4 5 p 1 1 2 3 4 e 2 1 1 2 3 t 3 2 1 1 2 r 4 3 2 1 1 1 a 5 4 3 2 2 2

slide-7
SLIDE 7

Levenshtein Distance with Dynamic Programming:

Are all cells in the matrix equally important?

7

i \ j p e t e r 1 2 3 4 5 p 1 1 2 3 4 e 2 1 1 2 3 t 3 2 1 1 2 r 4 3 2 1 1 1 a 5 4 3 2 2 2

For the two alignments, we only used 8 out of 36 cells. Can we discard the other cells beforehand? How?

not used maybe used used

slide-8
SLIDE 8

Levenshtein Distance with Dynamic Programming:

Are all cells in the matrix equally important?

8

i \ j 1 2 3 4 5 1 1 2 2 3 3 4 4 5 5

What is the worst possible distance for two strings of size 5? 5 mismatches. This means all paths of length >5 can be excluded

slide-9
SLIDE 9

Levenshtein Distance with Dynamic Programming:

Are all cells in the matrix equally important?

9

i \ j 1 2 3 4 5 1 1 2 2 3 3 4 4 5 5

Paths through red cells have all length >5 Only 24 out of 36 can contribute to results.

not used maybe used used

slide-10
SLIDE 10

Are we solving the right problem?

10

Are all alignments useful? Only results with reasonable edit distance. For size 5 strings, let‘s say that‘s 3.

slide-11
SLIDE 11

Levenshtein Distance with Dynamic Programming:

Are all cells in the matrix equally important?

11

i \ j 1 2 3 4 5 1 1 2 2 3 3 4 4 5 5

Paths through red cells have all length >3 Only 16 out of 36 can contribute to results.

not used maybe used used

slide-12
SLIDE 12

12

Ukkonen E. (1983) On approximate string matching. In: Karpinski M. (eds) Foundations of Computation Theory. FCT 1983. Lecture Notes in Computer Science, vol 158. Springer

slide-13
SLIDE 13

Are we Solving the Right Problem?

13

slide-14
SLIDE 14

Are we Solving the Right Problem?

  • Aligning completely unrelated sequences not needed
  • A reasonable alignment has some perfect matches
  • BLAST Idea:
  • 1. Identify small perfect matches (termed words)
  • 2. Extend perfect matches on the same diagonal
  • 3. Merge extended perfect matches

(with dynamic programming)

14

slide-15
SLIDE 15

BLAST Idea

  • Do we need the

alignments for all sequences in the database?

  • No, only for “reasonable”

hits introduce a ➞ threshold

  • A “reasonable” alignment

will contain short stretches

  • f perfect matches
  • Find these first, then

extend them to connect them as best possible

15

slide-16
SLIDE 16

Perfect Matches (Words) Formally

16

slide-17
SLIDE 17

BLAST: Filtration Algorithm

  • detects all matching words of length p (of a query

string a in a target string b, both of length n), combining it to an alignment of a and b with no more than k mismatches

  • Potential match detection: Find all matches of p-

tuples between a and b (can be done in linear time by inserting them into a hash table)

  • Potential match verification: Verify each potential

match by extending it to the left and right until either the first k+1 mismatches are found or the beginning

  • r end of the sequences are found

17

slide-18
SLIDE 18

BLAST Mechanism Summary

18

word length p (here: p = 4) no mismatches

  • r gaps allowed
  • nly within the

grey areas

slide-19
SLIDE 19

Zipf's Law

19

slide-20
SLIDE 20

Example for BLAST

20

SWISS_PROT:C79A_HUMAN P11912 Search SWISSPROT for Ig-alpha:

slide-21
SLIDE 21

Example for BLAST

21

slide-22
SLIDE 22

Example for BLAST

22

slide-23
SLIDE 23

Example for BLAST

23

Distribution of Hits:

slide-24
SLIDE 24

Example for BLAST: Alignment

>gi|126779|sp|P11911|C79A_MOUSE B-cell antigen receptor complex associated protein alpha-chain precursor (IG-alpha) (MB-1 membrane glycoprotein) (Surface-IGM-associated protein) (Membrane-bound immunoglobulin associated protein) (CD79A) Length = 220 Score = 278 bits (711), Expect = 5e-75 Identities = 150/226 (66%), Positives = 165/226 (73%), Gaps = 6/226 (2%) Query: 1 MPGGPGVLQALPATIFLLFLLSAVYLGPGCQALWMHKVPASLMVSLGEDAHFQCPHNSSN 60 MPGG + LL LS LGPGCQAL + P SL V+LGE+A C N+ Sbjct: 1 MPGG----LEALRALPLLLFLSYACLGPGCQALRVEGGPPSLTVNLGEEARLTC-ENNGR 55 Query: 61 NANVTWWRVLHGNYTWPPEFLGPGEDPNGTLIIQNVNKSHGGIYVCRVQEGNESYQQSCG 120 N N+TWW L N TWPP LGPG+ G L VNK+ G C+V E N ++SCG Sbjct: 56 NPNITWWFSLQSNITWPPVPLGPGQGTTGQLFFPEVNKNTGACTGCQVIE-NNILKRSCG 114 Query: 121 TYLRVRQPPPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKLGLDAGD 180 TYLRVR P PRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEK G+D D Sbjct: 115 TYLRVRNPVPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKFGVDMPD 174 Query: 181 EYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGSLNIGDVQLEKP 226 +YEDENLYEGLNLDDCSMYEDISRGLQGTYQDVG+L+IGD QLEKP Sbjct: 175 DYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGNLHIGDAQLEKP 220

24

slide-25
SLIDE 25

Example for BLAST: Taxonomy

Lineage Report root . cellular organisms . . Eukaryota [eukaryotes] . . . Fungi/Metazoa group [eukaryotes] . . . . Bilateria [animals] . . . . . Coelomata [animals] . . . . . . Gnathostomata [vertebrates] . . . . . . . Tetrapoda [vertebrates] . . . . . . . . Amniota [vertebrates] . . . . . . . . . Eutheria [mammals] . . . . . . . . . . Homo sapiens (man) ---------------------- 473 33 hits [mammals] . . . . . . . . . . Bos taurus (bovine) ..................... 312 2 hits [mammals] . . . . . . . . . . Mus musculus (mouse) .................... 278 31 hits [mammals] . . . . . . . . . . Canis familiaris (dogs) ................. 37 1 hit [mammals] . . . . . . . . . . Rattus norvegicus (brown rat) ........... 35 7 hits [mammals] . . . . . . . . . . Oryctolagus cuniculus (domestic rabbit) . 29 1 hit [mammals] . . . . . . . . . Coturnix japonica ------------------------- 33 2 hits [birds] . . . . . . . . . Gallus gallus (chickens) .................. 31 4 hits [birds] . . . . . . . . Xenopus laevis (clawed frog) ---------------- 30 2 hits [amphibians] . . . . . . . Heterodontus francisci ------------------------ 28 1 hit [sharks and rays] . . . . . . Drosophila melanogaster ------------------------- 30 2 hits [flies] . . . . . Caenorhabditis elegans ---------------------------- 29 1 hit [nematodes] . . . . Saccharomyces cerevisiae (brewer's yeast) ----------- 33 1 hit [ascomycetes] . . . Marchantia polymorpha --------------------------------- 29 1 hit [liverworts] . . Agrobacterium tumefaciens str. C58 ---------------------- 28 1 hit [a-proteobacteria] . Human adenovirus type 3 ----------------------------------- 30 1 hit [viruses] . Human adenovirus type 7 ................................... 30 1 hit [viruses] 25

slide-26
SLIDE 26

PSI-Blast

  • Globin family (oxygen transport) of proteins occurs in

many species

  • Proteins have same function and structure and
  • But there are pairs of members of the family sharing

less than 10% identical residues

  • A

B C

  • PSI-BLAST idea: score via intermediaries may be

better than score from direct comparison

26

50% Only 10% 50%

slide-27
SLIDE 27

PSI-BLAST

  • PSI-BLAST
  • 1. BLAST
  • 2. Collect top hits
  • 3. Build multiple sequence alignment from significant

local matches

  • 4. Build profile
  • 5. Re-probe database with profile
  • 6. Go back to 2.

27

slide-28
SLIDE 28

PSI-BLAST

  • But beware of PSI-BLAST:
  • False positives propagate and spread through

iterations

  • If protein A consists of domains D and E, and protein B
  • f domains E and F and protein C of domain F, then

PSI-BLAST will relate A and C although they do not share any domain

28