cse182 l7
play

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 - PowerPoint PPT Presentation

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching 1:POTATO P O T A S T P O T A T O 2:POTASSIUM 3:TASTE database dictionary Q: Given k words (s i has length l i ) , and a database of size n, find


  1. CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182

  2. Dictionary Matching 1:POTATO P O T A S T P O T A T O 2:POTASSIUM 3:TASTE database dictionary • Q: Given k words (s i has length l i ) , and a database of size n, find all matches to these words in the database string. • How fast can this be done? Fa05 CSE 182

  3. Dict. Matching & string matching • How fast can you do it, if you only had one word of length m? – Trivial algorithm O(nm) time – Pre-processing O(m), Search O(n) time. • Dictionary matching – Trivial algorithm (l 1 +l 2 +l 3 …)n – Using a keyword tree, l p n (l p is the length of the longest pattern) – Aho-Corasick: O(n) after preprocessing O(l 1 +l 2 ..) • We will consider the most general case Fa05 CSE 182

  4. Direct Algorithm P O P O P O T A S T P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O Observations: • When we mismatch, we (should) know something about where the next match will be. • When there is a mismatch, we (should) know something about other patterns in the dictionary as well. Fa05 CSE 182

  5. The Trie Automaton • Construct an automaton A from the dictionary – A[v,x] describes the transition from node v to a node w upon reading x. – A[u,’T’] = v, and A[u,’S’] = w – Special root node r – Some nodes are terminal, and labeled with the index of the dictionary word. 1:POTATO u v P O A 1 T T O r 2:POTASSIU M S T 3:TASTE S I U M 2 w A S T E 3 Fa05 CSE 182

  6. An O(l p n) algorithm for keyword matching • Start with the first position in the db, and the root node. • If successful transition – Increment current pointer – Move to a new node – If terminal node “success” • Else – Retract ‘current’ pointer – Increment ‘start’ pointer – Move to root & repeat Fa05 CSE 182

  7. Illustration: l c P O T A S T P O T A T O v P O A 1 T T O S T S I U M 2 A S T E 3 Fa05 CSE 182

  8. Idea for improving the time • Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match – Then prefix(pattern j) = suffix [ first c-l characters of pattern(i)) c l P O T A S T P O T A T O P O T A S S I U M Pattern i T A S T E 1:POTATO 2:POTASSIUM Pattern j 3:TASTE Fa05 CSE 182

  9. Failure function • Every node v corresponds to a string s v that is a prefix of some pattern. • Define F[v] to be the node u such that s u is the longest suffix of s v • If we fail to match at v, we should jump to F[v], and commence matching from there • Let lp[v] = |s u | 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 v T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182

  10. Illustration • What is F(n 10 )? • What is F(n 5 )? • F(n 3 )? • Lp(n 10 )? 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 v T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182

  11. Illustration P O T A S T P O T A T O l = 1 c = 1 v 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182

  12. Illustration P O T A S T P O T A T O l = 1 c = 2 v 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182

  13. Illustration P O T A S T P O T A T O l = 1 c = 6 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 v T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182

  14. Illustration P O T A S T P O T A T O l = 3 c = 6 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M v n 7 n 10 A S T E n 8 n 9 October 09 CSE182

  15. Illustration P O T A S T P O T A T O l = 3 c = 7 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M v n 7 n 10 A S T E n 8 n 9 n 11 October 09 CSE182

  16. Illustration P O T A S T P O T A T O l = 7 c = 7 v 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182

  17. Illustration P O T A S T P O T A T O l = 7 c = 8 v 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182

  18. Illustration P O T A S T P O T A T O l = 7 c = 7 v 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182

  19. Time analysis • In each step, either c is incremented, or l is incremented • Neither pointer is ever decremented (lp[v] < c-l). • l and c do not exceed n • Total time <= 2n l c P O T A S T P O T A T O October 09 CSE182

  20. Blast: Putting it all together • Input: Query of length m, database of size n • Select word-size, scoring matrix, gap penalties, E- value cutoff • Blast October 09 CSE182

  21. Blast Steps 1. Generate an automaton of all query keywords. 2. Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits. 3. Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties. 4. For each alignment with score S, compute E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. 5. Output results. October 09 CSE182

  22. BLAST output • Look up Blast Results with RID – HA5YXH5C012 October 09 CSE182

  23. Distant hits October 09 CSE182

  24. Protein Sequence Analysis • What can you do if BLAST does not return a hit? – Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity. • A: Accept hits at higher E-value. – This increases the probability that the sequence similarity is a chance event. – How can we get around this paradox? – Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish? B A C October 09 CSE182

  25. Silly Quiz Skin patterns Facial Features October 09 CSE182

  26. Not all features(residues) are important Skin patterns Facial Features October 09 CSE182

  27. Diverged family members provide key features October 09 CSE182

  28. Protein sequence motifs • Premise: • The sequence of a protein sequence gives clues about its structure and function. • Not all residues are equally important in determining function. • Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. • How can we identify these key residues? Fam(B) A C October 09 CSE182

  29. Prosite • In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment. However, relationships can be revealed by the occurrence in its sequence of a particular cluster of residue types, which is variously known as a pattern, motif, signature or fingerprint. These motifs arise because specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity are conserved in both structure and sequence. These structural requirements impose very tight constraints on the evolution of this small but important portion(s) of a protein sequence. The use of protein sequence patterns or profiles to determine the function of proteins is becoming very rapidly one of the essential tools of sequence analysis. Many authors ( 3,4) have recognized this reality. Based on these observations, we decided in 1988, to actively pursue the development of a database of regular expression-like patterns, which would be used to search against sequences of unknown function. Kay Hofmann ,Philipp Bucher, Laurent Falquet and Amos Bairoch The PROSITE database, its status in 1999 October 09 CSE182

  30. Basic idea • It is a heuristic approach. Start with the following: – A collection of sequences with the same function. – Region/residues known to be significant for maintaining structure and function. • Develop a pattern of conserved residues around the residues of interest • Iterate for appropriate sensitivity and specificity October 09 CSE182

  31. EX: Zinc Finger domain October 09 CSE182

  32. Proteins containing zf domains How can we find a motif corresponding to a zf domain October 09 CSE182

  33. From alignment to regular expressions * ALRDFATHDDF ATH-[DE] SMTAEATHDSI ECDQAATHEAS • Search Swissprot with the resulting pattern • Refine pattern to eliminate false positives • Iterate October 09 CSE182

  34. The sequence analysis perspective • Zinc Finger motif C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H – – 2 conserved C, and 2 conserved H • How can we search a database using these motifs? – The motif is described using a regular expression. What is a regular expression? October 09 CSE182

  35. Regular Expressions • Concise representation of a set of strings over alphabet ∑ . • Described by a string over { } Σ , ⋅ , ∗ , + • R is a r.e. if and only if R = { ε } Base case R = { σ }, σ ∈ Σ R = R 1 + R 2 Union of strings R = R 1 ⋅ R 2 Concatenation * 0 or more repetitions R = R 1 October 09 CSE182

  36. • End of L7 October 09 CSE182

  37. Regular Expression • Q: Let ∑ ={A,C,E} – Is (A+C)*EEC* a regular expression? – *(A+C)? – AC*..E? • Q: When is a string s in a regular expression? – R =(A+C)*EEC* – Is CEEC in R? – AEC? – ACEE? October 09 CSE182

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend