CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182

Dictionary Matching 1:POTATO P O T A S T P O T A T O 2:POTASSIUM 3:TASTE database dictionary • Q: Given k words (s i has length l i ) , and a database of size n, find all matches to these words in the database string. • How fast can this be done? Fa05 CSE 182

Dict. Matching & string matching • How fast can you do it, if you only had one word of length m? – Trivial algorithm O(nm) time – Pre-processing O(m), Search O(n) time. • Dictionary matching – Trivial algorithm (l 1 +l 2 +l 3 …)n – Using a keyword tree, l p n (l p is the length of the longest pattern) – Aho-Corasick: O(n) after preprocessing O(l 1 +l 2 ..) • We will consider the most general case Fa05 CSE 182

Direct Algorithm P O P O P O T A S T P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O Observations: • When we mismatch, we (should) know something about where the next match will be. • When there is a mismatch, we (should) know something about other patterns in the dictionary as well. Fa05 CSE 182

The Trie Automaton • Construct an automaton A from the dictionary – A[v,x] describes the transition from node v to a node w upon reading x. – A[u,’T’] = v, and A[u,’S’] = w – Special root node r – Some nodes are terminal, and labeled with the index of the dictionary word. 1:POTATO u v P O A 1 T T O r 2:POTASSIU M S T 3:TASTE S I U M 2 w A S T E 3 Fa05 CSE 182

An O(l p n) algorithm for keyword matching • Start with the first position in the db, and the root node. • If successful transition – Increment current pointer – Move to a new node – If terminal node “success” • Else – Retract ‘current’ pointer – Increment ‘start’ pointer – Move to root & repeat Fa05 CSE 182

Illustration: l c P O T A S T P O T A T O v P O A 1 T T O S T S I U M 2 A S T E 3 Fa05 CSE 182

Idea for improving the time • Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match – Then prefix(pattern j) = suffix [ first c-l characters of pattern(i)) c l P O T A S T P O T A T O P O T A S S I U M Pattern i T A S T E 1:POTATO 2:POTASSIUM Pattern j 3:TASTE Fa05 CSE 182

Failure function • Every node v corresponds to a string s v that is a prefix of some pattern. • Define F[v] to be the node u such that s u is the longest suffix of s v • If we fail to match at v, we should jump to F[v], and commence matching from there • Let lp[v] = |s u | 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 v T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182

Illustration • What is F(n 10 )? • What is F(n 5 )? • F(n 3 )? • Lp(n 10 )? 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 v T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182

Illustration P O T A S T P O T A T O l = 1 c = 1 v 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182

Illustration P O T A S T P O T A T O l = 1 c = 6 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 v T S S I U M n 7 n 10 A S T E n 8 n 9 October 09 CSE182

Illustration P O T A S T P O T A T O l = 3 c = 6 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M v n 7 n 10 A S T E n 8 n 9 October 09 CSE182

Illustration P O T A S T P O T A T O l = 3 c = 7 1 n 2 n 3 n 4 n 5 n 6 P O T A T O n 1 T S S I U M v n 7 n 10 A S T E n 8 n 9 n 11 October 09 CSE182

Time analysis • In each step, either c is incremented, or l is incremented • Neither pointer is ever decremented (lp[v] < c-l). • l and c do not exceed n • Total time <= 2n l c P O T A S T P O T A T O October 09 CSE182

Blast: Putting it all together • Input: Query of length m, database of size n • Select word-size, scoring matrix, gap penalties, E- value cutoff • Blast October 09 CSE182

Blast Steps 1. Generate an automaton of all query keywords. 2. Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits. 3. Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties. 4. For each alignment with score S, compute E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. 5. Output results. October 09 CSE182

BLAST output • Look up Blast Results with RID – HA5YXH5C012 October 09 CSE182

Distant hits October 09 CSE182

Protein Sequence Analysis • What can you do if BLAST does not return a hit? – Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity. • A: Accept hits at higher E-value. – This increases the probability that the sequence similarity is a chance event. – How can we get around this paradox? – Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish? B A C October 09 CSE182

Silly Quiz Skin patterns Facial Features October 09 CSE182

Not all features(residues) are important Skin patterns Facial Features October 09 CSE182

Diverged family members provide key features October 09 CSE182

Protein sequence motifs • Premise: • The sequence of a protein sequence gives clues about its structure and function. • Not all residues are equally important in determining function. • Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. • How can we identify these key residues? Fam(B) A C October 09 CSE182

Prosite • In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment. However, relationships can be revealed by the occurrence in its sequence of a particular cluster of residue types, which is variously known as a pattern, motif, signature or fingerprint. These motifs arise because specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity are conserved in both structure and sequence. These structural requirements impose very tight constraints on the evolution of this small but important portion(s) of a protein sequence. The use of protein sequence patterns or profiles to determine the function of proteins is becoming very rapidly one of the essential tools of sequence analysis. Many authors ( 3,4) have recognized this reality. Based on these observations, we decided in 1988, to actively pursue the development of a database of regular expression-like patterns, which would be used to search against sequences of unknown function. Kay Hofmann ,Philipp Bucher, Laurent Falquet and Amos Bairoch The PROSITE database, its status in 1999 October 09 CSE182

Basic idea • It is a heuristic approach. Start with the following: – A collection of sequences with the same function. – Region/residues known to be significant for maintaining structure and function. • Develop a pattern of conserved residues around the residues of interest • Iterate for appropriate sensitivity and specificity October 09 CSE182

EX: Zinc Finger domain October 09 CSE182

Proteins containing zf domains How can we find a motif corresponding to a zf domain October 09 CSE182

From alignment to regular expressions * ALRDFATHDDF ATH-[DE] SMTAEATHDSI ECDQAATHEAS • Search Swissprot with the resulting pattern • Refine pattern to eliminate false positives • Iterate October 09 CSE182

The sequence analysis perspective • Zinc Finger motif C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H – – 2 conserved C, and 2 conserved H • How can we search a database using these motifs? – The motif is described using a regular expression. What is a regular expression? October 09 CSE182

Regular Expressions • Concise representation of a set of strings over alphabet ∑ . • Described by a string over { } Σ , ⋅ , ∗ , + • R is a r.e. if and only if R = { ε } Base case R = { σ }, σ ∈ Σ R = R 1 + R 2 Union of strings R = R 1 ⋅ R 2 Concatenation * 0 or more repetitions R = R 1 October 09 CSE182

• End of L7 October 09 CSE182

Regular Expression • Q: Let ∑ ={A,C,E} – Is (A+C)*EEC* a regular expression? – *(A+C)? – AC*..E? • Q: When is a string s in a regular expression? – R =(A+C)*EEC* – Is CEEC in R? – AEC? – ACEE? October 09 CSE182

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 - PowerPoint PPT Presentation

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching 1:POTATO P O T A S T P O T A T O 2:POTASSIUM 3:TASTE database dictionary Q: Given k words (s i has length l i ) , and a database of size n, find

CSE182-L11 Protein sequencing and Mass Spectrometry CSE182 Course Summary Gene finding

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

CSE182-L13 Mass Spectrometry Quantitation and other applications CSE182 The forbidden pairs

CSE182-L12 Mass Spectrometry Peptide identification CSE182 General isotope computation

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

L14 Mass Spec Quantitation MS applications Microarray analysis CSE182 LC-MS Maps Peptide 2 I

CSE182-L10 Gene Finding November 09 HMM fair-coin example 0.6 0.6 1 0.4 0.4 E F (H)=0.5 E L

CSE182-L9 Protein domain analysis via HMMs Gene finding November 09 QUIZ! Question: Your

CSE182-L8 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding

CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley www. www.cse cse.

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

CSE182-L12 LW statistics/Assembly Quiz Who are these people, and what is the occasion?

CSE 182-L2:Blast & variants I Dynamic Programming FA08 CSE182 Notes

CSE182-L9 Modeling Protein domains using HMMs Profiles Revisited Note that profiles are a

CSE182-L5: Scoring matrices Dictionary Matching October 09 CSE 182 Expectation? Some

11: Catchup II Machine Learning and Real-world Data (MLRD) Ann Copestake Lent 2020 Last

CS654 Advanced Computer Architecture Lec 9 Limits to ILP and Simultaneous Multithreading

BECLoMA: Augmenting Stack Traces with User Review Information L. Pelloni, G. Grano, A.

Software Engineering Chap.3 - Agile Software Development Sim ao Melo de Sousa RELEASE (UBI),

WIDER Development Conference: Migration and Mobility 7 October 2017 1 Impacts of climate change

Pre Presented b by y Chale aley Har arney, Mo Montan ana B Beef Coun uncil il 2013 2013

Objectives Discuss the current evidence for opioids in labor Review the Role of Nitrous

Stereoselective hydrostannation of diacrylate and dimethacrylate esters of galactaric acid