October 09 CSE182
CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 - - PowerPoint PPT Presentation
CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 - - PowerPoint PPT Presentation
CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching 1:POTATO P O T A S T P O T A T O 2:POTASSIUM 3:TASTE database dictionary Q: Given k words (s i has length l i ) , and a database of size n, find
Fa05 CSE 182
Dictionary Matching
- Q: Given k words (si has length li), and a database of
size n, find all matches to these words in the database string.
- How fast can this be done?
1:POTATO 2:POTASSIUM 3:TASTE
P O T A S T P O T A T O
dictionary database
Fa05 CSE 182
- Dict. Matching & string matching
- How fast can you do it, if you only had one word of length m?
– Trivial algorithm O(nm) time – Pre-processing O(m), Search O(n) time.
- Dictionary matching
– Trivial algorithm (l1+l2+l3…)n – Using a keyword tree, lpn (lp is the length of the longest pattern) – Aho-Corasick: O(n) after preprocessing O(l1+l2..)
- We will consider the most general case
Fa05 CSE 182
Direct Algorithm
P O P O P O T A S T P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O
Observations:
- When we mismatch, we (should) know something about where
the next match will be.
- When there is a mismatch, we (should) know something about
- ther patterns in the dictionary as well.
Fa05 CSE 182
P O T A T O T U I S M S E T A
The Trie Automaton
- Construct an automaton A from the dictionary
– A[v,x] describes the transition from node v to a node w upon reading x. – A[u,’T’] = v, and A[u,’S’] = w – Special root node r – Some nodes are terminal, and labeled with the index of the dictionary word.
1:POTATO 2:POTASSIU M 3:TASTE
1 2 3
w v u
S r
Fa05 CSE 182
An O(lpn) algorithm for keyword matching
- Start with the first position in
the db, and the root node.
- If successful transition
– Increment current pointer – Move to a new node – If terminal node “success”
- Else
– Retract ‘current’ pointer – Increment ‘start’ pointer – Move to root & repeat
Fa05 CSE 182
Illustration:
P O T A T O T U I S M S E T A P O T A S T P O T A T O l c v S 1 2 3
Fa05 CSE 182
Idea for improving the time
P O T A S T P O T A T O
- Suppose we have partially matched pattern i (indicated by l, and c), but fail
- subsequently. If some other pattern j is to match
– Then prefix(pattern j) = suffix [ first c-l characters of pattern(i))
l c
1:POTATO 2:POTASSIUM 3:TASTE
P O T A S S I U M T A S T E
Pattern i Pattern j
October 09 CSE182
P O T A T O T U I S M S E T A v S 1
n1 n7 n6 n5 n4 n3 n2 n9 n8 n10
- Every node v corresponds to a string sv that is a
prefix of some pattern.
- Define F[v] to be the node u such that su is the
longest suffix of sv
- If we fail to match at v, we should jump to F[v],
and commence matching from there
- Let lp[v] = |su|
Failure function
October 09 CSE182
Illustration
P O T A T O T U I S M S E T A v S 1
n1 n7 n6 n5 n4 n3 n2 n9 n8 n10
- What is F(n10)?
- What is F(n5)?
- F(n3)?
- Lp(n10)?
October 09 CSE182
Illustration
P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1
l = 1
n1 n7 n6 n5 n4 n3 n2 n9 n8
v
c = 1
n10
October 09 CSE182
Illustration
P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1
l = 1
n1 n7 n6 n5 n4 n3 n2 n9 n8
v
c = 2
n10
October 09 CSE182
Illustration
P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1
l = 1
n1 n7 n6 n5 n4 n3 n2 n9 n8
v
c = 6
n10
October 09 CSE182
Illustration
P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1
l = 3
n1 n7 n6 n5 n4 n3 n2 n9 n8
v
c = 6
n10
October 09 CSE182
Illustration
P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1
l = 3
n1 n7 n6 n5 n4 n3 n2 n9 n8
v
c = 7
n10 n11
October 09 CSE182
Illustration
P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1
l = 7
n1 n7 n6 n5 n4 n3 n2 n9 n8
v
c = 7
n10
October 09 CSE182
Illustration
P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1
l = 7
n1 n7 n6 n5 n4 n3 n2 n9 n8
v
c = 8
n10
October 09 CSE182
Illustration
P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1
l = 7
n1 n7 n6 n5 n4 n3 n2 n9 n8
v
c = 7
n10
October 09 CSE182
Time analysis
- In each step, either c is
incremented, or l is incremented
- Neither pointer is ever
decremented (lp[v] < c-l).
- l and c do not exceed n
- Total time <= 2n
P O T A S T P O T A T O l c
October 09 CSE182
Blast: Putting it all together
- Input: Query of length
m, database of size n
- Select word-size, scoring
matrix, gap penalties, E- value cutoff
- Blast
October 09 CSE182
Blast Steps
1. Generate an automaton of all query keywords. 2. Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits. 3. Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties. 4. For each alignment with score S, compute E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. 5. Output results.
October 09 CSE182
BLAST output
- Look up Blast Results with RID
– HA5YXH5C012
October 09 CSE182
Distant hits
October 09 CSE182
Protein Sequence Analysis
- What can you do if BLAST does not return a hit?
– Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity.
- A: Accept hits at higher E-value.
– This increases the probability that the sequence similarity is a chance event. – How can we get around this paradox? – Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish?
A B C
October 09 CSE182
Silly Quiz
Skin patterns Facial Features
October 09 CSE182
Not all features(residues) are important
Skin patterns Facial Features
October 09 CSE182
Diverged family members provide key features
October 09 CSE182
Protein sequence motifs
- Premise:
- The sequence of a protein sequence gives clues about its
structure and function.
- Not all residues are equally important in determining function.
- Suppose we knew the key residues of a family. If our query
matches in those residues, it is a member. Otherwise, it is not.
- How can we identify these key residues?
A Fam(B) C
October 09 CSE182
Prosite
- In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect
its resemblance by overall sequence alignment. However, relationships can be revealed by the occurrence in its sequence of a particular cluster of residue types, which is variously known as a pattern, motif, signature or
- fingerprint. These motifs arise because specific region(s) of a protein which may be important, for example, for their
binding properties or for their enzymatic activity are conserved in both structure and sequence. These structural requirements impose very tight constraints on the evolution of this small but important portion(s) of a protein
- sequence. The use of protein sequence patterns or profiles to determine the function of proteins is becoming very
rapidly one of the essential tools of sequence analysis. Many authors ( 3,4) have recognized this reality. Based on these observations, we decided in 1988, to actively pursue the development of a database of regular expression-like patterns, which would be used to search against sequences of unknown function.
Kay Hofmann ,Philipp Bucher, Laurent Falquet and Amos Bairoch
The PROSITE database, its status in 1999
October 09 CSE182
Basic idea
- It is a heuristic approach. Start with the following:
– A collection of sequences with the same function. – Region/residues known to be significant for maintaining structure and function.
- Develop a pattern of conserved residues around the
residues of interest
- Iterate for appropriate sensitivity and specificity
October 09 CSE182
EX: Zinc Finger domain
October 09 CSE182
Proteins containing zf domains
How can we find a motif corresponding to a zf domain
October 09 CSE182
From alignment to regular expressions
* ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS ATH-[DE]
- Search Swissprot with the resulting pattern
- Refine pattern to eliminate false positives
- Iterate
October 09 CSE182
The sequence analysis perspective
- Zinc Finger motif
– C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H – 2 conserved C, and 2 conserved H
- How can we search a database using these motifs?
– The motif is described using a regular expression. What is a regular expression?
October 09 CSE182
Regular Expressions
- Concise representation of a set of strings over
alphabet ∑.
- Described by a string over
- R is a r.e. if and only if
Σ,⋅,∗,+
{ }
R = {ε} Base case R = {σ},σ ∈ Σ R = R
1 + R2 Union of strings
R = R
1 ⋅ R2 Concatenation
R = R
1
* 0 or more repetitions
- End of L7
October 09 CSE182
October 09 CSE182
Regular Expression
- Q: Let ∑={A,C,E}
– Is (A+C)*EEC* a regular expression? – *(A+C)? – AC*..E?
- Q: When is a string s in a regular expression?
– R =(A+C)*EEC* – Is CEEC in R? – AEC? – ACEE?
October 09 CSE182
Regular Expression & Automata
- Every R.E can be expressed by an automaton (a directed
graph) with the following properties:
– The automaton has a start and end node – Each edge is labeled with a symbol from ∑, or ε
- Suppose R is described by automaton A
- S ∈ R if and only if there is a path from start to end in
A, labeled with s.
October 09 CSE182
Examples: Regular Expression & Automata
- (A+C)*EEC*
C A C start end E E
October 09 CSE182
Constructing automata from R.E
- R = {ε}
- R = {σ}, σ ∈ ∑
- R = R1 + R2
- R = R1 · R2
- R = R1*
October 09 CSE182
End of L6
October 09 CSE182
Protein structure basics
October 09 CSE182
Side chains determine amino-acid type
- The residues may have different properties.
- Aspartic acid (D), and Glutamic Acid (E) are acidic residues
October 09 CSE182
Bond angles form structural constraints
October 09 CSE182
Various constraints determine 3d structure
- Constraints
– Structural constraints due to physiochemical properties – Constraints due to bond angles – H-bond formation
- Surprisingly, a few conformations are seen over
and over again.
October 09 CSE182
Alpha-helix
- 3.6 residues per turn
- H-bonds between 1st and
4th residue stabilize the structure.
- First discovered by Linus
Pauling
October 09 CSE182
Beta-sheet
- Each strand by itself has 2 residues per turn, and is not stable.
- Adjacent strands hydrogen-bond to form stable beta-sheets, parallel or anti-parallel.
- Beta sheets have long range interactions that stabilize the structure, while alpha-helices have local
interactions.
October 09 CSE182
Domains
- The basic structures (helix, strand, loop) combine
to form complex 3D structures.
- Certain combinations are popular. Many sequences,
but only a few folds
October 09 CSE182
3D structure
- Predicting tertiary structure is an important problem in
Bioinformatics.
- Premise: Clues to structure can be found in the sequence.
- While de novo tertiary structure prediction is hard, there are
many intermediate, and tractable goals.
- The PDB database is a compendium of structures
PDB
October 09 CSE182
Searching structure databases
- Threading, and other 3d Alignments can be used to
align structures.
- Database filtering is possible through geometric
hashing.
October 09 CSE182
Trivia Quiz
- What research won the Nobel prize in
Chemistry in 2004?
- In 2002?
October 09 CSE182
How are Proteins Sequenced? Mass Spec 101:
October 09 CSE182
Nobel Citation 2002
October 09 CSE182
Nobel Citation, 2002
October 09 CSE182
Mass Spectrometry
October 09 CSE182
Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation
October 09 CSE182
Single Stage MS
Mass Spectrometry LC-MS: 1 MS spectrum / second
October 09 CSE182
Tandem MS
Secondary Fragmentation
Ionized parent peptide