CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 - - PowerPoint PPT Presentation

cse182 l7
SMART_READER_LITE
LIVE PREVIEW

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 - - PowerPoint PPT Presentation

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching 1:POTATO P O T A S T P O T A T O 2:POTASSIUM 3:TASTE database dictionary Q: Given k words (s i has length l i ) , and a database of size n, find


slide-1
SLIDE 1

October 09 CSE182

CSE182-L7

Dicitionary matching Pattern matching

slide-2
SLIDE 2

Fa05 CSE 182

Dictionary Matching

  • Q: Given k words (si has length li), and a database of

size n, find all matches to these words in the database string.

  • How fast can this be done?

1:POTATO 2:POTASSIUM 3:TASTE

P O T A S T P O T A T O

dictionary database

slide-3
SLIDE 3

Fa05 CSE 182

  • Dict. Matching & string matching
  • How fast can you do it, if you only had one word of length m?

– Trivial algorithm O(nm) time – Pre-processing O(m), Search O(n) time.

  • Dictionary matching

– Trivial algorithm (l1+l2+l3…)n – Using a keyword tree, lpn (lp is the length of the longest pattern) – Aho-Corasick: O(n) after preprocessing O(l1+l2..)

  • We will consider the most general case
slide-4
SLIDE 4

Fa05 CSE 182

Direct Algorithm

P O P O P O T A S T P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O

Observations:

  • When we mismatch, we (should) know something about where

the next match will be.

  • When there is a mismatch, we (should) know something about
  • ther patterns in the dictionary as well.
slide-5
SLIDE 5

Fa05 CSE 182

P O T A T O T U I S M S E T A

The Trie Automaton

  • Construct an automaton A from the dictionary

– A[v,x] describes the transition from node v to a node w upon reading x. – A[u,’T’] = v, and A[u,’S’] = w – Special root node r – Some nodes are terminal, and labeled with the index of the dictionary word.

1:POTATO 2:POTASSIU M 3:TASTE

1 2 3

w v u

S r

slide-6
SLIDE 6

Fa05 CSE 182

An O(lpn) algorithm for keyword matching

  • Start with the first position in

the db, and the root node.

  • If successful transition

– Increment current pointer – Move to a new node – If terminal node “success”

  • Else

– Retract ‘current’ pointer – Increment ‘start’ pointer – Move to root & repeat

slide-7
SLIDE 7

Fa05 CSE 182

Illustration:

P O T A T O T U I S M S E T A P O T A S T P O T A T O l c v S 1 2 3

slide-8
SLIDE 8

Fa05 CSE 182

Idea for improving the time

P O T A S T P O T A T O

  • Suppose we have partially matched pattern i (indicated by l, and c), but fail
  • subsequently. If some other pattern j is to match

– Then prefix(pattern j) = suffix [ first c-l characters of pattern(i))

l c

1:POTATO 2:POTASSIUM 3:TASTE

P O T A S S I U M T A S T E

Pattern i Pattern j

slide-9
SLIDE 9

October 09 CSE182

P O T A T O T U I S M S E T A v S 1

n1 n7 n6 n5 n4 n3 n2 n9 n8 n10

  • Every node v corresponds to a string sv that is a

prefix of some pattern.

  • Define F[v] to be the node u such that su is the

longest suffix of sv

  • If we fail to match at v, we should jump to F[v],

and commence matching from there

  • Let lp[v] = |su|

Failure function

slide-10
SLIDE 10

October 09 CSE182

Illustration

P O T A T O T U I S M S E T A v S 1

n1 n7 n6 n5 n4 n3 n2 n9 n8 n10

  • What is F(n10)?
  • What is F(n5)?
  • F(n3)?
  • Lp(n10)?
slide-11
SLIDE 11

October 09 CSE182

Illustration

P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1

l = 1

n1 n7 n6 n5 n4 n3 n2 n9 n8

v

c = 1

n10

slide-12
SLIDE 12

October 09 CSE182

Illustration

P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1

l = 1

n1 n7 n6 n5 n4 n3 n2 n9 n8

v

c = 2

n10

slide-13
SLIDE 13

October 09 CSE182

Illustration

P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1

l = 1

n1 n7 n6 n5 n4 n3 n2 n9 n8

v

c = 6

n10

slide-14
SLIDE 14

October 09 CSE182

Illustration

P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1

l = 3

n1 n7 n6 n5 n4 n3 n2 n9 n8

v

c = 6

n10

slide-15
SLIDE 15

October 09 CSE182

Illustration

P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1

l = 3

n1 n7 n6 n5 n4 n3 n2 n9 n8

v

c = 7

n10 n11

slide-16
SLIDE 16

October 09 CSE182

Illustration

P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1

l = 7

n1 n7 n6 n5 n4 n3 n2 n9 n8

v

c = 7

n10

slide-17
SLIDE 17

October 09 CSE182

Illustration

P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1

l = 7

n1 n7 n6 n5 n4 n3 n2 n9 n8

v

c = 8

n10

slide-18
SLIDE 18

October 09 CSE182

Illustration

P O T A S T P O T A T O P O T A T O T U I S M S E T A S 1

l = 7

n1 n7 n6 n5 n4 n3 n2 n9 n8

v

c = 7

n10

slide-19
SLIDE 19

October 09 CSE182

Time analysis

  • In each step, either c is

incremented, or l is incremented

  • Neither pointer is ever

decremented (lp[v] < c-l).

  • l and c do not exceed n
  • Total time <= 2n

P O T A S T P O T A T O l c

slide-20
SLIDE 20

October 09 CSE182

Blast: Putting it all together

  • Input: Query of length

m, database of size n

  • Select word-size, scoring

matrix, gap penalties, E- value cutoff

  • Blast
slide-21
SLIDE 21

October 09 CSE182

Blast Steps

1. Generate an automaton of all query keywords. 2. Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits. 3. Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties. 4. For each alignment with score S, compute E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. 5. Output results.

slide-22
SLIDE 22

October 09 CSE182

BLAST output

  • Look up Blast Results with RID

– HA5YXH5C012

slide-23
SLIDE 23

October 09 CSE182

Distant hits

slide-24
SLIDE 24

October 09 CSE182

Protein Sequence Analysis

  • What can you do if BLAST does not return a hit?

– Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity.

  • A: Accept hits at higher E-value.

– This increases the probability that the sequence similarity is a chance event. – How can we get around this paradox? – Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish?

A B C

slide-25
SLIDE 25

October 09 CSE182

Silly Quiz

Skin patterns Facial Features

slide-26
SLIDE 26

October 09 CSE182

Not all features(residues) are important

Skin patterns Facial Features

slide-27
SLIDE 27

October 09 CSE182

Diverged family members provide key features

slide-28
SLIDE 28

October 09 CSE182

Protein sequence motifs

  • Premise:
  • The sequence of a protein sequence gives clues about its

structure and function.

  • Not all residues are equally important in determining function.
  • Suppose we knew the key residues of a family. If our query

matches in those residues, it is a member. Otherwise, it is not.

  • How can we identify these key residues?

A Fam(B) C

slide-29
SLIDE 29

October 09 CSE182

Prosite

  • In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect

its resemblance by overall sequence alignment. However, relationships can be revealed by the occurrence in its sequence of a particular cluster of residue types, which is variously known as a pattern, motif, signature or

  • fingerprint. These motifs arise because specific region(s) of a protein which may be important, for example, for their

binding properties or for their enzymatic activity are conserved in both structure and sequence. These structural requirements impose very tight constraints on the evolution of this small but important portion(s) of a protein

  • sequence. The use of protein sequence patterns or profiles to determine the function of proteins is becoming very

rapidly one of the essential tools of sequence analysis. Many authors ( 3,4) have recognized this reality. Based on these observations, we decided in 1988, to actively pursue the development of a database of regular expression-like patterns, which would be used to search against sequences of unknown function.

Kay Hofmann ,Philipp Bucher, Laurent Falquet and Amos Bairoch

The PROSITE database, its status in 1999

slide-30
SLIDE 30

October 09 CSE182

Basic idea

  • It is a heuristic approach. Start with the following:

– A collection of sequences with the same function. – Region/residues known to be significant for maintaining structure and function.

  • Develop a pattern of conserved residues around the

residues of interest

  • Iterate for appropriate sensitivity and specificity
slide-31
SLIDE 31

October 09 CSE182

EX: Zinc Finger domain

slide-32
SLIDE 32

October 09 CSE182

Proteins containing zf domains

How can we find a motif corresponding to a zf domain

slide-33
SLIDE 33

October 09 CSE182

From alignment to regular expressions

* ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS ATH-[DE]

  • Search Swissprot with the resulting pattern
  • Refine pattern to eliminate false positives
  • Iterate
slide-34
SLIDE 34

October 09 CSE182

The sequence analysis perspective

  • Zinc Finger motif

– C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H – 2 conserved C, and 2 conserved H

  • How can we search a database using these motifs?

– The motif is described using a regular expression. What is a regular expression?

slide-35
SLIDE 35

October 09 CSE182

Regular Expressions

  • Concise representation of a set of strings over

alphabet ∑.

  • Described by a string over
  • R is a r.e. if and only if

Σ,⋅,∗,+

{ }

R = {ε} Base case R = {σ},σ ∈ Σ R = R

1 + R2 Union of strings

R = R

1 ⋅ R2 Concatenation

R = R

1

* 0 or more repetitions

slide-36
SLIDE 36
  • End of L7

October 09 CSE182

slide-37
SLIDE 37

October 09 CSE182

Regular Expression

  • Q: Let ∑={A,C,E}

– Is (A+C)*EEC* a regular expression? – *(A+C)? – AC*..E?

  • Q: When is a string s in a regular expression?

– R =(A+C)*EEC* – Is CEEC in R? – AEC? – ACEE?

slide-38
SLIDE 38

October 09 CSE182

Regular Expression & Automata

  • Every R.E can be expressed by an automaton (a directed

graph) with the following properties:

– The automaton has a start and end node – Each edge is labeled with a symbol from ∑, or ε

  • Suppose R is described by automaton A
  • S ∈ R if and only if there is a path from start to end in

A, labeled with s.

slide-39
SLIDE 39

October 09 CSE182

Examples: Regular Expression & Automata

  • (A+C)*EEC*

C A C start end E E

slide-40
SLIDE 40

October 09 CSE182

Constructing automata from R.E

  • R = {ε}
  • R = {σ}, σ ∈ ∑
  • R = R1 + R2
  • R = R1 · R2
  • R = R1*
slide-41
SLIDE 41

October 09 CSE182

End of L6

slide-42
SLIDE 42

October 09 CSE182

Protein structure basics

slide-43
SLIDE 43

October 09 CSE182

Side chains determine amino-acid type

  • The residues may have different properties.
  • Aspartic acid (D), and Glutamic Acid (E) are acidic residues
slide-44
SLIDE 44

October 09 CSE182

Bond angles form structural constraints

slide-45
SLIDE 45

October 09 CSE182

Various constraints determine 3d structure

  • Constraints

– Structural constraints due to physiochemical properties – Constraints due to bond angles – H-bond formation

  • Surprisingly, a few conformations are seen over

and over again.

slide-46
SLIDE 46

October 09 CSE182

Alpha-helix

  • 3.6 residues per turn
  • H-bonds between 1st and

4th residue stabilize the structure.

  • First discovered by Linus

Pauling

slide-47
SLIDE 47

October 09 CSE182

Beta-sheet

  • Each strand by itself has 2 residues per turn, and is not stable.
  • Adjacent strands hydrogen-bond to form stable beta-sheets, parallel or anti-parallel.
  • Beta sheets have long range interactions that stabilize the structure, while alpha-helices have local

interactions.

slide-48
SLIDE 48

October 09 CSE182

Domains

  • The basic structures (helix, strand, loop) combine

to form complex 3D structures.

  • Certain combinations are popular. Many sequences,

but only a few folds

slide-49
SLIDE 49

October 09 CSE182

3D structure

  • Predicting tertiary structure is an important problem in

Bioinformatics.

  • Premise: Clues to structure can be found in the sequence.
  • While de novo tertiary structure prediction is hard, there are

many intermediate, and tractable goals.

  • The PDB database is a compendium of structures

PDB

slide-50
SLIDE 50

October 09 CSE182

Searching structure databases

  • Threading, and other 3d Alignments can be used to

align structures.

  • Database filtering is possible through geometric

hashing.

slide-51
SLIDE 51

October 09 CSE182

Trivia Quiz

  • What research won the Nobel prize in

Chemistry in 2004?

  • In 2002?
slide-52
SLIDE 52

October 09 CSE182

How are Proteins Sequenced? Mass Spec 101:

slide-53
SLIDE 53

October 09 CSE182

Nobel Citation 2002

slide-54
SLIDE 54

October 09 CSE182

Nobel Citation, 2002

slide-55
SLIDE 55

October 09 CSE182

Mass Spectrometry

slide-56
SLIDE 56

October 09 CSE182

Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation

slide-57
SLIDE 57

October 09 CSE182

Single Stage MS

Mass Spectrometry LC-MS: 1 MS spectrum / second

slide-58
SLIDE 58

October 09 CSE182

Tandem MS

Secondary Fragmentation

Ionized parent peptide