Pattern Discovery in Biosequences Pattern Discovery in Biosequences - PDF document

Pattern Discovery in Biosequences Pattern Discovery in Biosequences SDM 2005 tutorial (Appendix) SDM 2005 tutorial (Appendix) Stefano Lonardi Stefano Lonardi U niver s it y of Cal if or nia, R iver s ide U niver s it y of Cal if or nia, R iver s ide Index • Periodicity of Strings • Profiles for sequence classification • Sequence alignment • Expectation Maximization • Discovering profiles – Meme – Gibbs sampling • Megaprior heuristics for MEME

Periodicity of Strings Periods of a string • Definition: a string y has period w, if y is a non-empty prefix of w k for some integer k = 1 1 2 3 k … w w w w y • Definition: The period y of y is called trivial period • Definition: the set of period lengths of y is P(y)={ |w|: w is a period of y, w ≠ y}

Example a b a a b a b a a b a a b a b a a b a b a a b a a b a b a a b a b a 1 2 3 4 5 6 7 8 9 1 0 11 12 13 14 15 16 17 18 1 9 20 21 22 23 24 2 5 26 27 2 8 29 30 3 1 32 3 3 3 4 P(y) = {13,26,31,33} Borders of a string • Definition: a border w of y is any nonempty string which is both a prefix and a suffix of y • Definition: the border y of y is called the trivial border • Fact: a string y has a period of length d<m iff it has a non-trivial border of length m-d

Finding Borders/Periods • Borders can be found using the failure function of the string as done, e.g., in the preprocessing step of the classical linear time string search algorithms (Knuth, Morris, Pratt) • Borders can be computed in O(|y|), and so do periods Profiles for sequence classification

Profiles as classifiers • Profiles can be used directly to implement very simple classifiers • Suppose we have a sample S+ of known sites, and a sample S- of non-sites • Given a new sequence x , how do we classify x in S+ or in S-? Example: CRP binding sites • S+= {TTGTGGC, ACGTGAT, CTGTGAC, TTTTGAT, ATGTGAG, ATGAGAC, AAGTGTC, TTGTGAG, TTGTGAG} ATTTGCA, CTGTAAC, CTGTGCG, CTGTAAC, ATGCAAA, TTGTGAC, GTGTTAA, GCCTGAC, ATTTGAA, TTGTGAT, TTGTGAT, TTGTGAT, ATTTATT, GTGTGAA, Cyclic AMP receptor protein TFs in E.coli [Stormo & Hartzell, 89]

Training (CRP sites) • Assume a Bernoulli model for each position A 0.350 0.043 0.000 0.043 0.130 0.830 0.260 C 0.170 0.087 0.043 0.043 0.000 0.043 0.300 0.130 0.000 0.780 0.000 0.830 0.043 0.170 T G 0.350 0.870 0.170 0.910 0.043 0.087 0.260 • Assume the uniform Bernoulli model for the non-sites S- , that is p A =0.25, p C =0.25, p T =0.25, p G =0.25 for all the positions Testing • Suppose you get x = GGTGTAC Is x more likely to belong to S+ or to S-? In other words, it is more likely to be generated from the Bernoulli model for S+ or from the uniform Bernoulli model (for S- )? • Let’s compute the probability = + = = P x ( GGTGTAC S | ) .35*.87*.78*.91*.83*.83*.3 0.045 = − = = 7 P x ( GGTGTAC S | ) (.25) 0.0000061 = + + LR x ( ) P x S ( | ) / ( | P x S )

Sequence alignment Global alignment • Clearly there are many other possible alignments • Which one is better? • We assign a score to each – match (e.g., 2) – insertion/deletion (e.g., -1) – substitution (e.g., -2) • Both previous alignments scored 4*2+3*(-1)+1*(-2)=3 4*2+1*(-1)+2*(-2)=3

Scoring function • Given two symbols a, b in S U {-} we define σ (a,b) the score of aligning a and b, where a and b can not be both “ - ” • In the previous example σ (a,b)=+2 if a=b σ (a,b)=-2 if a ≠ b σ ( - ,a) =-1 σ (a, - ) =-1 Global alignment • Definition: Given two strings w and y, an alignment is a mapping of w,y into strings w ’ ,y ’ that may contain spaces, where |w ’ |=|y ’ | and the removal of the spaces from w ’ and y ’ returns w and y • Definition: The value of an alignment (w ’ ,y ’ ) is | '| w ( ) ∑ σ w ' , ' y [ ] i [ ] i = i 1

Global alignment • Definition: A optimal global alignment of w and y is one that achieves maximum score • How to find it? • How about checking all possible alignments? Checking all alignments = = | w | | y | m ≤ ≤ for all , 0 i i m do = for all subsequences of with | A w A | i do = for all subsequences of with | B y B | i do form an alignment that matche s A with B [ ] j [ ] j ∀ ≤ ≤ 1 j i , and matches all others with spaces

Example • Given w= ATCTG y= CATGA (m=5) • Suppose i=2 • Suppose we choose A= CG B= CT • We are considering the score of the following alignment (length is 2m-i=8 ) ATCT-G-- --C-ATGA Time complexity   m A string of length has m   subsequences of length . i   i 2   m Thus, there are   pairs of subsequences, each of   i length . The length of each alignment is 2 i m -1. The total number of operations is at l east 2 2       m m 2 m m m ∑ ∑ ( ) − ≥ = > 2 m   2 m 1 m   m   2       i i i = = i 0 i 0 10

Checking all alignments • The previous algorithms runs in O(2 2m ) time • How bad is it? • Choose m=1,000 and try to wait your computer to run 2 2,000 operations! Needleman & Wunsch, 70 • The first algorithm to solve efficiently the alignment of two sequences • Based on dynamic programming • Runs in O(m 2 ) time • Uses O(m 2 ) space 11

Alignment by dyn. programming • Let w and y be two strings, |w|=n, |y|=m • Define V(i,j) as the value of the alignment of the strings w [1..i] with y [1..j] • The idea is to compute V(i,j) for all values of 0 � i � n and 0 � j � m • In order to do that, we establish a recurrence relation between V(i,j) and V(i-1,j), V(i,j-1),V(i-1,j-1) Alignment by dyn. programming  − − + σ  V i ( 1, j 1) ( w , y ) [ ] i [ ] j   = − + σ −   V i j ( , ) max V i ( 1, ) j ( w ," ") [ ] i   − + σ − V i j ( , 1) (" ", y )   [ ] j = V (0,0) 0 = − + σ − V i ( ,0) V i ( 1,0) ( w ," ") [ ] i = − + σ − V (0, ) j V (0, j 1) (" ", y ) [ ] j 12

Example A G C σ = + ( , ) a a 1 [match] σ = − ≠ ( , ) a b 1, if a b [substitution] 0 -2 -4 -6 σ − = − ( ," ") a 2 [deletion] σ − = − (" ", ) a 2 [insertion] A -2 1 -1 -3  − − + σ  V i ( 1, j 1) ( w , y ) [ ] i [ ] j   = − + σ −   V i j ( , ) max V i ( 1, ) j ( w ," ") [ ] i   A -4 -1 0 -2 − + σ − V i j ( , 1) (" ", y )   [ ] j = V (0,0) 0 A -6 -3 -2 -1 = − + σ − V i ( ,0) V i ( 1,0) ( w ," ") [ ] i = − + σ − V (0, ) j V (0, j 1) (" ", y ) [ ] j C -8 -5 -4 -1 Example A G C 0 -2 -4 -6 AG-C A -2 1 -1 -3 AAAC A -4 -1 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1 13

Example A G C 0 -2 -4 -6 A-GC A -2 1 -1 -3 AAAC A -4 -1 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1 Example A G C 0 -2 -4 -6 -AGC A -2 1 -1 -3 AAAC A -4 -1 0 -2 A -6 -3 -2 -1 C -8 -5 -4 -1 14

Variations • Local alignment [Smith, Waterman 81] • Multiple sequence alignment (local or global) • Theorem [Wang, Jiang 94]: the optimal sum- of-pairs alignment problem is NP -complete Expectation Maximization 15

Convergence ( ) θ θ t The last term is H P y x ( | , )|| ( | , ) which is P y x always non-negative, and therefore θ − θ ≥ θ θ − θ θ t t t t ( ) L L ( ) Q ( | ) Q ( | ) + θ = θ t t 1 with equality iff ( | , P y x ) P y x ( | , ). + θ = θ θ t 1 t Choosing arg max Q ( | ) will always m ake θ the difference positive and thus the likelihood of the + θ θ t 1 t new model larger than the likelihood of . A step of EM L( θ ) L( θ t )+Q( θ , θ t )-Q( θ t , θ t ) θ t θ t+1 17

Expectation maximization EM iterates 1), 2) until convergence 1) E-Step: compute the Q( θ | θ t ) function with respect to the current parameters θ t 2) M-Step: choose θ t+1 =argmax θ Q( θ | θ t ) Expectation maximization • The likelihood increases at each step, so the procedure will always reach a maximum asymptotically • It has been proved that the number of iterations to convergence is linear in the input size • Each step, however, require quadratic time in the size of the input 18

Pattern Discovery in Biosequences Pattern Discovery in Biosequences - PDF document

Pattern Discovery in Biosequences Pattern Discovery in Biosequences SDM 2005 tutorial (Appendix) SDM 2005 tutorial (Appendix) Stefano Lonardi Stefano Lonardi U niver s it y of Cal if or nia, R iver s ide U niver s it y of Cal if or nia, R

Pattern Discovery in Biosequences Pattern Discovery in Biosequences ISMB 2002 tutorial ISMB 2002

Pattern Discovery in Biosequences Pattern Discovery in Biosequences ISMB 2002 tutorial (Appendix)

Evolutionary Search Techniques for the Lyndon Factorization of Biosequences Workshop on

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Pattern Recogniton Pattern: Any

An NFR Pattern Approach to Dealing An NFR Pattern Approach to Dealing An NFR Pattern Approach to

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

A common pattern: map Another common pattern: filter Pattern: take a list and produce a new list,

Pattern Discovery EECS 458 CWRU Fall 2004 Roadmap Pattern types Motivation

Composite Pattern Discovery for PCR Application Stanislav Angelov University of Pennsylvania,

Part 5 pattern recognition pattern recognition track pattern recognition: associate hits

Exact Pattern Matching p t Goal: Find all occurrences of a pattern in a text Input: Pattern p = p

Pattern Review Pattern Name and Classification: A descriptive and unique name that helps in

Awk, Awk Pattern matching and processing language Looks for pattern in file If pattern

Pattern Structures Pattern Structures Models describe whole or a large part of the data

A Pattern Pattern Taxonomy Creational Behavioral Structural Pattern Pattern Pattern

Physical Design Issues in Biofluidic Microchips Tamal Mukherjee MEMS Laboratory ECE Department

Fragfinder and Undertakernew-fold methods for protein structure prediction Kevin Karplus,

Numerical Solutions of Population-Balance Models in Particulate Systems Shamsul Qamar Gerald

15-11-2019 Department of Veterinary and Animal Sciences Department of Veterinary and Animal

I i M i M i 5 6 Handling non-Global Alignments Original profile HMMs model entire sequence

CS481: Bioinformatics Algorithms Can Alkan EA224 calkan@cs.bilkent.edu.tr

Comparative Genomics Comparative Genomics Common Themes Gene and functional pathway

Metabolomic analysis revealed that green tea polyphenols decreased the formation of microbial