pattern discovery
play

Pattern Discovery EECS 458 CWRU Fall 2004 Roadmap Pattern types - PDF document

Pattern Discovery EECS 458 CWRU Fall 2004 Roadmap Pattern types Motivation Exhaustive searches TEIRESIAS, WINNOWER (Graph) Gibbs sampler EM algorithm (MEME) Problems Pattern matching: find the exact


  1. Pattern Discovery EECS 458 CWRU Fall 2004 Roadmap • Pattern types • Motivation • Exhaustive searches • TEIRESIAS, • WINNOWER (Graph) • Gibbs sampler • EM algorithm (MEME) Problems • Pattern matching: find the exact occurrences of a given pattern in a given structure ( string matching) • Pattern recognition: recognizing approximate occurrences of a given pattern in a given structure (image recognition) • Pattern discovery: identifying significant patterns in a given structure, when the patterns are unknown (promoter discovery) 1

  2. Data • Positive examples: a sample set S+ of sequences with unknown patterns • Negative examples: a sample set S- of sequences w/o patterns • Noise data Pattern discovery: the problem • Given a set of sequences S+ and a model of the source for S- • Find a set of patterns in S+ which have a support that is statistically significant w.r.t. the probabilistic model • If we are also given negative examples for S-, we must ensure that patterns do not appear in S- Motivation • Transcription factor binding sites: . . . Given a collection of genes with common expression, Find the TF-binding motif in common 2

  3. Characteristics of Regulatory Motifs • Tiny • Highly Variable • ~Constant Size – Because a constant-size transcription factor binds • Often repeated • Low-complexity Pattern discovery • Methods – Combinatorial: graph, suffix tree, – Learning: clustering, classification – Statistical: EM, Gibbs sampling • Type of patterns – Deterministic, rigid, flexible, profiles, … • Measure of significance Training Adjust parameters Pos example Learning system Output Model Machine Neg example Predictions New data 3

  4. Output • True positive: a pattern belonging to the positive set which has been correctly predicted • True negative: a pattern belonging to the negative set which has been correctly predicted • False positive: a pattern belonging to the negative set which has been predicted as positive • False negative: a pattern belonging to the positive set which has been predicted as negative Measuring pattern discovery • Complete: if no true pattern is missed – But may report too many patterns • Sound: if no false pattern is reported – But may miss true positives • Usually there is a tradeoff between soundness and completeness: if you increase one, you will decrease the other. Measuring the performance • Information retrieval measures: – Precision (sensitivity) = true pos/(true pos + false pos): the proportion of discovered patterns out of the total reported positive – Recall = true pos / (true pos + true neg): the proportion of discovered patterns out of the total true patterns 4

  5. Measuring the performance • Specificity = true neg / (true neg + false pos): proportion of true neg of all negative samples • Correlation coefficient = (tp tn –fp fn) / sqrt( (tp+fp)(tp+fn)(tn+fn)(tn+fp)): is 1 when there are no false pos and no fase net, is 0 if the systems chooses randomly and -1 when there are only false pos and false neg Types of patterns • Deterministic patterns • Rigid patterns – Hamming distance • Flexible patterns (F-x(5)-G-x(2-4)-G-*-H – Edit distance • Matrix profiles (Position weighted matrix PWM or Position-specific score matrix PSSM) Deterministic patterns • Def: Deterministic patterns are substring over alphabet S – E.g. “TATAAA” (TATA-box) • Discovery algorithms are faster on these types of patterns • Usually not flexible enough for the needs of molecular biology 5

  6. Rigid patterns • Def: Rigid patterns are patterns which allow substitutions/”don’t care” symbols – E.g. the patterns under IUPAC alphabet {A,C,G,T,U,M,R,W,S,Y,K,V,H,D,B,X,N} where for example: R=[A|G], Y=[C|T], etc. • Note that the size of the pattern is not allowed to change Flexible patterns • Def: Flexible patterns are patterns which allow substitutions/”don’t care” symbols and variable length gap – E.g. F-x(5)-G-x(2-4)-G-*-H. • Note that the size of the pattern is variable • Very expressive • Space of all pattern is huge Position weighted matrix 6

  7. Hamming distance • Def: given two string y and w with the same length, the Hamming distance h(y,w) is given by the number of mismatches between y and w • Example: – y=GATTACA – w=TATAATA – h(w,y)=h(y,w)=3 Hamming neighborhood • Def: given a string y, all strings at Hamming distance at most d from y are in its d-neighborhood • Fact: the size of N(m,d) of the d- neighborhood of a string y with length m is ⎛ ⎞ d m = ∑ ⎜ ⎟ Σ − ∈ Σ j d d ( , ) (| | 1 ) ( | | ) N m d ⎜ ⎟ O m ⎝ ⎠ j = 0 j Hamming neighborhood • Example: y=ATA, the 1-neighborhood is { CTA, GTA, TTA, AAA, ACA, AGA, ATC, ATG, ATT, ATA} • This set can be written as a rigid pattern: {NTA|ANA|ATN} 7

  8. Rigid pattern discovery • Note that we are going to observe occurrences of the neighbors of y, but we may never observe an occurrences of the pattern y y is the unknown pattern 2d d is the number of allowed w3 w1 mismatches d y W1, w2, w3 belong to the d- neighborhood of y w2 Hamming neighborhood • Fact: given two string w 1 and w 2 in the d- neighborhood of the pattern y, then h(w 1 ,w 2 ) ≤ 2d • The problem of finding y given w 1 , w 2 ,… is sometimes called Steiner sequence problem • Unfortunately, even we know all the w i in the neighborhood, there is no guarantee to find the unknown pattern y. Discrete formulations • Closest string problem: Given a multisequence set X= {x 1 , …, x k } each of length n, find a string y of the same length and the minimum d such that h(y,x i ) ≤ d. • Consensus pattern problem: Given a multisequence X= {x 1 , …, x k } each of length n and an integer m, find a string y of length m and substring t i of length m for each x i minimizing Sum i h(y,t i ), denote as d(y,X). 8

  9. Enumerative approach: idea • Define the search space • List exhaustively all the patterns in the search space • Compute the statistical significance of all of them • Report the pattern with the highest statistical significance Enumerative approach 1. Pattern-driven algorithm: (4 m possibilities) For t = AA…A to TT…T Find d( t, S ) Report y = argmin( d(t, S) ) Running time: O(m|S|4 m ) Advantage: Finds provably “best” motif y Time Disadvantage: Complexity results • Thm: the consensus pattern problem is NP-hard • Thm: the closest string problem is NP- hard. • And many of other formulations in pattern discovery turn out to be NP-hard also (Li et al. STOC 99) 9

  10. NP-hard, what to do? • Change the problem – E.g. relax the class of patterns • Accept the fact that methods may fail to find the optimal patterns – Heuristics – Randomized algorithms – Approximation schemes Exhaustive Searches 2. Sample-driven algorithm: For any m-long word occurring in some x i Find d( W, S ) Report W* = argmin( d( W, S ) ) or, Report a local improvement of W * Running time: O( m|S| 2 ) Time Advantage: If the true motif is weak and does not occur in data Disadvantage: then a random motif may score better than any instance of true motif Two naïve approaches • n=1,000,000, m=20, | ∑ |=4 • Approach 1: patterns to be tested O(nm| ∑ | m ), in this case: 1,000,000*20*4^20 • Approach 2: pattern to be tested O(n 2 ), in this case: 1,000,000^2 10

  11. Discovering rigid patterns • Some “recent” algorithms • Teiresias • Winnower • Projection • Weeder Teiresias Teiresias • Teiresias by Rigoustos and Floratos from IBM [Bioinformatics, 1998] • Server at http://cbcsrv.watson.ibm.com/Tspd.html • The worst case running time is exponential, but works reasonably fast on average 11

  12. Teiresias patterns • Teiresias searches for rigid patterns on the alphabet ∑ U{.}, where “.” is the don’t care symbol • However, there are some constrains on the density of “.” that can appear in a pattern <L,W> patterns • Def: given integers L and W, L ≤ W, y is a <L,W> pattern if – y is a string over ∑ U{.} – y starts and ends with a symbol from ∑ – any substring of y containing exactly L symbols from ∑ has to be shorter or equal to W (i.e. any substring containing exactly L symbols, the number of don’t care symbol is less than or equal to W-L) Example of <3,5> patterns • AT..GG..T is a <3,5> pattern • AT..GG..T. is not a <3,5> pattern • AT.G.G..T is not a <3,5> pattern 12

  13. Teiresias • Def: a pattern w is more specific than a pattern y, if w can be obtained from y by changing one or more “.” to symbols from ∑ , or by appending any sequence of ∑ U{.} to the ends of y • Example: given y = AT..GG..T, the following patterns are more specific than y: • ATC.GG..T, CAT..GG..T, A.AT..GGT.T.A Teiresias • Def: a pattern y is maximal w.r.t. the sequences {x 1 , …, x k } if there exists no pattern w which is more specific than y and have the same number of occurrences. • Given {x 1 , …, x k } and parameter L,W and K, Teiresias reports all the maximal <L,W> patterns that have at least K colors. Some notation • Given {x 1 , …, x k } and a pattern y • Occurrences: the total number of occurrences of y in {x 1 , …, x k } is denoted as f(y) • Colors: assume each sequence has a different color, the total number of sequences that contain y (thus the total number of different colors) is denoted as c(y). • These quantities are called the support of y. 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend