Pattern Discovery EECS 458 CWRU Fall 2004 Roadmap Pattern types - PDF document

Pattern Discovery EECS 458 CWRU Fall 2004 Roadmap • Pattern types • Motivation • Exhaustive searches • TEIRESIAS, • WINNOWER (Graph) • Gibbs sampler • EM algorithm (MEME) Problems • Pattern matching: find the exact occurrences of a given pattern in a given structure ( string matching) • Pattern recognition: recognizing approximate occurrences of a given pattern in a given structure (image recognition) • Pattern discovery: identifying significant patterns in a given structure, when the patterns are unknown (promoter discovery) 1

Data • Positive examples: a sample set S+ of sequences with unknown patterns • Negative examples: a sample set S- of sequences w/o patterns • Noise data Pattern discovery: the problem • Given a set of sequences S+ and a model of the source for S- • Find a set of patterns in S+ which have a support that is statistically significant w.r.t. the probabilistic model • If we are also given negative examples for S-, we must ensure that patterns do not appear in S- Motivation • Transcription factor binding sites: . . . Given a collection of genes with common expression, Find the TF-binding motif in common 2

Characteristics of Regulatory Motifs • Tiny • Highly Variable • ~Constant Size – Because a constant-size transcription factor binds • Often repeated • Low-complexity Pattern discovery • Methods – Combinatorial: graph, suffix tree, – Learning: clustering, classification – Statistical: EM, Gibbs sampling • Type of patterns – Deterministic, rigid, flexible, profiles, … • Measure of significance Training Adjust parameters Pos example Learning system Output Model Machine Neg example Predictions New data 3

Output • True positive: a pattern belonging to the positive set which has been correctly predicted • True negative: a pattern belonging to the negative set which has been correctly predicted • False positive: a pattern belonging to the negative set which has been predicted as positive • False negative: a pattern belonging to the positive set which has been predicted as negative Measuring pattern discovery • Complete: if no true pattern is missed – But may report too many patterns • Sound: if no false pattern is reported – But may miss true positives • Usually there is a tradeoff between soundness and completeness: if you increase one, you will decrease the other. Measuring the performance • Information retrieval measures: – Precision (sensitivity) = true pos/(true pos + false pos): the proportion of discovered patterns out of the total reported positive – Recall = true pos / (true pos + true neg): the proportion of discovered patterns out of the total true patterns 4

Measuring the performance • Specificity = true neg / (true neg + false pos): proportion of true neg of all negative samples • Correlation coefficient = (tp tn –fp fn) / sqrt( (tp+fp)(tp+fn)(tn+fn)(tn+fp)): is 1 when there are no false pos and no fase net, is 0 if the systems chooses randomly and -1 when there are only false pos and false neg Types of patterns • Deterministic patterns • Rigid patterns – Hamming distance • Flexible patterns (F-x(5)-G-x(2-4)-G-*-H – Edit distance • Matrix profiles (Position weighted matrix PWM or Position-specific score matrix PSSM) Deterministic patterns • Def: Deterministic patterns are substring over alphabet S – E.g. “TATAAA” (TATA-box) • Discovery algorithms are faster on these types of patterns • Usually not flexible enough for the needs of molecular biology 5

Rigid patterns • Def: Rigid patterns are patterns which allow substitutions/”don’t care” symbols – E.g. the patterns under IUPAC alphabet {A,C,G,T,U,M,R,W,S,Y,K,V,H,D,B,X,N} where for example: R=[A|G], Y=[C|T], etc. • Note that the size of the pattern is not allowed to change Flexible patterns • Def: Flexible patterns are patterns which allow substitutions/”don’t care” symbols and variable length gap – E.g. F-x(5)-G-x(2-4)-G-*-H. • Note that the size of the pattern is variable • Very expressive • Space of all pattern is huge Position weighted matrix 6

Hamming distance • Def: given two string y and w with the same length, the Hamming distance h(y,w) is given by the number of mismatches between y and w • Example: – y=GATTACA – w=TATAATA – h(w,y)=h(y,w)=3 Hamming neighborhood • Def: given a string y, all strings at Hamming distance at most d from y are in its d-neighborhood • Fact: the size of N(m,d) of the d- neighborhood of a string y with length m is ⎛ ⎞ d m = ∑ ⎜ ⎟ Σ − ∈ Σ j d d ( , ) (| | 1 ) ( | | ) N m d ⎜ ⎟ O m ⎝ ⎠ j = 0 j Hamming neighborhood • Example: y=ATA, the 1-neighborhood is { CTA, GTA, TTA, AAA, ACA, AGA, ATC, ATG, ATT, ATA} • This set can be written as a rigid pattern: {NTA|ANA|ATN} 7

Rigid pattern discovery • Note that we are going to observe occurrences of the neighbors of y, but we may never observe an occurrences of the pattern y y is the unknown pattern 2d d is the number of allowed w3 w1 mismatches d y W1, w2, w3 belong to the d- neighborhood of y w2 Hamming neighborhood • Fact: given two string w 1 and w 2 in the d- neighborhood of the pattern y, then h(w 1 ,w 2 ) ≤ 2d • The problem of finding y given w 1 , w 2 ,… is sometimes called Steiner sequence problem • Unfortunately, even we know all the w i in the neighborhood, there is no guarantee to find the unknown pattern y. Discrete formulations • Closest string problem: Given a multisequence set X= {x 1 , …, x k } each of length n, find a string y of the same length and the minimum d such that h(y,x i ) ≤ d. • Consensus pattern problem: Given a multisequence X= {x 1 , …, x k } each of length n and an integer m, find a string y of length m and substring t i of length m for each x i minimizing Sum i h(y,t i ), denote as d(y,X). 8

Enumerative approach: idea • Define the search space • List exhaustively all the patterns in the search space • Compute the statistical significance of all of them • Report the pattern with the highest statistical significance Enumerative approach 1. Pattern-driven algorithm: (4 m possibilities) For t = AA…A to TT…T Find d( t, S ) Report y = argmin( d(t, S) ) Running time: O(m|S|4 m ) Advantage: Finds provably “best” motif y Time Disadvantage: Complexity results • Thm: the consensus pattern problem is NP-hard • Thm: the closest string problem is NP- hard. • And many of other formulations in pattern discovery turn out to be NP-hard also (Li et al. STOC 99) 9

NP-hard, what to do? • Change the problem – E.g. relax the class of patterns • Accept the fact that methods may fail to find the optimal patterns – Heuristics – Randomized algorithms – Approximation schemes Exhaustive Searches 2. Sample-driven algorithm: For any m-long word occurring in some x i Find d( W, S ) Report W* = argmin( d( W, S ) ) or, Report a local improvement of W * Running time: O( m|S| 2 ) Time Advantage: If the true motif is weak and does not occur in data Disadvantage: then a random motif may score better than any instance of true motif Two naïve approaches • n=1,000,000, m=20, | ∑ |=4 • Approach 1: patterns to be tested O(nm| ∑ | m ), in this case: 1,000,000*20*4^20 • Approach 2: pattern to be tested O(n 2 ), in this case: 1,000,000^2 10

Discovering rigid patterns • Some “recent” algorithms • Teiresias • Winnower • Projection • Weeder Teiresias Teiresias • Teiresias by Rigoustos and Floratos from IBM [Bioinformatics, 1998] • Server at http://cbcsrv.watson.ibm.com/Tspd.html • The worst case running time is exponential, but works reasonably fast on average 11

Teiresias patterns • Teiresias searches for rigid patterns on the alphabet ∑ U{.}, where “.” is the don’t care symbol • However, there are some constrains on the density of “.” that can appear in a pattern <L,W> patterns • Def: given integers L and W, L ≤ W, y is a <L,W> pattern if – y is a string over ∑ U{.} – y starts and ends with a symbol from ∑ – any substring of y containing exactly L symbols from ∑ has to be shorter or equal to W (i.e. any substring containing exactly L symbols, the number of don’t care symbol is less than or equal to W-L) Example of <3,5> patterns • AT..GG..T is a <3,5> pattern • AT..GG..T. is not a <3,5> pattern • AT.G.G..T is not a <3,5> pattern 12

Teiresias • Def: a pattern w is more specific than a pattern y, if w can be obtained from y by changing one or more “.” to symbols from ∑ , or by appending any sequence of ∑ U{.} to the ends of y • Example: given y = AT..GG..T, the following patterns are more specific than y: • ATC.GG..T, CAT..GG..T, A.AT..GGT.T.A Teiresias • Def: a pattern y is maximal w.r.t. the sequences {x 1 , …, x k } if there exists no pattern w which is more specific than y and have the same number of occurrences. • Given {x 1 , …, x k } and parameter L,W and K, Teiresias reports all the maximal <L,W> patterns that have at least K colors. Some notation • Given {x 1 , …, x k } and a pattern y • Occurrences: the total number of occurrences of y in {x 1 , …, x k } is denoted as f(y) • Colors: assume each sequence has a different color, the total number of sequences that contain y (thus the total number of different colors) is denoted as c(y). • These quantities are called the support of y. 13

Pattern Discovery EECS 458 CWRU Fall 2004 Roadmap Pattern types - PDF document

Pattern Discovery EECS 458 CWRU Fall 2004 Roadmap Pattern types Motivation Exhaustive searches TEIRESIAS, WINNOWER (Graph) Gibbs sampler EM algorithm (MEME) Problems Pattern matching: find the exact

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Pattern Recogniton Pattern: Any

Pattern Discovery in Biosequences Pattern Discovery in Biosequences ISMB 2002 tutorial ISMB 2002

Pattern Discovery in Biosequences Pattern Discovery in Biosequences ISMB 2002 tutorial (Appendix)

Pattern Discovery in Biosequences Pattern Discovery in Biosequences SDM 2005 tutorial (Appendix)

An NFR Pattern Approach to Dealing An NFR Pattern Approach to Dealing An NFR Pattern Approach to

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

A common pattern: map Another common pattern: filter Pattern: take a list and produce a new list,

Composite Pattern Discovery for PCR Application Stanislav Angelov University of Pennsylvania,

Part 5 pattern recognition pattern recognition track pattern recognition: associate hits

Exact Pattern Matching p t Goal: Find all occurrences of a pattern in a text Input: Pattern p = p

Pattern Review Pattern Name and Classification: A descriptive and unique name that helps in

Awk, Awk Pattern matching and processing language Looks for pattern in file If pattern

Pattern Structures Pattern Structures Models describe whole or a large part of the data

A Pattern Pattern Taxonomy Creational Behavioral Structural Pattern Pattern Pattern

Exact Pattern Matching p t Goal: Find all occurrences of a pattern in a text Input: Pattern p = p

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

CSEP 527 Spring 2016 Phylogenies: Parsimony Plus a Tantalizing Taste of Likelihood 1

MEGAN FRIGGENS USDA, Forest Service, Rocky Mountain Research Station Vulnerability of Riparian

ASSOCIATION WITH DIFFERENT OMICS BIOMARKERS AND DISEASE Carles Hernandez-Ferrer

November 14, 2017 Administrative notes Reminder: In the news call #3 individual component

Bootstrapping a historical commodities lexicon with SKOS and DBpedia. Ewan Klein, Beatrice Alex,

Increasing employment and career opportunities Jonathan Delman, PhD, JD, MPH Senior Researcher

Elements of Syntax COSI 114 Computational Linguistics James Pustejovsky February 27, 2015