Pattern Discovery EECS 458 CWRU Fall 2004 Roadmap Pattern types - - PDF document

pattern discovery
SMART_READER_LITE
LIVE PREVIEW

Pattern Discovery EECS 458 CWRU Fall 2004 Roadmap Pattern types - - PDF document

Pattern Discovery EECS 458 CWRU Fall 2004 Roadmap Pattern types Motivation Exhaustive searches TEIRESIAS, WINNOWER (Graph) Gibbs sampler EM algorithm (MEME) Problems Pattern matching: find the exact


slide-1
SLIDE 1

1

Pattern Discovery

EECS 458 CWRU Fall 2004

Roadmap

  • Pattern types
  • Motivation
  • Exhaustive searches
  • TEIRESIAS,
  • WINNOWER (Graph)
  • Gibbs sampler
  • EM algorithm (MEME)

Problems

  • Pattern matching: find the exact occurrences of

a given pattern in a given structure ( string matching)

  • Pattern recognition: recognizing approximate
  • ccurrences of a given pattern in a given

structure (image recognition)

  • Pattern discovery: identifying significant patterns

in a given structure, when the patterns are unknown (promoter discovery)

slide-2
SLIDE 2

2

Data

  • Positive examples: a sample set S+ of

sequences with unknown patterns

  • Negative examples: a sample set S- of

sequences w/o patterns

  • Noise data

Pattern discovery: the problem

  • Given a set of sequences S+ and a model
  • f the source for S-
  • Find a set of patterns in S+ which have a

support that is statistically significant w.r.t. the probabilistic model

  • If we are also given negative examples for

S-, we must ensure that patterns do not appear in S-

Motivation

  • Transcription factor binding sites:

Given a collection of genes with common expression, Find the TF-binding motif in common

. . .

slide-3
SLIDE 3

3

Characteristics of Regulatory Motifs

  • Tiny
  • Highly Variable
  • ~Constant Size

– Because a constant-size transcription factor binds

  • Often repeated
  • Low-complexity

Pattern discovery

  • Methods

– Combinatorial: graph, suffix tree, – Learning: clustering, classification – Statistical: EM, Gibbs sampling

  • Type of patterns

– Deterministic, rigid, flexible, profiles, …

  • Measure of significance

Training

Learning system Model Machine Pos example Neg example New data Adjust parameters Output Predictions

slide-4
SLIDE 4

4

Output

  • True positive: a pattern belonging to the positive

set which has been correctly predicted

  • True negative: a pattern belonging to the

negative set which has been correctly predicted

  • False positive: a pattern belonging to the

negative set which has been predicted as positive

  • False negative: a pattern belonging to the

positive set which has been predicted as negative

Measuring pattern discovery

  • Complete: if no true pattern is missed

– But may report too many patterns

  • Sound: if no false pattern is reported

– But may miss true positives

  • Usually there is a tradeoff between

soundness and completeness: if you increase one, you will decrease the other.

Measuring the performance

  • Information retrieval measures:

– Precision (sensitivity) = true pos/(true pos + false pos): the proportion of discovered patterns out of the total reported positive – Recall = true pos / (true pos + true neg): the proportion of discovered patterns out of the total true patterns

slide-5
SLIDE 5

5

Measuring the performance

  • Specificity = true neg / (true neg + false

pos): proportion of true neg of all negative samples

  • Correlation coefficient = (tp tn –fp fn) /

sqrt( (tp+fp)(tp+fn)(tn+fn)(tn+fp)): is 1 when there are no false pos and no fase net, is 0 if the systems chooses randomly and -1 when there are only false pos and false neg

Types of patterns

  • Deterministic patterns
  • Rigid patterns

– Hamming distance

  • Flexible patterns (F-x(5)-G-x(2-4)-G-*-H

– Edit distance

  • Matrix profiles (Position weighted matrix

PWM or Position-specific score matrix PSSM)

Deterministic patterns

  • Def: Deterministic patterns are substring
  • ver alphabet S

– E.g. “TATAAA” (TATA-box)

  • Discovery algorithms are faster on these

types of patterns

  • Usually not flexible enough for the needs
  • f molecular biology
slide-6
SLIDE 6

6

Rigid patterns

  • Def: Rigid patterns are patterns which

allow substitutions/”don’t care” symbols

– E.g. the patterns under IUPAC alphabet {A,C,G,T,U,M,R,W,S,Y,K,V,H,D,B,X,N} where for example: R=[A|G], Y=[C|T], etc.

  • Note that the size of the pattern is not

allowed to change

Flexible patterns

  • Def: Flexible patterns are patterns which

allow substitutions/”don’t care” symbols and variable length gap

– E.g. F-x(5)-G-x(2-4)-G-*-H.

  • Note that the size of the pattern is variable
  • Very expressive
  • Space of all pattern is huge

Position weighted matrix

slide-7
SLIDE 7

7

Hamming distance

  • Def: given two string y and w with the

same length, the Hamming distance h(y,w) is given by the number of mismatches between y and w

  • Example:

– y=GATTACA – w=TATAATA – h(w,y)=h(y,w)=3

Hamming neighborhood

  • Def: given a string y, all strings at

Hamming distance at most d from y are in its d-neighborhood

  • Fact: the size of N(m,d) of the d-

neighborhood of a string y with length m is

) | | ( ) 1 | (| ) , (

d d d j j

m O j m d m N Σ ∈ − Σ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ =∑

=

Hamming neighborhood

  • Example: y=ATA, the 1-neighborhood is {

CTA, GTA, TTA, AAA, ACA, AGA, ATC, ATG, ATT, ATA}

  • This set can be written as a rigid pattern:

{NTA|ANA|ATN}

slide-8
SLIDE 8

8

Rigid pattern discovery

  • Note that we are going to observe
  • ccurrences of the neighbors of y, but we

may never observe an occurrences of the pattern y

y

w1 w3 w2 d 2d

y is the unknown pattern d is the number of allowed mismatches W1, w2, w3 belong to the d- neighborhood of y

Hamming neighborhood

  • Fact: given two string w1 and w2 in the d-

neighborhood of the pattern y, then h(w1,w2)≤2d

  • The problem of finding y given w1, w2,… is

sometimes called Steiner sequence problem

  • Unfortunately, even we know all the wi in

the neighborhood, there is no guarantee to find the unknown pattern y.

Discrete formulations

  • Closest string problem: Given a multisequence

set X= {x1, …, xk} each of length n, find a string y

  • f the same length and the minimum d such that

h(y,xi)≤d.

  • Consensus pattern problem: Given a

multisequence X= {x1, …, xk} each of length n and an integer m, find a string y of length m and substring ti of length m for each xi minimizing Sumi h(y,ti), denote as d(y,X).

slide-9
SLIDE 9

9

Enumerative approach: idea

  • Define the search space
  • List exhaustively all the patterns in the

search space

  • Compute the statistical significance of all
  • f them
  • Report the pattern with the highest

statistical significance

Enumerative approach

  • 1. Pattern-driven algorithm:

For t = AA…A to TT…T (4m possibilities) Find d( t, S ) Report y = argmin( d(t, S) ) Running time: O(m|S|4m ) Advantage: Finds provably “best” motif y Disadvantage: Time

Complexity results

  • Thm: the consensus pattern problem is

NP-hard

  • Thm: the closest string problem is NP-

hard.

  • And many of other formulations in pattern

discovery turn out to be NP-hard also (Li et al. STOC 99)

slide-10
SLIDE 10

10

NP-hard, what to do?

  • Change the problem

– E.g. relax the class of patterns

  • Accept the fact that methods may fail to

find the optimal patterns

– Heuristics – Randomized algorithms – Approximation schemes

Exhaustive Searches

  • 2. Sample-driven algorithm:

For any m-long word occurring in some xi Find d( W, S ) Report W* = argmin( d( W, S ) )

  • r, Report a local improvement of W*

Running time: O( m|S|2 ) Advantage: Time Disadvantage: If the true motif is weak and does not occur in data then a random motif may score better than any instance of true motif

Two naïve approaches

  • n=1,000,000, m=20, |∑|=4
  • Approach 1: patterns to be tested

O(nm|∑|m), in this case: 1,000,000*20*4^20

  • Approach 2: pattern to be tested O(n2), in

this case: 1,000,000^2

slide-11
SLIDE 11

11

Discovering rigid patterns

  • Some “recent” algorithms
  • Teiresias
  • Winnower
  • Projection
  • Weeder

Teiresias Teiresias

  • Teiresias by Rigoustos and Floratos from

IBM [Bioinformatics, 1998]

  • Server at

http://cbcsrv.watson.ibm.com/Tspd.html

  • The worst case running time is

exponential, but works reasonably fast on average

slide-12
SLIDE 12

12

Teiresias patterns

  • Teiresias searches for rigid patterns on the

alphabet ∑U{.}, where “.” is the don’t care symbol

  • However, there are some constrains on

the density of “.” that can appear in a pattern

<L,W> patterns

  • Def: given integers L and W, L≤W, y is a

<L,W> pattern if

– y is a string over ∑U{.} – y starts and ends with a symbol from ∑ – any substring of y containing exactly L symbols from ∑ has to be shorter or equal to W (i.e. any substring containing exactly L symbols, the number of don’t care symbol is less than or equal to W-L)

Example of <3,5> patterns

  • AT..GG..T is a <3,5> pattern
  • AT..GG..T. is not a <3,5> pattern
  • AT.G.G..T is not a <3,5> pattern
slide-13
SLIDE 13

13

Teiresias

  • Def: a pattern w is more specific than a

pattern y, if w can be obtained from y by changing one or more “.” to symbols from ∑, or by appending any sequence of ∑U{.} to the ends of y

  • Example: given y = AT..GG..T, the

following patterns are more specific than y:

  • ATC.GG..T, CAT..GG..T, A.AT..GGT.T.A

Teiresias

  • Def: a pattern y is maximal w.r.t. the

sequences {x1, …, xk} if there exists no pattern w which is more specific than y and have the same number of

  • ccurrences.
  • Given {x1, …, xk} and parameter L,W and

K, Teiresias reports all the maximal <L,W> patterns that have at least K colors.

Some notation

  • Given {x1, …, xk} and a pattern y
  • Occurrences: the total number of occurrences of

y in {x1, …, xk} is denoted as f(y)

  • Colors: assume each sequence has a different

color, the total number of sequences that contain y (thus the total number of different colors) is denoted as c(y).

  • These quantities are called the support of y.
slide-14
SLIDE 14

14

Example

  • x1=CCATTATTCG
  • x2=AAATTAAAAC
  • x3=CCGGAAAGGG
  • x4=TTAACCGGAA
  • y=AAA, f(y)=4, c(y)=2

Teiresias algorithm

  • Idea: if y is a <L,W> pattern with at least K

colors, then its substrings are also <L,W> patterns with at least K colors

  • Therefore, Teiresias assembles the

maximal patterns from smaller patterns

  • Def: a pattern y is elementary if is a <L,W>

pattern containing exactly L symbols from ∑

Teiresias algorithm

  • Teiresias works in two phases

– Scanning: find all the elementary patterns with at least k colors; these become the initial set

  • f patterns

– Convolution: repeatedly extend the patterns by gluing them together

  • Example: y=AT..CG.T and w=G.T.A can

be merged to obtain AT..CG.T.A

slide-15
SLIDE 15

15

Convolution phase

  • For each elementary pattern y, try to

extend it with all the other elementary patterns

  • Any pattern that cannot be extended

without losing support can be potentially maximal

Convolution phase

  • To speed-up this phase, one wants to

avoid the all-against-all comparison

  • The authors devise two partial orderings

<pf and <sf on all patterns

  • Using these ordering to schedule the

convolution phase, they guarantee that

– All patterns are generated – A maximal pattern y is generated before any non-maximal pattern subsumed by y

Partial ordering <pf

  • Def: Given two pattern y and w, align them

from the left, the one <pf the one that has wild card “.”.

  • Example:
  • y=ASD…F
  • w=SE.EFG
  • y <pf w
  • w <sf y
slide-16
SLIDE 16

16

Teiresias algorithm

  • Init the stack with elementary patterns with at least K

support

  • Order the stack according to <pf and <sf
  • Repeat

– Repeat

  • Try to extend the top pattern to the right with all the others in the

prefix-wise ordering

  • If a new pattern is formed which have enough support, it becomes

the new top

– Until the top can no longer be extended to the right – Do the same for the left extension, using the ordering <sf. – Check the top for maximality, if so pop it and output it

  • Until stack is empty

Conclusion on Teiresias

  • It can be proved that Teiresias correctly

reports all <L,W> maximal patterns

  • Pros:

– Provably correct – Fast on average input

  • cons,:

– Exponential time complexity – Limited to <L,W> patterns

Winnower

slide-17
SLIDE 17

17

Winnower

  • Invented by Pevzner and Sze [ISMB2000]
  • Initially designed to solve the (15,4)-motif

challenge

  • Planted (m,d)-motif problem: to determine

an unknown pattern y of length m in a set

  • f k DNA sequences, each of length n,

and each one containing exactly one

  • ccurrence of a string w such that

h(y,w)=d

Winnower

  • The authors show that most popular

algorithms (Consensus, GibbsDNA, MEME) fail to solve (most of the times) the (15,4)-motif problem [n=600,k=20]

  • Why the (15,4)-motif problem is difficult?
  • Because two strings in the class may differ

by as many as 8 positions out of 15, a rather large number

Winnower

  • Idea: search for groups of strings of length

m such that any two in a group differ at most 2d positions

  • Remember however that this may not be

sufficient, (e.g. AAAA, GGAA, TTAA)

slide-18
SLIDE 18

18

Winnower

  • How to finds groups of patterns that any

two elements w1 and w2 in a group, h(w1,w2)≤2d?

  • One could generate (k choose 2) multiple

alignments to find out all pairs of substrings of length m that have at most 2d mismatches (Consensus [Hertz&Stormo 1999]

Winnower

  • Winnower builds a graph G in which

– Each vertex corresponds to a distinct string of length m – Two vertices are connected by an edge if the Hamming distance between the corresponding strings is at most 2d, and the strings do not come from the same sequences (only one occurrence of the unknown pattern in each sequence)

Winnower

  • For (15,4)-problem, they report that for

each signal-edge, there are about 20,000 spurious-edges

  • Finding the signal among the noise is a

“daunting task”

slide-19
SLIDE 19

19

Winnower

  • Winnower searches the graph G for

cliques, which are subsets of the vertices that have full connection

  • But the problem of finding large cliques in

graphs is NP-hard.

Multipartite graphs

  • Def: A graph G is n-partite if its vertices

can be partitioned into n sets, such that there is no edge between any two vertices within a set

  • Fact: Winnower’s graph is a k-partite

Example

  • Given sequences {abde, afcg, hbci, jbck}

we look for (3-1)-motif

abde jbck hbci afcg abd bde afc jbc hbc bci bck fcg

slide-20
SLIDE 20

20

idea

  • Each vertex of the clique has to be in a

different partition

  • And we look for cliques that have exactly
  • ne vertex in each partition

Extendable cliques

  • Def: a vertex u is a neighbor of a clique

{v1, …, vs}, if {v1, …, vs, u} is also a clique

  • f G when s<k
  • Def: a clique is called extendable if it has

at least one neighbor.

Extendable cliques

  • Def: a clique with k vertices, each in a

different partition is called maximal

  • Consider a maximal clique and take a

subset of t of its vertices, this subset is an extendable clique

  • Idea remove edges that do not belong to

extendable cliques

slide-21
SLIDE 21

21

Extendable cliques

  • Fact: for any clique of size k, there are C(k,t)

extendable cliques with t vertices

  • Fact: Any edge belonging to a clique with k

vertices is a member of at least C(k-2, t-2) extendable cliques of size t.

  • E.g. k=5, t=3

1 5 4 3 2

idea

  • That means an edge that is not member of

at least (k-2,t-2) expendable cliques of size t cannot be part of a maximal clique and therefore it can be removed safely

t=1

  • For t=1, each vertex is a clique, it is

extendable if it is connected to at least one vertex in each partition

  • If a node is not extendable, delete all the

edges connected to it.

  • Iterate
slide-22
SLIDE 22

22

Example

abde jbck hbci afcg abd bde afc jbc hbc bci bck fcg

Example

abde jbck hbci afcg abd bde afc jbc hbc bci bck fcg

Example

abde jbck hbci afcg abd bde afc jbc hbc bci bck fcg

slide-23
SLIDE 23

23

Example

abde jbck hbci afcg abd bde afc jbc hbc bci bck fcg

Example

abde jbck hbci afcg abd bde afc jbc hbc bci bck fcg

t=2

  • For t=2, each pair of vertices u, v such that

there is an edge (u,v) is a clique

  • It is extendable if there is a vertex in each
  • ther k-2 partitions such that (u,v,z) is a

cycle of length 3

  • Each edge should belong to at least C(k-2,

t-2)=C(t-2, 0) =1 clique of size 2.

slide-24
SLIDE 24

24

t>2

  • For t=3, Winnower removes edges that

belong to less than k-2 extendable cliques

  • f size 3
  • For t=4, Winnower removes edges that

belong to less than (k-2)(k-3)/2 extendable cliques of size 4

Remarks on Winnower

  • Pros:

– More effective than MEME, Consensus, GibbsDNA for (15,4) problem

  • Cons:

– Randomized, – Need to know m and d in advance – Assume exactly one occurrence per sequence – Time-complexity can be very high (depends

  • n t)

Projection

Buhler and Tompa, UW

slide-25
SLIDE 25

25

Random Projection algorithm

  • Proposed by Buhler and Tompa [Recomb

2001]

  • The algorithm was initially designed to

solve the (m,d)-motif planted problem

Analysis on (m,d)-motif problem

  • Suppose A,C,T,G have probability 1/4.

Then the probability that a pattern of size m occurs at a given position is p0 = (1/4)m .

  • If we allow up to one mismatch, the

probability becomes p1 = p0 + m(3/4)(1/4)m-1.

  • If we allow at most two mismatches, it

becomes p2 = p1 + C(m,2)(3/4)2(1/4)m-2.

  • In general, if we allow up to d mismatches,

pd=sum(C(m,i)(3/4)i(1/4)m-i).

Analysis on (m,d)-motif problem

  • If Z is the r.v. for the number of
  • ccurrences, then P(Z=0)=(1-pd)n-m+1.

Thus P(Z>0)=1-P(Z=0).

  • If we have k sequences, we get that the

probability that a particular occurs at least

  • nce in each sequence is (p(Z>0))k.
  • Therefore, the expected number of

patterns is E(n,m,k,d)=4m* {p(Z>0)}k= 4m*(1-((1-pd)n-m+1))k.

slide-26
SLIDE 26

26

Stats of spurious (m,d)-motifs in simulated data (k=20,n=600)

Bottom-line: the (9,2)-, (11,3)-, (13,4)-, (15,5)- and (17,6)- motif problems are probably impossible to solve

Random Projections

  • Idea: select t random positions and for each

substring of length m of the text hash its selected positions into a table

  • Hopefully, the cell corresponding to the planted

motif will be the one with the highest count

Random Projections

  • Parameters (m,d), n, k, s, possibly i
  • Set t < m-d and 4t > k(n-m+1)
  • Build a table with all substrings of length m
  • Repeat i times

– Select randomly t positions – Repeat for all substrings in the table

  • Increase the count of the cell indexed by the t

positions

  • Select all cells with count .s
slide-27
SLIDE 27

27

Random Projections

  • We want t < m-d because any occurrence

may have d mismatch positions

  • The number of iterations i can be

estimated from m, d and t

Random Projections

  • Since we are hashing k(n-m+1) substrings of

size m into 4t buckets, if 4t > k(n-m+1) each bucket will contain on average less than one substring (set s=1)

  • The constrain is designed to filter out the noise
  • The bucket corresponding to the planted motif is

expected to contain more motif instances than those produced by a random sequence

  • If the constrain 4t > k(n-m+1) cannot be

enforced, the authors suggest to set t = m-d-1 and the threshold s = 2 [k(n-m+1)/4t] (twice the average bucket size)

Motif refinement

  • The algorithm will try to recover the

unknown motif from each cell having at least s elements

  • The primary tool for motif refinement is

expectation maximization (EM)

slide-28
SLIDE 28

28

Experiments

  • Projection can handle the (15,4)- (14,4)-

(16,5)- and (18,6)-motif problem (k=20, n=600)

  • Winnower fails the (14,4)- (16,5)- and

(18,6)- motif problem

Results

  • k=20, n=600, projection (t=7,s=4)

Remarks about Projection

  • Pros:

– fast and effective

  • Cons:

–need to know m and d in advance

– randomized

slide-29
SLIDE 29

29

Newer developments

  • COSMO: Binding sites in coding regions

Blanchette, M. A comparative analysis method for detecting binding sites in coding regions. [RECOMB03]

  • FootPrinter: identifying regions of DNA

that are unusually well conserved across a set of orthologous sequences Blanchette, M. and

Tompa, M Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. Genome Research, vol. 12,

  • no. 5, May 2002, 739-748
  • http://bio.cs.washington.edu/software.h

tml

Weeder

Pavesi, Mauri& Pesole, U. Milan Based on Sagot, INRIA

Weeder

  • Proposed by Pavesi, Mauri and Pesole

[ISMB2001]

  • Draw ideas from PRATT by [Jonassen95,

Jonassen97] and [Sagot98]

  • It is an exhaustive approach for a

particular class of rigid patterns

slide-30
SLIDE 30

30

Exhaustive approach

  • Suppose that you want to spell out all

possible (m,d) rigid patterns that has at support least q

  • One way to do it, is to use a (generalized)

suffix tree [Sagot 98]

Idea [Sagot 98]

  • Any deterministic pattern (substring) w

corresponds to a path in the tree ending in a node u, called the locus of w – the number of leaves in the subtree rooted at u gives the support

  • Any model (rigid pattern) corresponds to a

set of paths in the tree ending in nodes {u1,u2,…,ul} – the total number of leaves in the subtrees rooted at {u1,u2,…,ul} gives the support

Example

d=2 q=2 Hamming distance c(ATA)=2 f(ATA)=4

slide-31
SLIDE 31

31

Exhaustive approach [Sagot 98]

  • Start with all paths of length d with enough

support (they represent valid models)

  • At each path-extension keep track of the

mismatches and the support

– if the number of mismatches has not been reached the model will be extended by the symbols in ∑ (therefore the number of models will be scaled up by a factor | ∑ |) – otherwise we are allowed just to follow the arcs

Time complexity [Sagot 98]

  • Finding all the models with

support=occurrences in a single sequence takes O(n N(m,d)) = O(n md | ∑ |d)

  • Finding all the models with support=colors

in a multisequence takes O(n k2 N(m,d)) = O(n k2 md | ∑ |d)

  • Note that the complexity is exponential

(with d)

Weeder

  • Pavesi et al., implemented the algorithm

by Sagot but it was running too slow, and they decided to change the class of patterns

  • Weeder is designed to find rigid patterns

which have an amount of mismatches proportional to their length (the same constrain applies also to all their prefixes)

slide-32
SLIDE 32

32

Example ε =0.25

m=16, d=4

1 2 3 4

Time complexity

  • By restricting the number of mismatches to

εd for substrings with length εm, the time complexity becomes exponential on εd instead of d.

The (15,4)-motif challenge … again

  • Since the restriction on the density of the

mismatches, Weeder may miss some motifs that do not satisfy the constrain. The authors report that Weeder has a probability 0.6 to catch a (15,4)-motif in ONE sequence

  • Then, the probability of Weeded to get the motif

in all the k (say 20) sequence is almost zero

  • On the other hand, running the Sagot’s version

is too time-consuming

slide-33
SLIDE 33

33

idea

  • Include more candidates at the beginning:

find all the patterns (with the restriction) that occur in at least half of the k sequences.

  • The pool of candidates is then processed

with Sagot’s algorithm

  • The probability that the (15,4)-motif will be

found is 0.98

Remarks about Weeder

  • Pros:

– Possibly exhaustive (if using Sagot’s algorithm), but may be better – The relative error rate ε may be more meaningful than d and allows one not to specify in advance m

  • Cons:

– Very slow if run exhaustively

Footprinter

slide-34
SLIDE 34

34

Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm

Mathieu Blanchette Martin Tompa

Computer Science & Engineering University of Washington

Phylogenetic Footprinting

(Tagle et al. 1988)

Functional sequences evolve slower than nonfunctional ones.

  • Consider a set of orthologous sequences from

different species

  • Identify unusually well conserved regions

Orthologous and paralogous genes

Rat Rat gene 1 Rat gene 2 Mouse Mouse gene 1 Mouse gene 2 Two genes are to be orthologous if they diverged after a speciation event, Two genes are to be paralogous if they diverged after a duplication event Homologous genes are functionally related.

slide-35
SLIDE 35

35

Substring Parsimony Problem

Given:

  • phylogenetic tree T,
  • set of orthologous sequences at leaves of T,
  • length k of motif
  • threshold d

Problem:

  • Find each set S of k-mers, one k-mer from

each leaf, such that the “parsimony” score of S in T is at most d.

This problem is NP-hard (exponential on k).

Small Example

AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat)

Size of motif sought: k = 4

Solution

Parsimony score: 1 mutation

AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGG ACGT ACGT ACGT

slide-36
SLIDE 36

36

CLUSTALW multiple sequence alignment (rbcS gene) Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT Pea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA Tobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC Ice-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC Turnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC Wheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA Duckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA Larch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC Cotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A Pea C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------A Tobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA Turnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------A Wheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC-------- Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT Larch TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA Pea GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA Tobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG Ice-plant GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG Turnip CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA Wheat CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG---- GGCTATATATACCGTG Duckweed TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC Larch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC Pea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC Tobacco CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plant TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC Larch TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA Turnip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG Wheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC Duckweed CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG

An Exact Algorithm

(generalizing Sankoff and Rousseau 1975)

Wu [s] = best parsimony score for subtree rooted at node u, if u is labeled with string s.

AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG

ACGG: 2 ACGT: 1

... …

ACGG: 0 ACGT: 2

... …

ACGG: 1 ACGT: 1

... …

ACGG: +∞ ACGT: 0

... …

ACGG: 1 ACGT: 0

...

4k entries

ACGG: 0 ACGT: +∞

...

… ACGG:∞ ACGT :0 ... … ACGG:∞ ACGT :0 ... … ACGG:∞ ACGT :0 ...

Time complexity

  • Better algorithm reduces time from

O(n k (42k + l )) to O(n k (4k + l ))

  • By restricting to motifs with parsimony

score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k)

  • Amenable to many useful extensions

(e.g., allow insertions and deletions)

slide-37
SLIDE 37

37

Motifs Absent from Some Species

  • Find motifs

– with small parsimony score – that span a large part of the tree

  • Example: in tree of 10 species spanning

760 Myrs, find all motifs with

– score 0 spanning at least 250 Myrs – score 1 spanning at least 350 Myrs – score 2 spanning at least 450 Myrs – score 3 spanning at least 550 Myrs

Application to c-fos Gene

Asked for motifs of length 10, with 0 mutations over tree of size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26

Puffer fish Chicken Pig Mouse Hamster Human

10 2 7 2 2 2 1 1 1 Found: 0 mutations over tree of size 8 1 mutation over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 28

Application to c-fos Gene

Motif Score Conserved in Known? CAGGTGCGAATGTTC 4 mammals TTCCCGCCTCCCCTCCCC 4 mammals yes GAGTTGGCTGcagcc 3 puffer + 4 mammals GTTCCCGTCAATCcct 1 chicken + 4 mammals yes CACAGGATGTcc 4 all 6 yes AGGACATCTG 1 chicken + 4 mammals yes GTCAGCAGGTTTCCACG 4 mammals yes TACTCCAACCGC 4 mammals

slide-38
SLIDE 38

38

Conclusions

  • Guaranteed optimality for question posed
  • Time linear in the number of species and the total

sequence lengths, exponential in the parsimony score

  • Practical on real biological data sets
  • Discovered highly conserved regions, both known and

not (yet) known

  • Available at

http://bio.cs.washington.edu/software.html

  • Limitations: need to set k and d in advance

Statistical approaches Discovering Profiles

slide-39
SLIDE 39

39

Discovering Matrix Profiles

  • If we know the all patterns in the input

sequences, we can create a profile for those patterns and the frequency of the letters in the columns are the ML estimates of the distribution, assume columns are independent.

  • But we do not know the positions of

patterns in the sequences

Gibbs sampling

  • Proposed in the context of pattern

discovery and alignment by Lawrence et

  • al. [Science, 1993]
  • Input: multisequence X = {x1, x2,… xk,}

pattern length m

  • Output: a matrix profile qi,b, b∈{A,C,G,T}

1≤i≤m, and position sj for jth sequence (1≤j≤k).

Gibbs sampling

  • The algorithms maintains also the

background distribution pA, …, pT of the symbols not described by the profiles

  • A string of y with length m:

– P(y) is the probability of y based on the background distribution – Q(y) is the probability of y based on the profile distribution

slide-40
SLIDE 40

40

Gibbs sampling

  • Idea: the profile is obtained by locating the

positions which maximize the ratio of the pattern probability versus the background probability; once the positions are

  • btained, a new more accurate version of

the profile can be computed

  • Initialize the initial positions sj randomly

Gibbs sampling

  • Gibbs sampling iterates 1) and 2) until

convergence 1. Predictive update step: randomly choose one

  • f the k sequences, say r. The matrix profile

qi,b and the background frequencies pb are recomputed from the current positions sj in all

  • ther sequences

2. Sampling step: assign a weight z(y)=Q(y)/P(y) to each substring y of length m in xr. Select randomly a substring y in xr with probability proportional to z(y) and then update sr.

Gibbs sampling

  • The more accurate the pattern description in

step 1, the more accurate the determination of its position in step 2, and vice versa

  • Once some correct positions have been selected

by chance, qi,b begins to reflect (imperfectly) the unknown pattern

  • This process tends to recruit further correct

positions which in turn improve the discriminating power of the evolving pattern

slide-41
SLIDE 41

41

Gibbs sampling

  • How to update the matrix profile and the

background frequencies?

  • qi,b = (fi(b)+db)/(k-1 + sum(dc)) (not in xr)
  • pb = f(b)/(total length) (no patterns)

Phase shift problem

  • Suppose that the strongest pattern begin, for

example, at position 7, 19, 8, 23, …

  • If Gibbs happens to choose s1=9, s2=21 it will

most likely choose s3=10 and s4=25

  • The algorithm can get stuck in local maxima,

which are the shifted form of the optimal pattern

  • Phase shift problem
  • The problem can be alleviated by adding a

step in which the current set of positions are compared with sets of shifted left and right positions, up to a certain number of symbols

  • Probability ratios may be calculated for all

positions, and a random selection is made with respect to the appropriate weight

slide-42
SLIDE 42

42

Gibbs sampling

  • It can be generalized to:
  • Find also the length of pattern m
  • Find a set of matrix profiles, instead of
  • ne

Gibbs sampling

  • Since Gibbs sampler is an heuristics

rather than a rigorous optimization procedure, one cannot guarantee the

  • ptimality of the result
  • It is a good practice to run the algorithm

several times from different random initial positions

Gibbs sampling

  • Although EM and Gibbs are built on

common statistical foundation, the authors claim that Gibbs outperforms EM both in term of time complexity and performance

  • http://bayesweb.wadsworth.org/gibbs/gibbs.html
  • ftp://ncbi.nlm.nih.gov/pub/neuwald/gibbs9_95/
  • http://argon.cshl.org/ioschikz/gibbsDNA/
slide-43
SLIDE 43

43

Expectation Maximization Expectation Maximization

  • EM was designed by Dempster, Laird,

Rubin [1977]

  • It is one of the most widely used

algorithms in statistics

  • 200-300 papers are published yearly on

EM, or applications of EM

  • EM is a family of algorithms for maximum

likelihood estimation of parameters with “missing data”

EM, when?

  • When we want to find the maximum

likelihood estimate of the parameters and – data is incomplete, or – the optimization of the maximum likelihood function is analytically intractable and the likelihood function can be simplified by assuming the existence of additional, but missing, parameters value

slide-44
SLIDE 44

44

Expectation maximization

  • EM approaches the problem of missing

information by iteratively solving a sequence of problems in which expected information is substituted for missing information

Expectation maximization

  • All EM algorithms consists of two steps:

1) the expectation step (E-step) 2) the maximization step (M-step)

  • The expectation is with respect to the unknown

underlying variables, using the current estimate

  • f the parameters and conditioned upon the
  • bservation
  • The maximization step provides a new estimate
  • f the parameters
  • The two steps are iterated until convergence

General framework for EM

  • Suppose we want to train a model with

parameters ?

  • We observe x (training set)
  • The probability of x under model ? is

determined by the missing data y

slide-45
SLIDE 45

45

Expectation maximization

  • Example (HMMs): x is the sequence we want to

learn from, ? is the transition and emission probabilities, y is the path through the model

  • Example (Pattern (profile discovery): x are the

sequences, ? are the parameters of the profile, y are all the missing positions

  • Haplotype Inference: x are the genotype data, ?

are the haplotype frequencies in a population, y are haplotype assignments for each individual.

Expectation maximization

  • The likelihood increases at each step, so

the procedure will always reach a maximum asymptotically

  • It has been proved that the number of

iterations to convergence is linear in the input size

  • More importantly, EM can get stuck

(easily) in local maxima

EM for pattern discovery

  • The first attempt to use EM for pattern

discovery has been proposed by Lawrence and Reilly [Proteins, 1990]

  • Input: multisequence {x1,x2,…,xk}

pattern length m

  • Output: a matrix profile qi,b, b∈ {A,C,G,T},

1≤i≤m, and positions sj, 1≤j≤k, of the profile

slide-46
SLIDE 46

46

EM for pattern discovery

  • Assumption: there is exactly one
  • ccurrence of the profile in each sequence
  • The missing information in this case are

the positions of the motif in {x1,x2,…,xk} (in fact, if we knew the positions, the problem

  • f finding the profile would be trivial)

Lawrence-Reilly EM

The objective is to maximize the following log likelihood where qb,0 is the unknown distribution outside the profile, qb,i is the unknown distribution inside the profile, f0(b) is the observed count of outside the profile, fi(b) is the observed count of in the profile at position i.

∑ ∑∑

Σ ∈ = Σ ∈

− + =

b b m i b i b i

q b f m n k q b f k q L ) log( ) ( ) ( ) log( ) ( ) (

, 1 ,

Lawrence-Reilly EM

  • Recall, if we know the positions of the

pattern in each sequence, we can easily calculate the parameter:

  • qi,b=fi(b)/k and q0,b=f0(b)/(k(n-m))
  • So for the E-step, we need to calculate the

expected count of each symbol in the profile and in the background.

slide-47
SLIDE 47

47

Lawrence-Reilly EM

  • For each sequence i, with length n, the length of

profile is m

s 1 n-m+1 j s s+m-1

Lawrence-Reilly EM

  • E-step: for each sequence xi, we need to calculate

the probability of the profile for each possible position s (1≤s ≤n-m+1), denote as ρi,s, which can be calculated based on the parameters of previous iteration (t):

  • and the expected count Q is just the summation
  • ver all sequences and over every position.

− + =

=

1 ,

) log(

], [

m s s j t s i

s j j i x

q ρ Lawrence-Reilly EM

  • M-step: use the expected count of each

symbol in each position to compute the ML (re)estimate of the parameters

  • Termination: when |q(t+1)-q(t)|<ε or max

iterations reached.

Σ ∈ − = ≤ ≤ Σ ∈ =

+ +

b m n k Q q m i b k Q q

b t b i b t i b

where , ) ( 1 , where ,

, ) 1 ( , , ) 1 ( ,

slide-48
SLIDE 48

48

Lawrence-Reilly EM

  • Constrains in the structure of the profile

can be easily incorporated

  • Variable length gaps within the profile can

be handled by adding new variables to the model

MEME

  • Proposed by Bailey and Elkan [Machine

Learning J., 1995]

  • “Multiple EM for Motif Elicitation” (MEME)

is an improved version of the expectation maximization approach by Lawrence and Reilly

  • It is designed to discover profiles (no

gaps)

MEME

  • There are three main differences w.r.t.

Lawrence et al.: 1) the initial profiles are not chosen randomly, but they are substrings which actually occur in the sequences 2) the assumption that there is only one

  • ccurrence of the motif is dropped

3) once a profile has been found, it is reported, and the iterative process continues

slide-49
SLIDE 49

49

Using substring as starting points

  • Idea: substrings actually occurring in

sequence are better starting points than random choices

  • Each substring is converted into a profile
  • Assigning 1.0 to the symbol and 0.0 to the
  • thers is a bad choice, because EM

cannot move from this

  • The authors arbitrarily assign probability

0.5 to the symbol and 0.5/3 for the other three

Using substring as starting points

  • It would be too expensive to run EM until

convergence from each substring

  • It turns out that this is not necessary
  • EM converges very quickly from profiles
  • btained from substrings, and the best

starting point can be found running only

  • ne iteration

Dealing with multiple occurrences

  • Meme allows to drop the “one-per-

sequence” assumption

  • The basic idea is to require the user to

supply an estimated number of

  • ccurrences of the unknown profile and

use that to normalize the estimation process of the EM algorithm

  • The authors claim that the exact value of

the number of occurrences is not critical

slide-50
SLIDE 50

50

Finding multiple profiles

  • Meme does not stop after finding the most

likely profile

  • Once a profile is found and reported, it is

“probabilistically erased” by changing some position-dependent weight

  • The process continues until a number of

predetermined motifs have been found

MEME algorithm

– For each substring y in {x1,x2,…,xk} do

  • Run EM for one iteration with profile computed

from y

  • Choose the profile q with highest likelihood
  • Run EM until convergence starting from q
  • Report the profile q
  • Erase the occurrences of q from dataset, repeat

until a number of predetermined motifs have been found

MEME & MetaMEME

http://meme.sdsc.edu/ http://metameme.sdsc.edu/