[PDF] - Probability Theory as Extended Logic: Probability Theory as Extended PDF Document

SLIDE 1

Probability Theory as Extended Logic: Probability Theory as Extended Logic:

Applications to motif finding and clustering Applications to motif finding and clustering

Erik van Nimwegen

Division of Bioinformatics Biozentrum, Universität Basel, Swiss Institute of Bioinformatics

Ab initio motif finding: regular expressions, MEME, Gibbs-Sampling.
Probabilistic clustering of sequences.
Discovering regulatory modules.
Regulatory motif discovery in phylogenetically related sequences.

Transcription Regulation Networks Transcription Regulation Networks

ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG…..

Genes Promoters Regulators (transcription factors)

SLIDE 2

Transcription Regulation Networks Transcription Regulation Networks

ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG…..

Genes Promoters Regulators (transcription factors) binding sites

Transcription Regulation Networks Transcription Regulation Networks

ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG…..

Genes Promoters Regulators (transcription factors) binding sites Regulatory network To reconstruct the network we need to identify all binding sites genome-wide and the factor(s) that binds at each site.

SLIDE 3

Transcription Regulation Networks Transcription Regulation Networks

The number of transcription regulators increases roughly quadratically with the size
f the genome.
The number of regulators per gene thus increases linearly with the size of

the genome. From: E. van Nimwegen Trends in Genetics 19 479-484 (2003) metabolic genes transcription factors cell cycle related genes

Transcription Regulation Networks Transcription Regulation Networks

Knowledge from direct experimentation:

E. coli:
almost 200,000 papers in PubMed. Over 17,000 on transcription.
About 300 TFs.
Less than 100 TFs with at least 1 known binding site.
About 750 known sites in total. (of 2,500-8,000 ?)
S. cerevisiae:
Almost 60,000 papers in PubMed. Over 10,000 on transcription.
About 350 TFs.
About 65 TFs with at least 1 known binding site.
About 450 known sites in total. (of > 10,000 ?)

Even in intensely studied model organisms the majority of regulatory sites is not known.

SLIDE 4

Ab Ab initio initio discovery of regulatory sites discovery of regulatory sites

General Approaches:

1. Collect sets of (intergenic) sequences that are thought to contain

binding sites for a common regulatory factor. Examples:

Upstream regions of co-regulated genes.
Sequence fragments pulled down with ChrIP

then search for overrepresented short sequence motifs among them.

Microarray experiments (gene expression) Binding experiments (ChIP-on-chip) Other external biological knowledge Sets of sequences containing sites for a common regulatory factor.

Representation by consensus sequence Representation by consensus sequence

r regular expression
r regular expression

ACGCGT ACGCGT ACGCGA ACGCGT ACGCGA CCGCGT TCGCGA ACGCGT ACGCGT ACGCGT ACGCGT ACGCGT ACGCGT HCGCGW Consensus sequence: Regular expression:

The experimentally known binding sites of MBP1 (yeast TF): So called IUPAC symbols are used to represent sets

f nucleotides. For instance:

W = {A,T} and H = {A,C,T}

(take the majority base in each column) (take the IUPAC symbol for the sequences

ccurring in each column)

SLIDE 5

Scan for over Scan for over-

represented patterns

represented patterns

ATG….. ATG….. ATG….. ATG….. ATG…..

Gene A Gene B Gene C Gene D Gene E

Exhaustively go through all possible consensus sequences (or regular

expressions) s up to some length L.

Scan for over Scan for over-

represented patterns

represented patterns

ATG….. ATG….. ATG….. ATG….. ATG…..

Gene A Gene B Gene C Gene D Gene E

Exhaustively go through all possible consensus sequences (or regular

expressions) s up to some length L.

For a given motif, say s = WGCWCG, find all occurrences.

AGCTCG TGCTCG TGCTCG TGCACG AGCACG

SLIDE 6

Scan for over Scan for over-

represented patterns

represented patterns

ATG….. ATG….. ATG….. ATG….. ATG…..

Gene A Gene B Gene C Gene D Gene E

Exhaustively go through all possible consensus sequences (or regular

expressions) s up to some length L.

For a given motif, say s = WGCWCG, find all occurrences.
Determine the significance of the motif. Roughly speaking the significance

is given by the probability to get so many occurrences in random sequences, e.g. P(WGCWCG) = 0.034

AGCTCG TGCTCG TGCTCG TGCACG AGCACG

Scan for over Scan for over-

represented patterns

represented patterns

ATG….. ATG….. ATG….. ATG….. ATG…..

Gene A Gene B Gene C Gene D Gene E

Exhaustively go through all possible consensus sequences (or regular

expressions) s up to some length L.

For a given motif, say s = WGCWCG, find all occurrences.
Determine the significance of the motif. Roughly speaking the significance

is given by the probability to get so many occurrences in random sequences, e.g. P(WGCWCG) = 0.034

Rank all motifs by significance and report the motifs with highest significance.

AGCTCG TGCTCG TGCTCG TGCACG AGCACG

SLIDE 7

Over Over-

representation of consensus

representation of consensus and regular expression patterns and regular expression patterns

Example algorithms:

YMF (Sinha and Tompa)
Weeder (Pavesi et al.)

Advantages:

The search is exhaustive. If a significant motif exists it is guaranteed to be found.

Disadvantages:

Consensus sequences and regular expressions are not necessarily a good representation
f binding sites. (next slides)
The significant motifs are often partially redundant. For example:

ATTACTAT WWACTWTTA AATTAC ATTACGG

Now which motif is the “correct” motif?

The weight matrix representation of The weight matrix representation of regulatory motifs regulatory motifs

Alignment of known fruR binding sites:

AAGCTGAATCGATTTTATGATTTGGT AGGCTGAATCGTTTCAATTCAGCAAG CTGCTGAATTGATTCAGGTCAGGCCA GTGCTGAAACCATTCAAGAGTCAATT GTGGTGAATCGATACTTTACCGGTTG CGACTGAAACGCTTCAGCTAGGATAA TGACTGAAACGTTTTTGCCCTATGAG TTCTTGAAACGTTTCAGCGCGATCTT ACGGTGAATCGTTCAAGCAAATATAT GCACTGAATCGGTTAACTGTCCAGTC ATCGTTAAGCGATTCAGCACCTTACC ** **gcTGAAtCG gcTGAAtCG* *TTcAg TTcAg**c****** **c******

0.067 , 0.467 , 0.2 , 267 . : instance For . position at base finding

f

y Probabilit

3 3 3 3

= = = = =

T G C A i

w w w w i w α

α

Probability that a site for the TF represented by w will have sequence s:

∏

=

l i i si

w w s P

1

) | (

SLIDE 8

Ab Ab initio initio motif discovery with weight matrices motif discovery with weight matrices

Assume the input set of ‘co-regulated’ sequences is a mixture of “random” background sequence plus a number of samples from a weight matrix. Unknowns:

1. The weight matrix
2. The number of sites
3. The positions of the sites

ATG….. ATG….. ATG….. ATG….. ATG…..

MEME approach: Search the space of WMs for the WM that maximizes the likelihood

f the data (summing over all possible binding site configurations for each WM). The likelihood

is maximized using “Expectation Maximization”.

Probability of the sequence given a configuration Probability of the sequence given a configuration

Probability of sequence at positions (i+1) through (i+L) assuming it is a site for WM w:

∏

= + + +

+

= ≡

L k k s L i i i L i

k i

w w s s s P w s P

1 2 1 ] , [

) | ..... ( ) | (

i+1 i+L sequence s

Probability of a given configuration of sites ρ to a set of sequences S:

s

b b s P = ) | (

Probability of a base not in a site (“background”):

∏ ∏

∈ ∉

=

sites ] , [ sites

) | ( ) , | (

j L j i s

w s P b w S P

i

ρ

( ) ( )

nonsites sites 1

) (

n w n w

p p P − = ρ

Likelihood of a configuration: Prior of a configuration:

SLIDE 9

Probability of the sequence given a configuration Probability of the sequence given a configuration

The probability of the sequences given the WM is the sum over all configurations:

∑

=

ρ

ρ ρ ) ( ) , | ( ) | ( P w S P w S P

) | ] , 1 [ ( ) | .... ( ) | ] 1 , 1 [ ( ) 1 ( ) | ] , 1 [ (

1 1

w L j S P p w s s s P w j S P p b w j S P

w j j L j w s j

− + − − =

− + −

Let be the probability of the sum of all configurations up to base j

) | ] , 1 [ ( w j S P

Starting from we can now determine in linear time.

1

) | ] 1 , 1 [ ( b w S P = ) | ( w S P

Thus, for any WM w we can easily determine but we want to find the maximum We therefore differentiate with respect to each WM entry and set zero.

) | ( w S P

) | (

*

w S P

i

wα

Optimum weight matrix Optimum weight matrix

α

, constant, ) | ( i w w S P

i

∀ = ∂ ∂

At the optimum we have: A little derivation shows that this is satisfied when:

k k i i L i i

w s w S P w L i S P w s s P w i S P

α

α δ constant ) , ( ) | ( ) | ] end , 1 [ ( ) | ( ) | ] , 1 [ (

1

= + +

+ + +

∑

L

Or equivalently:

k i k i i

w w i P s w i P

α

α δ =

∑ ∑

+

) | at site ( ) , ( ) | at site (

Instead of solving this equation, we can update the WM iteratively:

) new ( ))

ld

( | at site ( ) , ( ))

ld

( | at site (

k i k i i

w w i P s w i P

α

α δ =

∑ ∑

+

This is called Expectation Maximization and is guaranteed to lead to a local optimum in w.

SLIDE 10

MEME approach MEME approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. Choose a random segment and use its sequence to seed a weight matrix.

167 . , 5 .

5 5 5 5

= = = =

T C A G

w w w w

MEME approach MEME approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. Choose a random segment and use its sequence to seed a weight matrix.
2. For each position i in each sequence s, calculate the probability P(s,i) that this

sequence is a binding site for the WM.

SLIDE 11

MEME approach MEME approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. Choose a random segment and use its sequence to seed a weight matrix.
2. For each position i in each sequence s, calculate the probability P(s,i) that this

sequence is a binding site for the WM.

3. Construct a new WM by averaging the potential sites at all possible positions (s,i),

weighing each potential site with the probability that it is a site for the previous WM.

MEME approach MEME approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. Choose a random segment and use its sequence to seed a weight matrix.
2. For each position i in each sequence s, calculate the probability P(s,i) that this

sequence is a binding site for the WM.

3. Construct a new WM by averaging the potential sites at all possible positions (s,i),

weighing each potential site with the probability that it is a site for the previous WM.

4. This is iterated until the WM does not longer change

SLIDE 12

MEME approach MEME approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. Choose a random segment and use its sequence to seed a weight matrix.
2. For each position i in each sequence s, calculate the probability P(s,i) that this

sequence is a binding site for the WM.

3. Construct a new WM by averaging the potential sites at all possible positions (s,i),

weighing each potential site with the probability that it is a site for the previous WM.

4. This is iterated until the WM does not longer change
5. The best WM over many seeds is reported.

Probability sequences deriving from same WM Probability sequences deriving from same WM

Given a WM w, the probability that a set of n sequences for the TF

represented by w will match the set S is:

( )

∏ ∏ ∏ ∏ ∏

∈ ∈ = =

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = =

S s S s l i n i l i i s

i i

w w w s P w S P

1 1

) | ( ) | (

α

α α

With the number of times base α occurs at position i among the

sequences in S.

i

nα

AAGCTGAATCGATTTTATGATTTGGT AGGCTGAATCGTTTCAATTCAGCAAG CTGCTGAATTGATTCAGGTCAGGCCA GTGCTGAAACCATTCAAGAGTCAATT GTGGTGAATCGATACTTTACCGGTTG CGACTGAAACGCTTCAGCTAGGATAA TGACTGAAACGTTTTTGCCCTATGAG TTCTTGAAACGTTTCAGCGCGATCTT ACGGTGAATCGTTCAAGCAAATATAT GCACTGAATCGGTTAACTGTCCAGTC ATCGTTAAGCGATTCAGCACCTTACC ** **gcTGAAtCG gcTGAAtCG* *TTcAg TTcAg**c****** **c******

6 1 4

9 9 9 9

= = = =

T G C A

n n n n

SLIDE 13

Probability sequences deriving from same WM Probability sequences deriving from same WM ( )

∏ ∏

=

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ =

l i n i

i

w w S P

1

) | (

α

α α

What if we do not know the WM? The probability to obtain that a set of n sites from a TF with unknown weight matrix will match the set S:

( )

∫ ∏ ∫∏ ∫

=

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = = =

l i i i n i

dw w P w dw w P w S P dw w S P S P

i

1

) ( ) ( ) | ( ) , ( ) (

α α

α

is the prior probability over possible WMs, and we have used the fact that it factors into independent distributions for each column i. How to represent our information of the a priori likely WMs w?

) (w P ) (

i

w P

The prior is a density over the simplex

Prior over Prior over WMs WMs

1 =

A

w 1 =

C

w 1 =

G

w 1 =

T

w A column of the WM is a point in the 4D simplex

[ ] ( )

dw w dw w P

1 4

) ( ) 4 ( ) (

−

∏

Γ Γ =

γ α α

γ γ

The family of Dirichlet priors:

) , , , (

T G C A

w w w w P

The case corresponds to the uniform prior.
For more weight is on the corners and edges of the simplex, i.e.
ne expects a distribution heavily biased to one or two bases.
For more weight is on the middle of the simplex, i.e. one expects

all bases to have equal probabilities (unlikely to apply in practice).

1 = γ 1 < γ 1 > γ

SLIDE 14

Probability sequences deriving from same WM Probability sequences deriving from same WM

With the Dirichlet prior the integral becomes:

[ ] ( )

∏ ∏ ∏ ∫∏

= = − +

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Γ + Γ + Γ Γ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Γ Γ =

l i i l i n

n n dw w S P

i

1 1 1 4

) ( ) ( ) 4 ( ) 4 ( ) ( ) 4 ( ) (

α α α γ α

γ γ γ γ γ γ

α

This gives the probability to obtain the sequences S from a common WM.
Now we can calculate all kinds of things like:
1. Given a set S and one more sequence s, what is the probability that s

comes from the same WM as the WM that the sequences in S came from.

2. Given two sets S1 and S2 what is the probability to obtain them both from

a single WM versus the probability to obtain them when sampling from two independent WMs.

Probability sequences deriving from same WM Probability sequences deriving from same WM

Probability the set of sequences S and single sequence s derive from a

common WM:

∏ ∏

=

⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ Γ + + Γ + + Γ Γ =

l i s i

i

S n n s S P

1

) ( ) ) ( ( ) 4 1 ( ) 4 ( ) , (

α α α

γ γ δ γ γ

therwise.

and if 1 in position at

ccurs

times

f

number ) (

i s i

s S i S n

i

= = = α δ α

α α

Probability for S and s coming from independent WMs:

∏ ∏

=

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Γ + Γ + Γ Γ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Γ + Γ + Γ Γ =

l i i S

n n s P S P

1

) ( ) 1 ( ) 4 1 ( ) 4 ( ) ( ) ) ( ( ) 4 ( ) 4 ( ) ( ) ( γ γ γ γ γ γ γ γ

α α

Assume prior probability s comes from same WM is π then the posterior is:

( )

∏

=

+ + = − + = − + =

l i i s l

n S n x x x S P s P s S P s S P S s P

i

1

4 ) ( , 4 1 ) 1 ( ) 1 )( ( ) ( ) , ( ) , ( ) as WM same ( γ γ π π π π π π

SLIDE 15

Ab Ab initio initio motif discovery with weight matrices motif discovery with weight matrices

Assume the input set of ‘co-regulated’ sequences is a mixture of “random” background sequence plus a number of samples from a weight matrix. Unknowns:

1. The weight matrix
2. The number of sites
3. The positions of the sites

ATG….. ATG….. ATG….. ATG….. ATG…..

MEME approach: Search the space of WMs for the WM that maximizes the likelihood

f the data (summing over all possible binding site configurations for each WM). The likelihood

is maximized using “Expectation Maximization”.

Gibbs Sampler approach: Search the space of binding site configurations for the

configuration that maximizes the likelihood of all sites deriving from a common WM (integrating over all possible WMs) and all other sequence deriving from background. The space of configurations is searched through “Gibbs Sampling”.

Gibbs Sampler Approach Gibbs Sampler Approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. A binding site configuration (set of site positions) is chosen at random.

SLIDE 16

Gibbs Sampler Approach Gibbs Sampler Approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. A binding site configuration (set of site positions) is chosen at random.
2. Pick a sequence at random and remove the site.

Gibbs Sampler Approach Gibbs Sampler Approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. A binding site configuration (set of site positions) is chosen at random.
2. Pick a sequence at random and remove the site.
3. Scan the sequence from left to right and calculate at each position i the probability P(i)

that the sequence starting at that position is a site for the same WM as those from the

ther sequences S.

∏

= + + +

+ + = = ≡

+

l k k s l i i i

S n S n S P S s s s P S l i s P i P

k i

1 2 1

4 ) ( ) ( ) ( ) , ( ) | ] , [ ( ) ( γ γ L

SLIDE 17

Gibbs Sampler Approach Gibbs Sampler Approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. A binding site configuration (set of site positions) is chosen at random.
2. Pick a sequence at random and remove the site.
3. Scan the sequence from left to right and calculate at each position i the probability P(i)

that the sequence starting at that position is a site for the same WM as those from the

ther sequences S.
4. Sample a position i in proportion to probability P(i)

Gibbs Sampler Approach Gibbs Sampler Approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. A binding site configuration (set of site positions) is chosen at random.
2. Pick a sequence at random and remove the site.
3. Scan the sequence from left to right and calculate at each position i the probability P(i)

that the sequence starting at that position is a site for the same WM as those from the

ther sequences S.
4. Sample a position i in proportion to probability P(i)
5. Iterate while keeping track of:
The configuration with the highest probability P(S) that all sites stem

from a common WM.

The frequency of occurrence of each window in the sampled configurations.

SLIDE 18

Gibbs Sampling in General Gibbs Sampling in General

Given 1 site in each of the n sequences, each configuration is a a vector of

positions at which the sites occur.

the Gibbs sampler samples the joint probability distribution
At each step a sequence k is chosen and the conditional distribution
then a position is sampled from this conditional distribution.
One can show that, in the limit of many iterations, the number of times

that the system will be in configuration is proportional to

This is one example of what are generally called Monte-Carlo Markov chain

techniques for sampling a complex probability distribution on a computer.

) , , , (

2 1 n

i i i L ) , , , (

2 1 n

i i i P L

∑

+ − + −

=

j n k k n n k k k

i i j i i i P i i P i i i i i i P ) , , , , , , ( ) , , ( ) , , , , , | (

1 1 2 1 1 1 1 2 1

L L L L L

k

i ) , , , (

2 1 n

i i i N L ) , , , (

2 1 n

i i i L ) , , , (

2 1 n

i i i P L

Finding the maximum: Simulated Annealing Finding the maximum: Simulated Annealing

To sample from the distribution we pick a k and sample

from the conditional distribution:

In simulated annealing one introduces a parameter and samples instead

from:

One then slowly increases with time.
One can show that in the limit in which is increased infinitely slowly,
ne is guaranteed to find the global optimum of

.

One hopes that if one increases slowly that one is still likely to find the

global optimum.

) , , , (

2 1 n

i i i P L

∑

+ − + −

=

j n k k n n k k k

i i j i i i P i i P i i i i i i P ) , , , , , , ( ) , , ( ) , , , , , | (

1 1 2 1 1 1 1 2 1

L L L L L

k

i

β

∑

+ − + −

=

j n k k n n k k k

i i j i i i P i i P i i i i i i P

β β β

) , , , , , , ( ) , , ( ) , , , , , | (

1 1 2 1 1 1 1 2 1

L L L L L β β ) , , , (

2 1 n

i i i P L

β

SLIDE 19

Ab Ab initio initio motif discovery with weight matrices motif discovery with weight matrices

Expectation maximization approach:

MEME (http://meme.sdsc.edu/meme/intro.html) MDScan (http://ai.stanford.edu/~xsliu/MDscan/) Advantages:

Can flexibly treat variations in the number of sites per sequence.
Is likely to discover the motif if the WM is well-described by a consensus

sequence. Disadvantages: Can easily get stuck in local optima.

Gibbs Sampling approach:

The Gibbs Motif Sampler (http://bayesweb.wadsworth.org/gibbs/gibbs.html)

AlignAce (http://atlas.med.harvard.edu/) Advantages:

Searches the entire space of binding site configurations.
Is better suited to motifs that are very fuzzy and not well-described by a consensus.

Disadvantages:

Depends more on assumptions about the number of sites per sequences.

Probability sequences deriving from same WM Probability sequences deriving from same WM

Generalization to two sets of sequences S1 and S2 .

2 1

in position at

ccurs

times

f

number in position at

ccurs

times

f

number S i m S i n

i i

α α

= =

Again assuming a prior π for the probability that S1 and S2 come from the same WM, the posterior that S1 and S2 come from the same WM becomes:

∏ ∏

=

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + Γ + Γ Γ + + Γ Γ + + Γ + Γ + Γ = − + = − + =

l i i i i i

m n m n m n m n x x x S P S P S S P S S P S S P

1 2 1 2 1 2 1 2 1

) ( ) ( ) ( ) ( ) 4 ( ) 4 ( ) 4 ( ) 4 ( , 1 ) 1 )( ( ) ( ) , ( ) , ( ) as WM same (

α α α α α

γ γ γ γ γ γ γ γ π π π π π π In a minute we will generalize this to arbitrary numbers of groups.

SLIDE 20

Probability sequences deriving from same WM Probability sequences deriving from same WM

Probability to observe S when each sequence s stems from a different WM:

nl nl

S P ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Γ + Γ + Γ Γ = 4 1 ) ( ) 1 4 ( ) 1 ( ) 4 ( ) ( γ γ γ γ Number of possible base counts How many different ways the observed count can be realized.

Using Stirling’s approximation we can rewrite P(S) for as:

1 = γ

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + =

∑

= − l i i T i G i C i A

n n n n n n n n H n n S P

1 1

, , , exp 3 3 ) (

∑

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡

α α α

n n n n n n n n n n n n H

i i i T i G i C i A

log , , ,

Probabilistic clustering of sequences Probabilistic clustering of sequences

We are given a large dataset of sequences D.
We assume that each of the sequences was sampled from a WM.

van Nimwegen et al. PNAS 99: 7323-7328

SLIDE 21

Probabilistic clustering of sequences Probabilistic clustering of sequences

We are given a large dataset of sequences D.
We assume that each of the sequences was sampled from a WM.
We do not know how many different WMs were used to generate the data,

how many sequences were sampled from each WM, and which sequences were sampled from which WM.

We want to infer which sequences were drawn from the same WM.

van Nimwegen et al. PNAS 99: 7323-7328

Probabilistic clustering of sequences Probabilistic clustering of sequences

AAGCACTATATTGGTGCAACATTCACATCGTG GTGATGAACTGTTTTTTTATCCAGTATAATTT ACTCATCTGGTACGACCAGATCACCTTGCGGA AAGCACCATGTTGGTGCAATGACCTTTGGATA AAGCTGAATCGATTTTATGATTTGGTTCAATT AGGCTGAATCGTTTCAATTCAGCAAGAGAGGA CATTAACTCATCGGATCAGTTCAGTAACTATT CCTCTTTACTGTATATAAAACCAGTTTATACT TCCGAACTGATCGGACTTGTTCAGCGTACACG ACTCACAACTGTATATAAATACAGTTACAGAT GTGCTGAAACCATTCAAGAGTCAATTGGCGCG ATCAAGCTGGTATGATGAGTTAATATTATGTT TTCCAATACTGTATATTCATTCAGGTCAATTT GTGGTGAATCGATACTTTACCGGTTGAATTTG CAGCATAACTGTATATACACCCAGGGGGCGGA GCCTTTTGCTGTATATACTCACAGCATAACTG CAGCGGCTGGTCCGCTGTTTCTGCATTCTTAC ACGGTGAATCGTTCAAGCAAATATATTTTTTT AGTAATGACTGTATAAAACCACAGCCAATCAA ATCGTTAAGCGATTCAGCACCTTACCTCAGGC TGGATGTACTGTACATCCATACAGTAACTCAC ATGCACTAAAATGGTGCAACCTGTTCAGGAGA TATTTTACCTGTATAAATAACCAGTATATTCA CAGCAAATCTGTATATATACCCAGCTTTTTGG GCGCACCAGATTGGTGCCCCAGAATGGTGCAT ACAGACTACTGTATATAAAAACAGTATAACTT TCGCCACTGGTCTGATTTCTAAGATGTACCTC AGTTTATACTGTACACAATAACAGTAATGGTT CTGCTGAATTGATTCAGGTCAGGCCAAATGGC ACTTGATACTGTATGAGCATACAGTATAATTG TTCCAGCTGGTCCGACCTATACTCTCGCCACT TCGTTTTCCTGTATGAAAAACCATTACTGTTA TTACACTCCTGTTAATCCATACAGCAACAGTA CGACTGAAACGCTTCAGCTAGGATAAGCGAAA TGACTGAAACGTTTTTGCCCTATGAGCTCCGG CATATTTACTGATGATATATACAGGTATTTAG TTCTTGAAACGTTTCAGCGCGATCTTGTCTTT CTGTTACACTGGATAGATAACCAGCATTCGGA ATCCTTCGCTGGATATCTATCCAGCATTTTTT GCACTGAATCGGTTAACTGTCCAGTCGACGGC CCACAATATTGGCTGTTTATACAGTATTTCAG

SLIDE 22

Probabilistic clustering of sequences Probabilistic clustering of sequences

AAGCACTATATTGGTGCAACATTCACATCGTG GTGATGAACTGTTTTTTTATCCAGTATAATTT ACTCATCTGGTACGACCAGATCACCTTGCGGA AAGCACCATGTTGGTGCAATGACCTTTGGATA AAGCTGAATCGATTTTATGATTTGGTTCAATT AGGCTGAATCGTTTCAATTCAGCAAGAGAGGA CATTAACTCATCGGATCAGTTCAGTAACTATT CCTCTTTACTGTATATAAAACCAGTTTATACT TCCGAACTGATCGGACTTGTTCAGCGTACACG ACTCACAACTGTATATAAATACAGTTACAGAT GTGCTGAAACCATTCAAGAGTCAATTGGCGCG ATCAAGCTGGTATGATGAGTTAATATTATGTT TTCCAATACTGTATATTCATTCAGGTCAATTT GTGGTGAATCGATACTTTACCGGTTGAATTTG CAGCATAACTGTATATACACCCAGGGGGCGGA GCCTTTTGCTGTATATACTCACAGCATAACTG CAGCGGCTGGTCCGCTGTTTCTGCATTCTTAC ACGGTGAATCGTTCAAGCAAATATATTTTTTT AGTAATGACTGTATAAAACCACAGCCAATCAA ATCGTTAAGCGATTCAGCACCTTACCTCAGGC TGGATGTACTGTACATCCATACAGTAACTCAC ATGCACTAAAATGGTGCAACCTGTTCAGGAGA TATTTTACCTGTATAAATAACCAGTATATTCA CAGCAAATCTGTATATATACCCAGCTTTTTGG GCGCACCAGATTGGTGCCCCAGAATGGTGCAT ACAGACTACTGTATATAAAAACAGTATAACTT TCGCCACTGGTCTGATTTCTAAGATGTACCTC AGTTTATACTGTACACAATAACAGTAATGGTT CTGCTGAATTGATTCAGGTCAGGCCAAATGGC ACTTGATACTGTATGAGCATACAGTATAATTG TTCCAGCTGGTCCGACCTATACTCTCGCCACT TCGTTTTCCTGTATGAAAAACCATTACTGTTA TTACACTCCTGTTAATCCATACAGCAACAGTA CGACTGAAACGCTTCAGCTAGGATAAGCGAAA TGACTGAAACGTTTTTGCCCTATGAGCTCCGG CATATTTACTGATGATATATACAGGTATTTAG TTCTTGAAACGTTTCAGCGCGATCTTGTCTTT CTGTTACACTGGATAGATAACCAGCATTCGGA ATCCTTCGCTGGATATCTATCCAGCATTTTTT GCACTGAATCGGTTAACTGTCCAGTCGACGGC CCACAATATTGGCTGTTTATACAGTATTTCAG

Probabilistic clustering of sequences Probabilistic clustering of sequences

AAGCACTATATTGGTGCAACATTCACATCGTG AAGCACCATGTTGGTGCAATGACCTTTGGATA ATGCACTAAAATGGTGCAACCTGTTCAGGAGA GCGCACCAGATTGGTGCCCCAGAATGGTGCAT a*GCAC*A*atTGGTGCaac****t***g** ACTCATCTGGTACGACCAGATCACCTTGCGGA CATTAACTCATCGGATCAGTTCAGTAACTATT TCCGAACTGATCGGACTTGTTCAGCGTACACG ATCAAGCTGGTATGATGAGTTAATATTATGTT TTCCAGCTGGTCCGACCTATACTCTCGCCACT TCGCCACTGGTCTGATTTCTAAGATGTACCTC CAGCGGCTGGTCCGCTGTTTCTGCATTCTTAC ****a*CTGgTc*Gat**GT******t***** AAGCTGAATCGATTTTATGATTTGGTTCAATT AGGCTGAATCGTTTCAATTCAGCAAGAGAGGA CTGCTGAATTGATTCAGGTCAGGCCAAATGGC GTGCTGAAACCATTCAAGAGTCAATTGGCGCG GTGGTGAATCGATACTTTACCGGTTGAATTTG CGACTGAAACGCTTCAGCTAGGATAAGCGAAA TGACTGAAACGTTTTTGCCCTATGAGCTCCGG TTCTTGAAACGTTTCAGCGCGATCTTGTCTTT ACGGTGAATCGTTCAAGCAAATATATTTTTTT GCACTGAATCGGTTAACTGTCCAGTCGACGGC ATCGTTAAGCGATTCAGCACCTTACCTCAGGC ** **gcTGAAtCG gcTGAAtCG* *TTcAg TTcAg**c************ **c************ GTGATGAACTGTTTTTTTATCCAGTATAATTT TGGATGTACTGTACATCCATACAGTAACTCAC ACAGACTACTGTATATAAAAACAGTATAACTT CCTCTTTACTGTATATAAAACCAGTTTATACT ACTCACAACTGTATATAAATACAGTTACAGAT AGTTTATACTGTACACAATAACAGTAATGGTT ACTTGATACTGTATGAGCATACAGTATAATTG TTCCAATACTGTATATTCATTCAGGTCAATTT CAGCATAACTGTATATACACCCAGGGGGCGGA GCCTTTTGCTGTATATACTCACAGCATAACTG TATTTTACCTGTATAAATAACCAGTATATTCA CAGCAAATCTGTATATATACCCAGCTTTTTGG TCGTTTTCCTGTATGAAAAACCATTACTGTTA TTACACTCCTGTTAATCCATACAGCAACAGTA CATATTTACTGATGATATATACAGGTATTTAG CTGTTACACTGGATAGATAACCAGCATTCGGA ATCCTTCGCTGGATATCTATCCAGCATTTTTT CCACAATATTGGCTGTTTATACAGTATTTCAG AGTAATGACTGTATAAAACCACAGCCAATCAA ****t*tACTGTATATa*A*ACAG********

SLIDE 23

Probabilistic clustering of sequences Probabilistic clustering of sequences

We are given a large dataset of sequences D.
We assume that each of the sequences was sampled from a WM.
We do not know how many different WMs were used to generate the data,

how many sequences were sampled from each WM, and which sequences were sampled from which WM.

We want to infer which sequences were drawn from the same WM.
The hypothesis-space of all possible solutions to this problem consists of

all possible partitions of the dataset D into subsets.

For example, for 3 sequences A, B,and C the 5 possible solutions are:

A B C C A B C B A B C A C B A

van Nimwegen et al. PNAS 99: 7323-7328

Probabilistic clustering of sequences Probabilistic clustering of sequences

Assume a given partition C where the dataset of sequences D is divided

into subsets c (clusters) with Sc the set of sequences in cluster c.

The probability to observe the data given the partition (with

is):

∏ ∏ ∏ ∏

∈ = ∈

⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎣ ⎡ + = =

C c l i i C c c

c n c n S P C D P

1

)! 3 ) ( ( )! ( ! 3 ) ( ) | (

α α

1 = γ

We now need to assign a prior probability to all partitions C.
With a completely ignorant uniform prior the posterior over partitions

the posterior is simply proportional to the likelihood:

) (C P

) | ( ) | ( C D P D C P ∝

constant ) ( = C P

SLIDE 24

Example Example

Assume the data set consists of only these four sequences: s1 = aaacgattcagttaggc s2 = tcaagctaggtattacc s3 = aaccgtttcgattcgga s4 = tcgagaaaggtatcagc

P(s1,s3) P(s2,s4) = 0.543 aaacgattcagttaggc tcaagctaggtattacc aaccgtttcgattcgga tcgagaaaggtatcagc P(s1,s3,s4) P(s2) = 0.185 aaacgattcagttaggc tcaagctaggtattacc aaccgtttcgattcgga tcgagaaaggtatcagc P(s1,s2,s4) P(s3) = 0.084 aaacgattcagttaggc aaccgtttcgattcgga tcaagctaggtattacc tcgagaaaggtatcagc

The probabilities P(C|D) for several partitions (ways of clustering) the data:

P(s1)P(s3) P(s2,s4) = 0.084 aaacgattcagttaggc aaccgtttcgattcgga tcaagctaggtattacc tcgagaaaggtatcagc P(s1,s3) P(s2) P(s4) = 0.086 aaacgattcagttaggc tcaagctaggtattacc tcgagaaaggtatcagc aaccgtttcgattcgga

Monte Monte-

Carlo Markov chain sampling

Carlo Markov chain sampling

f the space of partitions
f the space of partitions
Pick a sequence at random and remove it from its current cluster.
Calculate the probabilities that result when the sequence is put

in any of the existing clusters, or by itself in a new cluster.

Select one of the partitions C’ with probability proportional to .
This guarantees that, in the limit of long time, each partition C is visited

according to its probability .

) | ' ( D C P ) | ' ( D C P

) | ( D C P

SLIDE 25

Monte Monte-

Carlo Markov chain sampling

Carlo Markov chain sampling

f the space of partitions
f the space of partitions

) | ( D C P

How do we summarize the distribution that we sample?
As we sample the contents of each of the clusters are constantly shifting.
Clusters can evaporate complete and reform later. Other clusters never

go empty but change their contents.

Currently no really satisfactory solution (maybe none exists).

time

First find the partition with globally maximal probability

through simulated annealing.

We sample from the distribution and slowly increase β until

an optimal state is reached.

We use as a reference state and “track” to what extent its clusters

remain intact during the sampling.

Anneal+Track Anneal+Track Strategy Strategy

) | ( D C P

β

) | ( D C P

*

C

*

C

SLIDE 26

Predicting Predicting regulons regulons in in E. coli

E. coli
McCue et al. (Nucl. Acids Res. 2001) and Rajewsky et al. (Genome res. 2002)

collected thousands of putative binding sites in E. coli by phylogenetic footprinting of proteo-gamma bacteria.

We clustered 2000 putative binding sites from each dataset.
Results McCue et al. data: 115 stable clusters, 21 corresponding to known

regulons and 94 new putative regulons.

Results Rajewsky et al. data: 65 stable clusters, 25 known, 40 new.
In total 150 new sites for known regulons and 500 sites in new regulons.

htpp://www.physics.rockefeller.edu/regulons/index.html

Example Cluster Example Cluster

Probability that site belongs to cluster Genes downstream of site. Inferred weight matrix (sequence motif recognized by transcription factor.)

SLIDE 27

GntII GntII system: catabolism of L system: catabolism of L-

idonate

idonate with D with D-

gluconate

gluconate as an intermediate as an intermediate

Additional genes in cluster: idnT = L-idonate transporter idnR = regulator idnDOTR operon gntT,gntU = gluconate transporters. b2740 = homologue of gluconate permease Bausch et. al.

J. Bacteriol. Vol 180,
p. 3074 (1998)

Entner-Doudoroff pathway

Discovering regulatory modules Discovering regulatory modules

(from Arnone, M. I. and Davidson, E. H., Development, 124(10):1851-64, 1997.)

SLIDE 28

Discovering regulatory modules Discovering regulatory modules

Gather sets of TFs for which binding sites are known and that (ideally)

are also known to interact in regulatory modules.

Search genome-wide for relatively short sequence segments,

i.e. 200-500 bp, that have a surprisingly high concentration of sequences that ‘match’ these WMs (i.e. that look like David Arnosti’s billboards).

Berman et al., PNAS (2002) 99 757-762

(first publication presenting the idea using simple filtering methods)

Ahab: Rajewsky N, Vergassola M, Gaul U, Siggia ED, BMC Bioinformatics (2002) 3 30

(presentation of algorithm with applications to Fly)

Smash: Zavolan M, Rajewsky N, Socci N, Gaasterland T, ICSSB (2003)

(essentially same algorithm with applications to human/mouse)

Stubb: Sinha S, van Nimwegen E, Siggia ED, ISMB (2003)

(extensions taking multiple species and correlations in neighboring sites into account.)

Bicoid Caudal Hunchback Knirps Kruppel Tailles TorRE sequence

Set of Weight Matrices for the gap gene transcription factors known to be involved in early body-patterning in fly. A `parse’ ρ of the sequence S in terms of hypothesized binding sites.

) | ( ρ S P

Probability of the observed sequence given the parse.

Discovering regulatory modules Discovering regulatory modules

Given a sequence we want to consider all ways in which the sequence can be “parsed” into binding sites for the set of TFs and assign probabilities.

SLIDE 29

Probability of the sequence given a parse Probability of the sequence given a parse

Probability of sequence at positions (i+1) through (i+L) assuming it is a site for WM w:

∏

= + − + +

+

= ≡

L k k s i L i L i L i

k i

w w s s s P w s P

1 1 1 ] , [

) | ..... ( ) | (

i+1 i+L sequence s

sequence s

Probability of a given parse ρ

s

b b s P = ) | (

Probability of a base not in a site (“background”):

∏ ∏ ∏

∈ ∈ ∉

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ =

WMs sites ] , [ sites

) | ( ) | (

w j L j i s

w w i

w s P b S P ρ

The prior probability of different parses assumes each WM w has sites occurring with probability at each position of the sequence s

( )

∏

∈

=

WMs

) (

w n w

w

p P ρ

w

p

Probability of the sequence given a parse Probability of the sequence given a parse

{ } ( ) ∑

=

ρ

ρ ρ ) ( ) | ( | P s P w s P

For a given sequence we now want to calculate the probability of the data given the set of WMs . For this we have to sum over all parses ρ:

{ } ( )

w s P |

{ }

w

The sum is over all non-overlapping configurations of binding sites. Let denote the sum over parses for the first j bases of the sequence.

) , 1 ( j Z

∑

− + − =

+ − − w w w L j j j s

L j Z p w s s s P j Z p b j Z

j

) , 1 ( ) | .... ( ) 1 , 1 ( ) , 1 (

1 1 bg

This way, can be calculated in time . The function still depends on the set of probabilities . Formally we should integrate over these probabilities. However, this is computationally infeasible. In practice we find the set of that maximizes . Finally we calculate the ratio:

) , 1 ( L Z ) ( w L O ) , 1 ( L Z ) , (

w bg p

p ) , (

w bg p

p ) , 1 ( L Z

{ } ( ) ( )

bg | | s P w s P R =

SLIDE 30

ATG…..

10-30Kbp 500bp

Discovering regulatory modules Discovering regulatory modules

Slide a window of length 500 over the sequence and calculate

at each position. This gives a profile of R. { } ( ) ( )

bg | | s P w s P R =

ATG…..

10-30Kbp Predicted locations of modules

cut-off

Profile of R

A predicted module occurs at every window where R exceeds a cut-off.

Discovering regulatory modules Discovering regulatory modules

Ahab given a set of 9 transcription factors: Bcd, Hb, Cad, TorRE, D-stat, Kr, Kni, Gt, Tll
Run on upstream regions of 29 genes with gap and pair-rule patterns (750,000bp total).

SLIDE 31

Stubb Stubb: multi : multi-

species and site correlations

species and site correlations

aligned areas (all other sequence segments are not aligned).

species A species B species C species D

Align the sequence in reference species A with orthologous sequences:

Stubb Stubb: multi : multi-

species and site correlations

species and site correlations

aligned areas (all other sequence segments are not aligned).

species A species B species C species D

window containing a potential binding site

) | (

] , [

w s P

L i

probability of sequence given w for unaligned sequence is the same as before. Scan along sequence of the reference species

SLIDE 32

Stubb Stubb: multi : multi-

species and site correlations

species and site correlations

aligned areas (all other sequence segments are not aligned).

species A species B species C species D

window containing a potential binding site Sites in an aligned block are extended to all species in the block These sequence segments in aligned blocks will be scored according to an evolutionary model:

) | (

] , [

w S P

L i

is the probability to observe the set of bases S at the leafs given the phylogenetic tree and the WM w.

Stubb Stubb: multi : multi-

species and site correlations

species and site correlations

species A acgtaactagtga species B acgttgctagatg species C tcgttgctataat species D aggtagcgagaag

Potential site in aligned sequence block: Probability of the set of bases S in each column is independent of the other columns.

species A species B species C species D x y z

Species phylogenetic tree

) | ( w S P

S

g g g x y z a

) | ( w S P

SLIDE 33

Stubb Stubb: multi : multi-

species and site correlations

species and site correlations

Probability along a single branch:

) , , | ( w t y x P

probability to end up with base x after time t, starting from base y.

x y t

∑

← − ← =

z

w t y x P w x z w t y z P w z x dt w t y x dP ) , , | ( ) | ( ) , , | ( ) | ( ) , , | ( μ μ General time evolution: Assumptions:

) | ( ) | ( w x w y x μ μ = ←

x

w w y x P = ∞ ) , , | (

Solution:

( )

t x t xy

e w e w t y x P

μ μ

δ

− −

− + = 1 ) , , | (

μ μ

x

w w x = ) | ( ( )

q w q w q y x P

x xy

− + = 1 ) , , | ( δ

In terms of no-mutation probability q:

Evolutionary model Evolutionary model

g g g x y z a

( )( )L

g yC yC yg g xD xD xg y xy z y x xy xy x

w q q w q q w q q w w S P ) 1 ( ) 1 ( ) ) 1 ( ( ) | (

, ,

− + − + − + = ∑ δ δ δ

The probability of the bases at the leafs is the product over the probabilities of each of the branches, summed over the possible bases at the internal nodes:

) | ( w S P

and similarly the probability given the background:

( )( )L

g yC yC yg g xD xD xg y xy z y x xy xy x

b q q b q q b q q b b S P ) 1 ( ) 1 ( ) ) 1 ( ( ) | (

, ,

− + − + − + = ∑ δ δ δ

∏

= + +

=

L k k i k k i L i

b S P w S P w S R

1 ] , [

) | ( ) | ( ) | ( Finally, the ratio of the probability of the whole block of aligned sequences under WM and background model:

SLIDE 34

Stubb Stubb: multi : multi-

species and site correlations

species and site correlations

Significant increase of the number of correct predictions genome-wide when using melanogaster/pseudoobscura alignments.

Ab Ab initio initio discovery of regulatory sites discovery of regulatory sites

General Approaches:

1. Collect sets of (intergenic) sequences that are thought to contain

binding sites for a common regulatory factor. Examples:

Upstream regions of co-regulated genes.
Sequence fragments pulled down with ChrIP

then search for overrepresented short sequence motifs.

2. Phylogenetic footprinting: create multiple alignments of
rthologous intergenic sequences and identify sequence

segments more conserved than “average”.

SLIDE 35

Phylogenetic Phylogenetic Footprinting Footprinting

From: Kellis et al. Nature 423 251-254 (2003)

Phylogenetic Phylogenetic Footprinting Footprinting

Scer ATGTTTTTTTAATGATATATGTAACGTACATTCTTTCCTCTACCACTGCCAATTCGGTATTATTTAATTGTGTTTAGCGCTATTTAC Spar -ATGTTTTTTAATGATATATGTAACGTACATTCTTC---CTACTGCTACCAAGTCGGTATTATTTAATTGTGTTTAGCGCTATTTAC Smik --------------------------TCTTTTCTCTA--CCACTACTACCAATTCGGTATTATTTAATTGTGTTTAGCACTATTTAC Sbay --ATGTTCTTAATGATATATATAACGTACATTTTTT---CCTCTACTAGCCAATCGGTGTTATTTAATTGTGTTTAGCTCTATTTAC * ** * * * ** * * ***** ******************* ******** Scer TAATTAACTAGAAACTCAATTTTTAAAGGCAAAGCTCGCTGACCT--TTCACTGATTTCGTGGATGTTATACTATCAGTTACTCTTC Spar CCACTAACTAGAAACTCGATTTTTAAAGGCAAAATTCAGTGTCCT--TTCACTAGTTTTGCAGATGTCCTGCTATCAGCTACTTCCC Smik TCACTAAC-AAAAACTCAATTTTGAAGGGCTGA-TTAAATATCCTCCTTTAATAGTTTTGCGCTTAGCCTGTTATCA--TATAAGTA Sbay TCACTTAACAAAAAAACCAACTTCAAAAGTATAATACAATAATTTC-TCCGTTGATCTTGTGAACTACATGCTATCACTTATTTGCC * * * * *** * * ** ** * * * * * * * * * * ***** ** Scer TGCAAAAAAAAA-----------TTGAGTCATATCGTAGCTTTGGGATTATTTTTCT-CTCTCTCCACGGCTAATTAGGTGATCATG Spar TGCAGAAAAGAAAAATA-----TTTGAGTCATATCATCGTCTAGGAAGTGTTTTTCT-CTCTCTCCACGGATAGTTAAGTGATCATG Smik TACAAAAAGAGAATAT------TTTGAGTCATATCATCGCCTAGGAAGTATTTTTTTTCTCTCTTCACGGTTAATTAGGTGATTTCT Sbay TGTAAAAAGAAAATCGTTTCGTTTTGAGTCATATCATGTTCTCATAA-TATTTTTTT--TTCCTTAGCGATTAA------------- * * *** * ************ * * * * ***** * * ** ** ** Scer AAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATC-GAAACATACATAA--GTTGATATTC-CTTTGATATCG-----ACGACTA Spar AAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATC-GAAACATACATAA--ATTGATATTC-CTTTAGCTTTT----AAAGACTA Smik GAAAAACGAAAAATTCATG-GAAAAGAGTCAACCGTC-GAAACATACATAA--ACCGATATTT-CTTTAGCTTTCGACAAAAATCTG Sbay GAAAAATAAAAAGTGATTG-GAAAAGAGTCAGATCTCCAAAACATACATAATAACAGGTTTTTACATTAGCTTTT----GAAAACTA ***** **** * ** *********** ** ************ * * ** * ** * * ** Scer CTCAATCAGG-TTTTAAAAGAAAAGAGGCA-GCTATTGAAGTAGCAGT-ATCCAGTTTAGGTTTTTTAATTATTTACAAGTAAA-GA Spar CTCAATCAA--GTTTAATAGAAGAAAGAGG-AAGGTTGAGATAGGTAT-ATCCAGTTTAGGTTTC--AATTATTTAATAATAAA-GG Smik CAATATTCATTATTCAAAACTCAAAAGAAG-AAGGTTCGAATTGGTGT-GTCCAGTTTAGGCTCT--AATTGTTGAATAATAAAAGG Sbay TCCACCACAA-ATTGAAGGTGAGGAAGAAACAAAGTTAAAGCAAGAATCGGCTTGTGTCTTTTTT--GATTGCGTATT--TGAAAGG ** ** ** ** * * ** * * *** * * ** * Scer AAAAGAGA-------------- Spar TAAAGAA--------------- Smik CGAAGAAATAACGATCCAAAAA Sbay TAAAGGAATACAACAAAAA--- ***

His7

GCN4 GCN4 ABF1

SLIDE 36

Ab Ab initio initio discovery of regulatory sites discovery of regulatory sites

General Approaches:

1. Collect sets of (intergenic) sequences that are thought to contain

binding sites for a common regulatory factor. Examples:

Upstream regions of co-regulated genes.
Sequence fragments pulled down with ChrIP

then search for overrepresented short sequence motifs.

2. Phylogenetic footprinting: create multiple alignments of
rthologous intergenic sequences and identify sequence

segments more conserved than “average”.

PhyloGibbs, combines these two approaches into a single procedure.

Ab Ab initio identification of regulatory motifs initio identification of regulatory motifs

Challenges:

Multiple alignments of intergenic DNA are often problematic.

We only align the unambiguous areas and consider all solutions consistent with the alignment of the unambiguous areas.

Sites occur both in aligned and non-alignable segments.

We simultaneously search for sites in both aligned blocks and unaligned segments.

The number of sites and the number of different kinds of sites is unknown.

We search over an arbitrary number of sites and motifs.

The entire phylogeny of the input sequences needs to be taken into account.

We use an explicit model for the evolution of regulatory sites on the phylogenetic tree.

The reliability of the predictions needs to be assessed internally.

We use Monte-Carlo Markov chain sampling to rigorously assign posterior probabilities to all reported sites and motifs.

SLIDE 37

AGAAGAAAGTAAttcttATGAGAAAATTTGCGGGAGTTCTTTGCCAGTGAGATAAAGtttttttt-------AATTTTAATCAACACAAAATACACATATTTATATAAACTGacgaaata- TGAAAAAAGTAAccttcATGAGATATATTGCGGAAGTCCATTACCAGTAAGTTAGAGttagaaaatttcgatcgacacaatttatacttcgatatatactggcaaaaaa------------ tgggaggaaaaaaaccattacctgtatgaaaaagattgcaaggattcctttgttagtgaactgaactTTAGGGATTTTAATCAACACAGTATATACATATatctttgtatactgacaaata agtaagctatatgaaaagtttcctttagcagtaaatttagagc------------------------TTAGGAATTTTGATCAAGACACAATATATATAGCTTTATATATTGtcaaata—-

Produce local multiple alignment to identify orthologous regions in upstream regions.

Algorithm developed with Rahul Siddharthan and Eric Siggia (Rockefeller)

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm

AGAAGAAAGTAAttcttATGAGAAAATTTGCGGGAGTTCTTTGCCAGTGAGATAAAGtttttttt-------AATTTTAATCAACACAAAATACACATATTTATATAAACTGacgaaata- TGAAAAAAGTAAccttcATGAGATATATTGCGGAAGTCCATTACCAGTAAGTTAGAGttagaaaatttcgatcgacacaatttatacttcgatatatactggcaaaaaa------------ tgggaggaaaaaaaccattacctgtatgaaaaagattgcaaggattcctttgttagtgaactgaactTTAGGGATTTTAATCAACACAGTATATACATATatctttgtatactgacaaata agtaagctatatgaaaagtttcctttagcagtaaatttagagc------------------------TTAGGAATTTTGATCAAGACACAATATATATAGCTTTATATATTGtcaaata—-

Produce local multiple alignment to identify orthologous regions in upstream regions.
Sample all configurations of assigning sites to the the upstream regions.

Algorithm developed with Rahul Siddharthan and Eric Siggia (Rockefeller)

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm

SLIDE 38

AGAAGAAAGTAAttcttATGAGAAAATTTGCGGGAGTTCTTTGCCAGTGAGATAAAGtttttttt-------AATTTTAATCAACACAAAATACACATATTTATATAAACTGacgaaata- TGAAAAAAGTAAccttcATGAGATATATTGCGGAAGTCCATTACCAGTAAGTTAGAGttagaaaatttcgatcgacacaatttatacttcgatatatactggcaaaaaa------------ tgggaggaaaaaaaccattacctgtatgaaaaagattgcaaggattcctttgttagtgaactgaactTTAGGGATTTTAATCAACACAGTATATACATATatctttgtatactgacaaata agtaagctatatgaaaagtttcctttagcagtaaatttagagc------------------------TTAGGAATTTTGATCAAGACACAATATATATAGCTTTATATATTGtcaaata—-

Produce local multiple alignment to identify orthologous regions in upstream regions.
Sample all configurations of assigning sites to the the upstream regions
Score sites in aligned regions according to phylogeny.

Algorithm developed with Rahul Siddharthan and Eric Siggia (Rockefeller)

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm

Sites in aligned region scored according to phylogeny

Phylogenetic Phylogenetic Scoring Scoring

Probability of a base mutating from x to y given proximity q:

) 1 ( ) , | (

xy y xy xy

q w q w q y x P − + = → δ

Probability for observed bases at leafs given phylogeny and WM w:

species A species B species C species D x y z g g g x y z a

( )( )L

g yC yC yg g xD xD xg y xy z y x xy xy x

w q q w q q w q q w w S P ) 1 ( ) 1 ( ) ) 1 ( ( ) | (

, ,

− + − + − + = ∑ δ δ δ

S

Similarly for the background:

( )( )L

g yC yC yg g xD xD xg y xy z y x xy xy x

b q q b q q b q q b b S P ) 1 ( ) 1 ( ) ) 1 ( ( ) | (

, ,

− + − + − + = ∑ δ δ δ

SLIDE 39

AGAAGAAAGTAAttcttATGAGAAAATTTGCGGGAGTTCTTTGCCAGTGAGATAAAGtttttttt-------AATTTTAATCAACACAAAATACACATATTTATATAAACTGacgaaata- TGAAAAAAGTAAccttcATGAGATATATTGCGGAAGTCCATTACCAGTAAGTTAGAGttagaaaatttcgatcgacacaatttatacttcgatatatactggcaaaaaa------------ tgggaggaaaaaaaccattacctgtatgaaaaagattgcaaggattcctttgttagtgaactgaactTTAGGGATTTTAATCAACACAGTATATACATATatctttgtatactgacaaata agtaagctatatgaaaagtttcctttagcagtaaatttagagc------------------------TTAGGAATTTTGATCAAGACACAATATATATAGCTTTATATATTGtcaaata—-

Produce local multiple alignment to identify orthologous regions in upstream regions.
Sample all configurations of assigning sites to the upstream regions
Score sites in aligned regions according to phylogeny.

Algorithm developed with Rahul Siddharthan and Eric Siggia (Rockefeller)

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm

Sites in aligned region scored according to phylogeny

) | ( ) ( ) | ( ) | ( ) ( ) ( b S P dw w P w S P b S P S P S R

∫

= =

S

AGAAGAAAGTAAttcttATGAGAAAATTTGCGGGAGTTCTTTGCCAGTGAGATAAAGtttttttt-------AATTTTAATCAACACAAAATACACATATTTATATAAACTGacgaaata- TGAAAAAAGTAAccttcATGAGATATATTGCGGAAGTCCATTACCAGTAAGTTAGAGttagaaaatttcgatcgacacaatttatacttcgatatatactggcaaaaaa------------ tgggaggaaaaaaaccattacctgtatgaaaaagattgcaaggattcctttgttagtgaactgaactTTAGGGATTTTAATCAACACAGTATATACATATatctttgtatactgacaaata agtaagctatatgaaaagtttcctttagcagtaaatttagagc------------------------TTAGGAATTTTGATCAAGACACAATATATATAGCTTTATATATTGtcaaata—-

Produce local multiple alignment to identify orthologous regions in upstream regions.
.Sample all configurations of assigning sites to the upstream regions.
Score sites in aligned regions according to phylogeny.

Algorithm developed with Rahul Siddharthan and Eric Siggia (Rockefeller)

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibss phylogibss algorithm algorithm

Scored according to independent model.

1

s

2

s

) | ( ) | ( ) ( ) | ( ) | ( ) | ( ) | ( ) , ( ) , (

2 1 2 1 2 1 2 1 2 1

b s P b s P dw w P w s P w s P b s P b s P s s P s s R

∫

= =

SLIDE 40

AGAAGAAAGTAAttcttATGAGAAAATTTGCGGGAGTTCTTTGCCAGTGAGATAAAGtttttttt-------AATTTTAATCAACACAAAATACACATATTTATATAAACTGacgaaata- TGAAAAAAGTAAccttcATGAGATATATTGCGGAAGTCCATTACCAGTAAGTTAGAGttagaaaatttcgatcgacacaatttatacttcgatatatactggcaaaaaa------------ tgggaggaaaaaaaccattacctgtatgaaaaagattgcaaggattcctttgttagtgaactgaactTTAGGGATTTTAATCAACACAGTATATACATATatctttgtatactgacaaata agtaagctatatgaaaagtttcctttagcagtaaatttagagc------------------------TTAGGAATTTTGATCAAGACACAATATATATAGCTTTATATATTGtcaaata—-

Produce local multiple alignment to identify orthologous regions in upstream regions.
Sample all configurations of assigning sites to the upstream regions.
Score sites in aligned regions according to phylogeny.

Algorithm developed with Rahul Siddharthan and Eric Siggia (Rockefeller)

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm

Phylogeny based score combined with site for same WM at phylogenetically unrelated position.

) | ( ) | ( ) ( ) | ( ) | ( ) | ( ) | ( ) , ( ) , ( b s P b S P dw w P w s P w S P b s P b S P s S P s S R

∫

= =

s S

AGAAGAAAGTAAttcttATGAGAAAATTTGCGGGAGTTCTTTGCCAGTGAGATAAAGtttttttt-------AATTTTAATCAACACAAAATACACATATTTATATAAACTGacgaaata- TGAAAAAAGTAAccttcATGAGATATATTGCGGAAGTCCATTACCAGTAAGTTAGAGttagaaaatttcgatcgacacaatttatacttcgatatatactggcaaaaaa------------ tgggaggaaaaaaaccattacctgtatgaaaaagattgcaaggattcctttgttagtgaactgaactTTAGGGATTTTAATCAACACAGTATATACATATatctttgtatactgacaaata agtaagctatatgaaaagtttcctttagcagtaaatttagagc------------------------TTAGGAATTTTGATCAAGACACAATATATATAGCTTTATATATTGtcaaata—-

Produce local multiple alignment to identify orthologous regions in upstream regions.
Sample all configurations of assigning sites to the upstream regions.
Score sites in aligned regions according to phylogeny.

Algorithm developed with Rahul Siddharthan and Eric Siggia (Rockefeller)

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm

Score of the binding site configuration C is product of the scores for each ‘color’:

) , ( ) , ( ) ( ) | (

red red blue 2 blue 1 yellow

S s R s s R S R C D R =

SLIDE 41

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm Motif finding strategy:

Anneal: Sample all configurations according to

slowly increase β.

Take configuration C* at end of anneal as reference configuration.
Sample all configurations according to

and track the average membership of each site group of the reference configuration.

β

) | ( D C P

Posterior probability configuration C:

. ion configurat

n

prior ) ( C C P =

) ( ) ( ) ( ) | ( D R C P C R D C P =

Test set: Test set: the the Saccharomyces Saccharomyces Cerevisiae Cerevisiae Promoter Database Promoter Database Sensu Sensu Stricto Stricto Saccharomyces Saccharomyces Species Species

After ‘clean up’ 437 known binding sites upstream of 200 cerevisiae genes.

Sensu stricto Saccharomyces species: S. cerevisiae
S. bayanus
S. paradoxus
S. miketae
S. kudriavzevii
803 sequences in total for the 200 genes. Thus about 4 orthologs per gene.
We run phylogibbs separately on the alignment of each upstream region.
As a function of posterior probability p we gather all predicted sites with probability >= p.
For each cut-off p we calculate:

specificity: fraction of predicted sites that hit known sites. sensitivity: fraction of known sites that are hit by at least 1 predicted site.

SLIDE 42

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm Motif finding strategy:

Anneal: Sample all configurations according to

slowly increase β.

Take configuration C* at end of anneal as reference configuration.
Sample all configurations according to

and track the average membership of each site group of the reference configuration.

β

) | ( D C P

Posterior probability configuration C:

. ion configurat

n

prior ) ( C C P =

) ( ) ( ) | ( ) | ( D P C P C D P D C P =

Test set: Test set: the the Saccharomyces Saccharomyces Cerevisiae Cerevisiae Promoter Database Promoter Database Sensu Sensu Stricto Stricto Saccharomyces Saccharomyces Species Species

After ‘clean up’ 437 known binding sites upstream of 200 cerevisiae genes.

Sensu stricto Saccharomyces species: S. cerevisiae
S. bayanus
S. paradoxus
S. miketae
S. kudriavzevii
803 sequences in total for the 200 genes. Thus about 4 orthologs per gene.
We run phylogibbs separately on the alignment of each upstream region.
As a function of posterior probability p we gather all predicted sites with probability >= p.
For each cut-off p we calculate:

specificity: fraction of predicted sites that hit known sites. sensitivity: fraction of known sites that are hit by at least 1 predicted site.

(Michael Zhang, Cold Spring Harbor)

SLIDE 43

Results on sites from SCPD Results on sites from SCPD

Estimated fraction of predicted sites matching known sites as a function of the

fraction of known sites covered by predictions.

The fractions vary by varying the cut-off of posterior site probability of the predictions.

PhyloGibbs PhyME EMnEM Gibbs sampler MEME

ChIP-on-chip for 203 yeast DNA binding proteins.
Ran 6 different ab initio motif finding algorithms on upstream regions that

were pulled down by a given protein.

Defined motifs for 116 DNA binding proteins.
We focus on 45 TFs that have between 3 and 25 annotated sites.
In 21 cases all computational methods failed to identify a motif and the

reported motif simply copies the motif reported in the literature.

Results on gene groups identified by ChIP-on-chip

SLIDE 44

Results on gene groups identified by ChIP-on-chip

24 TFs with computationally identified motifs: PhyloGibbs finds motif matching literature in 16 out of 21 cases.

Results on gene groups identified by ChIP-on-chip

21 TFs with WMs based on the experimental literature only:

SLIDE 45

Comparing literature with ChIP Comparing literature with ChIP-

on
n-
chip data

chip data

found literature motif 1 9 6 RLM1 found literature motif 3 3 GZF3 found literature motif 6 5 SKO1 found literature motif 1 5 4 MAC1 found literature motif 1 3 28 YOX1

3

YAP6 found literature motif 11 3 ROX1 literature motif not found 8 3 MOT3 found literature motif 8 4 DAL80 found literature motif 1 10 3 ADR1 found literature motif 4 5 MET31 found literature motif 1 4 8 GCR1 PhyloGibbs on lit. targets

verlap

ChIP-on-chip targets literature targets TF

The target genes in the literature show little overlap with targets predicted through ChIP-on-chip.
In 3 of 4 cases where the known motif was not found on the ChIP targets it was found when

PhyloGibbs was run on the literature targets.

Genome browser: http:// Genome browser: http://www.swissregulon.unibas.ch www.swissregulon.unibas.ch

More than 4000 predicted sites genome-wide.

Developed by Mikail Pachkov

SLIDE 46

Webserver Webserver: : http://www.phylogibbs.unibas.ch/cgi

http://www.phylogibbs.unibas.ch/cgi-

bin/phylogibbs.pl

bin/phylogibbs.pl

Initial results Initial results

Clustering 397 known sites for 54 different TFs from E. coli:

Annealing finds 30 clusters: 24 known regulons, 3 false positives. (finds 75% of regulons represented by more than 3 sites)

Clustering 2000 putative sites extracted from phylogenetic footprinting of

proteo-gamma bacteria: on average 350 clusters, only 18 (5%) significant.

Embedding known sites in set of 2000 putative sites:

Only 9 of 24 remain significant.

Surrogate data set: 250 clusters, 8 sites each with

Information scores same as for the known sites: 25 (10%) recovered by the algorithm. How does the cell tell the sites for different TFs apart if we cannot ?

E. van Nimwegen, Otto Warburg summer school, Aug 2005.

SLIDE 47

Clustering vs. Classifying Clustering vs. Classifying

Clustering: We are only given a set of sequences. Classifying: TFs know which motifs they bind most strongly. Thus, the cell possesses the centers of the clusters.

E. van Nimwegen, Otto Warburg summer school, Aug 2005.

Clusterability Clusterability and and Classifiability Classifiability

3 samples per cluster 5 samples per cluster 10 samples per cluster 15 samples per cluster

Dashed line: minimal sequence specificity for classifying.
Solid lines: minimal sequence specificity for clustering.
Grey bars: approximate range of specificities for known regulons.
E. van Nimwegen, Otto Warburg summer school, Aug 2005.

SLIDE 48

Evolution leads to Evolution leads to unclusterable unclusterable sequences. sequences.

Under mutation a site is likely to move outside its ball, i.e. the mutated site is no longer recognized. Sites are robust under

mutation. TFs are as

non-specific as allowed by the classifiability requirement.

Analogy from Information theory:

Optimally encoded mes- sages appear random to receivers that do not posses the code-book.

E. van Nimwegen, Otto Warburg summer school, Aug 2005.