[PPT] - Regulatory Motif Prediction in DNA Regulatory Motif Prediction in PowerPoint Presentation

SLIDE 1

Regulatory Motif Prediction in DNA Regulatory Motif Prediction in DNA

Erik van Nimwegen

Division of Bioinformatics Biozentrum, Universität Basel, Swiss Institute of Bioinformatics

Introduction: toward transcription regulatory networks
Ab initio discovery of motifs by over-representation of regular expressions
The weight matrix representation of regulatory motifs.
Ab initio discovery with weight matrices: MEME and the Gibbs Sampler
Discovery of regulatory modules in higher eukaryotes.
Ab initio regulatory motif discovery in phylogenetically

related sequences: PhyloGibbs

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 2

Transcription Regulation Networks Transcription Regulation Networks

ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG…..

Genes Promoters Regulators (transcription factors)

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 3

Transcription Regulation Networks Transcription Regulation Networks

ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG…..

Genes Promoters Regulators (transcription factors) binding sites

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 4

Transcription Regulation Networks Transcription Regulation Networks

ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG….. ATG…..

Genes Promoters Regulators (transcription factors) binding sites Regulatory network To reconstruct the network we need to identify all binding sites genome-wide and the factor(s) that binds at each site.

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 5

Transcription Regulation Networks Transcription Regulation Networks

The number of transcription regulators increases roughly quadratically with the size
f the genome.
The number of regulators per gene thus increases linearly with the size of

the genome. From: E. van Nimwegen Trends in Genetics 19 479-484 (2003) metabolic genes transcription factors cell cycle related genes

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 6

Transcription Regulation Networks Transcription Regulation Networks

Knowledge from direct experimentation:

E. coli:
almost 200,000 papers in PubMed. Over 17,000 on transcription.
About 300 TFs.
Less than 100 TFs with at least 1 known binding site.
About 750 known sites in total. (of 2,500-8,000 ?)
S. cerevisiae:
Almost 60,000 papers in PubMed. Over 10,000 on transcription.
About 350 TFs.
About 65 TFs with at least 1 known binding site.
About 450 known sites in total. (of > 10,000 ?)

Even in intensely studied model organisms the majority of regulatory sites is not known.

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 7

Ab Ab initio initio discovery of regulatory sites discovery of regulatory sites

General Approaches:

1. Collect sets of (intergenic) sequences that are thought to contain

binding sites for a common regulatory factor. Examples:

Upstream regions of co-regulated genes.
Sequence fragments pulled down with ChrIP

then search for overrepresented short sequence motifs among them.

Microarray experiments (gene expression) Binding experiments (ChIP-on-chip) Other external biological knowledge Sets of sequences containing sites for a common regulatory factor.

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 8

Representation by consensus sequence Representation by consensus sequence

r regular expression
r regular expression

ACGCGT ACGCGT ACGCGA ACGCGT ACGCGA CCGCGT TCGCGA ACGCGT ACGCGT ACGCGT ACGCGT ACGCGT ACGCGT HCGCGW Consensus sequence: Regular expression:

The experimentally known binding sites of MBP1 (yeast TF):

(take the majority base in each column) (take the IUPAC symbol for the sequences

ccurring in each column)

So called IUPAC symbols are used to represent sets

f nucleotides. For instance:

W = {A,T} and H = {A,C,T}

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 9

Scan for over Scan for over-

represented patterns

represented patterns

ATG….. ATG….. ATG….. ATG….. ATG…..Gene E

Gene A Gene B Gene C Gene D

Exhaustively go through all possible consensus sequences (or regular

expressions) s up to some length L.

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 10

Scan for over Scan for over-

represented patterns

represented patterns

ATG….. ATG….. ATG….. ATG….. ATG…..Gene E

Gene A Gene B Gene C Gene D

Exhaustively go through all possible consensus sequences (or regular

expressions) s up to some length L.

For a given motif, say s = WGCWCG, find all occurrences.

AGCTCG TGCTCG TGCTCG TGCACG AGCACG

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 11

Scan for over Scan for over-

represented patterns

represented patterns

ATG….. ATG….. ATG….. ATG….. ATG…..Gene E

Gene A Gene B Gene C Gene D

Exhaustively go through all possible consensus sequences (or regular

expressions) s up to some length L.

For a given motif, say s = WGCWCG, find all occurrences.
Determine the significance of the motif. Roughly speaking the significance

is given by the probability to get so many occurrences in random sequences, e.g. P(WGCWCG) = 0.034

AGCTCG TGCTCG TGCTCG TGCACG AGCACG

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 12

Scan for over Scan for over-

represented patterns

represented patterns

ATG….. ATG….. ATG….. ATG….. ATG…..

Gene A Gene B Gene C Gene D Gene E

Exhaustively go through all possible consensus sequences (or regular

expressions) s up to some length L.

For a given motif, say s = WGCWCG, find all occurrences.
Determine the significance of the motif. Roughly speaking the significance

is given by the probability to get so many occurrences in random sequences, e.g. P(WGCWCG) = 0.034

Rank all motifs by significance and report the motifs with highest significance.

AGCTCG TGCTCG TGCTCG TGCACG AGCACG

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 13

Over Over-

representation of consensus

representation of consensus and regular expression patterns and regular expression patterns

Example algorithms:

YMF (Sinha and Tompa)
Weeder (Pavesi et al.)

Advantages:

The search is exhaustive. If a significant motif exists it is guaranteed to be found.

Disadvantages:

Consensus sequences and regular expressions are not necessarily a good representation
f binding sites. (next slides)
The significant motifs are often partially redundant. For example:

ATTACTAT WWACTWTTA AATTAC ATTACGG

Now which motif is the “correct” motif?

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 14

The weight matrix representation of The weight matrix representation of regulatory motifs regulatory motifs

Alignment of known fruR binding sites:

CTGAATCGATTTTAT CTGAATCGTTTCAAT CTGAATTGATTCAGG CTGAAACCATTCAAG GTGAATCGATACTTT CTGAAACGCTTCAGC CTGAAACGTTTTTGC TTGAAACGTTTCAGC GTGAATCGTTCAAGC CTGAATCGGTTAACT GTTAAGCGATTCAGC cTGAAtCG cTGAAtCG* *TTcAg TTcAg* *

0.13 , 0.27 , 0.53 , 07 . : instance For . position at base finding

f

y Probabilit

1 1 1 1

= = = = =

T G C A i

w w w w i w α

α

Probability that a site for the TF represented by w will have sequence s:

∏

=

l i i si

w w s P

1

) | (

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 15

The weight matrix representation of The weight matrix representation of regulatory motifs regulatory motifs

Alignment of known fruR binding sites:

CTGAATCGATTTTAT CTGAATCGTTTCAAT CTGAATTGATTCAGG CTGAAACCATTCAAG GTGAATCGATACTTT CTGAAACGCTTCAGC CTGAAACGTTTTTGC TTGAAACGTTTCAGC GTGAATCGTTCAAGC CTGAATCGGTTAACT GTTAAGCGATTCAGC cTGAAtCG cTGAAtCG* *TTcAg TTcAg* *

∑

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = = =

α α α α α α α ,

log , background ,

i i i i i

b f f I b n n f

E. van Nimwegen, EMBnet Geneve, Feb 2006.

The quality of an alignment of putative sites can be measured by the Information score I:

nI

e P P ≈ bg) from sites ( ) WM a from sites (

SLIDE 16

Ab Ab initio initio motif discovery with weight matrices motif discovery with weight matrices

Assume the input set of ‘co-regulated’ sequences is a mixture of “random” background sequence plus a number of samples from a weight matrix. Unknowns:

1. The weight matrix
2. The number of sites
3. The positions of the sites

ATG….. ATG….. ATG….. ATG….. ATG…..

MEME approach: Search the space of WMs for the WM that maximizes the likelihood

f the data (summing over all possible binding site configurations for each WM). The likelihood

is maximized using “Expectation Maximization”.

Gibbs Sampler approach: Search the space of binding site configurations for the

configuration that maximizes the likelihood of all sites deriving from a common WM (integrating over all possible WMs) and all other sequence deriving from background. The space of configurations is searched through “Gibbs Sampling”.

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 17

MEME approach MEME approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. Choose a random segment and use its sequence to seed a weight matrix.
E. van Nimwegen, EMBnet Geneve, Feb 2006.

167 . , 5 .

5 5 5 5

= = = =

T C A G

w w w w

SLIDE 18

MEME approach MEME approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. Choose a random segment and use its sequence to seed a weight matrix.
2. For each position i in each sequence s, calculate the probability P(s,i) that this

sequence is a binding site for the WM.

∏ ∏

= + + = + + + + + + + +

+ +

= = + =

l k s l i i l k k s l i i l i i l i i l i i

k i k i

b b s s P w w s s P b s s P w s s P w s s P i s P

1 1 1 1 1 1 1

) | ( , ) | ( ) | ( ) | ( ) | ( ) , ( L L L L L

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 19

MEME approach MEME approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. Choose a random segment and use its sequence to seed a weight matrix.
2. For each position i in each sequence s, calculate the probability P(s,i) that this

sequence is a binding site for the WM.

3. Construct a new WM by averaging the potential sites at all possible positions (s,i),

weighing each potential site with the probability that it is a site for the previous WM.

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 20

MEME approach MEME approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. Choose a random segment and use its sequence to seed a weight matrix.
2. For each position i in each sequence s, calculate the probability P(s,i) that this

sequence is a binding site for the WM.

3. Construct a new WM by averaging the potential sites at all possible positions (s,i),

weighing each potential site with the probability that it is a site for the previous WM.

4. This is iterated until the WM does not longer change
E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 21

MEME approach MEME approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. Choose a random segment and use its sequence to seed a weight matrix.
2. For each position i in each sequence s, calculate the probability P(s,i) that this

sequence is a binding site for the WM.

3. Construct a new WM by averaging the potential sites at all possible positions (s,i),

weighing each potential site with the probability that it is a site for the previous WM.

4. This is iterated until the WM does not longer change
5. The best WM over many seeds is reported.
E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 22

Gibbs Sampler Approach Gibbs Sampler Approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. A binding site configuration (set of site positions) is chosen at random.
E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 23

Gibbs Sampler Approach Gibbs Sampler Approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. A binding site configuration (set of site positions) is chosen at random.
2. Pick a sequence at random and remove the site.
E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 24

Gibbs Sampler Approach Gibbs Sampler Approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. A binding site configuration (set of site positions) is chosen at random.
2. Pick a sequence at random and remove the site.
3. Construct a weight matrix from the other sites.
E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 25

Gibbs Sampler Approach Gibbs Sampler Approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. A binding site configuration (set of site positions) is chosen at random.
2. Pick a sequence at random and remove the site.
3. Construct a weight matrix from the other sites.
4. Scan the sequence s and calculate the probability P(s,i) of a site for this WM
ccurring at position i.

∏ ∏

= =

+ + = =

+ +

l k k s l k k s

n n w i s P

k i k i

1 1

4 1 ) , (

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 26

Gibbs Sampler Approach Gibbs Sampler Approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. A binding site configuration (set of site positions) is chosen at random.
2. Pick a sequence at random and remove the site.
3. Construct a weight matrix from the other sites.
4. Scan the sequence s and calculate the probability P(s,i) of a site for this WM
ccurring at position i.
5. Sample a position i in proportion to probability P(s,i)
E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 27

Gibbs Sampler Approach Gibbs Sampler Approach

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …Gene A …Gene B …Gene C …Gene D …Gene E …Gene E …Gene F

1. A binding site configuration (set of site positions) is chosen at random.
2. Pick a sequence at random and remove the site.
3. Construct a weight matrix from the other sites.
4. Scan the sequence s and calculate the probability P(s,i) of a site for this WM
ccurring at position i.
5. Sample a position i in proportion to probability P(s,i)
6. Iterate steps 2 through 5. and keep track of
The best configuration.
The fraction of time each site occurs.
E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 28

Ab Ab initio initio motif discovery with weight matrices motif discovery with weight matrices

Expectation maximization approach:

MEME (http://meme.sdsc.edu/meme/intro.html) MDScan (http://ai.stanford.edu/~xsliu/MDscan/) Advantages:

Can flexibly treat variations in the number of sites per sequence.
Is likely to discover the motif if the WM is well-described by a consensus

sequence. Disadvantages: Can easily get stuck in local optima.

Gibbs Sampling approach:

The Gibbs Motif Sampler (http://bayesweb.wadsworth.org/gibbs/gibbs.html)

AlignAce (http://atlas.med.harvard.edu/) Advantages:

Searches the entire space of binding site configurations.
Is better suited to motifs that are very fuzzy and not well-described by a consensus.

Disadvantages:

Depends more on assumptions about the number of sites per sequences.
E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 29

Databases with collections of known Databases with collections of known binding sites and derived weight matrices binding sites and derived weight matrices

TRANSFAC: http://www.gene-regulation.com/pub/databases.html

Large commercial collection of binding sites and weight matrices for different eukaryotic organisms.

RegulonDB: http://regulondb.ccg.unam.mx/index.html

Known E. coli binding sites and matrices.

SCPD: http://rulai.cshl.edu/SCPD/

Known binding sites in Saccharomyces cerevisiae.

JASPAR: http://mordor.cgb.ki.se/cgi-bin/jaspar2005/jaspar_db.pl

Open source collection of binding sites and weight matrices

These matrices can be used to scan for sites genome-wide.

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 30

Scanning for weight matrix matches Scanning for weight matrix matches

Probability of sequence at positions (i+1) through (i+L) assuming it is a site for WM w compared to the probability under a background model:

i+1 i+L sequence s

Score:

∏

=

+ +

=

L k s k s L i

k i k i

b w s R

1 ] , [

) (

sequence s cutoff

R Predict all sites over some cut-off in R. Problem: Especially in higher eukaryotes the vast majority of predictions from WM scans are false positives.

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 31

Discovering regulatory modules Discovering regulatory modules

(from Arnone, M. I. and Davidson, E. H., Development, 124(10):1851-64, 1997.)

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 32

Discovering regulatory modules Discovering regulatory modules

Gather sets of TFs for which binding sites are known and that (ideally)

are also known to interact in regulatory modules.

Search genome-wide for relatively short sequence segments,

i.e. 200-500 bp, that have a surprisingly high concentration of sequences that ‘match’ these WMs.

Berman et al., PNAS (2002) 99 757-762

(first publication presenting the idea using simple filtering methods)

Ahab: Rajewsky N, Vergassola M, Gaul U, Siggia ED, BMC Bioinformatics (2002) 3 30

(presentation of algorithm with applications to Fly)

Smash: Zavolan M, Rajewsky N, Socci N, Gaasterland T, ICSSB (2003)

(essentially same algorithm with applications to human/mouse)

Stubb: Sinha S, van Nimwegen E, Siggia ED, ISMB (2003)

(extensions taking multiple species and correlations in neighboring sites into account.)

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 33

Bicoid Caudal Hunchback Knirps Kruppel Tailles TorRE sequence

Set of Weight Matrices for gap gene transcription factors known to be involved in early body-patterning in fly. A `parse’ ρ of the sequence S in terms of hypothesized binding sites.

) | ( ρ S P

Probability of the observed sequence given the parse.

Discovering regulatory modules Discovering regulatory modules

Given a sequence we want to consider all ways in which the sequence can be “parsed” into binding sites for the set of TFs and assign probabilities.

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 34

Probability of the sequence given a parse Probability of the sequence given a parse

Probability of sequence at positions (i+1) through (i+L) assuming it is a site for WM w:

∏

= + − + +

+

= ≡

L k k s i L i L i L i

k i

w w s s s P w s P

1 1 1 ] , [

) | ..... ( ) | (

i+1 i+L sequence s

sequence s

Probability of a given parse ρ

s

b b s P = ) | (

Probability of a base not in a site (“background”):

∏ ∏ ∏

∈ ∈ ∉

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ =

WMs sites ] , [ sites

) | ( ) | (

w j L j i s

w w i

w s P b S P ρ

The prior probability of different parses assumes each WM w has sites occurring with probability at each position of the sequence s

( )

∏

∈

=

WMs

) (

w n w

w

p P ρ

w

p

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 35

Probability of the sequence given a parse Probability of the sequence given a parse

{ } ( ) ∑

=

ρ

ρ ρ ) ( ) | ( | P s P w S P

For a given sequence we now want to calculate the probability of the data given the set of WMs . For this we have to sum over all parses ρ:

{ } ( )

w S P |

{ }

w

The sum is over all non-overlapping configurations of binding sites. Let denote the sum over parses for the first j bases of the sequence.

) , 1 ( j Z

∑

− + − =

− + − w w w j j j L s

L j Z p w s s s P j Z p b j Z

j

) , 1 ( ) | .... ( ) 1 , 1 ( ) , 1 (

1 1 bg

This way, can be calculated in time . The function still depends on the set of probabilities . Formally we should integrate over these probabilities. However, this is computationally infeasible. In practice we find the set of that maximizes . Finally we calculate the ratio:

) , 1 ( L Z ) ( w L O ) , 1 ( L Z ) , (

w bg p

p ) , (

w bg p

p ) , 1 ( L Z

{ } ( ) ( )

bg | | S P w S P R =

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 36

ATG…..

10-30Kbp 500bp

Discovering regulatory modules Discovering regulatory modules

Slide a window of length 500 over the sequence and calculate

at each position. This gives a profile of R. { } ( ) ( )

bg | | s P w s P R =

ATG…..

10-30Kbp Predicted locations of modules

cut-off

Profile of R

A predicted module occurs at every window where R exceeds a cut-off.
E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 37

Discovering regulatory modules Discovering regulatory modules

Ahab given a set of 9 transcription factors: Bcd, Hb, Cad, TorRE, D-stat, Kr, Kni, Gt, Tll
Run on upstream regions of 29 genes with gap and pair-rule patterns (750,000bp total).
E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 38

Stubb Stubb: multi : multi-

species and site correlations

species and site correlations

aligned areas (all other sequence segments are not aligned).

species A species B species C species D

Align the sequence in reference species A with orthologous sequences:

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 39

Stubb Stubb: multi : multi-

species and site correlations

species and site correlations

aligned areas (all other sequence segments are not aligned).

species A species B species C species D

window containing a potential binding site

) | (

] , [

w s P

L i

probability of sequence given w for unaligned sequence is the same as before. Scan along sequence of the reference species

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 40

Stubb Stubb: multi : multi-

species and site correlations

species and site correlations

aligned areas (all other sequence segments are not aligned).

species A species B species C species D

window containing a potential binding site Sites in an aligned block are extended to all species in the block These sequence segments in aligned blocks will be scored according to an evolutionary model:

) | (

] , [

w S P

L i

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 41

is the probability to observe the set of bases S at the leafs given the phylogenetic tree and the WM w.

Stubb Stubb: multi : multi-

species and site correlations

species and site correlations

species A acgtaactagtga species B acgttgctagatg species C tcgttgctataat species D aggtagcgagaag

Potential site in aligned sequence block: Probability of the set of bases S in each column is independent of the other columns.

species A species B species C species D x y z

Species phylogenetic tree

) | ( w S P

S

g g g x y z a

) | ( w S P

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 42

Stubb Stubb: multi : multi-

species and site correlations

species and site correlations

Probability along a single branch:

) , , | ( w t y x P

probability to end up with base x after time t, starting from base y.

x y t

∑

← − ← =

z

w t y x P w x z w t y z P w z x dt w t y x dP ) , , | ( ) | ( ) , , | ( ) | ( ) , , | ( μ μ

General time evolution: Assumptions:

) | ( ) | ( w x w y x μ μ = ←

x

w w y x P = ∞ ) , , | (

Solution:

( )

t x t xy

e w e w t y x P

μ μ

δ

− −

− + = 1 ) , , | (

μ μ

x

w w x = ) | ( ( )

q w q w q y x P

x xy

− + = 1 ) , , | ( δ

In terms of no-mutation probability q:

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 43

Stubb Stubb: multi : multi-

species and site correlations

species and site correlations

g g g x y z a

( )( )L

g yC yC yg g xD xD xg y xy z y x xy xy x

w q q w q q w q q w w S P ) 1 ( ) 1 ( ) ) 1 ( ( ) | (

, ,

− + − + − + = ∑ δ δ δ

The probability of the bases at the leafs is the product over the probabilities of each of the branches, summed over the possible bases at the internal nodes:

) | ( w S P

and similarly the probability given the background:

( )( )L

g yC yC yg g xD xD xg y xy z y x xy xy x

b q q b q q b q q b b S P ) 1 ( ) 1 ( ) ) 1 ( ( ) | (

, ,

− + − + − + = ∑ δ δ δ

∏

= + +

=

L k k i k k i L i

b S P w S P w S R

1 ] , [

) | ( ) | ( ) | ( Finally, the ratio of the probability of the whole block of aligned sequences under WM and background model:

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 44

Stubb Stubb: multi : multi-

species and site correlations

species and site correlations

Significant increase of the number of correct predictions genome-wide when using melanogaster/pseudoobscura alignments.

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 45

Ab Ab initio initio discovery of regulatory sites discovery of regulatory sites

General Approaches:

1. Collect sets of (intergenic) sequences that are thought to contain

binding sites for a common regulatory factor. Examples:

Upstream regions of co-regulated genes.
Sequence fragments pulled down with ChrIP

then search for overrepresented short sequence motifs.

2. Phylogenetic footprinting: create multiple alignments of
rthologous intergenic sequences and identify sequence

segments more conserved than “average”.

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 46

Phylogenetic Phylogenetic Footprinting Footprinting

From: Kellis et al. Nature 423 251-254 (2003)

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 47

Phylogenetic Phylogenetic Footprinting Footprinting

Scer ATGTTTTTTTAATGATATATGTAACGTACATTCTTTCCTCTACCACTGCCAATTCGGTATTATTTAATTGTGTTTAGCGCTATTTAC Spar -ATGTTTTTTAATGATATATGTAACGTACATTCTTC---CTACTGCTACCAAGTCGGTATTATTTAATTGTGTTTAGCGCTATTTAC Smik --------------------------TCTTTTCTCTA--CCACTACTACCAATTCGGTATTATTTAATTGTGTTTAGCACTATTTAC Sbay --ATGTTCTTAATGATATATATAACGTACATTTTTT---CCTCTACTAGCCAATCGGTGTTATTTAATTGTGTTTAGCTCTATTTAC * ** * * * ** * * ***** ******************* ******** Scer TAATTAACTAGAAACTCAATTTTTAAAGGCAAAGCTCGCTGACCT--TTCACTGATTTCGTGGATGTTATACTATCAGTTACTCTTC Spar CCACTAACTAGAAACTCGATTTTTAAAGGCAAAATTCAGTGTCCT--TTCACTAGTTTTGCAGATGTCCTGCTATCAGCTACTTCCC Smik TCACTAAC-AAAAACTCAATTTTGAAGGGCTGA-TTAAATATCCTCCTTTAATAGTTTTGCGCTTAGCCTGTTATCA--TATAAGTA Sbay TCACTTAACAAAAAAACCAACTTCAAAAGTATAATACAATAATTTC-TCCGTTGATCTTGTGAACTACATGCTATCACTTATTTGCC * * * * *** * * ** ** * * * * * * * * * * ***** ** Scer TGCAAAAAAAAA-----------TTGAGTCATATCGTAGCTTTGGGATTATTTTTCT-CTCTCTCCACGGCTAATTAGGTGATCATG Spar TGCAGAAAAGAAAAATA-----TTTGAGTCATATCATCGTCTAGGAAGTGTTTTTCT-CTCTCTCCACGGATAGTTAAGTGATCATG Smik TACAAAAAGAGAATAT------TTTGAGTCATATCATCGCCTAGGAAGTATTTTTTTTCTCTCTTCACGGTTAATTAGGTGATTTCT Sbay TGTAAAAAGAAAATCGTTTCGTTTTGAGTCATATCATGTTCTCATAA-TATTTTTTT--TTCCTTAGCGATTAA------------- * * *** * ************ * * * * ***** * * ** ** ** Scer AAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATC-GAAACATACATAA--GTTGATATTC-CTTTGATATCG-----ACGACTA Spar AAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATC-GAAACATACATAA--ATTGATATTC-CTTTAGCTTTT----AAAGACTA Smik GAAAAACGAAAAATTCATG-GAAAAGAGTCAACCGTC-GAAACATACATAA--ACCGATATTT-CTTTAGCTTTCGACAAAAATCTG Sbay GAAAAATAAAAAGTGATTG-GAAAAGAGTCAGATCTCCAAAACATACATAATAACAGGTTTTTACATTAGCTTTT----GAAAACTA ***** **** * ** *********** ** ************ * * ** * ** * * ** Scer CTCAATCAGG-TTTTAAAAGAAAAGAGGCA-GCTATTGAAGTAGCAGT-ATCCAGTTTAGGTTTTTTAATTATTTACAAGTAAA-GA Spar CTCAATCAA--GTTTAATAGAAGAAAGAGG-AAGGTTGAGATAGGTAT-ATCCAGTTTAGGTTTC--AATTATTTAATAATAAA-GG Smik CAATATTCATTATTCAAAACTCAAAAGAAG-AAGGTTCGAATTGGTGT-GTCCAGTTTAGGCTCT--AATTGTTGAATAATAAAAGG Sbay TCCACCACAA-ATTGAAGGTGAGGAAGAAACAAAGTTAAAGCAAGAATCGGCTTGTGTCTTTTTT--GATTGCGTATT--TGAAAGG ** ** ** ** * * ** * * *** * * ** * Scer AAAAGAGA-------------- Spar TAAAGAA--------------- Smik CGAAGAAATAACGATCCAAAAA Sbay TAAAGGAATACAACAAAAA--- ***

His7

GCN4 GCN4 ABF1

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 48

Ab Ab initio initio discovery of regulatory sites discovery of regulatory sites

General Approaches:

1. Collect sets of (intergenic) sequences that are thought to contain

binding sites for a common regulatory factor. Examples:

Upstream regions of co-regulated genes.
Sequence fragments pulled down with ChrIP

then search for overrepresented short sequence motifs.

2. Phylogenetic footprinting: create multiple alignments of
rthologous intergenic sequences and identify sequence

segments more conserved than “average”.

PhyloGibbs, combines these two approaches into a single procedure.

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 49

Ab Ab initio initio identification of regulatory motifs identification of regulatory motifs

Challenges:

Multiple alignments of intergenic DNA are often problematic.

We only align the unambiguous areas and consider all solutions consistent with the alignment of the unambiguous areas.

Sites occur both in aligned and non-alignable segments.

We simultaneously search for sites in both aligned blocks and unaligned segments.

The number of sites and the number of different kinds of sites is unknown.

We search over an arbitrary number of sites and motifs.

The entire phylogeny of the input sequences needs to be taken into account.

We use an explicit model for the evolution of regulatory sites on the phylogenetic tree.

The reliability of the predictions needs to be assessed internally.

We use Monte-Carlo Markov chain sampling to rigorously assign posterior probabilities to all reported sites and motifs.

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 50

AGAAGAAAGTAAttcttATGAGAAAATTTGCGGGAGTTCTTTGCCAGTGAGATAAAGtttttttt-------AATTTTAATCAACACAAAATACACATATTTATATAAACTGacgaaata- TGAAAAAAGTAAccttcATGAGATATATTGCGGAAGTCCATTACCAGTAAGTTAGAGttagaaaatttcgatcgacacaatttatacttcgatatatactggcaaaaaa------------ tgggaggaaaaaaaccattacctgtatgaaaaagattgcaaggattcctttgttagtgaactgaactTTAGGGATTTTAATCAACACAGTATATACATATatctttgtatactgacaaata agtaagctatatgaaaagtttcctttagcagtaaatttagagc------------------------TTAGGAATTTTGATCAAGACACAATATATATAGCTTTATATATTGtcaaata—-

Produce local multiple alignment to identify orthologous regions in upstream regions.

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 51

AGAAGAAAGTAAttcttATGAGAAAATTTGCGGGAGTTCTTTGCCAGTGAGATAAAGtttttttt-------AATTTTAATCAACACAAAATACACATATTTATATAAACTGacgaaata- TGAAAAAAGTAAccttcATGAGATATATTGCGGAAGTCCATTACCAGTAAGTTAGAGttagaaaatttcgatcgacacaatttatacttcgatatatactggcaaaaaa------------ tgggaggaaaaaaaccattacctgtatgaaaaagattgcaaggattcctttgttagtgaactgaactTTAGGGATTTTAATCAACACAGTATATACATATatctttgtatactgacaaata agtaagctatatgaaaagtttcctttagcagtaaatttagagc------------------------TTAGGAATTTTGATCAAGACACAATATATATAGCTTTATATATTGtcaaata—-

Produce local multiple alignment to identify orthologous regions in upstream regions.
Sample all configurations of assigning sites to the the upstream regions.

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 52

AGAAGAAAGTAAttcttATGAGAAAATTTGCGGGAGTTCTTTGCCAGTGAGATAAAGtttttttt-------AATTTTAATCAACACAAAATACACATATTTATATAAACTGacgaaata- TGAAAAAAGTAAccttcATGAGATATATTGCGGAAGTCCATTACCAGTAAGTTAGAGttagaaaatttcgatcgacacaatttatacttcgatatatactggcaaaaaa------------ tgggaggaaaaaaaccattacctgtatgaaaaagattgcaaggattcctttgttagtgaactgaactTTAGGGATTTTAATCAACACAGTATATACATATatctttgtatactgacaaata agtaagctatatgaaaagtttcctttagcagtaaatttagagc------------------------TTAGGAATTTTGATCAAGACACAATATATATAGCTTTATATATTGtcaaata—-

Produce local multiple alignment to identify orthologous regions in upstream regions.
Sample all configurations of assigning sites to the the upstream regions
Score sites in aligned regions according to phylogeny.

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm

Sites in aligned region scored according to phylogeny

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 53

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm

[ ]

∑ ∏

=

− + =

a i i s i as a

q w q w w T S P

i i

4 1

) 1 ( ) , | ( δ

Example: star-phylogeny

s1 s2 s3 s4 a Background:

[ ]

∑ ∏

=

− + =

a i i s i as a

q b q b b T S P

i i

4 1

) 1 ( ) , | ( δ

Probability ratio two windows drawn from same WM vs. windows drawn from background: x1 x2 x3 x4 a

) , | ( ) , | ( ) , | ( ) , | ( ) | , ( ) , ( b T X P b T S P w T X P w T S P dw B X S P X S P

X S X S

∫

=

S X

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 54

AGAAGAAAGTAAttcttATGAGAAAATTTGCGGGAGTTCTTTGCCAGTGAGATAAAGtttttttt-------AATTTTAATCAACACAAAATACACATATTTATATAAACTGacgaaata- TGAAAAAAGTAAccttcATGAGATATATTGCGGAAGTCCATTACCAGTAAGTTAGAGttagaaaatttcgatcgacacaatttatacttcgatatatactggcaaaaaa------------ tgggaggaaaaaaaccattacctgtatgaaaaagattgcaaggattcctttgttagtgaactgaactTTAGGGATTTTAATCAACACAGTATATACATATatctttgtatactgacaaata agtaagctatatgaaaagtttcctttagcagtaaatttagagc------------------------TTAGGAATTTTGATCAAGACACAATATATATAGCTTTATATATTGtcaaata—-

Produce local multiple alignment to identify orthologous regions in upstream regions.
Sample all configurations of assigning sites to the upstream regions
Score sites in aligned regions according to phylogeny.

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm

Sites in aligned region scored according to phylogeny

) | ( ) ( ) | ( ) | ( ) ( ) ( b S P dw w P w S P b S P S P S R

∫

= =

S

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 55

AGAAGAAAGTAAttcttATGAGAAAATTTGCGGGAGTTCTTTGCCAGTGAGATAAAGtttttttt-------AATTTTAATCAACACAAAATACACATATTTATATAAACTGacgaaata- TGAAAAAAGTAAccttcATGAGATATATTGCGGAAGTCCATTACCAGTAAGTTAGAGttagaaaatttcgatcgacacaatttatacttcgatatatactggcaaaaaa------------ tgggaggaaaaaaaccattacctgtatgaaaaagattgcaaggattcctttgttagtgaactgaactTTAGGGATTTTAATCAACACAGTATATACATATatctttgtatactgacaaata agtaagctatatgaaaagtttcctttagcagtaaatttagagc------------------------TTAGGAATTTTGATCAAGACACAATATATATAGCTTTATATATTGtcaaata—-

Produce local multiple alignment to identify orthologous regions in upstream regions.
.Sample all configurations of assigning sites to the upstream regions.
Score sites in aligned regions according to phylogeny.

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibss phylogibss algorithm algorithm

Scored according to independent model.

1

s

2

s

) | ( ) | ( ) ( ) | ( ) | ( ) | ( ) | ( ) , ( ) , (

2 1 2 1 2 1 2 1 2 1

b s P b s P dw w P w s P w s P b s P b s P s s P s s R

∫

= =

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 56

AGAAGAAAGTAAttcttATGAGAAAATTTGCGGGAGTTCTTTGCCAGTGAGATAAAGtttttttt-------AATTTTAATCAACACAAAATACACATATTTATATAAACTGacgaaata- TGAAAAAAGTAAccttcATGAGATATATTGCGGAAGTCCATTACCAGTAAGTTAGAGttagaaaatttcgatcgacacaatttatacttcgatatatactggcaaaaaa------------ tgggaggaaaaaaaccattacctgtatgaaaaagattgcaaggattcctttgttagtgaactgaactTTAGGGATTTTAATCAACACAGTATATACATATatctttgtatactgacaaata agtaagctatatgaaaagtttcctttagcagtaaatttagagc------------------------TTAGGAATTTTGATCAAGACACAATATATATAGCTTTATATATTGtcaaata—-

Produce local multiple alignment to identify orthologous regions in upstream regions.
Sample all configurations of assigning sites to the upstream regions.
Score sites in aligned regions according to phylogeny.

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm

Phylogeny based score combined with site for same WM at phylogenetically unrelated position.

) | ( ) | ( ) ( ) | ( ) | ( ) | ( ) | ( ) , ( ) , ( b s P b S P dw w P w s P w S P b s P b S P s S P s S R

∫

= =

s S

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 57

AGAAGAAAGTAAttcttATGAGAAAATTTGCGGGAGTTCTTTGCCAGTGAGATAAAGtttttttt-------AATTTTAATCAACACAAAATACACATATTTATATAAACTGacgaaata- TGAAAAAAGTAAccttcATGAGATATATTGCGGAAGTCCATTACCAGTAAGTTAGAGttagaaaatttcgatcgacacaatttatacttcgatatatactggcaaaaaa------------ tgggaggaaaaaaaccattacctgtatgaaaaagattgcaaggattcctttgttagtgaactgaactTTAGGGATTTTAATCAACACAGTATATACATATatctttgtatactgacaaata agtaagctatatgaaaagtttcctttagcagtaaatttagagc------------------------TTAGGAATTTTGATCAAGACACAATATATATAGCTTTATATATTGtcaaata—-

Produce local multiple alignment to identify orthologous regions in upstream regions.
Sample all configurations of assigning sites to the upstream regions.
Score sites in aligned regions according to phylogeny.

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm

Score of the binding site configuration C is product of the scores for each ‘color’:

) , ( ) , ( ) ( ) | (

red red blue 2 blue 1 yellow

S s R s s R S R C D R =

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 58

Identifying TF binding sites in species Identifying TF binding sites in species with different degrees of evolutionary relatedness: with different degrees of evolutionary relatedness: The The phylogibbs phylogibbs algorithm algorithm Motif finding strategy:

Anneal: Sample all configurations according to

slowly increase β.

Take configuration C* at end of anneal as reference configuration.
Sample all configurations according to

and track the average membership of each site group of the reference configuration.

β

) | ( D C P

Posterior probability configuration C:

. ion configurat

n

prior ) ( C C P =

) ( ) ( ) | ( ) | ( D P C P C D P D C P =

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 59

Test set: Test set: the the Saccharomyces Saccharomyces Cerevisiae Cerevisiae Promoter Database Promoter Database Sensu Sensu Stricto Stricto Saccharomyces Saccharomyces Species Species

After ‘clean up’ 437 known binding sites upstream of 200 cerevisiae genes.

Sensu stricto Saccharomyces species: S. cerevisiae
S. bayanus
S. paradoxus
S. miketae
S. kudriavzevii
803 sequences in total for the 200 genes. Thus about 4 orthologs per gene.
We run phylogibbs separately on the alignment of each upstream region.
As a function of posterior probability p we gather all predicted sites with probability >= p.
For each cut-off p we calculate:

specificity: fraction of predicted sites that hit known sites. sensitivity: fraction of known sites that are hit by at least 1 predicted site.

(Michael Zhang, Cold Spring Harbor)

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 60

Results on sites from SCPD Results on sites from SCPD

Estimated fraction of predicted sites matching known sites as a function of the

fraction of known sites covered by predictions.

The fractions vary by varying the cut-off of posterior site probability of the predictions.

PhyloGibbs PhyME EMnEM Gibbs sampler MEME

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 61

Example predictions in the His7 promoter Example predictions in the His7 promoter

http://www.swissregulon.unibas.ch Developed by Mikhail Pachkov

Predictions based on alignment of single orthologous intergenic regions.
Genome-wide predictions available.
E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 62

ChIP-on-chip for 203 yeast DNA binding proteins.
Ran 6 different ab initio motif finding algorithms on upstream regions that

were pulled down by a given protein.

Defined motifs for 116 DNA binding proteins.
We focus on 45 TFs that have between 3 and 25 annotated sites.
In 21 cases all computational methods failed to identify a motif and the

reported motif simply copies the motif reported in the literature.

Results on gene groups identified by ChIP-on-chip

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 63

Results PhyloGibbs: a motif matching the literature motif was found for 16 out of 21 TFs

Results on gene groups identified by ChIP-on-chip

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 64

Comparing literature with Comparing literature with ChIP ChIP-

on
n-
chip data

chip data

TF literature targets ChIP-on-chip targets

verlap

PhyloGibbs on lit. targets GCR1 8 4 1 found literature motif MET31 5 4 found literature motif MAC1 4 5 1 found literature motif SKO1 5 6 found literature motif RLM1 6 9 1 found literature motif GZF3 3 3 found literature motif ADR1 3 10 1 found literature motif DAL80 4 8 found literature motif MOT3 3 8 literature motif not found ROX1 3 11 found literature motif YAP6 3

YOX1

28 3 1 found literature motif

The target genes in the literature show little overlap with targets predicted through ChIP-on-chip.
In 3 of 4 cases where the known motif was not found on the ChIP targets it was found when

PhyloGibbs was run on the literature targets.

E. van Nimwegen, EMBnet Geneve, Feb 2006.

SLIDE 65

Genome browser Genome browser

http://www.swissregulon.unibas.ch

More than 4000 predicted sites genome-wide.

E. van Nimwegen, EMBnet Geneve, Feb 2006.