Reverse engineering mammalian transcriptional regulatory circuits - - PowerPoint PPT Presentation

reverse engineering mammalian transcriptional regulatory
SMART_READER_LITE
LIVE PREVIEW

Reverse engineering mammalian transcriptional regulatory circuits - - PowerPoint PPT Presentation

Reverse engineering mammalian transcriptional regulatory circuits Andrew D Smith Pavel Sumazin Zhang Lab, CSHL & Califano Lab, Columbia ISMB 2007 Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB07 1


slide-1
SLIDE 1

Reverse engineering mammalian transcriptional regulatory circuits

Andrew D Smith Pavel Sumazin

Zhang Lab, CSHL & Califano Lab, Columbia

ISMB 2007

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 1 / 109

slide-2
SLIDE 2

Outlines Part I: Lecture Format

Outline of part I

Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 2 / 109

slide-3
SLIDE 3

Outlines Part II: Worked Examples

Outline of part II

Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 3 / 109

slide-4
SLIDE 4

Part I Part I: Lecture Format

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 4 / 109

slide-5
SLIDE 5

Overview

Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 5 / 109

slide-6
SLIDE 6

Introduction

Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 6 / 109

slide-7
SLIDE 7

Introduction Background on regulatory networks

Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 7 / 109

slide-8
SLIDE 8

Introduction Background on regulatory networks

Assumed background

(Levine & Tjian, 2003)

  • Genes
  • Promoters
  • Transcription factors (TFs)
  • Transcription factor binding sites
  • Enhancers
  • cis-Regulatory modules
  • Gene expression microarrays
  • ChIP-chip TF localization

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 8 / 109

slide-9
SLIDE 9

Introduction Background on regulatory networks

The goal: networks

  • Identifying regulatory relationships between genes
  • Understanding the underlying sequence-based mechanisms
  • Deriving specific hypotheses about transcription

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 9 / 109

slide-10
SLIDE 10

Introduction Background on regulatory networks

Components of regulatory networks

Nodes correspond to genes

  • Networks include both regulators

and targets

  • Edges are regulatory relationships
  • Most genes are only targets
  • Interesting subnetworks composed
  • f regulators

TF target TF target gene gene

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 10 / 109

slide-11
SLIDE 11

Introduction Background on regulatory networks

Components of regulatory networks

Nodes correspond to genes

  • Networks include both regulators

and targets

  • Edges are regulatory relationships
  • Most genes are only targets
  • Interesting subnetworks composed
  • f regulators

target target target target target target target target target target target target target target target target target target target target TF TF TF TF TF TF TF target gene gene gene gene gene gene gene gene gene gene gene gene gene gene gene gene gene gene gene gene gene

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 10 / 109

slide-12
SLIDE 12

Introduction Background on regulatory networks

Components of regulatory networks

Nodes correspond to genes

  • Networks include both regulators

and targets

  • Edges are regulatory relationships
  • Most genes are only targets
  • Interesting subnetworks composed
  • f regulators

target target target target target target target target target target target target target target target target target target target TF TF TF TF TF TF target TF target gene gene gene gene gene gene gene gene gene gene gene gene gene gene gene gene gene gene gene gene gene

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 10 / 109

slide-13
SLIDE 13

Introduction Background on regulatory networks

Two kinds of regulatory networks

Direct networks

  • Edges: physical interaction
  • Interaction is specified in regulatory

sequence of target

TF target Edge indicates physical interaction TF target gene gene

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 11 / 109

slide-14
SLIDE 14

Introduction Background on regulatory networks

Two kinds of regulatory networks

Direct networks

  • Edges: physical interaction
  • Interaction is specified in regulatory

sequence of target

Influence networks

  • Edges: possibly indirect interaction
  • Interaction may be mediated by another

gene

TF target

?

gene

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 11 / 109

slide-15
SLIDE 15

Introduction Background on regulatory networks

Examples of what we can achieve

Understanding regulators

  • Which TFs are most important in a given context?
  • Do certain regulators appear to work together?
  • Possibly infer novel regulators or functions
  • Clues about regulatory mechanisms (e.g. TF binding specificity)

Understanding targets

  • The set of TFs that appear to regulate some gene
  • The condition-specific targets of particular TFs
  • Sequence features are important to a gene’s transcription
  • Do targets appear under control of the same set of regulators?

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 12 / 109

slide-16
SLIDE 16

Introduction Data available for analysis

Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 13 / 109

slide-17
SLIDE 17

Introduction Data available for analysis

Gene expression data

Sets of interesting genes

  • Function in the context being examined
  • Sets can be assembled gene-by-gene:

very slow, but produces high-quality data

Microarray data

  • Lots of expression data fast and easy
  • Lower quality than sets of genes

collected manually

  • Ultimate test of understanding:

Can we reliably predict high-throughput expression data?

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 14 / 109

slide-18
SLIDE 18

Introduction Data available for analysis

TF binding data

TF binding behavior

  • Several ways to examine binding

(individually or high-throughput)

  • Locations of binding sites tell

much about TF function

  • Not all sites that bind are

involved in regulation

ChIP-chip

  • Context-specific binding-sites

genome wide and in vivo

  • Familiar tradeoff: much more

data, possibly low-quality

  • ChIP-seq: emerging technology

Wang, Snyder & Gerstein (2007)

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 15 / 109

slide-19
SLIDE 19

Introduction Data available for analysis

Genomic sequence and annotations

Raw sequence data

  • High-quality genomes available
  • Available from various sources: UCSC, NCBI, ENSEMBL

Genome alignments

  • Describe cross-species conservation

(important for sequence analysis in any context)

  • Pre-computed alignments: easy to use, improve constantly

Genome annotations

  • Locations of important genomic features
  • Examples for transcription: TSS, CDS and repeat locations
  • Increasing amount of annotation directly related to transcription

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 16 / 109

slide-20
SLIDE 20

Introduction Data available for analysis

Other data about transcription

Databases of existing knowledge

  • PUBMED: first place to check (know what’s already known)
  • Databases about transcription (e.g. TRANSFAC, SCPD)
  • Useful databases of characterized networks (hopefully soon)

Chromatin structure data

  • Important for transcription, but less well understood
  • Chromatin structure is an important regulatory mechanism
  • Different modifications affect transcription differently
  • Some modifications render genes “poised” for transcription
  • Other modifications prevent transcription
  • Emerging technology: Chromatin Capture (3C and 5C)

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 17 / 109

slide-21
SLIDE 21

Analysis methods Identifying gene modules

Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 18 / 109

slide-22
SLIDE 22

Analysis methods Identifying gene modules

Gene modules

What is a gene module?

  • Many possible definitions, but lets keep it informal
  • Usually a set of genes that function together
  • Think: the genes whose regulation you want to understand
  • Gene modules might have 10 genes, or 500 genes

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 19 / 109

slide-23
SLIDE 23

Analysis methods Identifying gene modules

Differentially expressed genes

The context

  • Want to understand, for example
  • Expression in diseased cells
  • Cells from a developmental state
  • Get expression from 2 conditions
  • Before and after some perturbation
  • Samples taken at different time-points
  • Different types of cells

Simplest gene modules

  • Genes showing differential expression
  • Maybe interested only in genes

“over-expressed” or “under-expressed”

  • Mann-Whitney U-test

Condition 1 Condition 2 pm1a Six6os1 Six6 Six1 Six4 Mnat1 Trmt5 Tmem30b Prkch Hif1a Snapc1 Syt16 Dbpht2 Kcnh5 Rhoj Gphb5 Ppp2r5e Sgpp1 Esr2 Ttc9 Tex21 Mthfd1 Zbtb25 Zbtb1

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 20 / 109

slide-24
SLIDE 24

Analysis methods Identifying gene modules

Gene expression profiles

Gene expression matrix

  • Columns ⇔ experiments
  • Rows ⇔ genes
  • xi,j ⇔ level of gene i in expmt j

       x1,1 x1,2 x1,3 · · · x1,m x2,1 x2,2 x2,3 · · · x2,m x3,1 x3,2 x3,3 · · · x3,m . . . . . . . . . ... . . . xn,1 xn,2 xn,3 · · · xn,m       

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 21 / 109

slide-25
SLIDE 25

Analysis methods Identifying gene modules

Gene expression profiles

Gene expression matrix

  • Columns ⇔ experiments
  • Rows ⇔ genes
  • xi,j ⇔ level of gene i in expmt j

       x1,1 x1,2 x1,3 · · · x1,m x2,1 x2,2 x2,3 · · · x2,m x3,1 x3,2 x3,3 · · · x3,m . . . . . . . . . ... . . . xn,1 xn,2 xn,3 · · · xn,m       

  • x3 =

Gene expression profile

  • Each gene has a profile: a row of the matrix
  • Statistical issues (e.g. normalization) outside current scope
  • More experiments means more information in each profile
  • Similar expression profiles suggest similar regulation

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 21 / 109

slide-26
SLIDE 26

Analysis methods Identifying gene modules

Using data from multiple experiments

Clustering genes expression profiles

  • Get gene modules based on expression

from multiple experiments

  • Cluster genes with similar or correlated

expression profiles

  • Any clustering algorithm can be used

(e.g. k-means, hierarchical)

  • Best algorithm depends on data and

analysis goals

Measuring profile similarity

  • Examples: correlation, Euclidean

distance, mutual information

  • Again, best measure depends on data

and analysis goals

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 22 / 109

slide-27
SLIDE 27

Analysis methods Identifying gene modules

Inferring influence networks

Obtaining the direction of a relationship

  • Clusters suggest association, but not causation
  • More interesting: infer which are regulators and which are targets
  • Need sophisticated tools and the right kind/amount of data
  • Examples of methods: Bayesian networks, ARACNE

How to use influence networks

  • Influence networks can provide framework
  • Connections can be annotated with direct information

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 23 / 109

slide-28
SLIDE 28

Analysis methods Modeling regulatory elements

Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 24 / 109

slide-29
SLIDE 29

Analysis methods Modeling regulatory elements

Modeling binding sites

Transcription Factor ACGTGACACAATTGGCATACGATCTACGTACAA Binding site

Binding sites

  • Genomic sequences recognized and bound by binding domains of TFs
  • Binding sites for same TF might be different from each other
  • Often 8-12bp, but examples can be found from 5bp to ∼30bp

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 25 / 109

slide-30
SLIDE 30

Analysis methods Modeling regulatory elements

Modeling binding sites

ACAACGTACATGATGTGCCCAGTC CACGTTTTTTAACACCGTGCCAAT T T A C G G TC C CCACGTGACGTAACCTGCATCACA A G T T C C C A T ACACGTGACCCAATATATGGACTT AGTCTCGACAGCCTTCCCTTCGCG CAACCATGCACGAATTGAATTAAT TTT C C TG G A GATCATCATCATTGTGCAGCAGTC CG CC G C T C G TGAAGAGAGAGAACATGACAACGA TGCGTATAACCCCATGATGCCCGA GATGACCAACACACACCACACCAG A C G C T T GC A

What is a motif?

  • Motifs are how we model the set of binding sites for a TF
  • Should describe information important for binding
  • Motifs = binding sites

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 25 / 109

slide-31
SLIDE 31

Analysis methods Modeling regulatory elements

Consensus sequence representation

G C C A T C T G T G C C A T C C G C G C C A T C T T G G C C A T G T A C G C C A T A T T T G C C A T C T T T G A C A T T T T G T C C A T T T T G T C T A G G T T T G C T C C A T T T T C C A T G G T T G C C A T C T T G G C C A T T T T G G C C A T C T T G A C C A T G T C A T C C A T G T G T G C C A T C A C A

G C C A T C T T G

Consensus sequence Alignment of binding sites

Consensus sequences

  • Pros: Easy to understand, easy to manipulate computationally
  • Cons: Does not express all important information

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 26 / 109

slide-32
SLIDE 32

Analysis methods Modeling regulatory elements

Consensus sequence representation

G C C A T C T G T G C C A T C C G C G C C A T C T T G G C C A T G T A C G C C A T A T T T G C C A T C T T T G A C A T T T T G T C C A T T T T G T C T A G G T T T G C T C C A T T T T C C A T G G T T G C C A T C T T G G C C A T T T T G G C C A T C T T G A C C A T G T C A T C C A T G T G T G C C A T C A C A

D M Y M B N N N N

M ⇒ A or C V ⇒ A, C or G R ⇒ A or G H ⇒ A, C or T W ⇒ A or T D ⇒ A, G or T S ⇒ C or G B ⇒ C, G or T Y ⇒ C or T N ⇒ A, C, G or T K ⇒ G or T

Degenerate nucleotides Degenerate consensus

Degenerate consensus sequences

  • IUPAC degenerate nucleotide codes
  • Provides more flexible representation, but usually not enough

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 26 / 109

slide-33
SLIDE 33

Analysis methods Modeling regulatory elements

Matrix-based representation

G C C A T C T G T G C C A T C C G C G C C A T C T T G G C C A T G T A C G C C A T A T T T G C C A T C T T T G A C A T T T T G T C C A T T T T G T C T A G G T T T G C T C C A T T T T C C A T G G T T G C C A T C T T G G C C A T T T T G G C C A T C T T G A C C A T G T C A T C C A T G T G T G C C A T C A C A 1 2 3 4 5 6 7 8 9 A 1 1 16 2 1 1 2 C 16 15 1 1 7 1 2 2 G 12 1 5 1 3 6 T 4 2 15 3 14 11 7

What is the matrix representation?

  • Matrix columns correspond to positions in sites
  • Matrix rows correspond to nucleotides
  • Entries correspond to base counts at the site
  • Assumptions: independent positions, fixed with, no gaps

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 27 / 109

slide-34
SLIDE 34

Analysis methods Modeling regulatory elements

Matrix-based representation

1 2 3 4 5 6 7 8 9 A 1 1 16 2 1 1 2 C 16 15 1 1 7 1 2 2 G 12 1 5 1 3 6 T 4 2 15 3 14 11 7

Counts

1 2 3 4 5 6 7 8 9 A 0.06 0.06 0.00 0.94 0.00 0.12 0.06 0.06 0.12 C 0.00 0.94 0.88 0.06 0.06 0.41 0.06 0.12 0.12 G 0.71 0.00 0.00 0.00 0.06 0.29 0.06 0.18 0.35 T 0.24 0.00 0.12 0.00 0.88 0.18 0.82 0.65 0.41

Probabilities

(normalized counts)

Different kinds of matrices

  • Probability matrix: columns are position-specific nucleotide distributions
  • Many names: position-weight matrix (PWM), position-frequency matrix

(PFM) profile, alignment matrix, etc.

  • We use PWM to refer to both count and probability matrices
  • Only 3 different kinds of matrices (we will see a scoring matrix later)

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 27 / 109

slide-35
SLIDE 35

Analysis methods Modeling regulatory elements

Sequence Logos

1 2 3 4 5 6 7 8 9 A 1 1 16 2 1 1 2 C 16 15 1 1 7 1 2 2 G 12 1 5 1 3 6 T 4 2 15 3 14 11 7

A

T

G

A

CT

C

C

AC

G

TA

T

G

C

A C G

T

A

C

G

T

A C

G

T

Sequence Logos

  • Cartoon depiction of a motif
  • Size of base is proportional to frequency in matrix
  • Sometimes sizes are scaled by “information content” (not covered)

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 28 / 109

slide-36
SLIDE 36

Analysis methods Modeling regulatory elements

Sequence Logos

weblogo.berkeley.edu

1 2 bits 5′ 1 A

T

G

2

A

C

3

T

C

4

C

A

5

G C

T

6 7

G C

A

T

8

A

C

G

T

9

G

T

3′

Sequence Logos

  • Cartoon depiction of a motif
  • Size of base is proportional to frequency in matrix
  • Sometimes sizes are scaled by “information content” (not covered)

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 28 / 109

slide-37
SLIDE 37

Analysis methods Modeling regulatory elements

Resources

Motif Databases

  • JASPAR (free) and TRANSFAC (BIOBASE)
  • Hundreds of known motifs and binding sites
  • Essential resources for regulatory sequence analysis

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 29 / 109

slide-38
SLIDE 38

Analysis methods Predicting binding sites

Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 30 / 109

slide-39
SLIDE 39

Analysis methods Predicting binding sites

Probability from a motif

1 2 3 4 5 6 7 8 9 A 0.06 0.06 0.00 0.94 0.00 0.12 0.06 0.06 0.12 C 0.00 0.94 0.88 0.06 0.06 0.41 0.06 0.12 0.12 G 0.71 0.00 0.00 0.00 0.06 0.29 0.06 0.18 0.35 T 0.24 0.00 0.12 0.00 0.88 0.18 0.82 0.65 0.41 T C T A T G T T T ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ 0.24 × 0.94 × 0.12 × 0.94 × 0.88 × 0.29 × 0.82 × 0.65 × 0.41 = 0.001419188

  • Possible to compute probability of a sequence from a motif
  • Multiply values corresponding to nucleotide at each position
  • This works because we assume positions are independent
  • In the example Pr(TCTATGTTT) = 0.001419188

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 31 / 109

slide-40
SLIDE 40

Analysis methods Predicting binding sites

Probability from a motif

1 2 3 4 5 6 7 8 9 A 0.06 0.06 0.00 0.94 0.00 0.12 0.06 0.06 0.12 C 0.00 0.94 0.88 0.06 0.06 0.41 0.06 0.12 0.12 G 0.71 0.00 0.00 0.00 0.06 0.29 0.06 0.18 0.35 T 0.24 0.00 0.12 0.00 0.88 0.18 0.82 0.65 0.41 T C T A T G T T T ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ 0.24 × 0.94 × 0.12 × 0.94 × 0.88 × 0.29 × 0.82 × 0.65 × 0.41 = 0.001419188

  • Possible to compute probability of a sequence from a motif
  • Multiply values corresponding to nucleotide at each position
  • This works because we assume positions are independent
  • In the example Pr(TCTATGTTT) = 0.001419188
  • ... but does that mean anything?

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 31 / 109

slide-41
SLIDE 41

Analysis methods Predicting binding sites

Likelihood from motif vs base composition

1 2 3 4 5 6 7 8 9 A 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 C 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 G 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 T 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 T C T A T G T T T ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ 0.2 × 0.3 × 0.2 × 0.2 × 0.2 × 0.3 × 0.2 × 0.2 × 0.2 = 0.00000152

  • Likelihood from motif was ≈ 0.00142
  • Assume each position sampled independently from base frequencies
  • Ratio of the likelihoods: 0.00142/0.00000152 ≈ 934
  • Match-score: obtained by taking log of this ratio
  • Positive match-score ⇒ sequence more likely from motif

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 32 / 109

slide-42
SLIDE 42

Analysis methods Predicting binding sites

Making a scoring matrix

1 2 3 4 5 6 7 8 9 A

  • 1.6
  • 1.6
  • 4.2

2.2

  • 4.2
  • 0.7
  • 1.6
  • 1.6
  • 0.7

C

  • 4.2

1.6 1.5

  • 2.1
  • 2.1

0.4

  • 2.1
  • 1.2
  • 1.2

G 1.2

  • 4.2
  • 4.2
  • 4.2
  • 2.1
  • 0.0
  • 2.1
  • 0.7

0.2 T 0.2

  • 4.2
  • 0.7
  • 4.2

2.1

  • 0.2

2.0 1.6 1.0

log

 

probability from motif probability from base composition

  = log 0.94

0.30

  • = 1.6

A 0.20 C 0.30 G 0.30 T 0.20 1 2 3 4 5 6 7 8 9 A 0.06 0.06 0.00 0.94 0.00 0.12 0.06 0.06 0.12 C 0.00 0.94 0.88 0.06 0.06 0.41 0.06 0.12 0.12 G 0.71 0.00 0.00 0.00 0.06 0.29 0.06 0.18 0.35 T 0.24 0.00 0.12 0.00 0.88 0.18 0.82 0.65 0.41

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 33 / 109

slide-43
SLIDE 43

Analysis methods Predicting binding sites

Scanning a sequence

1 2 3 4 5 6 7 8 9 A

  • 1.6
  • 1.6
  • 4.2

2.2

  • 4.2
  • 0.7
  • 1.6
  • 1.6
  • 0.7

C

  • 4.2

1.6 1.5

  • 2.1
  • 2.1

0.4

  • 2.1
  • 1.2
  • 1.2

G 1.2

  • 4.2
  • 4.2
  • 4.2
  • 2.1

0.0

  • 2.1
  • 0.7

0.2 T 0.2

  • 4.2
  • 0.7
  • 4.2

2.1

  • 0.2

2.0 1.6 1.0 ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ 0.2 + 1.6

  • 0.7 + 2.2 + 2.1 + 0.0 + 2.0 + 1.6 + 1.0 = 10

AGTATCACTCTATGTTTGTTGCACA

Basic steps

  • Slide matrix along sequence
  • Calculate score at each position
  • Keep scores that meet some criteria (e.g. above a cutoff)

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 34 / 109

slide-44
SLIDE 44

Analysis methods Predicting binding sites

Remarks

About scoring matrices

  • Match-scores are sensitive to the base composition assumed
  • Also sensitive to pseudocount
  • Several algorithms exist for calculating scores fast
  • Statistical significance of matches can be measured multiple ways

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 35 / 109

slide-45
SLIDE 45

Analysis methods Predicting binding sites

Remarks

About scoring matrices

  • Match-scores are sensitive to the base composition assumed
  • Also sensitive to pseudocount
  • Several algorithms exist for calculating scores fast
  • Statistical significance of matches can be measured multiple ways

About predicted sites

  • Provide mechanistic link between regulator and target in networks
  • High false positive rate: match-scores only tell part of the story
  • Should be combined with cross-species conservation (more later)

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 35 / 109

slide-46
SLIDE 46

Analysis methods Predicting binding sites

What does enrichment mean?

Three desirable properties

  • 1. More total occurrences

VS.

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 36 / 109

slide-47
SLIDE 47

Analysis methods Predicting binding sites

What does enrichment mean?

Three desirable properties

  • 1. More total occurrences
  • 2. Stronger occurrences

(i.e. higher scoring)

VS.

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 36 / 109

slide-48
SLIDE 48

Analysis methods Predicting binding sites

What does enrichment mean?

Three desirable properties

  • 1. More total occurrences
  • 2. Stronger occurrences

(i.e. higher scoring)

  • 3. More sequences containing

an occurrence

VS.

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 36 / 109

slide-49
SLIDE 49

Analysis methods Predicting binding sites

What does enrichment mean?

Three desirable properties

  • 1. More total occurrences
  • 2. Stronger occurrences

(i.e. higher scoring)

  • 3. More sequences containing

an occurrence But different assumptions valid for different TFs/contexts

VS.

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 36 / 109

slide-50
SLIDE 50

Analysis methods Predicting binding sites

Enrichment based on likelihood

OOPS One Occurrence Per Sequence ZOOPS Zero Or One Occurrence Per Sequence TCM Two Component Mixture (any number per sequence)

  • Mixture models: rigorous statistical foundation for enrichment
  • These models capture the 3 aspects of enrichment:

each sequence is a mixture of sites and non-sites

  • Likelihoods calculated for entire set of sequences
  • Necessary calculations closely related to match-scores

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 37 / 109

slide-51
SLIDE 51

Analysis methods Predicting binding sites

Using a set of background sequences

Foreground sequences

Which motif is more enriched?

  • Yellow motif occurs many times
  • Blue motif also occurs many

times (and in consistent location)

  • Both may appear enriched

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 38 / 109

slide-52
SLIDE 52

Analysis methods Predicting binding sites

Using a set of background sequences

Foreground sequences Background sequences

Why use a background set?

  • Statistical models of “random” promoters don’t work
  • Using a background can control many unknown variables
  • Different backgrounds can be used to examine different questions

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 38 / 109

slide-53
SLIDE 53

Analysis methods Predicting binding sites

Selecting background sequences

Examples of desirable properties

  • Similar to foreground in terms of primary sequence features

(e.g. GC-content, CpG-content)

  • Uniform length sequences (both FG and BG) can facilitate statistics
  • Share similar biological properties

(e.g. compare promoters to other promoters)

Common mistakes

  • Compare promoters to exons (very bad)
  • Comparing CpG-related promoters to non-CpG-related promoters
  • Having different repeat composition in background
  • Comparing sequences between species
  • Using too few sequences (results in over-fitting)

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 39 / 109

slide-54
SLIDE 54

Analysis methods Predicting binding sites

Identifying enriched motifs

Why identify enriched motifs?

  • Identify motifs that are important regulators of a gene module
  • Obtain more information for connections in networks
  • Identify candidates for site prediction

Significance of motif enrichment

  • Enrichment scores more useful if p-values can be obtained
  • Empirical p-values can be obtained in multiple ways:

shuffle sequences, permute sequence labels, permute matrix columns

  • Correct for multiple testing if evaluating enrichment of multiple motifs

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 40 / 109

slide-55
SLIDE 55

Analysis methods Conservation of regulatory elements

Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 41 / 109

slide-56
SLIDE 56

Analysis methods Conservation of regulatory elements

Cross-species conservation

chr19: RefSeq Genes Conservation mouse rat rabbit dog armadillo elephant

  • possum

chicken x_tropicalis tetraodon 50518000 50518500 50519000 UCSC Known Genes Based on UniProt, RefSeq, and GenBank mRNA RefSeq Genes Vertebrate Multiz Alignment & Conservation (17 Species) CKM

Why do we use it?

  • Negative selection: things that are important will be conserved
  • Helps distinguish functional from non-functional sites

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 42 / 109

slide-57
SLIDE 57

Analysis methods Conservation of regulatory elements

How to use conservation

Conserved regions

  • Search in pre-defined regions
  • e.g. Ultraconserved regions

Conserved regions Non−conserved sites Conserved sites

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 43 / 109

slide-58
SLIDE 58

Analysis methods Conservation of regulatory elements

How to use conservation

Conserved regions

  • Search in pre-defined regions
  • e.g. Ultraconserved regions

Conservation profile

  • Assign conservation score to

each individual base

  • e.g. phastCons scores

Non−conserved sites Conserved sites Conservation profile

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 43 / 109

slide-59
SLIDE 59

Analysis methods Conservation of regulatory elements

How to use conservation

Conserved regions

  • Search in pre-defined regions
  • e.g. Ultraconserved regions

Conservation profile

  • Assign conservation score to

each individual base

  • e.g. phastCons scores

Use alignments directly

  • Much information in alignments
  • Requires more complex methods

Gaps Human Chimp Rhesus Bushbaby TreeShrew Mouse Rat GuineaPig Rabbit Shrew Hedgehog Dog Cat Horse Cow Armadillo Elephant Tenrec Opossum Platypus Vertebrate Multiz Alignment & PhastCons Conservation (28 Species) 1 4 A C C A C G A A C A T G C C G G T A C A T G T T T G T T T A C C A C G A A C A T G C C G G T A C A T G T T T G T T T A C C A C G A A C A T G C C G G T A C A T G T T T G T T T A C C A T G A A C T T G C C T G T A C A T G T T T G T T T A C C A C G G A C A T G C T G G T A C A T G C T T G T T T A C C A A G A A C A T G C C G G T A C A T G T T T G T T T A C G A G G A A C A T G C C G G T A C A T G T T T G T T T A A C A C G A A C G T G C C G G G A C A T G T T T G T T T A C C A C G A A C A T G T C G G T A C A T G T T T G T T T C C C A T G A A C A G G T C G G T A C A T G T T T G T T T A C C A T G A A C A G G C T G G A A C A T G T T T G T T T A C C A C G A A C A T G C C G G T A C A T G T T T G T T T A C C A C G A A C A T G C C A G T A C A T G T C T G T T T A C C A C G A A C A T G C C A C T A C A T G T T T G T T T A C C C A G A A C A C A C C A G T A C A T G T T T G T T A A C C G C G A A C A T G C C G G T A C A T G T T T G T T T A C T G G G C A C T T G C A G G T A C T T G T T T G T T T A C C G G G A A C T T G C C A G T A C A T A T T T G T T T C C T G A G A A C A T G C C A G T A C A T G T T T G T T T A C C

  • T

T G G T A C A C A T T T A T T T

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 43 / 109

slide-60
SLIDE 60

Analysis methods Conservation of regulatory elements

Turnover and non-alignment methods

Human Chimp Mouse Rat Dog Cow Frog Present−day Ancestral Binding sites:

  • Functionally analogous sites (in different species) that do not align
  • Sites presumed to evolve under similar evolutionary constraints
  • Importance of turnover still not clear, but some evidence exists
  • Non-alignment methods in general less useful for predicting sites
  • Can indicate important motifs (cross-species enrichment)

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 44 / 109

slide-61
SLIDE 61

Analysis methods Conservation of regulatory elements

Things to consider

Which alignments to use

  • Precomputed alignments: multiz17way (recently 28-way), mlagan
  • Creating your own alignments raises many issues

Species to use

  • Understand the network being investigated
  • Make sure protein and function conserved in species compared
  • Accounts for compensatory substitutions in sites

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 45 / 109

slide-62
SLIDE 62

Analysis methods Motif discovery

Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 46 / 109

slide-63
SLIDE 63

Analysis methods Motif discovery

Motif discovery

What is motif discovery?

  • Start with just sequences
  • Identify strongly enriched motifs de novo
  • Algorithmically on of the most challenging analysis tasks
  • Use it when you suspect important unknown motifs in your data

Motif discovery methods

  • Can be classified by motif representation
  • Word-based representation
  • Matrix-based representation
  • Also by algorithmic strategy
  • Discrete optimization
  • General statistical algorithms

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 47 / 109

slide-64
SLIDE 64

Analysis methods Motif discovery

Motif discovery by word counting

TTTGT

and their occurrences Table of words

AAGTCTACATGAAAAGGATGGTTTCTTGGAGCTTCCACAAACTTAAAAATGGATTCAACATCTATTATTGCTACTATTGT TCTCCCGGGCTGGCAGCAGGGCCCCAGCGGCACCATGTCTATGGATTCCGGAGTCACCGTGGCCCTGCTGGTGTGGGCGG TCTCTGGCAGTAGGCACCAGGGCTGGAATGGGATGGATTCCCGGCTCCCCATGGCAGTGGGTGACGCTGCTGCTGGGGCT AGCCGTTCCCAGCTTGACTTTCCCCTTTAGCCTAGTGATTTGGGGGCCCCAAGGTTTATTTTCCTTTCGCGTAGCTTCGC GGGTGGGGTTGGGAGGAAACCCTTATCTGTGGCCGATGGCCCTCCGTTGTGAGTCTATTAAAACTCTGGGAAACTGCTAT AAGACCCTGAGAAGCAAATCTTTAATTTTTTTGTTTTTGTGAGACGGAGCACTCTGTCGCCCAGGCTAGAGTGCAATTAG GGTGCAATCTCGGCTCACTGGAACCTCCGCCTCCTGAGTCCAAGCGATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGTTA AAGTCTACATGGAGTCGATGGTTTCTTGGAGCTTCCACAAACTTAAAACCATGAAACATCTATTATTGCTACTATTGTTA TAAATAAATTCATCTGATCAAAAGAAATTTAAAAACCAACCAACCCTAATGAGCTCTAAAGACAGCAGAGTCACACGCGA AGGAGCGGCGCCTTCACCCTCCGGCCTCAGCCCGCGAGGCTGCAACCCTTTCCGCACCTGGCTCCATCTCCCTGGCCCTC GGAGCGAGAAGGCGGCGGGGGATCTGGCGCCCGGCTTAGGGGCGAGACGGCCGCACCGGGAGCCTAGCGATCAGGGCACC AGAAGGGTGCCCTGTCCTGGGAGTCCCTTTTGCAGCCACTCAGATGTGCTGCTGCGGTGTCCTTTGTGCTGGTGGCAGCC GCCACGCCGCCGTGAGCCCCGCCCAACATAGCCCCAGGAGTCGCTTCGCGTGTAGAAGCGTCCGGGTGGCGGAGGCCGCA TGTGTCCTGGTGTCTTCTCTCCTCAGCCTGTTTCTCATCCTGGAAACATGAGGTGTGCTGGCGCAGGGCGATAGCGCAGTG

AAAAA AAAAC AAAAG AAAAT AAACA TTTTG TTTTT AAACC AAACG

521 534 366 501 718 ??? 521

GAGGT GAGTA GAGTC GAGTG GAGTT TTTTC TTTTA TTTGG

622

current word

GAGTC

847 243

AGTAGAGACTGGAGTCACCATGTTGGCCAGGCTGGTCTCGAACTCCTGACCCCAAGTGATCCACCTGCCTCAGCCTCTT

For each word of width k: Apply statistics to counts count number of occurrences

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 48 / 109

slide-65
SLIDE 65

Analysis methods Motif discovery

Gibbs Sampling

Start with a given motif and a set of occurrences

TCCATTTTG TCTAGGTTT GCTCCATTT TCCATGGTT GCCATCTTG GCCATTTTG GCCATCTTG ACCATGTCA GCCATGACA TCCATGTGT GACATTTTG GCCATCTTT

A C G T

 

1 1 11 1 1 2 11 10 1 1 3 2 7 1 5 1 1 5 4 2 10 3 10 9 5

 

GATCATTCCTGGAAACCGCCTACTCAGGGCAGAGGTACAGAAAGAAAAGATTGCTCTTGAAAGTTGCCTGTCTTTCCTC AAGTCTACATGAAAAGGATGGTTTCTTGGAGCTTCCACAAACTTAAAACCATGAAACATCTATTATTGCTACTATTGT AATGCAGGTGTGGCGGGCCCTGGCCTCTGCACCCTCATAGAGGGGCTCAACAGCATCAACAGAAGGTGGGGGAGCAGAAGGT TCTCCCGGGCTGGCAGCAGGGCCCCAGCGGCACCATGTCTGCCCTCGGAGTCACCGTGGCCCTGCTGGTGTGGGCGG AGTGCACGAAGACGCTGTCGGGAGAGCCCAGGATTCAACACGGGCCTTGAGAAATGTGAGTAAGGGTGATGGGCAACCA TCTCTGGCAGTAGGCACCAGGGCTGGAATGGGGCCGCCCGGCTCCCCATGGCAGTGGGTGACGCTGCTGCTGGGGCT TCCCACATGGGATTCTTATCAAGTAGGATTATGCAGTGCTTTTCTTTCTGTGTCTGATTTATTTCACTTAACATGATGTG TTTAGTAAAACAAAGTTAGCTTAGTTGTGGGAATTATTTAAAAGGAGCTCTTACCAGGTCAGCTTCCTTCGGTGTTGCGG CTAGGCCGCCTGTCTCCTACCCATACTTAGAGGCCCCGCTCAGACGGTCCTTAAAACGTCTGAAAGGCCGTTCCTGCCA GTGCCCTGAGTTCTGAGGCAGAGAGGAGGACAGAAGAAACAAGAGGCTGGAGATTGTCAAATTCAGTATCCCAGTTG ATGGATTCTCTTGTGGTCCTTGTGCTCTGTCTCTCATGTTTGCTTCTCCTTTCACTCTGGAGACAGAGCTCTGGGAG ACATGCTAACCGGAATCCCTAGGCCGCCTGTCTCCTACCCATACTTAGAGGCCCCGCTCAGACGGTCCTTAAAACGTCT

GCCATCTTT GACATTTTG TCTAGGTTT GCCATCTTG TCCATGGTT GCTCCATTT TCCATTTTG GCCATTTTG GCCATGACA TCCATGTGT ACCATGTCA GCCATCTTG Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 49 / 109

slide-66
SLIDE 66

Analysis methods Motif discovery

Gibbs Sampling

Iterate these steps: 1) Sample a new occurrence from one sequence

TCCATTTTG TCTAGGTTT GCTCCATTT TCCATGGTT GCCATCTTG GCCATTTTG GCCATCTTG ACCATGTCA GCCATGACA TCCATGTGT GACATTTTG GCCATCTTT

A C G T

 

1 1 11 1 1 2 11 10 1 1 3 2 7 1 5 1 1 5 4 2 10 3 10 9 5

 

GATCATTCCTGGAAACCGCCTACTCAGGGCAGAGGTACAGAAAGAAAAGATTGCTCTTGAAAGTTGCCTGTCTTTCCTC AAGTCTACATGAAAAGGATGGTTTCTTGGAGCTTCCACAAACTTAAAACCATGAAACATCTATTATTGCTACTATTGT AATGCAGGTGTGGCGGGCCCTGGCCTCTGCACCCTCATAGAGGGGCTCAACAGCATCAACAGAAGGTGGGGGAGCAGAAGGT TCTCCCGGGCTGGCAGCAGGGCCCCAGCGGCACCATGTCTGCCCTCGGAGTCACCGTGGCCCTGCTGGTGTGGGCGG AGTGCACGAAGACGCTGTCGGGAGAGCCCAGGATTCAACACGGGCCTTGAGAAATGTGAGTAAGGGTGATGGGCAACCA TCCCACATGGGATTCTTATCAAGTAGGATTATGCAGTGCTTTTCTTTCTGTGTCTGATTTATTTCACTTAACATGATGTG TTTAGTAAAACAAAGTTAGCTTAGTTGTGGGAATTATTTAAAAGGAGCTCTTACCAGGTCAGCTTCCTTCGGTGTTGCGG CTAGGCCGCCTGTCTCCTACCCATACTTAGAGGCCCCGCTCAGACGGTCCTTAAAACGTCTGAAAGGCCGTTCCTGCCA GTGCCCTGAGTTCTGAGGCAGAGAGGAGGACAGAAGAAACAAGAGGCTGGAGATTGTCAAATTCAGTATCCCAGTTG ATGGATTCTCTTGTGGTCCTTGTGCTCTGTCTCTCATGTTTGCTTCTCCTTTCACTCTGGAGACAGAGCTCTGGGAG ACATGCTAACCGGAATCCCTAGGCCGCCTGTCTCCTACCCATACTTAGAGGCCCCGCTCAGACGGTCCTTAAAACGTCT TCTCTGGCAGTAGGCACCAGGGCTGGAATGGGGCCGCCCGGCTCCCCATGGCAGTGGGTGACGCTGCTGCTGGGGCT

Probability of selecting particular site related to strength of match to matrix

GCCATCTTT GACATTTTG TCTAGGTTT GCCATCTTG TCCATGGTT GCTCCATTT TCCATTTTG GCCATTTTG GCCATGACA TCCATGTGT ACCATGTCA GCCATCTTG GCCATCTTT Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 49 / 109

slide-67
SLIDE 67

Analysis methods Motif discovery

Gibbs Sampling

Iterate these steps: 1) Sample a new occurrence from one sequence 2) Update the matrix based on new occurrence

TCCATTTTG TCTAGGTTT GCTCCATTT TCCATGGTT GCCATCTTG GCCATTTTG GCCATCTTG ACCATGTCA GCCATGACA TCCATGTGT GACATTTTG GCCATCTTT

A C G T

 

1 1 11 1 1 2 11 10 1 1 3 2 7 1 5 1 1 5 4 2 10 3 10 9 5

 

GATCATTCCTGGAAACCGCCTACTCAGGGCAGAGGTACAGAAAGAAAAGATTGCTCTTGAAAGTTGCCTGTCTTTCCTC AAGTCTACATGAAAAGGATGGTTTCTTGGAGCTTCCACAAACTTAAAACCATGAAACATCTATTATTGCTACTATTGT AATGCAGGTGTGGCGGGCCCTGGCCTCTGCACCCTCATAGAGGGGCTCAACAGCATCAACAGAAGGTGGGGGAGCAGAAGGT TCTCCCGGGCTGGCAGCAGGGCCCCAGCGGCACCATGTCTGCCCTCGGAGTCACCGTGGCCCTGCTGGTGTGGGCGG AGTGCACGAAGACGCTGTCGGGAGAGCCCAGGATTCAACACGGGCCTTGAGAAATGTGAGTAAGGGTGATGGGCAACCA TCCCACATGGGATTCTTATCAAGTAGGATTATGCAGTGCTTTTCTTTCTGTGTCTGATTTATTTCACTTAACATGATGTG TTTAGTAAAACAAAGTTAGCTTAGTTGTGGGAATTATTTAAAAGGAGCTCTTACCAGGTCAGCTTCCTTCGGTGTTGCGG CTAGGCCGCCTGTCTCCTACCCATACTTAGAGGCCCCGCTCAGACGGTCCTTAAAACGTCTGAAAGGCCGTTCCTGCCA GTGCCCTGAGTTCTGAGGCAGAGAGGAGGACAGAAGAAACAAGAGGCTGGAGATTGTCAAATTCAGTATCCCAGTTG ATGGATTCTCTTGTGGTCCTTGTGCTCTGTCTCTCATGTTTGCTTCTCCTTTCACTCTGGAGACAGAGCTCTGGGAG ACATGCTAACCGGAATCCCTAGGCCGCCTGTCTCCTACCCATACTTAGAGGCCCCGCTCAGACGGTCCTTAAAACGTCT TCTCTGGCAGTAGGCACCAGGGCTGGAATGGGGCCGCCCGGCTCCCCATGGCAGTGGGTGACGCTGCTGCTGGGGCT

11 1 11 12 4

stronger motif will move matrix toward Usually the changes

GCCATCTTT GACATTTTG TCTAGGTTT GCCATCTTG TCCATGGTT GCTCCATTT TCCATTTTG GCCATTTTG GCCATGACA TCCATGTGT ACCATGTCA GCCATCTTG GCCATCTTT Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 49 / 109

slide-68
SLIDE 68

Analysis methods Motif discovery

Other techniques

Expectation Maximization (EM)

  • Instead of sampling sites with particular probability:
  • All possible sites contribute to the matrix
  • Contribution of each site related to probability (score)
  • Iterate through motifs instead of sites
  • Like deterministic version of Gibbs: no random choices after

setting the starting point

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 50 / 109

slide-69
SLIDE 69

Analysis methods Motif discovery

Other techniques

Expectation Maximization (EM)

  • Instead of sampling sites with particular probability:
  • All possible sites contribute to the matrix
  • Contribution of each site related to probability (score)
  • Iterate through motifs instead of sites
  • Like deterministic version of Gibbs: no random choices after

setting the starting point

Variants of EM or Gibbs

  • Gibbs Motif Sampler (Lawrence et al., 1993)
  • MEME (Bailey & Elkan, 1995)
  • AlignACE (Hughes et al., 2000)
  • MDscan (Liu et al., 2002)

Good starting points are critical for Gibbs and EM

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 50 / 109

slide-70
SLIDE 70

Analysis methods Motif discovery

Things to consider

Current status

  • Field starting to mature: many great algorithms exist!
  • Probably none will be “perfect” for your application
  • Try several algorithms, understand what they do

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 51 / 109

slide-71
SLIDE 71

Analysis methods Motif discovery

Things to consider

Current status

  • Field starting to mature: many great algorithms exist!
  • Probably none will be “perfect” for your application
  • Try several algorithms, understand what they do

How to improve

  • Combine best aspects of different algorithms
  • Incorporate more biological knowledge

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 51 / 109

slide-72
SLIDE 72

Analysis methods Motif discovery

Things to consider

Current status

  • Field starting to mature: many great algorithms exist!
  • Probably none will be “perfect” for your application
  • Try several algorithms, understand what they do

How to improve

  • Combine best aspects of different algorithms
  • Incorporate more biological knowledge

DME: Discriminating Motif Enumerator

  • Enumerative search strategy, matrix-based motifs
  • Smith, Sumazin & Zhang (PNAS, 2005)

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 51 / 109

slide-73
SLIDE 73

Analysis methods Cis-regulatory modules

Introduction Background on regulatory networks Data available for analysis Analysis methods Identifying gene modules Modeling regulatory elements Predicting binding sites Conservation of regulatory elements Motif discovery Cis-regulatory modules

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 52 / 109

slide-74
SLIDE 74

Analysis methods Cis-regulatory modules

What is a cis-regulatory module?

The IFNβ Enhancer

  • Figure from Maniatis et al. (CSHL Symposium 1998)
  • Critical property: sites that work together tend to cluster

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 53 / 109

slide-75
SLIDE 75

Analysis methods Cis-regulatory modules

What is a cis-regulatory module?

(Yuh et al., 2001)

Sea Urchin Endo16 promoter

  • Figure from Yuh et al. (2001)
  • Promoter logic: CRMs are autonomous units encoding regulation

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 53 / 109

slide-76
SLIDE 76

Analysis methods Cis-regulatory modules

Identifying cis-regulatory modules

chr2: module STS Markers RefSeq Genes Exoniphy ExonWalk Spliced ESTs Conservation mouse rat rabbit dog armadillo elephant

  • possum

chicken x_tropicalis tetraodon SNPs RepeatMasker 236860000 236865000 236870000 236875000 236880000 236885000 PReMod Predicted Regulatory Modules STS Markers on Genetic (blue) and Radiation Hybrid (black) Maps UCSC Known Genes (June, 05) Based on UniProt, RefSeq, and GenBank mRNA RefSeq Genes Mammalian Gene Collection Full ORF mRNAs Exoniphy Human/Mouse/Rat/Dog ExonWalk Alt-Splicing Transcripts Human mRNAs from GenBank Human ESTs That Have Been Spliced Vertebrate Multiz Alignment & Conservation Simple Nucleotide Polymorphisms (dbSNP build 125) Repeating Elements by RepeatMasker GBX2 ASB18 AF118452 AK123854

Occurrences tightly clustered Far from gene

  • f known motifs

Strong occurrences Highly conserved region PReMod (Blanchette et al, 2006)

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 54 / 109

slide-77
SLIDE 77

Analysis methods Cis-regulatory modules

Motif modules

What are they?

  • A set of motifs for sites that frequently work together
  • CRMs are the occurrences of motif modules
  • Often can predict expression better than individual motifs
  • Simplest kind: pair of sites for dimerizing TFs

Interesting properties

  • Relative order: some motifs must be beside each other
  • Total span and spacing of sites can be restricted
  • Relative orientation sometimes important
  • Weaker individual sites: combined affinity is important

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 55 / 109

slide-78
SLIDE 78

Analysis methods Cis-regulatory modules

Discovering motif modules

Library based

  • Given a library of motifs construct modular motifs
  • Many known motifs work have important interactions

De-novo discovery

  • Discover modular motifs from sequence alone
  • Currently no generally practical methods
  • Anchoring strategy: almost de novo, and can be useful
  • CisModule: one of the most sophisticated algorithms

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 56 / 109

slide-79
SLIDE 79

Part II Part II: Worked Examples

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 57 / 109

slide-80
SLIDE 80

Overview

Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 58 / 109

slide-81
SLIDE 81

Analyzing sets of co-regulated genes

Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 59 / 109

slide-82
SLIDE 82

Analyzing sets of co-regulated genes An example gene module

Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 60 / 109

slide-83
SLIDE 83

Analyzing sets of co-regulated genes An example gene module

Example gene module

LPS responsive genes

  • Bacterial LPS (lipopolysaccharide) stimulates B-cell activation,

proliferation, and differentiation

  • Gene module compiled through individual experiments
  • Ramirez-Carrozzi et al. (Genes & Dev, 2006)

Selective and antagonistic functions of SWI/SNF and Mi-2b nucleosome remodeling complexes during an inflammatory response

Properties of the gene module

  • The gene module comprises 35 genes
  • Some are TFs (e.g. Irf1, Irf7, Junb, Fos, Nfkbiz, Egr1, Zfp369)
  • Several known binding sites in promoters of these genes

(e.g. IFNβ enhancer)

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 61 / 109

slide-84
SLIDE 84

Analyzing sets of co-regulated genes An example gene module

Analysis tasks

Analysis tasks

  • Identify enriched known motifs
  • Use known motifs to predict functional binding sites

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 62 / 109

slide-85
SLIDE 85

Analyzing sets of co-regulated genes An example gene module

Obtaining promoter sequences

Promoter databases

  • Examples: EPD, DBTSS, CSHLmpd
  • Use when promoter choice really matters

(e.g. small data sets, many alternative promoters)

UCSC Table Browser to get promoters

  • Start with set of RefSeq IDs for genes in module
  • Select the appropriate table (refGene for mm8)
  • Upload the RefSeq IDs
  • Select sequence output format
  • Select “upstream by 1000bp”

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 63 / 109

slide-86
SLIDE 86

Analyzing sets of co-regulated genes Identifying enriched known motifs

Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 64 / 109

slide-87
SLIDE 87

Analyzing sets of co-regulated genes Identifying enriched known motifs

The motifclass program

How it evaluates enrichment

  • Compares set of foreground sequences to background sequences
  • For a given motif, each sequence is assigned a score
  • The score is the maximum match-score of any site in the sequence
  • The scores are used to classify foreground and background sequences
  • Sequences with higher scores are classified as foreground
  • Better classification ability means greater enrichment
  • p-values obtained by randomly permuting sequence labels

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 65 / 109

slide-88
SLIDE 88

Analyzing sets of co-regulated genes Identifying enriched known motifs

The motifclass program

Background sequences Foreground sequences

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 65 / 109

slide-89
SLIDE 89

Analyzing sets of co-regulated genes Identifying enriched known motifs

Using motifclass to evaluate motif enrichment

Sequence files

  • Foreground: the 35 proximal promoters
  • Background: 1000 random mm8 RefSeq promoters
  • Promoter sequences taken -1000 to -1 relative to the TSS
  • Sequences given in FASTA format

Motif library

  • Known motifs from the JASPAR database
  • Total of 123 motifs (some redundancy)
  • Motifs must be converted into CREAD motif format

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 66 / 109

slide-90
SLIDE 90

Analyzing sets of co-regulated genes Identifying enriched known motifs

The CREAD motif file format

AC: the accession

  • Identifier for each motif
  • Best to keep them unique

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 67 / 109

slide-91
SLIDE 91

Analyzing sets of co-regulated genes Identifying enriched known motifs

The CREAD motif file format

TY: the type of pattern

  • Type of this pattern is “Motif”
  • Just to tell programs what

they are looking at

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 67 / 109

slide-92
SLIDE 92

Analyzing sets of co-regulated genes Identifying enriched known motifs

The CREAD motif file format

The matrix lines

  • This is the actual PWM
  • Transposed:
  • ne line per column
  • Either counts or probabilities

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 67 / 109

slide-93
SLIDE 93

Analyzing sets of co-regulated genes Identifying enriched known motifs

The CREAD motif file format

AT: the attributes

  • Annotate motifs with

additional information

  • Attribute=value pairs
  • Usually optional
  • Some programs require

certain attributes

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 67 / 109

slide-94
SLIDE 94

Analyzing sets of co-regulated genes Identifying enriched known motifs

The CREAD motif file format

BS: the binding site lines

  • To store sites for each motif
  • More details on this later

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 67 / 109

slide-95
SLIDE 95

Analyzing sets of co-regulated genes Identifying enriched known motifs

Running motifclass on LPS-responsive promoters

  • -r: use relative error as enrichment measure
  • -O: find the score cutoff optimizing that enrichment
  • -P 1000: report a p-value for each motif using 1000 shuffles
  • -v: print progress information while running

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 68 / 109

slide-96
SLIDE 96

Analyzing sets of co-regulated genes Identifying enriched known motifs

What the output looks like

Attributes from motifclass

  • Relative error rate
  • Sensitivity and specificity
  • Optimal score cutoff

(Functional depth and threshold)

  • p-value and rank (in set of motifs)

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 69 / 109

slide-97
SLIDE 97

Analyzing sets of co-regulated genes Identifying enriched known motifs

Interpreting the results

Name Logo Sn Sp Error p-value 1. NFKB1

GGGA

G

C

G

A

G

A

T

G

C

TT C

T

CCA

G T

C

0.743 0.603 0.327 2. RELA

T

C

G

T

GGG

A

T

C

G

A

A

TTC TCC

0.686 0.655 0.33 3. NF-kappaB GG

A

GG

A

G

T

C

A

C

G

A

T

A

C G

TA

C

T

T

A

CT C

0.4 0.9 0.35 0.002 4. Dorsal 1 T

C

G

T

G

C T

GT

G

A C

G

T

A

G

T

A

TTC

T

A

T

C

A G

T

CA

C G

0.886 0.413 0.351 5. REL

T C

G

C T

G

A

T

G

C

T

A

G

G

T

C

A

C G

A

T

A

TA

TT

CA

C

0.314 0.956 0.365 0.008 6. En1

C

G

T

A

T

C G

A

A

GG

TG

T

A

A

GA

T

C

A

T

G

A

C G

T

A G

C

T

A G

T

C

0.686 0.584 0.365 0.009 7. IRF2

T

C

G

A

GAT

AAA

GC

TGAAAC

G

G

T

C

C T

G

A

G

T

A

C G

T

A

G

C

T A

G

T

A

C

0.371 0.872 0.378 0.015 8. TBP

T

A

C G

A G

C

TT AC

TT

AT

A

T G

AG

T

A

T

C

A G

T A

C G

T

A

G

C

T

A

C G

T A

C G

A

T

C

G

T A

C

G

0.371 0.867 0.381 0.018 9. Dorsal 2

C T

GT GGT

G

C

A

T

C

A

TTC

TCA

C

0.429 0.798 0.387 0.032 10. ZNF42 5-13

A

C

T

G

A

G

T

C

AT

G

T

G

C

GT G

C

A

T

G

T

G

A

C T

G

A

0.629 0.59 0.391 0.03

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 70 / 109

slide-98
SLIDE 98

Analyzing sets of co-regulated genes Identifying enriched known motifs

Implications for the LPS network

NF-κB motif highly enriched

  • Top 5 motifs all NF-κB family members
  • Likely a master regulator
  • Expected to have multiple direct targets (next task)

Other motifs and TFs

  • IRF motif is important
  • Could be Irf1, Irf7 or some other Irf family member
  • Other IRF motifs ranked high

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 71 / 109

slide-99
SLIDE 99

Analyzing sets of co-regulated genes Predicting functional binding sites

Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 72 / 109

slide-100
SLIDE 100

Analyzing sets of co-regulated genes Predicting functional binding sites

Predicting functional binding sites

Sequences Where we will search (e.g. promoters) Genomeic Regions Where functional sites are not likely (e.g. inside CDS) Alignments for conservation in sequences searched Motif library Known or novel motifs whose sites we want to identify 1) Identify candidate sites Scan sequences for sites scoring above the cutoff for each motif. 2) Filter by location Eliminate candidate sites

  • ccurring inside these

regions. 3) Filter by conservation Eliminate candidate sites without desired conservation properties. Predicted sites Final set of predicted sites; to be evaluated experimentally

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 73 / 109

slide-101
SLIDE 101

Analyzing sets of co-regulated genes Predicting functional binding sites

Identifying candidate sites

About this step

  • Goal: identify sites that strongly match our motifs
  • Sequences: 1000bp promoters of the 35 LPS-responsive genes
  • Motif library: the JASPAR motifs
  • We will use the storm program for finding sites

The STORM program

  • Select a p-value cutoff
  • Indicate that the cutoff is a match-score p-value
  • Often difficult to select this

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 74 / 109

slide-102
SLIDE 102

Analyzing sets of co-regulated genes Predicting functional binding sites

Using the storm program

  • -C: give a base composition (used to build scoring matrices)
  • -t: specify the score threshold for sites
  • -p: indicate that the threshold is a match-score p-value
  • -v: print progress information while running

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 75 / 109

slide-103
SLIDE 103

Analyzing sets of co-regulated genes Predicting functional binding sites

The set of candidate binding sites

  • Figure: candidate binding sites in part of a storm output file
  • 1674 candidates identified for 123 motifs (13.6 sites/motif)
  • Additional candidates identified in larger -10K to -1001 region
  • Vast majority are false-positives, and must be filtered

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 76 / 109

slide-104
SLIDE 104

Analyzing sets of co-regulated genes Predicting functional binding sites

Excluding less important regions

About this step

  • Goal: eliminate candidates less likely to be functional
  • Regions to exclude: CDS and Repeat Masker repeats
  • Functional sites are less likely in those regions
  • Program: sitesifter from CREAD

The sitesifter program

  • Filters set of sites based on location
  • Identifies sites contained in, or excluded from, a set of regions
  • Can also filter set of sites based on scores (above/below some cutoff)

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 77 / 109

slide-105
SLIDE 105

Analyzing sets of co-regulated genes Predicting functional binding sites

Using the sitesifter program

  • Running the program is straight-forward
  • Figure: filtered 402 sites contained in repeat regions

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 78 / 109

slide-106
SLIDE 106

Analyzing sets of co-regulated genes Predicting functional binding sites

Filtering based on conservation

About this step

  • Goal: identify remaining candidate sites that appear conserved
  • Alignments: precomputed UCSC multiz17way alignments
  • Species: all vertebrates species in the alignment
  • We will use the multistorm program to evaluate site conservation

What multistorm does

  • Takes a set of candidate sites for some motifs
  • Evaluates the aligned sites in other species using same motif
  • Given a cutoff score, count species scoring greater at aligned sites
  • Final score is number of species scoring above cutoff at the site

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 79 / 109

slide-107
SLIDE 107

Analyzing sets of co-regulated genes Predicting functional binding sites

Using the multistorm program

  • -C: give a base composition (same as we used to get candidates)
  • -c: specify score p-value cutoff (also same value as for candidates)
  • -v: print progress information while running
  • Others params specify input (i.e. alignment) and output files
  • Used sitesifter to get the 317 sites conserved in 4 species

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 80 / 109

slide-108
SLIDE 108

Analyzing sets of co-regulated genes Predicting functional binding sites

The set of predicted functional sites

Properties of the predicted sites

  • 1. Each is a strong match to a known binding site motif
  • 2. None appear in CDS or repeats regions
  • 3. Each is conserved through multiple species

What did we find?

  • 317 total sites
  • Includes overlapping sites, and sites for redundant motifs
  • 26 unique high-confidence predicted sites for NFkB
  • 9 unique high-confidence predicted sites for IRFs

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 81 / 109

slide-109
SLIDE 109

Analysis of transcription factor localization data

Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 82 / 109

slide-110
SLIDE 110

Analysis of transcription factor localization data ChIP-chip data examples

Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 83 / 109

slide-111
SLIDE 111

Analysis of transcription factor localization data ChIP-chip data examples

ChIP arrays

Promoter arrays

  • Use long probes to cover proximal promoters
  • Probe coverage is sparse
  • Transcription factor localization evidence from few probes

Tiling arrays

  • Dense covering of proximal promoters, possibly including distal regions
  • r even whole genome coverage
  • Varying coverage density
  • Transcription factor localization evidence from a set of probes

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 84 / 109

slide-112
SLIDE 112

Analysis of transcription factor localization data ChIP-chip data examples

E2F4 localization in primary human fibroblasts

E2F4 background

  • The E2F family of transcription factors is essential for cell cycle activity
  • E2F transcription factors are known to bind proximally to the TSS
  • E2F4 is known to regulate the G2/M phase

Our data

  • A set of probed promoters
  • A subset composed of promoters found to be localized with E2F4

Our task

  • Identify enriched motifs in the set of E2F4-localized promoters

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 85 / 109

slide-113
SLIDE 113

Analysis of transcription factor localization data ChIP-chip data examples

CTCF localization in primary human fibroblasts

CTCF background

  • CTCF is an 11-zink finger vertebrate nuclear insulator
  • CTCF binds far from transcription start sites
  • CTCF localization appears to be independent of cell type

Our data

  • A set of regions that were identified to be localized with CTCF

Our task

  • Discover and identify enriched motifs in the set of CTCF-localized

regions

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 86 / 109

slide-114
SLIDE 114

Analysis of transcription factor localization data ChIP-chip data examples

Obtaining sequence sets

E2F foreground and background sequences

  • Over 10K probed promoters to form the foreground
  • Segments lengths from 700 to 1000 have to be normalized
  • 236 E2F4-localized promoters
  • Background selected by sampling from the remaining promoters

CTCF foreground and background sequences

  • Over 15K CTCF-localized segments to form the foreground
  • Segments lengths from 350 to 5150 have to be normalized
  • We analyze a sample – 500 is plenty
  • Background constructed by either
  • shuffling the foreground to preserve base composition or dinucleotide

composition

  • using non-overlapping same-size flanking regions

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 87 / 109

slide-115
SLIDE 115

Analysis of transcription factor localization data Identifying enriched known motifs

Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 88 / 109

slide-116
SLIDE 116

Analysis of transcription factor localization data Identifying enriched known motifs

Selecting foreground and background – CTCF

  • Sample 500 sequences from CTCF-localized segments
  • Identify non-overlapping flanking regions

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 89 / 109

slide-117
SLIDE 117

Analysis of transcription factor localization data Identifying enriched known motifs

Selecting foreground and background – CTCF

  • Shuffling to preserve base composition and dinucleotide composition

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 90 / 109

slide-118
SLIDE 118

Analysis of transcription factor localization data Identifying enriched known motifs

Running motifclass

  • -r: use relative error as enrichment measure
  • -O: find the score cutoff optimizing that enrichment
  • -P 1000: report a p-value for each motif using 1000 shuffles

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 91 / 109

slide-119
SLIDE 119

Analysis of transcription factor localization data Identifying enriched known motifs

uniqmotifs and matcompare

Programs to compare motifs

  • Consider all legal alignments specified using max overhang for the

smaller matrix (-h)

  • Require that the average K-L divergence per aligned column is no

greater than specified (-t)

  • uniqmotifs clusters a sorted list of similar motifs so that lower ranking

motifs are listed below similar higher ranking motifs

  • matcompare queries a motif library to identify similar motifs to those in

the input list

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 92 / 109

slide-120
SLIDE 120

Analysis of transcription factor localization data Identifying enriched known motifs

Sorting and pruning

  • Sort by relative error rate
  • Cluster similar motifs

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 93 / 109

slide-121
SLIDE 121

Analysis of transcription factor localization data Identifying enriched known motifs

CTCF – enrichment against shuffled and flanking

Acc TF Err Sens Spec p-val FD Logo 1 MA0045 HMG-IY 0.348 0.60 0.70 0.000 0.88

T

A

G

C

G

T

A

C G

T

A

T

A

G

C

G

A

T

G

A

T

G

AG

A C

T

C T

A

G

T

A

C

G

G

C

A

C T

G

A

G

T

AC A

G

AA

T

C

MA0120 ID1 0.416 0.43 0.74 0.000 0.92

A C

G

TA

C

G

T

A

C

G

T

G

C

T

G

T

C

A

G

T

C

G

A

C

T

A C G

TG

C

A

T

C G

T

T

C

A

T

G

2 MA0041 Foxd3 0.355 0.72 0.57 0.000 0.89

C

T

A

G

C

T

A

C

T A

A

C

T

T

A

GTA

TA

G

TT

A

G

A

C

T

A

C

G

TC

T

MA0042 FOXI1 0.386 0.76 0.47 0.000 0.86

C

A

T

G

A

T C

G

C

G

T

ATA

GTTTA

G

G

T

C

G

A

T

C

A

G

T

3 MA0013 Broad-complex4 0.381 0.61 0.63 0.000 0.88 A C

TT

AA GA

TG

AAG AA

G

T

C

G

T

A

T

AC

T

A

4 MA0010 Broad-complex1 0.383 0.67 0.56 0.000 0.85

C T

A

G

A C G

T

G T

C

A

G T

AA TT

A

C

A G

G T

AA

CAG AT

G

AA

T

A G T

C

5 MA0082 SQUA 0.385 0.61 0.62 0.000 0.88

G

T

A

C

T

C

T

G

A

G

C

T

A

T

AT

A

C

T

A

G

A

T

C

A G

A C

G

G

C

T

A

C

AC

G

T

AC

A G

T

1 MA0123 ABI4 0.435 0.32 0.81 0.000 0.92 CA

GC

G

G

C

T

A

C

G

G

CA

G

T

C

A

T G

C

A

G

T

C

A

G

T

C

2 MA0003 TFAP2A 0.443 0.50 0.61 0.000 0.95 GCCA

G

T

C

A

T C G

T

C

A

G

T

C

A

G

T C

A

G

A

T

C G

3 MA0048 NHLH1 0.445 0.62 0.49 0.000 0.84

T A C

G

G

A

C

A

C

T

G

A

CAA

C

G

A

G

C

A C

TT

G

A

T G

C

A

T

G

A

G

C

T

4 MA0117 MafB 0.448 0.45 0.66 0.000 0.93 G

A G

T

C

G

C

T

A

T

G

T

C

AG

A

T

C

C

A

T

G

A

T

C G

5 MA0028 ELK1 0.449 0.27 0.83 0.000 0.91 T

A C

G

C G

T

A

C

T

A G

G

T

A

C

G T

A

CT G

A

G

G

AT

C

A

T

A G

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 94 / 109

slide-122
SLIDE 122

Analysis of transcription factor localization data Identifying enriched known motifs

Selecting foreground and background – E2F

  • Sample 500 sequences from non-positive promoters

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 95 / 109

slide-123
SLIDE 123

Analysis of transcription factor localization data Identifying enriched known motifs

Running motifclass

  • -r: use relative error as enrichment measure
  • -O: find the score cutoff optimizing that enrichment
  • -P 1000: report a p-value for each motif using 1000 shuffles

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 96 / 109

slide-124
SLIDE 124

Analysis of transcription factor localization data Identifying enriched known motifs

E2F enrichment

Acc TF Err Sens Spec p-val FD Logo 1 MA0060 NF-Y 0.327 0.64 0.71 0.000 0.86

T

G

A C

A

G

C

T

A

G

T

C

T

C

G

A

C T

A GCA

CT AATT

A

G

C

T

C

G

A

T A

C

G

T

G

A C

C

T

A

G

A

T G C 2 MA0024 E2F1 0.350 0.72 0.57 0.000 0.86 TTTC

G

C

GCG

G

C

3 MA0080 SPI1 0.359 0.61 0.68 0.000 1.00

T

A

G

C

C

T

A

G

T A

G

C

A

T

AT

A

C

G

MA0028 ELK1 0.364 0.58 0.69 0.000 0.93 T

A C

G

C G T

A

C

T

A G

G

T

A

C

G T

A

CT G

A

G

G

AT

C

A

T

A G

MA0076 ELK4 0.373 0.62 0.64 0.000 0.85

C T

G

ACCGGAT A

C

A

G

A

C

T

MA0026 E74A 0.399 0.42 0.78 0.000 0.98

T

A G

CA CGGAA

C T

A

G

MA0062 GABPA 0.412 0.64 0.54 0.000 0.85 C G

AA

G

CA

CGGAAG

C G

T

A

T

C

G

4 MA0021 Dof3 0.375 0.45 0.80 0.000 1.00 AAAG

G

T

CA

C

G

MA0020 Dof2 0.449 0.91 0.19 0.006 0.95 AAAG G T

A

C

G

T C

A

MA0053 MNB1A 0.449 0.91 0.19 0.000 1.00 AAAG

A

T

C

MA0064 PBF 0.449 0.91 0.19 0.003 1.00 AAAG

A G

T

C

5 MA0123 ABI4 0.398 0.81 0.40 0.000 0.91 CA

GC

G

G

C

T

A

C

G

G

CA

G

T

C

A

T G

C

A

G

T

C

A

G

T

C

6 MA0018 CREB1 0.406 0.43 0.76 0.000 0.87

T G

C

A

G

T C

C

G

T

A

C

G

C

T

A

GTT

GACGA

C

T

T

G

A

C

MA0096 bZIP910 0.433 0.45 0.68 0.000 0.86

G

A CTGACGT

7 MA0034 GAMYB 0.420 0.59 0.57 0.000 0.90 A

T

C

G

T

C

G

A

A

T

C

C G

AAA CT

A

G

CA

GT

A

C

G

A

C

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 97 / 109

slide-125
SLIDE 125

Analysis of transcription factor localization data Identifying enriched known motifs

Testing CpG-island influence

  • The positive set is highly CpG enriched and the analysis may be biased

– identifying patterns common to special or just active promoters

  • We compare foreground CpG-island promoters to background

CpG-island promoters to eliminate this potential bias

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 98 / 109

slide-126
SLIDE 126

Analysis of transcription factor localization data Identifying enriched known motifs

E2F CpG-conditional enrichment

Acc TF Err Sens Spec p-val FD Logo 1 MA0060 NF-Y 0.321 0.64 0.71 0.000 0.86

T

G

A C

A

G

C

T

A

G

T

C

T

C

G

A

C T

A GCA

CT AATT

A

G

C

T

C

G

A

T A

C

G

T

G

A C

C

T

A

G

A

T G C 2 MA0024 E2F1 0.344 0.73 0.58 0.000 0.86 TTTC

G

C

GCG

G

C

3 MA0080 SPI1 0.347 0.63 0.68 0.000 1.00

T

A

G

C

C

T

A

G

T A

G

C

A

T

AT

A

C

G

MA0028 ELK1 0.367 0.58 0.69 0.000 0.93 T

A C

G

C G T

A

C

T

A G

G

T

A

C

G T

A

CT G

A

G

G

AT

C

A

T

A G

MA0076 ELK4 0.373 0.54 0.72 0.000 0.86

C T

G

ACCGGAT A

C

A

G

A

C

T

MA0062 GABPA 0.406 0.66 0.53 0.000 0.85 C G

AA

G

CA

CGGAAG

C G

T

A

T

C

G

4 MA0021 Dof3 0.374 0.46 0.80 0.000 1.00 AAAG

G

T

CA

C

G

MA0053 MNB1A 0.448 0.91 0.19 0.000 1.00 AAAG

A

T

C

MA0064 PBF 0.448 0.91 0.19 0.000 1.00 AAAG

A G

T

C

5 MA0123 ABI4 0.386 0.83 0.39 0.000 0.91 CA

GC

G

G

C

T

A

C

G

G

CA

G

T

C

A

T G

C

A

G

T

C

A

G

T

C

6 MA0018 CREB1 0.396 0.45 0.76 0.000 0.88

T G

C

A

G

T C

C

G

T

A

C

G

C

T

A

GTT

GACGA

C

T

T

G

A

C

MA0096 bZIP910 0.434 0.30 0.83 0.002 0.88

G

A CTGACGT

7 MA0034 GAMYB 0.407 0.62 0.57 0.000 0.90 A

T

C

G

T

C

G

A

A

T

C

C G

AAA CT

A

G

CA

GT

A

C

G

A

C

MA0100 Myb 0.424 0.57 0.58 0.000 0.91

C

T

A

G

C

T

A

G

G

A

CT

A G C

A

G

C

A

TTT

A

G

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 99 / 109

slide-127
SLIDE 127

Analysis of transcription factor localization data Identifying co-factors

Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 100 / 109

slide-128
SLIDE 128

Analysis of transcription factor localization data Identifying co-factors

Identifying co-factors

  • MA0060 (NF-Y) and MA0024 (E2F1) are the best localization predictors
  • To identify possible cofactors we
  • identify putative sites for the two motifs
  • get flanking regions to search for co-factor sites
  • identify enriched motifs in flanking regions
  • we search only in CpG-island promoters to eliminate bias

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 101 / 109

slide-129
SLIDE 129

Analysis of transcription factor localization data Identifying co-factors

Evaluating putative co-factors

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 102 / 109

slide-130
SLIDE 130

Analysis of transcription factor localization data Identifying co-factors

Enrichment in proximity to MA0024 sites

Acc TF Err Sens Spec p-val FD Logo 1 MA0060 NF-Y 0.390 0.37 0.85 0.000 0.82

T

G

A C

A

G

C

T

A

G

T

C

T

C

G

A

C T

A GCA

CT AATT

A

G

C

T

C

G

A

T A

C

G

T

G

A C

C

T

A

G

A

T G C 2 MA0037 GATA3 0.417 0.65 0.52 0.000 0.90

G

C

T A

C

G

G T

A

G

T

C

G

T

A

C

T

A

G

MA0036 GATA2 0.433 0.61 0.53 0.002 0.95 T

A C

G

C

G

G

ATT

C

G

A

MA0070 Pbx 0.437 0.70 0.42 0.016 0.70

G

A T

C

A G T

C

C G

AA

C

TC

C

AT AG

TC

C T

AG

T

A

C G

T

A

MA0094 Ubx 0.438 0.81 0.31 0.002 0.83

A

T

G

C

T

A

T

G

A

G

C

T

3 MA0011 Broad-complex2 0.419 0.89 0.27 0.000 0.73

C G

A

T

G

C

T A

T

C

C

TAA

C

G

T

G

A

TA

G

C

T

MA0082 SQUA 0.435 0.81 0.32 0.011 0.74

G

T

A

C

T

C

T

G

A

G

C

T

A

T

AT

A

C

T

A

G

A

T

C

A G

A C

G

G

C

T

A

C

AC

G

T

AC

A G

T

4 MA0110 ATHB5 0.421 0.70 0.45 0.000 0.70

A

T

G

C

T

C

AATA

G

TATTA

G

MA0075 Prrx2 0.428 0.77 0.38 0.000 0.79

T

C

G

AA

G

TT

G

A

MA0008 Athb-1 0.430 0.91 0.23 0.001 0.67 A

G

T

C

C

T

AATC

T

G

AA

TG T

5 MA0096 bZIP910 0.423 0.41 0.75 0.000 0.85

G

A CTGACGT

MA0018 CREB1 0.428 0.72 0.43 0.001 0.78

T G

C

A

G

T C

C

G

T

A

C

G

C

T

A

GTT

GACGA

C

T

T

G

A

C

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 103 / 109

slide-131
SLIDE 131

Analysis of transcription factor localization data Discovering motifs de novo

Analyzing sets of co-regulated genes An example gene module Identifying enriched known motifs Predicting functional binding sites Analysis of transcription factor Localization data ChIP-chip data examples Identifying enriched known motifs Identifying co-factors Discovering motifs de novo

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 104 / 109

slide-132
SLIDE 132

Analysis of transcription factor localization data Discovering motifs de novo

Running DME

Overview

  • DME enumerates through a set of matrices to identify those with the

greatest number of potential sites in the foreground relative to the background

  • DME restricts the type of matrices it evaluates
  • it evaluates matrices with width specified using -w
  • it evaluates only those matrices that have a minimum average information

per column specified using -i

  • the number of matrices it reports is set using -n
  • it evaluates matrices corresponding to degenerate words with the level of

degeneracy optionally specified using -g

  • it uses a 2-iteration scheme, refining discovered motifs to a higher

degeneracy optionally specified using -r

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 105 / 109

slide-133
SLIDE 133

Analysis of transcription factor localization data Discovering motifs de novo

Running DME

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 106 / 109

slide-134
SLIDE 134

Analysis of transcription factor localization data Discovering motifs de novo

Evaluating motif enrichment

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 107 / 109

slide-135
SLIDE 135

Analysis of transcription factor localization data Discovering motifs de novo

Motif enrichment

Acc Err Sens Spec FD Logo 1 DME-10-1.60-6 0.372 0.42 0.83 0.90 T

CT

GCCA CC TCTA GC

G

2 DME-10-1.60-10 0.373 0.42 0.83 0.98 CCT

AC

GA

T

C

G

AGA

GT GG

3 DME-10-1.60-11 0.375 0.49 0.76 0.97 AGA

GT GGGT

CA

GC G

G A

T C

4 DME-10-1.60-28 0.422 0.60 0.56 0.90 T

CCTA

GC

GA

TGG

CT

CA

C

5 DME-10-1.60-26 0.423 0.54 0.61 0.90 T

GC

GT

AGG AGGGT

AC

G

6 DME-10-1.60-39 0.423 0.42 0.73 0.90 CA

TC

GCC AG

TGC GG

T AG

7 DME-10-1.60-27 0.426 0.38 0.77 0.98 A

GA GT

GA GGCT

AC

GCC A

8 DME-10-1.60-23 0.427 0.54 0.61 0.90 CA

TGG

CA

C

TG

TCT

CCC T

9 DME-10-1.60-35 0.427 0.32 0.83 0.96 GGA

GG A

C

A

GGCG

TGT

C

G

10 DME-10-1.60-13 0.429 0.36 0.78 0.97 T

AGGC

G

AG

AGC

GG

AAC

G

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 108 / 109

slide-136
SLIDE 136

Analysis of transcription factor localization data Discovering motifs de novo

Summary – gene-module and ChIP-chip examples

The good news

  • ChIP-chip data can be used to describe binding affinity of

sequence-specific transcription factors

  • Good tools exist to discover and evaluate motifs for their ability to

predict expression and binding

  • Some tools exist for identifying co-factor binding affinity

Careful analysis is paramount

  • Select negative control carefully
  • Try to make certain that you are detecting DNA patterns associated

with the phenomena under investigation

  • Reverse engineering regulatory circuits using sequence analysis is

always a detective story – tooling is important but experience shows that each case is special and requires specialized analysis

Smith & Sumazin (CSHL & Columbia) Transcriptional regulatory circuits ISMB’07 109 / 109