Statistical analysis applied to genome and proteome analyses - - PowerPoint PPT Presentation

statistical analysis applied to genome and proteome
SMART_READER_LITE
LIVE PREVIEW

Statistical analysis applied to genome and proteome analyses - - PowerPoint PPT Presentation

Statistical analysis applied to genome and proteome analyses (lecture 5) Warning: These notes serve only as a scaffold and should be completed with notes taken during the class Motif identification with position dependent weight matrices


slide-1
SLIDE 1

Statistical analysis applied to genome and proteome analyses (lecture 5)

Motif identification with position dependent weight matrices Estimating background models and score distributions

felix.naef@epfl.ch

Warning: These notes serve only as a scaffold and should be completed with notes taken during the class

slide-2
SLIDE 2

Stripe 2 enhancer

slide-3
SLIDE 3

Transcription Factors involved

from Segal, nature 2008

1 2 3 4 5 5 1 1 5 2 2 5 3 6 9 1 2 1 5 1 2 3 4 5 0.0 0.1 0.2 0.3 0.4

  • 11 -7
  • 3

1 5 9 13 17 0.0 0.5 1.0 1.5 0.0 0.1 0.2 0.3

  • 11 -7
  • 3

1 5 9 13 17 1.0 1.5 2.0 2.5 3.0 3.5 0.0 0.1 0.2

  • 11 -7
  • 3

1 5 9 13 17 0.0 0.5 1.0 0.0 0.1 0.2

  • 11 -7
  • 3

1 5 9 13 19 5 10 15 20 25

Hunchback

Coop. 6.22 Expr.

  • 0.05

Conc. 0.008

Giant

Coop. 6.68 Expr.

  • 1.98

Conc. 0.01

Kruppel

Coop. 4.45 Expr.

  • 0.22

Conc. 0.004

Knirps

Coop. 1.03 Expr.

  • 1.15

Conc. 0.09

1 1 1 1

2 4 6 8 10 5 10 15 20 1 2 3 2 4 6 8 10

3 2 1 10 6 2 8 4 20 15 10 5 10 6 2 8 4 25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100 0-50 400-450 800-850 0-50 400-450 800-850 0-50 400-450 800-850 0-50 400-450 800-850

slide-4
SLIDE 4

Goals of this lecture

  • Get a flavor about modeling cis-regulatory modules

from the point of view of protein-DNA interactions

  • 1. bioinformatics method for modeling enhancers
  • 2. statistical assessment of likelihood scores for single

TFs

  • 3. discuss some recent examples
slide-5
SLIDE 5

Main questions

  • 1. Given the consensus for the TF involved, what are

the most favorable configurations for a given set of concentrations

  • 2. How is the above configuration translated into gene

expression, in other words how does the transcription apparatus read the configuration.

  • 3. Ultimate aim: can one predict expression from

sequence, once the TF activities are given (measured)

slide-6
SLIDE 6

Graphically

cell type 1 cell type 2

slide-7
SLIDE 7
  • 1. Affinity (binding energy, how many H-bonds are

formed)

  • 2. concentrations (local) of the TFs
  • 3. competition (steric hindrance)
  • 4. cooperativity
  • all these can be taken into accounts in HMM using

models (cf. the latest Segal paper)

Things that matter because of mass action

slide-8
SLIDE 8

Outline of the class

  • 1. Physical and statistical models for Protein DNA-

interactions and cis-regulatory modules

  • one and multi- component systems for one site
  • HMMs
  • 2. Case studies:
  • Circadian enhancers
  • “Predicting expression patterns from regulatory

sequence in Drosophila segmentation”, Eran Segal et al., Nature 2008

  • 3. Exercises
slide-9
SLIDE 9

Statistical and physical models for Protein DNA- interactions

  • 1. Weight matrices and likelihoods
  • W is distribution over the space of sequences of length L
  • simplest model: independent positions
  • interpretation: P(S|W) is the likelihood that the sequence S was drawn from the

distribution W

  • c P(S|W) is proportional to the weight of the state where the site S is occupied
  • 2. Binding energy (e.g. measured in kcal/mol)
  • simplest model: additive energies (~independence)
  • is proportional to the weight of the state where

the site S is occupied

E(S) =

  • j

j(Sj) e−(E(S)−µ)/kT = eµ/kT e−E(S)/kT

P(S|W) =

L

  • j=1

wj(Sj)

slide-10
SLIDE 10

Getting a sense of the numbers

  • 1 good H-bond contributes 1 kcal/mol
  • RT=0.6 kcal/mol
  • exp(1/0.6)~5.3
  • thus one H-bond contributes a factor ~5 in affinity (Kd)
slide-11
SLIDE 11

Site occupancies and the law of mass action

  • Consider a single TF
  • 2 state model
  • chemical kinetics
  • statistical mechanics

2 state 0,1

  • ccupancy of state 1:
  • probabilities

F + D∅ ↔ DF

DF = F F + Kd Kd = k−1 k1

n1 =

c1 ˆ c e−β(E1−E0)

1 + c1

ˆ c e−β(E1−E0) =

c1 Kd + c1 1 1 n1 =

c1 ˆ c P(S|W)

P(S|B) + c1

ˆ c P(S|W)

slide-12
SLIDE 12

Site occupancy and the law of mass action

  • N state model: N TFs compete for the same genomic site
  • generalize the above forms, the factor that wins is the one with the largest

product cK, if some of these products are similar the site will be shared among different factors

  • ne H-bond less can be compensated by a factor 5 in concentration

ni = Ki ci 1 + N

i=1 Ki ci

K = 1 Kd

slide-13
SLIDE 13

Long sequences with multiple sites

  • so far
  • nly room for
  • ne site
  • now: tile the

sequences and compute observable by weighted averaging over each possible state

  • ni is the color and pi the position of the site (sites cannot overlap!)

( n, p)

slide-14
SLIDE 14

Hidden Markov Models (HMMs)

  • Efficient methodology for computing the sums over

all possible configurations (cf. book by R. Durbin et al.)

P(S, n) = 1 Z e−β K

i=1 (Enk (Spk )−µnk,pk )e

−β

q (E0(Sq)) ,

= 1 Z e

−βK

k=1 Enk (Spk )− q E0(Sq)

  • P (S|

n)

k µnk,pk

  • P (

n)

,

Z(S) =

  • P(S,

n)

  • Is the partition function, its logarithm is the free energy or

the Forward score in HMM parlance

  • usually one compute the Forward score for a set of

matrices minus the Forward score for pure background

  • n
slide-15
SLIDE 15

Background models for likelihood scores (binding energies)

  • 1. Empirical models (brute force)
  • 2. Gaussian approximation
  • is most accurate for long weight and weakly

polarized weight matrices

  • uses the Central Limit Theorem (CLT)
  • 3. Saddle-point approximation
  • is a size correction to the former, can thus handle

shorter matrices

  • uses inverse Laplace transforms
slide-16
SLIDE 16

Goal (taken from Djorjevic, Gen Res 2003)

  • Occupancy vs. likelihood
  • distribution of -log-likelihood
slide-17
SLIDE 17

2 Background models

The purpose of this section is to introduce an analytical method for computing the expected distribution of likelihood scores from a PSSM model. This follows the desription in Djorjevic et al. To evaluate the statistical significance of a candidate binding site in a genomic sequence, i.e. a position where the log-likelihood E(S) =

j

  • α ǫj,Sj is high, we would like to compute the expected distribution
  • f scores for random sequences described by a background model with properties that resemble the real
  • sequence. For the sake of simplicity, we will consider a zero-th order background model with frequencies bα

for letter α and

α bα = 1.

2.1 Gaussian model

The log-likelihood E is a sum of independent random variables ǫj. In the large L limit, i.e. when the model has many sites, we can apply the central limit theorem (CLT). Namely we know in that limit that the distribution of E will be a Gaussian with mean ¯ E =

j ¯

ǫj, where ¯ ǫj =

α bαǫj,α, and with variance

χ2 =

j χ2 j = j

  • α(ǫj,α − ¯

ǫj)2 bα.

slide-18
SLIDE 18

Concrete example will be shown during the exercises

  • 2.2

Saddle-point approximation

One can do better by invoking the following trick. The background distribution ρ(E) can be thought of as a density of states in the space of possible sequences and thus be written as: ρ(E) =

  • S

P(S)δ(E − E(S)) = 1 2πi

  • S

P(S) i∞+γ

−i∞+γ

eβ(E−E(s))dβ = 1 2πi i∞+γ

−i∞+γ

eβEZ(β)dβ , (10) where the first equation use a representation of the δ−function as an inverse Laplace transform, and the second introduces the partition function Z(β) =

S P(S)e−βE(S). The last integral can be approximated by

identifying the saddle point in the integrand: β∗ = minβ real eβE+ln Z(β), and taking the complex integration through the saddle (γ = β∗). At this point ζ(β∗ + ib) = ζ(β∗) + 1

2 d2 dβ2 ζ(β∗)b2 + O(b3), where ζ = ln Z and

therefore we find after that ln ρ(E) ≈ ln Z(β∗) + β∗E − 1 2 ln(2π d2 dβ2 ln Z(β∗)) . (11) The extremum condition leads to an implicit equation for β∗: E = − d dβ Z(β∗) . (12) In an order zero model the partition function factorizes and so it is straight forward to find β∗ numerically, and subsequently the second derivative.

slide-19
SLIDE 19

Case studies

  • 1. Circadian Ebox signals (Paquet et al., 2008)
  • 2. cis-regulatory code in the segment polarity

network in Drosophila (Segal et al., 2008)

slide-20
SLIDE 20

Modeling an evolutionary conserved circadian cis-element

Eric Paquet, Guillaume Rey and Felix Naef (PLoS CB, 2008)

E1 converged (p1=2-11, p2=2-4) E2 converged

Case study I

slide-21
SLIDE 21

Circadian rhythms and oscillators

Circadian locomotor assay – period ~24 hrs (+/- 1 hrs) – found all over the tree of life – central and peripheral clocks – limit cycle oscillators from Konopka and Benzer 1971

slide-22
SLIDE 22

Circadian running-wheel assay for mouse

Environment → ??? → SCN pacemaker → ??? → behavior → ??? → → ??? → phase locked free running

slide-23
SLIDE 23

Molecular rhythms

Circadian profiles in fly heads (Claridge-Chang et al. 2001, Wijnen 2006)

172 Drosophila circadian genes

φ=0 24

phase [hr] 24 no enrichment for canonical E-boxes among cyclers

slide-24
SLIDE 24

– inter-locked (delayed) feedback loops – post-translational modifications and protein-proteins interactions – coupling with metabolism

Recurrent themes in oscillator designs

from Schibler and Naef, ‘05

Understand the green arrows Which genes are direct CLK targets ?

slide-25
SLIDE 25

Modeling CLK/CYC (CLOCK/Bmal1) target sites

  • Goal: infer binding specificity beyond E-boxes
  • start from period 69 bp enhancer by Hao et al. (1997)
slide-26
SLIDE 26
  • 3’ flanking bases of central E-boxes are

important (Lyons, 2001)

slide-27
SLIDE 27

A

E1 E2

Comparative genomics in Drosophila

per-RA variable spacer

slide-28
SLIDE 28

upstream (69 bp enhancer, ref. 9) upstream (E-box - TER1 box, ref. 11) upstream upstream 1rst intron

mel_cwo gri_cwo

Other five know CLK/CYC targets in flies have similar conserved sequences period, timeless, vrille, Pdp1, cwo (promoter bashing only for the first two)

tim-RA ref. 11

slide-29
SLIDE 29

E E B

5’ 3’ 5’ 3’

E

3’ 5’

E

3’ 5’

B B B

p1 1-p2 p1(1-p2) p1 p2/2 p2/2 E1 seed E2 seed

bits 2

E1 converged (p1=2-11, p2=2-4) E2 converged p2/2 p2/2 q q

Markov Model (HMM)

training set: all conserved CANNGT +/- 30 bases in +/- 2.5 kb around TSS (~600 nt per gene) in all 12 species

slide-30
SLIDE 30

Model validation: scanning the fly model onto the mouse

A B

  • here: sites with phastCons > 0.5
  • → fly model predicts known circadian mouse genes

cyclers (F24>0.1) score > 15 bits

slide-31
SLIDE 31

Per2

promoter

Known circadian enhancers in mouse

slide-32
SLIDE 32

Dbp intron 2

slide-33
SLIDE 33

Bandshift validations

weak strong

weak AAGGGATCCCTTAGCATATTCTACGCGGTCACGTTCGCCACGAGCTTGGACCTACATGGCATGGCGCGCC strong GGCGCGCCGCTACTTAGTCACTGCGGTCGCAAACGTGGTCAAAGGCGTGACTTAGGTGAAGGGATCCCTT

slide-34
SLIDE 34

Summary

  • We have identified a more specific circadian CLK/CYC

(CLOCK/BMAL1) element than commonly invoked E-box

  • Element involves partner sites (E1 & E2). Is this the

beginning of a combinatorial circadian cis-regulatory code?

  • Fly model predicts mouse genes
slide-35
SLIDE 35

ARTICLES

Predicting expression patterns from regulatory sequence in Drosophila segmentation

Eran Segal1*, Tali Raveh-Sadka1*, Mark Schroeder2, Ulrich Unnerstall2 & Ulrike Gaul2

Vol 451 |31 January 2008 |doi:10.1038/nature06496

Eve has seven stripes, each controlled by a different enhancer in the eve promoter

Text

Case study II

slide-36
SLIDE 36

The approach

CAD Target gene + + = 0.26 Module KNI 1w + 4w = –3 4w + 1w = -3 4w + 3w = 1

… … … …

…TAAATACTTGGCAA TACTCTAGGAATTTT CTAGACTTTACCTAC TGGCTAAATCGAAG… AP position 35 0.0018 0.0071 0.0012

t u p t u O u p n I t Th r e y d

  • m

g n i l l e d

  • m

c i m a n

CAD HB KNI

Binding preferences

0 %EL 100 80 60 40 20 (No HB binding) Transcript AP position 35

E x P(c) Configurations (c) Expression (E) P(c)

Binding site CAD w = 1 w = –1 200 400 600 bp AP position (% EL)

Configurations of factor binding to DNA and resulting expression Binding energy along module sequence, for a given AP position Module sequence Input factor distribution Fit parameters to maximize agreement between measured and predicted patterns Factor occupancy

  • ver all AP positions

Predicted module expression Measured module expression

0.002 x 0.0018 0.953 x 0.0071 0.269 x 0.0012

slide-37
SLIDE 37

The second layer: from promoter configuration to expression E

! ! " # $ $ % & ! ! " # $ $ % & ! " # $ % & ' ( ' ) ! " # $ % & ' )

* *

) ) k i i f k i i f

w w w w it log c E P

1 ) ( 1 ) (

exp 1 / 1 ) | ( ,

  • 1w + 4w = –3

!

"

#

C c

c E P c P E P ) | ( ) ( ) ( .

  • P(E) is the probability that the polymerase is bound
  • c ( ) is the configuration (not the concentration!)
  • n
slide-38
SLIDE 38

Inputs, parameters, outputs

  • Inputs
  • collection of enhancers
  • spatial TF factor concentrations
  • binding sites
  • Outputs
  • predict the spatial expression pattern of a new piece
  • f DNA
  • Parameters
  • cooperativities, the analog of (α’s in the Segal

paper), wf and eventually the weight matrices

ˆ c

slide-39
SLIDE 39

Scoring function for model optimization

fit experimental to predicted

! "

# #

$ $

% &

S s A a

a s O a s E P F

2

) , ( ) , | ( .

  • a Kr_CD2

gt_–6 kni_–5 hb_ant

  • c_+7

cnc_+5 hkb_ven K D h e

  • p

h

slide-40
SLIDE 40

Training and testing performance

a

Poor Fair Good Predicted Measured module expression Kr_CD2 gt_–6 kni_–5 hb_ant

  • c_+7

kni_+1

  • c_otd

tll_P2 gt_–10 gt_–1 cnc_+5 hkb_ven slp2_–3 knrl_+8 btd_hd Kr_CD1 D_+4 h_3_4 eve_37e run_3

  • dd_–5

prd_+4 h_15 AP position (%EL) AP position (%EL) AP position (%EL) cad_+14 h_6 eve_5

  • dd_–3

nub_–2 gt_–3 run_5 run_–9 kni_83 run_–17 eve_46

pdm2_+1

Kr_AD2 hb_cent eve_2 eve_1 run_1 ftz_+3 tll_K2 fkh_–2 h_7

b

kni_+1 slp2_–3 kni_–5 gt_–6 Kr_CD2 gt_–10 gt_–1 h_15 Kr_CD1 eve_37e nub_–2 kni_83 gt_–3 h_6 eve_1 gt_1 h_0 CG9571 mir7 prd_1 gt_23 tll_head slp_A D_body

bowl_col

slp_B

c

(52%) (43%) (38%) (41%) (50%) (44%) (44%) (53%) (57%) (51%) (50%) (41%) (41%) (61%) (66%)

Figure 2 | Predicted expression patterns and model validation. a–c, Comparison between measured module expression patterns (red) and those predicted by the model (blue) for all 44 modules used to fit the parameters (a), as well as for modules not used for parameter fitting (b, c): b, 11 recently identified anterior modules4 (note that gt_23, gt_1, prd_1 and D_body represent shorter delineations of our modules gt_210, gt_26, prd_14 and D_14, respectively); c, Fifteen modules from

  • D. pseudoobscura (S. Sinha et al., manuscript in

preparation). Sequence identity as determined by pairwise sequence alignment is indicated in parentheses; the orthologous D. melanogaster modules are marked by grey triangles in

  • a. Modules were subjectively classified into three

categories (good, fair, poor) on the basis of the quality of the match between measured and predicted pattern and the amount of spurious expression.

slide-41
SLIDE 41

Performance metrics using ROC curves

1 0.8 0.6 0.4 0.2

Random 0.5 0.6 0.7 Random 0.5 0.7 0.9 Random

Modules: 10 Fold Cross validation Comparison: Permuted matrices Modules: Pseudoobscura Comparison: Permuted matrices Modules: Ochoa-Espinoza et al. Comparison: Permuted matrices

0.5 0.7 0.9

True positive rate

0.88 Our model Permuted, 0.51±.04 Random, 0.58±.02 Swap, 0.55±.07 0.69 Our model Permuted 0.62±.01 Random, 0.54±.01 Swap, 0.54±.03 Our model 0.77 Permuted, 0.58±.01 Random, 0.53±.01 Swap 0.54±.01

  • False positive rate

False positive rate False positive rate

slide-42
SLIDE 42

Conclusion

As a further test, we conducted a standard tenfold cross validation assay, using an automated objective performance measure that scores the expression predictions at each antero-posterior position as ‘on’

  • r ‘off’, depending on whether they are above a certain threshold,

and iterates over all possible thresholds. The resulting sensitivity/ specificity plots reveal that our model performs much better than random expectation or models using various types of randomized weight matrices; similar results are obtained when applying this auto- mated performance assessment to the above two sets of held-out modules (Supplementary Fig. 5). Taken together, our validation tests provide strong evidence that the successful predictions are not the result of overfitting the input data, and thus suggest that our model indeed captures core principles governing pattern formation in the segmentation network.

  • It seems possible to predict expression patterns from the computational

modeling of cis-enhancers (by calibrating probabilistic models of protein- DNA interactions)

  • Modeling has not used comparative genomics information
  • How will this generalize to other systems (performance is still not
  • ptimal even for the best studied regulatory network in all biology...)