Statistical analysis applied to genome and proteome analyses - PowerPoint PPT Presentation

Statistical analysis applied to genome and proteome analyses (lecture 5) Warning: These notes serve only as a scaffold and should be completed with notes taken during the class Motif identification with position dependent weight matrices Estimating background models and score distributions felix.naef@epfl.ch

Stripe 2 enhancer

� �� Transcription Factors involved from Segal, nature 2008 2 5 10 0.2 25 1 0 1 Conc. 20 2 0 8 8 0.008 Hunchback 15 1 5 0.1 6 6 Expr. 10 1 0 4 4 5 -0.05 0.0 0 5 2 2 Coop. -11 -7 -3 1 5 9 13 19 0 0 0 0 6.22 0-50 400-450 800-850 100 75 50 25 0 3 5 3 1 0.2 1.0 Conc. 4 0.01 2 2 3 0.1 0.5 Expr. Giant 2 -1.98 1 1 0.0 0.0 1 Coop. -11 -7 -3 1 5 9 13 17 0 0 0 0 6.68 0-50 400-450 800-850 100 75 50 25 0 20 5 20 1 0.3 3.5 Conc. 3.0 4 0.004 15 15 0.2 2.5 Kruppel 3 Expr. 2.0 10 10 0.1 2 -0.22 1.5 5 5 0.0 1.0 1 Coop. -11 -7 -3 1 5 9 13 17 0 0 0 0 4.45 100 75 50 25 0 0-50 400-450 800-850 10 10 1 0.4 1.5 1 5 Conc. 8 8 0.3 1 2 0.09 1.0 0.2 9 6 6 Knirps Expr. 0.5 0.1 6 4 4 -1.15 0.0 0.0 3 2 2 Coop. -11 -7 -3 1 5 9 13 17 0 0 0 0 1.03 100 75 50 25 0 0-50 400-450 800-850

Goals of this lecture • Get a flavor about modeling cis-regulatory modules from the point of view of protein-DNA interactions 1. bioinformatics method for modeling enhancers 2. statistical assessment of likelihood scores for single TFs 3. discuss some recent examples

Main questions 1. Given the consensus for the TF involved, what are the most favorable configurations for a given set of concentrations 2. How is the above configuration translated into gene expression, in other words how does the transcription apparatus read the configuration. 3. Ultimate aim: can one predict expression from sequence, once the TF activities are given (measured)

Graphically cell type 1 cell type 2

Things that matter because of mass action 1. Affinity (binding energy, how many H-bonds are formed) 2. concentrations (local) of the TFs 3. competition (steric hindrance) 4. cooperativity • all these can be taken into accounts in HMM using models (cf. the latest Segal paper)

Outline of the class 1. Physical and statistical models for Protein DNA- interactions and cis-regulatory modules • one and multi- component systems for one site • HMMs 2. Case studies: • Circadian enhancers • “Predicting expression patterns from regulatory sequence in Drosophila segmentation”, Eran Segal et al., Nature 2008 3. Exercises

Statistical and physical models for Protein DNA- interactions 1. Weight matrices and likelihoods - W is distribution over the space of sequences of length L L - � P ( S | W ) = w j ( S j ) - j =1 simplest model: independent positions - interpretation: P(S|W) is the likelihood that the sequence S was drawn from the distribution W - c P(S|W) is proportional to the weight of the state where the site S is occupied 2. Binding energy (e.g. measured in kcal/mol) - � E ( S ) = � j ( S j ) j - simplest model: additive energies (~independence) - is proportional to the weight of the state where e − ( E ( S ) − µ ) /kT = e µ/kT e − E ( S ) /kT the site S is occupied

Getting a sense of the numbers • 1 good H-bond contributes 1 kcal/mol • RT=0.6 kcal/mol • exp(1/0.6)~5.3 • thus one H-bond contributes a factor ~5 in affinity (K d )

Site occupancies and the law of mass action • Consider a single TF - 2 state model • chemical kinetics F + D ∅ ↔ D F F K d = k − 1 D F = F + K d k 1 • statistical mechanics 2 state 0,1 c 1 c e − β ( E 1 − E 0 ) c 1 ˆ occupancy of state 1: � n 1 � = c e − β ( E 1 − E 0 ) = 1 + c 1 K d + c 1 ˆ 1 1 c P ( S | W ) c 1 • probabilities ˆ � n 1 � = P ( S | B ) + c 1 c P ( S | W ) ˆ

Site occupancy and the law of mass action • N state model: N TFs compete for the same genomic site - generalize the above forms, the factor that wins is the one with the largest product cK, if some of these products are similar the site will be shared among different factors K i c i 1 � n i � = K = 1 + � N K d i =1 K i c i - one H-bond less can be compensated by a factor 5 in concentration

Long sequences with multiple sites • so far only room for one site • now: tile the sequences and compute observable by weighted averaging over each possible state ( � p ) n, � • n i is the color and p i the position of the site (sites cannot overlap!)

Hidden Markov Models (HMMs) • Efficient methodology for computing the sums over all possible configurations (cf. book by R. Durbin et al.) − β � Z e − β � K 1 q ( E 0 ( S q )) , i =1 ( E nk ( S pk ) − µ nk,pk ) e P ( S, � n ) = − β �� K q E 0 ( S q ) � k =1 E nk ( S pk ) − � e β � 1 k µ nk,pk = Z e , � �� P ( S | � n ) P ( � n ) � Z ( S ) = P ( S, � n ) � � n - Is the partition function, its logarithm is the free energy or the Forward score in HMM parlance - usually one compute the Forward score for a set of matrices minus the Forward score for pure background

Background models for likelihood scores (binding energies) 1. Empirical models (brute force) 2. Gaussian approximation - is most accurate for long weight and weakly polarized weight matrices - uses the Central Limit Theorem (CLT) 3. Saddle-point approximation - is a size correction to the former, can thus handle shorter matrices - uses inverse Laplace transforms

Goal (taken from Djorjevic, Gen Res 2003) • • Occupancy vs. likelihood distribution of -log-likelihood

2 Background models The purpose of this section is to introduce an analytical method for computing the expected distribution of likelihood scores from a PSSM model. This follows the desription in Djorjevic et al . To evaluate the statistical significance of a candidate binding site in a genomic sequence, i.e. a position where the log-likelihood E ( S ) = � � α ǫ j,S j is high, we would like to compute the expected distribution j of scores for random sequences described by a background model with properties that resemble the real sequence. For the sake of simplicity, we will consider a zero-th order background model with frequencies b α for letter α and � α b α = 1. 2.1 Gaussian model The log-likelihood E is a sum of independent random variables ǫ j . In the large L limit, i.e. when the model has many sites, we can apply the central limit theorem (CLT). Namely we know in that limit that the distribution of E will be a Gaussian with mean ¯ E = � j ¯ ǫ j , where ¯ ǫ j = � α b α ǫ j, α , and with variance χ 2 = � ǫ j ) 2 b α . j χ 2 j = � � α ( ǫ j, α − ¯ j

� � � 2.2 Saddle-point approximation One can do better by invoking the following trick. The background distribution ρ ( E ) can be thought of as a density of states in the space of possible sequences and thus be written as: � i ∞ + γ � i ∞ + γ 1 1 � � e β ( E − E ( s )) d β = e β E Z ( β ) d β , ρ ( E ) = P ( S ) δ ( E − E ( S )) = P ( S ) (10) 2 π i 2 π i − i ∞ + γ − i ∞ + γ S S where the first equation use a representation of the δ − function as an inverse Laplace transform, and the S P ( S ) e − β E ( S ) . The last integral can be approximated by second introduces the partition function Z ( β ) = � identifying the saddle point in the integrand: β ∗ = min β real e β E +ln Z ( β ) , and taking the complex integration through the saddle ( γ = β ∗ ). At this point ζ ( β ∗ + ib ) = ζ ( β ∗ ) + 1 d β 2 ζ ( β ∗ ) b 2 + O ( b 3 ), where ζ = ln Z and d 2 2 therefore we find after that 2 ln(2 π d 2 ln ρ ( E ) ≈ ln Z ( β ∗ ) + β ∗ E − 1 d β 2 ln Z ( β ∗ )) . (11) The extremum condition leads to an implicit equation for β ∗ : E = − d d β Z ( β ∗ ) . (12) In an order zero model the partition function factorizes and so it is straight forward to find β ∗ numerically, and subsequently the second derivative. Concrete example will be shown during the exercises

Case studies 1. Circadian Ebox signals (Paquet et al., 2008) 2. cis-regulatory code in the segment polarity network in Drosophila (Segal et al., 2008)

Case study I Modeling an evolutionary conserved circadian cis-element Eric Paquet, Guillaume Rey and Felix Naef (PLoS CB, 2008) E1 converged ( p 1 =2 -11 , p 2 =2 -4 ) E2 converged

Circadian rhythms and oscillators – period ~24 hrs (+/- 1 hrs) – found all over the tree of life – central and peripheral clocks – limit cycle oscillators Circadian locomotor assay from Konopka and Benzer 1971

Circadian running-wheel assay for mouse Environment → ??? → SCN pacemaker → ??? → behavior → ??? → → ??? → phase locked free running

Molecular rhythms Circadian profiles in fly heads (Claridge-Chang et al. 2001, Wijnen 2006) 172 Drosophila circadian genes φ = 0 0 24 24 phase [hr] no enrichment for canonical E-boxes among cyclers

Recurrent themes in oscillator designs – inter-locked (delayed) feedback loops – post-translational modifications and protein-proteins interactions – coupling with metabolism Understand the green arrows Which genes are direct CLK targets ? from Schibler and Naef, ‘05

Statistical analysis applied to genome and proteome analyses - PowerPoint PPT Presentation

Statistical analysis applied to genome and proteome analyses (lecture 5) Warning: These notes serve only as a scaffold and should be completed with notes taken during the class Motif identification with position dependent weight matrices

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Human Kidney Glomerulus Proteome and Proposition of a Method for Native Protein Profiling

Presented at the 16 th Human Proteome Organization World Congress Dublin Ireland 17-21 September

An Introduction to Analysis of Multiple Gene Expression Datasets Pratyaksha Wirapati Statistical

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Genome 562 February 2015 Week 6 Genome 562 p.1/13 Julian Huxley (1887-1975) Oxford

Genome 562 January 2015 Week 1 Genome 562 p.1/6 Early workers in theoretical population

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

Motivations Pool genome-wide expression measurements from many experiments! Why to study a large

Current Topics in Genome Analysis Fall 2006 Week 4: Mining Genomic Sequence Data Tyra G.

Genetic variation: SNPs ATTGCAATCCGTGG...ATCGAGCCATACG ATTGCACGCCG Basics What is

Integrative modeling of ! General aspects of docking ! Information-driven docking with HADDOCK

Stratified Markov Chain Monte Carlo Brian Van Koten University of Massachusetts, Amherst

Principal Process Analysis of biological models Stefano Casagranda, Delphine Ropers, Jean-Luc

Light and Sound as serious pollutants Jan Hollan CzechGlobe Global Change Research

Click To Edit Master Title Style June 10, 2015 1:00 PM - 2:00 PM EDT 1-877-309-2074 Access

cer 2012 Can Cancer How might power frequency magnetic fields cause increased risk of childhood

Transitjon in subcritjcal shear fmows Invariant solutjons and the edge of chaos. Ashley P.

Janice Y.K. Lee and Joseph Bae www.yaddo.org By the numbers MORE THAN 7 , 500 ARTISTS HAVE WORKED

Statistical analysis applied to genome and proteome analyses - PowerPoint PPT Presentation

Statistical analysis applied to genome and proteome analyses (lecture 5) Warning: These notes serve only as a scaffold and should be completed with notes taken during the class Motif identification with position dependent weight matrices

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics &amp; Computational

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Human Kidney Glomerulus Proteome and Proposition of a Method for Native Protein Profiling

Presented at the 16 th Human Proteome Organization World Congress Dublin Ireland 17-21 September

An Introduction to Analysis of Multiple Gene Expression Datasets Pratyaksha Wirapati Statistical

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Genome 562 February 2015 Week 6 Genome 562 p.1/13 Julian Huxley (1887-1975) Oxford

Genome 562 January 2015 Week 1 Genome 562 p.1/6 Early workers in theoretical population

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

Motivations Pool genome-wide expression measurements from many experiments! Why to study a large

Current Topics in Genome Analysis Fall 2006 Week 4: Mining Genomic Sequence Data Tyra G.

Genetic variation: SNPs ATTGCAATCCGTGG...ATCGAGCCATACG ATTGCACGCCG Basics What is

Integrative modeling of ! General aspects of docking ! Information-driven docking with HADDOCK

Stratified Markov Chain Monte Carlo Brian Van Koten University of Massachusetts, Amherst

Principal Process Analysis of biological models Stefano Casagranda, Delphine Ropers, Jean-Luc

Light and Sound as serious pollutants Jan Hollan CzechGlobe Global Change Research

Click To Edit Master Title Style June 10, 2015 1:00 PM - 2:00 PM EDT 1-877-309-2074 Access

cer 2012 Can Cancer How might power frequency magnetic fields cause increased risk of childhood

Transitjon in subcritjcal shear fmows Invariant solutjons and the edge of chaos. Ashley P.

Janice Y.K. Lee and Joseph Bae www.yaddo.org By the numbers MORE THAN 7 , 500 ARTISTS HAVE WORKED

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference