Outline CSE 527 Previously: Learning from data MLE: Max Likelihood - PowerPoint PPT Presentation

Outline CSE 527 Previously: Learning from data MLE: Max Likelihood Estimators Autumn 2009 EM: Expectation Maximization (MLE w/hidden data) These Slides: 5 – Motifs: Representation & Discovery Bio: Expression & regulation Expression: creation of gene products Regulation: when/where/how much of each gene product; complex and critical Comp: using MLE/EM to find regulatory motifs in biological sequence data Gene Expression Gene Expression & Recall a gene is a DNA sequence for a protein Regulation To say a gene is expressed means that it is transcribed from DNA to RNA the mRNA is processed in various ways is exported from the nucleus (eukaryotes) is translated into protein A key point: not all genes are expressed all the time, in all cells, or at equal levels

RNA Regulation Transcription Some genes heavily transcribed In most cells, pro- or eukaryote, easily a 10,000-fold (many are not) difference between least- and most-highly expressed genes Regulation happens at all steps. E.g., some genes are highly transcribed, some are not transcribed at all, some transcripts can be sequestered then released, or rapidly degraded, some are weakly translated, some are very actively translated, ... Below, focus on 1st step only: transcriptional regulation Alberts, et al. E. coli growth on glucose + lactose http://en.wikipedia.org/wiki/Lac_operon

1965 Nobel Prize Sea Urchin - Endo16 Physiology or Medicine François Jacob, Jacques Monod, André Lwoff DNA Binding Proteins A variety of DNA binding proteins (so-called “transcription factors”; a significant fraction, perhaps 5-10%, of all human proteins) modulate transcription of protein coding genes

In the The Double Helix groove Different patterns of potential H bonds at edges of different base pairs, accessible esp. in major groove Los Alamos Science Helix-Turn-Helix DNA Binding Motif H-T -H Dimers Bind 2 DNA patches, ~ 1 turn apart Increases both specificity and affinity

Zinc Finger Motif Leucine Zipper Motif Homo-/hetero-dimers and combinatorial control Alberts, et al. Some Protein/DNA MyoD interactions well-understood http://www.rcsb.org/pdb/explore/jmol.do?structureId=1MDY&bionumber=1

But the overall DNA binding Summary ! “code” still defies prediction Proteins can bind DNA to regulate gene expression (i.e., production of other proteins & themselves) ! This is widespread ! Complex combinatorial control is possible ! But it’s not the only way to do this... ! CAP 16 ! DNA binding site Sequence Motifs summary Motif : “a recurring salient thematic element” Complex “code” Last few slides described structural motifs in Short patches (4-8 bp) proteins Often near each other (1 turn = 10 bp) Equally interesting are the DNA sequence motifs to which these proteins bind - e.g. , Often reverse-complements one leucine zipper dimer might bind (with Not perfect matches varying affinities) to dozens or hundreds of similar sequences

E. coli Promoters E. coli Promoters “TATA Box” ~ 10bp upstream of “TATA Box” - consensus TATAAT transcription start ~10bp upstream of transcription start How to define it? Not exact: of 168 studied (mid 80’s) TACGAT Consensus is TATAAT – nearly all had 2/3 of TAxyzT TAAAAT TATACT – 80-90% had all 3 BUT all differ from it GATAAT – 50% agreed in each of x,y,z Allow k mismatches? TATGAT – no perfect match Equally weighted? TATGTT Other common features at -35, etc. Wildcards like R,Y? ({A,G}, {C,T}, resp.) TATA Scores TATA Box Frequencies A “Weight Matrix Model” or “WMM” pos pos 1 2 3 4 5 6 1 2 3 4 5 6 base base A 2 95 26 59 51 1 A -36 19 1 12 10 -46 C 9 2 14 13 20 3 C -15 -36 -8 -9 -3 -31 G 10 1 16 15 13 0 G -13 -46 -6 -7 -9 -46 (?) T 79 3 44 13 17 96 T 17 -31 8 -9 -6 19

Scanning for TATA Scanning for TATA A -36 19 1 12 10 -46 C -15 -36 -8 -9 -3 -31 = -90 100 85 G -13 -46 -6 -7 -9 -46 66 T 17 -31 8 -9 -6 19 50 50 A C T A T A A T C G 23 Score 0 A -36 19 1 12 10 -46 C -15 -36 -8 -9 -3 -31 = 85 -50 G -13 -46 -6 -7 -9 -46 T 17 -31 8 -9 -6 19 -100 -90 -91 A C T A T A A T C G -150 A -36 19 1 12 10 -46 A C T A T A A T C G A T C G A T G C T A G C A T G C G G A T A T G A T C -15 -36 -8 -9 -3 -31 = -91 G -13 -46 -6 -7 -9 -46 T 17 -31 8 -9 -6 19 A C T A T A A T C G Stormo, Ann. Rev. Biophys. Biophys Chem, 17, 1988, 241-263 Score Distribution TATA Scan at 2 genes (Simulated) LacI 50 3500 Score -50 3000 -150 2500 2000 1500 LacZ 1000 50 Score 500 -50 -150 0 -150 -130 -110 -90 -70 -50 -30 -10 10 30 50 70 90 -400 AUG +400

Weight Matrices: Neyman-Pearson Statistics Assume: Given a sample x 1 , x 2 , ..., x n , from a distribution f(...| ! ) with parameter ! , want to test f b,i ! = frequency of base b in position i in TATA hypothesis ! = " 1 vs ! = " 2 . f b ! = frequency of base b in all sequences Might as well look at likelihood ratio: Log likelihood ratio, given S = B 1 B 2 ...B 6 : f( x 1 , x 2 , ..., x n | " 1 ) > # f( x 1 , x 2 , ..., x n | " 2 ) ( P(S “promoter” | % ) 6 f ( " , i % B , i B f ( % P(S log ' & | “nonpromot er”) # $ = log & = = i i " f & ' 1 1 6 # # 1 B 6 i = $ i i ! = # # & i i & B $ log ' f (or log likelihood ratio ) Assumes independence Score Distribution What’s best WMM? (Simulated) Given, say, 168 sequences s 1 , s 2 , ..., s k of length 3500 6, assumed to be generated at random 3000 according to a WMM defined by 6 x (4-1) 2500 parameters " , what’s the best " ? 2000 E.g., what’s MLE for " given data s 1 , s 2 , ..., s k ? 1500 1000 Answer: like coin flips or dice rolls, count 500 frequencies per position (see HW). 0 -150 -130 -110 -90 -70 -50 -30 -10 10 30 50 70 90

Weight Matrices: Another WMM example Chemistry 8 Sequences: Freq. Col 1 Col 2 Col 3 A 0.625 0 0 ATG C 0 0 0 ATG Experiments show ~80% correlation of log ATG G 0.250 0 1 likelihood weight matrix scores to measured ATG T 0.125 1 0 ATG binding energy of RNA polymerase to GTG LLR Col 1 Col 2 Col 3 variations on TATAAT consensus GTG A 1.32 - $ - $ TTG [Stormo & Fields] C - $ - $ - $ Log-Likelihood Ratio: G 0 - $ 2.00 T -1.00 2.00 - $ f x i ,i , f x i = 1 log 2 f x i 4 Non-uniform Background Relative Entropy • E. coli - DNA approximately 25% A, C, G, T AKA Kullback-Liebler Distance/Divergence, AKA Information Content • M. jannaschi - 68% A-T, 32% G-C LLR from previous Given distributions P , Q LLR Col 1 Col 2 Col 3 example, assuming A 0.74 - $ - $ P ( x ) log P ( x ) ≥ 0 C - $ - $ - $ � H ( P || Q ) = f A = f T = 3 / 8 Q ( x ) G 1.00 - $ 3.00 x ∈ Ω f C = f G = 1 / 8 T -1.58 1.42 - $ Notes: e.g., G in col 3 is 8 x more likely via WMM Let P ( x ) log P ( x ) Q ( x ) = 0 if P ( x ) = 0 [since lim y → 0 y log y = 0] than background, so (log 2 ) score = 3 (bits). Undefined if 0 = Q ( x ) < P ( x )

WMM Scores vs WMM: How “Informative”? Mean score of site vs bkg? Relative Entropy For any fixed length sequence x , let P(x) = Prob. of x according to WMM H(P||Q) = 5.0 3500 Q(x) = Prob. of x according to background 3000 Relative Entropy: -H(Q||P) = -6.8 2500 P ( x ) 2000 � H ( P || Q ) = P ( x ) log 2 Q ( x ) 1500 x ∈ Ω -H(Q||P) H(P||Q) H(P||Q) is expected log likelihood score of a 1000 sequence randomly chosen from WMM ; 500 -H(Q||P) is expected score of Background 0 -150 -130 -110 -90 -70 -50 -30 -10 10 30 50 70 90 Expected score difference: H(P||Q) + H(Q||P) On average, foreground model scores > background by 11.8 bits (score difference of 118 on 10x scale used in examples above). WMM Example, cont. For a WMM: Freq. Col 1 Col 2 Col 3 H ( P || Q ) = � i H ( P i || Q i ) A 0.625 0 0 C 0 0 0 where P i and Q i are the WMM/background G 0.250 0 1 T 0.125 1 0 distributions for column i. Uniform Non-uniform LLR Col 1 Col 2 Col 3 LLR Col 1 Col 2 Col 3 Proof: exercise A 1.32 - $ - $ A 0.74 - $ - $ C - $ - $ - $ C - $ - $ - $ Hint: Use the assumption of independence G 0 - $ 2.00 G 1.00 - $ 3.00 between WMM columns T -1.00 2.00 - $ T -1.58 1.42 - $ RelEnt 0.70 2.00 2.00 4.70 RelEnt 0.51 1.42 3.00 4.93

Outline CSE 527 Previously: Learning from data MLE: Max Likelihood - PowerPoint PPT Presentation

Outline CSE 527 Previously: Learning from data MLE: Max Likelihood Estimators Autumn 2009 EM: Expectation Maximization (MLE w/hidden data) These Slides: 5 Motifs: Representation & Discovery Bio: Expression & regulation

Rigid Geometric Transformations COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

Camera Calibration COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Camera

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Training

Tracking Feature Windows COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview & Bio

CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview & Bio

CSE 527, Additional notes on MLE & EM Based on earlier notes by C. Grant & M. Narasimhan

Image Motion COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Image Motion 1 /

HW2o Image Differentiation COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

Correlation, Convolution, Filtering COMPSCI 527 Computer Vision COMPSCI 527 Computer

Rigid Geometric Transformations COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

Image Pyramids COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Image Pyramids 1

The Eight-Point Algorithm COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision The

The Singular Value Decomposition COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

Image Motion COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Image Motion 1 /

Local, Unconstrained Function Optimization COMPSCI 527 Computer Vision COMPSCI 527

Information Storage and Processing in Biological Systems: A seminar course for the Natural

On the Link Between Oscillations and Negative Circuits in Discrete Genetic Regulatory Networks

NCBI2R - To navigate and annotate genes and SNPs. The Problem Genome Wide Analysis provides

CM30174 + CM50206 Agents and Electronic Commerce Marina De Vos, Julian Padget Communication and

Attractors in synchronous and asynchronous genetic regulatory networks Marco Pedicini (Roma Tre

Dcouverte dans les rseaux biologiques htrognes : l'exprience Adalab Cline

Network Topology Inference Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science

Hill Kinetics Meets P Systems A Case Study on Gene Regulatory Networks as Computing Agents in

Outline CSE 527 Previously: Learning from data MLE: Max Likelihood - PowerPoint PPT Presentation

Outline CSE 527 Previously: Learning from data MLE: Max Likelihood Estimators Autumn 2009 EM: Expectation Maximization (MLE w/hidden data) These Slides: 5 Motifs: Representation & Discovery Bio: Expression & regulation

Rigid Geometric Transformations COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

Camera Calibration COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Camera

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Training

Tracking Feature Windows COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview &amp; Bio

CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview &amp; Bio

CSE 527, Additional notes on MLE &amp; EM Based on earlier notes by C. Grant &amp; M. Narasimhan

Image Motion COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Image Motion 1 /

HW2o Image Differentiation COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

Correlation, Convolution, Filtering COMPSCI 527 Computer Vision COMPSCI 527 Computer

Rigid Geometric Transformations COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

Image Pyramids COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Image Pyramids 1

The Eight-Point Algorithm COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision The

The Singular Value Decomposition COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

Image Motion COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Image Motion 1 /

Local, Unconstrained Function Optimization COMPSCI 527 Computer Vision COMPSCI 527

Information Storage and Processing in Biological Systems: A seminar course for the Natural

On the Link Between Oscillations and Negative Circuits in Discrete Genetic Regulatory Networks

NCBI2R - To navigate and annotate genes and SNPs. The Problem Genome Wide Analysis provides

CM30174 + CM50206 Agents and Electronic Commerce Marina De Vos, Julian Padget Communication and

Attractors in synchronous and asynchronous genetic regulatory networks Marco Pedicini (Roma Tre

Dcouverte dans les rseaux biologiques htrognes : l'exprience Adalab Cline

Network Topology Inference Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science

Hill Kinetics Meets P Systems A Case Study on Gene Regulatory Networks as Computing Agents in

CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview & Bio

CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview & Bio

CSE 527, Additional notes on MLE & EM Based on earlier notes by C. Grant & M. Narasimhan