George Palade CSE P 590 A Nov. 19, 1912 -- Oct 8, 2008 Autumn 2008 - PowerPoint PPT Presentation

George Palade CSE P 590 A Nov. 19, 1912 -- Oct 8, 2008 Autumn 2008 Lecture 5 Motifs: Representation & Discovery 1966 Albert Lasker Award for Basic Medical Research 1974 Nobel Prize in Physiology or Medicine (with Albert Claude and Christian de Duve) Identified the function of mitochondria, ribosomes and cellular secretion Outline Gene Expression & Last week: Learning from data: Regulation - MLE: Max Likelihood Estimators - EM: Expectation Maximization (MLE w/hidden data) Expression & regulation - Expression: creation of gene products - Regulation: when/where/how much of each gene product; complex and critical Next: using MLE/EM to find regulatory motifs in biological sequence data

RNA Gene Expression Transcription Some genes heavily transcribed Recall a gene is a DNA sequence for a protein (many are not) To say a gene is expressed means that it is transcribed from DNA to RNA the mRNA is processed in various ways is exported from the nucleus (eukaryotes) is translated into protein A key point: not all genes are expressed all the time, in all cells, or at equal levels Alberts, et al. E. coli growth Regulation on glucose + lactose In most cells, pro- or eukaryote, easily a 10,000-fold difference between least- and most-highly expressed genes Regulation happens at all steps. E.g., some transcripts can be sequestered then released, or rapidly degraded, some are weakly translated, some are very actively translated, some are highly transcribed, some are not transcribed at all Below, focus on 1st step only: transcriptional regulation http://en.wikipedia.org/wiki/Lac_operon

1965 Nobel Prize François Jacob and Jacques Monod The Double Helix DNA Binding Proteins A variety of DNA binding proteins (“transcription factors”; a significant fraction, perhaps 5-10%, of all human proteins) modulate transcription of protein coding genes Los Alamos Science

In the Helix-Turn-Helix DNA Binding Motif groove Different patterns of potential H bonds at edges of different base pairs, accessible esp. in major groove Zinc Finger Motif H-T -H Dimers Bind 2 DNA patches, ~ 1 turn apart Increases both specificity and affinity

Some Protein/DNA Leucine Zipper Motif interactions well-understood Homo-/hetero-dimers and combinatorial control Alberts, et al. But the overall DNA binding “code” still defies prediction CAP

Bacterial Met Repressor Summary � Negative feedback loop: high Met level ⇒ repress Met synthesis genes (a beta-sheet DNA binding domain) Proteins can bind DNA to regulate gene expression (i.e., production of other proteins & themselves) � This is widespread � Complex combinatorial control is possible � SAM (Met derivative) But it’s not the only way to do this... � 16 � DNA binding site Sequence Motifs summary Motif : “a recurring salient thematic element” Complex “code” Last few slides described structural motifs in Short patches (4-8 bp) proteins Often near each other (1 turn = 10 bp) Equally interesting are the DNA sequence motifs to which these proteins bind - e.g. , Often reverse-complements one leucine zipper dimer might bind (with Not perfect matches varying affinities) to dozens or hundreds of similar sequences

E. coli Promoters E. coli Promoters “TATA Box” ~ 10bp upstream of “TATA Box” - consensus TATAAT transcription start ~10bp upstream of transcription start How to define it? Not exact: of 168 studied (mid 80’s) TACGAT Consensus is TATAAT – nearly all had 2/3 of TAxyzT TAAAAT TATACT – 80-90% had all 3 BUT all differ from it GATAAT – 50% agreed in each of x,y,z Allow k mismatches? TATGAT – no perfect match Equally weighted? TATGTT Other common features at -35, etc. Wildcards like R,Y? ({A,G}, {C,T}, resp.) TATA Scores TATA Box Frequencies pos pos 1 2 3 4 5 6 1 2 3 4 5 6 base base A 2 95 26 59 51 1 A -36 19 1 12 10 -46 C 9 2 14 13 20 3 C -15 -36 -8 -9 -3 -31 G 10 1 16 15 13 0 G -13 -46 -6 -7 -9 -46 (?) T 79 3 44 13 17 96 T 17 -31 8 -9 -6 19

Scanning for TATA Scanning for TATA A C G 100 85 T 66 50 50 A C T A T A A T C G 23 Score 0 A C G -50 T -100 -93 -95 A C T A T A A T C G -150 A A C T A T A A T C G A T C G A T G C T A G C A T G C G G A T A T G A T C G T A C T A T A A T C G Stormo, Ann. Rev. Biophys. Biophys Chem, 17, 1988, 241-263 Weight Matrices: Score Distribution Statistics (Simulated) Assume: 3500 f b,i � = frequency of base b in position i in TATA 3000 2500 f b � = frequency of base b in all sequences 2000 Log likelihood ratio, given S = B 1 B 2 ...B 6 : 1500 1000 500 P(S log � | “promoter” ) � i 1 � � f � log = 6 i 6 , i � B � , log i B i f � � � P(S � “nonpromot | er”) � � = i f 6 1 � = � � i � B i � � = 1 = � i � B � � f � � 0 -150 -130 -110 -90 -70 -50 -30 -10 10 30 50 70 90 Assumes independence

Score Distribution Neyman-Pearson (Simulated) Given a sample x 1 , x 2 , ..., x n , from a distribution 3500 f(...| � ) with parameter � , want to test 3000 hypothesis � = � 1 vs � = � 2 . 2500 2000 Might as well look at likelihood ratio: 1500 1000 f( x 1 , x 2 , ..., x n | � 1 ) > � 500 f( x 1 , x 2 , ..., x n | � 2 ) 0 -150 -130 -110 -90 -70 -50 -30 -10 10 30 50 70 90 Weight Matrices: What’s best WMM? Chemistry Given, say, 168 sequences s 1 , s 2 , ..., s k of length 6, assumed to be generated at random Experiments show ~80% correlation of log according to a WMM defined by 6 x (4-1) likelihood weight matrix scores to measured parameters � , what’s the best � ? binding energy of RNA polymerase to variations on TATAAT consensus E.g., what’s MLE for � given data s 1 , s 2 , ..., s k ? [Stormo & Fields] Answer: like coin flips or dice rolls, count frequencies per position (see HW).

Another WMM example Non-uniform Background 8 Sequences: Freq. Col 1 Col 2 Col 3 • E. coli - DNA approximately 25% A, C, G, T A 0.625 0 0 ATG C 0 0 0 ATG • M. jannaschi - 68% A-T, 32% G-C ATG G 0.250 0 1 LLR from previous ATG T 0.125 1 0 LLR Col 1 Col 2 Col 3 ATG example, assuming A 0.74 - � - � GTG LLR Col 1 Col 2 Col 3 C - � - � - � GTG f A = f T = 3 / 8 A 1.32 - � - � G 1.00 - � 3.00 TTG f C = f G = 1 / 8 C - � - � - � T -1.58 1.42 - � Log-Likelihood Ratio: G 0 - � 2.00 e.g., G in col 3 is 8 x more likely via WMM T -1.00 2.00 - � f x i ,i , f x i = 1 than background, so (log 2 ) score = 3 (bits). log 2 f x i 4 WMM: How “Informative”? Relative Entropy Mean score of site vs bkg? For any fixed length sequence x , let AKA Kullback-Liebler Distance/Divergence, P(x) = Prob. of x according to WMM AKA Information Content Q(x) = Prob. of x according to background Given distributions P, Q Relative Entropy: P ( x ) log P ( x ) � 0 P ( x ) � H ( P || Q ) = � H ( P || Q ) = P ( x ) log 2 Q ( x ) Q ( x ) x ∈ Ω x ∈ Ω -H(Q||P) H(P||Q) Notes: H(P||Q) is expected log likelihood score of a Let P ( x ) log P ( x ) sequence randomly chosen from WMM ; Q ( x ) = 0 if P ( x ) = 0 [since lim y → 0 y log y = 0] -H(Q||P) is expected score of Background Undefined if 0 = Q ( x ) < P ( x )

WMM Scores vs Relative Entropy For WMM, you can show (based on the assumption of independence between H(P||Q) = 5.0 3500 columns), that : 3000 -H(Q||P) = -6.8 2500 H ( P || Q ) = � i H ( P i || Q i ) 2000 where P i and Q i are the WMM/background 1500 distributions for column i. 1000 500 0 -150 -130 -110 -90 -70 -50 -30 -10 10 30 50 70 90 WMM Example, cont. Pseudocounts Freq. Col 1 Col 2 Col 3 A 0.625 0 0 Are the - � ’s a problem? C 0 0 0 G 0.250 0 1 Certain that a given residue never occurs T 0.125 1 0 in a given position? Then - � just right Uniform Non-uniform Else, it may be a small-sample artifact LLR Col 1 Col 2 Col 3 LLR Col 1 Col 2 Col 3 Typical fix: add a pseudocount to each observed A 1.32 - � - � A 0.74 - � - � count—small constant (e.g., .5, 1) C - � - � - � C - � - � - � G 0 - � 2.00 G 1.00 - � 3.00 Sounds ad hoc ; there is a Bayesian justification T -1.00 2.00 - � T -1.58 1.42 - � RelEnt 0.70 2.00 2.00 4.70 RelEnt 0.51 1.42 3.00 4.93

George Palade CSE P 590 A Nov. 19, 1912 -- Oct 8, 2008 Autumn 2008 - PowerPoint PPT Presentation

George Palade CSE P 590 A Nov. 19, 1912 -- Oct 8, 2008 Autumn 2008 Lecture 5 Motifs: Representation & Discovery 1966 Albert Lasker Award for Basic Medical Research 1974 Nobel Prize in Physiology or Medicine (with Albert Claude and

Markov Chains and MCMC CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 4 : 590.02

De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12

Sampling from Databases CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 2 : 590.02

Post-processing outputs for better utility CompSci 590.03 Instructor: Ashwin Machanavajjhala

Wavelet and Matrix Mechanism CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 11 :

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

MATH 590: Meshfree Methods Chapter 36: Generalized Hermite Interpolation Greg Fasshauer

MATH 590: Meshfree Methods Chapter 5: Completely Monotone and Multiply Monotone Functions Greg

Simulatability The enemy knows the system, Claude Shannon CompSci 590.03 Instructor: Ashwin

Sampling from Databases CompSci 590.04 Instructor:

Fault Tolerant Distributed Main Memory Systems CompSci 590.04

K-Anonymity & Algorithms CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 3 :

NOW Handout Page 1 Strawman Lock Atomic Instructions Specifies a location, register, &

Programming Rules Appendix H Computer Security: Art and Science, 2 nd Edition Version 1.0 Slide

Pianola: A script-based I/O benchmark John May PSDW08, 17 November 2008 Lawrence Livermore

State Spaces & Partial-Order Planning AI Class 22 (Ch. 10 through 10.4.4 ) Material from Dr.

Sustainable Use of Risk-Informed Regulation to Improve Plant Safety Nuclear Regulatory Commission

DNA Mo'f Discovery COMPSCI 260 Spring 2016 DNA motif discovery

1 Contact Analysis Contact Analysis How would you compute a direction of motion for the

Neural Networks for Machine Learning Lecture 7a Modeling sequences: A brief overview Geoffrey

George Palade CSE P 590 A Nov. 19, 1912 -- Oct 8, 2008 Autumn 2008 - PowerPoint PPT Presentation

George Palade CSE P 590 A Nov. 19, 1912 -- Oct 8, 2008 Autumn 2008 Lecture 5 Motifs: Representation & Discovery 1966 Albert Lasker Award for Basic Medical Research 1974 Nobel Prize in Physiology or Medicine (with Albert Claude and

Markov Chains and MCMC CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 4 : 590.02

De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12

Sampling from Databases CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 2 : 590.02

Post-processing outputs for better utility CompSci 590.03 Instructor: Ashwin Machanavajjhala

Wavelet and Matrix Mechanism CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 11 :

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

CSE 182-L2:Blast &amp; variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

Welcome to CSE 506 Introduc/on &amp; Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

MATH 590: Meshfree Methods Chapter 36: Generalized Hermite Interpolation Greg Fasshauer

MATH 590: Meshfree Methods Chapter 5: Completely Monotone and Multiply Monotone Functions Greg

Simulatability The enemy knows the system, Claude Shannon CompSci 590.03 Instructor: Ashwin

Sampling from Databases CompSci 590.04 Instructor:

Fault Tolerant Distributed Main Memory Systems CompSci 590.04

K-Anonymity &amp; Algorithms CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 3 :

NOW Handout Page 1 Strawman Lock Atomic Instructions Specifies a location, register, &amp;

Programming Rules Appendix H Computer Security: Art and Science, 2 nd Edition Version 1.0 Slide

Pianola: A script-based I/O benchmark John May PSDW08, 17 November 2008 Lawrence Livermore

State Spaces &amp; Partial-Order Planning AI Class 22 (Ch. 10 through 10.4.4 ) Material from Dr.

Sustainable Use of Risk-Informed Regulation to Improve Plant Safety Nuclear Regulatory Commission

DNA Mo'f Discovery COMPSCI 260 Spring 2016 DNA motif discovery

1 Contact Analysis Contact Analysis How would you compute a direction of motion for the

Neural Networks for Machine Learning Lecture 7a Modeling sequences: A brief overview Geoffrey

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

K-Anonymity & Algorithms CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 3 :

NOW Handout Page 1 Strawman Lock Atomic Instructions Specifies a location, register, &

State Spaces & Partial-Order Planning AI Class 22 (Ch. 10 through 10.4.4 ) Material from Dr.