The number of occurrences of a word (5.7) and motif (5.9) in a DNA - PowerPoint PPT Presentation

The number of occurrences of a word (5.7) and motif (5.9) in a DNA sequence, allowing overlaps Covariance (2.4) and indicators (2.9) Prof. Tesler Math 283 Fall 2016 Prof. Tesler # occurrences of a word Math 283 / Fall 2016 1 / 24

Covariance Let X and Y be random variables, possibly dependent. Var ( X + Y ) = E (( X + Y − µ X − µ Y ) 2 ) �� 2 � � � = E X − µ X + Y − µ Y �� 2 � �� 2 � � � = E X − µ X + E Y − µ Y + 2 E ( X − µ X )( Y − µ Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y ) where the covariance of X and Y is defined as � � Cov ( X , Y ) = E ( X − µ X )( Y − µ Y ) Expanding gives an alternate formula Cov ( X , Y ) = E ( XY ) − E ( X ) E ( Y ) : � � Cov ( X , Y ) = E ( X − µ X )( Y − µ Y ) = E ( XY ) − µ X E ( Y ) − µ Y E ( X ) + µ X µ Y = E ( XY ) − E ( X ) E ( Y ) Prof. Tesler # occurrences of a word Math 283 / Fall 2016 2 / 24

Covariance properties Cov ( X , X ) = Var ( X ) Cov ( X , Y ) = Cov ( Y , X ) If X , Y are independent then Cov ( X , Y ) = 0 and Var ( X + Y ) = Var ( X ) + Var ( Y ) . Beware, this is not reversible; Cov ( X , Y ) could be 0 for dependent variables. Cov ( aX + b , cY + d ) = ac Cov ( X , Y ) 2 � Var ( X 1 + X 2 + · · · + X n ) = Var ( X 1 )+ · · · + Var ( X n )+ Cov ( X i , X j ) 1 � i < j � n Sign of covariance When Cov ( X , Y ) is positive: there is a tendency to have X > µ X when Y > µ Y and vice-versa, and X < µ X when Y < µ Y and vice-versa. When Cov ( X , Y ) is negative: there is a tendency to have X > µ X when Y < µ Y and vice-versa, and X < µ X when Y > µ Y and vice-versa. Prof. Tesler # occurrences of a word Math 283 / Fall 2016 3 / 24

Occurrences of a word in a sequence — notation Consider a (long) single-stranded nucleotide sequence τ = τ 1 . . . τ N and a (short) word w = w 1 . . . w k : τ = τ 1 . . . τ 19 = CTATAGATAGATAGACAGT w = w 1 . . . w 9 = ATAGATAGA Say w occurs in τ at position j when w is in τ ending at position j : j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 τ j C T A T A G A T A G A T A G A C A G T so w occurs in τ at 11 and 15 (underlined). � if w occurs in τ at j ; I 11 = I 15 = 1 1 Let I j = otherwise. other I j = 0 0 I j is an indicator variable (1 when a condition is true, 0 when false). Y = I k + I k + 1 + · · · + I N is the number of times w occurs in τ . Here, Y = 2 . Prof. Tesler # occurrences of a word Math 283 / Fall 2016 4 / 24

Computing mean number of occurrences µ = E ( Y ) Suppose τ is generated by N independent rolls of a 4-sided die, whose sides have probabilities p A , p C , p G , p T adding up to 1. The probability of a word being generated by rolling such a die is the product of the probabilities of its nucleotides: π ( ATAGATAGA ) = p A 5 p T 2 p G 2 π ( w ) = p w 1 · · · p w k The probability of w occurring at j = k , k + 1 , . . . , N is π ( w ) . I j ’s are indicator variables, so E ( I j ) = 0 P ( I j = 0 ) + 1 P ( I j = 1 ) = P ( I j = 1 ) = π ( w ) for j = k , k + 1 , . . . , N . Y = I k + I k + 1 + · · · + I N so the mean number of occurrences is µ = E ( Y ) = E ( I k ) + · · · + E ( I N ) = ( N − k + 1 ) π ( w ) . Prof. Tesler # occurrences of a word Math 283 / Fall 2016 5 / 24

Dependencies between positions Occurrences at different positions have dependencies, because of how shifts of w may overlap with each other. w = ATAGATAGA cannot occur at both 14 and 15: j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 τ j A T A G A T A G A A T A G A T A G A But w can occur at both 11 and 15 . j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 τ j C T A T A G A T A G A T A G A C A G T This is equivalent to w 1 . . . w k w r + 1 . . . w k = w 1 . . . w 9 w 6 . . . w 9 = ATAG ATAGA TAGA occurring at 15 , where k = 9 is the word length and r = 5 is the overlap length. Chapter 5.8 considers counting occurrences without overlaps. Chapters 4 and 11 do the more general problem of Markov chains. Prof. Tesler # occurrences of a word Math 283 / Fall 2016 6 / 24

Self-overlaps of a word Define if the first r letters of w equal the last r letters  1   of w in the exact same order (string equality); ε r =   otherwise. 0 This lets us account for dependencies between I j and I j + k − r . Shifting by k − r positions corresponds to an overlap of size r . w : A T A G A T A G A r = 9 ε 9 = 1 A T A G A T A G A r = 8 ε 8 = 0 A T A G A T A G A r = 7 ε 7 = 0 A T A G A T A G A r = 6 ε 6 = 0 A T A G A T A G A r = 5 ε 5 = 1 A T A G A T A G A r = 4 ε 4 = 0 A T A G A T A G A r = 3 ε 3 = 0 A T A G A T A G A r = 2 ε 2 = 0 A T A G A T A G A r = 1 ε 1 = 1 A T A G A T A G A Prof. Tesler # occurrences of a word Math 283 / Fall 2016 7 / 24

Computing σ 2 = Var ( Y ) Since the I j ’s have dependencies, the variance of their sum Y = I k + · · · + I N is NOT necessarily the sum of their variances. We must consider covariance terms as well: N � � Var ( Y ) = Var ( I j ) + Cov ( I j , I ℓ ) 2 j = k j , ℓ : k � j <ℓ � N First sum: Note that I j 2 = I j since I j = 0 or 1 , so Var ( I j ) = E ( I j 2 ) − ( E ( I j )) 2 = π ( w ) − π ( w ) 2 and the first sum in Var ( Y ) is N � Var ( I j ) = ( N − k + 1 )( π ( w ) − π ( w ) 2 ) j = k Second sum: next few slides. Prof. Tesler # occurrences of a word Math 283 / Fall 2016 8 / 24

2 � Covariances Cov ( I j , I ℓ ) j , ℓ : k � j <ℓ � N The covariances sum is complicated: If ℓ − j � k then I j , I ℓ are independent and Cov ( I j , I ℓ ) = 0 . If 0 < ℓ − j < k , the words ending at ℓ and j overlap by r = k − ( ℓ − j ) letters. Rewrite ℓ as ℓ = j + k − r : Cov ( I j , I ℓ ) = Cov ( I j , I j + k − r ) = E ( I j I j + k − r ) − E ( I j ) E ( I j + k − r ) I j I j + k − r = 1 iff w 1 . . . w k w r + 1 . . . w k occurs at position j + k − r in τ . E.g., w 1 . . . w k w r + 1 . . . w k = w 1 . . . w 9 w 6 . . . w 9 = ATAG ATAGA TAGA . E ( I j I j + k − r ) = ε r · π ( w 1 . . . w k w r + 1 . . . w k ) . Cov ( I j , I j + k − r ) = E ( I j I j + k − r ) − E ( I j ) E ( I j + k − r ) = ε r · π ( w 1 . . . w k w r + 1 . . . w k ) − ( π ( w )) 2 . Note that this depends on r but not j . Prof. Tesler # occurrences of a word Math 283 / Fall 2016 9 / 24

2 � Covariances Cov ( I j , I ℓ ) j , ℓ : k � j <ℓ � N The covariance sum becomes k − 1 N − k + r � � � ε r · π ( w 1 . . . w k w r + 1 . . . w k ) − ( π ( w )) 2 � � Cov ( I j , I ℓ ) = j , ℓ : k � j <ℓ � N r = 1 j = k k − 1 � ε r · π ( w 1 . . . w k w r + 1 . . . w k ) − ( π ( w )) 2 � � = ( N − 2 k + r + 1 ) r = 1 � k − 1 � � = ε r · ( N − 2 k + r + 1 ) π ( w 1 . . . w k w r + 1 . . . w k ) r = 1 � (( N − 2 k + 2 ) + ( N − k ))( k − 1 ) � ( π ( w )) 2 − 2 Prof. Tesler # occurrences of a word Math 283 / Fall 2016 10 / 24

Mean and variance of number of occurrences Combining all the parts together and simplifiying gives Mean number of occurrences E ( Y ) = ( N − k + 1 ) E ( I k ) = ( N − k + 1 ) π ( w ) Variance of number of occurrences ( 2 k − 1 ) N − 3 k 2 + 4 k − 1 ( π ( w )) 2 � � Var ( Y ) = ( N − k + 1 ) π ( w ) − k − 1 � + 2 ε r · ( N − 2 k + r + 1 ) π ( w 1 . . . w k w r + 1 . . . w k ) r = 1 Prof. Tesler # occurrences of a word Math 283 / Fall 2016 11 / 24

Computation for w = w 1 . . . w 9 = ATAGATAGA ( k = 9 ) over all τ of length N p A 5 p T 2 p G 2 π ( w ) = and w self-overlaps at r = 1 , 5 ( N − k + 1 ) π ( w ) = ( N − 8 ) π ( w ) = ( N − 8 ) p A 5 p T 2 p G 2 E ( Y ) = ( 2 k − 1 ) N − 3 k 2 + 4 k − 1 ( π ( w )) 2 � � Var ( Y ) = ( N − k + 1 ) π ( w ) − k − 1 � + 2 ε r · ( N − 2 k + r + 1 ) π ( w 1 . . . w k w r + 1 . . . w k ) r = 1 ( N − 8 ) π ( w ) − ( 17 N − 208 )( π ( w )) 2 = + 2 ( N − 16 ) π ( ATAGATAG A TAGATAGA ) + 2 ( N − 12 ) π ( ATAG ATAGA TAGA ) ( N − 8 ) p A 5 p T 2 p G 2 − ( 17 N − 208 ) p A 10 p T 4 p G 4 = + 2 ( N − 2 k + 2 ) p A 9 p G 4 p T 4 + 2 ( N − 2 k + 6 ) p A 7 p G 3 p T 3 Prof. Tesler # occurrences of a word Math 283 / Fall 2016 12 / 24

Frequencies of words and motifs in SARS The genome of SARS described previously has N = 29751 bases: Nucleotide Frequency Proportion p A ≈ 0 . 2851 8481 A p C ≈ 0 . 1997 5940 C p G ≈ 0 . 2080 6187 G p T ≈ 0 . 3073 9143 T Total N = 29751 1 These were used below to compute "Estimated" µ and σ . “Observed frequency” y was determined from the DNA sequence. Word Estimated Observed y = Freq. z = ( y − µ ) /σ Φ ( z ) µ σ 104 . 5456 10 . 6943 0 . 1360 0 . 5541 106 GAGA 10 − 5 73 . 2226 8 . 4830 − 4 . 2700 37 GCGA 78 . 9381 8 . 8018 − 2 . 2652 0 . 0118 59 TGCG 10 − 3 motif M 256 . 7064 17 . 6583 − 3 . 0980 202 ( M consists of all three words; details on computing µ , σ are later.) Prof. Tesler # occurrences of a word Math 283 / Fall 2016 13 / 24

The number of occurrences of a word (5.7) and motif (5.9) in a DNA - PowerPoint PPT Presentation

The number of occurrences of a word (5.7) and motif (5.9) in a DNA sequence, allowing overlaps Covariance (2.4) and indicators (2.9) Prof. Tesler Math 283 Fall 2016 Prof. Tesler # occurrences of a word Math 283 / Fall 2016 1 / 24

Math 283, Spring 2006, Prof. Tesler May 22, 2006 Markov chains and the number of occurrences

RNA Search and Whirlwind tour of ncRNA search & discovery Motif Discovery RNA motif

Motif Discovery Upper Bound An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery

Markov chains and the number of occurrences of a word in a sequence (4.54.9, 11.1,2,4,6)

Probability Theory as Extended Logic: Probability Theory as Extended Logic: Applications to motif

Regulatory Motif Prediction in DNA Regulatory Motif Prediction in DNA Introduction: toward

PROTEIN MOTIF RETRIEVAL THROUGH SECONDARY STRUCTURE SPATIAL CO- OCCURRENCES Virginio Cantoni,

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Assi Assignm gnment 6: Motif f Findi nding ng Bi Bio5488 2/ 2/24/ 24/17 17 Slide

Occurrences and Researches of Harmful Algal Blooms in China in Recent Years LU Songhui Research

UNSAFE PORTS: abnormal occurrences and the insurance solution. David Pitlarge Partner Marine,

Some useful tasks involving language Find all phone numbers in a text, e.g., occurrences such

Improved Modeling of Cross-Decoder Phone Co-occurrences in SVM-based Phonotactic Language

MapReduce Marek Adamczyk 24 XI 2010 Example Counting word occurrences Input document:

text statistics 1 many slides courtesy James Allan@umass 2 Word Occurrences

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

The Science of a Connected Age Columbia University Six Degrees: Duncan Watts Outline The

Small-world phenomenon Small-world phenomenon Jeroen Keijser Jeroen Keijser March 18, 2003

Learning to Predict Interactions in Networks Charles Elkan University of California, San Diego

An Auxiliary Task for Learning Nuclei Segmentation in 3D Microscopy Images Peter Hirsch, Dagmar

Specifying Biological Systems as Reactive Systems: Some Observations Amir Pnueli New York

Matematyczne modelowanie mzgu (czyli o termodynamice) Jan Karbowski University of Warsaw

How to teleport your cat? Mris Ozols University of Cambridge What is quantum computing?

Graph Representation Learning William L. Hamilton COMP 551 Special Topic Lecture Will

Sambuz

Useful Links

Newsletter

Mail Us

The number of occurrences of a word (5.7) and motif (5.9) in a DNA - PowerPoint PPT Presentation

The number of occurrences of a word (5.7) and motif (5.9) in a DNA sequence, allowing overlaps Covariance (2.4) and indicators (2.9) Prof. Tesler Math 283 Fall 2016 Prof. Tesler # occurrences of a word Math 283 / Fall 2016 1 / 24

Math 283, Spring 2006, Prof. Tesler May 22, 2006 Markov chains and the number of occurrences

RNA Search and Whirlwind tour of ncRNA search &amp; discovery Motif Discovery RNA motif

Motif Discovery Upper Bound An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery

Markov chains and the number of occurrences of a word in a sequence (4.54.9, 11.1,2,4,6)

Probability Theory as Extended Logic: Probability Theory as Extended Logic: Applications to motif

Regulatory Motif Prediction in DNA Regulatory Motif Prediction in DNA Introduction: toward

PROTEIN MOTIF RETRIEVAL THROUGH SECONDARY STRUCTURE SPATIAL CO- OCCURRENCES Virginio Cantoni,

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Assi Assignm gnment 6: Motif f Findi nding ng Bi Bio5488 2/ 2/24/ 24/17 17 Slide

Occurrences and Researches of Harmful Algal Blooms in China in Recent Years LU Songhui Research

UNSAFE PORTS: abnormal occurrences and the insurance solution. David Pitlarge Partner Marine,

Some useful tasks involving language Find all phone numbers in a text, e.g., occurrences such

Improved Modeling of Cross-Decoder Phone Co-occurrences in SVM-based Phonotactic Language

MapReduce Marek Adamczyk 24 XI 2010 Example Counting word occurrences Input document:

text statistics 1 many slides courtesy James Allan@umass 2 Word Occurrences

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

The Science of a Connected Age Columbia University Six Degrees: Duncan Watts Outline The

Small-world phenomenon Small-world phenomenon Jeroen Keijser Jeroen Keijser March 18, 2003

Learning to Predict Interactions in Networks Charles Elkan University of California, San Diego

An Auxiliary Task for Learning Nuclei Segmentation in 3D Microscopy Images Peter Hirsch, Dagmar

Specifying Biological Systems as Reactive Systems: Some Observations Amir Pnueli New York

Matematyczne modelowanie mzgu (czyli o termodynamice) Jan Karbowski University of Warsaw

How to teleport your cat? Mris Ozols University of Cambridge What is quantum computing?

Graph Representation Learning William L. Hamilton COMP 551 Special Topic Lecture Will

Sambuz

Useful Links

Newsletter

Mail Us

RNA Search and Whirlwind tour of ncRNA search & discovery Motif Discovery RNA motif