the number of occurrences of a word 5 7 and motif 5 9 in
play

The number of occurrences of a word (5.7) and motif (5.9) in a DNA - PowerPoint PPT Presentation

The number of occurrences of a word (5.7) and motif (5.9) in a DNA sequence, allowing overlaps Covariance (2.4) and indicators (2.9) Prof. Tesler Math 283 Fall 2016 Prof. Tesler # occurrences of a word Math 283 / Fall 2016 1 / 24


  1. The number of occurrences of a word (5.7) and motif (5.9) in a DNA sequence, allowing overlaps Covariance (2.4) and indicators (2.9) Prof. Tesler Math 283 Fall 2016 Prof. Tesler # occurrences of a word Math 283 / Fall 2016 1 / 24

  2. Covariance Let X and Y be random variables, possibly dependent. Var ( X + Y ) = E (( X + Y − µ X − µ Y ) 2 ) ��� �� 2 � � � = E X − µ X + Y − µ Y �� � 2 � �� � 2 � � � = E X − µ X + E Y − µ Y + 2 E ( X − µ X )( Y − µ Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y ) where the covariance of X and Y is defined as � � Cov ( X , Y ) = E ( X − µ X )( Y − µ Y ) Expanding gives an alternate formula Cov ( X , Y ) = E ( XY ) − E ( X ) E ( Y ) : � � Cov ( X , Y ) = E ( X − µ X )( Y − µ Y ) = E ( XY ) − µ X E ( Y ) − µ Y E ( X ) + µ X µ Y = E ( XY ) − E ( X ) E ( Y ) Prof. Tesler # occurrences of a word Math 283 / Fall 2016 2 / 24

  3. Covariance properties Cov ( X , X ) = Var ( X ) Cov ( X , Y ) = Cov ( Y , X ) If X , Y are independent then Cov ( X , Y ) = 0 and Var ( X + Y ) = Var ( X ) + Var ( Y ) . Beware, this is not reversible; Cov ( X , Y ) could be 0 for dependent variables. Cov ( aX + b , cY + d ) = ac Cov ( X , Y ) 2 � Var ( X 1 + X 2 + · · · + X n ) = Var ( X 1 )+ · · · + Var ( X n )+ Cov ( X i , X j ) 1 � i < j � n Sign of covariance When Cov ( X , Y ) is positive: there is a tendency to have X > µ X when Y > µ Y and vice-versa, and X < µ X when Y < µ Y and vice-versa. When Cov ( X , Y ) is negative: there is a tendency to have X > µ X when Y < µ Y and vice-versa, and X < µ X when Y > µ Y and vice-versa. Prof. Tesler # occurrences of a word Math 283 / Fall 2016 3 / 24

  4. Occurrences of a word in a sequence — notation Consider a (long) single-stranded nucleotide sequence τ = τ 1 . . . τ N and a (short) word w = w 1 . . . w k : τ = τ 1 . . . τ 19 = CTATAGATAGATAGACAGT w = w 1 . . . w 9 = ATAGATAGA Say w occurs in τ at position j when w is in τ ending at position j : j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 τ j C T A T A G A T A G A T A G A C A G T so w occurs in τ at 11 and 15 (underlined). � if w occurs in τ at j ; I 11 = I 15 = 1 1 Let I j = otherwise. other I j = 0 0 I j is an indicator variable (1 when a condition is true, 0 when false). Y = I k + I k + 1 + · · · + I N is the number of times w occurs in τ . Here, Y = 2 . Prof. Tesler # occurrences of a word Math 283 / Fall 2016 4 / 24

  5. Computing mean number of occurrences µ = E ( Y ) Suppose τ is generated by N independent rolls of a 4-sided die, whose sides have probabilities p A , p C , p G , p T adding up to 1. The probability of a word being generated by rolling such a die is the product of the probabilities of its nucleotides: π ( ATAGATAGA ) = p A 5 p T 2 p G 2 π ( w ) = p w 1 · · · p w k The probability of w occurring at j = k , k + 1 , . . . , N is π ( w ) . I j ’s are indicator variables, so E ( I j ) = 0 P ( I j = 0 ) + 1 P ( I j = 1 ) = P ( I j = 1 ) = π ( w ) for j = k , k + 1 , . . . , N . Y = I k + I k + 1 + · · · + I N so the mean number of occurrences is µ = E ( Y ) = E ( I k ) + · · · + E ( I N ) = ( N − k + 1 ) π ( w ) . Prof. Tesler # occurrences of a word Math 283 / Fall 2016 5 / 24

  6. Dependencies between positions Occurrences at different positions have dependencies, because of how shifts of w may overlap with each other. w = ATAGATAGA cannot occur at both 14 and 15: j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 τ j A T A G A T A G A A T A G A T A G A But w can occur at both 11 and 15 . j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 τ j C T A T A G A T A G A T A G A C A G T This is equivalent to w 1 . . . w k w r + 1 . . . w k = w 1 . . . w 9 w 6 . . . w 9 = ATAG ATAGA TAGA occurring at 15 , where k = 9 is the word length and r = 5 is the overlap length. Chapter 5.8 considers counting occurrences without overlaps. Chapters 4 and 11 do the more general problem of Markov chains. Prof. Tesler # occurrences of a word Math 283 / Fall 2016 6 / 24

  7. Self-overlaps of a word Define if the first r letters of w equal the last r letters  1   of w in the exact same order (string equality); ε r =   otherwise. 0 This lets us account for dependencies between I j and I j + k − r . Shifting by k − r positions corresponds to an overlap of size r . w : A T A G A T A G A r = 9 ε 9 = 1 A T A G A T A G A r = 8 ε 8 = 0 A T A G A T A G A r = 7 ε 7 = 0 A T A G A T A G A r = 6 ε 6 = 0 A T A G A T A G A r = 5 ε 5 = 1 A T A G A T A G A r = 4 ε 4 = 0 A T A G A T A G A r = 3 ε 3 = 0 A T A G A T A G A r = 2 ε 2 = 0 A T A G A T A G A r = 1 ε 1 = 1 A T A G A T A G A Prof. Tesler # occurrences of a word Math 283 / Fall 2016 7 / 24

  8. Computing σ 2 = Var ( Y ) Since the I j ’s have dependencies, the variance of their sum Y = I k + · · · + I N is NOT necessarily the sum of their variances. We must consider covariance terms as well: N � � Var ( Y ) = Var ( I j ) + Cov ( I j , I ℓ ) 2 j = k j , ℓ : k � j <ℓ � N First sum: Note that I j 2 = I j since I j = 0 or 1 , so Var ( I j ) = E ( I j 2 ) − ( E ( I j )) 2 = π ( w ) − π ( w ) 2 and the first sum in Var ( Y ) is N � Var ( I j ) = ( N − k + 1 )( π ( w ) − π ( w ) 2 ) j = k Second sum: next few slides. Prof. Tesler # occurrences of a word Math 283 / Fall 2016 8 / 24

  9. 2 � Covariances Cov ( I j , I ℓ ) j , ℓ : k � j <ℓ � N The covariances sum is complicated: If ℓ − j � k then I j , I ℓ are independent and Cov ( I j , I ℓ ) = 0 . If 0 < ℓ − j < k , the words ending at ℓ and j overlap by r = k − ( ℓ − j ) letters. Rewrite ℓ as ℓ = j + k − r : Cov ( I j , I ℓ ) = Cov ( I j , I j + k − r ) = E ( I j I j + k − r ) − E ( I j ) E ( I j + k − r ) I j I j + k − r = 1 iff w 1 . . . w k w r + 1 . . . w k occurs at position j + k − r in τ . E.g., w 1 . . . w k w r + 1 . . . w k = w 1 . . . w 9 w 6 . . . w 9 = ATAG ATAGA TAGA . E ( I j I j + k − r ) = ε r · π ( w 1 . . . w k w r + 1 . . . w k ) . Cov ( I j , I j + k − r ) = E ( I j I j + k − r ) − E ( I j ) E ( I j + k − r ) = ε r · π ( w 1 . . . w k w r + 1 . . . w k ) − ( π ( w )) 2 . Note that this depends on r but not j . Prof. Tesler # occurrences of a word Math 283 / Fall 2016 9 / 24

  10. 2 � Covariances Cov ( I j , I ℓ ) j , ℓ : k � j <ℓ � N The covariance sum becomes k − 1 N − k + r � � � ε r · π ( w 1 . . . w k w r + 1 . . . w k ) − ( π ( w )) 2 � � Cov ( I j , I ℓ ) = j , ℓ : k � j <ℓ � N r = 1 j = k k − 1 � ε r · π ( w 1 . . . w k w r + 1 . . . w k ) − ( π ( w )) 2 � � = ( N − 2 k + r + 1 ) r = 1 � k − 1 � � = ε r · ( N − 2 k + r + 1 ) π ( w 1 . . . w k w r + 1 . . . w k ) r = 1 � (( N − 2 k + 2 ) + ( N − k ))( k − 1 ) � ( π ( w )) 2 − 2 Prof. Tesler # occurrences of a word Math 283 / Fall 2016 10 / 24

  11. Mean and variance of number of occurrences Combining all the parts together and simplifiying gives Mean number of occurrences E ( Y ) = ( N − k + 1 ) E ( I k ) = ( N − k + 1 ) π ( w ) Variance of number of occurrences ( 2 k − 1 ) N − 3 k 2 + 4 k − 1 ( π ( w )) 2 � � Var ( Y ) = ( N − k + 1 ) π ( w ) − k − 1 � + 2 ε r · ( N − 2 k + r + 1 ) π ( w 1 . . . w k w r + 1 . . . w k ) r = 1 Prof. Tesler # occurrences of a word Math 283 / Fall 2016 11 / 24

  12. Computation for w = w 1 . . . w 9 = ATAGATAGA ( k = 9 ) over all τ of length N p A 5 p T 2 p G 2 π ( w ) = and w self-overlaps at r = 1 , 5 ( N − k + 1 ) π ( w ) = ( N − 8 ) π ( w ) = ( N − 8 ) p A 5 p T 2 p G 2 E ( Y ) = ( 2 k − 1 ) N − 3 k 2 + 4 k − 1 ( π ( w )) 2 � � Var ( Y ) = ( N − k + 1 ) π ( w ) − k − 1 � + 2 ε r · ( N − 2 k + r + 1 ) π ( w 1 . . . w k w r + 1 . . . w k ) r = 1 ( N − 8 ) π ( w ) − ( 17 N − 208 )( π ( w )) 2 = + 2 ( N − 16 ) π ( ATAGATAG A TAGATAGA ) + 2 ( N − 12 ) π ( ATAG ATAGA TAGA ) ( N − 8 ) p A 5 p T 2 p G 2 − ( 17 N − 208 ) p A 10 p T 4 p G 4 = + 2 ( N − 2 k + 2 ) p A 9 p G 4 p T 4 + 2 ( N − 2 k + 6 ) p A 7 p G 3 p T 3 Prof. Tesler # occurrences of a word Math 283 / Fall 2016 12 / 24

  13. Frequencies of words and motifs in SARS The genome of SARS described previously has N = 29751 bases: Nucleotide Frequency Proportion p A ≈ 0 . 2851 8481 A p C ≈ 0 . 1997 5940 C p G ≈ 0 . 2080 6187 G p T ≈ 0 . 3073 9143 T Total N = 29751 1 These were used below to compute "Estimated" µ and σ . “Observed frequency” y was determined from the DNA sequence. Word Estimated Observed y = Freq. z = ( y − µ ) /σ Φ ( z ) µ σ 104 . 5456 10 . 6943 0 . 1360 0 . 5541 106 GAGA 10 − 5 73 . 2226 8 . 4830 − 4 . 2700 37 GCGA 78 . 9381 8 . 8018 − 2 . 2652 0 . 0118 59 TGCG 10 − 3 motif M 256 . 7064 17 . 6583 − 3 . 0980 202 ( M consists of all three words; details on computing µ , σ are later.) Prof. Tesler # occurrences of a word Math 283 / Fall 2016 13 / 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend