markov chains and the number of occurrences of a word in
play

Markov chains and the number of occurrences of a word in a sequence - PowerPoint PPT Presentation

Markov chains and the number of occurrences of a word in a sequence (4.54.9, 11.1,2,4,6) Prof. Tesler Math 283 Fall 2018 Prof. Tesler Markov Chains Math 283 / Fall 2018 1 / 44 Locating overlapping occurrences of a word Consider a


  1. Markov chains and the number of occurrences of a word in a sequence (4.5–4.9, 11.1,2,4,6) Prof. Tesler Math 283 Fall 2018 Prof. Tesler Markov Chains Math 283 / Fall 2018 1 / 44

  2. Locating overlapping occurrences of a word Consider a (long) single-stranded nucleotide sequence τ = τ 1 . . . τ N and a (short) word w = w 1 . . . w k , e.g., w = GAGA . for i = 1 to N-3 { if ( τ i τ i + 1 τ i + 2 τ i + 3 == GAGA) { ... } } The above scan takes up to ≈ 4 N comparisons to locate all occurrences of GAGA ( kN comparisons for w of length k ). A finite state automaton (FSA) is a “machine” that can locate all occurrences while only examining each letter of τ once . Prof. Tesler Markov Chains Math 283 / Fall 2018 2 / 44

  3. Overlapping occurrences of GAGA M 1 A,C,T G G A G A 0 G GA GAG GAGA C,T G A,C,T G C,T A,C,T The states are the nodes ∅ , G , GA , GAG , GAGA (prefixes of w ). For w = w 1 w 2 · · · w k , there are k + 1 states (one for each prefix). Start in the state ∅ (shown on figure as 0 ). Scan τ = τ 1 τ 2 . . . τ N one character at a time left to right. Transition edges: When examining τ j , move from the current state to the next state according to which edge τ j is on. For each node u = w 1 · · · w r and each letter x = A , C , G , T , determine the longest suffix s (possibly ∅ ) of w 1 · · · w r x that’s among the states. x −→ s Draw an edge u The number of times we are in the state GAGA is the desired count of number of occurrences. Prof. Tesler Markov Chains Math 283 / Fall 2018 3 / 44

  4. Overlapping occurrences of GAGA in τ = CAGAGGTCGAGAGT... M 1 A,C,T G G A G A 0 G GA GAG GAGA C,T G A,C,T G C,T A,C,T t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 τ t C A G A G G T C G A G A G T ... Time t State at t τ t Time t State at t τ t 1 9 0 C 0 G 2 10 0 A G A 3 11 0 G GA G 4 12 G A GAG A 5 13 GA G GAGA G 6 14 GAG G GAG T · · · 7 15 G T 0 8 0 C Prof. Tesler Markov Chains Math 283 / Fall 2018 4 / 44

  5. Non-overlapping occurrences of GAGA M 1 = A,C,T G G A G A 0 G GA GAG GAGA C,T G A,C,T G C,T A,C,T A,C,T G M 2 = G A G A 0 G GA GAG GAGA C,T A,C,T G G C,T A,C,T For non-overlapping occurrences of w : Replace the outgoing edges from w by copies of the outgoing edges from ∅ . G On previous slide, the time 13 → 14 transition GAGA −→ GAG G −→ G . changes to GAGA Prof. Tesler Markov Chains Math 283 / Fall 2018 5 / 44

  6. Motif { GAGA , GTGA } , overlaps permitted C A,C,T A,C,T A,C,T G G A C GA GAG GAGA A G G 0 G G T C T G A,C,T G T G A GT GTG GTGA A,C,T States: All prefixes of all words in the motif. If a prefix occurs multiple times, only create one node for it. Transition edges: they may jump from one word of the motif to another. G −→ GAG . GTGA Count the number of times we reach the states for any words in the motif ( GAGA or GTGA ). Prof. Tesler Markov Chains Math 283 / Fall 2018 6 / 44

  7. Markov chains A Markov chain is similar to a finite state machine, but incorporates probabilities. Let S be a set of “states.” We will take S to be a discrete finite set, such as S = { 1 , 2 , . . . , s } . Let t = 1 , 2 , . . . denote the “time.” Let X 1 , X 2 , . . . denote a sequence of random variables, values ∈ S . The X t ’s form a (first order) Markov chain if they obey these rules The probability of being in a certain state at time t + 1 only 1 depends on the state at time t , not on any earlier states: P ( X t + 1 = x t + 1 | X 1 = x 1 , . . . , X t = x t ) = P ( X t + 1 = x t + 1 | X t = x t ) The probability of transitioning from state i at time t to state j at 2 time t + 1 only depends on i and j , but not on the time t : P ( X t + 1 = j | X t = i ) = p ij at all times t for some values p ij , which form an s × s transition matrix . Prof. Tesler Markov Chains Math 283 / Fall 2018 7 / 44

  8. Transition matrix The transition matrix , P 1 , of the Markov chain M 1 is From state To state 1 2 3 4 5     1: 0 p A + p C + p T 0 0 0 p G P 11 P 12 P 13 P 14 P 15 2: G p C + p T p G p A 0 0 P 21 P 22 P 23 P 24 P 25         3: GA p A + p C + p T = 0 0 p G 0 P 31 P 32 P 33 P 34 P 35         4: GAG p C + p T p G 0 0 p A P 41 P 42 P 43 P 44 P 45     5: GAGA p A + p C + p T 0 0 p G 0 P 51 P 52 P 53 P 54 P 55 Notice that the entries in each row sum up to p A + p C + p G + p T = 1 . A matrix with all entries � 0 and all row sums equal to 1 is called a stochastic matrix . The transition matrix of a Markov chain is always stochastic. All row sums = 1 can be written   1 . P � 1 = � where � 1 = . 1 .   1 so � 1 is a right eigenvector of P with eigenvalue 1 . Prof. Tesler Markov Chains Math 283 / Fall 2018 8 / 44

  9. Transition matrices for GAGA M 1 P 1 pA+pC+pT pG  3 / 4 1 / 4  0 0 0 pG pA pG pA 0 G GA GAG GAGA pC+pT pG 1 / 2 1 / 4 1 / 4 0 0   pA+pC+pT pG 3 / 4 0 0 1 / 4 0   pC+pT   1 / 2 1 / 4 0 0 1 / 4 pA+pC+pT   3 / 4 1 / 4 0 0 0 M 2 P 2 pA+pC+pT pG pG pA pG pA 0 G GA GAG GAGA pC+pT   3 / 4 1 / 4 0 0 0 pA+pC+pT pG pG 1 / 2 1 / 4 1 / 4 0 0     3 / 4 1 / 4 0 0 0     1 / 2 1 / 4 1 / 4 0 0 pC+pT   3 / 4 1 / 4 0 0 0 pA+pC+pT Edge labels are replaced by probabilities, e.g., p C + p T . The matrices are shown for the case that all nucleotides have equal probabilities 1 / 4 . P 2 (no overlaps) is obtained from P 1 (overlaps allowed) by replacing the last row with a copy of the first row. Prof. Tesler Markov Chains Math 283 / Fall 2018 9 / 44

  10. Other applications of automata Automata / state machines are also used in other applications in Math and Computer Science. The transition weights may be defined differently, and the matrices usually aren’t stochastic. Combinatorics: Count walks through the automaton (instead of x getting their probabilities) by setting transition weights u −→ s to 1 . Computer Science (formal languages, classifiers, . . . ): Does the string τ contain GAGA ? Output 1 if it does, 0 otherwise. Modify M 1 : remove the outgoing edges on GAGA . On reaching state GAGA , terminate with output 1. If the end of τ is reached, terminate with output 0. This is called a deterministic finite acceptor (DFA). Markov chains: Instead of considering a specific string τ , we’ll compute probabilities, expected values, . . . over the sample space of all strings of length n . Prof. Tesler Markov Chains Math 283 / Fall 2018 10 / 44

  11. Other Markov chain examples A Markov chain is k th order if the probability of X t = i depends on the values of X t − 1 , . . . , X t − k . It can be converted to a first order Markov chain by making new states that record more history. Positional independence : Instead of a null hypothesis that a DNA sequence is generated by repeated rolls of a biased four-sided die, we could use a Markov chain. The simplest is a one-step transition matrix   p AA p AC p AG p AT p CA p CC p CG p CT P =   p GA p GC p GG p GT   p TA p TC p TG p TT P could be the same at all positions. In a coding region, it could be different for the first, second, and third positions of codons. Nucleotide evolution: There are models of random point mutations over the course of evolution concerning Markov chains with the form P (same as above) in which X t is the state A , C , G , T of the nucleotide at a given position in a sequence at time (generation) t . Prof. Tesler Markov Chains Math 283 / Fall 2018 11 / 44

  12. Questions about Markov chains What is the probability of being in a particular state after n steps? 1 What is the probability of being in a particular state as n → ∞ ? 2 What is the “reverse” Markov chain? 3 If you are in state i , what is the expected number of time steps 4 until the next time you are in state j ? What is the variance of this? What is the complete probability distribution? Starting in state i , what is the expected number of visits to state j 5 before reaching state k ? Prof. Tesler Markov Chains Math 283 / Fall 2018 12 / 44

  13. Transition probabilities after two steps t + 1 t + 2 Time t 1 P 1 j P i 1 P 2 j 2 P i 2 P 3 j P i 3 j i 3 P 4 j P i 4 4 P 5 j P i 5 5 To compute the probability for going from state i at time t to state j at time t + 2 , consider all the states it could go through at time t + 1 : � P ( X t + 2 = j | X t = i ) = r P ( X t + 1 = r | X t = i ) P ( X t + 2 = j | X t + 1 = r , X t = i ) � = r P ( X t + 1 = r | X t = i ) P ( X t + 2 = j | X t + 1 = r ) � r P ir P rj = ( P 2 ) ij = Prof. Tesler Markov Chains Math 283 / Fall 2018 13 / 44

  14. Transition probabilities after n steps For n � 0 , the transition matrix from time t to time t + n is P n : � P ( X t + n = j | X t = i ) = P ( X t + 1 = r 1 | X t = i ) P ( X t + 2 = r 2 | X t + 1 = r 1 ) · · · r 1 ,..., r n − 1 � P i r 1 P r 1 r 2 · · · P r n − 1 j = ( P n ) ij = r 1 ,..., r n − 1 (sum over possible states r 1 , . . . , r n − 1 at times t + 1 , . . . , t + ( n − 1 ) ) Prof. Tesler Markov Chains Math 283 / Fall 2018 14 / 44

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend