 
              Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational Biology Cornell University
2 The Challenge Whole Genome Analysis
3 Genome Browsers Whole Genome Analysis
4 Whole Genome Analysis
5 Whole Genome Analysis
6 Comparative Analysis of Complete Mammalian Genomes human chimp macaque mouse rat cow dog opossum platypus chicken zfish tetra fugu Whole Genome Analysis
7 Detection of Functional Elements human mouse rat dog chicken Fugu Whole Genome Analysis
8 Conservation Track Siepel, Bejerano, Pedersen, et al., Genome Res , 2005 Whole Genome Analysis
9 Conservation Track: GAL1 Siepel, Bejerano, Pedersen, et al., Genome Res , 2005 Whole Genome Analysis
10 Solanaceae Browser Whole Genome Analysis
11 Whole Genome Analysis
12 Possible Positive Selection Chondrosarcoma associated gene 1 isoform a Whole Genome Analysis
13 “Human Accelerated Region 1” (HAR1) Whole Genome Analysis Pollard, Salama, et al., Nature, 2006
14 New Human RNA Structure Human Human Chimp U G C A - 0 10 30 U G C A - 0 1030 - 40 - 50 - 60 Chimp - 70 Whole Genome Analysis Pollard, Salama, et al., Nature, 2006
15 Exon Predictions Data from E. Green & colleagues (Thomas et al., Nature 2003) Whole Genome Analysis
16 Whole Mount in situ Hybridizations to Zebra Fish Embryos ch1.5081.18 OTP Telencephelon Telencephelon Hindbrain Telencephelon Telencephelon Hindbrain Hindbrain Hindbrain 48hpf Diencephelon Diencephelon Telencephelon Hindbrain Hindbrain Hindbrain Hindbrain Telencephelon 72hpf Diencephelon Diencephelon Bruce Roe & colleagues Whole Genome Analysis
17 non-coding 1 1 2 3 2 3 5´ splice 3´ splice codon positions C T C A G C A G … G A G G T A A G human T T A A G C A G … G A A G T G T G mouse T T A A G C A G … G A A G T G T T rat A A T A G C A A … G A G G T C C A dog C A C A G C A A … G A G G T C A A chicken Whole Genome Analysis
18 Phylo-HMM Used by PhastCons Siepel, Bejerano, Pedersen, et al., Genome Res , 2005 Whole Genome Analysis
Introduction to Hidden Markov Models, Phylogenetic Models, and Phylo-HMMs
2 A Markov Model (Chain) • Suppose Z = ( Z 1 , ..., Z L ) is a sequence of cloudy ( Z i = 0) or sunny ( Z i = 1) days • We could assume days are iid with probability theta of sun but cloudy and sunny days occur in runs • We can capture the correlation between successive days by assuming a first-order Markov model: P ( Z 1 , . . . , Z L ) = P ( Z 1 ) P ( Z 2 | Z 1 ) P ( Z 3 | Z 2 ) · · · P ( Z L | Z L − 1 ) instead of complete independence: P ( Z 1 , . . . , Z L ) = P ( Z 1 ) · · · P ( Z L )
3 Three Views L 1. � P ( z ) = P ( z 1 ) a z i − 1 ,z i i =2 where a c,d = P ( z i = d | z i − 1 = c ) a z 1 ,z 2 a z 2 ,z 3 a z L − 1 ,z L 2. · · · Z 1 Z 2 Z L a 0 , 0 a 1 , 1 a 0 , 1 3. P ( z 1 = 0) 0 1 B a 1 , 0 P ( z 1 = 1)
4 Process Interpretation • Let’s add an end state and cap the sequence with z 0 = B , z L +1 = E, e.g. z = B 000011000 E a 0 , 0 a 1 , 1 a 0 , 1 0 1 a B, 0 a 1 ,E B E a 1 , 0 a B, 1 a 0 ,E • This is a probabilistic machine that generates sequences of any length. It is a stochastic finite state machine and defines a grammar. L • We can now simply say: � P ( z ) = a z i ,z i +1 i =0 P ( z ) is a probability distribution over all sequences (for given alphabet).
5 A Hidden Markov Model • Let X = ( X 1 , ..., X L ) indicate whether AS bikes on day i ( X i = 1) or not ( X i = 0) • Suppose AS bikes on day i with probability theta 0 = 0.25 if it is cloudy ( Z i = 0) and with probability theta 1 = 0.75 if it is sunny ( Z i =1) • Further suppose the Z i s are hidden; we see only X = ( X 1 , ..., X L ) • This hidden Markov model is a mixture model in which the Z i s are correlated • We call Z = ( Z 1 , ..., Z L ) the path
6 HMM, cont. • Z is determined by the Markov chain: a 0 , 0 a 1 , 1 a 0 , 1 0 1 a B, 0 a 1 ,E B E a 1 , 0 a B, 1 a 0 ,E • The joint probability of X and Z is: L � P ( x , z ) = P ( z ) P ( x | z ) = a B,z 1 e z i ,x i a z i ,z i +1 i =1 where e z i ,x i = P ( x i | z i ) • The X i s are conditionally independent given the Z i s Z 1 Z 2 Z 3 Z L · · · X 1 X 2 X 3 X L
7 Parameters of the Model • Transition parameters: for all a s 1 ,s 2 s 1 , s 2 ∈ S ∪ { B, E } • Emission parameters: for all , s ∈ S x ∈ A e s,x • The transition parameters define conditional distributions for state s 2 at position i given state s 1 at position i -1 • The emission parameters define conditional distributions over observation x given state s , both at position i • The observations can be anything!
8 Key Questions • Given the model (parameter values) and a sequence X , what is the most likely path? ˆ z = argmax z P ( x , z ) • What is the likelihood of the sequence? � P ( x ) = P ( x , z ) z • What is the posterior probability of Z i given X • What is the maximum likelihood estimate of all parameters?
9 Graph Interpretation of Most Likely Path x i 0 0 1 0 0 1 0 0 B 0 z i 1 E
10 Graph Interpretation of Probability of x x i 0 0 1 0 0 1 0 0 B 0 z i 1 E
11 Viterbi Algorithm for Most Likely Path • Let v i,j be the weight of the most likely path for ( x 1 , ..., x i ) that ends in state j • Base case: v 0, B = 1, v i,B = 0 for i > 0 • Recurrence: v i,j = e x i ,j max v i − 1 ,k a k,j k • Termination: P ( x , ˆ z ) = max v L,k a k,E k • Keep back-pointers for traceback, as in alignment • See Durbin et al. for algorithm
12 Example a 0 , 0 a 1 , 1 a 0 , 1 P ( x i = 1 | z i = 0) = 0 . 25 0 1 a B, 0 a 1 ,E B E a 1 , 0 P ( x i = 1 | z i = 1) = 0 . 75 a B, 1 a 0 ,E Z = ? ? ? ? ? ? ? ? ? ? ? ? X = 0 1 0 0 1 1 0 1 0 0 1 0
13 Example a 0 , 0 a 1 , 1 a 0 , 1 P ( x i = 1 | z i = 0) = 0 . 25 0 1 a B, 0 a 1 ,E B E a 1 , 0 P ( x i = 1 | z i = 1) = 0 . 75 a B, 1 a 0 ,E Z = 0 0 0 0 1 1 1 1 0 0 0 0 X = 0 1 0 0 1 1 0 1 0 0 1 0
14 Why HMMs Are Cool • Extremely general and flexible models for sequence modeling • Efficient tools for parsing sequences • Also proper probability models: allow maximum likelihood parameter estimation, likelihood ratio tests, etc. • Inherently modular, accommodating of complexity • In many cases, strike an ideal balance between simplicity and expressiveness
Recommend
More recommend