4/24/09 1
CSCI1950‐Z Computa4onal Methods for Biology Lecture 20
Ben Raphael April 15, 2009
hGp://cs.brown.edu/courses/csci1950‐z/
Outline
- Finish aCGH + HMM.
- Introduc4on to networks.
Outline Finish aCGH + HMM. Introduc4on to networks. 1 4/24/09 - - PDF document
4/24/09 CSCI1950Z Computa4onal Methods for Biology Lecture 20 Ben Raphael April 15, 2009 hGp://cs.brown.edu/courses/csci1950z/ Outline Finish aCGH + HMM. Introduc4on to networks. 1 4/24/09 CGH Analysis (1) Divide genome into
‐0.5 0.5
Log2(R/G) Genomic posi4on
‐0.5 0.5
Dele4on Amplifica4on Genomic posi4on
1
2
3
4
1
µ2, σ2 µ3, σ3 µ4, σ4 Homozygous Deletion (copy =0) Emissions: Gaussians K states copy numbers Duplication (copy >2) Heterozygous Deletion (copy =1) Normal (copy =2)
Copy number Genome coordinate
1 2 K
…
1 2 K
…
1 2 K
… … … …
1 2 K
…
2 1 K 2
Defini:on: A hidden Markov model (HMM)
Σ = { b1, b2, …, bM }
aij = transi4on prob from state i to state j ai1 + … + aiK = 1, for all states i = 1…K
a01 + … + a0K = 1
ei(b) = P( xi = b | πi = k) ei(b1) + … + ei(bM) = 1, for all states i = 1…K
K 1 … 2
K 1 … 2
1 2 K
…
1 2 K
…
1 2 K
… … … …
1 2 K
…
2 1 K 2
Simply, mul:ply all the orange arrows! (transi4on probs and emission probs) 1 2 K
…
1 2 K
…
1 2 K
… … … …
1 2 K
…
2 1 K 2
P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2
GIVEN A sequence of rolls by the casino player
1245526462146146136136661664661636616366163616515615115146123562344
QUESTION How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs
Prob = 1.3 x 10‐35
GIVEN A sequence of rolls by the casino player
1245526462146146136136661664661636616366163616515615115146123562344
QUESTION What por4on of the sequence was generated with the fair die, and what por4on with the loaded die? This is the DECODING ques4on in HMMs. This is what we want to solve for CGH analysis
FAIR LOADED FAIR
GIVEN A sequence of rolls by the casino player
1245526462146146136136661664661636616366163616515615115146123562344
QUESTION How “loaded” is the loaded die? How “fair” is the fair die? How oqen does the casino player change from fair to loaded, and back? This is the LEARNING ques4on in HMMs
Prob(6) = 64%
1 2 K
…
1 2 K
…
1 2 K
… … … …
1 2 K
…
x1 x2 x3 xK
2 1 K 2
Given that we end up in state k at step i, maximize product to the le\ and right
Input: x = x1……xN Ini:aliza:on: V0(0) = 1 (0 is the imaginary first posi4on) Vk(0) = 0, for all k > 0 Itera:on: Vj(i) = ej(xi) × maxk akj Vk(i – 1) Ptrj(i) = argmaxk akj Vk(i – 1) Termina:on: P(x, π*) = maxk Vk(N) Traceback: πN* = argmaxk Vk(N) πi‐1* = Ptrπi (i)
Given a HMM, we can generate a sequence of length n as follows:
1 2 K
…
1 2 K
…
1 2 K
… … … …
1 2 K
…
2 1 K 2
e2(x1) a02
Given a sequence x,
Example: the dishonest casino Say x = 12341…23162616364616234112…21341 Most likely path: π = FF……F However: marked leGers more likely to be L than unmarked leGers
P(box: FFFFFFFFFFF) = (1/6)11 * 0.9512 = 2.76‐9 * 0.54 = 1.49‐9 P(box: LLLLLLLLLLL) = [ (1/2)6 * (1/10)5 ] * 0.9510 * 0.052 = 1.56*10‐7 * 1.5‐3
=
0.23‐9 F F
We can compute fk(i) for all k, i, using dynamic programming! Ini:aliza:on: f0(0) = 1 fk(0) = 0, for all k > 0 Itera:on: fk(i) = ek(xi) Σl fl(i‐1) alk Termina:on: P(x) = Σk fk(N) ak0 Where, ak0 is the probability that the termina4ng state is k (usually = a0k)
Ini:aliza:on: V0(0) = 1 Vk(0) = 0, for all k > 0 Itera:on: Vj(i) = ej(xi) maxk Vk(i‐1) akj Termina:on: P(x, π*) = maxk Vk(N)
Ini:aliza:on: f0(0) = 1 fk(0) = 0, for all k > 0 Itera:on: fl(i) = el(xi) Σk fk(i‐1) akl Termina:on: P(x) = Σk fk(N) ak0
We want to compute P(πi = k | x), the probability distribu4on on the ith posi4on, given x We start by compu4ng P(πi = k, x) = P(x1…xi, πi = k, xi+1…xN) = P(x1…xi, πi = k) P(xi+1…xN | x1…xi, πi = k) = P(x1…xi, πi = k) P(xi+1…xN | πi = k) Then, P(πi = k | x) = P(πi = k, x) / P(x) Forward, f Forward, fk(i) (i) Backward, b Backward, bk(i) (i)
Define the backward probability:
We can compute bk(i) for all k, i, using dynamic programming Ini:aliza:on: bk(N) = ak0, for all k Itera:on: bk(i) = Σl el(xi+1) akl bl(i+1) Termina:on: P(x) = Σl a0l el(x1) bl(1)
What is the running 4me, and space required, for Forward, and Backward? Time: O(K2N) Space: O(KN) Useful implementa4on technique to avoid underflows Viterbi: sum of logs Forward/Backward: rescaling at each posi4on by mul4plying by a constant K = number
N = length of sequence
i = argmaxk P(πi = k | x)
x1 x2 x3 …………………………………………… xN State 1 l
P(πi=l|x)
k 1(ψ) = 1, if ψ is true 0, otherwise
VITERBI Ini:aliza:on: V0(0) = 1 Vk(0) = 0, for all k > 0 Itera:on: Vl(i) = el(xi) maxk Vk(i‐1) akl Termina:on: P(x, π*) = maxk Vk(N) FORWARD Ini:aliza:on: f0(0) = 1 fk(0) = 0, for all k > 0 Itera:on: fl(i) = el(xi) Σk fk(i‐1) akl Termina:on: P(x) = Σk fk(N) ak0 BACKWARD Ini:aliza:on: bk(N) = ak0, for all k Itera:on: bl(i) = Σk el(xi+1) akl bk(i +1) Termina:on: P(x) = Σk a0k ek(x1) bk(1)
Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annota4ons of copy numbers GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls
Examples: GIVEN: tumor sample; we don’t know how the frequency and length of segments GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION: Update the parameters θ of the model to maximize P(x|θ)
Given x = x1…xN for which the true π = π1…πN is known, Define: Akl = # 4mes k→l transi4on occurs in π Ek(b) = # 4mes state k in π emits b in x We can show that the maximum likelihood parameters θ (maximize P(x|θ)) are:
Ek(b)
Intui:on: When we know the underlying states, Best es4mate is the average frequency of transi4ons & emissions that occur in the training data Drawback: Given liGle data, there may be overfieng: P(x|θ) is maximized, but θ is unreasonable 0 probabili:es – VERY BAD Example: Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3 π = F, F, F, F, F, F, F, F, F, F Then: aFF = 1; aFL = 0 eF(1) = eF(3) = .2; eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1
Solu4on for small training sets: Add pseudocounts Akl = # 4mes k→l transi4on occurs in π + rkl Ek(b) = # 4mes state k in π emits b in x + rk(b) rkl, rk(b) are pseudocounts represen4ng our prior belief Larger pseudocounts ⇒ Strong priof belief Small pseudocounts (ε < 1): just to avoid 0 probabili4es
Example: dishonest casino We will observe player for one day, 600 rolls Reasonable pseudocounts:
rFL = rLF = rFF = rLL = 1; rF(1) = rF(2) = … = rF(6) = 20 (strong belief fair is fair) rL(1) = rL(2) = … = rL(6) = 5 (wait and see for loaded) Above #s preGy arbitrary – assigning priors is an art
Star4ng with our best guess of a model M, parameters θ: Given x = x1…xN for which the true π = π1…πN is unknown, We can get to a provably more likely parameter set θ i.e., θ that increases the probability P(x | θ) Principle: EXPECTATION MAXIMIZATION
To es4mate Akl: (assume | θCURRENT, in all formulas below) At each posi4on i of sequence x, find probability transi4on k→l is used: P(πi = k, πi+1 = l | x) = [1/P(x)] × P(πi = k, πi+1 = l, x1…xN) = Q/P(x) where Q = P(x1…xi, πi = k, πi+1 = l, xi+1…xN) = = P(πi+1 = l, xi+1…xN | πi = k) P(x1…xi, πi = k) = = P(πi+1 = l, xi+1xi+2…xN | πi = k) fk(i) = = P(xi+2…xN | πi+1 = l) P(xi+1 | πi+1 = l) P(πi+1 = l | πi = k) fk(i) = = bl(i+1) el(xi+1) akl fk(i) fk(i) akl el(xi+1) bl(i+1) So: P(πi = k, πi+1 = l | x, θ) = –––––––––––––––––– P(x | θCURRENT)
Akl = Σi P(πi = k, πi+1 = l | x, θ) = Σi ––––––––––––––––– P(x | θ)
k l
akl el(xi+1) bl(i+1) fk(i)
Ini:aliza:on: Pick the best‐guess for model parameters (or arbitrary) Itera:on:
1. Forward 2. Backward 3. Calculate Akl, Ek(b), given θCURRENT 4. Calculate new model parameters θNEW : akl, ek(b) 5. Calculate new log‐likelihood P(x | θNEW)
GUARANTEED TO BE HIGHER BY EXPECTATION‐MAXIMIZATION Un4l P(x | θ) does not change much
Ini:aliza:on: Same Itera:on:
1. Perform Viterbi, to find π* 2. Calculate Akl, Ek(b) according to π* + pseudocounts 3. Calculate the new parameters akl, ek(b)
Un4l convergence Notes:
– Not guaranteed to increase P(x | θ) – Guaranteed to increase P(x, | θ, π*) – In general, worse performance than Baum‐Welch
1
2
3
4
1
µ2, σ2 µ3, σ3 µ4, σ4 Homozygous Deletion (copy =0) Emissions: Gaussians K states copy numbers Duplication (copy >2) Heterozygous Deletion (copy =1) Normal (copy =2)
Copy number Genome coordinate
Use Viterbi algorithm to derive segmenta4on. Forward/backward to compute state
1-p 1-q p q
Nodes = reactants Edges = reac4ons labeled by enzyme (protein) that catalyzes reac4on
B C A A “represses” C A “ac4vates” B
CGH data. Journal of Mul4variate Analysis, 2004