[PDF] - Copy number Aberra4ons Normal cells: Cancer cells: Extensive gene PDF Document

SLIDE 1

4/17/09 1

CSCI1950‐Z Computa4onal Methods for Biology Lecture 19

Ben Raphael April 13, 2009

hIp://cs.brown.edu/courses/csci1950‐z/

Copy number Aberra4ons

Cancer cells:

Extensive gene duplica4on/dele4on

hIp://www.coriell.org/images/gm10636fish.jpg

Normal cells:

Red and gene probes to regions on X chromosome: >2 copies of green region

hIp://www.imppc.org/

SLIDE 2

4/17/09 2

DNA Microarrays

Compara4ve Genomic Hybridiza4on (CGH)

Measuring Muta4ons in Cancer

SLIDE 3

4/17/09 3

CGH Analysis (1)

Divide genome into segments of equal copy number

‐0.5 0.5

Log2(R/G) Genomic posi4on

‐0.5 0.5

Dele4on Amplifica4on Genomic posi4on

CGH Analysis (2)

Iden4fy aberra4ons common to mul4ple samples

Chromosome 3 of 26 lung tumor samples on mid‐ density cDNA array. Common dele4on located in 3p21 and common amplifica4on – in 3q.

Samples

SLIDE 4

4/17/09 4

CGH Analysis (1)

Divide genome into segments of equal copy number

Copy number profile Numerous methods (e.g. clustering, Hidden Markov Model, Bayesian, etc.) Segmenta4on

Input: Xi = log2 Ti / Ri , clone i = 1, …, N Output: Assignment s(yi) ∈ {S1, …, SK} Si represent copy number states

Copy number Genome coordinate

An Approach to CGH Segmenta4on

Circular Binary Segmenta4on (CBS), Olshen et
al. 2004
Use hypothesis test to compare means of two

intervals using t‐test (whiteboard)

‐0.5 0.5

Dele4on Amplifica4on Genomic posi4on

SLIDE 5

4/17/09 5

Interval Score

Assume:

Xi are independent, normally distributed
µ and σ denote the mean and standard devia4on of the

normal genomic data. Given an interval I spanning k probes, we define its score as:

Lipson, et al. J. Computa7onal Biology, 2006

Significance of Interval Score

Assume:

Xi ~ N(µ, σ)

Lipson, et al. J. Computa7onal Biology, 2006

SLIDE 6

4/17/09 6

The MaxInterval Problem

Other intervals with high scores may be found by recursively calling this func4on. Exhaus4ve algorithm: O(n2)

Input: A vector X=(X1…Xn) Output: An interval I ⊂ [1…n], that maximizes S(I )

MaxInterval Algorithm I: LookAhead

Assume given:

m : An upper bound for the value of a single element Xi
t : A lower bound on the maximum score

Complexity: Expected O(n1.5) (unproved) Solve for first r for which S(I ) may exceed t.

sum length score s = Σj∈I Xj k s+m r k+r

I = [i,…,i+k‐1] I’ = [i,…,i+k+r‐1]

SLIDE 7

4/17/09 7

MaxInterval Algorithm II:

Geometric Family Approximation (GFA)

For ε>0 define the following geometric family of intervals:

kj Δj Ω(j1) Ω(j2) Ω(j3)

Theorem: Let I* be the op4mal scoring interval. Let J be the leomost longest interval of Ω fully contained in I. Then S(J) ≥ S(I)/α, where α ∝ (1‐ ε‐2 )‐1. Complexity: O(n)

Benchmarking

Linear regression suggests that the complexi4es of the Exhaus4ve, LookAhead and GFA algorithms are O(n2), O(n1.5), O(n), respec4vely.

synthe4c vectors of varying lengths

SLIDE 8

4/17/09 8

Applications: Single Samples

Chromosome 16 of HCT116 colon carcinoma cell line on high‐density

ligo array (n=5,464).

Chromosome 17 of several breast carcinoma cell lines on mid‐density cDNA array (n=364).

50 25 75 Mbp 1 1 1 1

ERBB2

Log2(ra4o) ‐1 1

FRA16B A2BP1

50 Mbp 25 75 Log2(ra4o)

Another Approach to CGH Segmenta4on

Use Hidden Markov Model (HMM) to “parse” sequence of probes into copy number states

‐0.5 0.5

Dele4on Amplifica4on Genomic posi4on

SLIDE 9

4/17/09 9

Hidden Markov Models

1 2 K

…

1 2 K

…

1 2 K

… … … …

1 2 K

…

x1  x2  x3  xK 

2 1 K 2

Example: The Dishonest Casino

A casino has two dice:

Fair die

P(1) = P(2) = P(3) = P(5) = P(6) = 1/6

Loaded die

P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches back‐&‐forth between fair and loaded die once every 20 turns Game:

1. You bet $1
2. You roll (always with a fair die)
3. Casino player rolls (maybe with fair die, maybe

with loaded die)

4. Highest number wins $2

SLIDE 10

4/17/09 10

Ques4on # 1 – Evalua4on

GIVEN A sequence of rolls by the casino player

1245526462146146136136661664661636616366163616515615115146123562344

QUESTION How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs

Prob = 1.3 x 10‐35

Ques4on # 2 – Decoding

GIVEN A sequence of rolls by the casino player

1245526462146146136136661664661636616366163616515615115146123562344

QUESTION What por4on of the sequence was generated with the fair die, and what por4on with the loaded die? This is the DECODING ques4on in HMMs. This is what we want to solve for CGH analysis

FAIR LOADED FAIR

SLIDE 11

4/17/09 11

Ques4on # 3 – Learning

GIVEN A sequence of rolls by the casino player

1245526462146146136136661664661636616366163616515615115146123562344

QUESTION How “loaded” is the loaded die? How “fair” is the fair die? How ooen does the casino player change from fair to loaded, and back? This is the LEARNING ques4on in HMMs

Prob(6) = 64%

Defini4on of a hidden Markov model

DefiniSon: A hidden Markov model (HMM)

Alphabet

Σ = { b1, b2, …, bM }

Set of states Q = { 1, ..., K }
Transi4on probabili4es between any two states

aij = transi4on prob from state i to state j ai1 + … + aiK = 1, for all states i = 1…K

Start probabili4es a0i

a01 + … + a0K = 1

Emission probabili4es within each state

ei(b) = P( xi = b | πi = k) ei(b1) + … + ei(bM) = 1, for all states i = 1…K

K 1 … 2

SLIDE 12

4/17/09 12

The dishonest casino model

FAIR LOADED 0.05 0.05 0.95 0.95

P(1|F) = 1/6  P(2|F) = 1/6  P(3|F) = 1/6  P(4|F) = 1/6  P(5|F) = 1/6  P(6|F) = 1/6  P(1|L) = 1/10  P(2|L) = 1/10  P(3|L) = 1/10  P(4|L) = 1/10  P(5|L) = 1/10  P(6|L) = 1/2 

A HMM is memory‐less

At each 4me step t, the only thing that affects future states is the current state πt P(πt+1 = k | “whatever happened so far”) = P(πt+1 = k | π1, π2, …, πt, x1, x2, …, xt) = P(πt+1 = k | πt)

K 1 … 2

SLIDE 13

4/17/09 13

A parse of a sequence

Given a sequence x = x1……xN, A parse of x is a sequence of states π = π1, ……, πN

1 2 K

…

1 2 K

…

1 2 K

… … … …

1 2 K

…

x1  x2  x3  xK 

2 1 K 2

Likelihood of a Parse

Simply, mulSply all the orange arrows! (transi4on probs and emission probs) 1 2 K

…

1 2 K

…

1 2 K

… … … …

1 2 K

…

x1  x2  x3  xK 

2 1 K 2

SLIDE 14

4/17/09 14

Likelihood of a parse

a0π1 aπ1π2……aπN‐1πN eπ1(x1)……eπN(xN)

1 2 K

…

1 2 K

…

1 2 K

… … … …

1 2 K

…

x1  x2  x3  xK 

2 1 K 2

A compact way to write a0π1 aπ1π2……aπN‐1πN eπ1(x1)……eπN(xN) Number all parameters aij and ei(b); n params Example: a0Fair : θ1; a0Loaded : θ2; … eLoaded(6) = θ18 Then, count in x and π the # of 4mes each parameter j = 1, …, n occurs F(j, x, π) = # parameter θj occurs in (x, π) (call F(.,.,.) the feature counts) Then,

P(x, π) = Πj=1…n θj

F(j, x, π) = = exp[Σj=1…n log(θj)×F(j, x, π)]

Example: the dishonest casino

Let the sequence of rolls be: x = 1, 2, 1, 5, 6, 2, 1, 5, 2, 4 Then, what is the likelihood of π = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair? (say ini4al probs a0Fair = ½, aoLoaded = ½) ½ × P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) = ½ × (1/6)10 × (0.95)9 = .00000000521158647211 ~= 0.5 × 10‐9

SLIDE 15

4/17/09 15

Example: the dishonest casino

So, the likelihood the die is fair in this run is just 0.521 × 10‐9 OK, but what is the likelihood of π = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded? ½ × P(1 | Loaded) P(Loaded, Loaded) … P(4 | Loaded) = ½ × (1/10)9 × (1/2)1 (0.95)9 = .00000000015756235243 ~= 0.16 × 10‐9 Therefore, it somewhat more likely that all the rolls are done with the fair die, than that they are all done with the loaded die

Example: the dishonest casino

Let the sequence of rolls be: x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6 Now, what is the likelihood π = F, F, …, F? ½ × (1/6)10 × (0.95)9 = 0.5 × 10‐9, same as before What is the likelihood π = L, L, …, L? ½ × (1/10)4 × (1/2)6 (0.95)9 = .00000049238235134735 ~= 0.5 × 10‐7 So, it is 100 4mes more likely the die is loaded

SLIDE 16

4/17/09 16 A+  A+  C+  C+  G+  G+  S1  S1  S2  S2  S3  S3  S4  S4 

HMM Model for CGH data

Fridlyand et al. (2004)

A model for CGH data

S1 

1 

S2 

2 

S3 

3 

S4 

4 

µ1,  , σ1 

1 

µ2, σ2 µ3, σ3 µ4, σ4 Homozygous   Deletion   (copy =0)  Emissions:  Gaussians  K states  copy numbers  Duplication   (copy >2)  Heterozygous   Deletion   (copy =1)  Normal   (copy =2) 

Copy number Genome coordinate

SLIDE 17

4/17/09 17

The three main ques4ons on HMMs

1. EvaluaSon

GIVEN: HMM M, and a sequence x, FIND: Prob[ x | M ]

2. Decoding

GIVEN: HMM M, and a sequence x, FIND: sequence π of states that maximizes P[ x, π | M ]

3. Learning

GIVEN: HMM M, with unspecified transi4on/emission

probs. and a sequence x,

FIND parameters θ = (ei(.), aij) that maximize P[ x | θ ]

Let’s not be confused by nota4on

P[ x | M ]: The probability that sequence x was generated by the model The model is: architecture (#states, etc) + parameters θ = aij, ei(.) So, P[x | M] is the same with P[ x | θ ], and P[ x ], when the architecture, and the parameters, respec4vely, are implied Similarly, P[ x, π | M ], P[ x, π | θ ] and P[ x, π ] are the same when the architecture, and the parameters, are implied In the LEARNING problem we always write P[ x | θ ] to emphasize that we are seeking the θ* that maximizes P[ x | θ ]

SLIDE 18

4/17/09 18

Problem 1: Decoding

Find the most likely parse of a sequence

Decoding

GIVEN x = x1x2……xN Find π = π1, ……, πN, to maximize P[ x, π ] π* = argmaxπ P[ x, π ] Maximizes a0π1 eπ1(x1) aπ1π2……aπN‐1πN eπN(xN) Dynamic Programming! Vk(i) = max{π1… πi‐1} P[x1…xi‐1, π1, …, πi‐1, xi, πi = k] = Prob. of most likely sequence of states ending at state πi = k

1 2 K

…

1 2 K

…

1 2 K

… … … …

1 2 K

…

x1  x2  x3  xK 

2 1 K 2

Given that we end up in state k at step i, maximize product to the leg and right

SLIDE 19

4/17/09 19

Decoding – main idea

InducSve assumpSon: Given that for all states k, and for a fixed posi4on i, Vk(i) = max{π1… πi‐1} P[x1…xi‐1, π1, …, πi‐1, xi, πi = k] What is Vl(i+1)? From defini4on, Vl(i+1) = max{π1… πi}P[ x1…xi, π1, …, πi, xi+1, πi+1 = l ] = max{π1… πi}P(xi+1, πi+1 = l | x1…xi, π1,…, πi) P[x1…xi, π1,…, πi] = max{π1… πi}P(xi+1, πi+1 = l | πi ) P[x1…xi‐1, π1, …, πi‐1, xi, πi] = maxk [P(xi+1, πi+1 = l | πi=k) max{π1… πi‐1}P[x1…xi‐1,π1,…,πi‐1, xi,πi=k]] = maxk [ P(xi+1 | πi+1 = l ) P(πi+1 = l | πi=k) Vk(i) ] = el(xi+1) maxk akl Vk(i)

The Viterbi Algorithm

Input: x = x1……xN IniSalizaSon: V0(0) = 1 (0 is the imaginary first posi4on) Vk(0) = 0, for all k > 0 IteraSon: Vj(i) = ej(xi) × maxk akj Vk(i – 1) Ptrj(i) = argmaxk akj Vk(i – 1) TerminaSon: P(x, π*) = maxk Vk(N) Traceback: πN* = argmaxk Vk(N) πi‐1* = Ptrπi (i)

SLIDE 20

4/17/09 20

The Viterbi Algorithm

Similar to “aligning” a set of states to a sequence Time: O(K2N) Space: O(KN)

x1   x2   x3 ………………………………………..xN  State 1  2  K  Vj(i)

Viterbi Algorithm – a prac4cal detail

Underflows are a significant problem P[ x1,…., xi, π1, …, πi ] = a0π1 aπ1π2……aπi eπ1(x1)……eπi(xi) These numbers become extremely small – underflow SoluSon: Take the logs of all values Vl(i) = log ek(xi) + maxk [ Vk(i‐1) + log akl ]

SLIDE 21

4/17/09 21

Example

Let x be a long sequence with a por4on of ~ 1/6 6’s, followed by a por4on of ~ ½ 6’s… x = 123456123456…12345 6626364656…1626364656 Then, it is not hard to show that op4mal parse is (exercise): FFF…………………...F LLL………………………...L 6 characters “123456” parsed as F, contribute .956×(1/6)6 = 1.6×10‐5 parsed as L, contribute .956×(1/2)1×(1/10)5 = 0.4×10‐5 “162636” parsed as F, contribute .956×(1/6)6 = 1.6×10‐5 parsed as L, contribute .956×(1/2)3×(1/10)3 = 9.0×10‐5

A model for CGH data

S1 

1 

S2 

2 

S3 

3 

S4 

4 

µ1,  , σ1 

1 

µ2, σ2 µ3, σ3 µ4, σ4 Homozygous   Deletion   (copy =0)  Emissions:  Gaussians  K states  copy numbers  Duplication   (copy >2)  Heterozygous   Deletion   (copy =1)  Normal   (copy =2) 

Copy number Genome coordinate

Use Viterbi algorithm to derive segmenta4on. Forward/backward to compute state

f a single probe.

SLIDE 22

4/17/09 22

Sources

Olshen et al. Circular binary segmenta4on for the analysis of array‐based

DNA copy number data. Biosta4s4cs, 2004.

Lipson, et al. Efficient Calcula4on of Interval Scores for DNA Copy Number

Data Analysis. Journal of Computa4onal Biology, 2006. (MaxInterval slides)

Fridyland, et al. Hidden Markov models approach to the analysis of array

CGH data. Journal of Mul4variate Analysis, 2004

hIp://ai.stanford.edu/~serafim/CS262_2006/ (HMM slides)

4/17/09 1

CSCI1950‐Z Computa4onal Methods for Biology Lecture 19

Ben Raphael April 13, 2009

hIp://cs.brown.edu/courses/csci1950‐z/

Copy number Aberra4ons

Normal cells:

4/17/09 2

DNA Microarrays

Compara4ve Genomic Hybridiza4on (CGH)

Measuring Muta4ons in Cancer

4/17/09 3

CGH Analysis (1)

Divide genome into segments of equal copy number

CGH Analysis (2)

Iden4fy aberra4ons common to mul4ple samples

Chromosome 3 of 26 lung tumor samples on mid‐ density cDNA array. Common dele4on located in 3p21 and common amplifica4on – in 3q.

4/17/09 4

CGH Analysis (1)

Divide genome into segments of equal copy number

Input: Xi = log2 Ti / Ri , clone i = 1, …, N Output: Assignment s(yi) ∈ {S1, …, SK} Si represent copy number states

An Approach to CGH Segmenta4on

intervals using t‐test (whiteboard)

4/17/09 5

Interval Score

Assume:

normal genomic data. Given an interval I spanning k probes, we define its score as:

Significance of Interval Score

Assume:

4/17/09 6

The MaxInterval Problem

Other intervals with high scores may be found by recursively calling this func4on. Exhaus4ve algorithm: O(n2)

Input: A vector X=(X1…Xn) Output: An interval I ⊂ [1…n], that maximizes S(I )

MaxInterval Algorithm I: LookAhead

Assume given:

Complexity: Expected O(n1.5) (unproved) Solve for first r for which S(I ) may exceed t.

sum length score s = Σj∈I Xj k s+m r k+r

I = [i,…,i+k‐1] I’ = [i,…,i+k+r‐1]

4/17/09 7

MaxInterval Algorithm II:

Geometric Family Approximation (GFA)

For ε>0 define the following geometric family of intervals:

Theorem: Let I* be the op4mal scoring interval. Let J be the leomost longest interval of Ω fully contained in I*. Then S(J) ≥ S(I*)/α, where α ∝ (1‐ ε‐2 )‐1. Complexity: O(n)

Benchmarking

Linear regression suggests that the complexi4es of the Exhaus4ve, LookAhead and GFA algorithms are O(n2), O(n1.5), O(n), respec4vely.

4/17/09 8

Applications: Single Samples

Another Approach to CGH Segmenta4on

Use Hidden Markov Model (HMM) to “parse” sequence of probes into copy number states

4/17/09 9

Hidden Markov Models

x1 x2 x3 xK

Example: The Dishonest Casino

4/17/09 10

Ques4on # 1 – Evalua4on

Ques4on # 2 – Decoding

4/17/09 11

Ques4on # 3 – Learning

Defini4on of a hidden Markov model

4/17/09 12

The dishonest casino model

FAIR LOADED 0.05 0.05 0.95 0.95

A HMM is memory‐less

At each 4me step t, the only thing that affects future states is the current state πt P(πt+1 = k | “whatever happened so far”) = P(πt+1 = k | π1, π2, …, πt, x1, x2, …, xt) = P(πt+1 = k | πt)

4/17/09 13

A parse of a sequence

Given a sequence x = x1……xN, A parse of x is a sequence of states π = π1, ……, πN

x1 x2 x3 xK

Likelihood of a Parse

x1 x2 x3 xK

4/17/09 14

Likelihood of a parse

x1 x2 x3 xK

P(x, π) = Πj=1…n θj

Example: the dishonest casino

4/17/09 15

Example: the dishonest casino

Example: the dishonest casino

4/17/09 16 A+ A+ C+ C+ G+ G+ S1 S1 S2 S2 S3 S3 S4 S4

HMM Model for CGH data

Fridlyand et al. (2004)

Theorem: Let I* be the op4mal scoring interval. Let J be the leomost longest interval of Ω fully contained in I. Then S(J) ≥ S(I)/α, where α ∝ (1‐ ε‐2 )‐1. Complexity: O(n)

x1  x2  x3  xK 

x1  x2  x3  xK 

x1  x2  x3  xK 

x1  x2  x3  xK 

4/17/09 16 A+  A+  C+  C+  G+  G+  S1  S1  S2  S2  S3  S3  S4  S4 

S1 

S2 

S3 

S4 

µ1,  , σ1 

S1 

S2 

S3 

S4 

µ1,  , σ1