A segmentation-clustering problem for the analysis of array CGH data - - PowerPoint PPT Presentation

a segmentation clustering problem for the analysis of
SMART_READER_LITE
LIVE PREVIEW

A segmentation-clustering problem for the analysis of array CGH data - - PowerPoint PPT Presentation

A segmentation-clustering problem for the analysis of array CGH data F. Picard, S. Robin, E. Lebarbier, J-J. Daudin UMR INA-PG / INRA, Paris Bio-Info-Math Workshop, Tehran, April 2005 Microarray CGH technology - Known effects of big size


slide-1
SLIDE 1

A segmentation-clustering problem for the analysis of array CGH data

  • F. Picard, S. Robin, E. Lebarbier, J-J. Daudin

UMR INA-PG / INRA, Paris Bio-Info-Math Workshop, Tehran, April 2005

slide-2
SLIDE 2

Microarray CGH technology

  • Known effects of big size chromosomal aberrations (ex: trisomy).

→experimental tool: Karyotype (Resolution ∼ chromosome).

  • Change of scale: what are the effects of small size DNA sequences dele-

tions/amplifications? → experimental tool: "conventional" CGH (resolution ∼ 10Mb).

  • CGH= Comparative Genomic Hybridization : method for the comparative

measurement of relative DNA copy numbers between two samples (normal/disease, test/reference). → Application of the microarray technology to CGH : 1997. →last generation of chips: resolution ∼ 100kb.

slide-3
SLIDE 3

Microarray technology in its principle

slide-4
SLIDE 4

Interpretation of a CGH profile

10 20 30 40 50 60 70 80 90 −3 −2 −1 1 2 3

segment amplifié segment délété segment "normal"

A dot on the graph represents log2

  • ♯ copies of BAC(t) in the test genome

♯ copies of BAC(t) in the reference genome

slide-5
SLIDE 5

First step of the statistical analysis Break-points detection in a gaussian signal

  • Y = (Y1, ..., Yn) a random process such that Yt ∼ N(µt, σ2

t).

  • Suppose that the parameters of the distribution of the Y s are affected by K-1

abrupt-changes at unknown coordinates T = (t1, ..., tK−1).

  • Those break-points define a partition of the data into K segments of size nk:

Ik = {t, t ∈]tk−1, tk]}, Y k = {Yt, t ∈ Ik}.

  • Suppose that those parameters are constant between two changes:

∀t ∈ Ik, Yt ∼ N(µk, σ2

k).

  • The parameters of this model are :

T = (t1, ..., tK−1), Θ = (θ1, . . . , θK), θk = (µk, σ2

k).

  • Break-points detection aims at studying the spatial structure of the signal.
slide-6
SLIDE 6

Estimating the parameters in a model of abrupt-changes detection Log-Likelihood LK(T, Θ) =

K

  • k=1

log f(yk; θk) =

K

  • k=1
  • t∈Ik

log f(yt; θk) Estimating the parameters with K fixed by maximum likelihood

  • Joint estimation of T and Θ with dynamic programming.
  • Necessary property of the likelihood : additivity in K (sum of local likeli-

hoods calculated on each segment). Model Selection : choice of K

  • Penalized Likelihood : ˆ

K = Argmax

K

  • ˆ

LK − β × pen(K)

  • .
  • With pen(K) = 2K.
  • β is adaptively estimated to the data (Lavielle(2003)).
slide-7
SLIDE 7

Example of segmentation on array CGH data

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 x 10

5

−2 −1.5 −1 −0.5 0.5 1 1.5 2 log2 rat genomic position 1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67 x 10

6

−2 −1.5 −1 −0.5 0.5 1 1.5 2 log2 rat genomic position

BT474 chromosome 1, ˆ K = 5 BT474 chromosome 9, ˆ K = 4

slide-8
SLIDE 8

Considering biologists objective and the need for a new model

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Segmentation: structure spatiale du signal Segmentation/Classification

y

µ1, σ1 µ2, σ2 µ3, σ3 µ4, σ4

structure sur les t

t t y

structure sur les t

m1, s1 m2, s2 m1, s1

structure sur les y

θk = (µk, σ2

k)

θp = (mp, s2

p)

slide-9
SLIDE 9

A new model for segmentation-clustering purposes

  • We suppose there exists a secondary underlying structure of the segments

into P populations with weights π1, ..., πP(

p πp = 1).

  • We introduce hidden variables, Zkp indicators of the population of origin of

segment k .

  • Those variables are supposed independent, with multinomial distribution:

(Zk1, . . . , ZkP) ∼ M(1; π1, . . . , πP).

  • Conditionnally to the hidden variables, we know the distribution of Y :

Y k|Zkp = 1 ∼ N(1 lnkmp, s2

pInk).

  • It is a model of segmentation/clustering.
  • The parameters of this model are

T = (t1, ..., tK−1), Θ = (π1, . . . , πP; θ1, . . . , θP), avec θp = (mp, s2

p).

slide-10
SLIDE 10

Likelihood and statistical units of the model

  • Mixture Model of segments :

⋆ the statistical units are segments :Y k, ⋆ the density of Y k is a mixture density: log LKP(T, Θ) =

K

  • k=1

log f(yk; Θ) =

K

  • k=1

log   

P

  • p=1

πpf(yk; θp)    ⋆ If the Yts are independent, we have: log LKP(T, Θ) =

K

  • k=1

log   

P

  • p=1

πp

  • t∈Ik

f(yt; θp)    .

  • Classical mixture model :

⋆ the statistical units are the Yts, log LP(Θ) =

K

  • k=1

log   

  • t∈Ik

P

  • p=1

πpf(yt; θp)   

slide-11
SLIDE 11

An hybrid algorithm for the optimization of the likelihood Alternate parameters estimation with K and P known 1 When T is fixed, the EM algorithm estimates Θ: ˆ Θ(ℓ+1) = Argmax

Θ

  • log LKP
  • Θ, T (ℓ)

. log LKP(ˆ Θ(ℓ+1); ˆ T (ℓ)) ≥ log LKP(ˆ Θ(ℓ); ˆ T (ℓ)) 2 When Θ is fixed, dynamic programming estimates T: ˆ T (ℓ+1) = Argmax

T

  • log LKP
  • ˆ

Θ(ℓ+1), T

  • .

log LKP(ˆ Θ(ℓ+1); ˆ T (ℓ+1)) ≥ log LKP(ˆ Θ(ℓ+1); ˆ T (ℓ)) An increasing sequence of likelihoods: log LKP(ˆ Θ(ℓ+1); ˆ T (ℓ+1)) ≥ log LKP(ˆ Θ(ℓ); ˆ T (ℓ))

slide-12
SLIDE 12

Mixture Model when the segmentation is knwon Mixture model parameters estimators ˆ τkp = ˆ πpf(yk; ˆ θp) P

ℓ=1 ˆ

πℓf(yk; ˆ θℓ) .

  • the estimator the the mixing proportions is: ˆ

πp =

  • k ˆ

τkp K

.

  • In the gaussian case, θp = (mp, s2

p) :

ˆ mp =

  • k ˆ

τkp

  • t∈Ik yt
  • k ˆ

τkpnk , ˆ s2

p =

  • k ˆ

τkp

  • t∈Ik(yt − ˆ

mp)2

  • k ˆ

τkpnk .

  • Big size vectors will have a bigger impact in the estimation of the parameters,

via the term

k ˆ

τkpnk

slide-13
SLIDE 13

Influence of the vectors size on the affectation (MAP)

  • The density of Y k can be written as follows:

f(yk; θp) = exp

  • −nk

2

  • log(2πs2

p) + 1

s2

p

  • ( ¯

y2

k − ¯

y2

k) + (¯

yk − mp)2 ⋆ (¯ yk − mp)2 : distance of the mean of vector k to population p ⋆ ( ¯ y2

k − ¯

y2

k) : intra-vector k variability

  • Big size Individuals will be affected with certitude to the closest population

lim

nk→∞τkp0 =

1 lim

nk→∞τkp = 0

lim

nk→0τkp0 = πp0

lim

nk→0τkp = πp

slide-14
SLIDE 14

Segmentation with a fixed mixture Back to dynamic programming

  • the incomplete mixture log-likelihood can be written as a sum of local log-

likelihoods: LKP(T, Θ) =

k ℓkP(yk; Θ)

  • the local log-likelihood of segment k corresponds to the mixture log-density
  • f vector Y k

ℓkP(yk; Θ) = log   

P

  • p=1

πp

  • t∈Ik

f(yt; θp)    .

  • log LKP(T, Θ) can be optimized in T with Θ fixed, by dynamix programming.
slide-15
SLIDE 15

A decreasing log-Likelihood?

10 20 30 40 50 60 −2 2 4 6 8 10 20 30 40 50 60 −500 −400 −300 −200 −100

Evolution of the incomplete log-likelihood with respect to the number of segments. f(yk; Θ) = 0.5N(0, 1) + 0.5N(5, 1)

slide-16
SLIDE 16

What is going on?

5 10 15 20 25 30 35 40 45 50 55 60 −2 2 4 6 5 10 15 20 25 30 35 40 45 50 55 60 −2 2 4 6 5 10 15 20 25 30 35 40 45 50 55 60 −2 2 4 6

When the true number of segments is reached (6), segments are cut on the edges.

slide-17
SLIDE 17

Explaining the behavior of the likelihood Optimization of the incomplete likelihood with dynamic programming: log LKP(T; Θ) = QKP(T; Θ) − HKP(T; Θ) QKP(T; Θ) =

  • k
  • p

τkp log(πp) +

  • k
  • p

τkp log f(yk; θp) HKP(T; Θ) =

  • k
  • p

τkp log τkp Hypothesis: 1 We suppose that the true number of segments is K∗ and that the partitions are nested for K ≥ K∗. ⋆ Segment Y K is cut into (Y K

1 , Y K 2 ):

f(Y K; θp) = f(Y K

1 ; θp) × f(Y K 2 ; θp).

2 We suppose that if Y K ∈ p then (Y K

1 , Y K 2 ) ∈ p :

τp(Y K) ≃ τp(Y K

1 ) ≃ τp(Y K 2 ) ≃ τp.

slide-18
SLIDE 18

An intrinsic penality Under hypothesis 1-2: ∀K ≥ K∗, log ˆ L(K+1),P − log ˆ L(K),P ≃

  • p

ˆ πp log(ˆ πp) −

  • p

ˆ τp log(ˆ τp) ≤ 0 The log-likelihood is decomposed into two terms

  • A term of fit that increases with K, and is constant from a certain K∗ (nested

partitions)

  • k
  • p

ˆ τkp log f(yk; ˆ θp).

  • A term of differences of entropies that decreases with K: plays the role of

penalty for the choice of K K

  • p

ˆ πp log(ˆ πp) −

  • k
  • p

ˆ τkp log ˆ τkp. Choosing the number of segments K when P is fixed can be done with a penalized likelihood

slide-19
SLIDE 19

Incomplete Likelihood behavior with respect to the number of segments

2 4 6 8 10 12 14 16 18 20 −10 10 20 30 40 50 60 P=2 P=3 P=4 P=5 P=6

The incomplete log-likelihood is decreasing from de K = 8 ˆ LKP( ˆ T; ˆ Θ) =

k log

  • p ˆ

πpf(yk; ˆ θp)

  • .
slide-20
SLIDE 20

Decomposition of the log-likelihood

2 4 6 8 10 12 14 16 18 20 −10 10 20 30 40 50 60 70 80 90 P=2 P=3 P=4 P=5 P=6 2 4 6 8 10 12 14 16 18 20 −30 −25 −20 −15 −10 −5 P=2 P=3 P=4 P=5 P=6

term of fit differences of entropies

  • k
  • p ˆ

τkp log f(yk; ˆ θp) K

p ˆ

πp log(ˆ πp) −

k

  • p ˆ

τkp log ˆ τkp

slide-21
SLIDE 21

Resulting clusters

10 20 30 40 50 60 70 80 90 100 110 −0.5 0.5 1 1.5 2 2.5 3 3.5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 x 10

5

−2 −1.5 −1 −0.5 0.5 1 1.5 2 log2 rat genomic position

Segmentation/Clustering P = 3, K = 8 Segmentation K = 5

slide-22
SLIDE 22

Resulting clusters

10 20 30 40 50 60 70 80 90 100 110 −0.5 0.5 1 1.5 2 2.5 3 3.5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 x 10

5

−2 −1.5 −1 −0.5 0.5 1 1.5 2 log2 rat genomic position

Segmentation/Clustering P = 4, K = 8 Segmentation K = 5

slide-23
SLIDE 23

Perspective : simultaneous choice for K and P

5 10 15 20 2 4 6 8 −60 −40 −20 20 40 60 80

Incomplete Log-likelihood with respect to K and P.

slide-24
SLIDE 24

This is the end Conclusions:

  • Definition of a new model that considers the a priori knowledge we have about

the biological phenomena under study.

  • Development of an hybrid algorithm (EM/dynamic programming) for the pa-

rameters estimation (problems linked to EM : initializtion, local maxima, de- generacy).

  • Still waiting for an other data set to assess the performance of the clustering.

Perspectives:

  • Modeling :

⋆ Comparison with Hidden Markov Models

  • Model choice:

⋆ Develop an adaptive procedure for two components.

  • Other application field

⋆ DNA sequences (in progress)