a segmentation clustering problem for the analysis of
play

A segmentation-clustering problem for the analysis of array CGH data - PowerPoint PPT Presentation

A segmentation-clustering problem for the analysis of array CGH data F. Picard, S. Robin, E. Lebarbier, J-J. Daudin UMR INA-PG / INRA, Paris Bio-Info-Math Workshop, Tehran, April 2005 Microarray CGH technology - Known effects of big size


  1. A segmentation-clustering problem for the analysis of array CGH data F. Picard, S. Robin, E. Lebarbier, J-J. Daudin UMR INA-PG / INRA, Paris Bio-Info-Math Workshop, Tehran, April 2005

  2. Microarray CGH technology - Known effects of big size chromosomal aberrations (ex: trisomy). → experimental tool: Karyotype (Resolution ∼ chromosome). - Change of scale: what are the effects of small size DNA sequences dele- tions/amplifications? → experimental tool: "conventional" CGH (resolution ∼ 10Mb). - CGH= Comparative Genomic Hybridization : method for the comparative measurement of relative DNA copy numbers between two samples (normal/disease, test/reference). → Application of the microarray technology to CGH : 1997. → last generation of chips: resolution ∼ 100kb.

  3. Microarray technology in its principle

  4. Interpretation of a CGH profile 3 2 segment amplifié 1 0 −1 segment "normal" −2 segment délété −3 0 10 20 30 40 50 60 70 80 90 A dot on the graph represents � � ♯ copies of BAC(t) in the test genome log 2 ♯ copies of BAC(t) in the reference genome

  5. First step of the statistical analysis Break-points detection in a gaussian signal - Y = ( Y 1 , ..., Y n ) a random process such that Y t ∼ N ( µ t , σ 2 t ) . - Suppose that the parameters of the distribution of the Y s are affected by K-1 abrupt-changes at unknown coordinates T = ( t 1 , ..., t K − 1 ) . - Those break-points define a partition of the data into K segments of size n k : I k = { t, t ∈ ] t k − 1 , t k ] } , Y k = { Y t , t ∈ I k } . - Suppose that those parameters are constant between two changes: ∀ t ∈ I k , Y t ∼ N ( µ k , σ 2 k ) . - The parameters of this model are : T = ( t 1 , ..., t K − 1 ) , Θ = ( θ 1 , . . . , θ K ) , θ k = ( µ k , σ 2 k ) . - Break-points detection aims at studying the spatial structure of the signal .

  6. Estimating the parameters in a model of abrupt-changes detection Log-Likelihood K K � � � log f ( y k ; θ k ) = L K ( T, Θ) = log f ( y t ; θ k ) k =1 k =1 t ∈ I k Estimating the parameters with K fixed by maximum likelihood - Joint estimation of T and Θ with dynamic programming. - Necessary property of the likelihood : additivity in K (sum of local likeli- hoods calculated on each segment). Model Selection : choice of K � � - Penalized Likelihood : ˆ ˆ K = Argmax L K − β × pen ( K ) . K - With pen ( K ) = 2 K . - β is adaptively estimated to the data (Lavielle(2003)).

  7. Example of segmentation on array CGH data 2 2 1.5 1.5 1 1 0.5 0.5 log 2 rat log 2 rat 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 −2 −2 1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 genomic position 6 x 10 genomic position 5 x 10 BT474 chromosome 1, ˆ BT474 chromosome 9, ˆ K = 5 K = 4

  8. Considering biologists objective and the need for a new model y y structure sur les y µ 2 , σ 2 m 2 , s 2 µ 3 , σ 3 x x x x x x x x x x x x x x µ 4 , σ 4 m 1 , s 1 µ 1 , σ 1 m 1 , s 1 x x x x x x x x x x x x x x x x x x x x x x t t structure sur les t structure sur les t Segmentation: structure spatiale du signal Segmentation/Classification θ k = ( µ k , σ 2 θ p = ( m p , s 2 k ) p )

  9. A new model for segmentation-clustering purposes - We suppose there exists a secondary underlying structure of the segments into P populations with weights π 1 , ..., π P ( � p π p = 1) . - We introduce hidden variables, Z kp indicators of the population of origin of segment k . - Those variables are supposed independent, with multinomial distribution: ( Z k 1 , . . . , Z kP ) ∼ M (1; π 1 , . . . , π P ) . - Conditionnally to the hidden variables, we know the distribution of Y : Y k | Z kp = 1 ∼ N (1 l n k m p , s 2 p I n k ) . - It is a model of segmentation/clustering . - The parameters of this model are T = ( t 1 , ..., t K − 1 ) , Θ = ( π 1 , . . . , π P ; θ 1 , . . . , θ P ) , avec θ p = ( m p , s 2 p ) .

  10. Likelihood and statistical units of the model - Mixture Model of segments : ⋆ the statistical units are segments : Y k , ⋆ the density of Y k is a mixture density:   K K P   � � � log f ( y k ; Θ) = π p f ( y k ; θ p ) log L KP ( T, Θ) = log   k =1 k =1 p =1 ⋆ If the Y t s are independent, we have:   K P   � � � log L KP ( T, Θ) = log f ( y t ; θ p ) π p  .  p =1 k =1 t ∈ I k - Classical mixture model : ⋆ the statistical units are the Y t s,   K P   � � � log L P (Θ) = log π p f ( y t ; θ p )  p =1  k =1 t ∈ I k

  11. An hybrid algorithm for the optimization of the likelihood Alternate parameters estimation with K and P known 1 When T is fixed, the EM algorithm estimates Θ : Θ ( ℓ +1) = Argmax � � Θ , T ( ℓ ) �� ˆ log L KP . Θ log L KP (ˆ Θ ( ℓ +1) ; ˆ T ( ℓ ) ) ≥ log L KP (ˆ Θ ( ℓ ) ; ˆ T ( ℓ ) ) 2 When Θ is fixed, dynamic programming estimates T : T ( ℓ +1) = Argmax � � �� ˆ ˆ Θ ( ℓ +1) , T log L KP . T log L KP (ˆ Θ ( ℓ +1) ; ˆ T ( ℓ +1) ) ≥ log L KP (ˆ Θ ( ℓ +1) ; ˆ T ( ℓ ) ) An increasing sequence of likelihoods: log L KP (ˆ Θ ( ℓ +1) ; ˆ T ( ℓ +1) ) ≥ log L KP (ˆ Θ ( ℓ ) ; ˆ T ( ℓ ) )

  12. Mixture Model when the segmentation is knwon Mixture model parameters estimators π p f ( y k ; ˆ ˆ θ p ) τ kp = ˆ . π ℓ f ( y k ; ˆ � P ℓ =1 ˆ θ ℓ ) � k ˆ τ kp - the estimator the the mixing proportions is: ˆ π p = . K - In the gaussian case, θ p = ( m p , s 2 p ) : � � k ˆ τ kp t ∈ I k y t m p = ˆ , � k ˆ τ kp n k m p ) 2 � � k ˆ t ∈ I k ( y t − ˆ τ kp s 2 ˆ p = . � k ˆ τ kp n k - Big size vectors will have a bigger impact in the estimation of the parameters, via the term � k ˆ τ kp n k

  13. Influence of the vectors size on the affectation (MAP) - The density of Y k can be written as follows: � � y k − m p ) 2 ��� p ) + 1 − n k � ( ¯ log(2 πs 2 y 2 f ( y k ; θ p ) = exp y 2 k − ¯ k ) + (¯ s 2 2 p y k − m p ) 2 : distance of the mean of vector k to population p ⋆ (¯ ⋆ ( ¯ y 2 y 2 k − ¯ k ) : intra-vector k variability - Big size Individuals will be affected with certitude to the closest population n k →∞ τ kp 0 = lim 1 n k →∞ τ kp = 0 lim n k → 0 τ kp 0 = π p 0 lim n k → 0 τ kp = π p lim

  14. Segmentation with a fixed mixture Back to dynamic programming - the incomplete mixture log-likelihood can be written as a sum of local log- likelihoods: k ℓ kP ( y k ; Θ) L KP ( T, Θ) = � - the local log-likelihood of segment k corresponds to the mixture log-density of vector Y k   P   � � ℓ kP ( y k ; Θ) = log f ( y t ; θ p ) π p  .  p =1 t ∈ I k - log L KP ( T, Θ) can be optimized in T with Θ fixed, by dynamix programming.

  15. A decreasing log-Likelihood? 8 6 4 2 0 −2 0 10 20 30 40 50 60 0 −100 −200 −300 −400 −500 0 10 20 30 40 50 60 Evolution of the incomplete log-likelihood with respect to the number of segments. f ( y k ; Θ) = 0 . 5 N (0 , 1) + 0 . 5 N (5 , 1)

  16. What is going on? 6 4 2 0 −2 5 10 15 20 25 30 35 40 45 50 55 60 6 4 2 0 −2 5 10 15 20 25 30 35 40 45 50 55 60 6 4 2 0 −2 5 10 15 20 25 30 35 40 45 50 55 60 When the true number of segments is reached (6), segments are cut on the edges.

  17. Explaining the behavior of the likelihood Optimization of the incomplete likelihood with dynamic programming : log L KP ( T ; Θ) = Q KP ( T ; Θ) − H KP ( T ; Θ) � � � � τ kp log f ( y k ; θ p ) Q KP ( T ; Θ) = τ kp log( π p ) + p p k k � � H KP ( T ; Θ) = τ kp log τ kp p k Hypothesis : 1 We suppose that the true number of segments is K ∗ and that the partitions are nested for K ≥ K ∗ . ⋆ Segment Y K is cut into ( Y K 1 , Y K 2 ) : f ( Y K ; θ p ) = f ( Y K 1 ; θ p ) × f ( Y K 2 ; θ p ) . 2 We suppose that if Y K ∈ p then ( Y K 1 , Y K 2 ) ∈ p : τ p ( Y K ) ≃ τ p ( Y K 1 ) ≃ τ p ( Y K 2 ) ≃ τ p .

  18. An intrinsic penality Under hypothesis 1-2 : ∀ K ≥ K ∗ , log ˆ L ( K +1) ,P − log ˆ � � L ( K ) ,P ≃ π p ) − τ p ) ≤ 0 π p log(ˆ ˆ τ p log(ˆ ˆ p p The log-likelihood is decomposed into two terms - A term of fit that increases with K , and is constant from a certain K ∗ (nested partitions) � � τ kp log f ( y k ; ˆ ˆ θ p ) . p k - A term of differences of entropies that decreases with K : plays the role of penalty for the choice of K � � � π p ) − K π p log(ˆ ˆ τ kp log ˆ ˆ τ kp . p p k Choosing the number of segments K when P is fixed can be done with a penalized likelihood

  19. Incomplete Likelihood behavior with respect to the number of segments 60 50 40 30 20 10 P=2 P=3 0 P=4 P=5 P=6 −10 2 4 6 8 10 12 14 16 18 20 The incomplete log-likelihood is decreasing from de K = 8 �� � L KP ( ˆ ˆ T ; ˆ π p f ( y k ; ˆ Θ) = � k log p ˆ θ p ) .

  20. Decomposition of the log-likelihood 90 0 P=2 80 P=3 P=4 −5 P=5 70 P=6 60 −10 50 40 −15 30 −20 20 10 P=2 −25 P=3 P=4 0 P=5 P=6 −10 −30 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 term of fit differences of entropies τ kp log f ( y k ; ˆ � � K � π p ) − � � p ˆ θ p ) p ˆ π p log(ˆ p ˆ τ kp log ˆ τ kp k k

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend