A segmentation-clustering problem for the analysis of array CGH data - PowerPoint PPT Presentation

A segmentation-clustering problem for the analysis of array CGH data F. Picard, S. Robin, E. Lebarbier, J-J. Daudin UMR INA-PG / INRA, Paris Bio-Info-Math Workshop, Tehran, April 2005

Microarray CGH technology - Known effects of big size chromosomal aberrations (ex: trisomy). → experimental tool: Karyotype (Resolution ∼ chromosome). - Change of scale: what are the effects of small size DNA sequences dele- tions/amplifications? → experimental tool: "conventional" CGH (resolution ∼ 10Mb). - CGH= Comparative Genomic Hybridization : method for the comparative measurement of relative DNA copy numbers between two samples (normal/disease, test/reference). → Application of the microarray technology to CGH : 1997. → last generation of chips: resolution ∼ 100kb.

Microarray technology in its principle

Interpretation of a CGH profile 3 2 segment amplifié 1 0 −1 segment "normal" −2 segment délété −3 0 10 20 30 40 50 60 70 80 90 A dot on the graph represents � � ♯ copies of BAC(t) in the test genome log 2 ♯ copies of BAC(t) in the reference genome

First step of the statistical analysis Break-points detection in a gaussian signal - Y = ( Y 1 , ..., Y n ) a random process such that Y t ∼ N ( µ t , σ 2 t ) . - Suppose that the parameters of the distribution of the Y s are affected by K-1 abrupt-changes at unknown coordinates T = ( t 1 , ..., t K − 1 ) . - Those break-points define a partition of the data into K segments of size n k : I k = { t, t ∈ ] t k − 1 , t k ] } , Y k = { Y t , t ∈ I k } . - Suppose that those parameters are constant between two changes: ∀ t ∈ I k , Y t ∼ N ( µ k , σ 2 k ) . - The parameters of this model are : T = ( t 1 , ..., t K − 1 ) , Θ = ( θ 1 , . . . , θ K ) , θ k = ( µ k , σ 2 k ) . - Break-points detection aims at studying the spatial structure of the signal .

Estimating the parameters in a model of abrupt-changes detection Log-Likelihood K K � � � log f ( y k ; θ k ) = L K ( T, Θ) = log f ( y t ; θ k ) k =1 k =1 t ∈ I k Estimating the parameters with K fixed by maximum likelihood - Joint estimation of T and Θ with dynamic programming. - Necessary property of the likelihood : additivity in K (sum of local likelihoods calculated on each segment). Model Selection : choice of K � � - Penalized Likelihood : ˆ ˆ K = Argmax L K − β × pen ( K ) . K - With pen ( K ) = 2 K . - β is adaptively estimated to the data (Lavielle(2003)).

Example of segmentation on array CGH data 2 2 1.5 1.5 1 1 0.5 0.5 log 2 rat log 2 rat 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 −2 −2 1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 genomic position 6 x 10 genomic position 5 x 10 BT474 chromosome 1, ˆ BT474 chromosome 9, ˆ K = 5 K = 4

Considering biologists objective and the need for a new model y y structure sur les y µ 2 , σ 2 m 2 , s 2 µ 3 , σ 3 x x x x x x x x x x x x x x µ 4 , σ 4 m 1 , s 1 µ 1 , σ 1 m 1 , s 1 x x x x x x x x x x x x x x x x x x x x x x t t structure sur les t structure sur les t Segmentation: structure spatiale du signal Segmentation/Classification θ k = ( µ k , σ 2 θ p = ( m p , s 2 k ) p )

A new model for segmentation-clustering purposes - We suppose there exists a secondary underlying structure of the segments into P populations with weights π 1 , ..., π P ( � p π p = 1) . - We introduce hidden variables, Z kp indicators of the population of origin of segment k . - Those variables are supposed independent, with multinomial distribution: ( Z k 1 , . . . , Z kP ) ∼ M (1; π 1 , . . . , π P ) . - Conditionnally to the hidden variables, we know the distribution of Y : Y k | Z kp = 1 ∼ N (1 l n k m p , s 2 p I n k ) . - It is a model of segmentation/clustering . - The parameters of this model are T = ( t 1 , ..., t K − 1 ) , Θ = ( π 1 , . . . , π P ; θ 1 , . . . , θ P ) , avec θ p = ( m p , s 2 p ) .

Likelihood and statistical units of the model - Mixture Model of segments : ⋆ the statistical units are segments : Y k , ⋆ the density of Y k is a mixture density:   K K P   � � � log f ( y k ; Θ) = π p f ( y k ; θ p ) log L KP ( T, Θ) = log   k =1 k =1 p =1 ⋆ If the Y t s are independent, we have:   K P   � � � log L KP ( T, Θ) = log f ( y t ; θ p ) π p  .  p =1 k =1 t ∈ I k - Classical mixture model : ⋆ the statistical units are the Y t s,   K P   � � � log L P (Θ) = log π p f ( y t ; θ p )  p =1  k =1 t ∈ I k

An hybrid algorithm for the optimization of the likelihood Alternate parameters estimation with K and P known 1 When T is fixed, the EM algorithm estimates Θ : Θ ( ℓ +1) = Argmax � � Θ , T ( ℓ ) �� ˆ log L KP . Θ log L KP (ˆ Θ ( ℓ +1) ; ˆ T ( ℓ ) ) ≥ log L KP (ˆ Θ ( ℓ ) ; ˆ T ( ℓ ) ) 2 When Θ is fixed, dynamic programming estimates T : T ( ℓ +1) = Argmax � � �� ˆ ˆ Θ ( ℓ +1) , T log L KP . T log L KP (ˆ Θ ( ℓ +1) ; ˆ T ( ℓ +1) ) ≥ log L KP (ˆ Θ ( ℓ +1) ; ˆ T ( ℓ ) ) An increasing sequence of likelihoods: log L KP (ˆ Θ ( ℓ +1) ; ˆ T ( ℓ +1) ) ≥ log L KP (ˆ Θ ( ℓ ) ; ˆ T ( ℓ ) )

Mixture Model when the segmentation is knwon Mixture model parameters estimators π p f ( y k ; ˆ ˆ θ p ) τ kp = ˆ . π ℓ f ( y k ; ˆ � P ℓ =1 ˆ θ ℓ ) � k ˆ τ kp - the estimator the the mixing proportions is: ˆ π p = . K - In the gaussian case, θ p = ( m p , s 2 p ) : � � k ˆ τ kp t ∈ I k y t m p = ˆ , � k ˆ τ kp n k m p ) 2 � � k ˆ t ∈ I k ( y t − ˆ τ kp s 2 ˆ p = . � k ˆ τ kp n k - Big size vectors will have a bigger impact in the estimation of the parameters, via the term � k ˆ τ kp n k

Influence of the vectors size on the affectation (MAP) - The density of Y k can be written as follows: � � y k − m p ) 2 �� p ) + 1 − n k � ( ¯ log(2 πs 2 y 2 f ( y k ; θ p ) = exp y 2 k − ¯ k ) + (¯ s 2 2 p y k − m p ) 2 : distance of the mean of vector k to population p ⋆ (¯ ⋆ ( ¯ y 2 y 2 k − ¯ k ) : intra-vector k variability - Big size Individuals will be affected with certitude to the closest population n k →∞ τ kp 0 = lim 1 n k →∞ τ kp = 0 lim n k → 0 τ kp 0 = π p 0 lim n k → 0 τ kp = π p lim

Segmentation with a fixed mixture Back to dynamic programming - the incomplete mixture log-likelihood can be written as a sum of local log- likelihoods: k ℓ kP ( y k ; Θ) L KP ( T, Θ) = � - the local log-likelihood of segment k corresponds to the mixture log-density of vector Y k   P   � � ℓ kP ( y k ; Θ) = log f ( y t ; θ p ) π p  .  p =1 t ∈ I k - log L KP ( T, Θ) can be optimized in T with Θ fixed, by dynamix programming.

A decreasing log-Likelihood? 8 6 4 2 0 −2 0 10 20 30 40 50 60 0 −100 −200 −300 −400 −500 0 10 20 30 40 50 60 Evolution of the incomplete log-likelihood with respect to the number of segments. f ( y k ; Θ) = 0 . 5 N (0 , 1) + 0 . 5 N (5 , 1)

What is going on? 6 4 2 0 −2 5 10 15 20 25 30 35 40 45 50 55 60 6 4 2 0 −2 5 10 15 20 25 30 35 40 45 50 55 60 6 4 2 0 −2 5 10 15 20 25 30 35 40 45 50 55 60 When the true number of segments is reached (6), segments are cut on the edges.

Explaining the behavior of the likelihood Optimization of the incomplete likelihood with dynamic programming : log L KP ( T ; Θ) = Q KP ( T ; Θ) − H KP ( T ; Θ) � � � � τ kp log f ( y k ; θ p ) Q KP ( T ; Θ) = τ kp log( π p ) + p p k k � � H KP ( T ; Θ) = τ kp log τ kp p k Hypothesis : 1 We suppose that the true number of segments is K ∗ and that the partitions are nested for K ≥ K ∗ . ⋆ Segment Y K is cut into ( Y K 1 , Y K 2 ) : f ( Y K ; θ p ) = f ( Y K 1 ; θ p ) × f ( Y K 2 ; θ p ) . 2 We suppose that if Y K ∈ p then ( Y K 1 , Y K 2 ) ∈ p : τ p ( Y K ) ≃ τ p ( Y K 1 ) ≃ τ p ( Y K 2 ) ≃ τ p .

An intrinsic penality Under hypothesis 1-2 : ∀ K ≥ K ∗ , log ˆ L ( K +1) ,P − log ˆ � � L ( K ) ,P ≃ π p ) − τ p ) ≤ 0 π p log(ˆ ˆ τ p log(ˆ ˆ p p The log-likelihood is decomposed into two terms - A term of fit that increases with K , and is constant from a certain K ∗ (nested partitions) � � τ kp log f ( y k ; ˆ ˆ θ p ) . p k - A term of differences of entropies that decreases with K : plays the role of penalty for the choice of K � � � π p ) − K π p log(ˆ ˆ τ kp log ˆ ˆ τ kp . p p k Choosing the number of segments K when P is fixed can be done with a penalized likelihood

Incomplete Likelihood behavior with respect to the number of segments 60 50 40 30 20 10 P=2 P=3 0 P=4 P=5 P=6 −10 2 4 6 8 10 12 14 16 18 20 The incomplete log-likelihood is decreasing from de K = 8 �� L KP ( ˆ ˆ T ; ˆ π p f ( y k ; ˆ Θ) = � k log p ˆ θ p ) .

Decomposition of the log-likelihood 90 0 P=2 80 P=3 P=4 −5 P=5 70 P=6 60 −10 50 40 −15 30 −20 20 10 P=2 −25 P=3 P=4 0 P=5 P=6 −10 −30 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 term of fit differences of entropies τ kp log f ( y k ; ˆ � � K � π p ) − � � p ˆ θ p ) p ˆ π p log(ˆ p ˆ τ kp log ˆ τ kp k k

A segmentation-clustering problem for the analysis of array CGH data - PowerPoint PPT Presentation

A segmentation-clustering problem for the analysis of array CGH data F. Picard, S. Robin, E. Lebarbier, J-J. Daudin UMR INA-PG / INRA, Paris Bio-Info-Math Workshop, Tehran, April 2005 Microarray CGH technology - Known effects of big size

Segmentation Bottom-up Segmentation Semantic / instance segmentation Many Slides from L.

VIDEO SIGNALS Segmentation WHAT IS SEGMENTATION WHAT IS SEGMENTATION Segmentation is a

Semantic Segmentation / Instance Segmentation Based on Deep learning Yiding Liu 2018.12.08

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Co-Segmentation of 3D Shapes via Subspace Clustering Ruizhen Hu Lubin Fan

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Segmentation Segmentation Segmentation Define the accurate boundaries of all objects in an image

Segmentation using Segmentation using Bayesian Decision Theory Bayesian Decision Theory

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

LECTURE 8 The EM Algorithm Clustering Validation Sequence segmentation CLUSTERING What is a

Exact posterior distributions over the segmentation space and model selection for multiple

Preference Proposals Each student will submit Two (2) votes for topic areas in the form

An Algorithmic View on Multi-related-segments: a unifying model for approximate common interval

GenomicTuples and DNA methylation patterns Peter Hickey (@PeteHaitch) - Walter and Eliza Hall

ALLPATHS: de novo assembly of whole genome micro-reads by Butler et al. Presented by Tim Smith

t tt

Parallel Compu,ng Strategies for NGS Sequence Mapping Kun Huang

for Clinical Gen or Clinical Genomics omics NACG introduction slides Updated 25. February 2020