clustering
play

Clustering Eric Xing 2 1 Object Recognition and Tracking (1.9, - PDF document

School of Computer Science Infinite Mixture and Dirichlet Process Probabilistic Graphical Models (10- Probabilistic Graphical Models (10 -708) 708) Lecture 20, Nov 28, 2007 Receptor A Receptor A X 1 X 1 X 1 Receptor B Receptor B X 2 X 2


  1. School of Computer Science Infinite Mixture and Dirichlet Process Probabilistic Graphical Models (10- Probabilistic Graphical Models (10 -708) 708) Lecture 20, Nov 28, 2007 Receptor A Receptor A X 1 X 1 X 1 Receptor B Receptor B X 2 X 2 X 2 Eric Xing Eric Xing Kinase C Kinase C X 3 X 3 X 3 Kinase D Kinase D X 4 X 4 X 4 Kinase E Kinase E X 5 X 5 X 5 TF F TF F X 6 X 6 X 6 Reading: Gene G Gene G X 7 X 7 X 7 X 8 X 8 X 8 Gene H Gene H 1 Clustering Eric Xing 2 1

  2. Object Recognition and Tracking (1.9, 9.0, 2.1) (1.8, 7.4, 2.3) (1.9, 6.1, 2.2) (0.7, 5.1, 3.2) (0.6, 5.9, 3.2) (0.9, 5.8, 3.1) t=1 t=2 t=3 Eric Xing 3 Modeling The Mind … Latent Latent brain processes: brain processes: View picture View picture Read sentence Read sentence Decide whether consistent Decide whether consistent fMRI scan: scan: fMRI ∑ ∑ … … … t=1 t=T Eric Xing 4 2

  3. The Evolution of Science Research Research circles circles Phy Phy Bio Research Research topics topics CS PNAS papers papers PNAS 2000 ? 1900 Eric Xing 5 Partially Observed, Open and Evolving Possible Worlds Unbounded # of objects/trajectories � Changing attributes � Birth/death, merge/split � Relational ambiguity � The parametric paradigm: � ( ) ( { } Event model Event model ) ( { } ) motion model motion model { } { } Finite � 0 1 1 p φ φ φ + φ : T t t p p or k k k k Entity space Entity space Structurally � unambiguous Ξ Ξ * * + 1 + | 1 t | t t t observation space observation space Sensor model Sensor model ( { } ) φ p | x k How to open it up? How to open it up? Eric Xing 6 3

  4. A Classical Approach � Clustering as Mixture Modeling � Then "model selection" Eric Xing 7 Model Selection vs. Posterior Inference � Model selection "intelligent" guess: ??? � cross validation: data-hungry � � information theoretic: � ( ) AIC � f g ⋅ ⋅ θ ˆ arg min KL ( ) | ( | , K ) ML TIC � Parsimony, Ockam's Ockam's Razor Razor MDL : Parsimony, � Bayes factor: need to compute data likelihood � � Posterior inference: we want to handle uncertainty of model complexity explicitly p M D p D M p M ∝ ( | ) ( | ) ( ) { } M ≡ θ , K we favor a distribution that does not constrain M in a "closed" space! � Eric Xing 8 4

  5. Two "Recent" Developments � First order probabilistic languages (FOPLs) Examples: PRM, BLOG … � Lift graphical models to "open" world (#rv, relation, index, lifespan …) � Focus on complete, consistent, and operating rules to instantiate possible worlds, � and formal language of expressing such rules Operational way of defining distributions over possible worlds, via sampling � methods � Bayesian Nonparametrics Examples: Dirichlet processes, stick-breaking processes … � � From finite, to infinite mixture, to more complex constructions (hierarchies, spatial/temporal sequences, …) Focus on the laws and behaviors of both the generative formalisms and resulting � distributions Often offer explicit expression of distributions, and expose the structure of the � distributions --- motivate various approximate schemes Eric Xing 9 Clustering How to label them ? � How many clusters ??? � Eric Xing 10 5

  6. Genetic Demography Are there genetic prototypes among them ? � What are they ? � How many ? (how many ancestors do we have ?) � Eric Xing 11 Genetic Polymorphisms Eric Xing 12 6

  7. Biological Terms � Genetic polymorphism: a difference in DNA sequence among individuals, groups, or populations � Single Nucleotide Polymorphism (SNP): DNA sequence variation occurring when a single nucleotide - A, T, C, or G - differs between members of the species – Each variant is called an “allele” – Almost always bi-allelic – Account for most of the genetic diversity among different (normal) individuals, e.g. drug response, disease susceptibility Eric Xing 13 From SNPs to Haplotypes � Alleles of adjacent SNPs on a chromosome form haplotypes Powerful in the study of disease association or genetic evolution � Eric Xing 14 7

  8. Haplotype and Genotype � A collection of alleles derived from the same chromosome Genotypes Haplotypes 2 13 13 2 1 6 1 6 9 15 9 15 4 17 17 4 1 9 1 9 Haplotype 2 6 6 2 9 17 9 17 Re-construction 2 12 12 2 12 7 7 12 6 14 14 6 1 7 7 1 18 18 18 18 1 4 1 4 10 10 10 10 Chromosome phase is unknown Chromosome phase is known Eric Xing 15 Ancestral Inference ? θ k A k H n 1 H n 2 G n N N Essentially a clustering problem, but … … Essentially a clustering problem, but Better recovery of the ancestors leads to better haplotyping results � (because of more accurate grouping of common haplotypes) � True haplotypes are obtainable with high cost, but they can validate model more subjectively (as opposed to examining saliency of clustering) � Many other biological/scientific utilities Eric Xing 16 8

  9. A Finite (Mixture of ) Allele Model � The probability of a genotype g : H n 1 H n 2 ∑ = p ( g ) p ( h , h ) p ( g | h , h ) 1 2 1 2 G n ∈ h , h H 1 2 Population haplotype Genotyping Haplotype pool model model � Standard settings: H | = K << 2 J fixed-sized population haplotype pool � p ( h 1 ,h 2 ) = p ( h 1 ) p ( h 2 ) =f 1 f 2 Hardy-Weinberg equilibrium � H ? � Problem: K ? Eric Xing 17 A Infinite (Mixture of ) Allele Model ∞ θ k A k H n 1 H n 2 G n N N � How? Via a nonparametric hierarchical Bayesian formalism ! � Eric Xing 18 9

  10. Stick-breaking Process 0 0.4 0.4 ∞ G ∑ = π δ θ ( ) k k k 1 = 0.6 0.5 0.3 G θ ~ k 0 ∞ Location ∑ 0.3 0.8 0.24 1 π = k k = 1 k 1 - ∏ 1 π = β β ( - ) k k k j 1 = Mass G 0 1 β α ~ Beta( , ) k Eric Xing 19 Graphical Model ∞ θ k A k H n 1 H n 2 G n N N Eric Xing 20 10

  11. Chinese Restaurant Process θ θ 1 2 1 0 0 P ( c = k | ) = c i - i α 1 0 α α 1 + 1 + α 1 1 α α α 2 + 2 + 2 + α 1 2 α α α 3 + 3 + 3 + α m m .... 1 2 α α + α i + - 1 i + - 1 i - 1 CRP defines an exchangeable distribution on partitions over an (infinite) sequence of samples, such a distribution is formally known as the Dirichlet Process (DP) Eric Xing 21 The DP Mixture of Ancestral Haplotypes � The customers around a table form a cluster associate a mixture component ( i.e ., a population haplotype) with a table � sample { a, θ } at each table from a base measure G 0 to obtain the � population haplotype and nucleotide substitution frequency for that component 1 8 9 4 2 { A , θ } { A , θ } { A , θ } { A , θ } { A , θ } { A , θ } … … 7 6 3 5 With p ( h| { Α, θ }) and p ( g|h 1 ,h 2 ), the CRP yields a posterior distribution on � the number of population haplotypes (and on the haplotype configurations and the nucleotide substitution frequencies) Eric Xing 22 11

  12. DP-haplotyper α G 0 DP G K infinite mixture components θ (for population haplotypes) A H n 1 H n 2 Likelihood model (for individual G n haplotypes and genotypes) N � Inference: Markov Chain Monte Carlo (MCMC) Gibbs sampling � Metropolis Hasting � Eric Xing 23 Model components � Choice of base measure: ∏ G a ⋅ θ ~ Unif( ) Beta( ) 0 j j � Nucleotide-substitution model: = ∏ p h a p h a θ θ ( | { , } ) ( | , ) i k i j k j k j , , , j h a ⎧ θ = if k j i j k j p h a θ = , , , ⎨ where ( | , ) i j k j k j 1 − θ h = a , , , if ⎩ k j i j k j , , , � Noisy genotyping model: = ∏ p g h h p g h h ( | , ) ( | , ) i i i i j i j i j , , , 1 2 1 2 j h h g ⎧ γ ⊕ = if ⎪ i j i j i j 1 , 2 , , p g h h = 1 ⎨ − γ where ( | , ) i j i j i j h h g , , , ⊕ ≠ 1 2 if ⎪ i j i j i j ⎩ , , , 1 2 2 Eric Xing 24 12

  13. Gibbs sampling Starting from some initial haplotype reconstruction H (0) , pick a first table (0) , and form initial population-hap pool A (0) = { a 1 (0) }: with an arbitrary a 1 i) Choose an individual i and one of his/her two haplytopes t , uniformly and at random, from all ambiguous individuals; ( + + ( + t 1 ) ( t 1 ) ( t ) ( t ) ( t ) t 1 ) c p ( c | c , H , ) c A ii) Sample from , update ; − i t i i t t ( + = ( + + ∀ + = t 1 ) t 1 ) ( t 1 ) ( t ) ( t 1 ) a k c p ( a | h s.t. c k ) iii) Sample , where , from ; − k i t k i ' i ' t ' t ' update A ( t+ 1) ; ( + + + + t 1 ) ( t 1 ) ( t 1 ) ( t ) ( t 1 ) p ( h | c , H , ) h A , update H ( t+ 1) . iii) Sample from − i i i i t t t t Eric Xing 25 Convergence of Ancestral Inference Eric Xing 26 13

  14. Haplotyping Error The Gabriel data Eric Xing 27 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend