c lustering s trategy swift
play

C LUSTERING S TRATEGY : SWIFT GMM clustering with Sampling for k [ K - PowerPoint PPT Presentation

SWIFT: S CALABLE W EIGHTED I TERATIVE F LOW - CLUSTERING T ECHNIQUE Iftekhar Naim , Gaurav Sharma , Suprakash Datta , James S. Cavenaugh , Jyh-Chiang E. Wang , Jonathan A. Rebhahn , Sally A. Quataert , and Tim R. Mosmann


  1. SWIFT: S CALABLE W EIGHTED I TERATIVE F LOW - CLUSTERING T ECHNIQUE Iftekhar Naim ∗ , Gaurav Sharma ∗ , Suprakash Datta † , James S. Cavenaugh ∗ , Jyh-Chiang E. Wang ∗ , Jonathan A. Rebhahn ∗ , Sally A. Quataert ∗ , and Tim R. Mosmann ∗ ∗ University of Rochester, Rochester, NY † York University, Toronto, ON FlowCAP Summit, 2010 1 / 48 SWIFT

  2. O UTLINE 1 I NTRODUCTION Flow cytometry (FC) Data Analysis Automated Multivariate clustering of FC Data 2 SWIFT METHOD FOR FC DATA ANALYSIS SWIFT Algorithm Weighted Iterative Sampling based EM Bimodality Splitting Graph-based Merging 3 D OES I T W ORK ? Does It Work? How Do We Know It Works? 4 F LOW CAP C ONTEST Results on FlowCAP Datasets Few Thoughts for FlowCAP II 5 C ONCLUSION 2 / 48 SWIFT

  3. O UTLINE 1 I NTRODUCTION Flow cytometry (FC) Data Analysis Automated Multivariate clustering of FC Data 2 SWIFT METHOD FOR FC DATA ANALYSIS SWIFT Algorithm Weighted Iterative Sampling based EM Bimodality Splitting Graph-based Merging 3 D OES I T W ORK ? Does It Work? How Do We Know It Works? 4 F LOW CAP C ONTEST Results on FlowCAP Datasets Few Thoughts for FlowCAP II 5 C ONCLUSION 3 / 48 SWIFT

  4. F LOW CYTOMETRY (FC) O VERVIEW ◮ Rapid multivariate analysis of individual cells. ◮ High throughput data generation (description of ∼ 1 million cells). ◮ High dimensionality ( ∼ 20 measurements per cell). Fluorochrome Antibody Antigen Cell F IGURE : Flow cytometry system (Ref: http://probes.invitrogen.com) 4 / 48 SWIFT

  5. FC D ATA A NALYSIS ◮ Traditionally FC data analyzed by Manual Gating ◮ Subjective, Scales poorly with increasing dimensions ◮ 1D/2D Projections may not represent full picture ◮ Inaccurate for overlapping clusters (a) Two overlapping (b) Combined view (c) Manual gating clusters F IGURE : Manual gating for overlapping clusters. ◮ Automated multivariate clustering is desirable for FC Data analysis . ◮ Repeatable, nonsubjective, comprehends multivariate structure. 5 / 48 SWIFT

  6. C HALLENGES OF AUTOMATED CLUSTERING OF FC D ATA ◮ Challenges of Automated Clustering: ◮ Large FC datasets ◮ ∼ 1 million events ◮ High dimensionality ( 20 or more dimensions) ◮ Very small clusters that are important in immunological analysis (100 − 200 cells out of millions) ◮ Overlapping clusters and background noise 6 / 48 SWIFT

  7. C HALLENGES OF AUTOMATED CLUSTERING OF FC D ATA ◮ Challenges of Automated Clustering: ◮ Large FC datasets ◮ ∼ 1 million events ◮ High dimensionality ( 20 or more dimensions) ◮ Very small clusters that are important in immunological analysis (100 − 200 cells out of millions) ◮ Overlapping clusters and background noise ◮ Our goal: Design automated clustering method capable of addressing these challenges 6 / 48 SWIFT

  8. M ANY D IFFERENT C LUSTERING M ETHODS Patitional Clustering Soft Hard Mixture Fuzzy Grid Spectral .... K-means Model Clustering Based Clustering 7 / 48 SWIFT

  9. M ANY D IFFERENT C LUSTERING M ETHODS Patitional Clustering Soft Hard Mixture Fuzzy Grid Spectral .... K-means Model Clustering Based Clustering 8 / 48 SWIFT

  10. M ODEL BASED CLUSTERING FOR FC DATA ◮ Model based clustering offers several advantages: ◮ Soft clustering- comprehends overlapping clusters, background noise ◮ BUT, computationally expensive and choice of model imposes limitation 9 / 48 SWIFT

  11. M ODEL BASED CLUSTERING FOR FC DATA ◮ Model based clustering offers several advantages: ◮ Soft clustering- comprehends overlapping clusters, background noise ◮ BUT, computationally expensive and choice of model imposes limitation ◮ Recent proposals for statistical model based FC clustering (Chan et al. [2008], Lo et al. [2008],Finak et al. [2009], Pyne et al. [2009]) 9 / 48 SWIFT

  12. M ODEL BASED CLUSTERING FOR FC DATA ◮ Model based clustering offers several advantages: ◮ Soft clustering- comprehends overlapping clusters, background noise ◮ BUT, computationally expensive and choice of model imposes limitation ◮ Recent proposals for statistical model based FC clustering (Chan et al. [2008], Lo et al. [2008],Finak et al. [2009], Pyne et al. [2009]) ◮ We propose computationally efficient model-based clustering method SWIFT (Naim et al. [2010]) that offers two advantages: ◮ Scalability: Faster Computation + Less Memory Usage ◮ Detection of Small Populations: ∼ 100 cells out of 1 million 9 / 48 SWIFT

  13. O UTLINE 1 I NTRODUCTION Flow cytometry (FC) Data Analysis Automated Multivariate clustering of FC Data 2 SWIFT METHOD FOR FC DATA ANALYSIS SWIFT Algorithm Weighted Iterative Sampling based EM Bimodality Splitting Graph-based Merging 3 D OES I T W ORK ? Does It Work? How Do We Know It Works? 4 F LOW CAP C ONTEST Results on FlowCAP Datasets Few Thoughts for FlowCAP II 5 C ONCLUSION 10 / 48 SWIFT

  14. SWIFT A LGORITHM FOR FC D ATA C LUSTERING SWIFT: a three stage algorithm: 11 / 48 SWIFT

  15. SWIFT A LGORITHM FOR FC D ATA C LUSTERING SWIFT: a three stage algorithm: 1 Weighted Iterative Sampling based EM : Gaussian mixture model clustering + novel weighted iterative sampling ◮ Bayesian Information Criterion (BIC) 11 / 48 SWIFT

  16. SWIFT A LGORITHM FOR FC D ATA C LUSTERING SWIFT: a three stage algorithm: 1 Weighted Iterative Sampling based EM : Gaussian mixture model clustering + novel weighted iterative sampling ◮ Bayesian Information Criterion (BIC) 2 Bimodality Splitting: Split any cluster that is, ◮ Bimodal in any dimensions or any principal components. ◮ Useful for clustering high dimensional data. 11 / 48 SWIFT

  17. SWIFT A LGORITHM FOR FC D ATA C LUSTERING SWIFT: a three stage algorithm: 1 Weighted Iterative Sampling based EM : Gaussian mixture model clustering + novel weighted iterative sampling ◮ Bayesian Information Criterion (BIC) 2 Bimodality Splitting: Split any cluster that is, ◮ Bimodal in any dimensions or any principal components. ◮ Useful for clustering high dimensional data. 3 Graph-based Merging: Merge overlapping Gaussians ( Hennig [2009], Finak et al. [2009], Baudry et al. [2010]). ◮ Allows representation of non-Gaussian clusters 11 / 48 SWIFT

  18. C LUSTERING S TRATEGY : SWIFT GMM clustering with Sampling for k ∈ [ K min , K max ] BIC to decide number of Gaussians ( ˆ K ) Split Bimodal Clusters until Unimodal. Results in K split Clusters Graph-based Merging using Overlap/Entropy criteria Results in K entropy Clusters Soft clustering for K entropy clusters 12 / 48 SWIFT

  19. S TAGE 1: G AUSSIAN M IXTURE M ODEL C LUSTERING GMM clustering with Sampling for k ∈ [ K min , K max ] BIC to decide number of Gaussians ( ˆ K ) Split Bimodal Clusters until Unimodal. Results in K split Clusters Graph-based Merging using Overlap/Entropy criteria Results in K entropy Clusters Soft clustering for K entropy clusters 13 / 48 SWIFT

  20. S TAGE 1: G AUSSIAN MIXTURE MODEL CLUSTERING ◮ Gaussian mixture model (GMM) clustering is chosen among the model based methods ◮ Faster than other model based clustering methods ◮ Closed form solution 14 / 48 SWIFT

  21. S TAGE 1: G AUSSIAN MIXTURE MODEL CLUSTERING ◮ Gaussian mixture model (GMM) clustering is chosen among the model based methods ◮ Faster than other model based clustering methods ◮ Closed form solution ◮ Expectation Maximization (EM) algorithm for parameter estimation ◮ Computational complexity of each iteration: O ( Nkd 2 ) ◮ N = the number of data-vectors in the dataset ◮ k = is the number of Gaussian components ◮ d = is the dimension of each data-vectors 14 / 48 SWIFT

  22. S TAGE 1: S AMPLING FOR S CALABILITY ◮ Operate on smaller subsample of dataset for better computation performance. ◮ Challenge: Poor representation of smaller clusters. (a) 4 Gaussians with 150K, 100K, 50K (b) After 10% sampling and 150 datapoints 15 / 48 SWIFT

  23. S TAGE 1: S AMPLING FOR S CALABILITY ◮ Operate on smaller subsample of dataset for better computation performance. ◮ Challenge: Poor representation of smaller clusters. (c) 4 Gaussians with 150K, 100K, 50K (d) After 10% sampling and 150 datapoints ◮ Solution: Weighted iterative sampling ◮ Faster computation ◮ Better detection of small clusters 15 / 48 SWIFT

  24. S TAGE 1: W EIGHTED I TERATIVE S AMPLING BASED EM FCS Dataset X Subsample S from X GMM fitting to S using EM Fix p largest clusters and add Resample S from X l �∈ F γ ( i ) with probability � them to F . Initially F = ∅ j All the No clusters fixed? Yes Perform few EM iterations on X Output model parameters ( θ ) 16 / 48 SWIFT

  25. S TAGE 1: W EIGHTED I TERATIVE S AMPLING BASED EM FCS Dataset X F = set of clusters whose parameters are fixed. Subsample S from X GMM fitting to S using EM Fix p largest clusters and add Resample S from X l �∈ F γ ( i ) with probability � them to F . Initially F = ∅ j All the No clusters fixed? Yes Perform few EM iterations on X Output model parameters ( θ ) 16 / 48 SWIFT

  26. S TAGE 1: W EIGHTED I TERATIVE S AMPLING BASED EM FCS Dataset X F = set of clusters whose parameters are fixed. Subsample S from X P ( X ( i ) is selected in S ) = ∑ l �∈ F γ ( i ) l GMM fitting to S using EM Fix p largest clusters and add Resample S from X l �∈ F γ ( i ) with probability � them to F . Initially F = ∅ j All the No clusters fixed? Yes Perform few EM iterations on X Output model parameters ( θ ) 16 / 48 SWIFT

  27. S TAGE 1: W EIGHTED I TERATIVE S AMPLING BASED EM F IGURE : 4 Gaussian clusters with 150K, 100K, 50K and 150 datapoints 17 / 48 SWIFT

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend