flock a density based clustering method for flock a
play

FLOCK: A Density Based Clustering Method for FLOCK: A Density Based - PowerPoint PPT Presentation

FLOCK: A Density Based Clustering Method for FLOCK: A Density Based Clustering Method for Automated Identification and Comparison of Cell Populations in High Dimensional Flow Cell Populations in High Dimensional Flow Cytometry Data Max


  1. FLOCK: A Density Based Clustering Method for FLOCK: A Density ‐ Based Clustering Method for Automated Identification and Comparison of Cell Populations in High Dimensional Flow Cell Populations in High ‐ Dimensional Flow Cytometry Data Max Yu Qian, Ph.D. Division of Biomedical Informatics and Department of Pathology University of Texas Southwestern Medical Center, Dallas, TX September 21, 2010

  2. Why Computation Is Necessary Why Computation Is Necessary • Segregating overlapping cell populations g g g pp g p p

  3. Solution: Clustering Solution: Clustering • Assumption: Cells of the same population express ALL biological markers similarly

  4. Related Work in Clustering Related Work in Clustering • Density ‐ based (such as DBSCAN) e s ty based (suc as SC ) • Partitioning approaches (such as K ‐ means) • Hierarchical approaches (such as HAC) Hierarchical approaches (such as HAC) • Grid ‐ based approaches (such as STING) J. Han, M. Kamber, A. K. H. Tung, “Spatial Clustering Methods in Data Mining: A Survey” There is another category called Model ‐ based Clustering, such as the EM method. g,

  5. Previous Methods not Directly Applicable l bl FCM d t FCM data requires the clustering method to be: i th l t i th d t b 1) Efficient 2) Able to handle high ‐ dimensionality 3) Easy setting parameters 3) Easy setting parameters

  6. Four populations on 2D display display 6

  7. Let K=4; Let K=4; Select random seeds 7

  8. Space partitioning based on centroids 8

  9. Recalculate centroids 9

  10. Repartition based on new centroids centroids 10

  11. Repeat the procedure many times … … 11

  12. Final centroids 12

  13. Final clustering results 13

  14. Let K=3 Let K=3 14

  15. Space partitioning based on centroids 15

  16. Recalculate centroids 16

  17. Repartition based on new centroids centroids 17

  18. Repeat the procedure … … 18

  19. Final Centroids 19

  20. Final clustering results 20

  21. Seeds trapped in local optimum even if K is correct S d t d i l l ti if K i t 21

  22. Non ‐ spherical populations 22

  23. K ‐ means Applied to High ‐ Dimensional Data Three different ways used to generating random seeds Number of Iterations = 1000, K=2

  24. “For high dimensional data clustering standard For high dimensional data clustering, standard algorithms such as EM and K ‐ means are often trapped in local minimum” trapped in local minimum Ding C, He X, Zha H, Simon HD. Adaptive dimension reduction for clustering high dimensional data. In: Proceedings of IEEE International Conference on Data Mining. Bradley PS Fayyad UM Refining initial points for K ‐ means clustering In: Proceedings Bradley PS, Fayyad UM. Refining initial points for K ‐ means clustering. In: Proceedings of the Fifteenth International Conference on Machine Learning . When number of dimension increases, there are more and more local optimum traps. This is also called Curse of Dimensionality .

  25. Therefore Therefore Dimensions need to be reduced Dimensions need to be reduced However the relationship between dimension However, the relationship between dimension selection and clustering is chicken ‐ egg : ‐ to cluster high ‐ dimensional data, dimensionality must be reduced (due to curse of dimensionality) ‐ it is more effective to select dimensions within individual data clusters than for whole dataset individual data clusters than for whole dataset

  26. The Procedure of The Procedure of 1) Generate initial clusters (yes, chicken first!) 1) Generate initial clusters (yes, chicken first!) ‐ Parameter selection 2) Normalize dimensions within clusters 2) Normalize dimensions within clusters 3) Select dimensions for initial clusters 4) Partition and merge the initial clusters in 4) Partition and merge the initial clusters in their selected subspaces 5) Output the final clusters 5) Output the final clusters *Details of each step in following slides p g

  27. Generation of Initial Clusters

  28. 2D example 2D example

  29. Divide with hyper-grids Divide with hyper grids

  30. Find dense hyper-regions Find dense hyper regions

  31. Merge neighboring dense hyper- regions

  32. Clustering based on region centers Clustering based on region centers

  33. Bin selection methods Bin selection methods Goal is to minimize the Mean Squared Error q • Scott’s method • Stone’s method • Knuth’s method, to maximize

  34. Density threshold selection Density threshold selection • Minimum description length Minimum description length   ∑ µ = s ( i ) ( x j / ) i     ≤ ≤ 1 j i   ∑ µ = σ − ( i ) ( x ) /( i )   d j   + ≤ ≤ σ i 1 j ∑ ∑ = µ + − µ + µ + − µ L ( i ) log ( s ( i )) log (| x s ( i ) |) log ( ( i )) log (| x ( i ) |) 2 2 j 2 d 2 j d ≤ ≤ + ≤ ≤ σ 1 j i i 1 j

  35. Simulation Study Simulation Study Birch dataset (Zhang et al, SIGMOD 1996) h d ( h l )

  36. Two assumptions with the above model d l 1) The center area is denser than the surrounding area in a population 2) There is only one group of adjacent hyper ‐ regions in one population population When number of dimensions increases: 1) 1) A Assumption 1 may not hold for a sparse population; further ti 1 t h ld f l ti f th partitioning to identify the sparse population may be necessary 2) There could be multiple adjacent hyper ‐ regions within one population; they need to be merged population; they need to be merged. Merging and partitioning will be done in the reduced ‐ dimensional space p

  37. Density Variability in High ‐ Dimensional Data Space Fix the number of bins and density threshold, and use a Gaussian simulator to simulate 2 ‐ d,….,10 ‐ d data with 2 Gaussian clusters 100 45 90 40 80 35 70 30 60 25 50 20 40 15 30 10 20 5 10 0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 X axis: Number of dimensions X ‐ axis: Number of dimensions X axis: Number of dimensions X ‐ axis: Number of dimensions Y ‐ axis: Number of groups of adjacent hyper ‐ Y ‐ axis: Number of bins selected by Stone’s regions Method

  38. Dimension Selection and Cluster Merging 1) 0 ‐ 1 column ‐ wise normalize each cluster 2) Select 3 dimensions for each cluster based on standard deviations (if number of dimensions < 3, all dimensions are used) 3) Partition a cluster into two, if necessary (this step can be optionally repeated) 4) 0 ‐ 1 column ‐ wise normalize each pair of partitions 5) Select 3 dimensions for each pair of partitions ) p p 6) Starting from the pair that are closest in the 3 ‐ dimensional space, merge a pair of partitions, if necessary p , g p p , f y 7) Repeat Steps 4) to 6) until there is no pair to merge

  39. Merging/Partitioning Criteria Merging/Partitioning Criteria The most common approach is nearest/mutual neighbor graph but it is very slow (O(N^2)) graph, but it is very slow (O(N^2)). Two partitions should not be merged Two partitions should be merged

  40. Results

  41. FlowCAP Challenges FlowCAP Challenges • Challenge 1 (fully automated) g ( y ) • Challenge 2 (tuned parameters allowed) • Challenge 3 (number of clusters known) • Challenge 4 (manual gating results of a couple of files known) Evaluation criteria: manual gating Data: diffuse large B ‐ cell lymphoma, graft versus host disease, normal donors, symptomatic west , , y p nile virus, and hematopoietic stem cell transplant

  42. FlowCAP Data

  43. Challenge 1 (auto) Challenge 1 (auto)

  44. DLBCL_001

  45. X: FL2; Y: FL4 DLBCL_001 DLBCL 006 DLBCL_006

  46. GvHD_001

  47. Hi h di High ‐ dimensional Data i l D

  48. ND_001 CD56 CD8 CD45 CD45 CD3/CD14

  49. Challenge 2 (tuned) Challenge 2 (tuned) Compared with Challenge 1

  50. FLOCK in ImmPort (www.immport.org)

  51. Automated Identification of Cell Populations FCM data from Montgomery Lab, Yale Univ.

  52. Cross-Sample Comparison with FLOCK FLOCK Proportion change of PlasmaBlasts at different days with Tetanus study Proportion change of PlasmaBlasts at different days with Tetanus study FCM data from Sanz Lab, Univ. of Rochester

  53. Download FLOCK Results to Your Own Software Own Software Casale FCM data from Immune Tolerance Network Visualization Software: Tableau

  54. Discussion Discussion • Computational analysis most needed for high ‐ Computational analysis most needed for high dimensional dataset • Preprocessing is also important • Preprocessing is also important • FlowCAP2 can include cross ‐ sample comparison, since the alignment and mapping i i h li d i is also challenging • From cluster to population

  55. Conclusions Conclusions FLOw Clustering without K ‐ FLOCK g o Identifies cell populations within multi ‐ dimensional space o Automatically determines the number of unique Automatically determines the number of unique populations present using a rapid binning approach o Can handle non ‐ spherical hyper ‐ shapes p yp p o Maps populations across independent samples o Calculates useful summary statistics o Reduces subjective factors in gating o Implemented in ImmPort and freely available

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend