FLOCK: A Density Based Clustering Method for FLOCK: A Density Based - PowerPoint PPT Presentation

FLOCK: A Density Based Clustering Method for FLOCK: A Density ‐ Based Clustering Method for Automated Identification and Comparison of Cell Populations in High Dimensional Flow Cell Populations in High ‐ Dimensional Flow Cytometry Data Max Yu Qian, Ph.D. Division of Biomedical Informatics and Department of Pathology University of Texas Southwestern Medical Center, Dallas, TX September 21, 2010

Why Computation Is Necessary Why Computation Is Necessary • Segregating overlapping cell populations g g g pp g p p

Solution: Clustering Solution: Clustering • Assumption: Cells of the same population express ALL biological markers similarly

Related Work in Clustering Related Work in Clustering • Density ‐ based (such as DBSCAN) e s ty based (suc as SC ) • Partitioning approaches (such as K ‐ means) • Hierarchical approaches (such as HAC) Hierarchical approaches (such as HAC) • Grid ‐ based approaches (such as STING) J. Han, M. Kamber, A. K. H. Tung, “Spatial Clustering Methods in Data Mining: A Survey” There is another category called Model ‐ based Clustering, such as the EM method. g,

Previous Methods not Directly Applicable l bl FCM d t FCM data requires the clustering method to be: i th l t i th d t b 1) Efficient 2) Able to handle high ‐ dimensionality 3) Easy setting parameters 3) Easy setting parameters

Four populations on 2D display display 6

Let K=4; Let K=4; Select random seeds 7

Space partitioning based on centroids 8

Recalculate centroids 9

Repartition based on new centroids centroids 10

Repeat the procedure many times … … 11

Final centroids 12

Final clustering results 13

Let K=3 Let K=3 14

Space partitioning based on centroids 15

Recalculate centroids 16

Repartition based on new centroids centroids 17

Repeat the procedure … … 18

Final Centroids 19

Final clustering results 20

Seeds trapped in local optimum even if K is correct S d t d i l l ti if K i t 21

Non ‐ spherical populations 22

K ‐ means Applied to High ‐ Dimensional Data Three different ways used to generating random seeds Number of Iterations = 1000, K=2

“For high dimensional data clustering standard For high dimensional data clustering, standard algorithms such as EM and K ‐ means are often trapped in local minimum” trapped in local minimum Ding C, He X, Zha H, Simon HD. Adaptive dimension reduction for clustering high dimensional data. In: Proceedings of IEEE International Conference on Data Mining. Bradley PS Fayyad UM Refining initial points for K ‐ means clustering In: Proceedings Bradley PS, Fayyad UM. Refining initial points for K ‐ means clustering. In: Proceedings of the Fifteenth International Conference on Machine Learning . When number of dimension increases, there are more and more local optimum traps. This is also called Curse of Dimensionality .

Therefore Therefore Dimensions need to be reduced Dimensions need to be reduced However the relationship between dimension However, the relationship between dimension selection and clustering is chicken ‐ egg : ‐ to cluster high ‐ dimensional data, dimensionality must be reduced (due to curse of dimensionality) ‐ it is more effective to select dimensions within individual data clusters than for whole dataset individual data clusters than for whole dataset

The Procedure of The Procedure of 1) Generate initial clusters (yes, chicken first!) 1) Generate initial clusters (yes, chicken first!) ‐ Parameter selection 2) Normalize dimensions within clusters 2) Normalize dimensions within clusters 3) Select dimensions for initial clusters 4) Partition and merge the initial clusters in 4) Partition and merge the initial clusters in their selected subspaces 5) Output the final clusters 5) Output the final clusters *Details of each step in following slides p g

Generation of Initial Clusters

2D example 2D example

Divide with hyper-grids Divide with hyper grids

Find dense hyper-regions Find dense hyper regions

Merge neighboring dense hyper- regions

Clustering based on region centers Clustering based on region centers

Bin selection methods Bin selection methods Goal is to minimize the Mean Squared Error q • Scott’s method • Stone’s method • Knuth’s method, to maximize

Density threshold selection Density threshold selection • Minimum description length Minimum description length   ∑ µ = s ( i ) ( x j / ) i     ≤ ≤ 1 j i   ∑ µ = σ − ( i ) ( x ) /( i )   d j   + ≤ ≤ σ i 1 j ∑ ∑ = µ + − µ + µ + − µ L ( i ) log ( s ( i )) log (| x s ( i ) |) log ( ( i )) log (| x ( i ) |) 2 2 j 2 d 2 j d ≤ ≤ + ≤ ≤ σ 1 j i i 1 j

Simulation Study Simulation Study Birch dataset (Zhang et al, SIGMOD 1996) h d ( h l )

Two assumptions with the above model d l 1) The center area is denser than the surrounding area in a population 2) There is only one group of adjacent hyper ‐ regions in one population population When number of dimensions increases: 1) 1) A Assumption 1 may not hold for a sparse population; further ti 1 t h ld f l ti f th partitioning to identify the sparse population may be necessary 2) There could be multiple adjacent hyper ‐ regions within one population; they need to be merged population; they need to be merged. Merging and partitioning will be done in the reduced ‐ dimensional space p

Density Variability in High ‐ Dimensional Data Space Fix the number of bins and density threshold, and use a Gaussian simulator to simulate 2 ‐ d,….,10 ‐ d data with 2 Gaussian clusters 100 45 90 40 80 35 70 30 60 25 50 20 40 15 30 10 20 5 10 0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 X axis: Number of dimensions X ‐ axis: Number of dimensions X axis: Number of dimensions X ‐ axis: Number of dimensions Y ‐ axis: Number of groups of adjacent hyper ‐ Y ‐ axis: Number of bins selected by Stone’s regions Method

Dimension Selection and Cluster Merging 1) 0 ‐ 1 column ‐ wise normalize each cluster 2) Select 3 dimensions for each cluster based on standard deviations (if number of dimensions < 3, all dimensions are used) 3) Partition a cluster into two, if necessary (this step can be optionally repeated) 4) 0 ‐ 1 column ‐ wise normalize each pair of partitions 5) Select 3 dimensions for each pair of partitions ) p p 6) Starting from the pair that are closest in the 3 ‐ dimensional space, merge a pair of partitions, if necessary p , g p p , f y 7) Repeat Steps 4) to 6) until there is no pair to merge

Merging/Partitioning Criteria Merging/Partitioning Criteria The most common approach is nearest/mutual neighbor graph but it is very slow (O(N^2)) graph, but it is very slow (O(N^2)). Two partitions should not be merged Two partitions should be merged

Results

FlowCAP Challenges FlowCAP Challenges • Challenge 1 (fully automated) g ( y ) • Challenge 2 (tuned parameters allowed) • Challenge 3 (number of clusters known) • Challenge 4 (manual gating results of a couple of files known) Evaluation criteria: manual gating Data: diffuse large B ‐ cell lymphoma, graft versus host disease, normal donors, symptomatic west , , y p nile virus, and hematopoietic stem cell transplant

FlowCAP Data

Challenge 1 (auto) Challenge 1 (auto)

DLBCL_001

X: FL2; Y: FL4 DLBCL_001 DLBCL 006 DLBCL_006

GvHD_001

Hi h di High ‐ dimensional Data i l D

ND_001 CD56 CD8 CD45 CD45 CD3/CD14

Challenge 2 (tuned) Challenge 2 (tuned) Compared with Challenge 1

FLOCK in ImmPort (www.immport.org)

Automated Identification of Cell Populations FCM data from Montgomery Lab, Yale Univ.

Cross-Sample Comparison with FLOCK FLOCK Proportion change of PlasmaBlasts at different days with Tetanus study Proportion change of PlasmaBlasts at different days with Tetanus study FCM data from Sanz Lab, Univ. of Rochester

Download FLOCK Results to Your Own Software Own Software Casale FCM data from Immune Tolerance Network Visualization Software: Tableau

Discussion Discussion • Computational analysis most needed for high ‐ Computational analysis most needed for high dimensional dataset • Preprocessing is also important • Preprocessing is also important • FlowCAP2 can include cross ‐ sample comparison, since the alignment and mapping i i h li d i is also challenging • From cluster to population

Conclusions Conclusions FLOw Clustering without K ‐ FLOCK g o Identifies cell populations within multi ‐ dimensional space o Automatically determines the number of unique Automatically determines the number of unique populations present using a rapid binning approach o Can handle non ‐ spherical hyper ‐ shapes p yp p o Maps populations across independent samples o Calculates useful summary statistics o Reduces subjective factors in gating o Implemented in ImmPort and freely available

FLOCK: A Density Based Clustering Method for FLOCK: A Density Based - PowerPoint PPT Presentation

FLOCK: A Density Based Clustering Method for FLOCK: A Density Based Clustering Method for Automated Identification and Comparison of Cell Populations in High Dimensional Flow Cell Populations in High Dimensional Flow Cytometry Data Max

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Data Warehousing and Machine Learning Density-based clustering Thomas D. Nielsen Aalborg

Time- -focused density focused density- -based based Time clustering of trajectories

Relative Density Chapters 3.5 Relative Density 1 2/5/2015 Minimum Density Pluviate soil from

Density-based Clustering MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de

Machine Learning 2 DS 4420 - Spring 2018 From clustering to EM Byron C. Wallace Clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

k -means clustering Method to automatically separate data sets into distinct groups. Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

7 Knowledge Representation 7.0 Issues in Knowledge 7.4 Agent Based and Representation

MITOCW | watch?v=tL7Lcl90Sc0 The following content is provided under a Creative Commons license.

January 2017 School of the Incarnation Mission Statement The School of the Incarnation embraces

Coronavirus Response Update and Matthew Carter and Harvest Fast Day Preview Jo Kitterick Co

Property Tax Appraisals Ways to Save 1. Start by checking & applying for Exemptions

SHPE Sacramento State SHPE California State University, Sacramento 2019 FRP Composites

Karl Benedict Director of Research Data Services, College of University Libraries and Learning

Modeling Peru Upwelling Ecosystem: from Physics to Anchovy Prof. Fei CHAI University of Maine,

FLOCK: A Density Based Clustering Method for FLOCK: A Density Based - PowerPoint PPT Presentation

FLOCK: A Density Based Clustering Method for FLOCK: A Density Based Clustering Method for Automated Identification and Comparison of Cell Populations in High Dimensional Flow Cell Populations in High Dimensional Flow Cytometry Data Max

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Data Warehousing and Machine Learning Density-based clustering Thomas D. Nielsen Aalborg

Time- -focused density focused density- -based based Time clustering of trajectories

Relative Density Chapters 3.5 Relative Density 1 2/5/2015 Minimum Density Pluviate soil from

Density-based Clustering MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de

Machine Learning 2 DS 4420 - Spring 2018 From clustering to EM Byron C. Wallace Clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

k -means clustering Method to automatically separate data sets into distinct groups. Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

7 Knowledge Representation 7.0 Issues in Knowledge 7.4 Agent Based and Representation

MITOCW | watch?v=tL7Lcl90Sc0 The following content is provided under a Creative Commons license.

January 2017 School of the Incarnation Mission Statement The School of the Incarnation embraces

Coronavirus Response Update and Matthew Carter and Harvest Fast Day Preview Jo Kitterick Co

Property Tax Appraisals Ways to Save 1. Start by checking &amp; applying for Exemptions

SHPE Sacramento State SHPE California State University, Sacramento 2019 FRP Composites

Karl Benedict Director of Research Data Services, College of University Libraries and Learning

Modeling Peru Upwelling Ecosystem: from Physics to Anchovy Prof. Fei CHAI University of Maine,

Property Tax Appraisals Ways to Save 1. Start by checking & applying for Exemptions