FLOCK: A Density Based Clustering Method for FLOCK: A Density Based - - PowerPoint PPT Presentation

flock a density based clustering method for flock a
SMART_READER_LITE
LIVE PREVIEW

FLOCK: A Density Based Clustering Method for FLOCK: A Density Based - - PowerPoint PPT Presentation

FLOCK: A Density Based Clustering Method for FLOCK: A Density Based Clustering Method for Automated Identification and Comparison of Cell Populations in High Dimensional Flow Cell Populations in High Dimensional Flow Cytometry Data Max


slide-1
SLIDE 1

FLOCK: A Density Based Clustering Method for FLOCK: A Density‐Based Clustering Method for Automated Identification and Comparison of Cell Populations in High Dimensional Flow Cell Populations in High‐Dimensional Flow Cytometry Data

Max Yu Qian, Ph.D. Division of Biomedical Informatics and Department of Pathology University of Texas Southwestern Medical Center, Dallas, TX September 21, 2010

slide-2
SLIDE 2

Why Computation Is Necessary Why Computation Is Necessary

  • Segregating overlapping cell populations

g g g pp g p p

slide-3
SLIDE 3

Solution: Clustering Solution: Clustering

  • Assumption: Cells of the same population

express ALL biological markers similarly

slide-4
SLIDE 4

Related Work in Clustering Related Work in Clustering

  • Density‐based (such as DBSCAN)

e s ty based (suc as SC )

  • Partitioning approaches (such as K‐means)
  • Hierarchical approaches (such as HAC)

Hierarchical approaches (such as HAC)

  • Grid‐based approaches (such as STING)
  • J. Han, M. Kamber, A. K. H. Tung, “Spatial Clustering Methods in Data

Mining: A Survey”

There is another category called Model‐based Clustering, such as the EM method. g,

slide-5
SLIDE 5

Previous Methods not Directly l bl Applicable

FCM d t i th l t i th d t b FCM data requires the clustering method to be: 1) Efficient 2) Able to handle high‐dimensionality 3) Easy setting parameters 3) Easy setting parameters

slide-6
SLIDE 6

Four populations

  • n 2D

display display

6

slide-7
SLIDE 7

Let K=4; Let K=4; Select random seeds

7

slide-8
SLIDE 8

Space partitioning based on centroids

8

slide-9
SLIDE 9

Recalculate centroids

9

slide-10
SLIDE 10

Repartition based on new centroids centroids

10

slide-11
SLIDE 11

Repeat the procedure many times … …

11

slide-12
SLIDE 12

Final centroids

12

slide-13
SLIDE 13

Final clustering results

13

slide-14
SLIDE 14

Let K=3 Let K=3

14

slide-15
SLIDE 15

Space partitioning based on centroids

15

slide-16
SLIDE 16

Recalculate centroids

16

slide-17
SLIDE 17

Repartition based on new centroids centroids

17

slide-18
SLIDE 18

Repeat the procedure … …

18

slide-19
SLIDE 19

Final Centroids

19

slide-20
SLIDE 20

Final clustering results

20

slide-21
SLIDE 21

S d t d i l l ti if K i t

21

Seeds trapped in local optimum even if K is correct

slide-22
SLIDE 22

Non‐spherical populations

22

slide-23
SLIDE 23

K‐means Applied to High‐Dimensional Data

Three different ways used to generating random seeds Number of Iterations = 1000, K=2

slide-24
SLIDE 24

“For high dimensional data clustering standard For high dimensional data clustering, standard algorithms such as EM and K‐means are often trapped in local minimum” trapped in local minimum

Ding C, He X, Zha H, Simon HD. Adaptive dimension reduction for clustering high dimensional data. In: Proceedings of IEEE International Conference on Data Mining. Bradley PS Fayyad UM Refining initial points for K‐means clustering In: Proceedings Bradley PS, Fayyad UM. Refining initial points for K‐means clustering. In: Proceedings

  • f the Fifteenth International Conference on Machine Learning.

When number of dimension increases, there are more and more local optimum traps. This is also called Curse of Dimensionality.

slide-25
SLIDE 25

Therefore Therefore

Dimensions need to be reduced Dimensions need to be reduced

However the relationship between dimension However, the relationship between dimension selection and clustering is chicken‐egg: ‐ to cluster high‐dimensional data, dimensionality must be reduced (due to curse of dimensionality) ‐ it is more effective to select dimensions within individual data clusters than for whole dataset individual data clusters than for whole dataset

slide-26
SLIDE 26
slide-27
SLIDE 27

The Procedure of The Procedure of

1) Generate initial clusters (yes, chicken first!) 1) Generate initial clusters (yes, chicken first!)

‐ Parameter selection

2) Normalize dimensions within clusters 2) Normalize dimensions within clusters 3) Select dimensions for initial clusters 4) Partition and merge the initial clusters in 4) Partition and merge the initial clusters in their selected subspaces 5) Output the final clusters 5) Output the final clusters

*Details of each step in following slides p g

slide-28
SLIDE 28

Generation of Initial Clusters

slide-29
SLIDE 29

2D example 2D example

slide-30
SLIDE 30

Divide with hyper-grids Divide with hyper grids

slide-31
SLIDE 31

Find dense hyper-regions Find dense hyper regions

slide-32
SLIDE 32

Merge neighboring dense hyper- regions

slide-33
SLIDE 33

Clustering based on region centers Clustering based on region centers

slide-34
SLIDE 34

Bin selection methods Bin selection methods

Goal is to minimize the Mean Squared Error q

  • Scott’s method
  • Stone’s method
  • Knuth’s method, to maximize
slide-35
SLIDE 35

Density threshold selection Density threshold selection

  • Minimum description length

Minimum description length

      =

≤ ≤

i x i s

i j j /

) ( ) (

1

µ

      − =

≤ ≤ +

) /( ) ( ) (

1

i x i

j i j d

σ µ

σ

∑ ∑

≤ ≤ + ≤ ≤

− + + − + =

σ

µ µ µ µ

j i d j d i j j

i x i i s x i s i L

1 2 2 1 2 2

|) ) ( (| log )) ( ( log |) ) ( (| log )) ( ( log ) (

slide-36
SLIDE 36

Simulation Study Simulation Study

h d ( h l ) Birch dataset (Zhang et al, SIGMOD 1996)

slide-37
SLIDE 37

Two assumptions with the above d l model

1) The center area is denser than the surrounding area in a population 2) There is only one group of adjacent hyper‐regions in one population population When number of dimensions increases: 1) A ti 1 t h ld f l ti f th 1) Assumption 1 may not hold for a sparse population; further partitioning to identify the sparse population may be necessary 2) There could be multiple adjacent hyper‐regions within one population; they need to be merged population; they need to be merged. Merging and partitioning will be done in the reduced‐dimensional space p

slide-38
SLIDE 38

Density Variability in High‐Dimensional Data Space

Fix the number of bins and density threshold, and use a Gaussian simulator to simulate 2‐d,….,10‐d data with 2 Gaussian clusters

70 80 90 100 30 35 40 45 20 30 40 50 60 10 15 20 25 10 2 4 6 8 10 12 5 2 4 6 8 10 12

X axis: Number of dimensions X axis: Number of dimensions X‐axis: Number of dimensions Y‐axis: Number of groups of adjacent hyper‐ regions X‐axis: Number of dimensions Y‐axis: Number of bins selected by Stone’s Method

slide-39
SLIDE 39

Dimension Selection and Cluster Merging

1) 0‐1 column‐wise normalize each cluster 2) Select 3 dimensions for each cluster based on standard deviations (if number of dimensions < 3, all dimensions are used) 3) Partition a cluster into two, if necessary (this step can be optionally repeated) 4) 0‐1 column‐wise normalize each pair of partitions 5) Select 3 dimensions for each pair of partitions ) p p 6) Starting from the pair that are closest in the 3‐dimensional space, merge a pair of partitions, if necessary p , g p p , f y 7) Repeat Steps 4) to 6) until there is no pair to merge

slide-40
SLIDE 40

Merging/Partitioning Criteria Merging/Partitioning Criteria

The most common approach is nearest/mutual neighbor graph but it is very slow (O(N^2)) graph, but it is very slow (O(N^2)).

Two partitions should be merged Two partitions should not be merged

slide-41
SLIDE 41

Results

slide-42
SLIDE 42

FlowCAP Challenges FlowCAP Challenges

  • Challenge 1 (fully automated)

g ( y )

  • Challenge 2 (tuned parameters allowed)
  • Challenge 3 (number of clusters known)
  • Challenge 4 (manual gating results of a couple of

files known) Evaluation criteria: manual gating Data: diffuse large B‐cell lymphoma, graft versus host disease, normal donors, symptomatic west , , y p nile virus, and hematopoietic stem cell transplant

slide-43
SLIDE 43

FlowCAP Data

slide-44
SLIDE 44

Challenge 1 (auto) Challenge 1 (auto)

slide-45
SLIDE 45

DLBCL_001

slide-46
SLIDE 46

X: FL2; Y: FL4

DLBCL_001 DLBCL 006 DLBCL_006

slide-47
SLIDE 47

GvHD_001

slide-48
SLIDE 48

Hi h di i l D High‐dimensional Data

slide-49
SLIDE 49

ND_001

CD56 CD8 CD45 CD45 CD3/CD14

slide-50
SLIDE 50

Challenge 2 (tuned) Challenge 2 (tuned)

Compared with Challenge 1

slide-51
SLIDE 51

FLOCK in ImmPort (www.immport.org)

slide-52
SLIDE 52

Automated Identification of Cell Populations

FCM data from Montgomery Lab, Yale Univ.

slide-53
SLIDE 53

Cross-Sample Comparison with FLOCK FLOCK

Proportion change of PlasmaBlasts at different days with Tetanus study Proportion change of PlasmaBlasts at different days with Tetanus study FCM data from Sanz Lab, Univ. of Rochester

slide-54
SLIDE 54

Download FLOCK Results to Your Own Software Own Software

Casale FCM data from Immune Tolerance Network Visualization Software: Tableau

slide-55
SLIDE 55

Discussion Discussion

  • Computational analysis most needed for high‐

Computational analysis most needed for high dimensional dataset

  • Preprocessing is also important
  • Preprocessing is also important
  • FlowCAP2 can include cross‐sample

i i h li d i comparison, since the alignment and mapping is also challenging

  • From cluster to population
slide-56
SLIDE 56

Conclusions Conclusions

FLOw Clustering without K ‐ FLOCK g

  • Identifies cell populations within multi‐dimensional

space Automatically determines the number of unique

  • Automatically determines the number of unique

populations present using a rapid binning approach

  • Can handle non‐spherical hyper‐shapes

p yp p

  • Maps populations across independent samples
  • Calculates useful summary statistics
  • Reduces subjective factors in gating
  • Implemented in ImmPort and freely available
slide-57
SLIDE 57

Acknowledgment

UT S h N h G R h UT Southwestern

Richard Scheuermann Megan Kong

Northrop Grumman

John Campbell Yue Liu

Rochester

Iñaki Sanz Chungwen Wei g g Paula Guidry David Dougall Eva Sadat Liz Thompson Patrick Dunn Jeff Wiser g Eun Hyung Lee Tim Mosmann Jessica Halliley Eva Sadat Jamie Lee Jennifer Cai Jeff Wiser Mike Atassi Jessica Halliley Chris Tipton

Immune Tolerance Network

Jie Huang Nishanth Marthandan Diane Xiang Dave Parrish Keith Boyce Tom Casale Young Kim Adam Seegmiller Nitin Karandikar

  • Casa e

Jason Liu

FlowCAP Organization Committee

Nitin Karandikar Supported by NIH N01 AI40076 (BISC)