C LUSTERING S TRATEGY : SWIFT GMM clustering with Sampling for k [ K - - PowerPoint PPT Presentation

c lustering s trategy swift
SMART_READER_LITE
LIVE PREVIEW

C LUSTERING S TRATEGY : SWIFT GMM clustering with Sampling for k [ K - - PowerPoint PPT Presentation

SWIFT: S CALABLE W EIGHTED I TERATIVE F LOW - CLUSTERING T ECHNIQUE Iftekhar Naim , Gaurav Sharma , Suprakash Datta , James S. Cavenaugh , Jyh-Chiang E. Wang , Jonathan A. Rebhahn , Sally A. Quataert , and Tim R. Mosmann


slide-1
SLIDE 1

SWIFT: SCALABLE WEIGHTED ITERATIVE FLOW-CLUSTERING TECHNIQUE

Iftekhar Naim∗, Gaurav Sharma∗, Suprakash Datta†, James S. Cavenaugh∗, Jyh-Chiang E. Wang∗, Jonathan A. Rebhahn∗, Sally

  • A. Quataert∗, and Tim R. Mosmann∗

∗University of Rochester, Rochester, NY †York University, Toronto, ON

FlowCAP Summit, 2010

1 / 48 SWIFT

slide-2
SLIDE 2

OUTLINE

1 INTRODUCTION Flow cytometry (FC) Data Analysis Automated Multivariate clustering of FC Data 2 SWIFT METHOD FOR FC DATA ANALYSIS SWIFT Algorithm Weighted Iterative Sampling based EM Bimodality Splitting Graph-based Merging 3 DOES IT WORK? Does It Work? How Do We Know It Works? 4 FLOWCAP CONTEST Results on FlowCAP Datasets Few Thoughts for FlowCAP II 5 CONCLUSION

2 / 48 SWIFT

slide-3
SLIDE 3

OUTLINE

1 INTRODUCTION Flow cytometry (FC) Data Analysis Automated Multivariate clustering of FC Data 2 SWIFT METHOD FOR FC DATA ANALYSIS SWIFT Algorithm Weighted Iterative Sampling based EM Bimodality Splitting Graph-based Merging 3 DOES IT WORK? Does It Work? How Do We Know It Works? 4 FLOWCAP CONTEST Results on FlowCAP Datasets Few Thoughts for FlowCAP II 5 CONCLUSION

3 / 48 SWIFT

slide-4
SLIDE 4

FLOW CYTOMETRY (FC) OVERVIEW

◮ Rapid multivariate analysis of individual cells.

◮ High throughput data generation (description of ∼ 1 million cells). ◮ High dimensionality (∼ 20 measurements per cell). Antigen Cell Antibody Fluorochrome

FIGURE: Flow cytometry system (Ref: http://probes.invitrogen.com)

4 / 48 SWIFT

slide-5
SLIDE 5

FC DATA ANALYSIS

◮ Traditionally FC data analyzed by Manual Gating

◮ Subjective, Scales poorly with increasing dimensions ◮ 1D/2D Projections may not represent full picture ◮ Inaccurate for overlapping clusters

(a) Two

  • verlapping

clusters (b) Combined view (c) Manual gating

FIGURE: Manual gating for overlapping clusters.

◮ Automated multivariate clustering is desirable for FC Data

analysis .

◮ Repeatable, nonsubjective, comprehends multivariate structure.

5 / 48 SWIFT

slide-6
SLIDE 6

CHALLENGES OF AUTOMATED CLUSTERING OF FC DATA

◮ Challenges of Automated Clustering:

◮ Large FC datasets ◮ ∼ 1 million events ◮ High dimensionality ( 20 or more dimensions) ◮ Very small clusters that are important in immunological analysis

(100 − 200 cells out of millions)

◮ Overlapping clusters and background noise

6 / 48 SWIFT

slide-7
SLIDE 7

CHALLENGES OF AUTOMATED CLUSTERING OF FC DATA

◮ Challenges of Automated Clustering:

◮ Large FC datasets ◮ ∼ 1 million events ◮ High dimensionality ( 20 or more dimensions) ◮ Very small clusters that are important in immunological analysis

(100 − 200 cells out of millions)

◮ Overlapping clusters and background noise

◮ Our goal: Design automated clustering method capable of

addressing these challenges

6 / 48 SWIFT

slide-8
SLIDE 8

MANY DIFFERENT CLUSTERING METHODS

....

Patitional Clustering Hard Soft Clustering Clustering Spectral Based Grid Model Mixture Fuzzy K-means

7 / 48 SWIFT

slide-9
SLIDE 9

MANY DIFFERENT CLUSTERING METHODS

....

Patitional Clustering Hard Soft Clustering Clustering Spectral Based Grid Model Mixture Fuzzy K-means

8 / 48 SWIFT

slide-10
SLIDE 10

MODEL BASED CLUSTERING FOR FC DATA

◮ Model based clustering offers several advantages:

◮ Soft clustering- comprehends overlapping clusters, background

noise

◮ BUT, computationally expensive and choice of model imposes

limitation

9 / 48 SWIFT

slide-11
SLIDE 11

MODEL BASED CLUSTERING FOR FC DATA

◮ Model based clustering offers several advantages:

◮ Soft clustering- comprehends overlapping clusters, background

noise

◮ BUT, computationally expensive and choice of model imposes

limitation

◮ Recent proposals for statistical model based FC clustering (Chan

et al. [2008], Lo et al. [2008],Finak et al. [2009], Pyne et al. [2009])

9 / 48 SWIFT

slide-12
SLIDE 12

MODEL BASED CLUSTERING FOR FC DATA

◮ Model based clustering offers several advantages:

◮ Soft clustering- comprehends overlapping clusters, background

noise

◮ BUT, computationally expensive and choice of model imposes

limitation

◮ Recent proposals for statistical model based FC clustering (Chan

et al. [2008], Lo et al. [2008],Finak et al. [2009], Pyne et al. [2009])

◮ We propose computationally efficient model-based clustering

method SWIFT (Naim et al. [2010]) that offers two advantages:

◮ Scalability: Faster Computation + Less Memory Usage ◮ Detection of Small Populations: ∼ 100 cells out of 1 million

9 / 48 SWIFT

slide-13
SLIDE 13

OUTLINE

1 INTRODUCTION Flow cytometry (FC) Data Analysis Automated Multivariate clustering of FC Data 2 SWIFT METHOD FOR FC DATA ANALYSIS SWIFT Algorithm Weighted Iterative Sampling based EM Bimodality Splitting Graph-based Merging 3 DOES IT WORK? Does It Work? How Do We Know It Works? 4 FLOWCAP CONTEST Results on FlowCAP Datasets Few Thoughts for FlowCAP II 5 CONCLUSION

10 / 48 SWIFT

slide-14
SLIDE 14

SWIFT ALGORITHM FOR FC DATA CLUSTERING

SWIFT: a three stage algorithm:

11 / 48 SWIFT

slide-15
SLIDE 15

SWIFT ALGORITHM FOR FC DATA CLUSTERING

SWIFT: a three stage algorithm:

1 Weighted Iterative Sampling based EM : Gaussian mixture model

clustering + novel weighted iterative sampling

◮ Bayesian Information Criterion (BIC)

11 / 48 SWIFT

slide-16
SLIDE 16

SWIFT ALGORITHM FOR FC DATA CLUSTERING

SWIFT: a three stage algorithm:

1 Weighted Iterative Sampling based EM : Gaussian mixture model

clustering + novel weighted iterative sampling

◮ Bayesian Information Criterion (BIC)

2 Bimodality Splitting: Split any cluster that is,

◮ Bimodal in any dimensions or any principal components. ◮ Useful for clustering high dimensional data.

11 / 48 SWIFT

slide-17
SLIDE 17

SWIFT ALGORITHM FOR FC DATA CLUSTERING

SWIFT: a three stage algorithm:

1 Weighted Iterative Sampling based EM : Gaussian mixture model

clustering + novel weighted iterative sampling

◮ Bayesian Information Criterion (BIC)

2 Bimodality Splitting: Split any cluster that is,

◮ Bimodal in any dimensions or any principal components. ◮ Useful for clustering high dimensional data.

3 Graph-based Merging: Merge overlapping Gaussians ( Hennig

[2009], Finak et al. [2009], Baudry et al. [2010]).

◮ Allows representation of non-Gaussian clusters

11 / 48 SWIFT

slide-18
SLIDE 18

CLUSTERING STRATEGY: SWIFT

Soft clustering for Kentropy clusters Split Bimodal Clusters until Unimodal. GMM clustering with Sampling for k ∈ [Kmin, Kmax] BIC to decide number of Gaussians ( ˆ K) Graph-based Merging using Overlap/Entropy criteria Results in Kentropy Clusters Results in Ksplit Clusters

12 / 48 SWIFT

slide-19
SLIDE 19

STAGE 1: GAUSSIAN MIXTURE MODEL CLUSTERING

Soft clustering for Kentropy clusters Split Bimodal Clusters until Unimodal. GMM clustering with Sampling for k ∈ [Kmin, Kmax] BIC to decide number of Gaussians ( ˆ K) Graph-based Merging using Overlap/Entropy criteria Results in Kentropy Clusters Results in Ksplit Clusters

13 / 48 SWIFT

slide-20
SLIDE 20

STAGE 1: GAUSSIAN MIXTURE MODEL CLUSTERING

◮ Gaussian mixture model (GMM) clustering is chosen among the

model based methods

◮ Faster than other model based clustering methods ◮ Closed form solution

14 / 48 SWIFT

slide-21
SLIDE 21

STAGE 1: GAUSSIAN MIXTURE MODEL CLUSTERING

◮ Gaussian mixture model (GMM) clustering is chosen among the

model based methods

◮ Faster than other model based clustering methods ◮ Closed form solution

◮ Expectation Maximization (EM) algorithm for parameter

estimation

◮ Computational complexity of each iteration: O(Nkd2)

◮ N = the number of data-vectors in the dataset ◮ k = is the number of Gaussian components ◮ d = is the dimension of each data-vectors

14 / 48 SWIFT

slide-22
SLIDE 22

STAGE 1: SAMPLING FOR SCALABILITY

◮ Operate on smaller subsample of dataset for better computation

performance.

◮ Challenge: Poor representation of smaller clusters. (a) 4 Gaussians with 150K, 100K, 50K and 150 datapoints (b) After 10% sampling

15 / 48 SWIFT

slide-23
SLIDE 23

STAGE 1: SAMPLING FOR SCALABILITY

◮ Operate on smaller subsample of dataset for better computation

performance.

◮ Challenge: Poor representation of smaller clusters. (c) 4 Gaussians with 150K, 100K, 50K and 150 datapoints (d) After 10% sampling ◮ Solution: Weighted iterative sampling

◮ Faster computation ◮ Better detection of small clusters

15 / 48 SWIFT

slide-24
SLIDE 24

STAGE 1: WEIGHTED ITERATIVE SAMPLING BASED EM

FCS Dataset X Fix p largest clusters and add them to F. Initially F = ∅ Resample S from X with probability

  • l∈F γ(i)

j

Subsample S from X GMM fitting to S using EM All the clusters fixed? Perform few EM iterations on X Output model parameters (θ) Yes No

16 / 48 SWIFT

slide-25
SLIDE 25

STAGE 1: WEIGHTED ITERATIVE SAMPLING BASED EM

FCS Dataset X Fix p largest clusters and add them to F. Initially F = ∅ Resample S from X with probability

  • l∈F γ(i)

j

Subsample S from X GMM fitting to S using EM All the clusters fixed? Perform few EM iterations on X Output model parameters (θ) Yes No

F = set of clusters whose parameters are fixed.

16 / 48 SWIFT

slide-26
SLIDE 26

STAGE 1: WEIGHTED ITERATIVE SAMPLING BASED EM

FCS Dataset X Fix p largest clusters and add them to F. Initially F = ∅ Resample S from X with probability

  • l∈F γ(i)

j

Subsample S from X GMM fitting to S using EM All the clusters fixed? Perform few EM iterations on X Output model parameters (θ) Yes No

F = set of clusters whose parameters are fixed. P(X(i) is selected in S) = ∑l∈F γ(i)

l

16 / 48 SWIFT

slide-27
SLIDE 27

STAGE 1: WEIGHTED ITERATIVE SAMPLING BASED EM

FIGURE: 4 Gaussian clusters with 150K, 100K, 50K and 150 datapoints

17 / 48 SWIFT

slide-28
SLIDE 28

WEIGHTED ITERATIVE SAMPLING: FIRST SAMPLE

(a) First sample (b) Clustering first sample ◮ Uniform random sampling

18 / 48 SWIFT

slide-29
SLIDE 29

WEIGHTED ITERATIVE SAMPLING: SECOND SAMPLE

(c) Second sample (d) Clustering second sample ◮ Sampling probability: (1−∑l∈{1} γ(i)

l )

19 / 48 SWIFT

slide-30
SLIDE 30

WEIGHTED ITERATIVE SAMPLING: THIRD SAMPLE

(e) Third sample (f) Clustering third sample ◮ Sampling probability: (1−∑l∈{1,2} γ(i)

l

)

20 / 48 SWIFT

slide-31
SLIDE 31

WEIGHTED ITERATIVE SAMPLING: LAST SAMPLE

(g) Last sample (h) Final clustering ◮ Sampling probability: (1−∑l∈{1,2,3} γ(i)

l )

21 / 48 SWIFT

slide-32
SLIDE 32

STAGE 2: BIMODALITY SPLITTING

Soft clustering for Kentropy clusters GMM clustering with Sampling for k ∈ [Kmin, Kmax] BIC to decide number of Gaussians ( ˆ K) Graph-based Merging using Overlap/Entropy criteria Results in Kentropy Clusters Results in Ksplit Clusters Split Bimodal Clusters until Unimodal.

22 / 48 SWIFT

slide-33
SLIDE 33

STAGE 2: BIMODALITY SPLITTING

◮ Motivated by Biology

◮ Separation along only one dimension can be significant

23 / 48 SWIFT

slide-34
SLIDE 34

STAGE 2: BIMODALITY SPLITTING

◮ Motivated by Biology

◮ Separation along only one dimension can be significant

◮ Clustering is challenging for high-dimensional data.

◮ Curse of Dimensionality. ◮ Discrimination in one dimension can be obfuscated by strong

similarity in other dimensions.

23 / 48 SWIFT

slide-35
SLIDE 35

STAGE 2: BIMODALITY SPLITTING

◮ Motivated by Biology

◮ Separation along only one dimension can be significant

◮ Clustering is challenging for high-dimensional data.

◮ Curse of Dimensionality. ◮ Discrimination in one dimension can be obfuscated by strong

similarity in other dimensions.

Problem:

◮ Gaussian mixture model for high dimensional data:

◮ Sometimes results in small clusters that are bimodal in one or two

dimensions

23 / 48 SWIFT

slide-36
SLIDE 36

STAGE 2: BIMODALITY SPLITTING

◮ Motivated by Biology

◮ Separation along only one dimension can be significant

◮ Clustering is challenging for high-dimensional data.

◮ Curse of Dimensionality. ◮ Discrimination in one dimension can be obfuscated by strong

similarity in other dimensions.

Problem:

◮ Gaussian mixture model for high dimensional data:

◮ Sometimes results in small clusters that are bimodal in one or two

dimensions

Solution:

◮ Detect bimodal clusters and split them.

23 / 48 SWIFT

slide-37
SLIDE 37

STAGE 2: BIMODALITY SPLITTING

◮ Bimodality Detection: detect clusters that are bimodal in

◮ Any given dimensions. ◮ Any principal components.

◮ Perform 1-D Kernel density estimation

◮ Compute number of modes.

−4 −2 2 4 6 8 10 5 10 15 20 25 30 35 40 45 −6 −4 −2 2 4 6 8 10 12 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

◮ Split each bimodal clusters until all subclusters are unimodal

24 / 48 SWIFT

slide-38
SLIDE 38

STAGE 3: GRAPH-BASED MERGING

Soft clustering for Kentropy clusters GMM clustering with Sampling for k ∈ [Kmin, Kmax] BIC to decide number of Gaussians ( ˆ K) Graph-based Merging using Overlap/Entropy criteria Results in Kentropy Clusters Results in Ksplit Clusters Split Bimodal Clusters until Unimodal.

25 / 48 SWIFT

slide-39
SLIDE 39

STAGE 3: GRAPH-BASED MERGING

◮ Merging of overlapping Gaussian components.

◮ Allows representing non-Gaussian clusers.

(k) After fitting 10 Gaussians (l) After merging down to 2 clusters

26 / 48 SWIFT

slide-40
SLIDE 40

STAGE 3: GRAPH-BASED MERGING

Merging Criterion: Normalized Overlap Measure (NO)

◮ ∼ Jaccard Index ◮ Ei = Ellipsoid approximating i-th Gaussian

NO(i,j) = Vol(Ei ∩ Ej) Vol(Ei ∪ Ej) (1) Merge (i′,j′) such that,

(i′,j′) = max(i,j)NO(i,j)

(2)

27 / 48 SWIFT

slide-41
SLIDE 41

STAGE 3: GRAPH-BASED MERGING

Merging Criterion: Normalized Overlap Measure (NO)

◮ ∼ Jaccard Index ◮ Ei = Ellipsoid approximating i-th Gaussian

NO(i,j) = Vol(Ei ∩ Ej) Vol(Ei ∪ Ej) (1) Merge (i′,j′) such that,

(i′,j′) = max(i,j)NO(i,j)

(2) Stopping Criterion: Merge until no significant changes in entropy (Finak et al. [2009], Baudry et al. [2010]). Ent(K) = −

n

i=1 K

j=1

γ(i)

j

log(γ(i)

j

)

(3)

27 / 48 SWIFT

slide-42
SLIDE 42

STAGE 3: GRAPH-BASED MERGING

◮ 5 Gaussian clusters

1 2 4 5 3

28 / 48 SWIFT

slide-43
SLIDE 43

STAGE 3: GRAPH-BASED MERGING

◮ 5 Gaussian clusters

1 2 4 5 3

29 / 48 SWIFT

slide-44
SLIDE 44

STAGE 3: GRAPH-BASED MERGING

◮ 5 Gaussian clusters

1 2 4 5 3

30 / 48 SWIFT

slide-45
SLIDE 45

STAGE 3: GRAPH-BASED MERGING

◮ 5 Gaussian clusters

1 2 4 5 3

31 / 48 SWIFT

slide-46
SLIDE 46

OUTLINE

1 INTRODUCTION Flow cytometry (FC) Data Analysis Automated Multivariate clustering of FC Data 2 SWIFT METHOD FOR FC DATA ANALYSIS SWIFT Algorithm Weighted Iterative Sampling based EM Bimodality Splitting Graph-based Merging 3 DOES IT WORK? Does It Work? How Do We Know It Works? 4 FLOWCAP CONTEST Results on FlowCAP Datasets Few Thoughts for FlowCAP II 5 CONCLUSION

32 / 48 SWIFT

slide-47
SLIDE 47

DOES IT WORK?

◮ Experiment: Cluster high-dimensional FC data ◮ Dataset:

◮ 544,000 Events ◮ 21 Dimensions

33 / 48 SWIFT

slide-48
SLIDE 48

DOES IT WORK?

◮ Experiment: Cluster high-dimensional FC data ◮ Dataset:

◮ 544,000 Events ◮ 21 Dimensions

◮ SWIFT Output:

◮ 191 Gaussians (Gaussian fitting + Bimodality Splitting) ◮ 143 Clusters (Post Merging)

33 / 48 SWIFT

slide-49
SLIDE 49

544,000 EVENTS, 21 DIMENSIONS, 143 CLUSTERS

34 / 48 SWIFT

slide-50
SLIDE 50

HOW DO WE KNOW IT WORKS?

◮ Experiments to produce datasets with ground truth

◮ Rochester Human Immunology Center

35 / 48 SWIFT

slide-51
SLIDE 51

HOW DO WE KNOW IT WORKS?

◮ Experiments to produce datasets with ground truth

◮ Rochester Human Immunology Center

◮ Electronic mixture of Human cells and Mouse cells

◮ Two datafiles: Human cells and Mouse cells only ◮ Human Datafile: 276,418 events ◮ Mouse Datafile: 267,582 events ◮ Total: 544,000 events ◮ Stained using both human and mouse antibodies. ◮ Human/mouse label is known for every cell.

35 / 48 SWIFT

slide-52
SLIDE 52

HOW DO WE KNOW IT WORKS?

◮ Experiments to produce datasets with ground truth

◮ Rochester Human Immunology Center

◮ Electronic mixture of Human cells and Mouse cells

◮ Two datafiles: Human cells and Mouse cells only ◮ Human Datafile: 276,418 events ◮ Mouse Datafile: 267,582 events ◮ Total: 544,000 events ◮ Stained using both human and mouse antibodies. ◮ Human/mouse label is known for every cell.

◮ Examine every clusters:

◮ Human only? Mouse only? or Both?

35 / 48 SWIFT

slide-53
SLIDE 53

FRACTIONAL MEMBERSHIP OF HUMAN AND MOUSE

20 40 60 80 100 120 140 0.2 0.4 0.6 0.8 1

Cluster Numbers Fractional Membership (Human or Mouse)

Human Mouse

36 / 48 SWIFT

slide-54
SLIDE 54

SMALL CLUSTER DETECTION

◮ Electronic human-mouse mixture

◮ Varying proportions of human cells. ◮ Five datasets: 50%, 25%, 10%, 1%, 0.1% Human cells.

37 / 48 SWIFT

slide-55
SLIDE 55

SMALL CLUSTER DETECTION

◮ Electronic human-mouse mixture

◮ Varying proportions of human cells. ◮ Five datasets: 50%, 25%, 10%, 1%, 0.1% Human cells.

◮ Sensitivity Analysis:

◮ Precision: Precision =

TP TP+FP

◮ Recall: Recall =

TP TP+FN

37 / 48 SWIFT

slide-56
SLIDE 56

SMALL CLUSTER DETECTION

◮ Electronic human-mouse mixture

◮ Varying proportions of human cells. ◮ Five datasets: 50%, 25%, 10%, 1%, 0.1% Human cells.

◮ Sensitivity Analysis:

◮ Precision: Precision =

TP TP+FP

◮ Recall: Recall =

TP TP+FN

% of Human cells Precision Recall Human Clusters 50% 99.8351% 99.902% 84 25% 99.5906% 99.728% 59 10% 98.8121% 99.405% 38 1% 96.0417% 92.2% 13 0.1% 71.8978% 98.5% 4

37 / 48 SWIFT

slide-57
SLIDE 57

ADVANTAGES OF SWIFT

◮ Scalable memory and computation time.

◮ Complexity of each EM iteration reduced from O(Nkd2) to

O(nkd2)

◮ n = Sample size

◮ Better resolution of small clusters ◮ Capable of detecting non-Gaussian clusters ◮ Works well for overlapping clusters (true for all model-based

methods)

38 / 48 SWIFT

slide-58
SLIDE 58

OUTLINE

1 INTRODUCTION Flow cytometry (FC) Data Analysis Automated Multivariate clustering of FC Data 2 SWIFT METHOD FOR FC DATA ANALYSIS SWIFT Algorithm Weighted Iterative Sampling based EM Bimodality Splitting Graph-based Merging 3 DOES IT WORK? Does It Work? How Do We Know It Works? 4 FLOWCAP CONTEST Results on FlowCAP Datasets Few Thoughts for FlowCAP II 5 CONCLUSION

39 / 48 SWIFT

slide-59
SLIDE 59

CLUSTERING RESULTS: GVHD DATASET

◮ GvHD Dataset: Data Sample 001.fcs

◮ 13831 Events ◮ 6 dimensions

40 / 48 SWIFT

slide-60
SLIDE 60

CLUSTERING RESULTS: GVHD DATASET

◮ GvHD Dataset: Data Sample 001.fcs

◮ 13831 Events ◮ 6 dimensions

◮ 13 Gaussians using BIC ◮ 11 Clusters after merging

40 / 48 SWIFT

slide-61
SLIDE 61

CLUSTERING RESULTS: GVHD DATASET

FL1.H FL2.H

100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000

41 / 48 SWIFT

slide-62
SLIDE 62

CLUSTERING RESULTS: GVHD DATASET

FL1.H FL2.H

100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000

42 / 48 SWIFT

slide-63
SLIDE 63

CLUSTERING RESULTS: GVHD DATASET

FSC.H SSC.H

100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000

FSC.H SSC.H

100 200 300 400 500 600 700 800 900 1000 50 100 150 200 250 300 350 400 450 500

FSC.H SSC.H

100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000

FSC.H SSC.H 200 400 600 800 1000 100 200 300 400 500 600 700 800 900 1000

43 / 48 SWIFT

slide-64
SLIDE 64

FEW THOUGHTS FOR FLOWCAP II

◮ FlowCAP-I datasets were relatively small

◮ Less than 100,000 events. ◮ Maximum 12 dimensions ◮ Number of clusters usually smaller than 25

44 / 48 SWIFT

slide-65
SLIDE 65

FEW THOUGHTS FOR FLOWCAP II

◮ FlowCAP-I datasets were relatively small

◮ Less than 100,000 events. ◮ Maximum 12 dimensions ◮ Number of clusters usually smaller than 25

◮ Introduction of larger datasets for FlowCAP II

◮ 1 millions events, 20 dimensions are common

44 / 48 SWIFT

slide-66
SLIDE 66

FEW THOUGHTS FOR FLOWCAP II

◮ FlowCAP-I datasets were relatively small

◮ Less than 100,000 events. ◮ Maximum 12 dimensions ◮ Number of clusters usually smaller than 25

◮ Introduction of larger datasets for FlowCAP II

◮ 1 millions events, 20 dimensions are common

◮ Introduction of different tasks and corresponding performance

measure

◮ Detection of very small clusters ◮ Detection of overlapping populations

44 / 48 SWIFT

slide-67
SLIDE 67

FEW THOUGHTS FOR FLOWCAP II

◮ FlowCAP-I datasets were relatively small

◮ Less than 100,000 events. ◮ Maximum 12 dimensions ◮ Number of clusters usually smaller than 25

◮ Introduction of larger datasets for FlowCAP II

◮ 1 millions events, 20 dimensions are common

◮ Introduction of different tasks and corresponding performance

measure

◮ Detection of very small clusters ◮ Detection of overlapping populations

◮ Gold standard for validation?

◮ Manual Gating: ◮ Focused rather than exhaustive ◮ Does not comprehend overlapping populations

44 / 48 SWIFT

slide-68
SLIDE 68

FEW THOUGHTS FOR FLOWCAP II

◮ FlowCAP-I datasets were relatively small

◮ Less than 100,000 events. ◮ Maximum 12 dimensions ◮ Number of clusters usually smaller than 25

◮ Introduction of larger datasets for FlowCAP II

◮ 1 millions events, 20 dimensions are common

◮ Introduction of different tasks and corresponding performance

measure

◮ Detection of very small clusters ◮ Detection of overlapping populations

◮ Gold standard for validation?

◮ Manual Gating: ◮ Focused rather than exhaustive ◮ Does not comprehend overlapping populations ◮ Electronically Mixed Datasets: for objective evaluation ◮ Human/Mouse dataset

44 / 48 SWIFT

slide-69
SLIDE 69

OUTLINE

1 INTRODUCTION Flow cytometry (FC) Data Analysis Automated Multivariate clustering of FC Data 2 SWIFT METHOD FOR FC DATA ANALYSIS SWIFT Algorithm Weighted Iterative Sampling based EM Bimodality Splitting Graph-based Merging 3 DOES IT WORK? Does It Work? How Do We Know It Works? 4 FLOWCAP CONTEST Results on FlowCAP Datasets Few Thoughts for FlowCAP II 5 CONCLUSION

45 / 48 SWIFT

slide-70
SLIDE 70

CONCLUSION

◮ SWIFT: Scalable algorithm for FC data clustering

◮ Posterior sampling based EM + Bimodality Splitting +

Graph-based merging

◮ Advantages: lower computational complexity + better small cluster

resolution

46 / 48 SWIFT

slide-71
SLIDE 71

CONCLUSION

◮ SWIFT: Scalable algorithm for FC data clustering

◮ Posterior sampling based EM + Bimodality Splitting +

Graph-based merging

◮ Advantages: lower computational complexity + better small cluster

resolution

◮ Extensible to other soft clustering methods

◮ Mixture of t, skewed t distributions or fuzzy clustering

46 / 48 SWIFT

slide-72
SLIDE 72

CONCLUSION

◮ SWIFT: Scalable algorithm for FC data clustering

◮ Posterior sampling based EM + Bimodality Splitting +

Graph-based merging

◮ Advantages: lower computational complexity + better small cluster

resolution

◮ Extensible to other soft clustering methods

◮ Mixture of t, skewed t distributions or fuzzy clustering

◮ Further speed-up can be achieved by combining with

parallelization.

◮ Parallelization using GPU (Suchard et al. [2010],Espenshade

et al. [2009]).

46 / 48 SWIFT

slide-73
SLIDE 73

CONCLUSION

◮ SWIFT: Scalable algorithm for FC data clustering

◮ Posterior sampling based EM + Bimodality Splitting +

Graph-based merging

◮ Advantages: lower computational complexity + better small cluster

resolution

◮ Extensible to other soft clustering methods

◮ Mixture of t, skewed t distributions or fuzzy clustering

◮ Further speed-up can be achieved by combining with

parallelization.

◮ Parallelization using GPU (Suchard et al. [2010],Espenshade

et al. [2009]).

◮ Future work:

◮ Improve stability and robustness ◮ Cross-sample cluster matching for biological inference.

46 / 48 SWIFT

slide-74
SLIDE 74

COLLABORATORS

SWIFT

Naim, Sharma, Datta, Cavenaugh, Rebhahn, Wang, Mosmann

GAFF

Rebhahn, Cavenaugh, Naim, Sharma, Mosmann

Acceleration

Pangborn, Cavenaugh, von Laszewski

Rochester Human Immunology Center

Quataert, Mosmann

NYICE Influenza

Treanor, Topham, Sant, Kim, Whittaker, Mosmann

RPBIP Immunocompromised

Sanz, Looney, Mosmann, Ritchlin, Anolik, Quataert

ACE Autoimmunity

Sanz, Fowell, Looney, Quataert, Mosmann

Asthma

Georas, Looney, Mosmann

Lymphoma

Bernstein, Quataert

47 / 48 SWIFT

slide-75
SLIDE 75

REFERENCES

J.P . Baudry, A.E. Raftery, G. Celeux, K. Lo, and R. Gottardo. Combining mixture components for clustering. Journal of Computational and Graphical Statistics, 19 (2):332–353, 2010.

  • C. Chan, F

. Feng, J. Ottinger, D. Foster, M. West, and T.B. Kepler. Statistical mixture modeling for cell subtype identification in flow cytometry. Cytometry Part A, (8), 2008.

  • J. Espenshade, A. Pangborn, G. von Laszewski, D. Roberts, and J.S. Cavenaugh.

Accelerating Partitional Algorithms for Flow Cytometry on GPUs. pages 226–233, 2009.

  • G. Finak, R. Gottardo, R. Brinkman, et al. Merging mixture components for cell

population identification in flow cytometry. Advances in Bioinformatics, 2009, 2009.

  • C. Hennig. Methods for merging Gaussian mixture components. Advances in Data

Analysis and Classification, pages 1–32, 2009.

  • K. Lo, R.R. Brinkman, and R. Gottardo. Automated gating of flow cytometry data via

robust model-based clustering. Cytometry Part A, 73:321–332, 2008. Iftekhar Naim, Suprakash Datta, Gaurav Sharma, James Cavenaugh, and Tim

  • Mosmann. SWIFT: Scalable weighted iterative sampling for flow cytometry
  • clustering. In Proc. IEEE Intl. Conf. Acoustics Speech and Sig. Proc., pages

509–512, Dallas, Texas, USA, Mar. 2010.

  • S. Pyne et al. Automated high-dimensional flow cytometric data analysis. PNAS, 106

(21):8519, 2009. M.A. Suchard, Q. Wang, C. Chan, J. Frelinger, A. Cron, and M. West. Understanding

48 / 48 SWIFT