On the Use of NMF and curvHDR to Cluster Flow Cytometry Data e M. - - PowerPoint PPT Presentation

on the use of nmf and curvhdr to cluster flow cytometry
SMART_READER_LITE
LIVE PREVIEW

On the Use of NMF and curvHDR to Cluster Flow Cytometry Data e M. - - PowerPoint PPT Presentation

On the Use of NMF and curvHDR to Cluster Flow Cytometry Data e M. Maisog 1,2 , Andrea A. Barbo 2 , George Luta 2 Jos 1 Medical Numerics, Inc., Germantown, MD 20876 2 Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown


slide-1
SLIDE 1

On the Use of NMF and curvHDR to Cluster Flow Cytometry Data

Jos´ e M. Maisog1,2, Andrea A. Barbo2, George Luta2

1Medical Numerics, Inc., Germantown, MD 20876 2Department of Biostatistics, Bioinformatics, and Biomathematics,

Georgetown University Medical Center, Washington, DC 20057

FlowCAP Summit, September 21-22, 2010

slide-2
SLIDE 2

Outline

1

Non-Negative Matrix Factorization

NMF and curvHDR September, 2010 2 / 21

slide-3
SLIDE 3

Outline

1

Non-Negative Matrix Factorization

2

curvHDR

NMF and curvHDR September, 2010 2 / 21

slide-4
SLIDE 4

Outline

1

Non-Negative Matrix Factorization

2

curvHDR

3

Strategy for FlowCAP Challenge 2

NMF and curvHDR September, 2010 2 / 21

slide-5
SLIDE 5

Outline

1

Non-Negative Matrix Factorization

2

curvHDR

3

Strategy for FlowCAP Challenge 2

4

Discussion

NMF and curvHDR September, 2010 2 / 21

slide-6
SLIDE 6

Non-Negative Matrix Factorization

NMF

A relatively new method of matrix decomposition [LS99]

NMF and curvHDR September, 2010 3 / 21

slide-7
SLIDE 7

Non-Negative Matrix Factorization

NMF

A relatively new method of matrix decomposition [LS99] Given Y , an M × N non-negative matrix

NMF and curvHDR September, 2010 3 / 21

slide-8
SLIDE 8

Non-Negative Matrix Factorization

NMF

A relatively new method of matrix decomposition [LS99] Given Y , an M × N non-negative matrix Find W and H such that:

Y ≈ W × H

NMF and curvHDR September, 2010 3 / 21

slide-9
SLIDE 9

Non-Negative Matrix Factorization

NMF

A relatively new method of matrix decomposition [LS99] Given Y , an M × N non-negative matrix Find W and H such that:

Y ≈ W × H W is M × k, H is k × N

NMF and curvHDR September, 2010 3 / 21

slide-10
SLIDE 10

Non-Negative Matrix Factorization

NMF

A relatively new method of matrix decomposition [LS99] Given Y , an M × N non-negative matrix Find W and H such that:

Y ≈ W × H W is M × k, H is k × N W and H are non-negative

NMF and curvHDR September, 2010 3 / 21

slide-11
SLIDE 11

Non-Negative Matrix Factorization

NMF

A relatively new method of matrix decomposition [LS99] Given Y , an M × N non-negative matrix Find W and H such that:

Y ≈ W × H W is M × k, H is k × N W and H are non-negative

Must specify k (cf. k-means clustering)

NMF and curvHDR September, 2010 3 / 21

slide-12
SLIDE 12

Non-Negative Matrix Factorization

NMF

A relatively new method of matrix decomposition [LS99] Given Y , an M × N non-negative matrix Find W and H such that:

Y ≈ W × H W is M × k, H is k × N W and H are non-negative

Must specify k (cf. k-means clustering) Dimensionality Reduction: k < M, N

NMF and curvHDR September, 2010 3 / 21

slide-13
SLIDE 13

Non-Negative Matrix Factorization

NMF

A relatively new method of matrix decomposition [LS99] Given Y , an M × N non-negative matrix Find W and H such that:

Y ≈ W × H W is M × k, H is k × N W and H are non-negative

Must specify k (cf. k-means clustering) Dimensionality Reduction: k < M, N There are multiple variations, e.g. different optimization criteria

NMF and curvHDR September, 2010 3 / 21

slide-14
SLIDE 14

Non-Negative Matrix Factorization

NMF: Algorithm

les Variables (e.g., genes)

H = + E W • H

Samples

Y W

(Based on a figure from [You09])

NMF and curvHDR September, 2010 4 / 21

slide-15
SLIDE 15

Non-Negative Matrix Factorization

NMF: Algorithm

les Variables (e.g., genes)

H = + E W • H

Samples

Y W

(Based on a figure from [You09]) Initialize W and H with random values.

NMF and curvHDR September, 2010 4 / 21

slide-16
SLIDE 16

Non-Negative Matrix Factorization

NMF: Algorithm

les Variables (e.g., genes)

H = + E W • H

Samples

Y W

(Based on a figure from [You09]) Initialize W and H with random values. Optimize so that (yij − whij)2 is minimized.

NMF and curvHDR September, 2010 4 / 21

slide-17
SLIDE 17

Non-Negative Matrix Factorization

NMF: Algorithm

les Variables (e.g., genes)

H = + E W • H

Samples

Y W

(Based on a figure from [You09]) Initialize W and H with random values. Optimize so that (yij − whij)2 is minimized. The k rows of H define “metagenes”, while the ith row of W represents the “metagene expression pattern of the corresponding sample” [Dev08]

NMF and curvHDR September, 2010 4 / 21

slide-18
SLIDE 18

Non-Negative Matrix Factorization

NMF Results are “Sparse”

NMF has decomposed the face data into discrete “parts.” (Lee and Seung, Nature 1999 Oct 21;401(6755):788-91)

NMF and curvHDR September, 2010 5 / 21

slide-19
SLIDE 19

Non-Negative Matrix Factorization

PCA of Face Data

Principal components are “holistic” rather than discrete “parts.” (Lee and Seung, Nature 1999 Oct 21;401(6755):788-91)

NMF and curvHDR September, 2010 6 / 21

slide-20
SLIDE 20

Non-Negative Matrix Factorization

Comparison of Matrix Decomposition Methods

Method Constraints Basis Encodings PCA/SVD components are orthogonal non-sparse non-sparse ICA statistically independent components sparse non-sparse NMF data and factors are non-negative sparse sparse

NMF and curvHDR September, 2010 7 / 21

slide-21
SLIDE 21

curvHDR

curvHDR

Unsupervised clustering with unknown number of clusters [NLW10]

NMF and curvHDR September, 2010 8 / 21

slide-22
SLIDE 22

curvHDR

curvHDR

Unsupervised clustering with unknown number of clusters [NLW10] Algorithm:

Remove excess boundary points and other debris

NMF and curvHDR September, 2010 8 / 21

slide-23
SLIDE 23

curvHDR

curvHDR

Unsupervised clustering with unknown number of clusters [NLW10] Algorithm:

Remove excess boundary points and other debris Obtain significant high negative curvature regions [DCKW08]

NMF and curvHDR September, 2010 8 / 21

slide-24
SLIDE 24

curvHDR

curvHDR

Unsupervised clustering with unknown number of clusters [NLW10] Algorithm:

Remove excess boundary points and other debris Obtain significant high negative curvature regions [DCKW08] Replace each of the significant curvature regions by their convex hull

NMF and curvHDR September, 2010 8 / 21

slide-25
SLIDE 25

curvHDR

curvHDR

Unsupervised clustering with unknown number of clusters [NLW10] Algorithm:

Remove excess boundary points and other debris Obtain significant high negative curvature regions [DCKW08] Replace each of the significant curvature regions by their convex hull Grow each convex hull by a factor G.

NMF and curvHDR September, 2010 8 / 21

slide-26
SLIDE 26

curvHDR

curvHDR

Unsupervised clustering with unknown number of clusters [NLW10] Algorithm:

Remove excess boundary points and other debris Obtain significant high negative curvature regions [DCKW08] Replace each of the significant curvature regions by their convex hull Grow each convex hull by a factor G. Obtain a kernel density estimate for data within each grown region [DCKW08]

NMF and curvHDR September, 2010 8 / 21

slide-27
SLIDE 27

curvHDR

curvHDR

Unsupervised clustering with unknown number of clusters [NLW10] Algorithm:

Remove excess boundary points and other debris Obtain significant high negative curvature regions [DCKW08] Replace each of the significant curvature regions by their convex hull Grow each convex hull by a factor G. Obtain a kernel density estimate for data within each grown region [DCKW08] The curvHDR gate is the union of the level-τ high density regions (HDRs).

NMF and curvHDR September, 2010 8 / 21

slide-28
SLIDE 28

curvHDR

curvHDR

Unsupervised clustering with unknown number of clusters [NLW10] Algorithm:

Remove excess boundary points and other debris Obtain significant high negative curvature regions [DCKW08] Replace each of the significant curvature regions by their convex hull Grow each convex hull by a factor G. Obtain a kernel density estimate for data within each grown region [DCKW08] The curvHDR gate is the union of the level-τ high density regions (HDRs).

Currently only the 2D version is implemented, but a 3D version will be released soon

NMF and curvHDR September, 2010 8 / 21

slide-29
SLIDE 29

curvHDR

curvHDR: Illustration

(Naumann et al., BMC Bioinformatics 2010 Jan 22;11:44)

NMF and curvHDR September, 2010 9 / 21

slide-30
SLIDE 30

curvHDR

GvHD Data, Sample #7, HDR Level = 0.1

  • 100

200 300 400 500 200 400 600 800

curvHDR filter with HDR level= 0.1

SSC.H FL2.H

NMF and curvHDR September, 2010 10 / 21

slide-31
SLIDE 31

curvHDR

GvHD Data, Sample #7, HDR Level = 0.2

  • 100

200 300 400 500 200 400 600 800

curvHDR filter with HDR level= 0.2

SSC.H FL2.H

NMF and curvHDR September, 2010 11 / 21

slide-32
SLIDE 32

Strategy for FlowCAP Challenge 2

Processing Pipeline

NMF

input data clustered data

Variance Normalization asinh curvHDR

All computations performed in R.

NMF and curvHDR September, 2010 12 / 21

slide-33
SLIDE 33

Strategy for FlowCAP Challenge 2

Processing Pipeline

NMF

input data clustered data

Variance Normalization asinh curvHDR

All computations performed in R. Normalize variance of variables (channels) to 1

NMF and curvHDR September, 2010 12 / 21

slide-34
SLIDE 34

Strategy for FlowCAP Challenge 2

Processing Pipeline

NMF

input data clustered data

Variance Normalization asinh curvHDR

All computations performed in R. Normalize variance of variables (channels) to 1 Use NMF to reduce dimensionality to 2 http://cran.r-project.org/web/packages/NMF/index.html

NMF and curvHDR September, 2010 12 / 21

slide-35
SLIDE 35

Strategy for FlowCAP Challenge 2

Processing Pipeline

NMF

input data clustered data

Variance Normalization asinh curvHDR

All computations performed in R. Normalize variance of variables (channels) to 1 Use NMF to reduce dimensionality to 2 http://cran.r-project.org/web/packages/NMF/index.html NMF algorithm: “left matrix” version of Kim and Park’s Alternating Least Squares method [KP07]

NMF and curvHDR September, 2010 12 / 21

slide-36
SLIDE 36

Strategy for FlowCAP Challenge 2

Processing Pipeline

NMF

input data clustered data

Variance Normalization asinh curvHDR

All computations performed in R. Normalize variance of variables (channels) to 1 Use NMF to reduce dimensionality to 2 http://cran.r-project.org/web/packages/NMF/index.html NMF algorithm: “left matrix” version of Kim and Park’s Alternating Least Squares method [KP07] Transform the NMF encodings (the W matrix) with the inverse hyperbolic sine function: asinh(x) = ln(x + √ x2 + 1)

NMF and curvHDR September, 2010 12 / 21

slide-37
SLIDE 37

Strategy for FlowCAP Challenge 2

Processing Pipeline

NMF

input data clustered data

Variance Normalization asinh curvHDR

All computations performed in R. Normalize variance of variables (channels) to 1 Use NMF to reduce dimensionality to 2 http://cran.r-project.org/web/packages/NMF/index.html NMF algorithm: “left matrix” version of Kim and Park’s Alternating Least Squares method [KP07] Transform the NMF encodings (the W matrix) with the inverse hyperbolic sine function: asinh(x) = ln(x + √ x2 + 1) Then use 2D version of curvHDR to perform clustering http://www.uow.edu.au/˜mwand/Rpacks.html

NMF and curvHDR September, 2010 12 / 21

slide-38
SLIDE 38

Strategy for FlowCAP Challenge 2

Example of NMF-curvHDR Results

  • 0.0

0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

GvHD, sample #7, HDR Level = 0.1

NMF Factor #1 NMF Factor #2

NMF and curvHDR September, 2010 13 / 21

slide-39
SLIDE 39

Discussion

Discussion

NMF is a relatively new matrix decomposition method that has been used in molecular pattern discovery and cross-platform and cross-species characterization (especially in gene and protein microarray data), biomedical informatics (e.g., text mining), and magnetic resonance spectroscopy [Dev08]

NMF and curvHDR September, 2010 14 / 21

slide-40
SLIDE 40

Discussion

Discussion

NMF is a relatively new matrix decomposition method that has been used in molecular pattern discovery and cross-platform and cross-species characterization (especially in gene and protein microarray data), biomedical informatics (e.g., text mining), and magnetic resonance spectroscopy [Dev08] NMF may have utility in flow cytometry data

NMF and curvHDR September, 2010 14 / 21

slide-41
SLIDE 41

Discussion

Discussion

NMF is a relatively new matrix decomposition method that has been used in molecular pattern discovery and cross-platform and cross-species characterization (especially in gene and protein microarray data), biomedical informatics (e.g., text mining), and magnetic resonance spectroscopy [Dev08] NMF may have utility in flow cytometry data Could have used some other method (e.g., PCA or ICA) for dimensionality reduction, but decided to use NMF because of its good performance with other types of data (out of curiosity, too).

NMF and curvHDR September, 2010 14 / 21

slide-42
SLIDE 42

Discussion

Discussion

NMF is a relatively new matrix decomposition method that has been used in molecular pattern discovery and cross-platform and cross-species characterization (especially in gene and protein microarray data), biomedical informatics (e.g., text mining), and magnetic resonance spectroscopy [Dev08] NMF may have utility in flow cytometry data Could have used some other method (e.g., PCA or ICA) for dimensionality reduction, but decided to use NMF because of its good performance with other types of data (out of curiosity, too). curvHDR is an even newer method than NMF, specifically for clustering flow cytometry data

NMF and curvHDR September, 2010 14 / 21

slide-43
SLIDE 43

Discussion

END

NMF and curvHDR September, 2010 15 / 21

slide-44
SLIDE 44

Discussion

References I

[Bre01] L Breiman. Random forests. Machine Learning, 45(1):5–32, March 2001. [DCKW08] T Duong, A Cowling, I Koch, and MP Wand. Feature significance for multivariate kernel density estimation. Computational Statistics and Data Analysis, 52:42254242, May 2008. [Dev08] K Devarajan. Nonnegative matrix factorization: an analytical and interpretive tool in computational biology. PLoS Computational Biology, 4(7):e1000029, July 2008. [GEGMMI08] LA Garcia-Escudero, A Gordaliza, C Matran, and A Mayo-Iscar. A general trimming approach to robust cluster analysis. Annals of Statistics, 36(3):1324–1345, 2008. [KP07] H Kim and H Park. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics, 23(12):1495–502, June 2007.

NMF and curvHDR September, 2010 16 / 21

slide-45
SLIDE 45

Discussion

References II

[LS99] DD Lee and HS Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–91, October 1999. [NLW10] U Naumann, G Luta, and MP Wand. The curvHDR method for gating flow cytometry samples. BMC Bioinformatics, 22:44, January 2010. [You09] SS Young. Pre-processing HCS data using non-negative matrix

  • factorization. In Midwest Biopharmaceutical Statistics Workshop,

2009.

NMF and curvHDR September, 2010 17 / 21

slide-46
SLIDE 46

Discussion

Strategy for FlowCAP Challenge 3: NMF

Use NMF to define clusters, set k equal to desired number of clusters

NMF and curvHDR September, 2010 18 / 21

slide-47
SLIDE 47

Discussion

Strategy for FlowCAP Challenge 3: NMF

Use NMF to define clusters, set k equal to desired number of clusters Problem #1: NMF is usually used to reduce dimensionality in high-dimensional data

Sometimes the desired number of clusters is actually greater than the dimensionality

NMF and curvHDR September, 2010 18 / 21

slide-48
SLIDE 48

Discussion

Strategy for FlowCAP Challenge 3: NMF

Use NMF to define clusters, set k equal to desired number of clusters Problem #1: NMF is usually used to reduce dimensionality in high-dimensional data

Sometimes the desired number of clusters is actually greater than the dimensionality E.g., after removing the FS and SS variables, the GvHD data had only four variables (FL1.H, FL2.H,FL3.H,, and FL4.H)

NMF and curvHDR September, 2010 18 / 21

slide-49
SLIDE 49

Discussion

Strategy for FlowCAP Challenge 3: NMF

Use NMF to define clusters, set k equal to desired number of clusters Problem #1: NMF is usually used to reduce dimensionality in high-dimensional data

Sometimes the desired number of clusters is actually greater than the dimensionality E.g., after removing the FS and SS variables, the GvHD data had only four variables (FL1.H, FL2.H,FL3.H,, and FL4.H) But GvHD samples #7 and #9 had five groups by manual analysis!

NMF and curvHDR September, 2010 18 / 21

slide-50
SLIDE 50

Discussion

Strategy for FlowCAP Challenge 3: NMF

Use NMF to define clusters, set k equal to desired number of clusters Problem #1: NMF is usually used to reduce dimensionality in high-dimensional data

Sometimes the desired number of clusters is actually greater than the dimensionality E.g., after removing the FS and SS variables, the GvHD data had only four variables (FL1.H, FL2.H,FL3.H,, and FL4.H) But GvHD samples #7 and #9 had five groups by manual analysis! Possible kludge: add an extra column for these cases.

NMF and curvHDR September, 2010 18 / 21

slide-51
SLIDE 51

Discussion

Strategy for FlowCAP Challenge 3: NMF

Use NMF to define clusters, set k equal to desired number of clusters Problem #1: NMF is usually used to reduce dimensionality in high-dimensional data

Sometimes the desired number of clusters is actually greater than the dimensionality E.g., after removing the FS and SS variables, the GvHD data had only four variables (FL1.H, FL2.H,FL3.H,, and FL4.H) But GvHD samples #7 and #9 had five groups by manual analysis! Possible kludge: add an extra column for these cases.

Problem #2: algorithm fails if there is a row in the data matrix whose entries are all zero

This situation does indeed arise in the flowCAP data sets after the FS and SS variables are removed.

NMF and curvHDR September, 2010 18 / 21

slide-52
SLIDE 52

Discussion

Strategy for FlowCAP Challenge 3: NMF

Use NMF to define clusters, set k equal to desired number of clusters Problem #1: NMF is usually used to reduce dimensionality in high-dimensional data

Sometimes the desired number of clusters is actually greater than the dimensionality E.g., after removing the FS and SS variables, the GvHD data had only four variables (FL1.H, FL2.H,FL3.H,, and FL4.H) But GvHD samples #7 and #9 had five groups by manual analysis! Possible kludge: add an extra column for these cases.

Problem #2: algorithm fails if there is a row in the data matrix whose entries are all zero

This situation does indeed arise in the flowCAP data sets after the FS and SS variables are removed. Possible kludge: if there’s a row whose entries are all zero, add a small amount of non-negative noise to the entries of that row.

NMF and curvHDR September, 2010 18 / 21

slide-53
SLIDE 53

Discussion

Strategy for FlowCAP Challenge 3: Trimmed Clustering

Instead, try trimmed clustering method (tCLUST; [GEGMMI08])

NMF and curvHDR September, 2010 19 / 21

slide-54
SLIDE 54

Discussion

Strategy for FlowCAP Challenge 3: Trimmed Clustering

Instead, try trimmed clustering method (tCLUST; [GEGMMI08]) Similar to curvHDR, a fraction of the observations are discarded (“trimmed”) and considered not part of any group.

NMF and curvHDR September, 2010 19 / 21

slide-55
SLIDE 55

Discussion

Strategy for FlowCAP Challenge 3: Trimmed Clustering

Instead, try trimmed clustering method (tCLUST; [GEGMMI08]) Similar to curvHDR, a fraction of the observations are discarded (“trimmed”) and considered not part of any group. Unlike curvHDR, the number of desired clusters is specified as input to the algorithm.

NMF and curvHDR September, 2010 19 / 21

slide-56
SLIDE 56

Discussion

Strategy for FlowCAP Challenge 3: Trimmed Clustering

Instead, try trimmed clustering method (tCLUST; [GEGMMI08]) Similar to curvHDR, a fraction of the observations are discarded (“trimmed”) and considered not part of any group. Unlike curvHDR, the number of desired clusters is specified as input to the algorithm. http://cran.r-project.org/web/packages/tclust/index.html

NMF and curvHDR September, 2010 19 / 21

slide-57
SLIDE 57

Discussion

Example of TCLUST Results (GvHD data, sample #7)

  • Classification

k = 5, alpha = 0.05 First discriminant coord. Second discriminant coord.

k = 5, alpha = 0.05

NMF and curvHDR September, 2010 20 / 21

slide-58
SLIDE 58

Discussion

Strategy for FlowCAP Challenge 4: Random Forests

Classification and Regression Tree (CART), recursive partitioning (similar to Decision Trees and C4.5)

NMF and curvHDR September, 2010 21 / 21

slide-59
SLIDE 59

Discussion

Strategy for FlowCAP Challenge 4: Random Forests

Classification and Regression Tree (CART), recursive partitioning (similar to Decision Trees and C4.5)

Interpretability

NMF and curvHDR September, 2010 21 / 21

slide-60
SLIDE 60

Discussion

Strategy for FlowCAP Challenge 4: Random Forests

Classification and Regression Tree (CART), recursive partitioning (similar to Decision Trees and C4.5)

Interpretability Can handle heterogeneous data (continuous/interval, categorical,

  • rdinal)

NMF and curvHDR September, 2010 21 / 21

slide-61
SLIDE 61

Discussion

Strategy for FlowCAP Challenge 4: Random Forests

Classification and Regression Tree (CART), recursive partitioning (similar to Decision Trees and C4.5)

Interpretability Can handle heterogeneous data (continuous/interval, categorical,

  • rdinal)

Can handle missing data

NMF and curvHDR September, 2010 21 / 21

slide-62
SLIDE 62

Discussion

Strategy for FlowCAP Challenge 4: Random Forests

Classification and Regression Tree (CART), recursive partitioning (similar to Decision Trees and C4.5)

Interpretability Can handle heterogeneous data (continuous/interval, categorical,

  • rdinal)

Can handle missing data

“Random forests” [Bre01] is an ensemble extension of Classification and Regression Tree (CART) approach

NMF and curvHDR September, 2010 21 / 21

slide-63
SLIDE 63

Discussion

Strategy for FlowCAP Challenge 4: Random Forests

Classification and Regression Tree (CART), recursive partitioning (similar to Decision Trees and C4.5)

Interpretability Can handle heterogeneous data (continuous/interval, categorical,

  • rdinal)

Can handle missing data

“Random forests” [Bre01] is an ensemble extension of Classification and Regression Tree (CART) approach http://cran.r-project.org/web/packages/randomForest/

NMF and curvHDR September, 2010 21 / 21