[PPT] - Phenotyping and Robust Feature Selection for Flow Cytometry Data PowerPoint Presentation

SLIDE 1

Phenotyping and Robust Feature Selection for Flow Cytometry Data

Nima Aghaeepour

CIHR/MSFHR Strategic Training Program in Bioinformatics for Health Research,

University of British Columbia

Sep 22, 2011

1 / 24

SLIDE 2

Introduction

Problem statement Find cell populations that correlate with an external variable (e.g., a clinical outcome). Approach flowType: Phenotyping. FeaLect: Feature Selection for Sample Classification.

2 / 24

SLIDE 3

Dataset

United States Military HIV Natural History Study PBMCs of 466 HIV+ personnel and beneficiaries from Army, Navy, Marines, and Air Force. 13 surface markers and KI-67 (cell proliferation). Clinical Data: Survival times including 135 eventsa

aAn event is defined as progression to AIDS or initiation of HAART. 3 / 24

SLIDE 4

Manual Gates

4 / 24

SLIDE 5

Manual Gating Results

Results Frequency of long-lived Memory Cells (CD127+) has a positive correlation. Frequency of cells with high proliferation (KI-67+) has a negative correlation.

5 / 24

SLIDE 6

Results

Previous results:

1

Frequency of long-lived Memory Cells (CD127+) has a positive correlation.

2

Frequency of cells with high proliferation (KI-67+) has a negative correlation.

New results:

1

Frequency of “short-lived” cells with high proliferation (CD127−KI-67+) has a negative correlation.

2

Frequency of terminal effector T-cells has a negative correlation.

3

Frequency of transitional memory T-cells has a negative correlation.

Lowest (371/86%) Highest (59/14%)

p < 8.6e−13 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0 E vent−free Proportion Y ears from Cell S ample

Lowest (387/90%) Highest (43/10%)

p < 1.8e−06 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0

Lowest (356/83%) Highest (74/17%)

p < 4.6e−10 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0

6 / 24

SLIDE 7

Clustering

flowMeans was used

n one dimension at

a time. A marker can only be positive or negative. Other (e.g., dim) populations will be resolved in other dimensions. A marker can be neutral.

CD28 CD45R O Negative Positive Neutral Positive Neutral Negative

7 / 24

SLIDE 8

Phenotyping

1 310 ≈ 60, 000 phenotypes 2 Cox Proportional Hazards Regression 3 Log rank test 4 Multiple testing 5 Sensitivity analysis 6 101 phenotypes remain statistically significant Phenotype p-value p-value CI adj p-value CPHR Coef Cell Freq 1 KI-67+CD8+CD27- 6.4e-07 (1.1e-12, 3.6e-03) 3e-04 35.2 0.00560 2 KI-67+CD8+CD57- 1.1e-06 (2.7e-13, 3.5e-03) 2e-06 28.3 0.00648 3 KI-67+CD45RO+ 8.9e-07 (2.1e-14, 2.0e-03) 4e-05 15.4 0.01343 4 KI-67+CD28-CD8- 8.3e-08 (6.9e-14, 1.6e-03) 2e-04 44.2 0.00523 5 KI-67+CD28-CD27- 7.1e-08 (1.5e-13, 3.0e-03) 2e-05 26.3 0.00874 6 KI-67+CD28- 1.9e-07 (3.9e-13, 3.3e-03) 2e-05 18.3 0.01053 7 KI-67+CD28-CD27-CCR7- 3.3e-09 (6.6e-14, 8.6e-04) 4e-04 43.0 0.00647 8 KI-67+CD28-CCR7- 3.3e-09 (3.2e-13, 7.6e-04) 3e-03 37.7 0.00739 9 KI-67+CD57-CD27-CCR7- 1.2e-08 (1.3e-13, 3.4e-03) 1e-03 36.8 0.00762 10 KI-67+CD57-CCR7- 2.7e-08 (5.3e-15, 1.2e-02) 2e-05 26.6 0.01008 . . . 101 KI-67+CD8+CD27- 6.4e-07 (2.3e-14, 1.1e-02) 2e-02 35.2 0.00560 8 / 24

SLIDE 9

Clustering the Phenotypes

Pearson correlation:

phenotypes phenotypes

0.2 0.4 0.6 0.8 1

Value

0.5 1 1.5 2 2.5 3

Color Key and Density Plot Density

9 / 24

SLIDE 10

Clustering the Phenotypes

Pearson correlation:

phenotypes phenotypes

0.2 0.4 0.6 0.8 1

Value

0.5 1 1.5 2 2.5 3

Color Key and Density Plot Density

Phenotype names:

KI−67 CD28 CD45RO CD8 CD4 CD57 CCR5 CD27 CCR7 CD127 Markers Phenotypes

Positive Neutral Negative

10 / 24

SLIDE 11

Marker Impacts

Impact Value Force each marker to be neutral (remove it). Measure its contribution to the results Interpretation

1

Does removing the marker make the phenotype “less significant”?

2

We have to use an effect size (like root mean square error)

K I − 6 7 C D 2 8 C D 4 5 R O C D 8 C D 4 C D 5 7 C C R 5 C D 2 7 C C R 7 C D 1 2 7 0.00 0.02 0.04 Marker Impact P

sitive

Mixed Negative K I − 6 7 C D 2 8 C D 4 5 R O C D 8 C D 4 C D 5 7 C C R 5 C D 2 7 C C R 7 C D 1 2 7 0.000 0.010 0.020 Phenotype Name K I − 6 7 C D 2 8 C D 4 5 R O C D 8 C D 4 C D 5 7 C C R 5 C D 2 7 C C R 7 C D 1 2 7 0.00 0.02 0.04

11 / 24

SLIDE 12

Marker Selection

Now we can select the markers with a significant impact:

K I − 6 7 C D 2 8 C D 4 5 R O C D 8 C D 4 C D 5 7 C C R 5 C D 2 7 C C R 7 C D 1 2 7 0.00 0.02 0.04 Marker Impact P

sitive

Mixed Negative K I − 6 7 C D 2 8 C D 4 5 R O C D 8 C D 4 C D 5 7 C C R 5 C D 2 7 C C R 7 C D 1 2 7 0.000 0.010 0.020 Phenotype Name K I − 6 7 C D 2 8 C D 4 5 R O C D 8 C D 4 C D 5 7 C C R 5 C D 2 7 C C R 7 C D 1 2 7 0.00 0.02 0.04

And extract 3 phenotypes:

Phenotype p-value p-value CI adjusted Cell p-value Frequency 1 KI-67+CD4-CCR5+CD127- 1.7e-10 (0.0e+00, 1.0e-05) 1.7e-08 0.00704 2 CD45RO-CD8+CD4- CD57+CCR5-CD27+CCR7- CD127- 1.2e-07 (0.0e+00, 7.7e-05) 1.3e-05 0.00068 3 CD28-CD45RO+CD4- CD57-CD27-CD127- 6.5e-08 (2.2e-16, 1.9e-05) 6.5e-06 0.02456 12 / 24

SLIDE 13

Marker Elimination

Phenotype p-value p-value CI adjusted Cell p-value Frequency 1 KI-67+CD4-CCR5+CD127- 1.7e-10 (0.0e+00, 1.0e-05) 1.7e-08 0.00704 2 CD45RO-CD8+CD4- CD57+CCR5-CD27+CCR7- CD127- 1.2e-07 (0.0e+00, 7.7e-05) 1.3e-05 0.00068 3 CD28-CD45RO+CD4- CD57-CD27-CD127- 6.5e-08 (2.2e-16, 1.9e-05) 6.5e-06 0.02456

Some of the markers are redundant Finding small cell populations is hard In clinics/developing world we can have a large number of measurements. Do we need all of the markers?

KI− 67+ CD4− CCR5+ CD127− KI− 67+ CD4− CD127− KI− 67+ CD127− KI− 67+ 2 4 6 8 10

− log10(pvalue) Phenotype Name

13 / 24

SLIDE 14

Are we "over-fitting"?

Resampling-based Sensitivity Analysis

KI− 67+ CD28− CD4− CD57− CD27− CD127− KI− 67+ CD127− KI− 67+ CD45RO+ CD4− CD57− CD127− 20 40 60 80 Bootstrapp percentage CD28− CD45RO− CD4− CD57+ CCR5− CD27+ CCR7− CD127− CD28− CD45RO− CD4− CD57+ CCR5− CD27+ CD127− CD28− CD45RO− CD8+ CD57+ CCR5− CD27+ CCR7− CD127− CD45RO− CD4− CD57+ CCR5− CD27+ CCR7− CD127− CD45RO− CD8+ CD4− CD57+ CCR5− CD27+ CCR7− CD127− CD45RO− CD8+ CD57+ CCR5− CD27+ CCR7− CD127− 10 20 30 40 Bootstrapp percentage CD28− CD4− CD57− CD28− CD57− CD27− CD127− CD28− CD57− CD127− CD28− CD45RO+ CD4− CD57− CD27− CD127− CD28− CD45RO+ CD4− CD57− CD127− CD28− CD45RO+ CD4− CD57− CCR5+ CD27− CD127− CD28− CD45RO+ CD4− CD57− CCR5+ CD127− CD28− CD45RO+ CD57− CD27− CD127− CD28− CD45RO+ CD57− CD28− CD45RO+ CD8+ CD4− CD57− CD27− CD127− CD28− CD45RO+ CD8+ CD4− CD57− CD127− CD45RO+ CD57− CD27− CD127− 10 20 30 40 Bootstrapp percentage Group 1 Group 2 Group 3

14 / 24

SLIDE 15

Phenotypes Phenotypes KI−67 CD28 CD45RO CD8 CD4 CD57 CCR5 CD27 CCR7 CD127 0.00 0.02 0.04 Marker Impact Positive Mixed Negative KI−67 CD28 CD45RO CD8 CD4 CD57 CCR5 CD27 CCR7 CD127 0.000 0.010 0.020

KI−67+CD4−CCR5+CD127− KI−67+CD4−CD127− KI−67+CD127− KI−67+

2 4 6 8 0.0 0.5 0.9 1.4 1.9 −log10(pvalue)

P−value %Cell Freq. CD28−CD45RO+CD4−CD57−CD27−CD127− CD28−CD45RO+CD57−CD27−CD127− CD28−CD45RO+CD57−CD127− CD28−CD45RO+CD57− CD28−CD57− CD28−

1 2 3 4 5 6 7 4 9 13 22 30 % Cell Frequency Phenotype Name

Lowest (371/86%) Highest (59/14%)

p < 8.6e−13 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0 Event−free Proportion Years from Cell Sample

Lowest (387/90%) Highest (43/10%)

p < 1.8e−06 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0

Lowest (356/83%) Highest (74/17%)

p < 4.6e−10 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0 CD28 CD45RO Negative Positive Neutral Positive

(A)Population Identification (C)Grouping (D)Marker Selection (E)Marker Elimination (F)Kaplan-Meier Curves

Phenotype Name

(B)Statistical Modeling

Cox Proportional Hazards Regression Sensitivity Analysis Multiple Testing Correction

1 2 3

Phenotype Groups:

CD45RO−CD4−CD57+CCR5−CD27+CCR7−CD127− CD4−CD57+CCR5−CD27+CCR7−CD127− CD57+CCR5−CD27+CCR7−CD127− CD57+CD27+CCR7−CD127− CD57+CD27+CD127− CD57+CD27+ CD27+

1 2 3 4 5 6 10 21 31 41 51 62 KI−67 CD28 CD45RO CD8 CD4 CD57 CCR5 CD27 CCR7 CD127 0.00 0.02 0.04

0.2 0.6 1 1.5 3 Density Color Key and Density Plot

Neutral Negative

15 / 24

SLIDE 16

A Cell Hierarchy based on a Clinical Outcome

CD45RO-CD8+CD4-CD57+CCR5- CD27+CCR7-CD127- A hierarchy based on the predictive power. Explains the relationship between the markers. Thinkness of arrows: increase in the predictive power. Yellow: high predictive power

16 / 24

SLIDE 17

Phenotype Hierarchies

CD28-CD45RO+CD4-CD57-CD27-CD127-

17 / 24

SLIDE 18

Feature Selection (FeaLect)

Problem statement flowType is a single variate exploratory analysis tool. It can be used to construct a multivariate predictive model. FeaLect

1 Use a mathematical model and bagging to score the

phenotypes for a linear model.

2 Use the selected phenotypes to construct a linear model (L1

penalization).

3 Perform cross-validation. 4 Perform hold-out validation. 5 Label the test-set. 18 / 24

SLIDE 19

HVTNa Phenotype: CD4+CD8−IL2+IL4+ IFNg+TNFa+

Algorithms F− measures 0.75 0.80 0.85 0.90 0.95 1.00 flowCore− flowStats flowType− FeaLect Kmeanssvm PRAMS SPADE SWIFT PBSC PramSpheres flowType

19 / 24

SLIDE 20

AML Phenotype: FS−SS−CD15− CD13+CD45−

Algorithms F− measures 0.70 0.75 0.80 0.85 0.90 0.95 1.00 flowPeakssvm flowType− FeaLect SPADE 2DhistsSVM EMMIXCYTOM flowType RandomSpheres flowBin PBSC

20 / 24

SLIDE 21

HEU vs UE flowType’s statistical tests and FeaLect’s cross-validation failed. flowType-FeaLect is worst than random. flowType scored better than flowType-FeaLect! Have we been able to find something meaningful?

Cross-validation Holdout validation (using other time points).

Algorithms F− measures SWIFT flowType PBSC 2DhistsSVM PramSpheres flowType− FeaLect flowBin 0.0 0.1 0.2 0.3 0.4 0.5 0.6

21 / 24

SLIDE 22

Conclusion

Conclusion I presented a phenotyping approach that analyzes the data one marker at a time. If a marker has more than two populations:

They will be resolved using other markers. The pipeline can be modified to allow more than two populations per channel.

We can use any population identification algorithm, manual gates, or controls. The ability to exclude the markers allows us to study the relationship between them and extract a cell hierarchy based

n their ability to predict a clinical outcome.

Complex flow cytometry assays can to design simpler panels that can be used in resource-poor settings. Availability flowType is available through Bioconductor. FeaLect is available through CRAN.

22 / 24

SLIDE 23

400 600 800 1000 100 200 300 400 FS Lin SS Log

0.9%

Normal

400 600 800 1000 100 200 300 400 FS Lin SS Log

21%

AML

400 600 800 1000 100 200 300 400 FS Lin SS Log

17%

Outlier 23 / 24

SLIDE 24

Conclusion

Conclusion I presented a phenotyping approach that analyzes the data one marker at a time. If a marker has more than two populations:

They will be resolved using other markers. The pipeline can be modified to allow more than two populations per channel.

We can use any population identification algorithm, manual gates, or controls. The ability to exclude the markers allows us to study the relationship between them and extract a cell hierarchy based

n their ability to predict a clinical outcome.

Complex flow cytometry assays can to design simpler panels that can be used in resource-poor settings. Availability flowType is available through Bioconductor. FeaLect is available through CRAN.

24 / 24

SLIDE 25

Acknowledgements

BCCA Ryan Brinkman, Habil Zare, Kieran O’Neill, and Adrin Jalali. UBC Holger Hoos NIH/USMIL Mario Roederer, Pratip Chattopadhyay, Anurdha Ganesan Funding NIAID Intramural Research Program; NIH/NIBIB grant EB008400; an NSERC discovery grant held by HHH; National Cancer Institute; NIH (contract HSN261200800001E); Military Infectious Disease Research Program, US Army Medical Research and Materiel Command; Infectious Disease Clinical Research Program; Uniformed Services University of the Health Sciences. Computing Resources Western Canada Research Grid (WestGrid), Compute/Calcul Canada, and Canada’s Michael Smith Genome Sciences Center. 25 / 24