Classifying HIV Vaccination Status with Regularized Logistic - - PowerPoint PPT Presentation

classifying hiv vaccination status with regularized
SMART_READER_LITE
LIVE PREVIEW

Classifying HIV Vaccination Status with Regularized Logistic - - PowerPoint PPT Presentation

Classifying HIV Vaccination Status with Regularized Logistic Regression Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen Purdue University FlowCAP-III, NIH, November 29-30, 2012 This research was supported by grant 1R21EB015707 from the


slide-1
SLIDE 1

Classifying HIV Vaccination Status with Regularized Logistic Regression

Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen

Purdue University

FlowCAP-III, NIH, November 29-30, 2012

This research was supported by grant 1R21EB015707 from the National Institute of Biomedical Imaging and Bioengineering and NSF grant CCF-1218916

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 1

slide-2
SLIDE 2

Overview

Overview

Problem: Predict the vaccination status (pre- and post- vaccination)

  • f samples from HIV patients. Half of the samples with known

vaccination status are given as training set. Method: We used the fraction of cells in different combination of Boolean gates, and Median Fluorescence Intensity (MFI) as features

  • r explanatory variables. We then train a logistic regression model

with Lasso regularization (RLR) with the training set and obtained a sparse model with four predictive features. Results: The optimized RLR model performs good on training set with four (out of 37) misclassification. On the test set, the model classify 29 out of 37 samples with high confidence.

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 2

slide-3
SLIDE 3

Problem Description

Dataset

Application of a HIV vaccine on 74 subjects at two time points (before and after vaccination), 37 in training set and 37 subjects in test set. At each time point we have a POL-3 stimulated sample and two negative controls. Each samples has six markers. CD3, CD4, CD8 are for identifying T cell subpopulations. The remaining markers are cytokines TNFa, IFNg, and IL2

Before ¡ Vaccina,on ¡ Two ¡Nega,ve ¡ Controls ¡ A ¡POL-­‑3 ¡S,mulated ¡ Sample ¡ AAer ¡ Vaccina,on ¡ Two ¡Nega,ve ¡ Controls ¡ A ¡POL-­‑3 ¡S,mulated ¡ Sample ¡ Subject 1

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 3

slide-4
SLIDE 4

Preprocessing

Automated CD4+ and CD8+ T cell gatings

We used norm1filter and norm2filter from flowCore to perform the automated gatings.

Remove doublet

FSC.H FSC.A

50000 100000 150000 200000 250000 50000 100000150000200000250000

Remove dead cells

ViViD FSC.A

50000 100000 150000 200000 250000 1 2 3 4 5

SSC.A FSC.A

50000 100000 150000 200000 250000 50000100000 200000

Tcells

CD8 CD3

1 2 3 4 5 −2 2 4

CD4+ Tcells

CD8 CD4

−1 1 2 3 4 5 −2 2 4

CD8+ Tcells

CD8 CD4

−1 1 2 3 4 5 −2 2 4

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 4

slide-5
SLIDE 5

Preprocessing

Automated Cytokine gating

We applied patient specific normalization to all six samples from a particular subject and used norm2filter to identify TNFa+, IFNg+, and IL2+ cells. Cytokine positive cells are extremely rare in CD8+ cells, and we mainly used them when CD4+ is unable to classify a pair of samples.

CD4+ Tcells

SSC.A TNFa

1 2 3 4 50000 100000

CD4+ Tcells

SSC.A IFNg

1 2 3 4 50000 100000

CD4+ Tcells

SSC.A IL2

1 2 3 4 50000 100000

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 5

slide-6
SLIDE 6

Feature Selection

Feature Selection

For each sample, we computed a Boolean (positive/negative) gating for each of the three cytokines. The Boolean gates can then be combined in 33 = 27 ways by considering positive, neutral and negative levels of expression. We, however, kept only those combinations with at least one positive cytokine. We consider the fraction of cells within a Boolean gate combination as a feature In addition we included median fluorescence intensity (MFI) of three cytokines as features in our model. Hence, we have about 21 features

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 6

slide-7
SLIDE 7

Feature Selection

Model selection

The dependent variable is the vaccination status of a sample (vaccinated or not-vaccinated) Therefore, this is a binary classification problem. We used Logistic Regression for this classification.

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 7

slide-8
SLIDE 8

Logistic Regression Model

Logistic Regression

Widely used for binary classification, e.g., Vaccinated and not-Vaccinated Explanatory variable xi, such as fraction of cells in a combination of Boolean gate. e.g., TNFa+IFNg−IL2+ Dependent variable yi, Vaccinated, yi=1 and not-Vaccinated, yi=0 Probability of ith sample being Vaccinated = pi log odds for the event yi=1, logit(pi) = log(

pi 1−pi )

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 8

slide-9
SLIDE 9

Logistic Regression Model

Logistic Regression

logit(pi) = β0 + β1xi1 + ... + βdxid = β0 + xTβ pi =

1 1+e−(β0+xT β) , logistic function

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 9

slide-10
SLIDE 10

Logistic Regression Model

Maximum Likelihood Solution

The dependent variable follows a binomial distribution, yi ∼ bin(1, pi) maximize the log likelihood: max

(β0,β)∈Rd+1 n

  • i=1

{yilog(pi) + (1 − yi)log(1 − pi)} which is equivalent to max

(β0,β)∈Rd+1 n

  • i=1

{yi(β0 + xT

i β) − log(1 + (β0 + xT i β))}

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 10

slide-11
SLIDE 11

Logistic Regression Model

Lasso Regularization

Pick the predictive features by penalize models with too many parameters [Friedman et. al. 2009] maximize the log likelihood: max

(β0,β)∈Rd+1

n

  • i=1

{yi(β0 + xT

i β) − log(1 + (β0 + xT i β))} − λβ1

  • Select a sparse solution with few non-zero values for βi

We used R package glmnet by Jerome Friedman, Trevor Hastie, and Rob Tibshirani.

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 11

slide-12
SLIDE 12

Results

Model Parameter Selection

The model parameters to be selected are β0, β1...βd and λ For fixed λ, β0, β1...βd are estimated by maximizing the log likelihood λ is selected from n-fold cross validation (minimize oilog( oi

ei ))

−10 −8 −6 −4 −2 0.9 1.0 1.1 1.2 1.3 1.4 log(Lambda) Binomial Deviance

  • 15

16 12 11 10 9 7 6 5 4 4 2 1 No of features selected FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 12

slide-13
SLIDE 13

Results

Significance of the selected features

A sparse solution with only four features being used Feature Coefficient in the model MFI TNFa+ 2.293 TNFa+IFNg+IL2+ 1.421 TNFa+IFNg−IL2− 0.397 TNFa−IFNg−IL2+

  • 0.844

Table: Optimal Solution of the Regularized (Lasso) Logistic Regression

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 13

slide-14
SLIDE 14

Results

Model verification by incremental feature selection

Build logistic regression model by incrementally adding features. Incrementally complex models from simpler models. Decrease the misclassification as we include features. Incremental Model features p-value AIC Tr Misclassification MFI TNFa+ 2.46e-07 79.95 8 TNFa+IFNg+IL2+ 2.20-08 73.33 6 TNFa+IFNg−IL2− 3.15e-08 72.81 5 TNFa−IFNg−IL2+ 4.69e-09 67.93 4

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 14

slide-15
SLIDE 15

Results

Predicting vaccination status

The RLR model predicts the probability of a sample being vaccinated. Low probability for non-vaccinated and high probability for vaccinated samples. From a pair of samples (before and after vaccination) from a patient, the sample with high probability is predicted as vaccinated. Example: Let p(s1), and p(s2) be the probabilities predicted by a trained RLR model for a pair of samples, s1 and s2 from a patient. If p(s1) > p(s2) then the model predicts s1 as vaccinated and vise

  • versa. |p(s1) − p(s2)| indicates the confidence on the prediction.

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 15

slide-16
SLIDE 16

Results

Prediction in the training set

Four misclassification in the training set. Misclassified samples are marked with green circles.

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 16

slide-17
SLIDE 17

Results

Prediction in the test set

Prediction in the test set. We have eight pair of samples predicted with low confidence (green circles). Thus about 75% samples are classified with high confidence.

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 17

slide-18
SLIDE 18

Summary

Summary

We used a logistic regression model with Lasso regularization (RLR) to classify samples to HIV vaccinated/not-vaccinated classes. The RLR model was able to automatically select the features predictive to the vaccination status. Results: The optimized RLR model performs good on training set with four (out of 37) misclassification. On the test set, the model classifies 29 out of 37 samples with high confidence.

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 18

slide-19
SLIDE 19

Thanks

Thank You !

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 19

slide-20
SLIDE 20

supporting slides

Preprocessing

Each sample is stained against 6 markers aside from ViViD for live cell gating. The CD3, CD4, and CD8 markers are used to identify CD4+ and CD8+ subpopulations in T cells. The remaining markers are cytokines TNFa, IFNg, and IL2 Expression of the cytokines (quantity and quality) in T cells are known to be indicative of the HIV progression (Seder et. al. 2008, Betts et. al. 2006) Therefore, we aim to identify cells with at least one cytokine expressed, both in CD4+ and CD8+ T cells, and combine them to predict the disease (or vaccination status)

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 20

slide-21
SLIDE 21

supporting slides

Model verification with single feature

Consider a single feature at a time and build a logistic regression model Incrementally complex models from simpler models. Keeps almost same set of features. Model with feature p-value AIC MFI TNFa+ 2.46e-07 79.95 TNFa+IFNg+IL2+ 7.47e-05 90.90 TNFa+IFNg−IL2− 1.68e-07 79.21 TNFa−IFNg−IL2+ 5.00e-01 106.13

FlowCAP-III Ariful Azad, Arif Khan, Bartek Rajwa, Alex Pothen 21