Analysis and evaluation of classification models for disease - - PowerPoint PPT Presentation

analysis and evaluation of classification models for
SMART_READER_LITE
LIVE PREVIEW

Analysis and evaluation of classification models for disease - - PowerPoint PPT Presentation

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Analysis and evaluation of classification models for disease detection using human gut metagenomic data Elena Kochkina, Fredrik Karlsson


slide-1
SLIDE 1

Analysis and evaluation of classification models for disease detection using human gut metagenomic data

Analysis and evaluation of classification models for disease detection using human gut metagenomic data

Elena Kochkina, Fredrik Karlsson

Chalmers University of Technology

September 10, 2015

slide-2
SLIDE 2

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Introduction

Colorectal cancer

◮ Colorectal cancer - is the development of

malignant tumor in the colon or rectum.

◮ 75-95 % of colon cancer occurs in people

with low genetic risk.

◮ Standard way of testing for CRC - the

analysis of the stool for hidden blood is of limited practical importance for diagnosis and there’s a need for developing better alternatives for population screening.

slide-3
SLIDE 3

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Introduction

The gut microbiota

◮ The gut microbiota - an ecological

community of the microorganisms populating our intestine.

◮ The gut microbiota is an important

modulator of the immune system and an important metabolic organ.

◮ In several diseases, the taxonomic and

functional composition of the microbiota is altered compared to a normal healthy microbiota.

slide-4
SLIDE 4

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Data

Data I. Zeller et. al. 2014

slide-5
SLIDE 5

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Data

Data II

Adenoma Colorectal cancer Early stage Late stage Group Healthy (<1 ) (>1 ) I II III IV Country F (N=156) 61 27 15 15 7 10 21 France G (N=38) 25 13 Germany H (N=297) 297 Denmark, Spain, Germany

Datasets include fecal metagenomes, collective genetic materials of the microbiota, information about functional and taxonomic features of the bacteria populating the human gut. Taxonomic features represent relative abundance of 1753 different

  • bacteria. Functional features - represent gene functions and are

divided to KEGG modules and CAZY families.

slide-6
SLIDE 6

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology

Classification

Training set (known labels) Machine learning algorithm Classification model Test set (unknown labels) Predicted label

◮ LASSO ◮ Elastic Net ◮ Support Vector Machines ◮ Random Forests

slide-7
SLIDE 7

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology

LASSO - Logistic regression with L1 norm regularisation

The binary logistic model predicts a binary response (class) based

  • n predictors or features by estimating probabilities of an instance

belnging to ’positive’ class. The probabilities are modeled using a logistic function: σ(q) = P(yi = 1 | x) = 1 1 + e−q Given a set of input measurements x1, x2...xp and an outcome measurement y = ±1, q can be a linear function of x: q = β0 + β1 ∗ x1 + β2 ∗ x2 + ...βp ∗ xp. The LASSO constraint is defined by:p

j=1 βj ≤ t

We maximise log-likelihood with added penalty: ˆ βlasso = argmax{

N

  • i=1

[yiq − log(1 + eq)] − λ

p

  • j=1

βj}

slide-8
SLIDE 8

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology

Elastic Net - Logistic regression with regularisation by combination of L1 and L2 norms

The difference from LASSO is in the Elastic Net penalty: λ

p

  • j=1

(αβ2

j + (1 − α)|βj|),

where α is a compromise between Ridge and LASSO. Therefore, Elasic Net criterion has the following form: ˆ βlasso = argmax{

N

  • i=1

[yiq − log(1 + eq)] − λ

p

  • j=1

(αβ2

j + (1 − α)|βj|)}

slide-9
SLIDE 9

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology

Support Vector Machines

Given training data (xi, yi) for i = 1...N, with xi ∈ Rd and yi ∈ {−1, 1}, learn a classifier f (x) such that f (xi) =

  • ≥ 0,

yi = +1 < 0, yi = −1 i.e. yif (xi) > 0 for a correct classification. A linear classifier has the form f (x) = w · x + b The margin is given by

2 ||w||

max

w

2 ||w|| subject to yi(w·xi−b) ≥ 1.

  • x

x x x x x x x x w wx+b=0 2 ||𝑥||

slide-10
SLIDE 10

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology

Random Forest

Random Forest algorithm works by constructing an ensemble of decision trees All the trees are constructed independently, using Gini impurity criterion to choose partition attributes. Classification of objects carried by a majority voting scheme: every tree classifies objects to

  • ne of the classes, and wins the

class for which the highest number of trees vote.

x

Tree 1 Tree 2 …. Tree n

+

y

slide-11
SLIDE 11

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology

Pipeline

GH Preprocessing: filtering, log-transform, normalisation GH Selection of the optimal hyperparameter(s) with nested 10-fold cross-validation and fitting the model Application of the fitted model to the test set of each fold and GH set Partition F set to test and training sets for 10-fold cross-validation GH Model interpretation and important feature extraction

10 х

GH Performance evaluation

slide-12
SLIDE 12

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology

Performance estimation

The result of classification is a set of predicted probabilities that a certain element belongs to positive class (CRC). After choosing the decision boundary we can construct confusion matrix: Actual class Positive Negative Predicted class Positive TP FP Negative FN TN TP - True Positive; TN - True Negative; FP - False Positive; FN - False Negative.

slide-13
SLIDE 13

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology

Performance estimation II

Performance metrics that can be calculated based on confusion matrix with fixed decision boundary. Accuracy = TP + TN TP + FN + FP + TN Precision = TP TP + FP Recall = TP TP + FN Specificity = TN FP + TN

slide-14
SLIDE 14

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology

Performance estimation III

For comparison of the classification models we use ROC-curve (Receiver Operator Characteristic) and Area Under a Curve(AUC). ROC-curve reflects relation between Sensitivity (True Positive Rate), and 1 − Specificity (False Positive Rate) while varying decision boundary.

Specificity Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.8 0.6 0.4 0.2 0.0

slide-15
SLIDE 15

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results

Results I

Classifier AUC, F set, taxonomic features AUC, F set, functional features AUC, F set, taxonomic and functional features LASSO 0.84 0.80 0.87 Elastic net 0.83 0.79 0.87 Support Vector Machines (SVM) no feature selection 0.82 0.76 0.85 Random forest 0.87 0.79 0.85

Table : Performance of different classification models on training set F using taxonomic and functional features

slide-16
SLIDE 16

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results

Results II

Classifier AUC (GH set) LASSO 0.85 Elastic net 0.85 Support Vector Machines (SVM) with feature selection using linear correlation criterion 0.89 Random forest 0.87

Table : Performance of different classification models on the test set GH

slide-17
SLIDE 17

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results

Important features

All classifiers and filters highlight the importance of the following bacteria: Fusbacterium nucleatum vincentii, Fusbacterium nucleatum animalis and Peptostreptococcus stomatis These bacteria are oral pathogens. Other studies (Warren et al (2013), Feng et al (2014)) also point

  • ut these species as CRC related bacteria.

It is still unclear whether they are the cause or a consequence of tumor growth.

slide-18
SLIDE 18

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results

Confounder assessment

Age Cases Controls Wilcoxon test p-value = 0.0027 BMI Wilcoxon test p-value = 0.76 Cases Controls Cases Controls Gender proportions Female Female Male Male Fisher test p-value = 0.86

A B C

Figure : Boxplots. (A) Comparison of gender proportions between CRC patients and controls of study population F. (B) Comparison of patient age as a potential confounder. (C) Comparison of body mass index (BMI) as a potential confounder.

slide-19
SLIDE 19

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results

Confounder assessment II

Age Wilcoxon test p-value = 0.76 p-value = 0.57 FN TP Cases FP TN Controls

A B F set GH set

FN TP Cases FP TN Controls p-value = 0.83 p-value = 0.75

Figure : Metagenomic CRC predictions are unbiased for patient age, despite an age bias between cases and controls in the training set. The classifier neither shows a significant enrichment of old subjects among its false positive (FP) relative to its true negative (TN) predictions, nor a significant enrichment of young subjects among its false negative (FN) relative to true positive (TP) predictions. This observation is consistent between study population F (A) used for cross validation and study populations G and H (B) used for external validation.

slide-20
SLIDE 20

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results

Multi-class classification

Overall (Macro-averaged) Per class Recall Precision Accuracy Confidence interval Recall Precision F-score Early stage Late stage Healthy Early stage Late stage Healthy F 0.74 0.66 - 0.81 0.56 0.65 0.60 0.18 0.55 0.95 0.57 0.59 0.80 GH 0.8 0.75 - 0.84 0.55 0.50 0.52 0.08 0.69 0.87 0.4 0.15 0.96

Table : Performance metrics of Random Forest classifier for multi-class

  • classification. Decision boundary = (0.3, 0.3, 0.4)
slide-21
SLIDE 21

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Conclusions

Conclusions

◮ We compared performance of Lasso, Elastic Net, Support

Vector Machines and Random Forests classifiers on the dataset containing taxonomic and functional features of the gut metagenome of CRC patients and healthy individuals.

◮ SVM with linear correlation feature selection on a

preprocessing step shows highest AUC of 0.89 on the test set GH.

◮ Fusbacterium nucleatum subsp. vincentii, Fusbacterium

nucleatum subsp. animalis and Peptostreptococcus stomatis were identified as CRC markers.

◮ We ruled out presence of confounding factors such as age,

gender and body mass index in this study.

◮ We tested possibility for multi-class categorization between

control, early- and late-stage cancer.

slide-22
SLIDE 22

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Conclusions

Thank you for your attention! Any questions? E.Kochkina@warwick.ac.uk Fredrik.Karlsson@metabogen.com

slide-23
SLIDE 23

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Conclusions

LASSO

AUC Accuracy Recall Precision Taxonomic, F 0.835 0.8 0.94 0.79 Taxonomic, GH 0.85 0.74 0.80 0.88 CAZY 0.76 0.73 0.89 0.73 KEGG 0.77 0.73 0.88 0.75 CAZY+KEGG 0.80 0.79 0.93 0.78 CAZY+KEGG+taxonomic 0.87 0.80 0.89 0.80

slide-24
SLIDE 24

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Conclusions

Elastic Net

AUC Accuracy Recall Precision Taxonomic, F 0.83 0.78 0.94 0.77 Taxonomic, GH 0.85 0.75 0.82 0.88 Functional, F 0.79 0.75 0.89 0.75 Taxonomic + functional, F 0.87 0.82 0.94 0.83

slide-25
SLIDE 25

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Conclusions

(Support Vector Machines)

AUC Accuracy Recall Precision Taxonomic features, dataset F 0.82 0.74 0.83 0.77 Taxonomic features, dataset GH 0.81 0.74 0.74 0.95 Functional features, dataset F 0.76 0.73 0.89 0.74 Functional and taxonomic features, dataset F 0.85 0.8 0.88 0.82

slide-26
SLIDE 26

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Conclusions

Support Vector Machines

Feature Selection methods

AUC Accuracy Recall Precision F 0.84 0.79 0.95 0.77 χ2 GH 0.86 0.85 0.87 0.95 F 0.83 0.78 0.95 0.77 Information gain GH 0.88 0.85 0.87 0.95 F 0.87 0.83 0.93 0.82 Linear correlation GH 0.89 0.83 0.83 0.97 F 0.86 0.75 0.83 0.78 Wilcoxon test GH 0.89 0.82 0.83 0.97 F 0.87 0.82 0.86 0.84 LASSO GH 0.87 0.77 0.7 0.97

Linear correlation Information gain LASSO Wilcoxon test p-value 0.04 0.08 0.16 0.05

slide-27
SLIDE 27

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Conclusions

Random Forest

AUC Accuracy Recall Precision Taxonomic, F 0.87 0.83 0.94 0.82 Taxonomic, GH 0.87 0.86 0.81 0.95 Functional, F 0.79 0.77 0.89 0.77 Taxonomic and functional, F 0.85 0.78 0.89 0.79

slide-28
SLIDE 28

Analysis and evaluation of classification models for disease detection using human gut metagenomic data Comparison

Comparison

p-value Elastic net 0.48 Random Forests 0.36 SVM (Linear correlation) 0.15 SVM (Wilcoxon test) 0.19 SVM (LASSO) 0.41 SVM (Information gain) 0.26 Table : P-values ROC LASSO

  • ne sided bootstrap

, .

p-value lasso (¿) 0.36 SVM (no feature selection) (¿) 0.14 SVM (Linear correlation) (¡) 0.01 SVM (Wilcoxon test) (¡) 0.10 SVM (Information gain) (¡) 0.33

Table : P- ROC Random Forests bootstrap . .