Analysis and evaluation of classification models for disease detection using human gut metagenomic data
Analysis and evaluation of classification models for disease - - PowerPoint PPT Presentation
Analysis and evaluation of classification models for disease - - PowerPoint PPT Presentation
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Analysis and evaluation of classification models for disease detection using human gut metagenomic data Elena Kochkina, Fredrik Karlsson
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Introduction
Colorectal cancer
◮ Colorectal cancer - is the development of
malignant tumor in the colon or rectum.
◮ 75-95 % of colon cancer occurs in people
with low genetic risk.
◮ Standard way of testing for CRC - the
analysis of the stool for hidden blood is of limited practical importance for diagnosis and there’s a need for developing better alternatives for population screening.
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Introduction
The gut microbiota
◮ The gut microbiota - an ecological
community of the microorganisms populating our intestine.
◮ The gut microbiota is an important
modulator of the immune system and an important metabolic organ.
◮ In several diseases, the taxonomic and
functional composition of the microbiota is altered compared to a normal healthy microbiota.
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Data
Data I. Zeller et. al. 2014
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Data
Data II
Adenoma Colorectal cancer Early stage Late stage Group Healthy (<1 ) (>1 ) I II III IV Country F (N=156) 61 27 15 15 7 10 21 France G (N=38) 25 13 Germany H (N=297) 297 Denmark, Spain, Germany
Datasets include fecal metagenomes, collective genetic materials of the microbiota, information about functional and taxonomic features of the bacteria populating the human gut. Taxonomic features represent relative abundance of 1753 different
- bacteria. Functional features - represent gene functions and are
divided to KEGG modules and CAZY families.
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology
Classification
Training set (known labels) Machine learning algorithm Classification model Test set (unknown labels) Predicted label
◮ LASSO ◮ Elastic Net ◮ Support Vector Machines ◮ Random Forests
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology
LASSO - Logistic regression with L1 norm regularisation
The binary logistic model predicts a binary response (class) based
- n predictors or features by estimating probabilities of an instance
belnging to ’positive’ class. The probabilities are modeled using a logistic function: σ(q) = P(yi = 1 | x) = 1 1 + e−q Given a set of input measurements x1, x2...xp and an outcome measurement y = ±1, q can be a linear function of x: q = β0 + β1 ∗ x1 + β2 ∗ x2 + ...βp ∗ xp. The LASSO constraint is defined by:p
j=1 βj ≤ t
We maximise log-likelihood with added penalty: ˆ βlasso = argmax{
N
- i=1
[yiq − log(1 + eq)] − λ
p
- j=1
βj}
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology
Elastic Net - Logistic regression with regularisation by combination of L1 and L2 norms
The difference from LASSO is in the Elastic Net penalty: λ
p
- j=1
(αβ2
j + (1 − α)|βj|),
where α is a compromise between Ridge and LASSO. Therefore, Elasic Net criterion has the following form: ˆ βlasso = argmax{
N
- i=1
[yiq − log(1 + eq)] − λ
p
- j=1
(αβ2
j + (1 − α)|βj|)}
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology
Support Vector Machines
Given training data (xi, yi) for i = 1...N, with xi ∈ Rd and yi ∈ {−1, 1}, learn a classifier f (x) such that f (xi) =
- ≥ 0,
yi = +1 < 0, yi = −1 i.e. yif (xi) > 0 for a correct classification. A linear classifier has the form f (x) = w · x + b The margin is given by
2 ||w||
max
w
2 ||w|| subject to yi(w·xi−b) ≥ 1.
- x
x x x x x x x x w wx+b=0 2 ||𝑥||
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology
Random Forest
Random Forest algorithm works by constructing an ensemble of decision trees All the trees are constructed independently, using Gini impurity criterion to choose partition attributes. Classification of objects carried by a majority voting scheme: every tree classifies objects to
- ne of the classes, and wins the
class for which the highest number of trees vote.
x
Tree 1 Tree 2 …. Tree n
+
y
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology
Pipeline
GH Preprocessing: filtering, log-transform, normalisation GH Selection of the optimal hyperparameter(s) with nested 10-fold cross-validation and fitting the model Application of the fitted model to the test set of each fold and GH set Partition F set to test and training sets for 10-fold cross-validation GH Model interpretation and important feature extraction
10 х
GH Performance evaluation
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology
Performance estimation
The result of classification is a set of predicted probabilities that a certain element belongs to positive class (CRC). After choosing the decision boundary we can construct confusion matrix: Actual class Positive Negative Predicted class Positive TP FP Negative FN TN TP - True Positive; TN - True Negative; FP - False Positive; FN - False Negative.
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology
Performance estimation II
Performance metrics that can be calculated based on confusion matrix with fixed decision boundary. Accuracy = TP + TN TP + FN + FP + TN Precision = TP TP + FP Recall = TP TP + FN Specificity = TN FP + TN
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Methodology
Performance estimation III
For comparison of the classification models we use ROC-curve (Receiver Operator Characteristic) and Area Under a Curve(AUC). ROC-curve reflects relation between Sensitivity (True Positive Rate), and 1 − Specificity (False Positive Rate) while varying decision boundary.
Specificity Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.8 0.6 0.4 0.2 0.0
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results
Results I
Classifier AUC, F set, taxonomic features AUC, F set, functional features AUC, F set, taxonomic and functional features LASSO 0.84 0.80 0.87 Elastic net 0.83 0.79 0.87 Support Vector Machines (SVM) no feature selection 0.82 0.76 0.85 Random forest 0.87 0.79 0.85
Table : Performance of different classification models on training set F using taxonomic and functional features
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results
Results II
Classifier AUC (GH set) LASSO 0.85 Elastic net 0.85 Support Vector Machines (SVM) with feature selection using linear correlation criterion 0.89 Random forest 0.87
Table : Performance of different classification models on the test set GH
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results
Important features
All classifiers and filters highlight the importance of the following bacteria: Fusbacterium nucleatum vincentii, Fusbacterium nucleatum animalis and Peptostreptococcus stomatis These bacteria are oral pathogens. Other studies (Warren et al (2013), Feng et al (2014)) also point
- ut these species as CRC related bacteria.
It is still unclear whether they are the cause or a consequence of tumor growth.
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results
Confounder assessment
Age Cases Controls Wilcoxon test p-value = 0.0027 BMI Wilcoxon test p-value = 0.76 Cases Controls Cases Controls Gender proportions Female Female Male Male Fisher test p-value = 0.86
A B C
Figure : Boxplots. (A) Comparison of gender proportions between CRC patients and controls of study population F. (B) Comparison of patient age as a potential confounder. (C) Comparison of body mass index (BMI) as a potential confounder.
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results
Confounder assessment II
Age Wilcoxon test p-value = 0.76 p-value = 0.57 FN TP Cases FP TN Controls
A B F set GH set
FN TP Cases FP TN Controls p-value = 0.83 p-value = 0.75
Figure : Metagenomic CRC predictions are unbiased for patient age, despite an age bias between cases and controls in the training set. The classifier neither shows a significant enrichment of old subjects among its false positive (FP) relative to its true negative (TN) predictions, nor a significant enrichment of young subjects among its false negative (FN) relative to true positive (TP) predictions. This observation is consistent between study population F (A) used for cross validation and study populations G and H (B) used for external validation.
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Results
Multi-class classification
Overall (Macro-averaged) Per class Recall Precision Accuracy Confidence interval Recall Precision F-score Early stage Late stage Healthy Early stage Late stage Healthy F 0.74 0.66 - 0.81 0.56 0.65 0.60 0.18 0.55 0.95 0.57 0.59 0.80 GH 0.8 0.75 - 0.84 0.55 0.50 0.52 0.08 0.69 0.87 0.4 0.15 0.96
Table : Performance metrics of Random Forest classifier for multi-class
- classification. Decision boundary = (0.3, 0.3, 0.4)
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Conclusions
Conclusions
◮ We compared performance of Lasso, Elastic Net, Support
Vector Machines and Random Forests classifiers on the dataset containing taxonomic and functional features of the gut metagenome of CRC patients and healthy individuals.
◮ SVM with linear correlation feature selection on a
preprocessing step shows highest AUC of 0.89 on the test set GH.
◮ Fusbacterium nucleatum subsp. vincentii, Fusbacterium
nucleatum subsp. animalis and Peptostreptococcus stomatis were identified as CRC markers.
◮ We ruled out presence of confounding factors such as age,
gender and body mass index in this study.
◮ We tested possibility for multi-class categorization between
control, early- and late-stage cancer.
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Conclusions
Thank you for your attention! Any questions? E.Kochkina@warwick.ac.uk Fredrik.Karlsson@metabogen.com
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Conclusions
LASSO
AUC Accuracy Recall Precision Taxonomic, F 0.835 0.8 0.94 0.79 Taxonomic, GH 0.85 0.74 0.80 0.88 CAZY 0.76 0.73 0.89 0.73 KEGG 0.77 0.73 0.88 0.75 CAZY+KEGG 0.80 0.79 0.93 0.78 CAZY+KEGG+taxonomic 0.87 0.80 0.89 0.80
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Conclusions
Elastic Net
AUC Accuracy Recall Precision Taxonomic, F 0.83 0.78 0.94 0.77 Taxonomic, GH 0.85 0.75 0.82 0.88 Functional, F 0.79 0.75 0.89 0.75 Taxonomic + functional, F 0.87 0.82 0.94 0.83
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Conclusions
(Support Vector Machines)
AUC Accuracy Recall Precision Taxonomic features, dataset F 0.82 0.74 0.83 0.77 Taxonomic features, dataset GH 0.81 0.74 0.74 0.95 Functional features, dataset F 0.76 0.73 0.89 0.74 Functional and taxonomic features, dataset F 0.85 0.8 0.88 0.82
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Conclusions
Support Vector Machines
Feature Selection methods
AUC Accuracy Recall Precision F 0.84 0.79 0.95 0.77 χ2 GH 0.86 0.85 0.87 0.95 F 0.83 0.78 0.95 0.77 Information gain GH 0.88 0.85 0.87 0.95 F 0.87 0.83 0.93 0.82 Linear correlation GH 0.89 0.83 0.83 0.97 F 0.86 0.75 0.83 0.78 Wilcoxon test GH 0.89 0.82 0.83 0.97 F 0.87 0.82 0.86 0.84 LASSO GH 0.87 0.77 0.7 0.97
Linear correlation Information gain LASSO Wilcoxon test p-value 0.04 0.08 0.16 0.05
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Conclusions
Random Forest
AUC Accuracy Recall Precision Taxonomic, F 0.87 0.83 0.94 0.82 Taxonomic, GH 0.87 0.86 0.81 0.95 Functional, F 0.79 0.77 0.89 0.77 Taxonomic and functional, F 0.85 0.78 0.89 0.79
Analysis and evaluation of classification models for disease detection using human gut metagenomic data Comparison
Comparison
p-value Elastic net 0.48 Random Forests 0.36 SVM (Linear correlation) 0.15 SVM (Wilcoxon test) 0.19 SVM (LASSO) 0.41 SVM (Information gain) 0.26 Table : P-values ROC LASSO
- ne sided bootstrap