Improving Disease Prediction Using Machine Learning Shelda Sajeev, - - PowerPoint PPT Presentation

improving disease prediction using machine learning
SMART_READER_LITE
LIVE PREVIEW

Improving Disease Prediction Using Machine Learning Shelda Sajeev, - - PowerPoint PPT Presentation

Improving Disease Prediction Using Machine Learning Shelda Sajeev, Stephanie Champion, Anthony Maeder Flinders Digital Health Research Centre, College of Nursing and Health Sciences, Flinders University, Adelaide, Australia Cardiovascular


slide-1
SLIDE 1

Improving Disease Prediction Using Machine Learning

Shelda Sajeev, Stephanie Champion, Anthony Maeder

Flinders Digital Health Research Centre, College of Nursing and Health Sciences, Flinders University, Adelaide, Australia

slide-2
SLIDE 2

In Introduction

  • Cardiovascular disease (CVD) is one of the

leading causes of death worldwide (~30%) and is regarded as highly preventable (~90%) [1].

  • Primary prevention is thus a high priority and

requires screening for risk factors and providing suitable interventions.

  • Clinicians need accurate and reliable disease

prediction tools to identify people who are at increased risk of a cardiovascular event.

slide-3
SLIDE 3

In Introduction (cont.) .)

  • Numerous CVD risk prediction models to estimate an

individual’s likelihood of a CVD event are available [2].

  • Conventional predictive models typically use simple

regression fitting over relatively few risk factors.

  • Regression approaches are simple, but do not assume any

non-linearity in the model for contributions of chosen factors.

  • In practice, many factors are correlated and have underlying

non-linear relationships to the predicted outcome.

  • Regression models are generalised from broad based

population datasets which can miss some subtle associations.

slide-4
SLIDE 4

Machine Lea earning

Learning

Overcome limitations of the conventional models.

Scale

Cater for a larger number

  • f variables in the model.

Complexity

Address multivariate interactions and non-linear relationships.

Adaptivity

Support an adaptive approach for risk predictor revisions.

slide-5
SLIDE 5

Purpose

The aim of the work reported here was to investigate plausibility of using a machine learning approach, by demonstrating its ability to derive prediction models for heart disease risk. This study discusses variations that can arise in the performance of some typical linear and more sophisticated non-linear machine learning prediction methods. The effects of different underlying populations on predictive performance, and the impact of combining cohorts to mimic a more general population, are considered.

slide-6
SLIDE 6

Methods

  • We used two datasets from the widely known

University of California, Irvine (UCI) machine learning repository [3].

  • The two datasets were the Statlog heart

dataset (270 participants) and Cleveland heart disease dataset (303 participants).

  • To provide a larger sample size, the two

datasets were also combined over the 13 common risk factors (with no duplicates).

  • The machine learning study was conducted
  • n the two datasets individually and on the

combined dataset.

slide-7
SLIDE 7

Study Population Characteristics

  • Average age - 54 years. Substantially fewer

women than men (32% women, 68% men).

  • 14% had diabetes and 52% had high

cholesterol (above 240).

  • 51% exhibited an abnormality in ECG results

and 31% exhibited major vessel calcification in fluoroscopy.

  • 33% experienced exercise-induced angina.
  • 257 (45%) cases of heart disease, from 567
  • participants. Statlog cohort had 120/270

(44%) and Cleveland had 137/297 (46%).

slide-8
SLIDE 8

Approach

  • Before applying machine learning

algorithms, the data was normalized to zero mean and unit variance, to ensure each variable would have the same influence on the cost function in designing the classifier.

  • Data was then separated into Training

(80%) and Testing (20%) subsets by random selection, independently and repeatedly for successive runs

Overview of the machine learning approach.

slide-9
SLIDE 9

Experimental Setup

  • Four popular machine learning models were used:
  • Logistic regression (LR) [4]
  • Linear discriminant analysis (LDA) [5],
  • Support vector machine (SVM) with RBF kernel [6],
  • Random forest (RF) [7]
  • LR and LDA are simple linear classifiers; SVM and RF

are more advanced machine learning models that support non-linear classification.

  • All the machine learning algorithms were implemented

in Python using the Scikit-learn library.

slide-10
SLIDE 10

Results

  • A confusion matrix was used to review the

performance of the classification algorithm, reporting four outcomes: True Positives (TP), False Positives (FP), True Negatives (TN), False Negatives (FN).

  • The performance measures extracted were

sensitivity, specificity, precision and accuracy.

  • Sensitivity

= TP / (TP + FN )

  • Specificity

= TN / (TN + FP)

  • Precision

= TP / (TP + FP)

  • Accuracy

= (TP+TN) / (TP + TN + FN + FP )

slide-11
SLIDE 11

Pred edic ictio ion Accuracy – Ind Indiv ivid idual l Datasets ts

Algorithms Sensitivity Specificity Precision Accuracy AUC Statlog Heart Dataset Logistic Regression 0.807 0.859 0.821 0.836 0.910 Linear Discriminant Analysis 0.798 0.870 0.830 0.838 0.909 Support Vector Machine – RBF 0.807 0.849 0.849 0.830 0.907 Random Forest 0.788 0.879 0.838 0.836 0.913 Cleveland Heart Dataset Logistic Regression

0.794 0.869 0.841 0.834 0.903

Linear Discriminant Analysis

0.789 0.886 0.858 0.840 0.904

Support Vector Machine – RBF

0.773 0.867 0.867 0.828 0.900

Random Forest

0.778 0.883 0.853 0.832 0.912 Comparison of the performance of the four machine learning models using 13 risk factors predicting heart disease incidence for individual datasets (Statlog and Cleveland). The reported values are the average of 50 iterations.

slide-12
SLIDE 12

Pred edic ictio ion Accuracy – Com

  • mbin

ined Dataset

Algorithms Sensitivity Specificity Precision Accuracy AUC Logistic Regression 0.817 0.873 0.844 0.848 0.913 Linear Discriminant Analysis 0.800 0.888 0.857 0.848 0.911 Support Vector Machine – RBF 0.866 0.906 0.885 0.888 0.943 Random Forest 0.890 0.955 0.943 0.933 0.963

Comparison of the performance of the four machine learning models using 13 risk factors predicting heart disease incidence for combined dataset (Statlog and Cleveland). The reported values are the average of 50 iterations.

slide-13
SLIDE 13

Results – Area Under Curve (ROC)

. ROC curves for Logistic Regression (LR) , Random Forest (RF), Linear Discriminant Analysis ( LDA) and Support Vector Machine (SVM) models for UCI study participants (Statlog and Cleveland cohorts combined). ROC is drawn for one of the 50 iterations.

slide-14
SLIDE 14

Discussion

  • The results for the two individual dataset

cohorts from different sources show that even for a small dataset, machine learning models can produce good results.

  • Variations in these two comparable

cohorts do not affect this adversely.

  • When the cohorts are combined, the
  • verall non-linear model’s performance

increases substantially, while the results from linear models remain similar.

slide-15
SLIDE 15

Conclusions

  • This work demonstrates there is value in considering machine learning

methods for disease prediction modelling, and offers the potential for the modelling performance to improve as dataset size increases.

  • This suggests that the machine learning approach may be more effective for

maintaining prediction accuracy for datasets which change over time, as well as for specialized cohorts within the overall population, for which prediction may be less accurate due to deviation from a standard model.

slide-16
SLIDE 16

References

  • 1. WHO: Prevention of cardiovascular disease : guidelines for assessment and management of total cardiovascular risk. World Health

Organization (2007)

  • 2. Sajeev, S., Maeder, A.: Cardiovascular risk prediction models: A scoping review. In: Proceedings of the Australasian Computer Science

Week Multiconference, p. 21, ACM (2019)

  • 3. Bache, K., Lichman, M.: UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer

Science (2013), Available at http://archive.ics.uci.edu/ml

  • 4. Hosmer Jr, D.W., Lemeshow, S., Sturdivant, R.X.: Applied logistic regression, vol.398, John Wiley & Sons (2013)
  • 5. Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Mullers, K.R.: Fisher discriminant analysis with kernels. In: Neural Networks for Signal

Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop, pp. 41-48, IEEE (1999)

  • 6. Van Gestel, T., Suykens, J.A., Baesens, B., Viaene, S., Vanthienen, J., Dedene, G.,De Moor, B., Vandewalle, J.: Benchmarking least

squares support vector machine classfiers. Machine Learning 54(1), pp. 5-32 (2004)

  • 7. Breiman, L.: Random forests. Machine Learning 45(1), pp. 5-32 (2001)