COMP 364: Computer Tools for Life Sciences Intro to machine learning - - PowerPoint PPT Presentation

comp 364 computer tools for life sciences
SMART_READER_LITE
LIVE PREVIEW

COMP 364: Computer Tools for Life Sciences Intro to machine learning - - PowerPoint PPT Presentation

COMP 364: Computer Tools for Life Sciences Intro to machine learning with scikit-learn (part three) Christopher J.F. Cameron and Carlos G. Oliver 1 / 23 Key course information TA office hours For this week only: Pouriya - Friday


slide-1
SLIDE 1

COMP 364: Computer Tools for Life Sciences

Intro to machine learning with scikit-learn (part three) Christopher J.F. Cameron and Carlos G. Oliver

1 / 23

slide-2
SLIDE 2

Key course information

TA office hours

◮ For this week only:

◮ Pouriya - Friday 1:30-3:00 pm TR 3104 ◮ if room is locked, check TR 3090

Course evaluations

◮ available now at the following link:

◮ https://horizon.mcgill.ca/pban1/twbkwbis.P_

WWWLogin?ret_code=f

2 / 23

slide-3
SLIDE 3

Recap - Titanic survival problem

In the last COMP 364 lecture

◮ implemented a support vector machine classifier (SVC) ◮ created a learned SVC model from training data ◮ calculated train and test mean squared error (MSE)

◮ MSE isn’t typically used for classification

Let’s implement a better accuracy metric for our classifier

◮ receiver operating characteristic (ROC)

3 / 23

slide-4
SLIDE 4

ROC curves

A plot that represents the predictive capability of a classifier

◮ across various discrimination thresholds

ROC curves are created by plotting the true positive rate (TPR) against false positive rate (FPR)

◮ wait....what are those? ◮ FPR is the proportion of true positives (TP) ◮ stop...what is a TP?

Let’s start with an example ROC plot

4 / 23

slide-5
SLIDE 5

Example ROC plot

A, B, C, and C ′

◮ are different methods ◮ i.e., different ML models

Top left of the plot

◮ perfect classification ◮ no incorrect predictions

Dashed red line

◮ result of random chance ◮ e.g., flipping a coin

5 / 23

slide-6
SLIDE 6

True/false positives and negatives

True positive (TP) Positive example is predicted to be positive

◮ surviving Titanic passenger predicted to survive

False positive (FP) Negative example is predicted to be positive

◮ dead Titanic passenger predicted to survive

True negative (TN) Negative example is predicted to be negative False negative (FN) Positive example is predicted to be negative

6 / 23

slide-7
SLIDE 7

Confusion matrices

A table describing the counts of TPs, FPs, TNs, and FNs In scikit-learn, we can get the confusion matrix for the SVC by:

1

from sklearn.metrics import confusion_matrix

2 3

clf = svm.SVC()

4

clf.fit(X_train, y_train)

5

preds = clf.predict(X_test)

6

tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()

7

print(tn, fp, fn, tp)

8

# prints: 86 33 41 49 These counts are for the current threshold used by the SVC

7 / 23

slide-8
SLIDE 8

TPR vs FPR

True positive rate (TPR) The proportion of correctly classified true examples

◮ i.e., surviving passengers predicted to survive

TPR = TP TP + FN = 49 49 + 41 = 0.49 False positive rate (FPR) The proportion of incorrectly labeled negative examples

◮ i.e., dead passengers predicted to survive

TPR = FP FP + TN = 33 33 + 86 = 0.33

8 / 23

slide-9
SLIDE 9

Creating a ROC curve

To create a ROC curve we need to:

  • 1. extract the score assigned by the SVC for test examples
  • 2. calculate TPRs and FPRs at various thresholds
  • 3. calculate area under the curve (AUC) for the ROC

◮ greater the area, better the classifier

  • 4. plot TPR vs. FPR

9 / 23

slide-10
SLIDE 10

Extracting SVC scores

To extract the score assigned by the SVC for each test example

1

from sklearn import svm

2 3

clf = svm.SVC()

4

clf.fit(X_train, y_train)

5

scores = clf.decision_function(X_test)

6

print(scores[:5])

7

# prints: [-0.26781241 0.1145858

  • 0.40117029

8

# 0.35895218 -1.07689094]

9

preds = clf.predict(X_test)

10

print(preds[:5])

11

#prints: [0 1 0 1 0] What threshold is being used to convert scores to labels?

10 / 23

slide-11
SLIDE 11

Calculating FPR, TPR, and AUC

To calculate FPRs, TPRs, and AUC for the SVC’s scores:

1

from sklearn.metrics import roc_curve, auc

2 3

scores = clf.decision_function(X_test)

4

fpr, tpr, thresholds = roc_curve(y_test, scores)

5

roc_auc = auc(fpr, tpr)

6

print(sorted(thresholds))

7

# prints:

8

# [-1.1008432176342331, -1.0168691423153751,

9

# -1.0002881313288357, -0.98665888089289866,

10

# ... 0.032799334871084884,

11

# 0.16940940621752093, 0.32341186208816985,

12

# 0.61088122361422137, 2.2800666503978029]

11 / 23

slide-12
SLIDE 12

Plotting ROC curves

1

import matplotlib.pyplot as plt

2 3

fpr, tpr, _ = roc_curve(y_test, scores)

4

roc_auc = auc(fpr, tpr)

5

plt.plot(fpr,tpr,"b-",lw=2,

6

label="ROC curve (area = %0.2f)"%roc_auc)

7

plt.plot([0,1],[0,1],"k--",lw=2)

8

plt.xlim([0.0, 1.0])

9

plt.ylim([0.0, 1.05])

10

plt.xlabel('False Positive Rate')

11

plt.ylabel('True Positive Rate')

12

plt.title('Receiver operating characteristic for SVC')

13

plt.legend(loc="lower right")

14

plt.savefig("./roc_curve.png")

15

plt.close()

12 / 23

slide-13
SLIDE 13

13 / 23

slide-14
SLIDE 14

Can we improve upon our predictor?

Let’s try applying a different ML algorithm to the data

◮ perhaps a decision tree? ◮ http://scikit-learn.org/stable/modules/tree.html ◮ remember to choose the classifier

1

from sklearn import tree

2 3

clf = tree.DecisionTreeClassifier()

4

clf.fit(X_train, y_train)

5

# similar to .decision_function()

6

dt_scores = clf.predict_proba(X_test)[:, 1]

14 / 23

slide-15
SLIDE 15

Decision trees only provide class labels

15 / 23

slide-16
SLIDE 16

Decision trees (DT)

Okay, why did we use a DT if it isn’t useful to plot a ROC curve?

◮ DTs do not transform model input as heavily ◮ input examples of the SVC are transformed

◮ using a radial basis function ◮ prevents interpretation of feature importance

In addition to making accurate predictions

◮ we would like to know which features contribute most to a

model’s predictions

◮ i.e., feature importance for the ML model

16 / 23

slide-17
SLIDE 17

Extracting feature importance

To implement the DT in sckit-learn

◮ we are working with a classifier object ◮ which has particular attributes:

http://scikit-learn.org/stable/modules/generated/ sklearn.tree.DecisionTreeClassifier.html#sklearn. tree.DecisionTreeClassifier

1

clf = tree.DecisionTreeClassifier()

2

clf.fit(X_train, y_train)

3

print(clf.feature_importances_)

4

# prints:

5

#[ 0.10237889 0.30678435 0.25202615

6

# 0.04502743 0.00862753 0.25493346

7

# 0.0302222 ]

17 / 23

slide-18
SLIDE 18

18 / 23

slide-19
SLIDE 19

ML - closing comments

Now that we know the important features for the DT

◮ we could try improving our predictions by:

◮ filter out low-weighted features ◮ try using a different transformation function in the SVC ◮ manipulate optional arguments of the DT and SVC ◮ choose other ML algorithms to apply ◮ build our own ML algorithm? ◮ transform input and output values ◮ so on and so on

But... we need to move onto our next topic in COMP 364

◮ digital image analysis/processing

19 / 23

slide-20
SLIDE 20

What is digital image analysis?

Digital image analysis (DIA): the extraction of useful information from images DIA is considered to consist of the following:

◮ pre-processing ◮ image enhancement ◮ classification (hmmm...sounds familiar, eh?)

◮ unsupervised ◮ supervised ◮ object-based

◮ change detection ◮ data merging

20 / 23

slide-21
SLIDE 21

Think objects, not pixels

Object-based Image Analysis (OBIA) involves:

◮ segmentation of images into objects

21 / 23

slide-22
SLIDE 22

OBIA

Then classifying objects resulting from segmentation

◮ to identify components of the image ◮ e.g., roads, grass, trees, etc.

22 / 23

slide-23
SLIDE 23

Next time in COMP 364

Exploring the scikit-image module scikit-image API http://scikit-image.org/docs/dev/api/api.html scikit-image tutorials http: //scikit-image.org/docs/dev/user_guide/tutorials.html

23 / 23