comp 364 computer tools for life sciences
play

COMP 364: Computer Tools for Life Sciences Intro to machine learning - PowerPoint PPT Presentation

COMP 364: Computer Tools for Life Sciences Intro to machine learning with scikit-learn (part three) Christopher J.F. Cameron and Carlos G. Oliver 1 / 23 Key course information TA office hours For this week only: Pouriya - Friday


  1. COMP 364: Computer Tools for Life Sciences Intro to machine learning with scikit-learn (part three) Christopher J.F. Cameron and Carlos G. Oliver 1 / 23

  2. Key course information TA office hours ◮ For this week only: ◮ Pouriya - Friday 1:30-3:00 pm TR 3104 ◮ if room is locked, check TR 3090 Course evaluations ◮ available now at the following link: ◮ https://horizon.mcgill.ca/pban1/twbkwbis.P_ WWWLogin?ret_code=f 2 / 23

  3. Recap - Titanic survival problem In the last COMP 364 lecture ◮ implemented a support vector machine classifier (SVC) ◮ created a learned SVC model from training data ◮ calculated train and test mean squared error (MSE) ◮ MSE isn’t typically used for classification Let’s implement a better accuracy metric for our classifier ◮ receiver operating characteristic (ROC) 3 / 23

  4. ROC curves A plot that represents the predictive capability of a classifier ◮ across various discrimination thresholds ROC curves are created by plotting the true positive rate (TPR) against false positive rate (FPR) ◮ wait....what are those? ◮ FPR is the proportion of true positives (TP) ◮ stop...what is a TP? Let’s start with an example ROC plot 4 / 23

  5. Example ROC plot A , B , C , and C ′ ◮ are different methods ◮ i.e., different ML models Top left of the plot ◮ perfect classification ◮ no incorrect predictions Dashed red line ◮ result of random chance ◮ e.g., flipping a coin 5 / 23

  6. True/false positives and negatives True positive (TP) Positive example is predicted to be positive ◮ surviving Titanic passenger predicted to survive False positive (FP) Negative example is predicted to be positive ◮ dead Titanic passenger predicted to survive True negative (TN) Negative example is predicted to be negative False negative (FN) Positive example is predicted to be negative 6 / 23

  7. Confusion matrices A table describing the counts of TPs, FPs, TNs, and FNs In scikit-learn, we can get the confusion matrix for the SVC by: from sklearn.metrics import confusion_matrix 1 2 clf = svm.SVC() 3 clf.fit(X_train, y_train) 4 preds = clf.predict(X_test) 5 tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel() 6 print(tn, fp, fn, tp) 7 # prints: 86 33 41 49 8 These counts are for the current threshold used by the SVC 7 / 23

  8. TPR vs FPR True positive rate (TPR) The proportion of correctly classified true examples ◮ i.e., surviving passengers predicted to survive TP 49 TPR = TP + FN = 49 + 41 = 0 . 49 False positive rate (FPR) The proportion of incorrectly labeled negative examples ◮ i.e., dead passengers predicted to survive 33 FP TPR = FP + TN = 33 + 86 = 0 . 33 8 / 23

  9. Creating a ROC curve To create a ROC curve we need to: 1. extract the score assigned by the SVC for test examples 2. calculate TPRs and FPRs at various thresholds 3. calculate area under the curve (AUC) for the ROC ◮ greater the area, better the classifier 4. plot TPR vs. FPR 9 / 23

  10. Extracting SVC scores To extract the score assigned by the SVC for each test example from sklearn import svm 1 2 clf = svm.SVC() 3 clf.fit(X_train, y_train) 4 scores = clf.decision_function(X_test) 5 print(scores[:5]) 6 # prints: [-0.26781241 0.1145858 -0.40117029 7 # 0.35895218 -1.07689094] 8 preds = clf.predict(X_test) 9 print(preds[:5]) 10 #prints: [0 1 0 1 0] 11 What threshold is being used to convert scores to labels? 10 / 23

  11. Calculating FPR, TPR, and AUC To calculate FPRs, TPRs, and AUC for the SVC’s scores: from sklearn.metrics import roc_curve, auc 1 2 scores = clf.decision_function(X_test) 3 fpr, tpr, thresholds = roc_curve(y_test, scores) 4 roc_auc = auc(fpr, tpr) 5 print(sorted(thresholds)) 6 # prints: 7 # [-1.1008432176342331, -1.0168691423153751, 8 # -1.0002881313288357, -0.98665888089289866, 9 # ... 0.032799334871084884, 10 # 0.16940940621752093, 0.32341186208816985, 11 # 0.61088122361422137, 2.2800666503978029] 12 11 / 23

  12. Plotting ROC curves import matplotlib.pyplot as plt 1 2 fpr, tpr, _ = roc_curve(y_test, scores) 3 roc_auc = auc(fpr, tpr) 4 plt.plot(fpr,tpr,"b-",lw=2, 5 label="ROC curve (area = %0.2f)"%roc_auc) 6 plt.plot([0,1],[0,1],"k--",lw=2) 7 plt.xlim([0.0, 1.0]) 8 plt.ylim([0.0, 1.05]) 9 plt.xlabel('False Positive Rate') 10 plt.ylabel('True Positive Rate') 11 plt.title('Receiver operating characteristic for SVC') 12 plt.legend(loc="lower right") 13 plt.savefig("./roc_curve.png") 14 plt.close() 15 12 / 23

  13. 13 / 23

  14. Can we improve upon our predictor? Let’s try applying a different ML algorithm to the data ◮ perhaps a decision tree ? ◮ http://scikit-learn.org/stable/modules/tree.html ◮ remember to choose the classifier from sklearn import tree 1 2 clf = tree.DecisionTreeClassifier() 3 clf.fit(X_train, y_train) 4 # similar to .decision_function() 5 dt_scores = clf.predict_proba(X_test)[:, 1] 6 14 / 23

  15. Decision trees only provide class labels 15 / 23

  16. Decision trees (DT) Okay, why did we use a DT if it isn’t useful to plot a ROC curve? ◮ DTs do not transform model input as heavily ◮ input examples of the SVC are transformed ◮ using a radial basis function ◮ prevents interpretation of feature importance In addition to making accurate predictions ◮ we would like to know which features contribute most to a model’s predictions ◮ i.e., feature importance for the ML model 16 / 23

  17. Extracting feature importance To implement the DT in sckit-learn ◮ we are working with a classifier object ◮ which has particular attributes: http://scikit-learn.org/stable/modules/generated/ sklearn.tree.DecisionTreeClassifier.html#sklearn. tree.DecisionTreeClassifier clf = tree.DecisionTreeClassifier() 1 clf.fit(X_train, y_train) 2 print(clf.feature_importances_) 3 # prints: 4 #[ 0.10237889 0.30678435 0.25202615 5 # 0.04502743 0.00862753 0.25493346 6 # 0.0302222 ] 7 17 / 23

  18. 18 / 23

  19. ML - closing comments Now that we know the important features for the DT ◮ we could try improving our predictions by: ◮ filter out low-weighted features ◮ try using a different transformation function in the SVC ◮ manipulate optional arguments of the DT and SVC ◮ choose other ML algorithms to apply ◮ build our own ML algorithm? ◮ transform input and output values ◮ so on and so on But... we need to move onto our next topic in COMP 364 ◮ digital image analysis/processing 19 / 23

  20. What is digital image analysis? Digital image analysis (DIA) : the extraction of useful information from images DIA is considered to consist of the following: ◮ pre-processing ◮ image enhancement ◮ classification (hmmm...sounds familiar, eh?) ◮ unsupervised ◮ supervised ◮ object-based ◮ change detection ◮ data merging 20 / 23

  21. Think objects, not pixels Object-based Image Analysis (OBIA) involves: ◮ segmentation of images into objects 21 / 23

  22. OBIA Then classifying objects resulting from segmentation ◮ to identify components of the image ◮ e.g., roads, grass, trees, etc. 22 / 23

  23. Next time in COMP 364 Exploring the scikit-image module scikit-image API http://scikit-image.org/docs/dev/api/api.html scikit-image tutorials http: //scikit-image.org/docs/dev/user_guide/tutorials.html 23 / 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend