Binary Classification Many slides attributable to: Prof. Mike - - PowerPoint PPT Presentation

binary classification
SMART_READER_LITE
LIVE PREVIEW

Binary Classification Many slides attributable to: Prof. Mike - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Binary Classification Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani


slide-1
SLIDE 1

Binary Classification

1

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/

Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

Logistics

  • Waitlist: We have some room, contact me
  • HW2 due TONIGHT (Wed 2/6 at 11:59pm)
  • What you submit: PDF and zip
  • Please annotate pages in Gradescope!
  • HW3 out later tonight, due a week from today
  • What you submit: PDF and zip
  • Please annotate pages in Gradescope!
  • Next recitation is Mon 2/11
  • Practical binary classifiers in Python with sklearn
  • Numerical issues and how to address them

2

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-3
SLIDE 3

Objectives: Classifier Overview

  • 3 steps of a classification task
  • Prediction
  • Making hard binary decisions
  • Predicting class probabilities
  • Training
  • Evaluation
  • Performance Metrics
  • A “taste” of 3 Methods
  • Logistic Regression
  • K-Nearest Neighbors
  • Decision Tree Regression

3

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-4
SLIDE 4

What will we learn?

4

Mike Hughes - Tufts COMP 135 - Spring 2019

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-5
SLIDE 5

5

Mike Hughes - Tufts COMP 135 - Spring 2019

Before: Regression

Supervised Learning Unsupervised Learning Reinforcement Learning

regression

x y y

is a numeric variable e.g. sales in $$

slide-6
SLIDE 6

6

Mike Hughes - Tufts COMP 135 - Spring 2019

y

x2 x1

is a binary variable (red or blue)

Supervised Learning

binary classification

Unsupervised Learning Reinforcement Learning

Task: Binary Classification

slide-7
SLIDE 7

Example: Hotdog or Not

7

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-8
SLIDE 8

8

Mike Hughes - Tufts COMP 135 - Spring 2019

y

x2 x1

is a discrete variable (red or blue or green or purple)

Supervised Learning

multi-class classification

Unsupervised Learning Reinforcement Learning

Task: Multi-class Classification

slide-9
SLIDE 9

9

Mike Hughes - Tufts COMP 135 - Spring 2019

Classification Example: Swype

Many possible letters: Multi-class classification

slide-10
SLIDE 10

Binary Prediction Step

Goal: Predict label (0 or 1) given features x

  • Input:
  • Output:

10

Mike Hughes - Tufts COMP 135 - Spring 2019

xi , [xi1, xi2, . . . xif . . . xiF ]

Entries can be real-valued, or other numeric types (e.g. integer, binary) Binary label (0 or 1)

“features” “covariates” “predictors” “attributes” “responses” “labels”

yi ∈ {0, 1}

slide-11
SLIDE 11

Binary Prediction Step

11

Mike Hughes - Tufts COMP 135 - Spring 2019

>>> # Given: pretrained regression object model >>> # Given: 2D array of features x

>>> x_NF.shape (N, F) >>> yhat_N = model.predict(x_NF) >>> yhat_N[:5] # peek at predictions [0, 0, 1, 0, 1] >>> yhat_N.shape (N,)

slide-12
SLIDE 12

Types of binary predictions

12

FP : false positive TP : true positive TN : true negative FN : false negative

slide-13
SLIDE 13

Example:

13

Which outcome is this?

FP : false positive TP : true positive TN : true negative FN : false negative

slide-14
SLIDE 14

Example:

14

Which outcome is this?

Answer: True Positive

FP : false positive TP : true positive TN : true negative FN : false negative

slide-15
SLIDE 15

Example:

15

FP : false positive TP : true positive TN : true negative FN : false negative

Which outcome is this?

slide-16
SLIDE 16

Example:

16

Which outcome is this?

Answer: True Negative (TN)

FP : false positive TP : true positive TN : true negative FN : false negative

slide-17
SLIDE 17

Example:

17

Which outcome is this?

FP : false positive TP : true positive TN : true negative FN : false negative

slide-18
SLIDE 18

Example:

18

Which outcome is this?

FP : false positive TP : true positive TN : true negative FN : false negative

Answer: False Negative (FN)

slide-19
SLIDE 19

Example:

19

Which outcome is this?

FP : false positive TP : true positive TN : true negative FN : false negative

slide-20
SLIDE 20

Example:

20

Which outcome is this?

FP : false positive TP : true positive TN : true negative FN : false negative

Answer: False Positive (FP)

slide-21
SLIDE 21

Probability Prediction Step

Goal: Predict probability p(Y=1) given features x

  • Input:
  • Output:

21

Mike Hughes - Tufts COMP 135 - Spring 2019

xi , [xi1, xi2, . . . xif . . . xiF ]

Entries can be real-valued, or other numeric types (e.g. integer, binary) Probability between 0 and 1 e.g. 0.001, 0.513, 0.987

“features” “covariates” “predictors” “attributes” “probabilities”

ˆ pi

slide-22
SLIDE 22

Probability Prediction Step

22

Mike Hughes - Tufts COMP 135 - Spring 2019

>>> # Given: pretrained regression object model >>> # Given: 2D array of features x >>> x_NF.shape (N, F)

>>> yproba_N2 = model.predict_proba(x_NF) >>> yproba_N2.shape (N, 2)

>>> yproba_N2[:, 1] [0.003, 0.358, 0.987, 0.111, 0.656]

Column index 1 gives probability of positive label p(Y = 1)

slide-23
SLIDE 23

23

Credit: Wattenberg, Viégas, Hardt

Thresholding to get Binary Decisions

23

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-24
SLIDE 24

24

Credit: Wattenberg, Viégas, Hardt

Thresholding to get Binary Decisions

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-25
SLIDE 25

25

Credit: Wattenberg, Viégas, Hardt

Thresholding to get Binary Decisions

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-26
SLIDE 26

Pair Exercise

Interactive Demo: https://research.google.com/bigpicture/attacking- discrimination-in-ml/ Loan and pay back: +$300 Loan and not pay back: -$700 Goals:

  • What threshold maximizes accuracy?
  • What threshold maximizes profit?
  • What needs to be true of costs so threshold is the same

for profit and accuracy?

26

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-27
SLIDE 27

Classifier: Training Step

Goal: Given a labeled dataset, learn a function that can perform prediction well

  • Input: Pairs of features and labels/responses
  • Output:

27

Mike Hughes - Tufts COMP 135 - Spring 2019

{xn, yn}N

n=1

ˆ y(·) : RF → {0, 1}

Useful to break into two steps: 1) Produce probabilities in [0, 1] OR real-valued scores 2) Threshold to make binary decisions

slide-28
SLIDE 28

28

Mike Hughes - Tufts COMP 135 - Spring 2019

>>> # Given: 2D array of features x >>> # Given: 1D array of binary labels y

>>> y_N.shape (N,) >>> x_NF.shape (N, F) >>> model = BinaryClassifier() >>> model.fit(x_NF, y_N) >>> # Now can call predict or predict_proba

Classifier: Training Step

slide-29
SLIDE 29

Classifier: Evaluation Step

Goal: Assess quality of predictions

29

Mike Hughes - Tufts COMP 135 - Spring 2019

Many ways in practice: 1) Evaluate probabilities / scores directly

logistic loss, hinge loss, …

2) Evaluate binary decisions at specific threshold

accuracy, TPR, TNR, PPV, NPV, etc.

3) Evaluate across range of thresholds

ROC curve, Precision-Recall curve

slide-30
SLIDE 30

Metric: Confusion Matrix

Counting mistakes in binary predictions

30

#TP : num. true positive #FP : num. false positive #TN : num. true negative #FN : num. false negative

#FN #FP #TP #TP

slide-31
SLIDE 31

Metric: Accuracy

accuracy = fraction of correct predictions

31

Mike Hughes - Tufts COMP 135 - Spring 2019

Potential problem: Suppose your dataset has 1 positive example and 99 negative examples What is the accuracy of the classifier that always predicts ”negative”?

= TP + TN TP + TN + FN + FP

slide-32
SLIDE 32

Metric: Accuracy

accuracy = fraction of correct predictions

32

Mike Hughes - Tufts COMP 135 - Spring 2019

Potential problem: Suppose your dataset has 1 positive example and 99 negative examples What is the accuracy of the classifier that always predicts ”negative”? 99%!

= TP + TN TP + TN + FN + FP

slide-33
SLIDE 33

33

Metrics for Binary Decisions

Emphasize the metrics appropriate for your application.

“sensitivity”, “recall” “specificity”, 1 - FPR “precision”

slide-34
SLIDE 34

34

Mike Hughes - Tufts COMP 135 - Spring 2019

Goal: App to classify cats vs. dogs from images

Which metric might be most important? Could we just use accuracy?

slide-35
SLIDE 35

35

Mike Hughes - Tufts COMP 135 - Spring 2019

Goal: Classifier to find relevant tweets to list on website

Which metric might be most important? Could we just use accuracy?

slide-36
SLIDE 36

36

Mike Hughes - Tufts COMP 135 - Spring 2019

Goal: Detector for tumors based on medical image

Which metric might be most important? Could we just use accuracy?

slide-37
SLIDE 37

ROC Curve (across thresholds)

37

FPR (1 – specificity) TPR (sensitivity)

random guess perfect

Specific thresh

slide-38
SLIDE 38

Area under ROC curve

(aka AUROC or AUC or “C statistic”)

38

FPR (1 – specificity) TPR (sensitivity)

AUROC , Pr(ˆ y(xi) > ˆ y(xj)|yi = 1, yj = 0)

Graphical: Probabilistic:

For random pair of examples, one positive and one negative, What is probability classifier will rank positive one higher?

Area varies from 0.0 – 1.0. 0.5 is random guess. 1.0 is perfect.

slide-39
SLIDE 39

Precision-Recall Curve

39

Mike Hughes - Tufts COMP 135 - Spring 2019

recall (= TPR) precision

slide-40
SLIDE 40

AUROC not always best choice

40

AUROC: red is better Blue much better for alarm fatigue

slide-41
SLIDE 41

Classifier: Evaluation Metrics

41

Mike Hughes - Tufts COMP 135 - Spring 2019

https://scikit-learn.org/stable/modules/model_evaluation.html

1) To evaluate predicted scores / probabilities 2) To evaluate specific binary decisions 3) To make ROC or PR curves and compute areas

slide-42
SLIDE 42

Objectives: Classifier Overview

  • 3 steps of a classification task
  • Prediction
  • Making hard binary decisions
  • Predicting class probabilities
  • Training
  • Evaluation
  • Performance Metrics
  • A “taste” of 3 Methods
  • Logistic Regression
  • K-Nearest Neighbors
  • Decision Tree Regression

42

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-43
SLIDE 43

Logistic Sigmoid Function

43

Mike Hughes - Tufts COMP 135 - Spring 2019

sigmoid(z) = 1 1 + e−z

Goal: Transform real values into probabilities probability

slide-44
SLIDE 44

Logistic Regression

Parameters: Prediction: Training: find weights and bias that minimize error

44

Mike Hughes - Tufts COMP 135 - Spring 2019

w = [w1, w2, . . . wf . . . wF ] b

weight vector bias scalar

p(yi = 1|xi) , sigmoid @

F

X

f=1

wfxif + b 1 A

ˆ p(xi, w, b) =

slide-45
SLIDE 45

Measuring prediction quality for a probabilistic classifier

45

Mike Hughes - Tufts COMP 135 - Spring 2019

Use the log loss (aka “binary cross entropy”, related to“logistic loss”)

log loss(y, ˆ p) = −y log ˆ p − (1 − y) log(1 − ˆ p)

from sklearn.metrics import log_loss

Advantages:

  • smooth
  • easy to take

derivatives!

slide-46
SLIDE 46

Logistic Regression: Training

46

Mike Hughes - Tufts COMP 135 - Spring 2019

Optimization: Minimize total log loss on train set

min

w,b N

X

n=1

log loss(yn, ˆ p(xn, w, b))

Algorithm: Gradient descent More in next class! Avoid overfitting: Use L2 or L1 penalty on weights

slide-47
SLIDE 47

Nearest Neighbor Classifier

47

Mike Hughes - Tufts COMP 135 - Spring 2019

Parameters: none Prediction:

  • find “nearest” training vector to given input x
  • predict y value of this neighbor

Training: none needed (use training data as lookup table)

slide-48
SLIDE 48

K nearest neighbor classifier

48

Mike Hughes - Tufts COMP 135 - Spring 2019

Parameters: K : number of neighbors Prediction:

  • find K “nearest” training vectors to input x
  • predict: vote most common y in neighborhood
  • predict_proba: report fraction of labels

Training: none needed (use training data as lookup table)

slide-49
SLIDE 49

Decision Tree Classifier

49

Mike Hughes - Tufts COMP 135 - Spring 2019

Leaves make binary predictions! (but can be made probabilistic) Goal: Does patient have heart disease?

slide-50
SLIDE 50

50

Mike Hughes - Tufts COMP 135 - Spring 2019

Decision Tree Classifier

Parameters:

  • at each internal node: x variable id and threshold
  • at each leaf: probability of positive y label

Prediction:

  • identify rectangular region for input x
  • predict: most common y value in region
  • predict_proba: report fraction of each label in regtion

Training:

  • minimize error on training set
  • often, use greedy heuristics
slide-51
SLIDE 51

Summary of Methods

51

Mike Hughes - Tufts COMP 135 - Spring 2019

Function class flexibility Knobs to tune Interpret? Logistic Regression Linear L2/L1 penalty on weights Inspect weights Decision Tree Classifier Axis-aligned Piecewise constant

  • Max. depth
  • Min. leaf size

Goal criteria Inspect tree K Nearest Neighbors Classifier Piecewise constant Number of Neighbors Distance metric How neighbors vote Inspect neighbors