Binary Classification Many slides attributable to: Prof. Mike - - PowerPoint PPT Presentation

binary classification
SMART_READER_LITE
LIVE PREVIEW

Binary Classification Many slides attributable to: Prof. Mike - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Binary Classification Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani


slide-1
SLIDE 1

Binary Classification

2

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/

Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

Today’s objectives (day 07) Binary Classification Basics

  • 3 steps of a classification task
  • Prediction
  • Predicting probabilities of each binary class
  • Making hard binary decisions
  • Training
  • Evaluation (much more in next class)
  • A “taste” of 2 Methods
  • Logistic Regression
  • K-Nearest Neighbors

3

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-3
SLIDE 3

What will we learn?

4

Mike Hughes - Tufts COMP 135 - Fall 2020

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-4
SLIDE 4

5

Mike Hughes - Tufts COMP 135 - Fall 2020

Before: Regression

Supervised Learning Unsupervised Learning Reinforcement Learning

regression

x y y

is a numeric variable e.g. sales in $$

slide-5
SLIDE 5

6

Mike Hughes - Tufts COMP 135 - Fall 2020

y

x2 x1

is a binary variable (red or blue)

Supervised Learning

binary classification

Unsupervised Learning Reinforcement Learning

Task: Binary Classification

slide-6
SLIDE 6

Example: Hotdog or Not

7

Mike Hughes - Tufts COMP 135 - Fall 2020

https://www.theverge.com/tldr/2017/5/14/15639784/hbo- silicon-valley-not-hotdog-app-download

slide-7
SLIDE 7

8

Mike Hughes - Tufts COMP 135 - Fall 2020

y

x2 x1

is a discrete variable (red or blue or green or purple)

Supervised Learning

multi-class classification

Unsupervised Learning Reinforcement Learning

Task: Multi-class Classification

slide-8
SLIDE 8

Binary Prediction Step

Goal: Predict label (0 or 1) given features x

  • Input:
  • Output:

9

Mike Hughes - Tufts COMP 135 - Fall 2020

xi , [xi1, xi2, . . . xif . . . xiF ]

Entries can be real-valued, or other numeric types (e.g. integer, binary) Binary label (0 or 1)

“features” “covariates” “predictors” “attributes” “responses” “labels”

yi ∈ {0, 1}

slide-9
SLIDE 9

Binary Prediction Step

10

Mike Hughes - Tufts COMP 135 - Fall 2020

>>> # Given: pretrained binary classifier model >>> # Given: 2D array of features x_NF

>>> x_NF.shape (N, F) >>> yhat_N = model.predict(x_NF) >>> yhat_N[:5] # peek at predictions [0, 0, 1, 0, 1] >>> yhat_N.shape (N,)

slide-10
SLIDE 10

Types of binary predictions

11

FP : false positive TP : true positive TN : true negative FN : false negative

slide-11
SLIDE 11

Probability Prediction Step

Goal: Predict probability of event y=1 given features x

  • Input:
  • Output:

12

Mike Hughes - Tufts COMP 135 - Fall 2020

xi , [xi1, xi2, . . . xif . . . xiF ]

Entries can be real-valued, or other numeric types (e.g. integer, binary) Probability between 0 and 1 e.g. 0.001, 0.513, 0.987

“features” “covariates” “predictors” “attributes” “probabilities”

ˆ pi

slide-12
SLIDE 12

Probability Prediction Step

13

Mike Hughes - Tufts COMP 135 - Fall 2020

>>> # Given: pretrained regression object model >>> # Given: 2D array of features x_NF >>> x_NF.shape (N, F)

>>> yproba_N2 = model.predict_proba(x_NF) >>> yproba_N2.shape (N, 2)

>>> yproba_N2[:, 1] [0.003, 0.358, 0.987, 0.111, 0.656]

Column index 1 gives probability of positive label given input features p(Y = 1 | X)

slide-13
SLIDE 13

14

Credit: Wattenberg, Viégas, Hardt

Thresholding to get Binary Decisions

14

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-14
SLIDE 14

15

Credit: Wattenberg, Viégas, Hardt

Thresholding to get Binary Decisions

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-15
SLIDE 15

16

Credit: Wattenberg, Viégas, Hardt

Thresholding to get Binary Decisions

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-16
SLIDE 16

Classifier: Training Step

Goal: Given a labeled dataset, learn a function that can perform (probabilistic) prediction well

  • Input: Pairs of features and labels/responses
  • Output:

17

Mike Hughes - Tufts COMP 135 - Fall 2020

{xn, yn}N

n=1

ˆ y(·) : RF → {0, 1}

Useful to break into two steps: 1) Produce real-valued scores OR probabilities in [0, 1] 2) Threshold to make binary decisions

slide-17
SLIDE 17

18

Mike Hughes - Tufts COMP 135 - Fall 2020

>>> # Given: 2D array of features x_NF >>> # Given: 1D array of binary labels y_N

>>> y_N.shape (N,) >>> x_NF.shape (N, F) >>> model = BinaryClassifier() >>> model.fit(x_NF, y_N) >>> # Now can call predict or predict_proba

Classifier: Training Step

slide-18
SLIDE 18

Classifier: Evaluation Step

Goal: Assess quality of predictions

19

Mike Hughes - Tufts COMP 135 - Fall 2020

Many ways in practice: 1) Evaluate probabilities / scores directly

cross entropy loss (aka log loss), hinge loss, …

2) Evaluate binary decisions at specific threshold

accuracy, TPR, TNR, PPV, NPV, etc.

3) Evaluate across range of thresholds

ROC curve, Precision-Recall curve

slide-19
SLIDE 19

Metric: Confusion Matrix

Counting mistakes in binary predictions

20

#TP : num. true positive #FP : num. false positive #TN : num. true negative #FN : num. false negative

#FN #FP #TN #TP

slide-20
SLIDE 20

Metric: Accuracy

accuracy = fraction of correct predictions

21

Mike Hughes - Tufts COMP 135 - Fall 2020

Potential problem: Suppose your dataset has 1 positive example and 99 negative examples What is the accuracy of the classifier that always predicts ”negative”?

= TP + TN TP + TN + FN + FP

slide-21
SLIDE 21

Metric: Accuracy

accuracy = fraction of correct predictions

22

Mike Hughes - Tufts COMP 135 - Fall 2020

Potential problem: Suppose your dataset has 1 positive example and 99 negative examples What is the accuracy of the classifier that always predicts ”negative”? 99%!

= TP + TN TP + TN + FN + FP

slide-22
SLIDE 22

Objectives: Classifier Overview

  • 3 steps of a classification task
  • Prediction
  • Making hard binary decisions
  • Predicting class probabilities
  • Training
  • Evaluation
  • A “taste” of 2 Methods
  • Logistic Regression
  • K-Nearest Neighbors

23

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-23
SLIDE 23

Logistic Sigmoid Function

24

Mike Hughes - Tufts COMP 135 - Fall 2020

sigmoid(z) = 1 1 + e−z

Goal: Transform real values into probabilities probability

slide-24
SLIDE 24

Logistic Regression

Parameters: Prediction: Training: find weights and bias that minimize “loss”

25

Mike Hughes - Tufts COMP 135 - Fall 2020

w = [w1, w2, . . . wf . . . wF ] b

weight vector bias scalar

p(yi = 1|xi) , sigmoid @

F

X

f=1

wfxif + b 1 A

ˆ p(xi, w, b) =

slide-25
SLIDE 25

Measuring prediction quality for a probabilistic classifier

26

Mike Hughes - Tufts COMP 135 - Fall 2020

Use the log loss (aka “binary cross entropy”)

log loss(y, ˆ p) = −y log ˆ p − (1 − y) log(1 − ˆ p)

from sklearn.metrics import log_loss

Advantages:

  • smooth
  • easy to take

derivatives!

slide-26
SLIDE 26

Logistic Regression: Training

27

Mike Hughes - Tufts COMP 135 - Fall 2020

Optimization: Minimize total log loss on train set

min

w,b N

X

n=1

log loss(yn, ˆ p(xn, w, b))

Algorithm: Gradient descent Avoid overfitting: Use L2 or L1 penalty on weights Much more in depth in next class

slide-27
SLIDE 27

Visualizing predicted probas for Logistic Regression

28

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-28
SLIDE 28

Visualizing predicted probas for Logistic Regression

29

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-29
SLIDE 29

Nearest Neighbor Classifier

30

Mike Hughes - Tufts COMP 135 - Fall 2020

Parameters: none Prediction:

  • find “nearest” training vector to given input x
  • predict y value of this neighbor

Training: none needed (use training data as lookup table)

slide-30
SLIDE 30

K nearest neighbor classifier

31

Mike Hughes - Tufts COMP 135 - Fall 2020

Parameters: K : number of neighbors Prediction:

  • find K “nearest” training vectors to input x
  • predict: vote most common y in neighborhood
  • predict_proba: report fraction of labels

Training: none needed (use training data as lookup table)

slide-31
SLIDE 31

32

Mike Hughes - Tufts COMP 135 - Fall 2020

Visualizing predicted probas for K-Nearest Neighbors

slide-32
SLIDE 32

33

Mike Hughes - Tufts COMP 135 - Fall 2020

Visualizing predicted probas for K-Nearest Neighbors

slide-33
SLIDE 33

Summary of Methods

34

Mike Hughes - Tufts COMP 135 - Fall 2020

Function class flexibility Hyperparameters to select (control complexity) Interpret? Logistic Regression Linear L2/L1 penalty on weights Inspect weights K Nearest Neighbors Classifier Piecewise constant Number of Neighbors Distance metric Inspect neighbors