15-388/688 - Practical Data Science: Linear classification J. Zico - - PowerPoint PPT Presentation

β–Ά
15 388 688 practical data science linear classification
SMART_READER_LITE
LIVE PREVIEW

15-388/688 - Practical Data Science: Linear classification J. Zico - - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Example: classifying tumors Classification in machine learning Example classification algorithms Libraries for machine


slide-1
SLIDE 1

15-388/688 - Practical Data Science: Linear classification

  • J. Zico Kolter

Carnegie Mellon University Fall 2019

1

slide-2
SLIDE 2

Outline

Example: classifying tumors Classification in machine learning Example classification algorithms Libraries for machine learning

2

slide-3
SLIDE 3

Outline

Example: classifying tumors Classification in machine learning Example classification algorithms Libraries for machine learning

3

slide-4
SLIDE 4

Classification tasks

Regression tasks: predicting real-valued quantity 𝑧 ∈ ℝ Classification tasks: predicting discrete-valued quantity 𝑧 Binary classification: 𝑧 ∈ βˆ’1, +1 Multiclass classification: 𝑧 ∈ 1,2, … , 𝑙

4

slide-5
SLIDE 5

Example: breast cancer classification

Well-known classification example: using machine learning to diagnose whether a breast tumor is benign or malignant [Street et al., 1992] Setting: doctor extracts a sample of fluid from tumor, stains cells, then outlines several of the cells (image processing refines outline) System computes features for each cell such as area, perimeter, concavity, texture (10 total); computes mean/std/max for all features

5

slide-6
SLIDE 6

Example: breast cancer classification

Plot of two features: mean area vs. mean concave points, for two classes

6

slide-7
SLIDE 7

Linear classification example

Linear classification ≑ β€œdrawing line separating classes”

7

slide-8
SLIDE 8

Outline

Example: classifying tumors Classification in machine learning Example classification algorithms Libraries for machine learning

8

slide-9
SLIDE 9

Formal setting

Input features: 𝑦 ν‘– ∈ ℝ푛, 𝑗 = 1, … , 𝑛

  • E. g. : 𝑦 ν‘– =

Mean_Area ν‘– Mean_Concave_Points ν‘– 1 Outputs: 𝑧 ν‘– ∈ 𝒡, 𝑗 = 1, … , 𝑛

  • E. g. : 𝑧 ν‘– ∈ {βˆ’1 benign , +1 (malignant)}

Model parameters: πœ„ ∈ ℝ푛 Hypothesis function: β„Žνœƒ: ℝ푛 β†’ ℝ, aims for same sign as the output (informally, a measure of confidence in our prediction)

  • E. g. : β„Žνœƒ 𝑦 = πœ„ν‘‡ 𝑦,

Μ‚ 𝑧 = sign(β„Žνœƒ 𝑦 )

9

slide-10
SLIDE 10

Understanding linear classification diagrams

Color shows regions where the β„Žνœƒ(𝑦) is positive Separating boundary is given by the equation β„Žνœƒ 𝑦 = 0

10

slide-11
SLIDE 11

Loss functions for classification

How do we define a loss function β„“: ℝ×{βˆ’1, +1} β†’ ℝ+? What about just using squared loss?

11

y βˆ’1 +1 x y βˆ’1 +1 x

Least squares

y βˆ’1 +1 x

Least squares Perfect classifier

slide-12
SLIDE 12

0/1 loss (i.e. error)

The loss we would like to minimize (0/1 loss, or just β€œerror”): β„“0/1 β„Žνœƒ 𝑦 , 𝑧 = {0 if sign β„Žνœƒ 𝑦 = 𝑧 1

  • therwise

= 𝟐{𝑧 β‹… β„Žνœƒ 𝑦 ≀ 0}

12

slide-13
SLIDE 13

Alternative losses

Unfortunately 0/1 loss is hard to optimize (NP-hard to find classifier with minimum 0/1 loss, relates to a property called convexity of the function) A number of alternative losses for classification are typically used instead

13

β„“0/1 = 1 𝑧 β‹… β„Žνœƒ 𝑦 ≀ 0 β„“logistic = log 1 + exp βˆ’π‘§ β‹… β„Žνœƒ 𝑦 β„“hinge = max{1 βˆ’ 𝑧 β‹… β„Žνœƒ 𝑦 , 0} β„“exp = exp(βˆ’π‘§ β‹… β„Žνœƒ 𝑦 )

slide-14
SLIDE 14

Poll: sensitivity to outliers

How sensitive would you estimate each of the following losses would be to

  • utliers (i.e., points typically heavily misclassified)?
  • 1. 0/1 < Exp < {Hinge,Logitistic}
  • 2. Exp < Hinge < Logistic < 0/1
  • 3. Hinge < 0/1 < Logistic < Exp
  • 4. 0/1 < {Hinge,Logistic} < Exp
  • 5. Outliers don’t exist in the classification because output space is bounded

14

slide-15
SLIDE 15

Machine learning optimization

With this notation, the β€œcanonical” machine learning problem is written in the exact same way minimize

νœƒ

βˆ‘

ν‘–=1 ν‘š

β„“ β„Žνœƒ 𝑦 ν‘– , 𝑧 ν‘– Unlike least squares, there is not an analytical solution to the zero gradient condition for most classification losses Instead, we solve these optimization problems using gradient descent (or a alternative

  • ptimization method, but we’ll only consider gradient descent here)

Repeat: πœ„ ≔ πœ„ βˆ’ 𝛽 βˆ‘

ν‘–=1 ν‘š

π›Όνœƒβ„“( β„Žνœƒ 𝑦 ν‘– , 𝑧 ν‘– )

15

slide-16
SLIDE 16

Outline

Example: classifying tumors Classification in machine learning Example classification algorithms Libraries for machine learning

16

slide-17
SLIDE 17

Support vector machine

A (linear) support vector machine (SVM) just solves the canonical machine learning

  • ptimization problem using hinge loss and linear hypothesis, plus an additional

regularization term, more on this next lecture minimize

νœƒ

βˆ‘

ν‘–=1 ν‘š

max{1 βˆ’ 𝑧 ν‘– β‹… πœ„ν‘‡ 𝑦 ν‘– , 0} + πœ‡ 2 πœ„ 2

2

Even more precisely, the β€œstandard” SVM doesn’t actually regularize the πœ„ν‘– corresponding to the constant feature, but we’ll ignore this here Updates using gradient descent: πœ„ ≔ πœ„ βˆ’ 𝛽 βˆ‘

ν‘–=1 ν‘š

βˆ’π‘§ ν‘– 𝑦 ν‘– 1{ 𝑧 ν‘– β‹… πœ„ν‘‡ 𝑦 ν‘– ≀ 1} βˆ’ π›½πœ‡πœ„

17

slide-18
SLIDE 18

Support vector machine example

Running support vector machine on cancer dataset, with small regularization parameter (effectively zero)

18

πœ„ = 1.456 1.848 βˆ’0.189

slide-19
SLIDE 19

SVM optimization progress

Optimization objective and error versus gradient descent iteration number

19

slide-20
SLIDE 20

Logistic regression

Logistic regression just solves this problem using logistic loss and linear hypothesis function minimize

νœƒ

βˆ‘

ν‘–=1 ν‘š

log 1 + exp βˆ’π‘§ ν‘– β‹… πœ„ν‘‡ 𝑦 ν‘– Gradient descent updates (can you derive these?): πœ„ ≔ πœ„ βˆ’ 𝛽 βˆ‘

ν‘–=1 ν‘š

βˆ’π‘§ ν‘– 𝑦 ν‘– 1 1 + exp 𝑧 ν‘– β‹… πœ„ν‘‡ 𝑦 ν‘–

20

slide-21
SLIDE 21

Logistic regression example

Running logistic regression on cancer data set, small regularization

21

slide-22
SLIDE 22

Logistic regression example

Running logistic regression on cancer data set, small regularization

22

slide-23
SLIDE 23

Multiclass classification

When output is in {1, … , 𝑙} (e.g., digit classification), a few different approaches Approach 1: Build 𝑙 different binary classifiers β„Žνœƒν‘– with the goal of predicting class 𝑗 vs all others, output predictions as Μ‚ 𝑧 = argmax

ν‘–

β„Žνœƒν‘–(𝑦) Approach 2: Use a hypothesis function β„Žνœƒ: ℝ푛 β†’ β„ν‘˜, define an alternative loss function β„“: β„ν‘˜Γ— 1, … , 𝑙 β†’ ℝ+ E.g., softmax loss (also called cross entropy loss): β„“ β„Žνœƒ 𝑦 , 𝑧 = log βˆ‘

ν‘—=1 ν‘˜

exp β„Žνœƒ 𝑦 ν‘— βˆ’ β„Žνœƒ 𝑦 푦

23

slide-24
SLIDE 24

Outline

Example: classifying tumors Classification in machine learning Example classification algorithms Classification with Python libraries

24

slide-25
SLIDE 25

Support vector machine in scikit-learn

Train a support vector machine: Make predictions: Note: Scikit-learn in solving the problem (inverted regularization term): minimizeνœƒ 𝐷 βˆ‘

ν‘–=1 ν‘š

max{1 βˆ’ 𝑧 ν‘– β‹… πœ„ν‘‡ 𝑦 ν‘– , 0} + 1 2 πœ„ 2

2

25

from sklearn.svm import LinearSVC, SVC clf = SVC(C=1e4, kernel='linear') # or clf = LinearSVC(C=1e4, loss='hinge', max_iter=1e5) clf.fit(X, y) # don’t include constant features in X y_pred = clf.predict(X)

slide-26
SLIDE 26

Native Python SVM

It’s pretty easy to write a gradient-descent-based SVM too For the most part, ML algorithms are very simple, you can easily write them yourself, but it’s fine to use libraries to quickly try many algorithms But watch out for idiosyncratic differences (e.g., 𝐷 vs πœ‡, the fact that I’m using 𝑧 ∈ βˆ’1, +1 , not 𝑧 ∈ {0,1}, etc)

26

def svm_gd(X, y, lam=1e-5, alpha=1e-4, max_iter=5000): m,n = X.shape theta = np.zeros(n) Xy = X*y[:,None] for i in range(max_iter): theta -= alpha*(-Xy.T.dot(Xy.dot(theta) <= 1) + lam*theta) return theta

slide-27
SLIDE 27

Logistic regression in scikit-learn

Admittedly very nice element of scikit-learn is that we can easily try out other algorithms For both this example and SVM, you can access resulting parameters using the fields

27

from sklearn.linear_model import LogisticRegression clf = LogisticRegression(C=10000.0) clf.fit(X, y) clf.coef_ # parameters other than weight on constant feature clf.intercept_ # weight on constant feature