with Linear Models CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu - - PowerPoint PPT Presentation

with linear models
SMART_READER_LITE
LIVE PREVIEW

with Linear Models CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu - - PowerPoint PPT Presentation

Binary Classification with Linear Models CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Figures credit: Piyush Rai T opics Linear Models Loss functions Regularization Gradient Descent Calculus refresher Convexity


slide-1
SLIDE 1

Binary Classification with Linear Models

CMSC 422 MARINE CARPUAT

marine@cs.umd.edu

Figures credit: Piyush Rai

slide-2
SLIDE 2

T

  • pics
  • Linear Models

– Loss functions – Regularization

  • Gradient Descent
  • Calculus refresher

– Convexity – Gradients

[CIML Chapter 6]

slide-3
SLIDE 3

Binary classification via hyperplanes

  • A classifier is a hyperplane (w,b)
  • At test time, we check on what

side of the hyperplane examples fall

𝑧 = 𝑡𝑗𝑕𝑜(𝑥𝑈𝑦 + 𝑐)

  • This is a linear classifier

– Because the prediction is a linear combination of feature values x

slide-4
SLIDE 4
slide-5
SLIDE 5

Learning a Linear Classifier as an Optimization Problem

Indicator function: 1 if (.) is true, 0 otherwise The loss function above is called the 0-1 loss

Loss function measures how well classifier fits training data Regularizer prefers solutions that generalize well Objective function

slide-6
SLIDE 6

Learning a Linear Classifier as an Optimization Problem

  • Problem: The 0-1 loss above is NP-hard to optimize
  • Solution: Different loss function approximations and

regularizers lead to specific algorithms (e.g., perceptron, support vector machines, logistic regression, etc.)

slide-7
SLIDE 7

The 0-1 Loss

  • Small changes in w,b can lead to big

changes in the loss value

  • 0-1 loss is non-smooth, non-convex
slide-8
SLIDE 8

Calculus refresher: Smooth functions, convex functions

slide-9
SLIDE 9

Approximating the 0-1 loss with surrogate loss functions

  • Examples (with b = 0)

– Hinge loss – Log loss – Exponential loss

  • All are convex upper-

bounds on the 0-1 loss

slide-10
SLIDE 10

Approximating the 0-1 loss with surrogate loss functions

  • Examples (with b = 0)

– Hinge loss – Log loss – Exponential loss

  • Q: Which of these

loss functions is not smooth?

slide-11
SLIDE 11

Approximating the 0-1 loss with surrogate loss functions

  • Examples (with b = 0)

– Hinge loss – Log loss – Exponential loss

  • Q: Which of these

loss functions is most sensitive to

  • utliers?
slide-12
SLIDE 12

Casting Linear Classification as an Optimization Problem

Indicator function: 1 if (.) is true, 0 otherwise The loss function above is called the 0-1 loss

Loss function measures how well classifier fits training data Regularizer prefers solutions that generalize well Objective function

slide-13
SLIDE 13

The regularizer term

  • Goal: find simple solutions (inductive bias)
  • Ideally, we want most entries of w to be zero, so

prediction depends only on a small number of features.

  • Formally, we want to minimize:
  • That’s NP-hard, so we use approximations instead.

– E.g., we encourage wd’s to be small

slide-14
SLIDE 14

Norm-based Regularizers

  • 𝑚𝑞 norms can be used as regularizers

Contour plots for p = 2 p = 1 p < 1

slide-15
SLIDE 15

Norm-based Regularizers

  • 𝑚𝑞 norms can be used as regularizers
  • Smaller p favors sparse vectors w

– i.e. most entries of w are close or equal to 0

  • 𝑚2 norm: convex, smooth, easy to optimize
  • 𝑚1 norm: encourages sparse w, convex, but not

smooth at axis points

  • 𝑞 < 1 : norm becomes non convex and hard to
  • ptimize
slide-16
SLIDE 16

Casting Linear Classification as an Optimization Problem

Indicator function: 1 if (.) is true, 0 otherwise The loss function above is called the 0-1 loss

Loss function measures how well classifier fits training data Regularizer prefers solutions that generalize well Objective function

slide-17
SLIDE 17

What is the perceptron optimizing?

  • Loss function is a variant of the hinge loss
slide-18
SLIDE 18

Recap: Linear Models

  • General framework for binary classification
  • Cast learning as optimization problem
  • Optimization objective combines 2 terms

– loss function: measures how well classifier fits training data – Regularizer: measures how simple classifier is

  • Does not assume data is linearly separable
  • Lets us separate model definition from

training algorithm

slide-19
SLIDE 19

Calculus refresher: Gradients

slide-20
SLIDE 20

Gradient descent

  • A general solution for our optimization problem

Idea: take iterative steps to update parameters in the direction

  • f the gradient
slide-21
SLIDE 21

Gradient descent algorithm

slide-22
SLIDE 22

Recap: Linear Models

  • General framework for binary classification
  • Cast learning as optimization problem
  • Optimization objective combines 2 terms

– loss function: measures how well classifier fits training data – Regularizer: measures how simple classifier is

  • Does not assume data is linearly separable
  • Lets us separate model definition from

training algorithm (Gradient Descent)