Learning as Loss Minimization Machine Learning 1 Learning as loss - - PowerPoint PPT Presentation

learning as loss minimization
SMART_READER_LITE
LIVE PREVIEW

Learning as Loss Minimization Machine Learning 1 Learning as loss - - PowerPoint PPT Presentation

Learning as Loss Minimization Machine Learning 1 Learning as loss minimization The setup Examples x drawn from a fixed, unknown distribution D Hidden oracle classifier f labels examples We wish to find a hypothesis h that mimics f


slide-1
SLIDE 1

Machine Learning

Learning as Loss Minimization

1

slide-2
SLIDE 2

Learning as loss minimization

  • The setup

– Examples x drawn from a fixed, unknown distribution D – Hidden oracle classifier f labels examples – We wish to find a hypothesis h that mimics f

  • The ideal situation

– Define a function L that penalizes bad hypotheses – Learning: Pick a function h 2 H to minimize expected loss

  • Instead, minimize empirical loss on the training set

2

But distribution D is unknown

slide-3
SLIDE 3

Learning as loss minimization

  • The setup

– Examples x drawn from a fixed, unknown distribution D – Hidden oracle classifier f labels examples – We wish to find a hypothesis h that mimics f

  • The ideal situation

– Define a function L that penalizes bad hypotheses – Learning: Pick a function h 2 H to minimize expected loss

  • Instead, minimize empirical loss on the training set

3

But distribution D is unknown

slide-4
SLIDE 4

Learning as loss minimization

  • The setup

– Examples x drawn from a fixed, unknown distribution D – Hidden oracle classifier f labels examples – We wish to find a hypothesis h that mimics f

  • The ideal situation

– Define a function L that penalizes bad hypotheses – Learning: Pick a function h 2 H to minimize expected loss

  • Instead, minimize empirical loss on the training set

4

slide-5
SLIDE 5

Learning as loss minimization

  • The setup

– Examples x drawn from a fixed, unknown distribution D – Hidden oracle classifier f labels examples – We wish to find a hypothesis h that mimics f

  • The ideal situation

– Define a function L that penalizes bad hypotheses – Learning: Pick a function h 2 H to minimize expected loss

  • Instead, minimize empirical loss on the training set

5

But distribution D is unknown

slide-6
SLIDE 6

Learning as loss minimization

  • The setup

– Examples x drawn from a fixed, unknown distribution D – Hidden oracle classifier f labels examples – We wish to find a hypothesis h that mimics f

  • The ideal situation

– Define a function L that penalizes bad hypotheses – Learning: Pick a function h 2 H to minimize expected loss

  • Instead, minimize empirical loss on the training set

6

But distribution D is unknown

slide-7
SLIDE 7

Empirical loss minimization

Learning = minimize empirical loss on the training set

7

Is there a problem here?

slide-8
SLIDE 8

Empirical loss minimization

Learning = minimize empirical loss on the training set We need something that biases the learner towards simpler hypotheses

  • Achieved using a regularizer, which penalizes complex

hypotheses

8

Is there a problem here?

Overfitting!

slide-9
SLIDE 9

Regularized loss minimization

  • Learning:
  • With linear classifiers:
  • What is a loss function?

– Loss functions should penalize mistakes – We are minimizing average loss over the training data

  • What is the ideal loss function for classification?

9

slide-10
SLIDE 10

Regularized loss minimization

  • Learning:
  • With linear classifiers:
  • What is a loss function?

– Loss functions should penalize mistakes – We are minimizing average loss over the training data

  • What is the ideal loss function for classification?

10

slide-11
SLIDE 11

Regularized loss minimization

  • Learning:
  • With linear classifiers:
  • What is a loss function?

– Loss functions should penalize mistakes – We are minimizing average loss over the training data

  • What is the ideal loss function for classification?

11

slide-12
SLIDE 12

The 0-1 loss

Penalize classification mistakes between true label y and prediction y’

  • For linear classifiers, the prediction y’ = sgn(wTx)

– Mistake if y wTx · 0

Minimizing 0-1 loss is intractable. Need surrogates

12

slide-13
SLIDE 13

The 0-1 loss

13

ywTx Loss ywTx > 0, no misclassification ywTx < 0, misclassification

slide-14
SLIDE 14

Compare to the hinge loss

14

ywTx Loss ywTx > 0, no misclassification ywTx < 0, misclassification Penalize predictions even if they are correct, but too close to the margin More penalty as wTx is farther away from the separator on the wrong side

slide-15
SLIDE 15

Support Vector Machines

  • SVM = linear classifier combined with regularization
  • Ideally, we would like to minimize 0-1 loss,

– But we can’t for computational reasons

  • SVM minimizes hinge loss

– Variants exist

15

slide-16
SLIDE 16

SVM objective function

16

Regularization term:

  • Maximize the margin
  • Imposes a preference over the

hypothesis space and pushes for better generalization

  • Can be replaced with other

regularization terms which impose

  • ther preferences

Empirical Loss:

  • Hinge loss
  • Penalizes weight vectors that make

mistakes

  • Can be replaced with other loss

functions which impose other preferences

slide-17
SLIDE 17

SVM objective function

17

Regularization term:

  • Maximize the margin
  • Imposes a preference over the

hypothesis space and pushes for better generalization

  • Can be replaced with other

regularization terms which impose

  • ther preferences

Empirical Loss:

  • Hinge loss
  • Penalizes weight vectors that make

mistakes

  • Can be replaced with other loss

functions which impose other preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss

slide-18
SLIDE 18

The loss function zoo

Many loss functions exist

– Perceptron loss – Hinge loss (SVM) – Exponential loss (AdaBoost) – Logistic loss (logistic regression)

18

slide-19
SLIDE 19

The loss function zoo

19

slide-20
SLIDE 20

The loss function zoo

20

Zero-one

slide-21
SLIDE 21

The loss function zoo

21

Hinge: SVM Zero-one

slide-22
SLIDE 22

The loss function zoo

22

Perceptron Hinge: SVM Zero-one

slide-23
SLIDE 23

The loss function zoo

23

Perceptron Hinge: SVM Exponential: AdaBoost Zero-one

slide-24
SLIDE 24

The loss function zoo

24

Perceptron Hinge: SVM Logistic regression Exponential: AdaBoost Zero-one

slide-25
SLIDE 25

Learning via Loss Minimization: Summary

  • Learning via Loss Minimization

– Write down a loss function – Minimize empirical loss

  • Regularize to avoid overfitting

– Neural networks use other strategies such as dropout

  • Widely applicable, different loss functions and

regularizers

25