Introduction to Machine Learning CMU-10701 10. Risk Minimization - - PowerPoint PPT Presentation

introduction to machine learning cmu 10701
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning CMU-10701 10. Risk Minimization - - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk Minimization 2 What have we seen so far? Several algorithms that seem to work fine on training datasets: Linear regression Nave Bayes


slide-1
SLIDE 1

Introduction to Machine Learning CMU-10701

  • 10. Risk Minimization

Barnabás Póczos

slide-2
SLIDE 2
  • 10. Risk Minimization

2

slide-3
SLIDE 3

What have we seen so far?

3

Several algorithms that seem to work fine on training datasets:

  • Linear regression
  • Naïve Bayes classifier
  • Perceptron
  • Support Vector Machines

How good are these algorithms on unknown test sets? How many training samples do we need to achieve small error? What is the smallest possible error we can achieve?

) Learning Theory

slide-4
SLIDE 4

Outline

  • Risk and loss

–Loss functions –Risk –Empirical risk vs True risk –Empirical Risk minimization

  • Underfitting and Overfitting
  • Classification
  • Regression

4

slide-5
SLIDE 5

Supervised Learning Setup

Generative model of the data: (train and test data)

5

Regression: Classification:

slide-6
SLIDE 6

Loss

It measures how good we are on a particular (x,y) pair.

6

Loss function:

slide-7
SLIDE 7

Loss Examples

Classification loss:

L2 loss for regression: L1 loss for regression: Regression: Predict house prices. Price

) Mean of p( |x) ) Median of p( |x)

7

slide-8
SLIDE 8

Squared loss, L2 loss

8

Picture form Alex

slide-9
SLIDE 9

L1 loss

9

Picture form Alex

slide-10
SLIDE 10

ε−insensitive loss

10

Picture form Alex

slide-11
SLIDE 11

Huber’s robust loss

11

Picture form Alex

slide-12
SLIDE 12

Risk

Risk of f classification/regression function: = The expected loss Why do we care about this?

12

slide-13
SLIDE 13

Why do we care about risk?

Risk of f classification/regression function: =The expected loss Our true goal is to minimize the loss of the test points!

Usually we don’t know the test points and their labels in advance…, but

(LLN)

13

That is why our goal is to minimize the risk.

slide-14
SLIDE 14

Risk Examples

Risk:

The expected loss

14

Classification loss: Risk of classification loss: L2 loss for regression: Risk of L2 loss:

slide-15
SLIDE 15

Bayes Risk

The expected loss

We consider all possible function f here

We don’t know P, but we have i.i.d. training data sampled from P! Goal of Learning:

The learning algorithm constructs this function fD from the training data.

15

Definition: Bayes Risk

slide-16
SLIDE 16

Consistency of learning methods

Definition:

Stone’s theorem 1977: Many classification, regression algorithms are universally consistent for certain loss functions under certain conditions: kNN, Parzen kernel regression, SVM,…

Yayyy!!! 

Wait! This doesn’t tell us anything about the rates…

16

Risk is a random variable:

slide-17
SLIDE 17

No Free Lunch!

Devroy 1982: For every consistent learning method and for every fixed convergence rate an, 9 P(X,Y) distribution such that the convergence rate of this learning method on P(X,Y) distributed data is slower than an

17

What can we do now?

slide-18
SLIDE 18

What do we mean on rate?

18

Notation: (stochastic rate, stochastic little o and big O)

(stochastically bounded)

Definition: (stochastically bounded) Example: (CLT)

slide-19
SLIDE 19

Empirical Risk and True Risk

19

slide-20
SLIDE 20

Empirical Risk

20

Let us use the empirical counter part: Shorthand:

True risk of f (deterministic): Bayes risk:

Empirical risk:

slide-21
SLIDE 21

Empirical Risk Minimization

21

Law of Large Numbers:

Empirical risk is converging to the Bayes risk

slide-22
SLIDE 22

Overfitting in Classification with ERM

22

Bayes classifier:

Picture from David Pal

Bayes risk:

Generative model:

slide-23
SLIDE 23

23

Picture from David Pal

Bayes risk:

n-order thresholded polynomials

Empirical risk:

Overfitting in Classification with ERM

slide-24
SLIDE 24

Is the following predictor a good one? What is its empirical risk? (performance on training data) zero ! What about true risk? > zero Will predict very poorly on new random test point: Large generalization error !

Overfitting in Regression with ERM

24

slide-25
SLIDE 25 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 1.2 1.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 0.2
0.2 0.4 0.6 0.8 1 1.2 1.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 45
  • 40
  • 35
  • 30
  • 25
  • 20
  • 15
  • 10
  • 5
5

k=1 k=2 k=3 k=7

If we allow very complicated predictors, we could overfit the training data.

Examples: Regression (Polynomial of order k-1 – degree k )

Overfitting in Regression

25

constant linear quadratic 6th order

slide-26
SLIDE 26

Solutions to Overfitting

26

slide-27
SLIDE 27

Solutions to Overfitting Structural Risk Minimization

Notation:

1st issue:

(Model error, Approximation error)

Solution: Structural Risk Minimzation (SRM)

27

Risk Empirical risk

slide-28
SLIDE 28

Approximation error, Estimation error, PAC framework

28

Bayes risk

Risk of the classifier f Estimation error Approximation error

Probably Approximately Correct (PAC) learning framework

Estimation error

slide-29
SLIDE 29

Solution to Overfitting

2nd issue:

Solution:

29

slide-30
SLIDE 30

Approximation with the Hinge loss and quadratic loss

Picture is taken from R. Herbrich

slide-31
SLIDE 31

Empirical risk is no longer a good indicator of true risk

fixed # training data

If we allow very complicated predictors, we could overfit the training data.

Effect of Model Complexity

31

Prediction error on training data

slide-32
SLIDE 32

Underfitting

Bayes risk = 0.1

32

slide-33
SLIDE 33

Underfitting

Best linear classifier:

The empirical risk of the best linear classifier:

33

slide-34
SLIDE 34

Underfitting

Best quadratic classifier:

Same as the Bayes risk ) good fit!

34

slide-35
SLIDE 35

Classification using the classification loss

35

slide-36
SLIDE 36

The Bayes Classifier

36

Lemma I: Lemma II:

slide-37
SLIDE 37

Proofs

37

Lemma I: Trivial from definition Lemma II: Surprisingly long calculation

slide-38
SLIDE 38

The Bayes Classifier

38

This is what the learning algorithm produces

We will need these definitions, please copy it!

slide-39
SLIDE 39

The Bayes Classifier

39

Theorem I:

The true risk of what the learning algorithm produces This is what the learning algorithm produces

slide-40
SLIDE 40

The Bayes Classifier

40

Theorem II:

This is what the learning algorithm produces

slide-41
SLIDE 41

Proofs

41

Theorem I: Not so long calculations. Theorem II: Trivial Main message: It’s enough to derive upper bounds for Corollary:

slide-42
SLIDE 42

Illustration of the Risks

42

slide-43
SLIDE 43

Let us see why we have learned the tail bounds!

43

It’s enough to derive upper bounds for

slide-44
SLIDE 44

Hoeffding’s inequality (1963)

44

Special case

slide-45
SLIDE 45

Binomial distributions

45

Our goal is to bound

Bernoulli(p)

Therefore, from Hoeffding we have:

Yuppie!!!

slide-46
SLIDE 46

Inversion

46

From Hoeffding we have: Therefore,

slide-47
SLIDE 47

Union Bound

47

Our goal is to bound: We already know:

Theorem: [tail bound on the ‘deviation’ in the worst case]

Worst case error Proof:

This is not the worst classifier in terms of classification accuracy! Worst case means that the empirical risk of classifier f is the furthest from its true risk!

slide-48
SLIDE 48

Inversion of Union Bound

48

Therefore,

We already know:

slide-49
SLIDE 49

Inversion of Union Bound

49

  • The larger the N, the looser the bound
  • This results is distribution free: True for all P(X,Y) distributions
  • It is useless if N is big, or infinite… (e.g. all possible hyperplanes)

We will see later how to fix that. (Hint: We haven’t used McDiarmid yet)

slide-50
SLIDE 50

The Expected Error

50

Our goal is to bound:

Theorem: [Expected ‘deviation’ in the worst case] Worst case deviation

We already know:

Proof: we already know a tail bound. (Tail bound, Concentration inequality) (From that actually we get a bit weaker inequality… oh well)

This is not the worst classifier in terms of classification accuracy! Worst case means that the empirical risk of classifier f is the furthest from its true risk!

slide-51
SLIDE 51

Thanks for your attention 

51