Regression and generalization CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation

โ–ถ
regression and generalization
SMART_READER_LITE
LIVE PREVIEW

Regression and generalization CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation

Regression and generalization CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Topics } Beyond linear regression models } Evaluation & model selection } Regularization } Bias-Variance 2 Recall: Linear


slide-1
SLIDE 1

Regression and generalization

CE-717: Machine Learning

Sharif University of Technology

  • M. Soleymani

Fall 2018

slide-2
SLIDE 2

Topics

} Beyond linear regression models } Evaluation & model selection } Regularization }

Bias-Variance

2

slide-3
SLIDE 3

Recall: Linear regression (squared loss)

} Linear regression functions

๐‘” โˆถ โ„ โ†’ โ„ ๐‘”(๐‘ฆ; ๐’™) = ๐‘ฅ- + ๐‘ฅ/๐‘ฆ ๐‘” โˆถ โ„0 โ†’ โ„ ๐‘”(๐’š; ๐’™) = ๐‘ฅ- + ๐‘ฅ/๐‘ฆ/ + . . . ๐‘ฅ3๐‘ฆ3

} Minimizing the squared loss for linear regression

๐พ(๐’™) = ๐’› โˆ’ ๐’€๐’™ 8

8

} We obtain ๐’™

9 = ๐’€:๐’€ ;๐Ÿ ๐’€:๐’›

3

๐’™ = ๐‘ฅ-,๐‘ฅ/,...,๐‘ฅ3 : are the parameters we need to set.

slide-4
SLIDE 4

Beyond linear regression

} How to extend the linear regression to non-linear

functions?

} Transform the data using basis functions } Learn a linear regression on the new feature vectors (obtained

by basis functions)

4

slide-5
SLIDE 5

Beyond linear regression

} ๐‘›?@ order polynomial regression (univariate ๐‘” โˆถ โ„ โŸถ โ„)

๐‘” ๐‘ฆ; ๐’™ = ๐‘ฅ- + ๐‘ฅ/๐‘ฆ + . . . +๐‘ฅB;/๐‘ฆB;/ +๐‘ฅB ๐‘ฆB

} Solution: ๐’™

C = ๐’€โ€ฒ:๐’€โ€ฒ

;๐Ÿ ๐’€โ€ฒ:๐’›

๐’› = ๐‘ง/ โ‹ฎ ๐‘งG ๐’€โ€ฒ = 1 ๐‘ฆ / I ๐‘ฆ / J โ‹ฏ ๐‘ฆ / L 1 ๐‘ฆ 8 I ๐‘ฆ 8 J โ‹ฏ ๐‘ฆ 8 L โ‹ฎ 1 โ‹ฎ ๐‘ฆ G I โ‹ฎ ๐‘ฆ G J โ‹ฎ โ‹ฏ โ‹ฎ ๐‘ฆ G I ๐’™ = ๐’™ 9 - ๐’™ 9/ โ‹ฎ ๐’™ 9 B

5

slide-6
SLIDE 6

Polynomial regression: example

6

๐‘› = 1 ๐‘› = 3 ๐‘› = 5 ๐‘› = 7

slide-7
SLIDE 7

Generalized linear

} Linear combination of fixed non-linear function of the

input vector

๐‘”(๐’š; ๐’™) = ๐‘ฅ- + ๐‘ฅ/๐œš/(๐’š)+ . . . ๐‘ฅB๐œšB(๐’š)

{๐œš/(๐’š), . . . , ๐œšB(๐’š)}: set of basis functions (or features) ๐œšS ๐’š : โ„3 โ†’ โ„

7

slide-8
SLIDE 8

Basis functions: examples

} Linear } Polynomial (univariate)

8

slide-9
SLIDE 9

Basis functions: examples

} Gaussian: ๐œšU ๐’š = ๐‘“๐‘ฆ๐‘ž โˆ’

๐’š;๐’…Y

J

8ZY

J

} Sigmoid: ๐œšU ๐’š = ๐œ

๐’š;๐’…Y ZY

๐œ ๐‘ =

/ /]^_` (;a)

9

slide-10
SLIDE 10

Radial Basis Functions: prototypes

} Predictions based on similarity to โ€œprototypesโ€:

๐œšU ๐’š = ๐‘“๐‘ฆ๐‘ž โˆ’ 1 2๐œ

U 8 ๐’š โˆ’ ๐’…U 8

} Measuring the similarity to the prototypes ๐’…/, โ€ฆ , ๐’…B

} ฯƒ8 controls how quickly it vanishes as a function of the

distance to the prototype.

} Training examples themselves could serve as prototypes

10

slide-11
SLIDE 11

Generalized linear: optimization

11

๐พ ๐’™ = e ๐‘ง S โˆ’ ๐‘” ๐’š S ; ๐’™

8 G Sf/

= e ๐‘ง S โˆ’ ๐’™:๐” ๐’š S

8 G Sf/

๐’› = ๐‘ง(/) โ‹ฎ ๐‘ง(G) ๐šพ = 1 ๐œš/(๐’š(/)) โ‹ฏ ๐œšB(๐’š(/)) 1 โ‹ฎ ๐œš/(๐’š(8)) โ‹ฎ โ‹ฏ โ‹ฑ ๐œšB(๐’š(8)) โ‹ฎ 1 ๐œš/(๐’š(G)) โ‹ฏ ๐œšB(๐’š(G)) ๐’™ = ๐‘ฅ- ๐‘ฅ/ โ‹ฎ ๐‘ฅB ๐’™

j = ๐šพ:๐šพ

;๐Ÿ ๐šพ:๐’›

slide-12
SLIDE 12

Model complexity and overfitting

} With limited training data, models may achieve zero

training error but a large test error.

} Over-fitting: when the training loss no longer bears any

relation to the test (generalization) loss.

} Fails to generalize to unseen examples.

12

1 ๐‘œ e ๐‘ง S โˆ’ ๐‘” ๐’š S ; ๐œพ

8

โ‰ˆ 0

G Sf/

Training (empirical) loss Expected (true) loss E๐ฒ,q ๐‘ง โˆ’ ๐‘” ๐’š; ๐œพ

8 โ‰ซ 0

slide-13
SLIDE 13

Polynomial regression

13

๐‘› = 0 ๐‘› = 1 ๐‘› = 3 ๐‘› = 9 ๐‘ง ๐‘ง ๐‘ง ๐‘ง [Bishop]

slide-14
SLIDE 14

Polynomial regression: training and test error

14

๐‘› ๐‘†๐‘๐‘‡๐น = โˆ‘ ๐‘ง S โˆ’ ๐‘” ๐’š S ; ๐œพ

8 G Sf/

๐‘œ

  • [Bishop]
slide-15
SLIDE 15

Over-fitting causes

15

} Model complexity

} E.g., Model with a large number of parameters (degrees of

freedom)

} Low number of training data

} Small data size compared to the complexity of the model

slide-16
SLIDE 16

Model complexity

16

} Example:

} Polynomials with larger ๐‘› are becoming increasingly tuned to

the random noise on the target values.

16

๐‘› = 0 ๐‘› = 1 ๐‘› = 3 ๐‘› = 9 ๐‘ง ๐‘ง ๐‘ง ๐‘ง [Bishop]

slide-17
SLIDE 17

Number of training data & overfitting

17

} Over-fitting problem becomes less severe as the size of

training data increases.

๐‘› = 9 ๐‘› = 9 ๐‘œ = 15 ๐‘œ = 100 [Bishop]

slide-18
SLIDE 18

How to evaluate the learnerโ€™s performance?

18

} Generalization error: true (or expected) error that we

would like to optimize

} Two ways to assess the generalization error is:

} Practical: Use a separate data set to test the model } Theoretical: Law of Large numbers

} statistical bounds on the difference between training and expected

errors

slide-19
SLIDE 19

Avoiding over-fitting

19

} Determine a suitable value for model complexity (Model

Selection)

} Simple hold-out method } Cross-validation

} Regularization (Occamโ€™s Razor)

} Explicit preference towards simple models } Penalize for the model complexity in the objective function

} Bayesian approach

slide-20
SLIDE 20

Evaluation and model selection

20

} Evaluation:

} We need to measure how well the learned function can

predicts the target for unseen examples

} Model selection:

} Most of the time we need to select among a set of models

} Example: polynomials with different degree ๐‘›

} and thus we need to evaluate these models first

slide-21
SLIDE 21

Model Selection

21

} learning algorithm defines the data-driven search over

the hypothesis space (i.e. search for good parameters)

} hyperparameters are the tunable aspects of the model,

that the learning algorithm does not select

This slide has been adopted from CMU ML course: http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

slide-22
SLIDE 22

Model Selection

22

} Model selection is the process by which we choose the

โ€œbestโ€ model from among a set of candidates

} assume access to a function capable of measuring the quality of

a model

} typically done โ€œoutsideโ€ the main training algorithm

} Model selection / hyperparameter optimization is just

another form of learning

This slide has been adopted from CMU ML course: http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

slide-23
SLIDE 23

Simple hold-out: model selection

23

} Steps:

} Divide training data into training and validation set ๐‘ค_๐‘ก๐‘“๐‘ข } Use only the training set to train a set of models } Evaluate each learned model on the validation set

} ๐พ~ ๐’™ =

/ ~_โ€ขโ‚ฌ? โˆ‘

๐‘ง(S) โˆ’ ๐‘” ๐’š(S); ๐’™

8

  • Sโˆˆ~_โ€ขโ‚ฌ?

} Choose the best model based on the validation set error

} Usually, too wasteful of valuable training data

} Training data may be limited. } On the other hand, small validation set give a relatively noisy

estimate of performance.

slide-24
SLIDE 24

Simple hold out: training, validation, and test sets

24

} Simple hold-out chooses the model that minimizes error on

validation set.

} ๐พ~ ๐’™

9 is likely to be an optimistic estimate of generalization error.

} extra parameter (e.g., degree of polynomial) is fit to this set.

} Estimate generalization error for the test set

} performance of the selected model is finally evaluated on the test set

Training Validation Test

slide-25
SLIDE 25

Cross-Validation (CV): Evaluation

25

} ๐‘™-fold cross-validation steps:

} Shuffle the dataset and randomly partition training data into ๐‘™ groups of

approximately equal size

} for ๐‘— = 1 to ๐‘™

} Choose the ๐‘—-th group as the held-out validation group } Train the model on all but the ๐‘—-th group of data } Evaluate the model on the held-out group

} Performance scores of the model from ๐‘™ runs are averaged.

} The average error rate can be considered as an estimation of the true

performance. โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ First run Second run (k-1)th run k-th run

slide-26
SLIDE 26

Cross-Validation (CV): Model Selection

26

} For each model we first find the average error find by CV. } The model with the best average performance is

selected.

slide-27
SLIDE 27

Cross-validation: polynomial regression example

} 5-fold CV } 100 runs

} average

27

๐‘› = 1 CV: ๐‘๐‘‡๐น = 0.30 ๐‘› = 3 CV: ๐‘๐‘‡๐น = 1.45 ๐‘› = 5 CV: ๐‘๐‘‡๐น = 45.44 ๐‘› = 7 CV: ๐‘๐‘‡๐น = 31759

slide-28
SLIDE 28

Leave-One-Out Cross Validation (LOOCV)

28

} When data is particularly scarce, cross-validation with

๐‘™ = ๐‘‚

} Leave-one-out treats each training sample in turn as a test

example and all other samples as the training set.

} Use for small datasets

} When training data is valuable } LOOCV can be time expensive as ๐‘‚

training steps are required.

slide-29
SLIDE 29

Regularization

29

} Adding a penalty term in the cost function to discourage

the coefficients from reaching large values.

} Ridge regression (weight decay):

๐พ ๐’™ = e ๐‘ง S โˆ’ ๐’™:๐” ๐’š S

8

+ ๐œ‡๐’™:๐’™

G Sf/

๐’™

C = ๐šพ:๐šพ + ๐œ‡๐‘ฑ

;๐Ÿ ๐šพ:๐’›

slide-30
SLIDE 30

Polynomial order

30

} Polynomials with larger ๐‘›

are becoming increasingly tuned to the random noise on the target values.

} magnitude of the coefficients typically gets larger by increasing

๐‘›.

[Bishop]

slide-31
SLIDE 31

Regularization parameter

31

๐‘ฅ 9/ ๐‘ฅ 98 ๐‘ฅ 9ห† ๐‘ฅ 9โ€ฐ ๐‘ฅ 9ล  ๐‘ฅ 9โ€น ๐‘ฅ 9ล’ ๐‘ฅ 9โ€ข ๐‘ฅ 9ลฝ ๐‘› = 9 ๐‘ฅ 9- ๐‘š๐‘œ๐œ‡ = โˆ’โˆž ๐‘š๐‘œ๐œ‡ = โˆ’18 [Bishop]

slide-32
SLIDE 32

Regularization parameter

32

} Generalization

} ๐œ‡ now controls the effective complexity of the model and

hence determines the degree of over-fitting

[Bishop]

slide-33
SLIDE 33

Choosing the regularization parameter

33

} A set of models with different values of ๐œ‡. } Find ๐’™

9 for each model based on training data

} Find ๐พ~(๐’™

9) (or ๐พโ€™~(๐’™ 9)) for each model

} ๐พ~ ๐’™ =

/ G_~ โˆ‘

๐‘ง(S) โˆ’ ๐‘” ๐‘ฆ(S); ๐’™

8

  • Sโˆˆ~_โ€ขโ‚ฌ?

} Select the model with the best ๐พ~(๐’™

9) (or ๐พโ€™~(๐’™ 9))

slide-34
SLIDE 34

The approximation-generailization trade-off

34

} Small true error shows good approximation of ๐‘” out of

sample

} More complex โ„‹ โ‡’ better chance of approximating ๐‘” } Less complex โ„‹ โ‡’ better chance of generalization out of ๐‘”

slide-35
SLIDE 35

Complexity of Hypothesis Space: Example

35

Price Size Price Size Price Size

๐‘ฅ- + ๐‘ฅ/๐‘ฆ ๐‘ฅ- + ๐‘ฅ/๐‘ฆ + ๐‘ฅ8๐‘ฆ8 ๐‘ฅ- + ๐‘ฅ/๐‘ฆ + ๐‘ฅ8๐‘ฆ8 + ๐‘ฅห†๐‘ฆห† + ๐‘ฅโ€ฐ๐‘ฆโ€ฐ This example has been adapted from: Prof. Andrew Ngโ€™s slides More complex โ„‹ Less complex โ„‹

slide-36
SLIDE 36

Complexity of Hypothesis Space: Example

36

Price Size Price Size Price Size

๐‘ฅ- + ๐‘ฅ/๐‘ฆ ๐‘ฅ- + ๐‘ฅ/๐‘ฆ + ๐‘ฅ8๐‘ฆ8 ๐‘ฅ- + ๐‘ฅ/๐‘ฆ + ๐‘ฅ8๐‘ฆ8 + ๐‘ฅห†๐‘ฆห† + ๐‘ฅโ€ฐ๐‘ฆโ€ฐ This example has been adapted from: Prof. Andrew Ngโ€™s slides Overfitting Underfitting

slide-37
SLIDE 37

Complexity of Hypothesis Space: Example

37

๐‘› egree of polynomial d error ๐พ~ ๐พ?โ€ขaSG ๐พ~ ๐’™ = 1 ๐‘œ_๐‘ค e ๐‘ง(S) โˆ’ ๐‘” ๐’š(S); ๐’™

8

  • Sโˆˆ~aโ€“_โ€ขโ‚ฌ?

๐พ?โ€ขaSG ๐’™ = 1 ๐‘œ_๐‘ข๐‘ ๐‘๐‘—๐‘œ e ๐‘ง(S) โˆ’ ๐‘” ๐’š(S); ๐’™

8

  • Sโˆˆ?โ€ขaSG_โ€ขโ‚ฌ?
slide-38
SLIDE 38

Complexity of Hypothesis Space

38

} Less complex โ„‹:

} ๐พ?โ€ขaSG(๐’™

9) โ‰ˆ ๐พ~(๐’™ 9) and ๐พ?โ€ขaSG(๐’™ 9) is very high

} More complex โ„‹:

} ๐พ?โ€ขaSG(๐’™

9) โ‰ช ๐พ~(๐’™ 9) and ๐พ?โ€ขaSG(๐’™ 9) is low

๐‘› egree of polynomial d error ๐พ~(๐’™ 9) ๐พ?โ€ขaSG(๐’™ 9)

slide-39
SLIDE 39

Size of training set

39

๐‘” ๐‘ฆ; ๐’™ = ๐‘ฅ- + ๐‘ฅ/๐‘ฆ + ๐‘ฅ8๐‘ฆ8 (training set size) error ๐‘œ ๐พ~ ๐’™ = 1 ๐‘œ_๐‘ค e ๐‘ง(S) โˆ’ ๐‘” ๐‘ฆ(S); ๐’™

8

  • Sโˆˆ~aโ€“_โ€ขโ‚ฌ?

๐พ?โ€ขaSG ๐’™ = 1 ๐‘œ_๐‘ข๐‘ ๐‘๐‘—๐‘œ e ๐‘ง(S) โˆ’ ๐‘” ๐‘ฆ(S); ๐’™

8

  • Sโˆˆ?โ€ขaSG_โ€ขโ‚ฌ?

๐พ~ ๐พ?โ€ขaSG This slide has been adapted from: Prof. Andrew Ngโ€™s slides

slide-40
SLIDE 40

Less complex โ„‹

40

size price size price ๐‘” ๐‘ฆ; ๐’™ = ๐‘ฅ- + ๐‘ฅ/๐‘ฆ

If model is very simple, getting more training data will not (by itself) help much.

(training set size) ๐‘œ error ๐พ~ ๐พ?โ€ขaSG High error This slide has been adapted from: Prof. Andrew Ngโ€™s slides

slide-41
SLIDE 41

More complex โ„‹

41

(training set size) error ๐‘œ ๐พ~ ๐พ?โ€ขaSG Gap size price size price

For more complex models, getting more training data is usually helps.

๐‘” ๐‘ฆ; ๐’™ = ๐‘ฅ- + ๐‘ฅ/๐‘ฆ + โ‹ฏ ๐‘ฅ/-๐‘ฆ/- This slide has been adapted from: Prof. Andrew Ngโ€™s slides

slide-42
SLIDE 42

Regularization: Example

42

๐‘” ๐‘ฆ; ๐’™ = ๐‘ฅ- + ๐‘ฅ/๐‘ฆ + ๐‘ฅ8๐‘ฆ8 +๐‘ฅห† ๐‘ฆห† +๐‘ฅโ€ฐ ๐‘ฆโ€ฐ ๐พ ๐’™ = 1 ๐‘œ e ๐‘ง S โˆ’ ๐‘” ๐‘ฆ S ; ๐’™

8

+ ๐œ‡๐’™:๐’™

G Sf/

Large ๐œ‡x

(Prefer to more simple models)

Intermediate ๐œ‡

Price Size Price Size Price Size

Small ๐œ‡

(Prefer to more complex models)

๐‘ฅ/ = ๐‘ฅ8 โ‰ˆ 0 ๐œ‡ = 0 This example has been adapted from: Prof. Andrew Ngโ€™s slides

slide-43
SLIDE 43

Theoretical Part

43

slide-44
SLIDE 44

Model complexity: Bias-variance trade-off

44

} Least squares, can lead to severe over-fitting if complex models

are trained using data sets of limited size.

} A frequentist viewpoint of the model complexity issue, known

as the bias-variance trade-off.

slide-45
SLIDE 45

Formal discussion on bias, variance, and noise

45

} Best unrestricted regression function } Noise } Bias and variance

slide-46
SLIDE 46

The learning diagram: deterministic target

46

โ„Ž: ๐’ด โ†’ ๐’ต ๐‘”: ๐’ด โ†’ ๐’ต ๐‘ฆ / , ๐‘ง(/) , โ€ฆ , ๐‘ฆ ล“ , ๐‘ง(ล“) ๐‘ฆ / , โ€ฆ , ๐‘ฆ ล“

[Y.S. Abou Mostafa, 2012]

slide-47
SLIDE 47

The learning diagram including noisy target

47

} Type equation here.

โ„Ž: ๐’ด โ†’ ๐’ต ๐‘”: ๐’ด โ†’ ๐’ต ๐‘ฆ / , ๐‘ง(/) , โ€ฆ , ๐‘ฆ ล“ , ๐‘ง(ล“) ๐‘ฆ / , โ€ฆ , ๐‘ฆ ล“ ๐‘” ๐’š = โ„Ž(๐’š)

๐‘„ ๐‘ฆ, ๐‘ง = ๐‘„ ๐‘ฆ ๐‘„(๐‘ง|๐‘ฆ) Target distribution Distribution

  • n features

[Y.S. Abou Mostafa, 2012]

slide-48
SLIDE 48

Best unrestricted regression function

} If we know the joint distribution ๐‘„(๐’š, ๐‘ง)

and no constraints on the regression function?

} cost function: mean squared error

โ„Žโˆ— = argmin

@:โ„ยฏโ†’โ„

๐”ฝ๐’š,ยฑ

๐‘ง โˆ’ โ„Ž ๐’š

8

โ„Žโˆ— ๐’š = ๐”ฝยฑ|๐’š[๐‘ง]

48

slide-49
SLIDE 49

Best unrestricted regression function: Proof

๐”ฝ๐’š,ยฑ

๐‘ง โˆ’ โ„Ž ๐’š

8 = ยด ๐‘ง โˆ’ โ„Ž ๐’š 8๐‘ž ๐’š, ๐‘ง ๐‘’๐’š๐‘’๐‘ง

  • } For each ๐’š separately minimize loss since โ„Ž(๐’š) can be chosen

independently for each different ๐’š: ๐œ€๐”ฝ๐’š,ยฑ ๐‘ง โˆ’ โ„Ž ๐’š

8

๐œ€โ„Ž(๐’š) = โˆ’ ยท 2 ๐‘ง โˆ’ โ„Ž ๐’š ๐‘ž ๐’š, ๐‘ง ๐‘’๐‘ง

  • = 0

โ‡’ โ„Ž ๐’š = โˆซ ๐‘ง๐‘ž ๐’š, ๐‘ง ๐‘’๐‘ง

  • โˆซ ๐‘ž ๐’š, ๐‘ง ๐‘’๐‘ง
  • =

โˆซ ๐‘ง๐‘ž ๐’š, ๐‘ง ๐‘’๐‘ง

  • ๐‘ž ๐’š

= ยท ๐‘ง๐‘ž ๐‘ง|๐’š ๐‘’๐‘ง = ๐”ฝยฑ|๐’š

  • ๐‘ง

โŸน โ„Žโˆ— ๐’š = ๐”ฝยฑ|๐’š[๐‘ง]

49

slide-50
SLIDE 50

Error decomposition

50

๐น?โ€ขยบโ‚ฌ ๐‘”

๐’  ๐’š

= ๐”ฝ๐’š,ยฑ ๐‘”

๐’  ๐’š โˆ’ ๐‘ง 8

= ๐”ฝ๐’š,ยฑ ๐‘”

๐’  ๐’š โˆ’ โ„Ž ๐’š + โ„Ž ๐’š โˆ’ ๐‘ง 8

= ๐”ฝ๐’š ๐‘”

๐’  ๐’š โˆ’ โ„Ž ๐’š 8 + ๐”ฝ๐’š,ยฑ

โ„Ž ๐’š โˆ’ ๐’› 8 +2๐”ฝ๐’š,ยฑ ๐‘”

๐’  ๐’š โˆ’ โ„Ž ๐’š

โ„Ž ๐’š โˆ’ ๐‘ง

โ„Ž ๐’š : minimizes the expected loss ๐’š, ๐‘ง ~๐‘„

๐”ฝ๐’š ๐‘”

๐’  ๐’š โˆ’ โ„Ž ๐’š

๐”ฝยฑ|๐’š โ„Ž ๐’š โˆ’ ๐‘ง

Expected loss

slide-51
SLIDE 51

Error decomposition

51

๐น?โ€ขยบโ‚ฌ ๐‘”

๐’  ๐’š

= ๐”ฝ๐’š,ยฑ ๐‘”

๐’  ๐’š โˆ’ ๐‘ง 8

= ๐”ฝ๐’š,ยฑ ๐‘”

๐’  ๐’š โˆ’ โ„Ž ๐’š + โ„Ž ๐’š โˆ’ ๐‘ง 8

= ๐”ฝ๐’š ๐‘”

๐’  ๐’š โˆ’ โ„Ž ๐’š 8 + ๐”ฝ๐’š,ยฑ

โ„Ž ๐’š โˆ’ ๐’› 8 +2๐น๐’š,ยฑ ๐‘” ๐’š; ๐’™ 9 โˆ’ โ„Ž ๐’š โ„Ž ๐’š โˆ’ ๐‘ง

} Noise shows the irreducible minimum value of the loss

function

โ„Ž ๐’š : minimizes the expected loss ๐’š, ๐‘ง ~๐‘„ noise

slide-52
SLIDE 52

Expectation of true error

52

๐น?โ€ขยบโ‚ฌ ๐‘”

๐’  ๐’š

= ๐”ฝ๐’š,ยฑ ๐‘”

๐’  ๐’š โˆ’ ๐‘ง 8

= ๐”ฝ๐’š ๐‘”

๐’  ๐’š โˆ’ โ„Ž ๐’š 8 + ๐‘œ๐‘๐‘—๐‘ก๐‘“

๐”ฝ๐’  ๐”ฝ๐’š

๐‘”

๐’  ๐’š โˆ’ โ„Ž ๐’š 8

= ๐”ฝ๐’š ๐”ฝ๐’  ๐‘”

๐’  ๐’š โˆ’ โ„Ž ๐’š 8

We now want to focus on ๐”ฝ๐’  ๐‘”

๐’  ๐’š โˆ’ โ„Ž ๐’š 8 .

slide-53
SLIDE 53

The average hypothesis

53

๐‘”ฬ… ๐’š โ‰ก ๐น๐’  ๐‘”

๐’  ๐’š

๐‘”ฬ… ๐’š โ‰ˆ 1 ๐ฟ e ๐‘”๐’  ร ๐’š

ร‚ รƒf/

๐ฟ training sets (of size ๐‘‚) sampled from ๐‘„(๐’š, ๐‘ง): ๐’ (/), ๐’ (8), โ€ฆ , ๐’ (ร‚)

slide-54
SLIDE 54

Using the average hypothesis

54

๐”ฝ๐’ 

๐‘”

๐’  ๐’š โˆ’ โ„Ž ๐’š 8

= ๐”ฝ๐’  ๐‘”

๐’  ๐’š โˆ’ ๐‘”ฬ… ๐’š + ๐‘”ฬ… ๐’š โˆ’ โ„Ž ๐’š 8

= ๐”ฝ๐’  ร„ ๐‘”

๐’  ๐’š โˆ’ ๐‘”ฬ… ๐’š 8

+ ๐‘”ฬ… ๐’š โˆ’ โ„Ž ๐’š

8

+ 2 ๐‘”

๐’  ๐’š โˆ’ ๐‘”ฬ… ๐’š

๐‘”ฬ… ๐’š โˆ’ โ„Ž ๐’š ร… = ๐”ฝ๐’  ๐‘”

๐’  ๐’š โˆ’ ๐‘”ฬ… ๐’š 8

+ ๐‘”ฬ… ๐’š โˆ’ โ„Ž ๐’š

8

slide-55
SLIDE 55

Bias and variance

55

๐”ฝ๐’  ๐‘”

๐’  ๐’š โˆ’ โ„Ž ๐’š 8 = ๐”ฝ๐’ 

๐‘”

๐’  ๐’š โˆ’ ๐‘”ฬ… ๐’š 8

+ ๐‘”ฬ… ๐’š โˆ’ โ„Ž ๐’š

8

๐”ฝ๐’š ๐”ฝ๐’  ๐‘”

๐’  ๐’š โˆ’ โ„Ž ๐’š 8

= ๐”ฝ๐’š var ๐’š + bias(๐’š) = var + bias

var(๐’š) bias(๐’š)

slide-56
SLIDE 56

Bias-variance trade-off

56

var = ๐”ฝ๐’š ๐”ฝ๐’  ๐‘”

๐’  ๐’š โˆ’ ๐‘”ฬ… ๐’š 8

bias = ๐”ฝ๐’š ๐‘”ฬ… ๐’š โˆ’ โ„Ž ๐’š

More complex โ„‹ โ‡’ lower bias but higher variance

โ„Ž โ„Ž

[Y.S. Abou Mostafa, 2012]

slide-57
SLIDE 57

Example: sin target

57

} Only two training example ๐‘‚ = 2 } Two models used for learning:

} โ„‹-: ๐‘” ๐‘ฆ = ๐‘ } โ„‹/: ๐‘” ๐‘ฆ = ๐‘๐‘ฆ + ๐‘

} Which is better โ„‹- or โ„‹/?

slide-58
SLIDE 58

Learning from a training set

58

โ„‹- โ„‹/ [Y.S. Abou Mostafa, 2012]

slide-59
SLIDE 59

Variance โ„‹-

59

๐‘”ฬ…(๐‘ฆ)

[Y.S. Abou Mostafa, et. al]

slide-60
SLIDE 60

Variance โ„‹/

60

๐‘”ฬ…(๐‘ฆ)

[Y.S. Abou Mostafa, et. al]

slide-61
SLIDE 61

Which is better?

61

๐‘”ฬ…(๐‘ฆ) ๐‘”ฬ…(๐‘ฆ)

[Y.S. Abou Mostafa, 2012]

slide-62
SLIDE 62

Lesson

62

Match the model complexity to the data sources not to the complexity of the target function.

slide-63
SLIDE 63

Expected training and true error curves

63

} Errors vary with the number of training samples

๐นรŠร‹รŒรรŽ ๐นรŠร‹รŒรรŽ ๐นรŠร‹ร^ ๐นรŠร‹ร^ expected true error: ๐”ฝ๐’  ๐น?โ€ขยบโ‚ฌ ๐‘”

๐’  ๐’š

expected training error: ๐”ฝ๐’  ๐น?โ€ขaSG ๐‘”

๐’  ๐’š

[Y.S. Abou Mostafa, 2012]

slide-64
SLIDE 64

Regularization

64

[Y.S. Abou Mostafa, 2012]

slide-65
SLIDE 65

Regularization: bias and variance

65

๐‘”ฬ…(๐‘ฆ) ๐‘”ฬ…(๐‘ฆ)

[Y.S. Abou Mostafa, 2012]

slide-66
SLIDE 66

Winner of โ„‹-, โ„‹/, and โ„‹/ with regularization

66

[Y.S. Abou Mostafa, 2012]

๐‘”ฬ…(๐‘ฆ) ๐‘”ฬ…(๐‘ฆ)

โ„‹/

๐‘”ฬ…(๐‘ฆ)

slide-67
SLIDE 67

Regularization and bias/variance

67

๐‘€ = 100 data sets ๐‘œ = 25 ๐‘› = 25 ๐œ‡ is large ๐œ‡ is intermediate ๐œ‡ is small [Bishop]

slide-68
SLIDE 68

Learning curves of bias, variance, and noise

68

[Bishop]

slide-69
SLIDE 69

Bias-variance decomposition: summary

69

} The noise term is unavoidable. } The terms we are interested in are bias and variance. } The approximation-generalization trade-off is seen in the

bias-variance decomposition.

slide-70
SLIDE 70

Resources

70

} C. Bishop, โ€œPattern Recognition and Machine Learningโ€,

Chapter 1.1,1.3, 3.1, 3.2.

} Yaser S. Abu-Mostafa, Malik Maghdon-Ismail, and Hsuan

Tien Lin,โ€œLearning from Dataโ€, Chapter 2.3, 3.2, 3.4.