Machine Learning: Chenhao Tan University of Colorado Boulder - - PowerPoint PPT Presentation

machine learning chenhao tan
SMART_READER_LITE
LIVE PREVIEW

Machine Learning: Chenhao Tan University of Colorado Boulder - - PowerPoint PPT Presentation

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 3 Slides adapted from Thorsten Joachims Machine Learning: Chenhao Tan | Boulder | 1 of 29 Logistics Homework assignments Final project Machine Learning: Chenhao


slide-1
SLIDE 1

Machine Learning: Chenhao Tan

University of Colorado Boulder

LECTURE 3 Slides adapted from Thorsten Joachims

Machine Learning: Chenhao Tan | Boulder | 1 of 29

slide-2
SLIDE 2

Logistics

  • Homework assignments
  • Final project

Machine Learning: Chenhao Tan | Boulder | 2 of 29

slide-3
SLIDE 3

Overview

Sample error and generalization error Bias-variance tradeoff Model selection

Machine Learning: Chenhao Tan | Boulder | 3 of 29

slide-4
SLIDE 4

Sample error and generalization error

Outline

Sample error and generalization error Bias-variance tradeoff Model selection

Machine Learning: Chenhao Tan | Boulder | 4 of 29

slide-5
SLIDE 5

Sample error and generalization error

Supervised learning

  • Strain → h
  • Target function f: X → Y (f is unknown)
  • Goal: h approximates f

Machine Learning: Chenhao Tan | Boulder | 5 of 29

slide-6
SLIDE 6

Sample error and generalization error

Problem Setup

  • Instances in a learning problems follow a probability distribution P(X, Y)
  • A sample S = {(x1, y1), . . . , (xn, yn)} is independently and identically distributed

(i.i.d.) according to P(X, Y).

Machine Learning: Chenhao Tan | Boulder | 6 of 29

slide-7
SLIDE 7

Sample error and generalization error

Problem Setup

  • Instances in a learning problems follow a probability distribution P(X, Y)
  • A sample S = {(x1, y1), . . . , (xn, yn)} is independently and identically distributed

(i.i.d.) according to P(X, Y). Examples

  • training sample Strain
  • test sample Stest

Machine Learning: Chenhao Tan | Boulder | 6 of 29

slide-8
SLIDE 8

Sample error and generalization error

Sample Error vs. Generalization Error

  • Generalization error of a hypothesis h for a learning task P(X, Y):

ErrP(h) = E [∆(h(x), y)] =

  • x∈X,y∈Y

∆(h(x), y)P(X = x, Y = y)

Machine Learning: Chenhao Tan | Boulder | 7 of 29

slide-9
SLIDE 9

Sample error and generalization error

Sample Error vs. Generalization Error

  • Generalization error of a hypothesis h for a learning task P(X, Y):

ErrP(h) = E [∆(h(x), y)] =

  • x∈X,y∈Y

∆(h(x), y)P(X = x, Y = y)

  • Sample error of a hypothesis h for a sample S:

ErrS(h) = 1 n

n

  • i=1

∆(h(xi), yi)

Machine Learning: Chenhao Tan | Boulder | 7 of 29

slide-10
SLIDE 10

Sample error and generalization error

Training error vs. Test error

  • Strain → h

Machine Learning: Chenhao Tan | Boulder | 8 of 29

slide-11
SLIDE 11

Sample error and generalization error

Training error vs. Test error

  • Strain → h
  • Train error = ErrStrain(h)
  • test error = ErrStest(h)

Machine Learning: Chenhao Tan | Boulder | 8 of 29

slide-12
SLIDE 12

Sample error and generalization error

A concrete hypothetical example

  • Predict flu trends using search data
  • X: search data, Y: fraction of population with flu

Machine Learning: Chenhao Tan | Boulder | 9 of 29

slide-13
SLIDE 13

Sample error and generalization error

A concrete hypothetical example

  • Predict flu trends using search data
  • X: search data, Y: fraction of population with flu
  • Strain = all data before 2012
  • Stest = all data in 2012
  • What is the problem of generalization error estimation?

[Lazer et al., 2014]

Machine Learning: Chenhao Tan | Boulder | 9 of 29

slide-14
SLIDE 14

Sample error and generalization error

Overfitting

[Friedman et al., 2001]

Machine Learning: Chenhao Tan | Boulder | 10 of 29

slide-15
SLIDE 15

Bias-variance tradeoff

Outline

Sample error and generalization error Bias-variance tradeoff Model selection

Machine Learning: Chenhao Tan | Boulder | 11 of 29

slide-16
SLIDE 16

Bias-variance tradeoff

Bias-Variance Tradeoff

Assume a simple model y = f(x) + ǫ, E(ǫ) = 0, Var(ǫ) = σ2

ǫ,

Machine Learning: Chenhao Tan | Boulder | 12 of 29

slide-17
SLIDE 17

Bias-variance tradeoff

Bias-Variance Tradeoff

Assume a simple model y = f(x) + ǫ, E(ǫ) = 0, Var(ǫ) = σ2

ǫ,

Err(x0) = E[(y − h(x0))2|X = x0]

Machine Learning: Chenhao Tan | Boulder | 12 of 29

slide-18
SLIDE 18

Bias-variance tradeoff

Bias-Variance Tradeoff

Assume a simple model y = f(x) + ǫ, E(ǫ) = 0, Var(ǫ) = σ2

ǫ,

Err(x0) = E[(y − h(x0))2|X = x0] = σ2

ǫ + [Eh(x0) − f(x0)]2 + E[h(x0) − Eh(x0)]2

= σ2

ǫ + Bias2(h(x0)) + Var(h(x0))

= Irreducible Error + Bias2 + Variance

Machine Learning: Chenhao Tan | Boulder | 12 of 29

slide-19
SLIDE 19

Bias-variance tradeoff

Example

1.0 1.2 1.4 1.6 1.8 2.0 x 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 y

  • rder=1

1.0 1.2 1.4 1.6 1.8 2.0 x 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 y

  • rder=5

1.0 1.2 1.4 1.6 1.8 2.0 x 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 y

  • rder=9

Machine Learning: Chenhao Tan | Boulder | 13 of 29

slide-20
SLIDE 20

Bias-variance tradeoff

Revisit Overfitting

http://scott.fortmann-roe.com/docs/BiasVariance.html

Machine Learning: Chenhao Tan | Boulder | 14 of 29

slide-21
SLIDE 21

Bias-variance tradeoff

K-NN Example

Err(x0) = σ2

ǫ +

  • f(x0) − 1

k

k

  • l=1

f(x(l)) 2 + σ2

ǫ

k In homework 1!

Machine Learning: Chenhao Tan | Boulder | 15 of 29

slide-22
SLIDE 22

Model selection

Outline

Sample error and generalization error Bias-variance tradeoff Model selection

Machine Learning: Chenhao Tan | Boulder | 16 of 29

slide-23
SLIDE 23

Model selection

Model Selection

  • training: run learning

algorithm m times (e.g., parameter search).

  • validation error: Errors

ErrSval(ˆ hi) is an estimate of ErrP(hi).

  • selection: Use hi with

min ErrSval(ˆ hi) for prediction

  • n test examples.

Machine Learning: Chenhao Tan | Boulder | 17 of 29

slide-24
SLIDE 24

Model selection

Train-val-test

Machine Learning: Chenhao Tan | Boulder | 18 of 29

slide-25
SLIDE 25

Model selection

K-fold cross validation

An estimate using all instances:

  • Input: a sample S and a learning algorithm A.
  • Procedure: Randomly split S into K equally-sized folds

S1, . . . , SK For each Si, apply A to S−i, get ˆ hi, and compute ErrSi(ˆ hi)

  • Performance estimates: 1

K

K

i=1 ErrSi(ˆ

hi)

Machine Learning: Chenhao Tan | Boulder | 19 of 29

slide-26
SLIDE 26

Model selection

K-fold Cross Validation

Example use:

  • Find good features F using Strain
  • Split Strain into K folds
  • For each fold, use the rest training data and features F to build a classifier and

estimate prediction error using average error rates on each fold

Machine Learning: Chenhao Tan | Boulder | 20 of 29

slide-27
SLIDE 27

Model selection

K-fold Cross Validation

Example use (Wrong!):

  • Find good features F using Strain
  • Split Strain into K folds
  • For each fold, use the rest training data and features F to build a classifier and

estimate prediction error using average error rates on each fold

Machine Learning: Chenhao Tan | Boulder | 21 of 29

slide-28
SLIDE 28

Model selection

K-fold cross validation

  • select best models from training data
  • nested cross-validation for performance estimation

Machine Learning: Chenhao Tan | Boulder | 22 of 29

slide-29
SLIDE 29

Model selection

Evaluating learned hypothesis

  • Goal: Find h with small prediction error ErrP(h) over P(X, Y)
  • Question: What is ErrP(ˆ

h) of ˆ h obtained from training data Strain

  • Training error and test error
  • Training error: ErrStrain(ˆ

h)

  • Test error: ErrStest(ˆ

h) is an estimate of ErrP(ˆ h)

Machine Learning: Chenhao Tan | Boulder | 23 of 29

slide-30
SLIDE 30

Model selection

What is the True Error of a Hypothesis?

  • Apply ˆ

h to Stest, for each (x, y) ∈ Stest observer ∆(ˆ h(x), y).

Machine Learning: Chenhao Tan | Boulder | 24 of 29

slide-31
SLIDE 31

Model selection

What is the True Error of a Hypothesis?

  • Apply ˆ

h to Stest, for each (x, y) ∈ Stest observer ∆(ˆ h(x), y).

  • Binomial distribution estimates: assume that each toss is independent and the

probability of heads is p, then the probability of observing x heads in a sample

  • f n independent coin tosses is

Pr(X = x|p, n) = n! x!(n − x)!px(1 − p)n−x

Machine Learning: Chenhao Tan | Boulder | 24 of 29

slide-32
SLIDE 32

Model selection

What is the True Error of a Hypothesis?

  • Apply ˆ

h to Stest, for each (x, y) ∈ Stest observer ∆(ˆ h(x), y).

  • Binomial distribution estimates: assume that each toss is independent and the

probability of heads is p, then the probability of observing x heads in a sample

  • f n independent coin tosses is

Pr(X = x|p, n) = n! x!(n − x)!px(1 − p)n−x

  • Normal approximation
  • Err(ˆ

h) = ˆ p = 1

n

n

i=1 ∆(ˆ

h(xi), yi)

  • Confidence interval: ˆ

p ± zα

  • 1

p(1 − ˆ p)

Machine Learning: Chenhao Tan | Boulder | 24 of 29

slide-33
SLIDE 33

Model selection

Is hypothesis ˆ h1 better than ˆ h2?

Same test sample

  • Apply ˆ

h1 and ˆ h2 to Stest

Machine Learning: Chenhao Tan | Boulder | 25 of 29

slide-34
SLIDE 34

Model selection

Is hypothesis ˆ h1 better than ˆ h2?

Same test sample

  • Apply ˆ

h1 and ˆ h2 to Stest

  • Decide if ErrP(ˆ

h1) = ErrP(ˆ h2)

  • Null hypothesis: ErrStest(ˆ

h1) and ErrStest(ˆ h2) come from binomial distributions with same p Binomial Sign Test (McNemar’s Test)

Machine Learning: Chenhao Tan | Boulder | 25 of 29

slide-35
SLIDE 35

Model selection

Is hypothesis ˆ h1 better than ˆ h2?

Different test samples

  • Apply ˆ

h1 to Stest1 and ˆ h2 to Stest2

Machine Learning: Chenhao Tan | Boulder | 26 of 29

slide-36
SLIDE 36

Model selection

Is hypothesis ˆ h1 better than ˆ h2?

Different test samples

  • Apply ˆ

h1 to Stest1 and ˆ h2 to Stest2

  • Decide if ErrP(ˆ

h1) = ErrP(ˆ h2)

  • Null hypothesis: ErrStest1(ˆ

h1) and ErrStest2(ˆ h2) come from binomial distributions with same p t-test

Machine Learning: Chenhao Tan | Boulder | 26 of 29

slide-37
SLIDE 37

Model selection

Is Learning Algorithm A1 better than A2?

  • Given k samples of S1 . . . Sk of labeled instances from P(X, Y), each Si randomly

split into Si

test, Si train.

  • For each i, train A1, A2 on Si

train, obtain ˆ

hA1

i

and ˆ hA2

i , apply to Si test and compute

ErrStest(ˆ hA1

i ), ErrStest(ˆ

hA2

i )

Machine Learning: Chenhao Tan | Boulder | 27 of 29

slide-38
SLIDE 38

Model selection

Is Learning Algorithm A1 better than A2?

  • Given k samples of S1 . . . Sk of labeled instances from P(X, Y), each Si randomly

split into Si

test, Si train.

  • For each i, train A1, A2 on Si

train, obtain ˆ

hA1

i

and ˆ hA2

i , apply to Si test and compute

ErrStest(ˆ hA1

i ), ErrStest(ˆ

hA2

i )

  • Decide, if ES(ErrP(A1(Strain))) = ES(ErrP(A2(Strain)))
  • Null hypothesis: ErrStest(A1(Strain)) and ErrStest(A2(Strain)) come from same

distribution over samples S t-test or Wilcoxon Signed-Rank Test

Machine Learning: Chenhao Tan | Boulder | 27 of 29

slide-39
SLIDE 39

Model selection

Summary

  • Optimizing the generalization error is the goal.
  • Looking for the appropriate model to address bias-variance tradeoff.
  • Train-val-test paradigm and cross validation.

Machine Learning: Chenhao Tan | Boulder | 28 of 29

slide-40
SLIDE 40

Model selection

References (1)

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001. David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani. The parable of Google Flu: traps in big data

  • analysis. Science, 343(6176):1203–1205, 2014.

Machine Learning: Chenhao Tan | Boulder | 29 of 29