Accuracy & confidence Most of course so far: estimating stuff - - PowerPoint PPT Presentation

accuracy confidence
SMART_READER_LITE
LIVE PREVIEW

Accuracy & confidence Most of course so far: estimating stuff - - PowerPoint PPT Presentation

Accuracy & confidence Most of course so far: estimating stuff from data Today: how much do we trust our estimates? Last week: one answer to this question prove ahead of time that training set estimate of prediction error will


slide-1
SLIDE 1

Geoff Gordon—Machine Learning—Fall 2013

Accuracy & confidence

  • Most of course so far: estimating stuff from data
  • Today: how much do we trust our estimates?
  • Last week: one answer to this question
  • prove ahead of time that training set estimate of

prediction error will have accuracy ϵ w/ probability 1–δ

  • had to handle two issues:
  • limited data ⇒ can’t get exact error of single model
  • selection bias ⇒ we pick “lucky” model r.t. right one

1

slide-2
SLIDE 2

Geoff Gordon—Machine Learning—Fall 2013

Selection bias

2

CDF of max of n samples of N(μ=2, σ2=1)

[representing error estimates for n models]

2 4 6 0.2 0.4 0.6 0.8 1 n=1 n=4 n=30

slide-3
SLIDE 3

Geoff Gordon—Machine Learning—Fall 2013

Overfitting

  • Overfitting = selection bias when fitting complex

models to little/noisy data

  • to limit overfitting: limit noise in data, get more data,

simplify model class

  • Today: not trying to limit overfitting
  • instead, try to evaluate accuracy of selected model (and

recursively, accuracy of our accuracy estimate)

  • can lead to detection of overfitting

3

slide-4
SLIDE 4

Geoff Gordon—Machine Learning—Fall 2013

What is accuracy?

  • Simple problem: estimate μ and σ2 for a Gaussian

from samples x1, x2, … xN ~ Normal(μ, σ2)

4

slide-5
SLIDE 5

Geoff Gordon—Machine Learning—Fall 2013

Bias vs. variance vs. residual

  • Mean squared prediction error: predict xN+1
  • 5
slide-6
SLIDE 6

Geoff Gordon—Machine Learning—Fall 2013

Bias-variance tradeoff

  • Can’t do much about residual, so we’re mostly

concerned w/ estimation error = bias2 + variance

  • Can trade bias v. variance to some extent: e.g.,

always estimate 0 ⇒ variance=0, but bias big

  • Cramér-Rao bound on estimation error:

6

slide-7
SLIDE 7

Geoff Gordon—Machine Learning—Fall 2013

Prediction error v. estimation error

  • Several ways to get at accuracy
  • prediction error (bias2 + var + residual2)
  • talks only about predictions
  • estimation error (bias2 + var)
  • same; tries to concentrate on error due to estimation
  • parameter error
  • talks about parameters r.t. predictions
  • in simple case, numerically equal to estimation error
  • but only makes sense if our model class is right

7

E((µ − ˆ µ)2)

slide-8
SLIDE 8

Geoff Gordon—Machine Learning—Fall 2013

Evaluating accuracy

  • In N(μ, σ2) example, we were able to derive bias,

variance, and residual from first principles

  • In general, have to estimate prediction error,

estimation error, or model error from data

  • Holdout data, tail bounds, normal theory (use CLT

& tables of normal dist’n), and today’s topics: crossvalidation & bootstrap

8

slide-9
SLIDE 9

Geoff Gordon—Machine Learning—Fall 2013

Goal: estimate sampling variability

  • We’ve computed something from our sample
  • classification error rate, a parameter vector, mean squared

prediction error, …

  • for simplicity, a single number (e.g., ith component of weight

vector)

  • t = f(x1, x2, …, xN)
  • How much would t vary if we had taken a different

sample?

  • For concreteness: f = sample mean (an estimate of

population mean)

9

slide-10
SLIDE 10

Geoff Gordon—Machine Learning—Fall 2013

Gold standard: new samples

  • Get M independent data sets
  • Run our computation M times: t1, t2, … tM
  • tj =
  • Look at distribution of tj
  • mean, variance, upper and lower 2.5% quantiles, …
  • A tad wasteful of data…

10

slide-11
SLIDE 11

Geoff Gordon—Machine Learning—Fall 2013

Crossvalidation & bootstrap

  • CV and bootstrap: approximate the gold standard,

but cheaper—spend computation instead of data

  • Work for nearly arbitrarily complicated models
  • Typically tighter than tail bounds, but involve

difficult-to-verify approximations/assumptions

  • Basic idea: surrogate samples
  • Rearrange/modify x1, …, xN to build each “new” sample
  • Getting something from nothing? (hence name)

11

slide-12
SLIDE 12

Geoff Gordon—Machine Learning—Fall 2013

For example

12

−2 2 4 10 20 30 40 50 −2 2 4 0.2 0.4 0.6 0.8 1

μ=1.5 μ=1.6136 ˆ

slide-13
SLIDE 13

Geoff Gordon—Machine Learning—Fall 2013

Basic bootstrap

  • Treat x1…xN as our estimate of true distribution
  • To get a new sample, draw N times from this

estimate (with replacement)

  • Do this M times
  • each original xi part of many samples (on average 1–1/e
  • f them, about 63%)
  • each sample contains many repeated values (single xi

selected multiple times)

13

slide-14
SLIDE 14

Geoff Gordon—Machine Learning—Fall 2013

−2 2 4 10 20 30 40 50

Basic bootstrap

14

−2 2 4 10 20 30 40 50

−2 2 4 10 20 30 40 50

← original resamples ↓

μ=1.6909 μ=1.6136

−2 2 4 10 20 30 40 50

μ=1.6059 μ=1.6507

slide-15
SLIDE 15

Geoff Gordon—Machine Learning—Fall 2013

What can go wrong?

  • Convergence is only asymptotic (large original

sample)

  • here: what if original sample hits mostly the larger mode?
  • Original sample might not be i.i.d.
  • unmeasured covariate

15

slide-16
SLIDE 16

Geoff Gordon—Machine Learning—Fall 2013

Types of errors

  • “Conservative” estimate of uncertainty: tends to

be high (too uncertain)

  • “Optimistic” estimate of uncertainty: tends to be

low (too certain)

16

slide-17
SLIDE 17

Geoff Gordon—Machine Learning—Fall 2013

Should we worry?

  • New drug: mean outcome 1.327 [higher is better]
  • old one: outcome 1.242
  • Bootstrap underestimates σ = .04
  • true σ = .08
  • Tell investors: new drug better than old one
  • Enter Phase III trials—cost $millions
  • Whoops, it isn’t better after all…

17

slide-18
SLIDE 18

Geoff Gordon—Machine Learning—Fall 2013

Blocked resampling

  • Partial fix for one issue (original sample not i.i.d.)
  • Divide sample into blocks that tend to share the

unmeasured covariates, and resample blocks

  • e.g., time series: break up into blocks of adjacent times
  • assumes unmeasured covariates change slowly
  • e.g., matrix: break up by rows or columns
  • assumes unmeasured covariates are associated with

rows or columns (e.g., user preferences in Netflix)

18

slide-19
SLIDE 19

Geoff Gordon—Machine Learning—Fall 2013

Further reading

  • http://bcs.whfreeman.com/ips5e/content/cat_080/

pdf/moore14.pdf

  • Hesterberg et al. (2005). “Bootstrap methods and

permutation tests.” In Moore & McCabe, Introduction to the Practice of Statistics.

19