Evaluating Hypotheses IEEE Expert, October 1996 1 Evaluating - - PDF document

evaluating hypotheses
SMART_READER_LITE
LIVE PREVIEW

Evaluating Hypotheses IEEE Expert, October 1996 1 Evaluating - - PDF document

Evaluating Hypotheses IEEE Expert, October 1996 1 Evaluating Hypotheses Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal distribution, Central Limit Theorem


slide-1
SLIDE 1

Evaluating Hypotheses

IEEE Expert, October 1996

1

slide-2
SLIDE 2

Evaluating Hypotheses

  • Sample error, true error
  • Confidence intervals for observed hypothesis error
  • Estimators
  • Binomial distribution, Normal distribution, Central

Limit Theorem

  • Paired t tests
  • Comparing learning methods

2

slide-3
SLIDE 3

Evaluating Hypotheses and Learners

Consider hypotheses H1 and H2 learned by learners L1 and L2

  • How to learn H and estimate accuracy with limited

data?

  • How well does observed accuracy of H over limited

sample estimate accuracy over unseen data?

  • If H1 outperforms H2 on sample, will H1 outperform

H2 in general?

  • Same conclusion for L1 and L2?

3

slide-4
SLIDE 4

Two Definitions of Error

The true error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D. errorD(h) ≡ Pr

x∈D[f(x) = h(x)]

The sample error of h with respect to target function f and data sample S is the proportion of examples h misclassifies errorS(h) ≡ 1 n

  • x∈S δ(f(x) = h(x))

Where δ(f(x) = h(x)) is 1 if f(x) = h(x), and 0

  • therwise.

How well does errorS(h) estimate errorD(h)?

4

slide-5
SLIDE 5

Problems Estimating Error

  • 1. Bias: If S is training set, errorS(h) is optimistically

biased bias ≡ E[errorS(h)] − errorD(h) For unbiased estimate, h and S must be chosen independently

  • 2. Variance: Even with unbiased S, errorS(h) may

still vary from errorD(h)

5

slide-6
SLIDE 6

Example

Hypothesis h misclassifies 12 of the 40 examples in S errorS(h) = 12 40 = .30 What is errorD(h)?

6

slide-7
SLIDE 7

Estimators

Experiment:

  • 1. choose sample S of size n according to distribution D
  • 2. measure errorS(h)

errorS(h) is a random variable (i.e., result of an experiment) errorS(h) is an unbiased estimator for errorD(h) Given observed errorS(h) what can we conclude about errorD(h)?

7

slide-8
SLIDE 8

Confidence Intervals

If

  • S contains n examples, drawn independently of h

and each other

  • n ≥ 30

Then

  • With approximately 95% probability, errorD(h) lies

in interval errorS(h) ± 1.96

  • errorS(h)(1 − errorS(h))

n

8

slide-9
SLIDE 9

Confidence Intervals

If

  • S contains n examples, drawn independently of h

and each other

  • n ≥ 30

Then

  • With approximately N% probability, errorD(h) lies

in interval errorS(h) ± zN

  • errorS(h)(1 − errorS(h))

n where N%: 50% 68% 80% 90% 95% 98% 99% zN: 0.67 1.00 1.28 1.64 1.96 2.33 2.58

9

slide-10
SLIDE 10

errorS(h) is a Random Variable

Rerun the experiment with different randomly drawn S (of size n) Probability of observing r misclassified examples:

0.02 0.04 0.06 0.08 0.1 0.12 0.14 5 10 15 20 25 30 35 40 P(r) Binomial distribution for n = 40, p = 0.3

P(r) = n! r!(n − r)! errorD(h)r(1 − errorD(h))n−r

10

slide-11
SLIDE 11

Binomial Probability Distribution

0.02 0.04 0.06 0.08 0.1 0.12 0.14 5 10 15 20 25 30 35 40 P(r) Binomial distribution for n = 40, p = 0.3

P(r) = n! r!(n − r)! pr(1 − p)n−r Probability P(r) of r heads in n coin flips, if p = Pr(heads)

  • Expected, or mean value of X, E[X], is

E[X] ≡

n

  • i=0 iP(i) = np
  • Variance of X is

V ar(X) ≡ E[(X − E[X])2] = np(1 − p)

  • Standard deviation of X, σX, is

σX ≡

  • E[(X − E[X])2] =
  • np(1 − p)

11

slide-12
SLIDE 12

Normal Distribution Approximates Binomial

errorS(h) follows a Binomial distribution, with

  • mean µerrorS(h) = errorD(h)
  • standard deviation σerrorS(h)

σerrorS(h) =

  • errorD(h)(1 − errorD(h))

n Approximate this by a Normal distribution with

  • mean µerrorS(h) = errorD(h)
  • standard deviation σerrorS(h)

σerrorS(h) ≈

  • errorS(h)(1 − errorS(h))

n

12

slide-13
SLIDE 13

Normal Probability Distribution

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

  • 3
  • 2
  • 1

1 2 3 Normal distribution with mean 0, standard deviation 1

p(x) = 1 √ 2πσ2e−1

2(x−µ σ )2

The probability that X will fall into the interval (a, b) is given by

b

a p(x)dx

  • Expected, or mean value of X, E[X], is

E[X] = µ

  • Variance of X is

V ar(X) = σ2

  • Standard deviation of X, σX, is

σX = σ

13

slide-14
SLIDE 14

Normal Probability Distribution

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

  • 3
  • 2
  • 1

1 2 3

80% of area (probability) lies in µ ± 1.28σ N% of area (probability) lies in µ ± zNσ N%: 50% 68% 80% 90% 95% 98% 99% zN: 0.67 1.00 1.28 1.64 1.96 2.33 2.58

14

slide-15
SLIDE 15

Confidence Intervals, More Correctly

If

  • S contains n examples, drawn independently of h

and each other

  • n ≥ 30

Then

  • With approximately 95% probability, errorS(h) lies

in interval errorD(h) ± 1.96

  • errorD(h)(1 − errorD(h))

n equivalently, errorD(h) lies in interval errorS(h) ± 1.96

  • errorD(h)(1 − errorD(h))

n which is approximately errorS(h) ± 1.96

  • errorS(h)(1 − errorS(h))

n

15

slide-16
SLIDE 16

Two-Sided and One-Sided Bounds

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

  • 3
  • 2
  • 1

1 2 3 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

  • 3
  • 2
  • 1

1 2 3

  • If µ − zNσ ≤ y ≤ µ + zNσ with confidence

N = 100(1 − α)%

  • Then −∞ ≤ y ≤ µ + zNσ with confidence

N = 100(1 − α/2)% and µ − zNσ ≤ y ≤ +∞ with confidence N = 100(1 − α/2)%

  • Example: n = 40, r = 12

– Two-sided, 95% confidence (α = 0.05) P(0.16 ≤ y ≤ 0.44) = 0.95 – One-sided P(y ≤ 0.44) = P(y ≥ 0.16) = (1 − α/2) = 0.975

16

slide-17
SLIDE 17

Calculating Confidence Intervals

  • 1. Pick parameter p to estimate
  • errorD(h)
  • 2. Choose an estimator
  • errorS(h)
  • 3. Determine probability distribution that governs

estimator

  • errorS(h) governed by Binomial distribution,

approximated by Normal when n ≥ 30

  • 4. Find interval (L, U) such that N% of probability

mass falls in the interval

  • Use table of zN values

17

slide-18
SLIDE 18

Central Limit Theorem

Consider a set of independent, identically distributed random variables Y1 . . . Yn, all governed by an arbitrary probability distribution with mean µ and finite variance σ2. Define the sample mean, ¯ Y ≡ 1 n

n

  • i=1 Yi

Central Limit Theorem. As n → ∞, the distribution governing ¯ Y approaches a Normal distribution, with mean µ and variance σ2

n .

18

slide-19
SLIDE 19

Difference Between Hypotheses

Test h1 on sample S1, test h2 on S2

  • 1. Pick parameter to estimate

d ≡ errorD(h1) − errorD(h2)

  • 2. Choose an estimator

ˆ d ≡ errorS1(h1) − errorS2(h2)

  • 3. Determine probability distribution that governs

estimator

σ ˆ

d ≈

  • errorS1(h1)(1 − errorS1(h1))

n1 + errorS2(h2)(1 − errorS2(h2)) n2

Find interval (L, U) such that N% of probability mass falls in the interval

ˆ d ± zN

  • errorS1(h1)(1 − errorS1(h1))

n1 + errorS2(h2)(1 − errorS2(h2)) n2 19

slide-20
SLIDE 20

Hypothesis Testing

P(errorD(h1) > errorD(h2)) =?

  • Example
  • |S1| = |S2| = 100
  • errorS1(h1) = 0.30
  • errorS2(h2) = 0.20
  • ˆ

d = 0.10

  • σ ˆ

d = 0.061

  • P( ˆ

d < µ ˆ

d + 0.10) = probability ˆ

d does not

  • verestimate d by more than 0.10
  • zN · σ ˆ

d = 0.10

  • zN = 1.64
  • P( ˆ

d < µ ˆ

d + 1.64σ ˆ d) = 0.95

  • I.e., reject null hypothesis with 0.05 level of

significance

20

slide-21
SLIDE 21

Paired t test to compare hA,hB

  • 1. Partition data into k disjoint test sets T1, T2, . . . , Tk
  • f equal size, where this size is at least 30.
  • 2. For i from 1 to k, do

δi ← errorTi(hA) − errorTi(hB)

  • 3. Return the value ¯

δ, where ¯ δ ≡ 1 k

k

  • i=1 δi

N% confidence interval estimate for δ: ¯ δ ± tN,k−1 s¯

δ

δ ≡

  • 1

k(k − 1)

k

  • i=1(δi − ¯

δ)2 Note δi approximately Normally distributed

21

slide-22
SLIDE 22

Comparing learning algorithms LA and LB

What we’d like to estimate: ES⊂D[errorD(LA(S)) − errorD(LB(S))] where L(S) is the hypothesis output by learner L using training set S i.e., the expected difference in true error between hypotheses output by learners LA and LB, when trained using randomly selected training sets S drawn according to distribution D. But, given limited data D0, what is a good estimator?

  • could partition D0 into training set S0 and testing

set T0, and measure errorT0(LA(S0)) − errorT0(LB(S0))

  • even better, repeat this many times and average the

results (next slide)

22

slide-23
SLIDE 23

Comparing learning algorithms LA and LB

  • 1. Partition data D0 into k disjoint test sets

T1, T2, . . . , Tk of equal size, where this size is at least 30.

  • 2. For i from 1 to k, do

use Ti for the test set, and the remaining data for training set Si

  • Si ← {D0 − Ti}
  • hA ← LA(Si)
  • hB ← LB(Si)
  • δi ← errorTi(hA) − errorTi(hB)
  • 3. Return the value ¯

δ, where ¯ δ ≡ 1 k

k

  • i=1 δi

23

slide-24
SLIDE 24

Comparing learning algorithms LA and LB

Notice we’d like to use the paired t test on ¯ δ to obtain a confidence interval but not really correct, because the training sets in this algorithm are not independent (they overlap!) more correct to view algorithm as producing an estimate of ES⊂D0[errorD(LA(S)) − errorD(LB(S))] instead of ES⊂D[errorD(LA(S)) − errorD(LB(S))] but even this approximation is better than no comparison

24

slide-25
SLIDE 25

More on t-tests

  • Good for comparing two learners
  • But not for multiple pairs
  • P(reject null hypothesis) grows exponentially with #
  • f pairs
  • Null hypothesis = learners perform equally

αe ≃ (1 − (1 − αc)m) m = k(k − 1)/2 αe = P(incorrectly rejecting all means equal) αc = P(incorrectly rejecting null hypothesis for one pair) k = # learners

25

slide-26
SLIDE 26

Analysis of Variance (ANOVA)

  • P(means are not equal)
  • MS = Mean Square
  • MSbetween =

1 d fb

  • j(xj − x)2
  • MSwithin =

1 d fw

  • j
  • k(xjk − xj)2

j = # groups (learners) k = # trials per group xj = mean of the trials for group j (i.e., error) x = mean of all trials for all groups d fb = degrees of freedom between groups (j − 1) d fw = degrees of freedom within all groups j(k − 1)

  • If no difference between groups, then

MSwithin = MSbetween ANOVA : F = MSbetween MSwithin

  • Maps (F, d

fb, d fw) to probability of rejecting null hypothesis

  • Increased F leads to decreased P(means are equal)

26

slide-27
SLIDE 27

Summary

  • Evaluating hypotheses

– Sample error vs. true error – Confidence intervals for observed hypothesis error – Error estimators – Binomial distribution, Normal distribution, Central Limit Theorem

  • Comparing hypotheses and learners

– Hypothesis testing – Paired t tests – Cross validation – Analysis of variance

27