Evaluating Hypotheses Based on Machine Learning, T. Mitchell, McGRAW - PowerPoint PPT Presentation

0. Evaluating Hypotheses Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 5 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell

1. Main Questions in Evaluating Hypotheses 1. How can we estimate the accuracy of a learned hypothesis h over the whole space of instances D , given its observed accuracy over limited data? 2. How can we estimate the probability that a hypothesis h 1 performs is more accurate than another hypothesis h 2 over D ? 3. If available data is limited, how can we use this data for both training and comparing the relative accuracy of two learned hypothesis?

2. Statistics Prespective (See Appendix for ◦ Details) Problem: Given a property observed over some random sample D of the population X , estimate the proportion of X that exhibits that property. • Sample error, true error • Estimators ◦ Binomial distribution, Normal distribution ◦ Confidence intervals ◦ Paired t tests

3. 1. Two Definitions of Error The sample error of hypothesis h with respect to the target function f and data sample S is the proportion of examples h misclassifies error S ( h ) ≡ 1 � δ ( f ( x ) � = h ( x )) n x ∈ S where δ ( f ( x ) � = h ( x )) is 1 if f ( x ) � = h ( x ) , and 0 otherwise. The true error of hypothesis h with respect to the target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D . error D ( h ) ≡ Pr x ∈D [ f ( x ) � = h ( x )] Question: How well does error S ( h ) estimate error D ( h ) ?

4. Problems in Estimating error D ( h ) bias ≡ E [ error S ( h )] − error D ( h ) 1. If S is training set, then error S ( h ) is optimistically biased, because h was learned using S . Therefore, for unbiased estimate, h and S must be chosen independently. 2. Even with unbiased S (i.e., bias = 0 ), the variance of error S ( h ) − error D ( h ) may be not null.

5. Calculating Confidence Intervals for error S ( h ) : Preview/Example Question: If hypothesis h misclassifies 12 of the 40 examples in S , what can we conclude about error D ( h ) ? Answer: If the examples are drawn independently of h and of each other, then with approximately 95% probability, error D ( h ) lies in interval 0 . 30 ± (1 . 96 × 0 . 14) . � error S ( h )(1 − error S ( h )) ( error S ( h ) = 0 . 30 , z N = 1 . 96 , and 0 . 14 ≈ ) n

6. Calculating Confidence Intervals for Discrete-valued Hypotheses: A general approach 1. Pick parameter p to estimate • error D ( h ) 2. Choose an estimator for the parameter p • error S ( h ) 3. Determine the probability distribution that governs estimator • error S ( h ) governed by Binomial distribution, approximated by Normal when n ≥ 30 4. Find the interval ( L, U ) such that N% of probability mass falls in this interval • Use table of z N values

7. Calculating Confidence Intervals for error S ( h ) : Proof Idea • we run the experiment with different randomly drawn S (of size n ), therefore error S ( h ) is a random variable; we will use error S ( h ) to estimate error D ( h ) . • probability of observing r misclassified examples follows the Binomial distribution: n ! r !( n − r )! error D ( h ) r (1 − error D ( h )) n − r P ( r ) = • for n sufficiently large, the Normal distribution approximates the Binomial distribution (see next slide); • N % of the area defined by the Binomial distribution lies in the interval µ ± z N σ , with µ and σ respectively the mean and the std. deviation.

8. Normal Distribution Approximates error S ( h ) error S ( h ) follows a Binomial distribution, with • mean µ error S ( h ) = error D ( h ) � error D ( h )(1 − error D ( h )) • standard deviation σ error S ( h ) = n Approximate this by a Normal distribution with • mean µ error S ( h ) = error D ( h ) � error S ( h )(1 − error S ( h )) • standard deviation σ error S ( h ) ≈ n

9. Calculating Confidence Intervals for error S ( h ) : Full Proof Details If • S contains n examples, drawn independently of h and each other • n ≥ 30 • error S ( h ) is not too close to 0 or 1 (recommended: n × error S ( h ) × (1 − error S ( h )) ≥ 5 ) then with approximately N% probability, error S ( h ) lies in the interval � error D ( h )(1 − error D ( h )) error D ( h ) ± z N n � error D ( h )(1 − error D ( h )) Equivalently, error D ( h ) lies in interval error S ( h ) ± z N n � error S ( h )(1 − error S ( h )) which is approximately error S ( h ) ± z N n N % : 50% 68% 80% 90% 95% 98% 99% z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58

10. 2. Estimate the Difference Between Two Hypotheses Test h 1 on sample S 1 , test h 2 on S 2 1. Pick parameter to estimate: d ≡ error D ( h 1 ) − error D ( h 2 ) 2. Choose an estimator: ˆ d ≡ error S 1 ( h 1 ) − error S 2 ( h 2 ) 3. Determine probability distribution that governs estimator. ˆ d is approximately Normally distributed: d = d µ ˆ � error S 1 ( h 1 )(1 − error S 1 ( h 1 )) error S 2 ( h 2 )(1 − error S 2 ( h 2 )) d ≈ + σ ˆ n 1 n 2 4. Find the confidence interval ( L, U ): d ± z N σ ˆ N% of probability mass falls in the interval µ ˆ d

11. Difference Between Two Hypotheses: An Example Suppose error S 1 ( h 1 ) = .30 and error S 2 ( h 2 ) = .20. Question: What is the estimated probability of error D ( h 1 ) > error D ( h 2 ) ? Answer: Notation: ˆ d = error S 1 ( h 1 ) − error S 2 ( h 2 ) = 0 . 10 d = error D ( h 1 ) > error D ( h 2 ) Calculation: P ( d > 0 , ˆ d = . 10) = P ( ˆ d < d + 0 . 10) = P ( ˆ d + 0 . 10) d < µ ˆ d = 0 . 061 , and 0 . 10 ≈ 1 . 64 × σ ˆ σ ˆ d z 90 = 1 . 64 Conclusion: (using one-sided conf. interv.) P ( ˆ d +0 . 10) = 95% d < µ ˆ Therefore, with a 95% confidence, error D ( h 1 ) > error D ( h 2 )

12. 3. Comparing learning algorithms L A and L B We would like to estimate the true error between the output of L A and L B : E S ⊂D [ error D ( L A ( S )) − error D ( L B ( S ))] where L ( S ) is the hypothesis output by learner L using the training set S drawn according to distribution D . When only limited data D 0 is available, we will produce an estimation of E S ⊂ D 0 [ error D ( L A ( S )) − error D ( L B ( S ))] • partition D 0 into training set S 0 and test set T 0 , and measure error T 0 ( L A ( S 0 )) − error T 0 ( L B ( S 0 )) • better, repeat this many times and average the results (next slide) • use the t paired test to get an (approximate) confidence interval

13. Comparing learning algorithms L A and L B 1. Partition data D 0 into k disjoint test sets T 1 , T 2 , . . . , T k of equal size, where this size is at least 30. 2. For i from 1 to k , use T i for the test set, and the remaining data for training set S i • S i ← { D 0 − T i } • h A ← L A ( S i ) • h B ← L B ( S i ) • δ i ← error T i ( h A ) − error T i ( h B ) 3. Return the value ¯ δ ≡ 1 � k i =1 δ i k Note: We’d like to use the paired t test on ¯ δ to obtain a confidence interval. This is not really correct, because the training sets in this algorithm are not independent (they overlap!). But even this approximation is better than no comparison.

14. APPENDIX: Statistics Issues • Binomial distribution, Normal distribution • Confidence intervals • Paired t tests

15. Binomial Probability Distribution Binomial distribution for n = 40, p = 0.3 0.14 Probability P ( r ) of r heads in 0.12 n coin flips, if p = Pr( heads ) 0.1 0.08 P(r) n ! r !( n − r )! p r (1 − p ) n − r P ( r ) = 0.06 0.04 0.02 0 0 5 10 15 20 25 30 35 40 � n • Expected, or mean value of X , E [ X ] , is E [ X ] ≡ i =0 iP ( i ) = np • Variance of X is V ar ( X ) ≡ E [( X − E [ X ]) 2 ] = np (1 − p ) � � • Standard deviation of X , σ X , is σ X ≡ E [( X − E [ X ]) 2 ] = np (1 − p ) • For large n , the Normal distribution approximates very closely the Binomial distribution.

16. Normal Probability Distribution Normal distribution with mean 0, standard deviation 1 0.4 0.35 0.3 1 2 ( x − µ 2 πσ 2 e − 1 σ ) 2 0.25 √ p ( x ) = 0.2 0.15 0.1 0.05 0 -3 -2 -1 0 1 2 3 • Expected, or mean value of X , is E [ X ] = µ • Variance of X is V ar ( X ) = σ 2 • Standard deviation of X is σ X = σ � b • The probability that X falls into the interval ( a, b ) is a p ( x ) dx

17. Normal Probability Distribution (I) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -3 -2 -1 0 1 2 3 N% of area (probability) lies in µ ± z N σ N % : 50% 68% 80% 90% 95% 98% 99% z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 Example: 80% of area (probability) lies in µ ± 1 . 28 σ

18. Normal Probability Distribution (II) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -3 -2 -1 0 1 2 3 N% + 1 2 (100-N%) of area (probability) lies in ( −∞ ; µ + z N σ ) N % : 50% 68% 80% 90% 95% 98% 99% z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 Example: 90% of area (probability) lies in the “one-sided” interval ( −∞ ; µ + 1 . 28 σ )

19. Paired t test to compare h A , h B 1. Partition data into k disjoint test sets T 1 , T 2 , . . ., T k of equal size, where this size is at least 30. 2. For i from 1 to k , do δ i ← error T i ( h A ) − error T i ( h B ) Note: δ i is approximately Normally distributed. 3. Return the value ¯ δ ≡ 1 � k i =1 δ i k N % confidence interval estimate for d = error D ( h A ) − error D ( h B ) is: ¯ δ ± t N,k − 1 s ¯ δ � k 1 � ( δ i − ¯ � � δ ≡ δ ) 2 s ¯ � k ( k − 1) i =1

Evaluating Hypotheses Based on Machine Learning, T. Mitchell, McGRAW - PowerPoint PPT Presentation

0. Evaluating Hypotheses Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 5 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. Main Questions in Evaluating Hypotheses 1. How can we

Hypotheses with two variates Two sample hypotheses R.W. Oldford Common hypotheses Recall some

13. hypothesis testing 1 competing hypotheses 2 competing hypotheses 3 competing hypotheses

Evaluating Hypotheses IEEE Expert, October 1996 1 Evaluating Hypotheses Sample error, true

Hypotheses with two variates Paired data R.W. Oldford Common hypotheses Recall some common

Verifying Test Hypotheses - HOL/TestGen An Experiment in Test and Proof Thomas Malcher January

Some simple hypotheses to be Some simple hypotheses to be tested by IBOY-DIWPA data Takakazu

Generating Hypotheses by Generating Hypotheses by Discovering Implicit Associations in

Business Statistics CONTENTS A hypothesis test Hypotheses Rejection region and significance

Learning Logically Defined Hypotheses Martin Grohe RWTH Aachen Outline I. A Declarative

Fictions Functions: Three Data-Driven Hypotheses Andrew Piper, McGill University How can we

1 Z-Score Test for Comparing One-sided vs Two-sided Tests Learned Hypotheses Assumes h 1 is

Hypotheses Question What are the hypotheses for testing for a difference between the aver- age

Essential components of a good experiment Questions/hypotheses/predictions Variables (treatments,

Assessing Phylogenetic Hypotheses and Phylogenetic Data We use numerical phylogenetic methods

Hypotheses testing, p-values, Type I and Type II Errors Statistics are not substitute for

Chapter 9: Testing Hypotheses In this chapter we will cover: 1. Hypothesis tests ( 9.1, 9.2

Implementing discrete approximations to continuous mixture distributions Christian R over

CSCE 478/878 Lecture 5: Evaluating will misclassify an instance drawn at random accord-

The normal distrib u tion FOU N DATION S OF P R OBABIL ITY IN R Da v id Robinson Chief Data

Membership problem in GL(2 , Z ) extended by singular matrices Pavel Semukhin joint work with

Using Steins method to show Poisson and normal limit laws for fringe trees Cecilia Holmgren,

MAS2602: Computing for Statistics Newcastle University lee.fawcett@ncl.ac.uk Semester 1, 2019/20

16.1 Review Recall from the previous lectures that non-parametric learning is a process which

GlaxoSmithKline MODA 8 Almagro, Spain 48 June 2007 INTRODUCTION Patient recruitment