Evaluating Hypotheses IEEE Expert, October 1996 1 Evaluating - PDF document

Evaluating Hypotheses IEEE Expert, October 1996 1

Evaluating Hypotheses • Sample error, true error • Confidence intervals for observed hypothesis error • Estimators • Binomial distribution, Normal distribution, Central Limit Theorem • Paired t tests • Comparing learning methods 2

Evaluating Hypotheses and Learners Consider hypotheses H 1 and H 2 learned by learners L 1 and L 2 • How to learn H and estimate accuracy with limited data? • How well does observed accuracy of H over limited sample estimate accuracy over unseen data? • If H 1 outperforms H 2 on sample, will H 1 outperform H 2 in general? • Same conclusion for L 1 and L 2 ? 3

Two Definitions of Error The true error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D . error D ( h ) ≡ Pr x ∈D [ f ( x ) � = h ( x )] The sample error of h with respect to target function f and data sample S is the proportion of examples h misclassifies error S ( h ) ≡ 1 x ∈ S δ ( f ( x ) � = h ( x )) � n Where δ ( f ( x ) � = h ( x )) is 1 if f ( x ) � = h ( x ), and 0 otherwise. How well does error S ( h ) estimate error D ( h )? 4

Problems Estimating Error 1. Bias: If S is training set, error S ( h ) is optimistically biased bias ≡ E [ error S ( h )] − error D ( h ) For unbiased estimate, h and S must be chosen independently 2. Variance: Even with unbiased S , error S ( h ) may still vary from error D ( h ) 5

Example Hypothesis h misclassifies 12 of the 40 examples in S error S ( h ) = 12 40 = . 30 What is error D ( h )? 6

Estimators Experiment: 1. choose sample S of size n according to distribution D 2. measure error S ( h ) error S ( h ) is a random variable (i.e., result of an experiment) error S ( h ) is an unbiased estimator for error D ( h ) Given observed error S ( h ) what can we conclude about error D ( h )? 7

Confidence Intervals If • S contains n examples, drawn independently of h and each other • n ≥ 30 Then • With approximately 95% probability, error D ( h ) lies in interval � � error S ( h )(1 − error S ( h )) � � error S ( h ) ± 1 . 96 � � n 8

Confidence Intervals If • S contains n examples, drawn independently of h and each other • n ≥ 30 Then • With approximately N% probability, error D ( h ) lies in interval � � error S ( h )(1 − error S ( h )) � � error S ( h ) ± z N � � n where N %: 50% 68% 80% 90% 95% 98% 99% z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 9

error S ( h ) is a Random Variable Rerun the experiment with different randomly drawn S (of size n ) Probability of observing r misclassified examples: Binomial distribution for n = 40, p = 0.3 0.14 0.12 0.1 0.08 P(r) 0.06 0.04 0.02 0 0 5 10 15 20 25 30 35 40 n ! r !( n − r )! error D ( h ) r (1 − error D ( h )) n − r P ( r ) = 10

Binomial Probability Distribution Binomial distribution for n = 40, p = 0.3 0.14 0.12 0.1 0.08 P(r) 0.06 0.04 0.02 0 0 5 10 15 20 25 30 35 40 n ! r !( n − r )! p r (1 − p ) n − r P ( r ) = Probability P ( r ) of r heads in n coin flips, if p = Pr( heads ) • Expected, or mean value of X , E [ X ], is n E [ X ] ≡ i =0 iP ( i ) = np � • Variance of X is V ar ( X ) ≡ E [( X − E [ X ]) 2 ] = np (1 − p ) • Standard deviation of X , σ X , is � � E [( X − E [ X ]) 2 ] = σ X ≡ np (1 − p ) 11

Normal Distribution Approximates Binomial error S ( h ) follows a Binomial distribution, with • mean µ error S ( h ) = error D ( h ) • standard deviation σ error S ( h ) � � error D ( h )(1 − error D ( h )) � � σ error S ( h ) = � � n Approximate this by a Normal distribution with • mean µ error S ( h ) = error D ( h ) • standard deviation σ error S ( h ) � � error S ( h )(1 − error S ( h )) � � σ error S ( h ) ≈ � � n 12

Normal Probability Distribution Normal distribution with mean 0, standard deviation 1 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -3 -2 -1 0 1 2 3 1 2 πσ 2 e − 1 2 ( x − µ σ ) 2 √ p ( x ) = The probability that X will fall into the interval ( a, b ) is given by � b a p ( x ) dx • Expected, or mean value of X , E [ X ], is E [ X ] = µ • Variance of X is V ar ( X ) = σ 2 • Standard deviation of X , σ X , is σ X = σ 13

Normal Probability Distribution 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -3 -2 -1 0 1 2 3 80% of area (probability) lies in µ ± 1 . 28 σ N% of area (probability) lies in µ ± z N σ N %: 50% 68% 80% 90% 95% 98% 99% z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 14

Confidence Intervals, More Correctly If • S contains n examples, drawn independently of h and each other • n ≥ 30 Then • With approximately 95% probability, error S ( h ) lies in interval � � error D ( h )(1 − error D ( h )) � � error D ( h ) ± 1 . 96 � � n equivalently, error D ( h ) lies in interval � � error D ( h )(1 − error D ( h )) � � error S ( h ) ± 1 . 96 � � n which is approximately � � error S ( h )(1 − error S ( h )) � � error S ( h ) ± 1 . 96 � � n 15

Two-Sided and One-Sided Bounds 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 • If µ − z N σ ≤ y ≤ µ + z N σ with confidence N = 100(1 − α )% • Then −∞ ≤ y ≤ µ + z N σ with confidence N = 100(1 − α/ 2)% and µ − z N σ ≤ y ≤ + ∞ with confidence N = 100(1 − α/ 2)% • Example: n = 40, r = 12 – Two-sided, 95% confidence ( α = 0 . 05) P (0 . 16 ≤ y ≤ 0 . 44) = 0 . 95 – One-sided P ( y ≤ 0 . 44) = P ( y ≥ 0 . 16) = (1 − α/ 2) = 0 . 975 16

Calculating Confidence Intervals 1. Pick parameter p to estimate • error D ( h ) 2. Choose an estimator • error S ( h ) 3. Determine probability distribution that governs estimator • error S ( h ) governed by Binomial distribution, approximated by Normal when n ≥ 30 4. Find interval ( L, U ) such that N% of probability mass falls in the interval • Use table of z N values 17

Central Limit Theorem Consider a set of independent, identically distributed random variables Y 1 . . . Y n , all governed by an arbitrary probability distribution with mean µ and finite variance σ 2 . Define the sample mean, Y ≡ 1 n ¯ i =1 Y i � n Central Limit Theorem. As n → ∞ , the distribution governing ¯ Y approaches a Normal distribution, with mean µ and variance σ 2 n . 18

Difference Between Hypotheses Test h 1 on sample S 1 , test h 2 on S 2 1. Pick parameter to estimate d ≡ error D ( h 1 ) − error D ( h 2 ) 2. Choose an estimator ˆ d ≡ error S 1 ( h 1 ) − error S 2 ( h 2 ) 3. Determine probability distribution that governs estimator � error S 1 ( h 1 )(1 − error S 1 ( h 1 )) + error S 2 ( h 2 )(1 − error S 2 ( h 2 )) d ≈ σ ˆ n 1 n 2 Find interval ( L, U ) such that N% of probability mass falls in the interval � error S 1 ( h 1 )(1 − error S 1 ( h 1 )) + error S 2 ( h 2 )(1 − error S 2 ( h 2 )) ˆ d ± z N n 1 n 2 19

Hypothesis Testing P ( error D ( h 1 ) > error D ( h 2 )) =? • Example ◦ | S 1 | = | S 2 | = 100 ◦ error S 1 ( h 1 ) = 0 . 30 ◦ error S 2 ( h 2 ) = 0 . 20 ◦ ˆ d = 0 . 10 ◦ σ ˆ d = 0 . 061 • P ( ˆ d + 0 . 10) = probability ˆ d < µ ˆ d does not overestimate d by more than 0.10 ◦ z N · σ ˆ d = 0 . 10 ◦ z N = 1 . 64 • P ( ˆ d < µ ˆ d + 1 . 64 σ ˆ d ) = 0 . 95 • I.e., reject null hypothesis with 0.05 level of significance 20

Paired t test to compare h A , h B 1. Partition data into k disjoint test sets T 1 , T 2 , . . . , T k of equal size, where this size is at least 30. 2. For i from 1 to k , do δ i ← error T i ( h A ) − error T i ( h B ) 3. Return the value ¯ δ , where δ ≡ 1 k ¯ i =1 δ i � k N % confidence interval estimate for δ : ¯ δ ± t N,k − 1 s ¯ δ � 1 � k � i =1 ( δ i − ¯ δ ) 2 s ¯ δ ≡ � � � � k ( k − 1) � Note δ i approximately Normally distributed 21

Comparing learning algorithms L A and L B What we’d like to estimate: E S ⊂D [ error D ( L A ( S )) − error D ( L B ( S ))] where L ( S ) is the hypothesis output by learner L using training set S i.e., the expected difference in true error between hypotheses output by learners L A and L B , when trained using randomly selected training sets S drawn according to distribution D . But, given limited data D 0 , what is a good estimator? • could partition D 0 into training set S 0 and testing set T 0 , and measure error T 0 ( L A ( S 0 )) − error T 0 ( L B ( S 0 )) • even better, repeat this many times and average the results (next slide) 22

Comparing learning algorithms L A and L B 1. Partition data D 0 into k disjoint test sets T 1 , T 2 , . . . , T k of equal size, where this size is at least 30. 2. For i from 1 to k , do use T i for the test set, and the remaining data for training set S i • S i ← { D 0 − T i } • h A ← L A ( S i ) • h B ← L B ( S i ) • δ i ← error T i ( h A ) − error T i ( h B ) 3. Return the value ¯ δ , where δ ≡ 1 k ¯ i =1 δ i � k 23

Evaluating Hypotheses IEEE Expert, October 1996 1 Evaluating - PDF document

Evaluating Hypotheses IEEE Expert, October 1996 1 Evaluating Hypotheses Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal distribution, Central Limit Theorem

Hypotheses with two variates Two sample hypotheses R.W. Oldford Common hypotheses Recall some

13. hypothesis testing 1 competing hypotheses 2 competing hypotheses 3 competing hypotheses

Hypotheses with two variates Paired data R.W. Oldford Common hypotheses Recall some common

Verifying Test Hypotheses - HOL/TestGen An Experiment in Test and Proof Thomas Malcher January

Evaluating Hypotheses Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 5

Some simple hypotheses to be Some simple hypotheses to be tested by IBOY-DIWPA data Takakazu

Generating Hypotheses by Generating Hypotheses by Discovering Implicit Associations in

Business Statistics CONTENTS A hypothesis test Hypotheses Rejection region and significance

Learning Logically Defined Hypotheses Martin Grohe RWTH Aachen Outline I. A Declarative

Fictions Functions: Three Data-Driven Hypotheses Andrew Piper, McGill University How can we

1 Z-Score Test for Comparing One-sided vs Two-sided Tests Learned Hypotheses Assumes h 1 is

Hypotheses Question What are the hypotheses for testing for a difference between the aver- age

Essential components of a good experiment Questions/hypotheses/predictions Variables (treatments,

Assessing Phylogenetic Hypotheses and Phylogenetic Data We use numerical phylogenetic methods

Hypotheses testing, p-values, Type I and Type II Errors Statistics are not substitute for

Chapter 9: Testing Hypotheses In this chapter we will cover: 1. Hypothesis tests ( 9.1, 9.2

Sampling and Inference Sampling and Inference The Quality of Data and Measures 2012 1 Why do we

Confidence Interval for the Variance of a Normal Population Bernd Schr oder logo1 Bernd

Statistical Intervals Vive La Diffrence! William Q. Meeker Department of Statistics Center for

Probability and Statistics for Computer Science In sta(s(cs we apply probability to draw

Simulation Simulation Modeling and Performance Analysis with Discrete-Event Simulation g y Dr.

ECON2228 Notes 0 Christopher F Baum Boston College Economics 20142015 cfb (BC Econ)

HATs off to DAWN? Simon Rudge. VTE Nurse. Hospital Acquired Thrombosis (HAT) The process ~650

TCAD & Economic Development An Overview of Strategies, Policies & Programs TCADs

Evaluating Hypotheses IEEE Expert, October 1996 1 Evaluating - PDF document

Evaluating Hypotheses IEEE Expert, October 1996 1 Evaluating Hypotheses Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal distribution, Central Limit Theorem

Hypotheses with two variates Two sample hypotheses R.W. Oldford Common hypotheses Recall some

13. hypothesis testing 1 competing hypotheses 2 competing hypotheses 3 competing hypotheses

Hypotheses with two variates Paired data R.W. Oldford Common hypotheses Recall some common

Verifying Test Hypotheses - HOL/TestGen An Experiment in Test and Proof Thomas Malcher January

Evaluating Hypotheses Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 5

Some simple hypotheses to be Some simple hypotheses to be tested by IBOY-DIWPA data Takakazu

Generating Hypotheses by Generating Hypotheses by Discovering Implicit Associations in

Business Statistics CONTENTS A hypothesis test Hypotheses Rejection region and significance

Learning Logically Defined Hypotheses Martin Grohe RWTH Aachen Outline I. A Declarative

Fictions Functions: Three Data-Driven Hypotheses Andrew Piper, McGill University How can we

1 Z-Score Test for Comparing One-sided vs Two-sided Tests Learned Hypotheses Assumes h 1 is

Hypotheses Question What are the hypotheses for testing for a difference between the aver- age

Essential components of a good experiment Questions/hypotheses/predictions Variables (treatments,

Assessing Phylogenetic Hypotheses and Phylogenetic Data We use numerical phylogenetic methods

Hypotheses testing, p-values, Type I and Type II Errors Statistics are not substitute for

Chapter 9: Testing Hypotheses In this chapter we will cover: 1. Hypothesis tests ( 9.1, 9.2

Sampling and Inference Sampling and Inference The Quality of Data and Measures 2012 1 Why do we

Confidence Interval for the Variance of a Normal Population Bernd Schr oder logo1 Bernd

Statistical Intervals Vive La Diffrence! William Q. Meeker Department of Statistics Center for

Probability and Statistics for Computer Science In sta(s(cs we apply probability to draw

Simulation Simulation Modeling and Performance Analysis with Discrete-Event Simulation g y Dr.

ECON2228 Notes 0 Christopher F Baum Boston College Economics 20142015 cfb (BC Econ)

HATs off to DAWN? Simon Rudge. VTE Nurse. Hospital Acquired Thrombosis (HAT) The process ~650

TCAD &amp; Economic Development An Overview of Strategies, Policies &amp; Programs TCADs

TCAD & Economic Development An Overview of Strategies, Policies & Programs TCADs