evaluating hypotheses
play

Evaluating Hypotheses IEEE Expert, October 1996 1 Evaluating - PDF document

Evaluating Hypotheses IEEE Expert, October 1996 1 Evaluating Hypotheses Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal distribution, Central Limit Theorem


  1. Evaluating Hypotheses IEEE Expert, October 1996 1

  2. Evaluating Hypotheses • Sample error, true error • Confidence intervals for observed hypothesis error • Estimators • Binomial distribution, Normal distribution, Central Limit Theorem • Paired t tests • Comparing learning methods 2

  3. Evaluating Hypotheses and Learners Consider hypotheses H 1 and H 2 learned by learners L 1 and L 2 • How to learn H and estimate accuracy with limited data? • How well does observed accuracy of H over limited sample estimate accuracy over unseen data? • If H 1 outperforms H 2 on sample, will H 1 outperform H 2 in general? • Same conclusion for L 1 and L 2 ? 3

  4. Two Definitions of Error The true error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D . error D ( h ) ≡ Pr x ∈D [ f ( x ) � = h ( x )] The sample error of h with respect to target function f and data sample S is the proportion of examples h misclassifies error S ( h ) ≡ 1 x ∈ S δ ( f ( x ) � = h ( x )) � n Where δ ( f ( x ) � = h ( x )) is 1 if f ( x ) � = h ( x ), and 0 otherwise. How well does error S ( h ) estimate error D ( h )? 4

  5. Problems Estimating Error 1. Bias: If S is training set, error S ( h ) is optimistically biased bias ≡ E [ error S ( h )] − error D ( h ) For unbiased estimate, h and S must be chosen independently 2. Variance: Even with unbiased S , error S ( h ) may still vary from error D ( h ) 5

  6. Example Hypothesis h misclassifies 12 of the 40 examples in S error S ( h ) = 12 40 = . 30 What is error D ( h )? 6

  7. Estimators Experiment: 1. choose sample S of size n according to distribution D 2. measure error S ( h ) error S ( h ) is a random variable (i.e., result of an experiment) error S ( h ) is an unbiased estimator for error D ( h ) Given observed error S ( h ) what can we conclude about error D ( h )? 7

  8. Confidence Intervals If • S contains n examples, drawn independently of h and each other • n ≥ 30 Then • With approximately 95% probability, error D ( h ) lies in interval � � error S ( h )(1 − error S ( h )) � � error S ( h ) ± 1 . 96 � � n 8

  9. Confidence Intervals If • S contains n examples, drawn independently of h and each other • n ≥ 30 Then • With approximately N% probability, error D ( h ) lies in interval � � error S ( h )(1 − error S ( h )) � � error S ( h ) ± z N � � n where N %: 50% 68% 80% 90% 95% 98% 99% z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 9

  10. error S ( h ) is a Random Variable Rerun the experiment with different randomly drawn S (of size n ) Probability of observing r misclassified examples: Binomial distribution for n = 40, p = 0.3 0.14 0.12 0.1 0.08 P(r) 0.06 0.04 0.02 0 0 5 10 15 20 25 30 35 40 n ! r !( n − r )! error D ( h ) r (1 − error D ( h )) n − r P ( r ) = 10

  11. Binomial Probability Distribution Binomial distribution for n = 40, p = 0.3 0.14 0.12 0.1 0.08 P(r) 0.06 0.04 0.02 0 0 5 10 15 20 25 30 35 40 n ! r !( n − r )! p r (1 − p ) n − r P ( r ) = Probability P ( r ) of r heads in n coin flips, if p = Pr( heads ) • Expected, or mean value of X , E [ X ], is n E [ X ] ≡ i =0 iP ( i ) = np � • Variance of X is V ar ( X ) ≡ E [( X − E [ X ]) 2 ] = np (1 − p ) • Standard deviation of X , σ X , is � � E [( X − E [ X ]) 2 ] = σ X ≡ np (1 − p ) 11

  12. Normal Distribution Approximates Binomial error S ( h ) follows a Binomial distribution, with • mean µ error S ( h ) = error D ( h ) • standard deviation σ error S ( h ) � � error D ( h )(1 − error D ( h )) � � σ error S ( h ) = � � n Approximate this by a Normal distribution with • mean µ error S ( h ) = error D ( h ) • standard deviation σ error S ( h ) � � error S ( h )(1 − error S ( h )) � � σ error S ( h ) ≈ � � n 12

  13. Normal Probability Distribution Normal distribution with mean 0, standard deviation 1 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -3 -2 -1 0 1 2 3 1 2 πσ 2 e − 1 2 ( x − µ σ ) 2 √ p ( x ) = The probability that X will fall into the interval ( a, b ) is given by � b a p ( x ) dx • Expected, or mean value of X , E [ X ], is E [ X ] = µ • Variance of X is V ar ( X ) = σ 2 • Standard deviation of X , σ X , is σ X = σ 13

  14. Normal Probability Distribution 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -3 -2 -1 0 1 2 3 80% of area (probability) lies in µ ± 1 . 28 σ N% of area (probability) lies in µ ± z N σ N %: 50% 68% 80% 90% 95% 98% 99% z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 14

  15. Confidence Intervals, More Correctly If • S contains n examples, drawn independently of h and each other • n ≥ 30 Then • With approximately 95% probability, error S ( h ) lies in interval � � error D ( h )(1 − error D ( h )) � � error D ( h ) ± 1 . 96 � � n equivalently, error D ( h ) lies in interval � � error D ( h )(1 − error D ( h )) � � error S ( h ) ± 1 . 96 � � n which is approximately � � error S ( h )(1 − error S ( h )) � � error S ( h ) ± 1 . 96 � � n 15

  16. Two-Sided and One-Sided Bounds 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 • If µ − z N σ ≤ y ≤ µ + z N σ with confidence N = 100(1 − α )% • Then −∞ ≤ y ≤ µ + z N σ with confidence N = 100(1 − α/ 2)% and µ − z N σ ≤ y ≤ + ∞ with confidence N = 100(1 − α/ 2)% • Example: n = 40, r = 12 – Two-sided, 95% confidence ( α = 0 . 05) P (0 . 16 ≤ y ≤ 0 . 44) = 0 . 95 – One-sided P ( y ≤ 0 . 44) = P ( y ≥ 0 . 16) = (1 − α/ 2) = 0 . 975 16

  17. Calculating Confidence Intervals 1. Pick parameter p to estimate • error D ( h ) 2. Choose an estimator • error S ( h ) 3. Determine probability distribution that governs estimator • error S ( h ) governed by Binomial distribution, approximated by Normal when n ≥ 30 4. Find interval ( L, U ) such that N% of probability mass falls in the interval • Use table of z N values 17

  18. Central Limit Theorem Consider a set of independent, identically distributed random variables Y 1 . . . Y n , all governed by an arbitrary probability distribution with mean µ and finite variance σ 2 . Define the sample mean, Y ≡ 1 n ¯ i =1 Y i � n Central Limit Theorem. As n → ∞ , the distribution governing ¯ Y approaches a Normal distribution, with mean µ and variance σ 2 n . 18

  19. Difference Between Hypotheses Test h 1 on sample S 1 , test h 2 on S 2 1. Pick parameter to estimate d ≡ error D ( h 1 ) − error D ( h 2 ) 2. Choose an estimator ˆ d ≡ error S 1 ( h 1 ) − error S 2 ( h 2 ) 3. Determine probability distribution that governs estimator � error S 1 ( h 1 )(1 − error S 1 ( h 1 )) + error S 2 ( h 2 )(1 − error S 2 ( h 2 )) d ≈ σ ˆ n 1 n 2 Find interval ( L, U ) such that N% of probability mass falls in the interval � error S 1 ( h 1 )(1 − error S 1 ( h 1 )) + error S 2 ( h 2 )(1 − error S 2 ( h 2 )) ˆ d ± z N n 1 n 2 19

  20. Hypothesis Testing P ( error D ( h 1 ) > error D ( h 2 )) =? • Example ◦ | S 1 | = | S 2 | = 100 ◦ error S 1 ( h 1 ) = 0 . 30 ◦ error S 2 ( h 2 ) = 0 . 20 ◦ ˆ d = 0 . 10 ◦ σ ˆ d = 0 . 061 • P ( ˆ d + 0 . 10) = probability ˆ d < µ ˆ d does not overestimate d by more than 0.10 ◦ z N · σ ˆ d = 0 . 10 ◦ z N = 1 . 64 • P ( ˆ d < µ ˆ d + 1 . 64 σ ˆ d ) = 0 . 95 • I.e., reject null hypothesis with 0.05 level of significance 20

  21. Paired t test to compare h A , h B 1. Partition data into k disjoint test sets T 1 , T 2 , . . . , T k of equal size, where this size is at least 30. 2. For i from 1 to k , do δ i ← error T i ( h A ) − error T i ( h B ) 3. Return the value ¯ δ , where δ ≡ 1 k ¯ i =1 δ i � k N % confidence interval estimate for δ : ¯ δ ± t N,k − 1 s ¯ δ � 1 � k � i =1 ( δ i − ¯ δ ) 2 s ¯ δ ≡ � � � � k ( k − 1) � Note δ i approximately Normally distributed 21

  22. Comparing learning algorithms L A and L B What we’d like to estimate: E S ⊂D [ error D ( L A ( S )) − error D ( L B ( S ))] where L ( S ) is the hypothesis output by learner L using training set S i.e., the expected difference in true error between hypotheses output by learners L A and L B , when trained using randomly selected training sets S drawn according to distribution D . But, given limited data D 0 , what is a good estimator? • could partition D 0 into training set S 0 and testing set T 0 , and measure error T 0 ( L A ( S 0 )) − error T 0 ( L B ( S 0 )) • even better, repeat this many times and average the results (next slide) 22

  23. Comparing learning algorithms L A and L B 1. Partition data D 0 into k disjoint test sets T 1 , T 2 , . . . , T k of equal size, where this size is at least 30. 2. For i from 1 to k , do use T i for the test set, and the remaining data for training set S i • S i ← { D 0 − T i } • h A ← L A ( S i ) • h B ← L B ( S i ) • δ i ← error T i ( h A ) − error T i ( h B ) 3. Return the value ¯ δ , where δ ≡ 1 k ¯ i =1 δ i � k 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend