Significance Testing Evaluation, session 6 CS6200: Information - PowerPoint PPT Presentation

Significance Testing Evaluation, session 6 CS6200: Information Retrieval

Statistical Significance IR and other experimental sciences are concerned with measuring the effects of competing systems and deciding whether they are really different. For instance, “Does stemming improve my results enough that my search engine should use it?” Statistical hypothesis testing is a collection of principled methods for setting up these tests and making justified conclusions from their results.

Hypothesis Testing In statistical hypothesis testing, we try to isolate the effect of a single change so we can decide whether it makes an impact. Null Hypothesis: what we believe by The test allows us to choose between the default – the change did not improve null hypothesis and an alternative hypothesis . performance. The outcome of a hypothesis test does Alternative Hypothesis: the change not tell us whether the alternative improved performance. hypothesis is true. Instead, it tells us the probability that the null hypothesis could The hypotheses we’re testing produce a “fake improvement” at least as extreme as the data you’re testing.

Test Steps 1. Prepare your experiment carefully, with only one difference between the two systems: the change whose effect you wish to measure. Choose a significance level ⍺ , used to make your decision. 2. Run each system many times (e.g. on many different queries), evaluating each run (e.g. with AP). 3. Calculate a test statistic for each system based on the distributions of evaluation metrics. 4. Use a statistical significance test to compare the test statistics (one for each system). This will give you a p-value : the probability of the null hypothesis producing a difference at least this large. 5. If the p-value is less than ⍺ , reject the null hypothesis. The probability that you will correctly reject the null hypothesis using a particular statistical test is known as its power .

Error Types Hypothesis testing involves balancing between two types of errors: • Type I Errors , or false positives, occur when the null hypothesis is true, but you reject it • Type II Errors , or false negatives, occur when the null hypothesis is false, but you don’t reject it. The probability of a type I error is ⍺ – the significance level. The probability of a type II error is β = ( 1 - power ) .

What Can Go Wrong? The power of a statistical test depends on: • The number of independent runs (e.g. queries). In IR, we generally use 50 queries, but empirical studies suggest that 25 may be enough. • Any bias in the experimental setup (are you using the wrong test collection?). • Whether the true distribution of test statistic values matches the distribution assumed by your statistical test. A common mistake is repeating a test until you get the p-value you want. Repeating a test decreases its power.

Wrapping Up For a very clear and detailed explanation of the subtleties of statistical testing, see the excellent guide “Statistics Done Wrong,” at: http://www.statisticsdonewrong.com. In the next two sessions, we’ll look at two specific significance tests.

T-tests Evaluation, session 7 CS6200: Information Retrieval

T-Tests There are many types of T-Tests, but here we’ll focus on two: • One-sample tests have a single distribution of test statistics, and compare its mean to some pre-determined value μ . • Paired-sample tests compare the means of two systems on the same queries. Each comes in two flavors: • One-tailed tests ask whether the difference is > μ or < μ , but not both (or whether the mean of one group is greater/less than the mean of the other). • Two-tailed tests ask whether the mean = μ (or whether the means of the two samples are equal).

One-sample T-tests �� ¯ � := �� Suppose you were developing a new type of IR system for your company, � := �� and your management decided that � := �� you can release it if its precision is � := �� above 75%. �� To check this, run your system against � := ¯ � − � 50 queries and record the mean of the ( � / √ � ) precision values. Then calculate the t- � �� value and p-value that correspond to �� your vector of precision values. � := �� ( � > � )

Example: One-tailed T-test �� ¯ � := ��   � . �� := �� . ��   � := ��   � = � . �� ; � = � ;¯ � = � . �� ; � = � . �� ; � = � . ��     � . �� := ��   � . �� = � . �� − � . �� := ¯ � − � √ ( � / √ � ) � . �� / � = � . �� ( � > � ) = � . �� := �� ( � > � )

Example: Two-tailed T-test �� ¯ � := ��   � . �� := �� . ��   � := ��   � = � . �� ; � = � ;¯ � = � . �� ; � = � . �� ; � = � . ��     � . �� := ��   � . �� = � . �� − � . �� := ¯ � − � √ ( � / √ � ) � . �� / � = � . �� ( � = � ) = � . �� := �� ( � = � ) Only the p-value changes

Paired-Sample T-tests �� := �� := �� Suppose you have runs from two ¯ different IR systems: a baseline run � := � � − � � using a standard implementation, and � � := �� ( � � − � � ) a test run using the changes you’re � := �� testing. You want to know whether �� your changes outperform the baseline. ¯ � � := To test this, run both systems on the ( � � / √ � ) same 50 queries using the same � �� document collections and compare �� the difference in AP values per query. � := �� ( � > � )

Example: Paired-Sample T-test �� := ��     � . �� . �� . �� . �� := ��         � � = � . �� ; � � � = � . �� ¯     � := � � − � �     � . �� . ��     � � := �� ( � � − � � ) � . �� . �� := �� − � � = � . �� ; � � / √ � = � . �� ¯ � � := ( � � / √ � ) � = � . �� . �� = � . �� (¯ � � = ¯ � � ) = � . �� := �� ( � = � )

Wrapping Up It’s easy to glance at the data, see a bunch of bigger numbers, and conclude that your new system is working. You’re often fooling yourself when you do this. In order to really conclude that your new system is working, we need enough of the values to be “significantly” larger than the baseline values. A t-test will tell us whether the difference is big enough. Next, we’ll see what we can do if we don’t want to assume that our data are normally-distributed.

Wilcoxon Signed Ranks Test Evaluation, session 8 CS6200: Information Retrieval

Significance Testing Evaluation, session 6 CS6200: Information - PowerPoint PPT Presentation

Significance Testing Evaluation, session 6 CS6200: Information Retrieval Statistical Significance IR and other experimental sciences are concerned with measuring the effects of competing systems and deciding whether they are really different.

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Significance How important is it? Thoughts on historical significance A property must have

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

Statistical-Significance Background & Goal Shortcuts Statistical significance is one of

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Testing Terminology System testing Types of errors Function testing Structure

Topic III: Significance Testing Discrete Topics in Data Mining Universitt des Saarlandes,

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Software Testing Overview What is software testing? General testing criteria Testing

Software testing Software Testing Introduction Testing levels Automated testing Principles and

1. Test page This page is for testing. This page is for testing. This page is for testing.

Factor Analysis for Multiple Testing : an R package for large-scale significance testing under

Hypothesis testing get data that differ from the null hypothesis. If the data would be quite

SMT error analysis and mapping to syntactic, semantic and structural fixes Nora Aranberri IXA

Website http://exceptionsafecode.com Bibliography Video Comments Contact Email

BERT Basic Error Response Type Bert Why: Document WG Choice What: method to sign

Hypothesis testing DS GA 1002 Probability and Statistics for Data Science

Confidence Intervals and Hypothesis Testing Marc H. Mehlman marcmehlman@yahoo.com University of

Statistical Analysis of Corpus Data with R Hypothesis Testing for Corpus Frequency Data The

ACMS 20340 Statistics for Life Sciences Chapter 15: Inference in Practice Inference in Practice

Significance Testing Evaluation, session 6 CS6200: Information - PowerPoint PPT Presentation

Significance Testing Evaluation, session 6 CS6200: Information Retrieval Statistical Significance IR and other experimental sciences are concerned with measuring the effects of competing systems and deciding whether they are really different.

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Significance How important is it? Thoughts on historical significance A property must have

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

Statistical-Significance Background &amp; Goal Shortcuts Statistical significance is one of

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Testing Terminology System testing Types of errors Function testing Structure

Topic III: Significance Testing Discrete Topics in Data Mining Universitt des Saarlandes,

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Software Testing Overview What is software testing? General testing criteria Testing

Software testing Software Testing Introduction Testing levels Automated testing Principles and

1. Test page This page is for testing. This page is for testing. This page is for testing.

Factor Analysis for Multiple Testing : an R package for large-scale significance testing under

Hypothesis testing get data that differ from the null hypothesis. If the data would be quite

SMT error analysis and mapping to syntactic, semantic and structural fixes Nora Aranberri IXA

Website http://exceptionsafecode.com Bibliography Video Comments Contact Email

BERT Basic Error Response Type Bert Why: Document WG Choice What: method to sign

Hypothesis testing DS GA 1002 Probability and Statistics for Data Science

Confidence Intervals and Hypothesis Testing Marc H. Mehlman marcmehlman@yahoo.com University of

Statistical Analysis of Corpus Data with R Hypothesis Testing for Corpus Frequency Data The

ACMS 20340 Statistics for Life Sciences Chapter 15: Inference in Practice Inference in Practice

Statistical-Significance Background & Goal Shortcuts Statistical significance is one of