Significance Testing Evaluation, session 6 CS6200: Information - PowerPoint PPT Presentation

Feb 13, 2024 •47 likes •117 views

Significance Testing Evaluation, session 6 CS6200: Information Retrieval Statistical Significance IR and other experimental sciences are concerned with measuring the effects of competing systems and deciding whether they are really different.

Significance Testing Evaluation, session 6 CS6200: Information Retrieval
Statistical Significance IR and other experimental sciences are concerned with measuring the effects of competing systems and deciding whether they are really different. For instance, “Does stemming improve my results enough that my search engine should use it?” Statistical hypothesis testing is a collection of principled methods for setting up these tests and making justified conclusions from their results.
Hypothesis Testing In statistical hypothesis testing, we try to isolate the effect of a single change so we can decide whether it makes an impact. Null Hypothesis: what we believe by The test allows us to choose between the default – the change did not improve null hypothesis and an alternative hypothesis . performance. The outcome of a hypothesis test does Alternative Hypothesis: the change not tell us whether the alternative improved performance. hypothesis is true. Instead, it tells us the probability that the null hypothesis could The hypotheses we’re testing produce a “fake improvement” at least as extreme as the data you’re testing.
Test Steps 1. Prepare your experiment carefully, with only one difference between the two systems: the change whose effect you wish to measure. Choose a significance level ⍺ , used to make your decision. 2. Run each system many times (e.g. on many different queries), evaluating each run (e.g. with AP). 3. Calculate a test statistic for each system based on the distributions of evaluation metrics. 4. Use a statistical significance test to compare the test statistics (one for each system). This will give you a p-value : the probability of the null hypothesis producing a difference at least this large. 5. If the p-value is less than ⍺ , reject the null hypothesis. The probability that you will correctly reject the null hypothesis using a particular statistical test is known as its power .
Error Types Hypothesis testing involves balancing between two types of errors: • Type I Errors , or false positives, occur when the null hypothesis is true, but you reject it • Type II Errors , or false negatives, occur when the null hypothesis is false, but you don’t reject it. The probability of a type I error is ⍺ – the significance level. The probability of a type II error is β = ( 1 - power ) .
What Can Go Wrong? The power of a statistical test depends on: • The number of independent runs (e.g. queries). In IR, we generally use 50 queries, but empirical studies suggest that 25 may be enough. • Any bias in the experimental setup (are you using the wrong test collection?). • Whether the true distribution of test statistic values matches the distribution assumed by your statistical test. A common mistake is repeating a test until you get the p-value you want. Repeating a test decreases its power.
Wrapping Up For a very clear and detailed explanation of the subtleties of statistical testing, see the excellent guide “Statistics Done Wrong,” at: http://www.statisticsdonewrong.com. In the next two sessions, we’ll look at two specific significance tests.

Recommend

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold Stakeholder Working Group # 3 Stakeholder Working Group # 3 June 19, 2008 SCAQMD Diamond Bar, California GHG Significance Threshold GHG Significance

691 views • 12 slides

Significance How important is it? Thoughts on historical significance A property must have

Item No. 11A Page 1 of 3 Significance How important is it? Thoughts on historical significance A property must have enough integrity to reflect its significance . What is a propertys history & context? What is its story? What

343 views • 3 slides

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance of alignment scores 2 http://dericbownds.net/uploaded_images/god_face2.jpg Significance of Alignments Is 42 a good score? Compared to what?

919 views • 42 slides

Statistical-Significance Background & Goal Shortcuts Statistical significance is one of

Statistical-Significance Shortcuts 9 Mar 2015 V0F V0F V0F 2015 Schield SS Shortcuts 1 2015 Schield SS Shortcuts 2 Statistical-Significance Background & Goal Shortcuts Statistical significance is one of statistics big ideas. by

283 views • 14 slides

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring 2014 January 1, 2017 1 /28 Understand this figure f ( x | H 0 ) x reject H 0 dont reject H 0 reject H 0 x = test statistic f ( x | H 0 ) =

436 views • 27 slides

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring 2018 NO CLASS Monday April 16 (Patriots Day) Problem set due Wednesday April 18 Watch class web site for RESCHEDULED OFFICE HOURS Understand

396 views • 27 slides

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring 2014 January 1, 2017 1 /22 Understand this figure f ( x | H 0 ) x reject H 0 dont reject H 0 reject H 0 x = test statistic f ( x | H 0 ) =

407 views • 22 slides

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing Testing of individual components Integration testing Testing to expose problems arising from the combination of components System

173 views • 15 slides

Testing Terminology System testing Types of errors Function testing Structure

Outline Testing Terminology System testing Types of errors Function testing Structure Testing Dealing with errors Performance testing Quality assurance vs Acceptance testing Testing Installation testing

396 views • 11 slides

Topic III: Significance Testing Discrete Topics in Data Mining Universitt des Saarlandes,

Topic III: Significance Testing Discrete Topics in Data Mining Universitt des Saarlandes, Saarbrcken Winter Semester 2012/13 T III.Intro- 1 T III: Significance Testing 1. Hypothesis Testing 1.1. Null Hypotheses and p -values 1.2.

362 views • 32 slides

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha 26/03/2020 Statistical Significance Tests in NLP Agenda NLP Tasks Significance Tests Presentation of tasks Types of testing Evaluation

467 views • 35 slides

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important Testing is Important Testing is Hard Testing is Hard Testing is Hard Capture the Important Cases Minimize The Coding Overhead Sorting a list of

1.66k views • 118 slides

Software Testing Overview What is software testing? General testing criteria Testing

2/8/17 Software Testing Overview What is software testing? General testing criteria Testing strategies OO testing strategies Debugging N. Meng, B. Ryder 2 1 2/8/17 Software Testing

642 views • 21 slides

Software testing Software Testing Introduction Testing levels Automated testing Principles and

Introduction Testing levels Automated testing Principles and testability Coverage Criteria Points for thought Software testing Software Testing Introduction Testing levels Automated testing Principles and testability Coverage Criteria

892 views • 55 slides

1. Test page This page is for testing. This page is for testing. This page is for testing.

1. Test page This page is for testing. This page is for testing. This page is for testing. This page is for testing. This page is for testing. STOP! 1. Test page This page is for testing. This page is for testing. This page is for

1.41k views • 127 slides

Factor Analysis for Multiple Testing : an R package for large-scale significance testing under

Background Factor Analysis for Multiple Testing The FAMT package procedure Concluding comments Factor Analysis for Multiple Testing : an R package for large-scale significance testing under dependence Maela Kloareg, Chlo Friguet & David

402 views • 24 slides

The Power and Limits of Statistics DPRRGSP 2018-11-29 @ReinhardFurrer Applied Statistics

Applied Statistics, IMath The Power and Limits of Statistics DPRRGSP 2018-11-29 @ReinhardFurrer Applied Statistics Department of Mathematics Department of Computational Science Applied Statistics, IMath Contents Preamble Good

383 views • 34 slides

The Gaussian parameterized by mean and SD (position / width) product of two Gaussians is

1 Mathematical Tools for Neural and Cognitive Science Fall semester, 2018 Probability & Statistics: Estimation, inference, model-fitting 2 Estimation of model parameters (outline) How do I compute an estimate? (mathematics vs.

540 views • 40 slides

Review of basic frequentist concepts Shravan Vasishth March 10, 2020 1 Foundations 1.1 Random

Review of basic frequentist concepts Shravan Vasishth March 10, 2020 1 Foundations 1.1 Random variable A random variable X is a function X : S R that associates to each outcome S exactly one number X ( ) = x . S X is all the x

323 views • 8 slides

Testing Specification testing Michel Bierlaire Introduction to choice models Differences from

Testing Specification testing Michel Bierlaire Introduction to choice models Differences from classical hypothesis testing Classical hypothesis testing: example Null hypothesis ( H 0 ) A simple hypothesis contradicting a theoretical

516 views • 33 slides

Error Exponents for Composite Hypothesis Testing of Markov Forest Distributions Vincent Tan,

Error Exponents for Composite Hypothesis Testing of Markov Forest Distributions Vincent Tan, Anima Anandkumar, Alan S. Willsky Stochastic Systems Group, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology

1.25k views • 58 slides

Lecture 8: Information Theory and Statistics I-Hsiang Wang Department of Electrical Engineering

Hypothesis Testing Lecture 8: Information Theory and Statistics I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 22, 2015 1 / 30 I-Hsiang Wang IT Lecture 8 Part II Part II : Hypothesis

528 views • 30 slides

Quality Data Categories Administered by: Funded by: Target audience MultilingualWebLT

MultilingualWebLT MultilingualWeb-LT: Quality Data Categories Administered by: Funded by: Target audience MultilingualWebLT Localization Service Providers doing Quality Assessment tasks Content creators doing quality veri fi cation

401 views • 9 slides

Type Error Slicing What is a type error and how do you locate one? Christian Haack Joe Wells

Type Error Slicing What is a type error and how do you locate one? Christian Haack Joe Wells DePaul University Heriot-Watt University fpl.cs.depaul.edu/chaack www.macs.hw.ac.uk/jbw Type Error Slicing p.1/38 Overview Concepts.

1.36k views • 90 slides

Significance Testing Evaluation, session 6 CS6200: Information - PowerPoint PPT Presentation

Significance Testing Evaluation, session 6 CS6200: Information Retrieval Statistical Significance IR and other experimental sciences are concerned with measuring the effects of competing systems and deciding whether they are really different.

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Significance How important is it? Thoughts on historical significance A property must have

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

Statistical-Significance Background &amp; Goal Shortcuts Statistical significance is one of

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Testing Terminology System testing Types of errors Function testing Structure

Topic III: Significance Testing Discrete Topics in Data Mining Universitt des Saarlandes,

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Software Testing Overview What is software testing? General testing criteria Testing

Software testing Software Testing Introduction Testing levels Automated testing Principles and

1. Test page This page is for testing. This page is for testing. This page is for testing.

Factor Analysis for Multiple Testing : an R package for large-scale significance testing under

The Power and Limits of Statistics DPRRGSP 2018-11-29 @ReinhardFurrer Applied Statistics

The Gaussian parameterized by mean and SD (position / width) product of two Gaussians is

Review of basic frequentist concepts Shravan Vasishth March 10, 2020 1 Foundations 1.1 Random

Testing Specification testing Michel Bierlaire Introduction to choice models Differences from

Error Exponents for Composite Hypothesis Testing of Markov Forest Distributions Vincent Tan,

Lecture 8: Information Theory and Statistics I-Hsiang Wang Department of Electrical Engineering

Quality Data Categories Administered by: Funded by: Target audience MultilingualWebLT

Type Error Slicing What is a type error and how do you locate one? Christian Haack Joe Wells

Statistical-Significance Background & Goal Shortcuts Statistical significance is one of