Safe Testing Peter Grnwald Centrum Wiskunde & Informatica - PDF document

Peter Grünwald December 2016 Slate Sep 10 th : yet another classic finding in psychology — that you can smile your way to happiness — just blew up… Safe Testing Peter Grünwald Centrum Wiskunde & Informatica – Amsterdam Mathematisch Instituut Universiteit Leiden Reproducibility Crisis Partly based on joint work with Cover Story of Stéphanie van der Pas, Rianne de Heide, Economist (2013), Wall Street Journal, Science Wouter Koolen, Allard Hendriksen (2012) J. Berger (2003, IMS Medaillion Lecture ) Could Neyman, Fisher and Jeffreys have 80 years and still unresolved... agreed on testing? Jerzy Neyman : alternative exists, “inductive . . ... behaviour” • Standard method is still p-value-based Sir Ronald Fisher : test statistic rather than null hypothesis significance testing alternative, p- value indicates “unlikeliness” ...an amalgam of Neyman- Pearson’s and Fisher’s 1930s methods • Sir Harold Jeffreys : Bayesian , alternative exists, inductive behaviour; compression interpretation • everybody in psychology and medical sciences does it... • .... most statisticians agree it’s not o.k.... • ...but still can’t agree on what to do instead! P-value Problem #1: P-value Problem #2: Combining Independent Tests Combining Dependent Tests • • Suppose two different research groups Suppose reseach group A tests medication, gets ‘almost significant’ result. tested the same new medication. How to combine their test results? • ...whence group B tries again on new data. • You can’t multiply p -values! How to combine their test results? • Now Fisher’s and Stouffer’s method don’t work • This will (wildly) overestimate evidence anymore – need complicated methods! against the null hypothesis! • In our method, despite dependence, • Different valid p-value combination methods exist evidences can still be safely multiplied (Fisher’s; Stouffer’s) but give different results • We will present a method in which evidences can be safely multiplied! Safe Testing – talk at WADAPT 2016 1

Peter Grünwald December 2016 P-value Problem #2b: P-value Problem #2b: Extending Your Test Extending Your Test • • Suppose reseach group A tests medication, Suppose reseach group A tests medication, gets ‘almost significant’ result. gets ‘almost significant’ result. • Sometimes group A can’t resist to test a • Sometimes group A can’t resist to test a few more subjects themselves... few more subjects themselves... • • In a recent survey 55% of psychologists admit to have A recent survey revealed that 55% of psychologists have succumbed to this practice [L. John et al., Psychological succumbed to this practice Science , 23(5), 2012] • But isn’t this just cheating? • In our method, despite dependence, • Not clear: what if you submit a paper and the referee evidences can still be safely multiplied asks you to test a couple more subjects? Should you refuse because it invalidates your p-values!? Should we be Bayesian? Safe (i.e. adaptive) Testing • We aim for a ‘safe’ or adaptive method • These and several other problems with p-values attracted a lot of attention in the 1960s and... that better suits the real-life research • ...caused several people to become Bayesian world where obviously either you yourself • a nd right now there’s a Bayesian revolution in psychology... or another research group wants to, and • As we will see though, Bayesian methods don’t will, study more data given preliminary fully resolve the issues at hand test results that are promising but • We propose a new method that does: Safe Testing inconclusive! Should we be Bayesian? Earlier Work • • The simple 𝐼 0 case (and related developments) These and several other problems with p-values attracted a lot of attention in the 1960s and... was essentially covered in work by Volodya Vovk and collaborators (1993, 2001, 2011,...) • ...caused several people to become Bayesian • see esp. Shafer, Shen, Vereshchagin, Vovk: Test • a nd right now there’s a Bayesian revolution in psychology... Martingales, Bayes Factors and p-values, 2011 • As we will see though, Bayesian methods don’t • Also Jim Berger and collaborators have earlier fully resolve the issues at hand ideas in this direction (1994, 2001, ...) • We propose a new method: Safe Testing • Both Berger and Vovk inspired by the great • for simple 𝑰 𝟏 , all Bayes factor tests are also Jack Kiefer Safe Tests • The only thing that is really radically new here is • for composite 𝑰 𝟏 , Bayes factor tests are usually the treatment of composite 𝑰 𝟏 and its relation to not safe ( T-Test, independence testing ) reverse-information projection Safe Testing – talk at WADAPT 2016 2

Peter Grünwald December 2016 Menu Menu 1. Some of the problems with p-values 1. Some of the problems with p-values 2. Safe Testing 2. Safe Testing • • ...solves the adaptivity problem ...solves the adaptivity problem • • gambling interpretation gambling interpretation 3. Safe Testing, simple (singleton) 𝐼 0 3. Safe Testing, simple (singleton) 𝐼 0 • • relation to Bayes relation to Bayes • • relation to MDL (data compression) relation to MDL (data compression) 4. Safe Testing, Composite 𝐼 0 4. Safe Testing, Composite 𝐼 0 • • Magic: RIPr (Reverse Information Projection) Magic: RIPr (Reverse Information Projection) • • Examples: Safe t-Test, Safe Independence Test Examples: Safe t-Test, Safe Independence Test Null Hypothesis Testing Null Hypothesis Testing • Let 𝐼 0 = 𝑄 𝜄 𝜄 ∈ Θ 0 } represent the null hypothesis • Let 𝐼 0 = 𝑄 𝜄 𝜄 ∈ Θ 0 } represent the null hypothesis For simplicity, assume data 𝑌 1 ,𝑌 2 ,… are i.i.d. For simplicity, assume data 𝑌 1 ,𝑌 2 ,… are i.i.d. • • under all 𝑄 ∈ 𝐼 0 . under all 𝑄 ∈ 𝐼 0 . • Let 𝐼 1 = 𝑄 𝜄 𝜄 ∈ Θ 1 } represent alternative hypothesis • Let 𝐼 1 = 𝑄 𝜄 𝜄 ∈ Θ 1 } represent alternative hypothesis • Example: testing whether a coin is fair • Example: testing whether a coin is fair Under 𝑄 𝜄 , data are i.i.d. Bernoulli 𝜄 Under 𝑄 𝜄 , data are i.i.d. Bernoulli 𝜄 1 1 1 1 Simple 𝐼 0 Θ 0 = 2 , Θ 1 = 0,1 ∖ Θ 0 = 2 , Θ 1 = 0,1 ∖ 2 2 Standard test would measure frequency of 1s Standard test would measure frequency of 1s Null Hypothesis Testing Null Hypothesis Testing • Let 𝐼 0 = 𝑄 𝜄 𝜄 ∈ Θ 0 } represent the null hypothesis • Let 𝐼 0 = 𝑄 𝜄 𝜄 ∈ Θ 0 } represent the null hypothesis • For simplicity, assume data 𝑌 1 ,𝑌 2 ,… are i.i.d. • For simplicity, assume data 𝑌 1 ,𝑌 2 ,… are i.i.d. under all 𝑄 ∈ 𝐼 0 . under all 𝑄 ∈ 𝐼 0 . • Let 𝐼 1 = 𝑄 𝜄 𝜄 ∈ Θ 1 } represent alternative hypothesis • Let 𝐼 1 = 𝑄 𝜄 𝜄 ∈ Θ 1 } represent alternative hypothesis • • Example: t-test (most used test world-wide) Example: t-test (most used test world-wide) 𝐼 0 : 𝑌 𝑗 ∼ 𝑗.𝑗.𝑒. 𝑂 0, 𝜏 2 vs. 𝐼 0 : 𝑌 𝑗 ∼ 𝑗.𝑗.𝑒. 𝑂 0, 𝜏 2 vs. Composite 𝐼 0 𝐼 1 : 𝑌 𝑗 ∼ 𝑗.𝑗.𝑒. 𝑂 𝜈, 𝜏 2 for some 𝜈 ≠ 0 𝐼 1 : 𝑌 𝑗 ∼ 𝑗.𝑗.𝑒. 𝑂 𝜈, 𝜏 2 for some 𝜈 ≠ 0 𝜏 2 unknown (‘nuisance’) parameter 𝜏 2 unknown (‘nuisance’) parameter 𝐼 0 = 𝑄 𝜏 𝜏 ∈ 0,∞ } 𝐼 0 = 𝑄 𝜏 𝜏 ∈ 0,∞ } 𝐼 1 = 𝑄 𝜏,𝜈 𝜏 ∈ 0, ∞ ,𝜈 ∈ ℝ ∖ 0 } 𝐼 1 = 𝑄 𝜏,𝜈 𝜏 ∈ 0, ∞ ,𝜈 ∈ ℝ ∖ 0 } Safe Testing – talk at WADAPT 2016 3

Peter Grünwald December 2016 Safe Test: General Definition General Definition Let 𝐼 0 = 𝑄 𝜄 𝜄 ∈ Θ 0 } represent the null hypothesis • • Assume data 𝑌 1 ,𝑌 2 ,… are i.i.d. under all 𝑄 ∈ 𝐼 0 . • Let 𝑈 be a positive-integer valued random variable • Let 𝐼 1 = 𝑄 𝜄 𝜄 ∈ Θ 1 } represent alternative hypothesis • A safe test for stopping time 𝑈 is a test such that for all 𝑄 0 ∈ 𝐼 0 , we have • A test is a function • A safe test for sample size 𝑜 is a test such that for all 𝑄 0 ∈ 𝐼 0 , we have First Interpretation: p-values First Interpretation: p-values • Proposition: Let 𝑁 be a safe test. Then • Proposition: Let 𝑁 be a safe test. Then 𝑁 −1 𝑌 𝑈 is a nonstrict p-value, i.e. a p-value 𝑁 −1 𝑌 𝑈 is a nonstrict p-value, i.e. a p-value with wiggle room : with wiggle room : • for all 𝑄 ∈ 𝐼 0 , all 0 ≤ 𝛽 ≤ 1 , • for all 𝑄 ∈ 𝐼 0 , all 0 ≤ 𝛽 ≤ 1 , • Proof: just Markov’s inequality! Safe Tests are Safe (‘Adaptive’) First Interpretation: p-values • Proposition: Let 𝑁 be a safe test. Then • Suppose we observe data (𝑌 1 ,𝑍 1 ), 𝑌 2 ,𝑍 2 ,… 𝑁 −1 𝑌 𝑈 is a nonstrict p-value, i.e. a p-value 𝑍 𝑗 : side information, independent of 𝑌 𝑗 ’s • • Let 𝑁 1 ,𝑁 2 ,… ,𝑁 𝑙 be an arbitrarily large collection of with wiggle room : (potentially identical) safe tests for sample sizes • for all 𝑄 ∈ 𝐼 0 , all 0 ≤ 𝛽 ≤ 1 , 𝑜 1 ,𝑜 2 ,… , 𝑜 𝑙 respectively. Suppose we first perform test 𝑁 1 . • • If outcome is in certain range (e.g. promising but not conclusive) and 𝑍 𝑜 1 has certain values (e.g. ‘boss has money to collect more data’) then we Hence if we reject 𝐼 0 iff 𝑁 −1 𝑌 𝑈 < 0.05 , • perform test 𝑁 2 ; otherwise we stop. then we have Type-I Error Bound of 0.05 Safe Testing – talk at WADAPT 2016 4

Safe Testing Peter Grnwald Centrum Wiskunde & Informatica - PDF document

Peter Grnwald December 2016 Slate Sep 10 th : yet another classic finding in psychology that you can smile your way to happiness just blew up Safe Testing Peter Grnwald Centrum Wiskunde & Informatica Amsterdam Mathematisch

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Testing Terminology System testing Types of errors Function testing Structure

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Software Testing Overview What is software testing? General testing criteria Testing

Software testing Software Testing Introduction Testing levels Automated testing Principles and

1. Test page This page is for testing. This page is for testing. This page is for testing.

Overview Objective Types of testing ECE 553: TESTING AND Verification testing

Object Oriented Testing Chapter 23 1 OO Testing Class Testing: Equivalent to unit testing

Software Testing Software testing 1 V model Software testing 2 Program testing goals To

Safe Automotive soFtware architEcture (SAFE) Co-summit 2015, 10-11 March 2015, Berlin - Germany

The Safe Feed/Safe Food Certification Program Feed Safety Stair Steps HAACP-SF/SF SAFE FEED/

Safe School Programme National Anti-Bullying Coalition Safe Students Safe

Development Services in Automotive TESTING LABORATORY Accredited Testing Laboratory Nr. 1552

A review of software testing P DAVID COWARD 200511347 Software testing Software

Chapter 1 Fundamentals of testing 1. Why is testing necessary? 2. What is testing? 3. Test

Functional Testing Review Chapter 8 Functional Testing We saw three types of functional

Course Content Week 2 (March 17) and Week 3 (March 24) 33459-01 Principles of Knowledge Discovery

Sharing and non sharing of work related information amongst scholars within the field of

Counterexpectation, concession, and free choice in Tibetan and beyond Michael Yoshitaka Erlewine

Prrs rs tt

How I approach patients with both proximal common carotid disease & carotid bifurcation

of Israel and Judah Every kingdom divide divided against itself will be ruined, and every

2016 Third Quarter Update November 2, 2016 Legal Statements SAFE HARBOR STATEMENT /

Depth Sensing and Deep Learning: Grasping and Segmenting 3D Objects from Real Depth Images using

Safe Testing Peter Grnwald Centrum Wiskunde & Informatica - PDF document

Peter Grnwald December 2016 Slate Sep 10 th : yet another classic finding in psychology that you can smile your way to happiness just blew up Safe Testing Peter Grnwald Centrum Wiskunde & Informatica Amsterdam Mathematisch

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Testing Terminology System testing Types of errors Function testing Structure

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Software Testing Overview What is software testing? General testing criteria Testing

Software testing Software Testing Introduction Testing levels Automated testing Principles and

1. Test page This page is for testing. This page is for testing. This page is for testing.

Overview Objective Types of testing ECE 553: TESTING AND Verification testing

Object Oriented Testing Chapter 23 1 OO Testing Class Testing: Equivalent to unit testing

Software Testing Software testing 1 V model Software testing 2 Program testing goals To

Safe Automotive soFtware architEcture (SAFE) Co-summit 2015, 10-11 March 2015, Berlin - Germany

The Safe Feed/Safe Food Certification Program Feed Safety Stair Steps HAACP-SF/SF SAFE FEED/

Safe School Programme National Anti-Bullying Coalition Safe Students Safe

Development Services in Automotive TESTING LABORATORY Accredited Testing Laboratory Nr. 1552

A review of software testing P DAVID COWARD 200511347 Software testing Software

Chapter 1 Fundamentals of testing 1. Why is testing necessary? 2. What is testing? 3. Test

Functional Testing Review Chapter 8 Functional Testing We saw three types of functional

Course Content Week 2 (March 17) and Week 3 (March 24) 33459-01 Principles of Knowledge Discovery

Sharing and non sharing of work related information amongst scholars within the field of

Counterexpectation, concession, and free choice in Tibetan and beyond Michael Yoshitaka Erlewine

Prrs rs tt

How I approach patients with both proximal common carotid disease &amp; carotid bifurcation

of Israel and Judah Every kingdom divide divided against itself will be ruined, and every

2016 Third Quarter Update November 2, 2016 Legal Statements SAFE HARBOR STATEMENT /

Depth Sensing and Deep Learning: Grasping and Segmenting 3D Objects from Real Depth Images using

How I approach patients with both proximal common carotid disease & carotid bifurcation