Magnitude-based Inference: A Statistical Review Alan Welsh and - PowerPoint PPT Presentation

August 2014 A.H. Welsh & E.J. Knight ”Magnitude-based Inference”: A Statistical Review Alan Welsh and Emma Knight The Australian National University The Australian Institute of Sport Thinkstock

August 2014 A.H. Welsh & E.J. Knight

August 2014 A.H. Welsh & E.J. Knight xParallelGroupsTrial.xls

August 2014 A.H. Welsh & E.J. Knight Comparing change in two groups Compare Post1 - Pre2 measurements for the Control group with Post1 - Pre2 measurements for the Exptal group to see if there is a treatment effect .

August 2014 A.H. Welsh & E.J. Knight Comparing two means • Assume all 40 Post1 - Pre2 measurements are independent . • The Post1 - Pre2 measurements for the Control group and the Post1 - Pre2 measurements for the Exptal group are approximately normally distributed . • The problem is to make inferences about the effect of the treatment on a typical (randomly chosen) individual ; this effect is summarized by the difference in the means of the separate (normal) populations represented by the experimental and control athletes. • For simplicity, assume throughout that positive values of the Exptal population mean - Control population mean represent a positive or beneficial effect . • The two normal populations are allowed to have different variances ; this is called the Behrens-Fisher problem .

August 2014 A.H. Welsh & E.J. Knight xParallelGroupsTrial.xls Results

August 2014 A.H. Welsh & E.J. Knight According to the papers . . . : Confidence Intervals Compute a standard approximate Student t confidence interval (default level: 90%) for the difference in population means. Specify the smallest meaningful positive effect δ > 0; this defines three regions on the real line: “negative or harmful” region ( −∞ , − δ ), “trivial or no effect” region [ − δ, δ ], “positive or beneficial” region ( δ, ∞ ). The confidence interval is classified by the extent of overlap with these three regions into one of the four categories “Beneficial” , “Trivial” , “Harmful” or “Unclear” , where the last category is used for confidence intervals that do not belong to any of the other categories.

August 2014 A.H. Welsh & E.J. Knight For δ = 4 . 41, the xParallelGroupsTrial.xls data produces the third confidence interval: not significant but possibly beneficial .

August 2014 A.H. Welsh & E.J. Knight xParallelGroupsTrial.xls : Classical Results

August 2014 A.H. Welsh & E.J. Knight “It’s all in the spreadsheets . . . ”

August 2014 A.H. Welsh & E.J. Knight “Chances” and “Qualitative Probabilities” p b “substantially positive (+ve) or beneficial” value 1 − p b − p h “trivial value” p h “substantially negative (-ve) or harmful” value

August 2014 A.H. Welsh & E.J. Knight xParallelGroupsTrial.xls : “Magnitude-based Inference” Results

August 2014 A.H. Welsh & E.J. Knight “Clinical Inference” and “Mechanistic Inference” Classify p b and p h into one of four categories: p h small large p b small trivial harmful large beneficial unclear Qualify the classifications “beneficial” , “harmful” and “trivial” by the corresponding classifications of p b , p h and 1 − p b − p h . “Clinical inference” distinguishes positive and negative values; it needs thresholds for the “minimum chance of benefit” (default: η b = 0 . 25) and the “maximum risk of harm” (default: η h = 0 . 001 ) . “Mechanistic inference” applies when there is no direct clinical or practical application and positive and negative values represent equally important effects; it needs a single threshold ( default α = 0 . 1 obtained by setting η b = η h = 0 . 05).

August 2014 A.H. Welsh & E.J. Knight A graphical representation ANIMATION 1: Constructing the ternary diagram to interpret “magnitude-based inference” and show the effect of changing the thresholds η b and η h Thinkstock

August 2014 A.H. Welsh & E.J. Knight Interpretation The “chance of benefit” p b and “risk of harm” p h cannot be derived as frequentist probabilities from the standard confidence interval; they can be derived from a Bayesian credibility interval if we switch to a Bayesian framework. We can derive p b and p h as frequentist p-values. For δ ≥ 0: p b is the one-sided p-value for testing the null hypothesis that µ 2 − µ 1 = δ against the alternative that µ 2 − µ 1 < δ ; p h is the one-sided p-value for testing the null hypothesis that µ 2 − µ 1 = − δ against the alternative that µ 2 − µ 1 > − δ ; p , the usual p-value, is the two-sided test of the null hypothesis that µ 2 − µ 1 = 0 against the alternative that µ 2 − µ 1 � = 0. When δ = 0, p b = 1 − p/ 2 and p h = p/ 2, so small p corresponds to large p b and small p h . For p in 0 . 05 − 0 . 15, moderate increases in δ shift the analysis towards a positive conclusion: we decrease p h and p b , but usually not by enough to lose the “evidence” for a positive effect (given that η b is small; 0 . 25 compared to, say, 0 . 95). The important threshold for obtaining a positive result is η h .

August 2014 A.H. Welsh & E.J. Knight A graphical representation ANIMATION 2: The effect of changing δ on p b and p h in the ternary diagram and on the probabilities of finding an effect when there is none ANIMATION 3: The effect of changing δ on p b and p h , showing both the Frequentist and the Bayesian inter- pretations of these probabilities Thinkstock

August 2014 A.H. Welsh & E.J. Knight The “Magnitude-based Inference” Test “Magnitude-based inference” has not replaced tests by confidence intervals but is actually a test . “Mechanistic inference” is a complicated and confusing way of increasing the level of the test; it does nothing to the power of the test. It is equivalent to using the usual p-value with a much larger threshold value. e.g. 50% instead of 5%

August 2014 A.H. Welsh & E.J. Knight “Clinical inference” in “magnitude-based inference” increases the level of the test and changes the thresholds. • The increase in η b (from 5% to 25%) looks spectacular but this is misleading because η b is not actually important when the p-value is in the range 0.05–0.15. • The decrease in η h (from 5% to 0.5%) works against the other changes (in the p-value and δ ), but is outweighed by the gains from the other two changes . “Magnitude-based inference” is less conservative than other clinical inference procedures. If other researchers feel that clinical conclusions should be more conservative (“do no harm”) than mere statistical significance, what is the role for a method for clinical inference that is explicitly designed to be less conservative?

August 2014 A.H. Welsh & E.J. Knight I can’t be bothered addressing this kind of criticism. If you believe in God, no amount of evidence against His existence will disabuse you of your be- lief. Similarly, if you believe in null hypothesis testing, the evidence for a better method of making inferences about true effects means nothing to you. In any case, has this person read the evidence? I doubt it. Will Hopkins, Quoted by Martin Buchheit April 30, 2013 Thinkstock

August 2014 A.H. Welsh & E.J. Knight Sample size calculations

August 2014 A.H. Welsh & E.J. Knight The standard formula (from significance testing) is n ≈ function of level (default: 5%) and power (default: 80%) (smallest difference you hope to detect) 2 Without explanation or justification, Hopkins uses n ≈ function of 2 η h (default: 1%) and 1 − η b (default: 75%) 4 (smallest difference you hope to detect) 2 Calling η h the “Type I clinical error rate” and η b the “Type II clinical error rate” acknowledges (ironically) that “magnitude-based inference” is a test but does not justify their use in the standard sample size formula because η h is not the level of the test and η b is not the probability of “not using an effect that is beneficial”. There is no basis for the division by 4 . The changes to the numerator produce a 4/3 increase, the division by 4 changes this to an overall 1/3 decrease .

August 2014 A.H. Welsh & E.J. Knight Conclusion The real motivation for “magnitude-based inference” is that significance tests (the use of p-values) and confidence intervals are seen as being too conservative . “Magnitude-based inference” is promoted as an alternative to significance tests, but it is also a test . It is less conservative than standard tests because it inflates the level of the test to levels that should not be used . The sample size calculations should not be used . We sympathize with the frustration of the researcher finding that the evidence they have for an effect is weaker than they would like, but we have to recognize the limitations of the data and be careful about trying to strengthen weak evidence just because it suits us to do so. We recommend being realistic about the limitations of the data and using confidence intervals (in preference to p-values).

August 2014 A.H. Welsh & E.J. Knight Thinkstock “Should scientists accept and offer overconfidence, oversimplification, distortion and rhetoric disguised as quantified science ...?” Sander Greenland

Magnitude-based Inference: A Statistical Review Alan Welsh and - PowerPoint PPT Presentation

August 2014 A.H. Welsh & E.J. Knight Magnitude-based Inference: A Statistical Review Alan Welsh and Emma Knight The Australian National University The Australian Institute of Sport Thinkstock August 2014 A.H. Welsh & E.J.

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Foundations for Inference I Dajiang Liu @PHS525 Feb-09-2016 Statistical Inference

Order of Magnitude Markers: An Empirical Study on Large Magnitude Number Detection Rita Borgo,

Order of Magnitude Icebreaker How many galaxies in the Universe? KAS16/MT

UQ, STAT2201, 2017, Lecture 6 Unit 6 Statistical Inference Ideas. 1 Statistical Inference is

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Lifted Inference in Statistical Relational Models Guy Van den Broeck BUDA Invited Tutorial June

Statistical Inference https://people.bath.ac.uk/masss/APTS/apts.html Simon Shaw University of

Modes of Statistical Inference for Causal Efgects Plus an overview of the testing based approach

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Post-Selection Inference Todd Kuffner Washington University in St. Louis PhyStat 2016

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

Type Inference 75 Definition Type Inference Type inference = Java compiler's ability

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Democracy, Information, and Audience Costs (Previously circulated as Informational Effects of

Linear mixed effect model- Birth rates data Richard Erickson Quantitative Ecologist DataCamp

Professor of Philanthropic Studies Indiana University Lilly Family School of Philanthropy Types of

Health Equity Dr. Kwame McKenzie CEO, Wellesley Institute Date 2015 Toronto Stories Diversity

Existing knowledge, practice and responses to violence against women in Australian Indigenous

Leisa Donlan www.committee.com Everything you ever wanted to know:

AIRS Project Status T. Pagano NASA AIRS Project Office California Institute of Technology Jet

ETC5512: Wild Caught Data ETC5512: Wild Caught Data Week 1 Week 1 Data collection Lecturer:

Sambuz

Useful Links

Newsletter

Mail Us