Teesside University, Social Futures Institute, seminar, 18/11/2015 1
It is better to observe than to criticise. Bobby Wellins (Jazz - - PowerPoint PPT Presentation
It is better to observe than to criticise. Bobby Wellins (Jazz - - PowerPoint PPT Presentation
It is better to observe than to criticise. Bobby Wellins (Jazz Line-up, 13/2/2011) Teesside University, Social Futures Institute, seminar, 18/11/2015 1 Best of all is is to to co convey vey th the mag e magnitude nitude of
Teesside University, Social Futures Institute, seminar, 18/11/2015 2
“Best of all is is to to co convey vey th the mag e magnitude nitude
- f
- f th
the eff e effect ect an and d th the e de degr gree ee of
- f ce
cert rtai ainty nty ex expl plicitly icitly.”
– Pinker (2014, p. 45)
Teesside University, Social Futures Institute, seminar, 18/11/2015 3
“Usually wh what at on
- ne
e wa wants nts to to kn know
- w is
is no not t wh whether the cha ether the change nge ma make kes s an any di diff fferenc erence, e, bu but t to to kn know
- w how
w li like kely ly it it is is th that the at the ch chan ange ge wi will ll be be bi big g en enou
- ugh
gh.”
– (Landauer, 1997, p. 222)”)
Teesside University, Social Futures Institute, seminar, 18/11/2015 4
Ma Magnitude gnitude-based based in infer erence ence in in be beha haviour vioural al resear search Paul ul van an Schaik haik
p.van an-sc schaik haik@t @tee ees.ac .ac.uk .uk http://sss p://sss-studne udnet.tees t.tees.a .ac.uk/p .uk/psy sycholog hology/staf /staff/P /Paul aul_vs/i vs/inde ndex.htm .htm
Teesside University, Social Futures Institute, seminar, 18/11/2015 5
Ou Outline tline
- Problem and proposed solution
- Quantification in behavioural research
- Statistical inference in behavioural
research
- Magnitude-based inference
- The application of magnitude-based
inference in behavioural research
- Other approaches
- Limitations
- Recommendations
Teesside University, Social Futures Institute, seminar, 18/11/2015 6
The e pr prob
- blem
lem
A researcher conducts a study comparing two software designs in terms of their usability She conducts usability tests with two groups, each using one of the designs, and collects various measures These include perceived usability, error rate and time-
- n-task
She then compares the two groups in terms of their mean scores on the measures, using a t test She finds that, although differences in mean scores are apparent, the test results do not show statistical significance What should the researcher conclude about the difference in usability between the two designs?
Teesside University, Social Futures Institute, seminar, 18/11/2015 7
A pr proposed posed solution lution
As an altnernative to null-hypothesis significance- testing (NHST), use information about
- uncertainty in the data,
- the observed value of the effect and
- smallest substantial values for the effect
to make two kinds of magnitude-based inference: mechanistic and practical Use the results of (NHST) as input Use spreadsheets available on the Internet to generate inferences Developed and influential in sport- and exercise science
Teesside University, Social Futures Institute, seminar, 18/11/2015 8
Qua uantifi ntification cation in in us user rese search arch
- “The systematic study of the goals, needs,
and capabilities of users so as to specify design, construction, or improvement of tools to benefit how users work and live” (Schumacher, 2009, p. 6)
- Usability- and user-experience data
- E.g. psychometric data, error rate and time-on-task
- Formative research
- users’ interaction with an artefact is studied to generate
data that, when analysed, provide information to inform system improvement
- Summative research
- establishes the quality interaction of an artefact in
comparison with another artefact or a benchmark
Teesside University, Social Futures Institute, seminar, 18/11/2015 9
Sta tatistical tistical in inferen erence ce in in us user er re rese search arch
Usually, null-hypothesis significance testing (NHST) is used; limitations:
1. null hypothesis of no effect is (almost) always false 2. ignores the smallest important effect: has no effect on the inference that is made in NHST 3. does not address practical relevance; does not clearly define or distinguish practical and mechanistic significance 4. a non-significant result is inconclusive and a crude classification of inference is used (reject or retain H0) 5. sample size estimation is based on NHST
Teesside University, Social Futures Institute, seminar, 18/11/2015 10
Me Merits its of
- f magnitude
gnitude-based based in inference rence
- 1. Requires the researcher to define smallest
important effect, rather than null effect
- 2. Uses smallest important effect as integral part
- f inference, so inferences are not an artefact
- f sample size
- 3. Provides a rigorous and principled approach
to infer practical significance; provides a rigorous distinction between practical and mechanistic significance
Teesside University, Social Futures Institute, seminar, 18/11/2015 11
Mo More merits its
- 4. Provides a more refined classification of
inferences that can be made than merely rejecting or retaining the null hypothesis
- 5. Estimates of required sample size are based
- n practical significance or mechanistic
significance and researcher-defined smallest important effect
Teesside University, Social Futures Institute, seminar, 18/11/2015 12
Inf nference erence of
- f me
mech chanistic anistic sig ignificance nificance (1 (1)
- For descriptive purposes, an effect can be
classified in terms of its size
- in relation to smallest important + and - effect size
- as positive, trivial or negative
- For inference proper, the chances of an effect
being positive, negative or trivial are used
- The chances of the effect being positive: effect falling
above the threshold of the smallest important + effect
- The chances of the effect being negative: effect falling
below the threshold of the smallest important - effect
- The chances of a trivial effect: 100% minus the sum of
the chances of a + effect and those of a - effect
Teesside University, Social Futures Institute, seminar, 18/11/2015 13
Inf nference erence of
- f me
mech chanistic anistic sig ignificance nificance (2 (2)
- An inference is then made from the chances of
each of three ranges of outcome (positivity, triviality and negativity) as follows
- Unclear effect: both the chances of the obtained effect
being + and the chances of the effect being - effect are too large (e.g., both greater than the default value of 0.05
- r other appropriate cut-offs).
- Otherwise, clear effect, seen as substantially +, - or trivial
and considered to have the size of the observed value, with a qualification of probability
- Proposed interpretation of probability ranges
Teesside University, Social Futures Institute, seminar, 18/11/2015 14
Probability Chances Odds The effect … positive/trivial/negative beneficial/negligible/harmful <0; 0.005] <0; 0.5%] <0; 1:199] is almost certainly not … <0.005; 0.05] <0.5%; 5%] <1:199: 1:19] is very unlikely to be … <0.05; 0.25] <5%; 25%] <1:19; 1:3] is unlikely to be …, is probably not … <0.25; 0.75] <25%; 75%] <1:3; 3:1] is possibly (not) …, may (not) be … <0.75; 0.95] <75%; 95%] <3:1; 19:1] is likely to be ..., is probably … <0.95; 0.995] <95%; 99.5%] <19:1; 199:1] is very likely to be … <0.995; 1> <99.5; 100> <199:1; > is almost certainly …
Teesside University, Social Futures Institute, seminar, 18/11/2015 15
Teesside University, Social Futures Institute, seminar, 18/11/2015 16
Teesside University, Social Futures Institute, seminar, 18/11/2015 17
Inf nference erence of
- f pra
ractical ctical sig ignific nificance ance (1 (1)
- For descriptive purposes, an effect can be
classified in terms of its size
- in relation to smallest important beneficial and harmful
effect size
- as beneficial, negligible or harmful
- For inference proper, the chances of an effect
being beneficial, harmful or negligible are used
- The chances of the effect being beneficial: effect falling
above the threshold of the smallest important ben. effect
- The chances of the effect being harmful: effect falling
below the threshold of the smallest important harmf. effect
- The chances of a negligible effect: 100% minus the sum of
the chances of a ben. effect and those of a harmf. effect
Teesside University, Social Futures Institute, seminar, 18/11/2015 18
Inf nference erence of
- f pra
ractical ctical sig ignific nificance ance (2 (2)
- Type-1 practical error
- analogous to that of Type-I error in NHST (rejecting the
null hypothesis when it is true)
- Type-2 practical error
- analogous to that of Type-II error in NHST (retaining the
null hypothesis when it is false)
- In the practical (‘clinical’) application of effects
- the chance of using a harmful effect (a Type-1 practical
error) needs to be far smaller than
- the chance of not using a beneficial effect (a Type-2
practical error)
Teesside University, Social Futures Institute, seminar, 18/11/2015 19
Inf nference erence of
- f pra
ractical ctical sig ignific nificance ance (3 (3)
- An inference is then made from the chances of
each of three ranges of outcome (benefit, negligibility and harm) as follows
- If the chances of benefit are greater than the suggested
cut-off of 25% for a Type-2 practical error and the chances
- f harm are greater than the suggested cut-off of 0.5% for
a Type-1 practical error then the effect is unclear
- If the chances of benefit are greater than 25% and the
chances of harm are smaller than 0.5% then the effect is clearly beneficial
- Otherwise, the effect is clearly negligible or harmful.
- Proposed interpretation of probability ranges
as before
Teesside University, Social Futures Institute, seminar, 18/11/2015 20
Exa xample mple fr from
- m sp
sport
- rt sc
science ience (1 (1)
- I am grateful to Matt Weston for providing this
example
- A sports researcher is interested in whether a new,
commercially available nutritional supplement has a beneficial or harmful effect on elite cyclists’ 40 km time trial performance (the faster the time, the better the performance)
- The researcher conducts an experiment to examine
the effect of two different doses of the supplement (a low dose and a high dose)
- Experimental crossover design
- all of the cyclists perform the time trial under three different
conditions (placebo [no supplement], low dose and high dose),
- in a counterbalanced manner and
- the researcher’s experience led to the belief that the smallest
worthwhile change in 40 km time trial performance was -1%
Teesside University, Social Futures Institute, seminar, 18/11/2015 21
Exa xample mple fr from
- m sp
sport
- rt sc
science ience (2 (2)
- The mean (± SD) performance times
- 59.5 ± 1.6 min (low dose),
- 60.9 ± 2.2 min (high dose) and
- 60.5 ± 1.9 min (placebo)
- Magnitude-based inferences
- calculate the chances of benefit (or harm), with reference to a
change of -1%
- compared to placebo, the low dose performance improved by -
1.7% (90% confidence interval -2.4 to -0.9%) with a 92% chance of benefit and 0.0% chance of harm
- a low dose of the supplement is therefore likely to be beneficial and
recommended
- however, compared to placebo the high dose impaired
performance by 0.7% (90% confidence interval -0.1 to 1.5%) with a 0% chance of benefit and a 25% chance of harm
- a high dose of the supplement is therefore most unlikely beneficial
and not recommended
Teesside University, Social Futures Institute, seminar, 18/11/2015 22
Dem emonstratio
- nstration
- Example: unrelated t test
- Mechanistic inference
- Practical inference
- Spreadsheets available at
http://www.sportsci.org/
Teesside University, Social Futures Institute, seminar, 18/11/2015 23
Obse serva vation tions
- Practical and mechanistic inference, but not
for statistical inference, depend on smallest worthwhile effect
- The range of practical and mechanistic
inferences (e.g., “is very (un)likely to be harmful/trivial/beneficial”) is greater than that of statistical inference (dichotomous)
- The results of practical and mechanistic
inference concur about half of the time with those of statistical inference; when the results differ, the latter is more conservative
- Practical and mechanistic inference mostly
concur
Teesside University, Social Futures Institute, seminar, 18/11/2015 24
Smallest harmful/
- ive d
Smallest beneficial/ +ive d Total sample size (N) Sample size ratio P M S S/P S/M M/P
- 0.2
0.2 268 274 788 2.94 2.88 1.02
- 0.3
0.3 122 122 352 2.89 2.89 1.00
- 0.4
0.4 70 70 198 2.83 2.83 1.00
- 0.5
0.5 46 46 128 2.78 2.78 1.00
- 0.6
0.6 34 32 90 2.65 2.81 0.94
- 0.7
0.7 26 24 66 2.54 2.75 0.92
- 0.8
0.8 22 20 52 2.36 2.60 0.91
- 0.9
0.9 18 16 42 2.33 2.63 0.89
- 1.0
1.0 14 14 34 2.43 2.43 1.00
- 1.1
1.1 14 12 28 2.00 2.33 0.86
- 1.2
1.2 14 10 24 1.71 2.40 0.71
Teesside University, Social Futures Institute, seminar, 18/11/2015 25
Fu Further ther alt lternatives ernatives to
- NH
NHST ST
- Counter-null statistic (Rosenthal & Rubin,
1994)
- prep (Killeen, 2005)
- p-intervals (Cumming, 2008)
- Minimum-effect tests (Murphy & Myors, 1999)
- Equivalence-testing (Tryon, 2001)
- Non-inferiority-testing (Head et al., 2014)
- Bayesian statistics (Rouder et al., 2009)
Teesside University, Social Futures Institute, seminar, 18/11/2015 26
Li Limita itations tions
- Apparent
- As in NHST, need to make several choices or accept
recommended choices
- Confidence level
- Type-1 and Type-2 practical-error rates
- The smallest important effect
- The mapping of quantitative probabilities onto qualitative
descriptors
- As in NHST, assumptions about sampling distribution
- f the outcome statistic; can use bootstrapping
- Substantive
The decision rules do not necessarily take all relevant factors into account, for example the (financial) value of inputs to and outputs from using a harmful or beneficial effect (Murphy & Myors, 1999)
Teesside University, Social Futures Institute, seminar, 18/11/2015 27
Recommendati commendations
- ns
- 1. Plan sample size using magnitude-based inference
- 2. Analyse data using NHST; make better use of the
results as input for magnitude-based inference
- 3. Always analyse data using mechanistic inference;
also use practical inference for effects where benefit and harm can be meaningfully defined
- 4. Use appropriate spreadsheets for sample size
estimation and magnitude-based inference (http://www.sportsci.org/)
- 5. When preparing for journal publication, cogently
argue why it is appropriate to use magnitude-based inference in your research; in your section Data Analysis explain the specific magnitude-based inference that you have used (see, e.g., Barnes et al., 2014)
Teesside University, Social Futures Institute, seminar, 18/11/2015 28
So Some pu publ blica ications tions
Barnes, K. R., Hopkins, W. G., McGuigan, M. R., & Kilding, A. E. (2015). Warm-up with a weighted vest improves running performance via leg stiffness and running economy. Journal of Science and Medicine in Sport, 18, 103-108. doi:10.1016/j.jsams.2013.12.005 Batterham, A. M., & Hopkins, W. G. (2006). Making meaningful inferences about magnitudes. International Journal of Sports Physiology and Performance, 1(1), 50-57. Hopkins, W. G. (2006). Estimating sample size for magnitude-based
- inference. Sport Science, 10, 63-70.
Hopkins, W. G. (2006). Spreadsheets for analysis of controlled trials, with adjustment for a subject characteristic. Sport Science, 10, 46-50. Hopkins, W. G., Marshall, S. W., Batterham, A. M., & Hanin, J. (2009). Progressive statistics for studies in sports medicine and exercise
- science. Medicine and Science in Sports and Exercise, 41(1), 3-
- 12. doi:10.1249/MSS.0b013e31818cb278
Schaik, P. van & Weston, M. (2016). Magnitude-based inference and its application in user research. International Journal of Human- Computer Studies, 88, 38-50.