SLIDE 1
P-values, Randomization Tests, and Nonparametric Combinations of Tests
Tonix Virtual Retreat
Philip B. Stark 22 October 2020
University of California, Berkeley 1
SLIDE 2 Randomized experiments
- Subjects recruited at one or more centers
- Criteria to ensure they have the condition
- Randomized to treatment/control or treatment level, sometimes w/ constraints or
“bias” to get balance.
- Randomization algorithms often proprietary
2
SLIDE 3 Analyzing the data
- Common to use things like ANOVA, t-tests, regression, logistic regression
- Assumptions generally have nothing to do with the experiment
3
SLIDE 4 Small example
11 pairs of rats, each pair from the same litter. Randomly–by coin toss–put one of each pair into “enriched” environment; other sib gets “normal” environment. After 65 days, measure cortical mass (mg). enriched 689 656 668 660 679 663 664 647 694 633 653 impoverished 657 623 652 654 658 646 600 640 605 635 642 diff 32 33 16 6 21 17 64 7 89
11 Cartoon of Rosenzweig, M.R., E.L. Bennet, and M.C. Diamond, 1972. Brain changes in response to experience, Scientific American, 226, 22–29.
4
SLIDE 5
Informal Hypotheses Null hypothesis: treatment has “no effect.” Alternative hypothesis: treatment increases cortical mass. Suggests 1-sided test for an increase.
5
SLIDE 6 Test contenders
mean(treatment) - mean(control) pooled estimate of SD of difference of means
6
SLIDE 7 Test contenders
mean(treatment) - mean(control) pooled estimate of SD of difference of means
- 1-sample Student t-test on the differences
mean(differences) SD(differences)/ √ 11
6
SLIDE 8 Test contenders
mean(treatment) - mean(control) pooled estimate of SD of difference of means
- 1-sample Student t-test on the differences
mean(differences) SD(differences)/ √ 11
- randomization test using t-statistic of differences: same statistic, calibrate
probability differently
6
SLIDE 9 The Neyman “ticket” model (1930)
7
SLIDE 10 The Neyman “ticket” model (1930)
- S subjects, T treatments
- subject s represented by a ticket with T numbers on it, xs1, . . . , xsT, set before
treatment is assigned (but unknown to the experimenter) resp to tx 1 resp to tx 2 · · · resp to tx T 4 9.2 · · ·
7
SLIDE 11 The Neyman “ticket” model (1930)
- S subjects, T treatments
- subject s represented by a ticket with T numbers on it, xs1, . . . , xsT, set before
treatment is assigned (but unknown to the experimenter) resp to tx 1 resp to tx 2 · · · resp to tx T 4 9.2 · · ·
- 3.33
- xst is the response subject s will have if assigned treatment t
- if subject s is assigned to treatment t, observe xst
7
SLIDE 12 The Neyman “ticket” model (1930)
- S subjects, T treatments
- subject s represented by a ticket with T numbers on it, xs1, . . . , xsT, set before
treatment is assigned (but unknown to the experimenter) resp to tx 1 resp to tx 2 · · · resp to tx T 4 9.2 · · ·
- 3.33
- xst is the response subject s will have if assigned treatment t
- if subject s is assigned to treatment t, observe xst
- no necessary connection of the numbers across subjects
- no assumption about the distribution of the numbers
- “non-interference” implicit
7
SLIDE 13 Generalizations
- subject s represented by a ticket with T J-vectors on it,
xs1, . . . , xsJ.
- if subject s is assigned treatment ts, observe the vector
xst
item resp to tx 1 resp to tx 2 · · · resp to tx T 1 4 9.2 · · ·
2 2 1 · · · 17 . . . . . . . . . . . . . . . J 5 42 · · · 9
8
SLIDE 14 More generalizations
- subject s represented by a ticket with T probability distributions on it, Fs1, . . . , FsT.
- if subject s is assigned treatment t, observe a draw from Fst
- Fst could be a multivariate distribution
resp to tx 1 resp to tx 2 · · · resp to tx T F11(·) F12(·) · · · F1T(·)
9
SLIDE 15
Generic notation
xst could be a scalar, a vector, or a realization of a random variable or random vector. ψ(·) is a test statistic: it maps the data x to a scalar
10
SLIDE 16 The strong null hypothesis
- “treatment doesn’t matter at all”
- subject s’s response would have been the same, no matter what treatment was
assigned
11
SLIDE 17 The strong null hypothesis
- “treatment doesn’t matter at all”
- subject s’s response would have been the same, no matter what treatment was
assigned
- xs1 = xs2 = · · · = xsT
- (but xst is not necessarily equal to xrt for r = s)
11
SLIDE 18 The strong null hypothesis
- “treatment doesn’t matter at all”
- subject s’s response would have been the same, no matter what treatment was
assigned
- xs1 = xs2 = · · · = xsT
- (but xst is not necessarily equal to xrt for r = s)
resp to tx 1 resp to tx 2 · · · resp to tx T 4 4 · · · 4
11
SLIDE 19
- if the null is true, know what would have been observed if random assignment had
been different: every subject would have had same response
- induces null distribution for any test statistic ψ
- completely determined by the randomization: no additional assumptions
12
SLIDE 20
The rats: strong null
Treatment has no effect–as if each rat’s cortical mass was determined before randomization. Then equally likely that the rat with the heavier cortex will be assigned to treatment or to control, independently across littermate pairs. Gives 211 = 2048 equally likely possibilities: ± 32 ± 33 ± 16 ± 6 ± 21 ± 17 ± 64 ± 7 ± 89 ± 2 ± 11
13
SLIDE 21 Alternative hypotheses
- 1. Individual’s response depends only on that individual’s assignment
- Special cases: shift, scale, etc.
- 2. Interactions/Interference: my response could depend on your treatment
14
SLIDE 22 Assumptions of the tests
- 1. 2-sample t-test:
- masses are iid sample from normal distribution, same unknown variance, same
unknown mean.
- Tests “weak” null hypothesis (plus normality, independence, non-interference, etc.).
- 2. 1-sample t-test on the differences:
- mass differences are iid sample from normal distribution, unknown variance, zero
mean.
- Tests “weak” null hypothesis (plus normality, independence, non-interference, etc.)
- 3. randomization test:
- randomization performed as claimed.
- tests strong null hypothesis.
Assumptions of randomization test are true by fiat.
15
SLIDE 23
Student t-test calculations Mean of differences: 26.73mg Sample SD of differences: 27.33mg t-statistic: 3.244 ≡ t0. P-value for 2-sided t-test: 0.0088
16
SLIDE 24 Student t-test calculations Mean of differences: 26.73mg Sample SD of differences: 27.33mg t-statistic: 3.244 ≡ t0. P-value for 2-sided t-test: 0.0088
- Why do cortical weights have normal distribution?
- Why is variance of the difference between treatment and control the same for
different litters?
- Treatment and control are dependent because assigning a rat to treatment excludes
it from the control group, and vice versa.
- P-value depends on assuming differences are iid sample from a normal distribution.
- If we reject the null, is that because there is a treatment effect, or because the
- ther assumptions are wrong?
16
SLIDE 25
Randomization t-test calculations Could enumerate all 211 = 2, 048 equally likely possibilities. Calculate t-statistic for each. P-value is (# possibilities s.t. t ≥ t0)/2048 ≈ 0.0018.
17
SLIDE 26
18
SLIDE 27
“Statistical procedure and experimental design are only two different aspects of the same whole, and that whole is the logical requirements of the complete process of adding to natural knowledge by experimentation.”
19
SLIDE 28 “A Lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup. We will consider the problem of designing an experiment by means of which this assertion can be tested. · · · Our experiment consists in mixing eight cups of tea, four in one way and four in the
- ther, and presenting them to the subject for judgment in a random order. The subject
has been told in advance of what the test will consist, namely, that she will be asked to taste eight cups, that these shall be four of each kind, and that they shall be presented to her in a random order, that is in an order not determined arbitrarily by human choice, but by the actual manipulation of the physical apparatus used in games of chance, dice, cards, roulettes, etc., or, more expeditiously, from a published collection of random sampling numbers purporting to give the actual results of such manipulation. Her task is to divide the 8 cups into two sets of 4, agreeing, if possible, with the treatments received.”
20
SLIDE 29
Test statistic: number of correct IDs
8
4
= 70
21
SLIDE 30
Test statistic: number of correct IDs
8
4
= 70 4
3
4
1
= 16
1/70 ≈ 0.014; (16 + 1)/70 ≈ 0.243
21
SLIDE 31 Test statistic: number of correct IDs
8
4
= 70 4
3
4
1
= 16
1/70 ≈ 0.014; (16 + 1)/70 ≈ 0.243 “At best the subject can judge rightly with every cup and, knowing that 4 are of each kind, this amounts to choosing, out of the 70 sets of 4 which might be chosen, that particular one which is correct. A subject without any faculty of discrimination would in fact divide the 8 cups correctly into two sets of 4 in one trial out of 70, or, more properly, with a frequency which would approach 1 in 70 more and more nearly the more
- ften the test were repeated.”
21
SLIDE 32 “No such selection [of a significance level] can eliminate the whole of the possible effects
- f chance coincidence, and if we accept this convenient convention, and agree that an
event which would occur by chance only once in 70 trials is decidedly”significant," in the statistical sense, we thereby admit that no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the “one chance in a million” will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result."
22
SLIDE 33 “Tests of significance are of many different kinds, which need not be considered here. Here we are only concerned with the fact that the easy calculation in permutations which we encountered, and which gave us our test of significance, stands for something present in every possible experimental arrangement;
- r, at least, for something required in its interpretation.”
23
SLIDE 34
24
SLIDE 35
25
SLIDE 36
What’s a P-value?
Suppose X is a random variable s.t. P0{X ≤ p} ≤ p for all p ∈ [0, 1]. Then the observed value of X is a P-value.
26
SLIDE 37
Example: Lady Tasting Tea X =
1/70, 4 correct 17/70, 3 correct 53/70, 2 correct 69/70, 1 correct 1, 0 correct
27
SLIDE 38
Example: Lady Tasting Tea X =
1/70, 4 correct 17/70, 3 correct 53/70, 2 correct 69/70, 1 correct 1, 0 correct Then P0{X ≤ p} ≤ p. X is a P-value.
27
SLIDE 39
If the P-value is p, either the null hypothesis is false or the null hypothesis is true and an event occurred that had chance no greater than p.
28
SLIDE 40
Disconnect
The distribution used to “calibrate” P-values (i.e., to find P{X ≤ p}) for parametric tests typically used in RCTs has nothing to do with the experiment actually performed.
29
SLIDE 41 Permutation tests
- exploit invariance of the distribution of the data under the action of some group
when the null hypothesis is true
- generically, “permutation” group, but can be any group
- every dataset in the orbit of the observed data is equally likely
30
SLIDE 42 Permutation tests
- exploit invariance of the distribution of the data under the action of some group
when the null hypothesis is true
- generically, “permutation” group, but can be any group
- every dataset in the orbit of the observed data is equally likely
- in principle, can find P-value by enumeration
30
SLIDE 43 Permutation tests
- exploit invariance of the distribution of the data under the action of some group
when the null hypothesis is true
- generically, “permutation” group, but can be any group
- every dataset in the orbit of the observed data is equally likely
- in principle, can find P-value by enumeration
- too many in practice: use (pseudo-)random sample of N “permutations”
30
SLIDE 44 Permutation tests
- exploit invariance of the distribution of the data under the action of some group
when the null hypothesis is true
- generically, “permutation” group, but can be any group
- every dataset in the orbit of the observed data is equally likely
- in principle, can find P-value by enumeration
- too many in practice: use (pseudo-)random sample of N “permutations”
- think of hits/N as approximation to exact P-value
- can use sequential methods to make inferences about the exact P-value
30
SLIDE 45 Permutation tests
- exploit invariance of the distribution of the data under the action of some group
when the null hypothesis is true
- generically, “permutation” group, but can be any group
- every dataset in the orbit of the observed data is equally likely
- in principle, can find P-value by enumeration
- too many in practice: use (pseudo-)random sample of N “permutations”
- think of hits/N as approximation to exact P-value
- can use sequential methods to make inferences about the exact P-value
- (hits +1)/(N + 1) is an exact P-value for a randomized test
30
SLIDE 46 Randomization tests
- exploit the random assignment of subjects to treatments
31
SLIDE 47 Randomization tests
- exploit the random assignment of subjects to treatments
- null distribution of test statistic flows from method of random assignment
31
SLIDE 48 Randomization tests
- exploit the random assignment of subjects to treatments
- null distribution of test statistic flows from method of random assignment
- generally not analytically tractable, esp. if random assignment includes balancing
31
SLIDE 49 Randomization tests
- exploit the random assignment of subjects to treatments
- null distribution of test statistic flows from method of random assignment
- generally not analytically tractable, esp. if random assignment includes balancing
- approximate by simulation: re-run the random assignment N times
31
SLIDE 50 Randomization tests
- exploit the random assignment of subjects to treatments
- null distribution of test statistic flows from method of random assignment
- generally not analytically tractable, esp. if random assignment includes balancing
- approximate by simulation: re-run the random assignment N times
- hits/N is a approximation of P-value
31
SLIDE 51 Randomization tests
- exploit the random assignment of subjects to treatments
- null distribution of test statistic flows from method of random assignment
- generally not analytically tractable, esp. if random assignment includes balancing
- approximate by simulation: re-run the random assignment N times
- hits/N is a approximation of P-value
- can use sequential testing to make inferences about “true” P-value
31
SLIDE 52 Test functions in randomization and permutation tests
- Can use any test function you want, including functions that come from parametric
methods such as regression, ANOVA, logistic regression, etc.
- Calibrate P-values using the permutation or randomization distribution
- Choose test function to have power against the scientifically interesting
alternative(s)
32
SLIDE 53 Generic sketch
- Pick test statistic
- Collect data
- Find/simulate null distribution of test statistic conditional on the observed data
- P-value is tail probability of the test statistic under the null
33
SLIDE 54
Multivariate tests and intersection tests
Generally measure more than one “response” per subject. E.g., CAPS-5 has J = 20 items. Null: treatment has no effect on any of the J dimensions of measurement.
34
SLIDE 55 Combining functions
Let λ be a J-vector of statistics such that the distribution of λj if hypothesis H0j is true is known. Assume smaller values of λj are stronger evidence against H0j. E.g., λj might be P-value of H0j for some test. φ : [0, 1]J → ℜ; λ = (λ1, . . . , λJ) → φ(λ) s.t.:
- φ is non-increasing in every argument, i.e., φ(. . . , λj, . . .) ≥ φ((. . . , λ′
j, . . .) if
λj ≤ λ′
j, j = 1, . . . , J.
- φ attains its maximum if any of its arguments equals 0.
- φ attains its minimum if all of its arguments equal 1.
35
SLIDE 56
j=1 ln(λj)
j=1 Φ−1(1 − λj), where Φ−1 is the inverse standard normal
CDF.
j=1(1 − λj)
- Direct combination φD ≡ J
j=1 ft(λj), where {fj} are suitable decreasing functions.
E.g., if λj is the P-value for H0j corresponding to some test statistic ψj for which larger values are stronger evidence against H0j, could use φD =
j ψj. 36
SLIDE 57
Nonparametric combination tests (NPC)
Reallocate subjects K times; K + 1 allocations in all. (Original allocation is k = 0) ψ(k): J-vector of test statistics applied to kth allocation. ψj(k): test statistic for dimension j for the kth permutation.
37
SLIDE 58
[ψj(k)]J
j=1 K k=0 is a J by (K + 1) matrix.
Columns correspond to random allocations of subjects. Rows correspond to a dimension of measurement. Transform: Pj(k) ≡ #{ℓ ∈ {0, . . . , K} : ψj(ℓ) ≥ ψj(k)} K + 1 . Simulated upper tail probability of the kth observed value of the jth test statistic under the null. Entries are between 1/(K + 1) and 1. Smaller entries are stronger evidence against the null (smaller item-level P-values).
38
SLIDE 59
Apply combining function to each column of J numbers. Yields K + 1 numbers, f (k), k = 0, . . . , K, one for each allocation. The overall “Non-Parametric Combination Test” (NPC) P-value is PNPC ≈ #{k ∈ {0, . . . , K} : f (k) ≥ f (0)} K + 1 . Test is exact if all allocations are equally likely; otherwise, approximate but biased conservatively.
39