P -values, Randomization Tests, and Nonparametric Combinations of - - PowerPoint PPT Presentation

p values randomization tests and nonparametric
SMART_READER_LITE
LIVE PREVIEW

P -values, Randomization Tests, and Nonparametric Combinations of - - PowerPoint PPT Presentation

P -values, Randomization Tests, and Nonparametric Combinations of Tests Tonix Virtual Retreat Philip B. Stark 22 October 2020 University of California, Berkeley 1 Randomized experiments Subjects recruited at one or more centers


slide-1
SLIDE 1

P-values, Randomization Tests, and Nonparametric Combinations of Tests

Tonix Virtual Retreat

Philip B. Stark 22 October 2020

University of California, Berkeley 1

slide-2
SLIDE 2

Randomized experiments

  • Subjects recruited at one or more centers
  • Criteria to ensure they have the condition
  • Randomized to treatment/control or treatment level, sometimes w/ constraints or

“bias” to get balance.

  • Randomization algorithms often proprietary

2

slide-3
SLIDE 3

Analyzing the data

  • Common to use things like ANOVA, t-tests, regression, logistic regression
  • Assumptions generally have nothing to do with the experiment

3

slide-4
SLIDE 4

Small example

11 pairs of rats, each pair from the same litter. Randomly–by coin toss–put one of each pair into “enriched” environment; other sib gets “normal” environment. After 65 days, measure cortical mass (mg). enriched 689 656 668 660 679 663 664 647 694 633 653 impoverished 657 623 652 654 658 646 600 640 605 635 642 diff 32 33 16 6 21 17 64 7 89

  • 2

11 Cartoon of Rosenzweig, M.R., E.L. Bennet, and M.C. Diamond, 1972. Brain changes in response to experience, Scientific American, 226, 22–29.

4

slide-5
SLIDE 5

Informal Hypotheses Null hypothesis: treatment has “no effect.” Alternative hypothesis: treatment increases cortical mass. Suggests 1-sided test for an increase.

5

slide-6
SLIDE 6

Test contenders

  • 2-sample Student t-test

mean(treatment) - mean(control) pooled estimate of SD of difference of means

6

slide-7
SLIDE 7

Test contenders

  • 2-sample Student t-test

mean(treatment) - mean(control) pooled estimate of SD of difference of means

  • 1-sample Student t-test on the differences

mean(differences) SD(differences)/ √ 11

6

slide-8
SLIDE 8

Test contenders

  • 2-sample Student t-test

mean(treatment) - mean(control) pooled estimate of SD of difference of means

  • 1-sample Student t-test on the differences

mean(differences) SD(differences)/ √ 11

  • randomization test using t-statistic of differences: same statistic, calibrate

probability differently

6

slide-9
SLIDE 9

The Neyman “ticket” model (1930)

  • S subjects, T treatments

7

slide-10
SLIDE 10

The Neyman “ticket” model (1930)

  • S subjects, T treatments
  • subject s represented by a ticket with T numbers on it, xs1, . . . , xsT, set before

treatment is assigned (but unknown to the experimenter) resp to tx 1 resp to tx 2 · · · resp to tx T 4 9.2 · · ·

  • 3.33

7

slide-11
SLIDE 11

The Neyman “ticket” model (1930)

  • S subjects, T treatments
  • subject s represented by a ticket with T numbers on it, xs1, . . . , xsT, set before

treatment is assigned (but unknown to the experimenter) resp to tx 1 resp to tx 2 · · · resp to tx T 4 9.2 · · ·

  • 3.33
  • xst is the response subject s will have if assigned treatment t
  • if subject s is assigned to treatment t, observe xst

7

slide-12
SLIDE 12

The Neyman “ticket” model (1930)

  • S subjects, T treatments
  • subject s represented by a ticket with T numbers on it, xs1, . . . , xsT, set before

treatment is assigned (but unknown to the experimenter) resp to tx 1 resp to tx 2 · · · resp to tx T 4 9.2 · · ·

  • 3.33
  • xst is the response subject s will have if assigned treatment t
  • if subject s is assigned to treatment t, observe xst
  • no necessary connection of the numbers across subjects
  • no assumption about the distribution of the numbers
  • “non-interference” implicit

7

slide-13
SLIDE 13

Generalizations

  • subject s represented by a ticket with T J-vectors on it,

xs1, . . . , xsJ.

  • if subject s is assigned treatment ts, observe the vector

xst

item resp to tx 1 resp to tx 2 · · · resp to tx T 1 4 9.2 · · ·

  • 3.33

2 2 1 · · · 17 . . . . . . . . . . . . . . . J 5 42 · · · 9

8

slide-14
SLIDE 14

More generalizations

  • subject s represented by a ticket with T probability distributions on it, Fs1, . . . , FsT.
  • if subject s is assigned treatment t, observe a draw from Fst
  • Fst could be a multivariate distribution

resp to tx 1 resp to tx 2 · · · resp to tx T F11(·) F12(·) · · · F1T(·)

9

slide-15
SLIDE 15

Generic notation

xst could be a scalar, a vector, or a realization of a random variable or random vector. ψ(·) is a test statistic: it maps the data x to a scalar

10

slide-16
SLIDE 16

The strong null hypothesis

  • “treatment doesn’t matter at all”
  • subject s’s response would have been the same, no matter what treatment was

assigned

11

slide-17
SLIDE 17

The strong null hypothesis

  • “treatment doesn’t matter at all”
  • subject s’s response would have been the same, no matter what treatment was

assigned

  • xs1 = xs2 = · · · = xsT
  • (but xst is not necessarily equal to xrt for r = s)

11

slide-18
SLIDE 18

The strong null hypothesis

  • “treatment doesn’t matter at all”
  • subject s’s response would have been the same, no matter what treatment was

assigned

  • xs1 = xs2 = · · · = xsT
  • (but xst is not necessarily equal to xrt for r = s)

resp to tx 1 resp to tx 2 · · · resp to tx T 4 4 · · · 4

11

slide-19
SLIDE 19
  • if the null is true, know what would have been observed if random assignment had

been different: every subject would have had same response

  • induces null distribution for any test statistic ψ
  • completely determined by the randomization: no additional assumptions

12

slide-20
SLIDE 20

The rats: strong null

Treatment has no effect–as if each rat’s cortical mass was determined before randomization. Then equally likely that the rat with the heavier cortex will be assigned to treatment or to control, independently across littermate pairs. Gives 211 = 2048 equally likely possibilities: ± 32 ± 33 ± 16 ± 6 ± 21 ± 17 ± 64 ± 7 ± 89 ± 2 ± 11

13

slide-21
SLIDE 21

Alternative hypotheses

  • 1. Individual’s response depends only on that individual’s assignment
  • Special cases: shift, scale, etc.
  • 2. Interactions/Interference: my response could depend on your treatment

14

slide-22
SLIDE 22

Assumptions of the tests

  • 1. 2-sample t-test:
  • masses are iid sample from normal distribution, same unknown variance, same

unknown mean.

  • Tests “weak” null hypothesis (plus normality, independence, non-interference, etc.).
  • 2. 1-sample t-test on the differences:
  • mass differences are iid sample from normal distribution, unknown variance, zero

mean.

  • Tests “weak” null hypothesis (plus normality, independence, non-interference, etc.)
  • 3. randomization test:
  • randomization performed as claimed.
  • tests strong null hypothesis.

Assumptions of randomization test are true by fiat.

15

slide-23
SLIDE 23

Student t-test calculations Mean of differences: 26.73mg Sample SD of differences: 27.33mg t-statistic: 3.244 ≡ t0. P-value for 2-sided t-test: 0.0088

16

slide-24
SLIDE 24

Student t-test calculations Mean of differences: 26.73mg Sample SD of differences: 27.33mg t-statistic: 3.244 ≡ t0. P-value for 2-sided t-test: 0.0088

  • Why do cortical weights have normal distribution?
  • Why is variance of the difference between treatment and control the same for

different litters?

  • Treatment and control are dependent because assigning a rat to treatment excludes

it from the control group, and vice versa.

  • P-value depends on assuming differences are iid sample from a normal distribution.
  • If we reject the null, is that because there is a treatment effect, or because the
  • ther assumptions are wrong?

16

slide-25
SLIDE 25

Randomization t-test calculations Could enumerate all 211 = 2, 048 equally likely possibilities. Calculate t-statistic for each. P-value is (# possibilities s.t. t ≥ t0)/2048 ≈ 0.0018.

17

slide-26
SLIDE 26

18

slide-27
SLIDE 27

“Statistical procedure and experimental design are only two different aspects of the same whole, and that whole is the logical requirements of the complete process of adding to natural knowledge by experimentation.”

19

slide-28
SLIDE 28

“A Lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup. We will consider the problem of designing an experiment by means of which this assertion can be tested. · · · Our experiment consists in mixing eight cups of tea, four in one way and four in the

  • ther, and presenting them to the subject for judgment in a random order. The subject

has been told in advance of what the test will consist, namely, that she will be asked to taste eight cups, that these shall be four of each kind, and that they shall be presented to her in a random order, that is in an order not determined arbitrarily by human choice, but by the actual manipulation of the physical apparatus used in games of chance, dice, cards, roulettes, etc., or, more expeditiously, from a published collection of random sampling numbers purporting to give the actual results of such manipulation. Her task is to divide the 8 cups into two sets of 4, agreeing, if possible, with the treatments received.”

20

slide-29
SLIDE 29

Test statistic: number of correct IDs

8

4

= 70

21

slide-30
SLIDE 30

Test statistic: number of correct IDs

8

4

= 70 4

3

4

1

= 16

1/70 ≈ 0.014; (16 + 1)/70 ≈ 0.243

21

slide-31
SLIDE 31

Test statistic: number of correct IDs

8

4

= 70 4

3

4

1

= 16

1/70 ≈ 0.014; (16 + 1)/70 ≈ 0.243 “At best the subject can judge rightly with every cup and, knowing that 4 are of each kind, this amounts to choosing, out of the 70 sets of 4 which might be chosen, that particular one which is correct. A subject without any faculty of discrimination would in fact divide the 8 cups correctly into two sets of 4 in one trial out of 70, or, more properly, with a frequency which would approach 1 in 70 more and more nearly the more

  • ften the test were repeated.”

21

slide-32
SLIDE 32

“No such selection [of a significance level] can eliminate the whole of the possible effects

  • f chance coincidence, and if we accept this convenient convention, and agree that an

event which would occur by chance only once in 70 trials is decidedly”significant," in the statistical sense, we thereby admit that no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the “one chance in a million” will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result."

22

slide-33
SLIDE 33

“Tests of significance are of many different kinds, which need not be considered here. Here we are only concerned with the fact that the easy calculation in permutations which we encountered, and which gave us our test of significance, stands for something present in every possible experimental arrangement;

  • r, at least, for something required in its interpretation.”

23

slide-34
SLIDE 34

24

slide-35
SLIDE 35

25

slide-36
SLIDE 36

What’s a P-value?

Suppose X is a random variable s.t. P0{X ≤ p} ≤ p for all p ∈ [0, 1]. Then the observed value of X is a P-value.

26

slide-37
SLIDE 37

Example: Lady Tasting Tea X =

                

1/70, 4 correct 17/70, 3 correct 53/70, 2 correct 69/70, 1 correct 1, 0 correct

27

slide-38
SLIDE 38

Example: Lady Tasting Tea X =

                

1/70, 4 correct 17/70, 3 correct 53/70, 2 correct 69/70, 1 correct 1, 0 correct Then P0{X ≤ p} ≤ p. X is a P-value.

27

slide-39
SLIDE 39

If the P-value is p, either the null hypothesis is false or the null hypothesis is true and an event occurred that had chance no greater than p.

28

slide-40
SLIDE 40

Disconnect

The distribution used to “calibrate” P-values (i.e., to find P{X ≤ p}) for parametric tests typically used in RCTs has nothing to do with the experiment actually performed.

29

slide-41
SLIDE 41

Permutation tests

  • exploit invariance of the distribution of the data under the action of some group

when the null hypothesis is true

  • generically, “permutation” group, but can be any group
  • every dataset in the orbit of the observed data is equally likely

30

slide-42
SLIDE 42

Permutation tests

  • exploit invariance of the distribution of the data under the action of some group

when the null hypothesis is true

  • generically, “permutation” group, but can be any group
  • every dataset in the orbit of the observed data is equally likely
  • in principle, can find P-value by enumeration

30

slide-43
SLIDE 43

Permutation tests

  • exploit invariance of the distribution of the data under the action of some group

when the null hypothesis is true

  • generically, “permutation” group, but can be any group
  • every dataset in the orbit of the observed data is equally likely
  • in principle, can find P-value by enumeration
  • too many in practice: use (pseudo-)random sample of N “permutations”

30

slide-44
SLIDE 44

Permutation tests

  • exploit invariance of the distribution of the data under the action of some group

when the null hypothesis is true

  • generically, “permutation” group, but can be any group
  • every dataset in the orbit of the observed data is equally likely
  • in principle, can find P-value by enumeration
  • too many in practice: use (pseudo-)random sample of N “permutations”
  • think of hits/N as approximation to exact P-value
  • can use sequential methods to make inferences about the exact P-value

30

slide-45
SLIDE 45

Permutation tests

  • exploit invariance of the distribution of the data under the action of some group

when the null hypothesis is true

  • generically, “permutation” group, but can be any group
  • every dataset in the orbit of the observed data is equally likely
  • in principle, can find P-value by enumeration
  • too many in practice: use (pseudo-)random sample of N “permutations”
  • think of hits/N as approximation to exact P-value
  • can use sequential methods to make inferences about the exact P-value
  • (hits +1)/(N + 1) is an exact P-value for a randomized test

30

slide-46
SLIDE 46

Randomization tests

  • exploit the random assignment of subjects to treatments

31

slide-47
SLIDE 47

Randomization tests

  • exploit the random assignment of subjects to treatments
  • null distribution of test statistic flows from method of random assignment

31

slide-48
SLIDE 48

Randomization tests

  • exploit the random assignment of subjects to treatments
  • null distribution of test statistic flows from method of random assignment
  • generally not analytically tractable, esp. if random assignment includes balancing

31

slide-49
SLIDE 49

Randomization tests

  • exploit the random assignment of subjects to treatments
  • null distribution of test statistic flows from method of random assignment
  • generally not analytically tractable, esp. if random assignment includes balancing
  • approximate by simulation: re-run the random assignment N times

31

slide-50
SLIDE 50

Randomization tests

  • exploit the random assignment of subjects to treatments
  • null distribution of test statistic flows from method of random assignment
  • generally not analytically tractable, esp. if random assignment includes balancing
  • approximate by simulation: re-run the random assignment N times
  • hits/N is a approximation of P-value

31

slide-51
SLIDE 51

Randomization tests

  • exploit the random assignment of subjects to treatments
  • null distribution of test statistic flows from method of random assignment
  • generally not analytically tractable, esp. if random assignment includes balancing
  • approximate by simulation: re-run the random assignment N times
  • hits/N is a approximation of P-value
  • can use sequential testing to make inferences about “true” P-value

31

slide-52
SLIDE 52

Test functions in randomization and permutation tests

  • Can use any test function you want, including functions that come from parametric

methods such as regression, ANOVA, logistic regression, etc.

  • Calibrate P-values using the permutation or randomization distribution
  • Choose test function to have power against the scientifically interesting

alternative(s)

32

slide-53
SLIDE 53

Generic sketch

  • Pick test statistic
  • Collect data
  • Find/simulate null distribution of test statistic conditional on the observed data
  • P-value is tail probability of the test statistic under the null

33

slide-54
SLIDE 54

Multivariate tests and intersection tests

Generally measure more than one “response” per subject. E.g., CAPS-5 has J = 20 items. Null: treatment has no effect on any of the J dimensions of measurement.

34

slide-55
SLIDE 55

Combining functions

Let λ be a J-vector of statistics such that the distribution of λj if hypothesis H0j is true is known. Assume smaller values of λj are stronger evidence against H0j. E.g., λj might be P-value of H0j for some test. φ : [0, 1]J → ℜ; λ = (λ1, . . . , λJ) → φ(λ) s.t.:

  • φ is non-increasing in every argument, i.e., φ(. . . , λj, . . .) ≥ φ((. . . , λ′

j, . . .) if

λj ≤ λ′

j, j = 1, . . . , J.

  • φ attains its maximum if any of its arguments equals 0.
  • φ attains its minimum if all of its arguments equal 1.

35

slide-56
SLIDE 56
  • Fisher’s φF(λ) ≡ −2 J

j=1 ln(λj)

  • Liptak’s φL(λ) ≡ J

j=1 Φ−1(1 − λj), where Φ−1 is the inverse standard normal

CDF.

  • Tippet’s φT(λ) ≡ maxJ

j=1(1 − λj)

  • Direct combination φD ≡ J

j=1 ft(λj), where {fj} are suitable decreasing functions.

E.g., if λj is the P-value for H0j corresponding to some test statistic ψj for which larger values are stronger evidence against H0j, could use φD =

j ψj. 36

slide-57
SLIDE 57

Nonparametric combination tests (NPC)

Reallocate subjects K times; K + 1 allocations in all. (Original allocation is k = 0) ψ(k): J-vector of test statistics applied to kth allocation. ψj(k): test statistic for dimension j for the kth permutation.

37

slide-58
SLIDE 58

[ψj(k)]J

j=1 K k=0 is a J by (K + 1) matrix.

Columns correspond to random allocations of subjects. Rows correspond to a dimension of measurement. Transform: Pj(k) ≡ #{ℓ ∈ {0, . . . , K} : ψj(ℓ) ≥ ψj(k)} K + 1 . Simulated upper tail probability of the kth observed value of the jth test statistic under the null. Entries are between 1/(K + 1) and 1. Smaller entries are stronger evidence against the null (smaller item-level P-values).

38

slide-59
SLIDE 59

Apply combining function to each column of J numbers. Yields K + 1 numbers, f (k), k = 0, . . . , K, one for each allocation. The overall “Non-Parametric Combination Test” (NPC) P-value is PNPC ≈ #{k ∈ {0, . . . , K} : f (k) ≥ f (0)} K + 1 . Test is exact if all allocations are equally likely; otherwise, approximate but biased conservatively.

39