Hypothesis testing, part 2 With some material from Howard Seltman, - - PowerPoint PPT Presentation

hypothesis testing part 2
SMART_READER_LITE
LIVE PREVIEW

Hypothesis testing, part 2 With some material from Howard Seltman, - - PowerPoint PPT Presentation

Hypothesis testing, part 2 With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal 1 CATEGOR ORICAL IV, NU NUMERI ERIC DV 2 Independent samples, one IV # Conditions Normal/Parametric Non-parametric Exactly 2


slide-1
SLIDE 1

1

Hypothesis testing, part 2

With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

slide-2
SLIDE 2

2

CATEGOR ORICAL IV, NU NUMERI ERIC DV

slide-3
SLIDE 3

3

Independent samples, one IV

# Conditions Normal/Parametric Non-parametric Exactly 2 T-test Mann-Whitney U, bootstrap 2+ One-way ANOVA Kruskal-Wallis, bootstrap

slide-4
SLIDE 4

4

Is your data normal?

  • Skewness: asymmetry
  • Kurtosis: “peakedness” rel. to normal

– Both: within +- 2SE(s/u) is OK

  • Or use Shapiro-Wilk (null = normal)
  • Or look at Q-Q plot
slide-5
SLIDE 5

5

T-test

  • Already talked about
  • Assumptions: normality, equal variances,

independent samples

– Can use Levene to test equal variance assumption

  • Post-test: check residuals for assumption fit

– For a t-test this is the same pre or post – For other tests you check residual vs. fit post

slide-6
SLIDE 6

6

One way ANOVA

  • H0: m1 = m2 = m3
  • H1: at least one doesn’t match
  • NOT H1: m1 != m2 != m3
  • Assumptions: normality, common variance,

independent errors

  • Intuition: F statistic

– Variance between / Variance within – Under (exact null), F=1; F >> 1 rejects null

slide-7
SLIDE 7

7

One-way ANOVA

  • F = MSb / MSw
  • MSw = sum [sum[ (diff from mean)2 ]] / dfw

– dfw = N-k, where k = number of conditions – Sum over all conditions; sum per condition

  • MSb = sum [(diff from grand mean)2] / dfb

– dfb = k-1 – Every observation goes in the sum

slide-8
SLIDE 8

8

(example from Vibha Sazawal)

slide-9
SLIDE 9

9

slide-10
SLIDE 10

10

F-distribution

rejected

slide-11
SLIDE 11

11

Now what? (Contrasts)

  • So we rejected the null. What did we learn?

– What *didn’t* we learn? – At least one is different ... Which? All? – This is called an “omnibus test”

  • To answer our actual research question, we

usually need pa pairw rwise co contra trasts ts

slide-12
SLIDE 12

12

The trouble with contrasts

  • Contrasts mess with your Type I bounds

– One test: 95% confident – Three tests: 85.7% confident – 5 conditions, all pairs: 4 + 3 + 2 + 1 = 10 tests: 59.9% – UH OH

slide-13
SLIDE 13

13

Planned vs. post hoc

  • Planned: You have a theory.

– Really, no cheating – You get n-1 pairwise comparisons for free – In theory, should not be control vs. all, but prob. OK – NO COMPARISONS unless omnibus passes

  • Post-hoc

– Anything unplanned – More than n-1 – Requires correction! – Doesn’t necessarily require omnibus first

slide-14
SLIDE 14

14

Correction

  • Adjust {p-values, alpha} to compensate for

multiple testing post-hoc

  • Bonferroni (most conservative)

– Assume all possible pairs: m = k(k-1)/m (comb.) – alphac = alpha / m – Once you have looked, implication is you did all the comparisons implicitly!

  • Holm-Bonferroni is less conservative

– Stepwise adjusting alpha as you go

  • Dunnett for specifically all vs. control, others
slide-15
SLIDE 15

15

Independent samples, one IV

# Conditions Normal/Parametric Non-parametric Exactly 2 T-test Mann-Whitney U, bootstrap 2+ One-way ANOVA Kruskal-Wallis, bootstrap

slide-16
SLIDE 16

16

Non-parametrics: MWU and K-W

  • Good for non-normal data, likert data (ordinal,

not actually numeric)

  • Assumptions: independent, at least ordinal
  • Null: P(X > Y) = P(Y > X) where X,Y are
  • bservations from the 2 distributions (MWU)

– If assume same distribution shape, continuous then this can can be seen as comparing medians

slide-17
SLIDE 17

17

MWU and K-W continued

  • Essentially: rank order all data (both conditions)

– Total ranks for condition 1, compare to “expected” – Various procecures to correct for ties

slide-18
SLIDE 18

18

Bootstrap

  • Resampling technique(s)
  • Intuition:

– Create “null” distribution by e.g. subtracting means so mA = mB = 0

  • Now you have shifted samples A-hat and B-hat

– Combine these to make a null distribution – Draw sample of size N, with replacement

  • Do it 1000 (or 10k) times

– Use this to determine critical value (alpha = 0.05) – Compare this critical value to your real data for test

slide-19
SLIDE 19

19

Paired samples, one IV

# Conditions Normal/Parametric Non-parametric Exactly 2 Paired T-test Wilcoxon signed-rank 2+ 2-way ANOVA w/ subject random factor Mixed models (later) Friedman

slide-20
SLIDE 20

20

Paired T-test

  • Two samples per participant item
  • Test subtracts them
  • Then uses one-sample T-test with H0: m = 0 and

H1: m != 0

  • Regular T-test assumptions, plus: does

subtraction make sense here?

slide-21
SLIDE 21

21

Wilcoxon S.R. / Friedman

  • H0: difference btwn pairs is symmetric around 0
  • H1: … or not
  • Excludes no-change items
  • Essentially: rank by abs. difference; compare

signs * ranks

  • (Friedman = 3+ generalization)
slide-22
SLIDE 22

22

SIMPLE LINEAR REGRESSION ON

One numeric IV, numeric DV

slide-23
SLIDE 23

23

Simple linear regression

  • E(Y|x) = b0 + b1x … looks at populations

– Population mean at this value of x

  • Key H0: b1 != 0

– b0 usually not important for significance (obv. important in model fit)

  • b1 : slope à change in Y per unit X
  • Best fit: Least squares, or maximum likelihood

– LSq: minimize sum of squares of residuals – ML: max prob. of seeing this data with this model

slide-24
SLIDE 24

24

Assumptions, caveats

  • Assumes:

– linearity in Y ~ X – normally distributed error for each x, with constant variance at all x – Error measuring X is small compared to var. Y (fixed X)

  • Independent errors!

– Serial correlation, data that is grouped, etc. (later)

  • Don’t interpret widely outside available x vals
  • Can transform for linearity!

– Log(Y), sqrt(y), 1/y, y^2

slide-25
SLIDE 25

25

Assumption/residual checking

  • Before: Use scatterplot for plausible linearity
  • After: residual vs. fit

– Residual on Y vs. predicted on X – Should be relatively even distributed around 0 (linear) – Should have relatively even v. spread (eq. var)

  • After: quantile-normal of residuals
slide-26
SLIDE 26

26

Model interpretation

  • Interpret b1, interpret the p-value
  • CI: if it crosses 0, it’s not significant
  • R2: fraction of total variation accounted for

– Intutively: explained variance / total variance – Explained = var(Y) – residual errors

  • F2 = R2 / (1 – R R2); SML: 0.02, 0.15, 0.35 (cohen)
slide-27
SLIDE 27

27

Robustness

  • Brittle to linearity, independent errors
  • Somewhat brittle to fixed-X
  • Fairly robust to equal variance
  • Quite robust to normality
slide-28
SLIDE 28

28

CATEGOR ORICAL OU OUTCOM OMES

slide-29
SLIDE 29

29

One Cat. IV, Cat. DV, independent

  • Contingency tables: how many people in each

combination of categories

slide-30
SLIDE 30

30

Chi-square test of independence

  • H0: distribution of Var1 is the same at every level
  • f Var2 (and vice versa)

– Null dist. Approaches X^2 when sample size grows – Heuristic: no cells < 5 – Can use FET instead

  • Intuition:

– Sum over rows/columns: (observed – expected)^2 / expected – Expected: marginal % * count in other margin

slide-31
SLIDE 31

31

Paired 2x2 tables

  • Use McNemar’s test

– Contigency table: matches and mismatches for each

  • ption.
  • H0: marginals are the same
  • Essentially a X^2 test on the agreement

– Test stat: (b-c)^2 / (b+c)

Cond1: Yes Cond 1: No Cond2: Yes a b a + b Cond2: No c d c + d a + c b + d N

slide-32
SLIDE 32

32

Paired, continued

  • Cochran’s Q: extended for more than two

conditions

  • Other similar extensions for related tasks
slide-33
SLIDE 33

33

Critiques

  • Choose a paper that has one (or more) empirical

experiments as a central contribution

– Doesn’t have to be human subjects, but can be – Does have to have enough description of experiment

  • 10-12 minute presentation
  • Briefly: research questions, necessary background
  • Main: describe and critique methods

– Experimental design, data collection, analysis – Good, bad, ugly, missing

  • Briefly, results?
slide-34
SLIDE 34

34

Logistic regression (logit)

  • Numeric IV, binary DV (or ordinal)
  • log( E(Y)/ (1-E(Y)) ) == log ( Pr (Y=1) / Pr (Y=0)) = b0 + b1x
  • Log odds of success = linear function

– Odds: 0 to inf., 1 is the middle – e.g.: odds = 5 = 5:1 … for five successes, one fail – Log odds: -inf to inf w/ 0 in the middle: good for regression

  • Modeled as binomial distribution
slide-35
SLIDE 35

35

Interpreting logistic regression

  • Take exp(coef) to get interpretable odds.
  • For each unit increase in x, odds increase b1

times

– Note that this can make small coefs important!

  • Use e.g., Homer-Lemeshow test for goodness of

fit – null == data fit the model

– But not a lot of power!

slide-36
SLIDE 36

36

MU MULTIVARIATE

slide-37
SLIDE 37

37

Multiple regression

  • Linear/logistic regression with more variables!

– At least one numeric, 0+ categorical

  • Still: fixed x, normal errors w/ equal variance,

independent errors (linear)

  • Linear relationship in E(Y) and one x, when other

inputs held constant

– Effects of each x are independent!

  • Still check q-n of residuals, residual vs. fit
slide-38
SLIDE 38

38

Model selection

  • Which covariates to keep? (more on this in a bit)
slide-39
SLIDE 39

39

Adding categorical vars

  • Indicator variables (everything is 0 or 1)
  • Need one fewer indicator than conditions

– One condition is true; or none are true (baseline) – Coefs are *r *relative to

  • baseline*!

*!

  • Model selection: keep all or none for one factor
  • Called “ANCOVA” when at least one each

numeric + categorical

slide-40
SLIDE 40

40

Interaction

  • What if your covariates *aren’t* independent?
  • E(Y) = b0 + b1x1 + b2x2 + b12x1x2

– Slope for x1 is diff. for each value of x2

  • Superadditive: all in same direction, interaction

makes effects stronger

  • Subadditive: interaction is in opposite direction
  • For indicator vars, all or none
slide-41
SLIDE 41

41

Model selection!

  • Which covariates to keep?
  • From theory
  • Keep interaction only if it’s significant?

– If keep interaction, should keep corresponding mains

  • ”Adjusted” R^2?

– Regular R^2 always higher w/ more covars

  • BIC and AIC

– Take model likelihood and penalize for more params – Abs value not interpretable; lower is better

  • All combinations? Stepwise?
slide-42
SLIDE 42

42

THINGS WE ARE ON ONLY GOI OING TO O MENTION ON BRIEFLY

Know they exist; look them up if relevant

slide-43
SLIDE 43

43

Multi-way ANOVA

  • >1 cat IVs, 1 numeric DV
  • Normality, equal variance, indep. Errors
  • With interaction: every combo of factor levels

has its own population mean

  • Without interaction (additive): change in one var

consistent as all fixed vals for others

  • Works basically like standard ANOVA, etc.
slide-44
SLIDE 44

44

Mixed models regression

  • Explicitly model correlations in data
  • Fixed effects: affect outcome for everyone
  • Random effects: deviations per data item, don’t

want to model individually

  • Simplest example: repeated measures

– Y ~ b0 + b1x1 + b2x2 …. + random ID intercept – Each participant has their own intercept adjustment

slide-45
SLIDE 45

45

POW OWER ANALYSIS

slide-46
SLIDE 46

46

What is power?

  • Null distribution: designed so that we’d only see

a test statistic this extreme 5% of the time

  • This bounds type I but not type II
  • Power = 1 – type II error rate
  • Heuristic: 80% is “good enough”
slide-47
SLIDE 47

47

Alternative scenarios

  • One null, but infinitely many alternatives!
  • Alternative distribution: given some n,

underlying variance, underlying diff. in pop. means, what is the distribution of test statistic

  • You know the critical value, so tells you how
  • ften your p will be above 0.05 when the “true”

scenario is as you model

slide-48
SLIDE 48

48

Calculating power

  • A priori, to think about sample size and judge

value of experiment

  • Inherently requires estimating the alternative

scenario!

– Maybe try a few

  • Statistic-specific, but in general:

– Sample size, effect size, power, alpha

  • “Consider the smallest effect size that you

consider interesting and try to achieve reasonable power for that effect size”

slide-49
SLIDE 49

49

Example from Seltman book

  • F statistic (ANOVA)
  • 3 treatments
  • 50 people each
  • Red: sigma = 10,

means: 10, 12, 14

  • Blue: sigma = 10,

means: 10, 13, 16

slide-50
SLIDE 50

50

Promoting power

  • (Review from earlier)
  • Raise sample size; reduce variance; aim for

bigger effects

slide-51
SLIDE 51

51

Walkthrough: linear regression

  • u = model df -> number of params
  • v = F-test error df -> N – u – 1
  • f2 = r2 / (1 – r2) … r2 = f2 / (1 + f2)
slide-52
SLIDE 52

52

Retrospective power

  • Somewhat controversial
  • Calculate observed effect size, then determine

what sample size would be needed

– Whole new experiment, not just collect more

  • Not a good idea:

– We didn’t find a significant effect, but if we had studied 12 more people …