STAT 113 Hypothesis Testing II Colin Reimer Dawson Oberlin College - - PowerPoint PPT Presentation

stat 113 hypothesis testing ii
SMART_READER_LITE
LIVE PREVIEW

STAT 113 Hypothesis Testing II Colin Reimer Dawson Oberlin College - - PowerPoint PPT Presentation

Measuring the Unlikelihood of H 0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed STAT 113 Hypothesis Testing II Colin Reimer Dawson Oberlin College October 10, 2017 1 / 30 Measuring the


slide-1
SLIDE 1

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

STAT 113 Hypothesis Testing II

Colin Reimer Dawson

Oberlin College

October 10, 2017 1 / 30

slide-2
SLIDE 2

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed Tests 2 / 30

slide-3
SLIDE 3

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Two Main Goals of Inference

  • 1. Assessing strength of evidence about “yes/no” questions

(hypothesis testing)

  • 2. Estimating unknown quantities in a population using a sample

(confidence intervals) 3 / 30

slide-4
SLIDE 4

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Statistics vs. Parameters

  • Summary values (like mean, median, standard deviation) can

be computed for populations or for samples.

  • In a population, such a summary value is called a parameter
  • In a sample, these values are called statistics, and are used to

estimate the corresponding parameter Value Population Parameter Sample Statistic Mean µ ¯ X Proportion p ˆ p Correlation ρ r Slope of a Line β1 ˆ β1 Difference in Means µ1 − µ2 ¯ X1 − ¯ X2 . . . . . . . . . 4 / 30

slide-5
SLIDE 5

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Quantifying H0 and H1

Identify the relevant population parameter for each of the following claims and state the null and alternative hypotheses (abbreviated H0 and H1), as statements about that parameter.

  • Dr. Bristol can tell the difference between cups of tea more
  • ften than random guessing. H0: pcorrect = 0.5, H1:

pcorrect > 0.5, where pcorrect is her “long run” success rate

  • There is a positive linear association between pH and mercury

in Florida lakes. H0: ρ = 0, H1: ρ > 0, where ρ is the correlation coefficient between pH and Hg in all Florida lakes

  • Lab mice eat more on average when the room is light. H0:

µlight − µdark = 0, H1: µlight − µdark > 0, where µ are “long run”/population means for an appropriate measure of amount

  • f food consumed

5 / 30

slide-6
SLIDE 6

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Outline

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed Tests 6 / 30

slide-7
SLIDE 7

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Logic of Testing H0

  • Logic: Don’t “confirm” H1; try to reject H0
  • If the data would be very unlikely assuming H0 were true, and

would be less unlikely if H1 were true, we have evidence against H0 and hence in favor of H1. 7 / 30

slide-8
SLIDE 8

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

What should we measure the likelihood of?

  • Suppose Dr. Bristol gets 9 out of 10 cups of tea right.
  • How unlikely is that?
  • What should count as “that”?

8 / 30

slide-9
SLIDE 9

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

P-values

  • “That” is all potential outcomes that favor H1 at least as

much as the actual outcome.

  • Sample: 9 of 10 correct. “That” =
  • The collective probability of all of these outcomes is called the

P-value for the sample. 9 / 30

slide-10
SLIDE 10

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

P-value Definition

P-value

The probability of obtaining a result at least as “extreme” (i.e., far from what’s expected under H0) as what was actually observed, assuming H0 is true is called the P-value. 10 / 30

slide-11
SLIDE 11

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Outline

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed Tests 11 / 30

slide-12
SLIDE 12

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Randomization distribution under H0

  • Often we can simulate the world under H0 to find a P-value
  • Cards
  • Computer simulation (e.g., R or StatKey)

12 / 30

slide-13
SLIDE 13

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Randomization Distribution

A randomization distribution is a simulated sampling distribution based on a hypothetical world where H0 is true.

  • The randomization distribution shows what types of statistics

would be observed, just by random chance, if the null hypothesis were true 13 / 30

slide-14
SLIDE 14

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Simulating a Randomization Distribution Handout

14 / 30

slide-15
SLIDE 15

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Outline

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed Tests 15 / 30

slide-16
SLIDE 16

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Statistical Significance

Statistical Significance

A finding in a sample (e.g., a correlation, or a difference between groups) is said to be statistically significant if the sample value (or one more extreme) would be very unlikely if H0 is true (i.e., the P-value is low) 16 / 30

slide-17
SLIDE 17

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

What is low enough?

Significance level (α)

We need to decide for ourselves, in advance of collecting data, what we will count as a “low enough” P-value to achieve statistical

  • significance. This threshold is called the significance level of the
  • test. (Notation: α)

17 / 30

slide-18
SLIDE 18

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Making a Decision

Reject H0 or not?

(a) If P ≥ α: Do not reject H0. (Data wouldn’t be that surprising if H0 true. H0 is “presumed innocent”.) (b) If P < α: Reject H0. (Data would be too surprising if H0 were

  • true. Beyond a “reasonable doubt”.)

Caution: We do not “accept H0”. We “fail to reject” it. (Not enough evidence to decide) 18 / 30

slide-19
SLIDE 19

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

What if we’re wrong?

Example

H1 : Drug is better than a placebo H0 : Drug no better than a placebo

  • We reject H0 if the data (or something even less consistent

with H0) would be improbable in a world where H0 is true.

  • But improbable things happen sometimes! This means that we

will occasionally reject H0 incorrectly!

  • E.g., we conclude that the drug works when in fact it doesn’t:

reject H0 by mistake. 19 / 30

slide-20
SLIDE 20

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Types of Errors

Example

H1 : Drug is better than a placebo H0 : Drug no better than a placebo

  • We could prevent this from ever happening by never rejecting

H0

  • Why not do this?

20 / 30

slide-21
SLIDE 21

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Types of Errors

2 × 2 table of possibilities. Is H0 actually false (does the treatment actually work)? Did we reject H0 (did we conclude that it works)? Action H0 rejected H0 not rejected Truth H0 is false True Discovery Missed Discovery H0 is true False Discovery No Error

Table: Possible outcomes of a null hypothesis significance test

Which is worse? Pairs: What does increasing or decreasing α do to the likelihood of each possibility? 21 / 30

slide-22
SLIDE 22

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Type I vs. Type II Errors

  • We can set α to whatever we want. The lower it is, the less
  • ften we make false discoveries (also called “Type I” Errors).
  • So why not make it really small?
  • Tradeoff: Fewer false discoveries (Type I Errors) → More

missed discoveries (Type II Errors). 22 / 30

slide-23
SLIDE 23

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Multiple Choice Test

  • A professor writes a multiple choice “pretest” to assess whether

students already know some of the course material when the semester starts.

  • There are 20 questions, each with 4 options.
  • For a particular student, we can ask “Do they know anything

about this material?”

  • H0: pcorrect = 0.25, H1: pcorrect > 0.25

23 / 30

slide-24
SLIDE 24

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Type I vs. Type II Errors

Decreasing α moves the rejection threshold out toward the tail of the H0 distribution.

5 10 15 20 0.00 0.05 0.10 0.15 0.20 Values Probability

  • α = 0.15, threshold = 8

Blue spikes: Distribution of outcomes if H0 is true 24 / 30

slide-25
SLIDE 25

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Type I vs. Type II Errors

Decreasing α moves the rejection threshold out toward the tail of the H0 distribution.

5 10 15 20 0.00 0.05 0.10 0.15 0.20 Values Probability

  • α = 0.05, threshold = 9

Blue spikes: Distribution of outcomes if H0 is true 24 / 30

slide-26
SLIDE 26

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Type I vs. Type II Errors

Decreasing α moves the rejection threshold out toward the tail of the H0 distribution.

5 10 15 20 0.00 0.05 0.10 0.15 0.20 Values Probability

  • α = 0.01, threshold = 11

Blue spikes: Distribution of outcomes if H0 is true 24 / 30

slide-27
SLIDE 27

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Type I vs. Type II Errors

We retain H0 when we do not exceed the threshold. But if H1 is correct, this is a Type II Error. More stringent threshold → more missed discoveries.

5 10 15 20 0.00 0.05 0.10 0.15 0.20 Values Probability

  • α = 0.15, threshold = 8

Blue spikes: Distribution of outcomes if H0 is true Orange spikes: Distribution of outcomes for one possible parameter value under . 25 / 30

slide-28
SLIDE 28

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Type I vs. Type II Errors

We retain H0 when we do not exceed the threshold. But if H1 is correct, this is a Type II Error. More stringent threshold → more missed discoveries.

5 10 15 20 0.00 0.05 0.10 0.15 0.20 Values Probability

  • α = 0.05, threshold = 9

Blue spikes: Distribution of outcomes if H0 is true Orange spikes: Distribution of outcomes for one possible parameter value under . 25 / 30

slide-29
SLIDE 29

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Type I vs. Type II Errors

We retain H0 when we do not exceed the threshold. But if H1 is correct, this is a Type II Error. More stringent threshold → more missed discoveries.

5 10 15 20 0.00 0.05 0.10 0.15 0.20 Values Probability

  • α = 0.01, threshold = 11

Blue spikes: Distribution of outcomes if H0 is true Orange spikes: Distribution of outcomes for one possible parameter value under . 25 / 30

slide-30
SLIDE 30

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Outline

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed Tests 26 / 30

slide-31
SLIDE 31

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Testing Fairness of a Coin

  • I suspect a coin is biased, but I don’t know in what direction.
  • What is my parameter of interest?
  • What are my H0 and H1?
  • I test it by flipping 100 times. What kinds of outcomes provide

evidence against H0?

  • I get 63 heads. What’s the “that” that I should counts toward

the P-value? (StatKey) 27 / 30

slide-32
SLIDE 32

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Two-Tailed Tests

Two-Tailed Test

In a Two-Tailed Test, H1 does not specify the direction (sign) of a difference/correlation/slope. So outcomes at either extreme count in its favor. The P-value therefore uses outcomes at or past the

  • bserved one, but also the symmetric outcomes on the other “tail”

We should prefer two-tailed tests, unless only one side of the alternative is plausible a priori. 28 / 30

slide-33
SLIDE 33

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Dangers of Directional Tests

Cardiac Arrhythmia Suppression Trial

Three drugs compared to a placebo, with the hope that they reduce deaths. But some of them led to more deaths. We’d better be prepared to detect that. 29 / 30

slide-34
SLIDE 34

Measuring the “Unlikelihood” of H0 Constructing a Randomization Distribution Decisions and Errors One vs. Two-Tailed

Benefits of Non-directional Tests

Superconductors

In 1986, Alex Müller and Georg Bednorz created a brittle ceramic compound that superconducted at the highest temperature then known. What made this discovery so remarkable was that ceramics are normally insulators. Unexpected benefit of finding the opposite effect from what was expected! 30 / 30