Basic statistical concepts Susanne Rosthj Section of Biostatistics - - PowerPoint PPT Presentation

basic statistical concepts
SMART_READER_LITE
LIVE PREVIEW

Basic statistical concepts Susanne Rosthj Section of Biostatistics - - PowerPoint PPT Presentation

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Faculty of Health Sciences Basic statistical concepts Susanne Rosthj Section of Biostatistics Department of Public Health University of Copenhagen


slide-1
SLIDE 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Faculty of Health Sciences

Basic statistical concepts

Susanne Rosthøj

Section of Biostatistics Department of Public Health University of Copenhagen sr@biostat.ku.dk

slide-2
SLIDE 2

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Statistical approaches

Descriptive statistics :

  • Summarizing observations
  • Represented
  • graphically
  • in tables
  • as summary statistics (single values)

Inferential statistics :

  • Procedures allowing us to conclude and generalize
  • Based on models, confidence intervals, hypotheses, tests
  • Need mathematical assumptions and results

2 / 22

slide-3
SLIDE 3

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Male height from Sundby data

Height distribution (males)

Height (cm) Frequency 150 160 170 180 190 200 50 100 150

Median 180, IQR 175-185.

3 / 22

slide-4
SLIDE 4

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Decriptive illustration - box plot

  • 160

170 180 190 200

Height (males)

4 / 22

slide-5
SLIDE 5

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

The normal distribution

The normal distribution is the most important distribution for describing continuous variables. Examples:

  • Body temperature
  • Male height
  • Lung function indices

It is widely used in statistical inference because

  • it has many mathematically convenient properties
  • the Central Limit Theorem :

The average of a sufficiently number of independent variables with same distribution will be approximately normally distributed.

5 / 22

slide-6
SLIDE 6

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

The 95% reference interval

Reference range for normally distributed data: µ ± 1.96 · SD

Height (cm) Density 150 160 170 180 190 200 0.00 0.01 0.02 0.03 0.04 0.05 0.06 150

Mean 179.9, SD=7.8. Reference range 164.6 to 195.2 cm.

6 / 22

slide-7
SLIDE 7

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Mean and standard deviation of the sample mean

Vi observerer n observationer X1, . . . , Xn trukket fra en normalfordeling (µ, σ2). For gennemsnittet gælder: mean(X) = µ. SD(X) = σ √n Denne SD kaldes også standard error of the mean (SE or SEM). Gennemsnittet har altså en fordeling.

7 / 22

slide-8
SLIDE 8

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Fordelingen af gennemsnittet

Ifølge CLT følger gennemsnittet (X) (approksimativt) en normalfordeling:

Density µ + 1.96 σ n µ µ − 1.96 σ n 95%

2.5% 2.5%

8 / 22

slide-9
SLIDE 9

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

The 95% confidence interval

Density µ + 1.96 σ n µ µ − 1.96 σ n 95%

2.5% 2.5%

  • X
  • X

9 / 22

slide-10
SLIDE 10

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Understanding confidence intervals

The population mean µ is a fixed unknown number. The confidence intervals vary between samples:

¡

Mean and 95% confidence interval Sample

1 2 3 4 5 6 7 8 9 10 11 1213 14 1516 17 18 19 20

21 22 23 24 25 26 27

10 / 22

slide-11
SLIDE 11

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Interpretation of CI

The 95% CI for mean male height ranges from 179 to 181 cm. Which of the following statements are true?

  • A. There is a 95% probability that the population mean lies

between 179 and 181 cm.

  • B. 95% of males are between 179 and 181 cm tall.
  • C. We are 95% confident that the interval from 179 to 181 cm

contains the population mean.

  • D. If we were to repeat the experiment over and over, then 95% of

the time the population mean falls between 179 and 181 cm.

11 / 22

slide-12
SLIDE 12

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Why do we need confidence intervals?

We want to estimate a parameter, e.g.

  • the mean height for males
  • the mean difference in lung function for boys and girls

Based on a sample we suggest a qualified guess (estimate)

  • we are uncertain about the guess and suggest an interval
  • f plausible values
  • the interval has to be narrow
  • we want a large probability (95%) of guessing right.

12 / 22

slide-13
SLIDE 13

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Small sample confidence intervals

For small samples (n ≤ 60) the CIs are better approximated by the t-distribution with df=n − 1. The 95%-CI for µ is X ± z′ · se with z′ being the lower 2.5%-quantile of the t-distribution with df=n − 1. Find a selection of quantiles in KS table A3 or calculate quantiles in R qt(x=0.025,df=n-1).

13 / 22

slide-14
SLIDE 14

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

How to make conclusions based on data?

The purpose of most experiments is to prove or disprove a hypothesis. This is done by collecting data, analyzing it and drawing a conclusion. The original hypothesis is tested against the data to find out whether or not it is right.

14 / 22

slide-15
SLIDE 15

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Example of a hypothesis

636 children from Peru had their lung capacity examined. Response: FEV (Forced Expiratory Volume (L/1s). Scientific question: Do boys and girls have different lung capacity? Hypothesis: H0 : There is no difference in lung capacity for boys and girls. We observe: Girls : mean(FEV) = 1.54 Boys : mean(FEV) = 1.66. Observed difference = 0.12. What can we conclude?

15 / 22

slide-16
SLIDE 16

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Formulation of a hypothesis

We always formulate hypotheses as no difference or no association. Comparison of two populations (two groups): H0 : The means are equal (i.e. µ1 − µ0 = 0) HA : The means are not equal. If sufficient evidence against the hypothesis, we reject H0.

16 / 22

slide-17
SLIDE 17

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Test statistics

We use test statistics to find evidence against the hypothesis. Often test statistics are given by estimate − hypothetical value SD(estimate) We expect the test statistic to be

  • small if the hypothesis is true
  • large if the hypothesis is false.

17 / 22

slide-18
SLIDE 18

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Example: Lung capacity

Let Xi denote FEV for child i, i = 1, . . . , n = 636. Assume Xi normally distributed with mean µ0 for girls, mean µ1 for boys and variance σ2. Do boy and girls have different lung capacity? Hypothesis: H0 : µ0 = µ1. µ1 − µ0 is the parameter we investigate. 0 is the hypothetical value.

18 / 22

slide-19
SLIDE 19

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Two sample t-test

Can be used when data are normally distributed∗, arise from two groups, the variances in the two groups are equal and all observations are independent. Summary data: Girls: n0, X0, SD0 Boys : n1, X1, SD1 Test statistic: T = (X1 − X0) − 0 SD(X1 − X0)

∗ can be relaxed when n is large (≥ 40 (+/-)).

19 / 22

slide-20
SLIDE 20

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Example : Lung capacity

n mean SD Girls 335 1.538 0.291 Boys 301 1.657 0.308 An estimate of the difference : X1 − X0 = 0.119. The test statistic (formulas in KS Ch 7.4) T = 0.119 − 0 0.299 ×

  • 1

335 + 1 301

= 5.01. Small or large???

20 / 22

slide-21
SLIDE 21

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

P values

We use p values to assess the size of test statistics. If the hypothesis is true and we replicate the sampling many times: How often will we obtain a test statistic numerically larger than the observed test statistic? The p-value P(|test statistic| > |observed test statistic|) is calculated assuming the hypothesis being true. A small p-value corresponds to the observed test statistic being unlikely if the hypothesis is true.

21 / 22

slide-22
SLIDE 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Example : Lung capacity

If H0 is true, T follows a t-distribution with df=n0 + n1 − 2. P-value: P(|T| > 5.01) = P(T < −5.01) + P(T > 5.01) = 2 · 3.54 × 10−7 = 7.09 × 10−7 If there is no difference in the mean lung function for boys and girls, the observed test statistic of 5.01 is unlikely. We reject H0 and conclude that boys and girls have different lung function.

22 / 22