basic statistical concepts
play

Basic statistical concepts Susanne Rosthj Section of Biostatistics - PowerPoint PPT Presentation

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Faculty of Health Sciences Basic statistical concepts Susanne Rosthj Section of Biostatistics Department of Public Health University of Copenhagen


  1. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Faculty of Health Sciences Basic statistical concepts Susanne Rosthøj Section of Biostatistics Department of Public Health University of Copenhagen sr@biostat.ku.dk

  2. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Statistical approaches Descriptive statistics : • Summarizing observations • Represented • graphically • in tables • as summary statistics (single values) Inferential statistics : • Procedures allowing us to conclude and generalize • Based on models , confidence intervals, hypotheses, tests • Need mathematical assumptions and results 2 / 22

  3. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Male height from Sundby data Height distribution (males) 150 100 Frequency 50 0 150 160 170 180 190 200 Height (cm) Median 180, IQR 175-185. 3 / 22

  4. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Decriptive illustration - box plot Height (males) 200 190 180 170 160 ● ● 4 / 22

  5. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s The normal distribution The normal distribution is the most important distribution for describing continuous variables. Examples: • Body temperature • Male height • Lung function indices It is widely used in statistical inference because • it has many mathematically convenient properties • the Central Limit Theorem : The average of a sufficiently number of independent variables with same distribution will be approximately normally distributed . 5 / 22

  6. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s The 95% reference interval Reference range for normally distributed data: µ ± 1 . 96 · SD 0.06 0.05 0.04 Density 0.03 0.02 0.01 0.00 150 150 160 170 180 190 200 Height (cm) Mean 179.9, SD=7.8. Reference range 164.6 to 195.2 cm. 6 / 22

  7. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Mean and standard deviation of the sample mean Vi observerer n observationer X 1 , . . . , X n trukket fra en normalfordeling ( µ, σ 2 ) . For gennemsnittet gælder: mean( X ) = µ . σ SD ( X ) = √ n Denne SD kaldes også standard error of the mean (SE or SEM). Gennemsnittet har altså en fordeling . 7 / 22

  8. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Fordelingen af gennemsnittet Ifølge CLT følger gennemsnittet ( X ) (approksimativt) en normalfordeling: Density 95% 2.5% 2.5% σ σ µ + 1.96 µ − 1.96 µ n n 8 / 22

  9. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s The 95% confidence interval Density 95% X ● X ● 2.5% 2.5% σ σ µ + 1.96 µ − 1.96 µ n n 9 / 22

  10. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Understanding confidence intervals The population mean µ is a fixed unknown number. The confidence intervals vary between samples: Mean and 95% confidence interval 27 26 25 24 23 22 21 1 2 3 4 5 6 7 8 9 10 11 1213 14 1516 17 18 19 20 Sample ¡ 10 / 22

  11. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Interpretation of CI The 95% CI for mean male height ranges from 179 to 181 cm. Which of the following statements are true? A. There is a 95% probability that the population mean lies between 179 and 181 cm. B. 95% of males are between 179 and 181 cm tall. C. We are 95% confident that the interval from 179 to 181 cm contains the population mean. D. If we were to repeat the experiment over and over, then 95% of the time the population mean falls between 179 and 181 cm. 11 / 22

  12. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Why do we need confidence intervals? We want to estimate a parameter , e.g. • the mean height for males • the mean difference in lung function for boys and girls Based on a sample we suggest a qualified guess (estimate) • we are uncertain about the guess and suggest an interval of plausible values • the interval has to be narrow • we want a large probability (95%) of guessing right. 12 / 22

  13. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Small sample confidence intervals For small samples ( n ≤ 60 ) the CIs are better approximated by the t-distribution with df= n − 1 . The 95%-CI for µ is X ± z ′ · se with z ′ being the lower 2.5%-quantile of the t-distribution with df= n − 1 . Find a selection of quantiles in KS table A3 or calculate quantiles in R qt(x=0.025,df=n-1) . 13 / 22

  14. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s How to make conclusions based on data? The purpose of most experiments is to prove or disprove a hypothesis . This is done by collecting data, analyzing it and drawing a conclusion. The original hypothesis is tested against the data to find out whether or not it is right. 14 / 22

  15. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Example of a hypothesis 636 children from Peru had their lung capacity examined. Response: FEV (Forced Expiratory Volume ( L /1s). Scientific question: Do boys and girls have different lung capacity? Hypothesis: H 0 : There is no difference in lung capacity for boys and girls. We observe: Girls : mean(FEV) = 1.54 Boys : mean(FEV) = 1.66. Observed difference = 0.12. What can we conclude? 15 / 22

  16. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Formulation of a hypothesis We always formulate hypotheses as no difference or no association . Comparison of two populations (two groups): H 0 : The means are equal (i.e. µ 1 − µ 0 = 0 ) H A : The means are not equal. If sufficient evidence against the hypothesis, we reject H 0 . 16 / 22

  17. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Test statistics We use test statistics to find evidence against the hypothesis. Often test statistics are given by estimate − hypothetical value SD ( estimate ) We expect the test statistic to be • small if the hypothesis is true • large if the hypothesis is false. 17 / 22

  18. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Example: Lung capacity Let X i denote FEV for child i , i = 1 , . . . , n = 636 . Assume X i normally distributed with mean µ 0 for girls, mean µ 1 for boys and variance σ 2 . Do boy and girls have different lung capacity? Hypothesis: H 0 : µ 0 = µ 1 . µ 1 − µ 0 is the parameter we investigate. 0 is the hypothetical value. 18 / 22

  19. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Two sample t-test Can be used when data are normally distributed ∗ , arise from two groups , the variances in the two groups are equal and all observations are independent . Summary data: Girls: n 0 , X 0 , SD 0 Boys : n 1 , X 1 , SD 1 Test statistic: ( X 1 − X 0 ) − 0 T = SD ( X 1 − X 0 ) ∗ can be relaxed when n is large ( ≥ 40 (+/-)). 19 / 22

  20. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Example : Lung capacity n mean SD Girls 335 1.538 0.291 Boys 301 1.657 0.308 An estimate of the difference : X 1 − X 0 = 0 . 119 . The test statistic (formulas in KS Ch 7.4) 0 . 119 − 0 = = 5 . 01 . T � 1 1 0 . 299 × 335 + 301 Small or large??? 20 / 22

  21. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s P values We use p values to assess the size of test statistics. If the hypothesis is true and we replicate the sampling many times: How often will we obtain a test statistic numerically larger than the observed test statistic? The p-value P (|test statistic| > |observed test statistic|) is calculated assuming the hypothesis being true. A small p-value corresponds to the observed test statistic being unlikely if the hypothesis is true. 21 / 22

  22. u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Example : Lung capacity If H 0 is true, T follows a t-distribution with df= n 0 + n 1 − 2 . P -value: P ( | T | > 5 . 01 ) = P ( T < − 5 . 01 ) + P ( T > 5 . 01 ) 2 · 3 . 54 × 10 − 7 = 7 . 09 × 10 − 7 = If there is no difference in the mean lung function for boys and girls, the observed test statistic of 5.01 is unlikely . We reject H 0 and conclude that boys and girls have different lung function . 22 / 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend