Basic statistical concepts Susanne Rosthj Section of Biostatistics - PowerPoint PPT Presentation

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Faculty of Health Sciences Basic statistical concepts Susanne Rosthøj Section of Biostatistics Department of Public Health University of Copenhagen sr@biostat.ku.dk

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Statistical approaches Descriptive statistics : • Summarizing observations • Represented • graphically • in tables • as summary statistics (single values) Inferential statistics : • Procedures allowing us to conclude and generalize • Based on models , confidence intervals, hypotheses, tests • Need mathematical assumptions and results 2 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Male height from Sundby data Height distribution (males) 150 100 Frequency 50 0 150 160 170 180 190 200 Height (cm) Median 180, IQR 175-185. 3 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Decriptive illustration - box plot Height (males) 200 190 180 170 160 ● ● 4 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s The normal distribution The normal distribution is the most important distribution for describing continuous variables. Examples: • Body temperature • Male height • Lung function indices It is widely used in statistical inference because • it has many mathematically convenient properties • the Central Limit Theorem : The average of a sufficiently number of independent variables with same distribution will be approximately normally distributed . 5 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s The 95% reference interval Reference range for normally distributed data: µ ± 1 . 96 · SD 0.06 0.05 0.04 Density 0.03 0.02 0.01 0.00 150 150 160 170 180 190 200 Height (cm) Mean 179.9, SD=7.8. Reference range 164.6 to 195.2 cm. 6 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Mean and standard deviation of the sample mean Vi observerer n observationer X 1 , . . . , X n trukket fra en normalfordeling ( µ, σ 2 ) . For gennemsnittet gælder: mean( X ) = µ . σ SD ( X ) = √ n Denne SD kaldes også standard error of the mean (SE or SEM). Gennemsnittet har altså en fordeling . 7 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Fordelingen af gennemsnittet Ifølge CLT følger gennemsnittet ( X ) (approksimativt) en normalfordeling: Density 95% 2.5% 2.5% σ σ µ + 1.96 µ − 1.96 µ n n 8 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s The 95% confidence interval Density 95% X ● X ● 2.5% 2.5% σ σ µ + 1.96 µ − 1.96 µ n n 9 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Understanding confidence intervals The population mean µ is a fixed unknown number. The confidence intervals vary between samples: Mean and 95% confidence interval 27 26 25 24 23 22 21 1 2 3 4 5 6 7 8 9 10 11 1213 14 1516 17 18 19 20 Sample ¡ 10 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Interpretation of CI The 95% CI for mean male height ranges from 179 to 181 cm. Which of the following statements are true? A. There is a 95% probability that the population mean lies between 179 and 181 cm. B. 95% of males are between 179 and 181 cm tall. C. We are 95% confident that the interval from 179 to 181 cm contains the population mean. D. If we were to repeat the experiment over and over, then 95% of the time the population mean falls between 179 and 181 cm. 11 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Why do we need confidence intervals? We want to estimate a parameter , e.g. • the mean height for males • the mean difference in lung function for boys and girls Based on a sample we suggest a qualified guess (estimate) • we are uncertain about the guess and suggest an interval of plausible values • the interval has to be narrow • we want a large probability (95%) of guessing right. 12 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Small sample confidence intervals For small samples ( n ≤ 60 ) the CIs are better approximated by the t-distribution with df= n − 1 . The 95%-CI for µ is X ± z ′ · se with z ′ being the lower 2.5%-quantile of the t-distribution with df= n − 1 . Find a selection of quantiles in KS table A3 or calculate quantiles in R qt(x=0.025,df=n-1) . 13 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s How to make conclusions based on data? The purpose of most experiments is to prove or disprove a hypothesis . This is done by collecting data, analyzing it and drawing a conclusion. The original hypothesis is tested against the data to find out whether or not it is right. 14 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Example of a hypothesis 636 children from Peru had their lung capacity examined. Response: FEV (Forced Expiratory Volume ( L /1s). Scientific question: Do boys and girls have different lung capacity? Hypothesis: H 0 : There is no difference in lung capacity for boys and girls. We observe: Girls : mean(FEV) = 1.54 Boys : mean(FEV) = 1.66. Observed difference = 0.12. What can we conclude? 15 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Formulation of a hypothesis We always formulate hypotheses as no difference or no association . Comparison of two populations (two groups): H 0 : The means are equal (i.e. µ 1 − µ 0 = 0 ) H A : The means are not equal. If sufficient evidence against the hypothesis, we reject H 0 . 16 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Test statistics We use test statistics to find evidence against the hypothesis. Often test statistics are given by estimate − hypothetical value SD ( estimate ) We expect the test statistic to be • small if the hypothesis is true • large if the hypothesis is false. 17 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Example: Lung capacity Let X i denote FEV for child i , i = 1 , . . . , n = 636 . Assume X i normally distributed with mean µ 0 for girls, mean µ 1 for boys and variance σ 2 . Do boy and girls have different lung capacity? Hypothesis: H 0 : µ 0 = µ 1 . µ 1 − µ 0 is the parameter we investigate. 0 is the hypothetical value. 18 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Two sample t-test Can be used when data are normally distributed ∗ , arise from two groups , the variances in the two groups are equal and all observations are independent . Summary data: Girls: n 0 , X 0 , SD 0 Boys : n 1 , X 1 , SD 1 Test statistic: ( X 1 − X 0 ) − 0 T = SD ( X 1 − X 0 ) ∗ can be relaxed when n is large ( ≥ 40 (+/-)). 19 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Example : Lung capacity n mean SD Girls 335 1.538 0.291 Boys 301 1.657 0.308 An estimate of the difference : X 1 − X 0 = 0 . 119 . The test statistic (formulas in KS Ch 7.4) 0 . 119 − 0 = = 5 . 01 . T � 1 1 0 . 299 × 335 + 301 Small or large??? 20 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s P values We use p values to assess the size of test statistics. If the hypothesis is true and we replicate the sampling many times: How often will we obtain a test statistic numerically larger than the observed test statistic? The p-value P (|test statistic| > |observed test statistic|) is calculated assuming the hypothesis being true. A small p-value corresponds to the observed test statistic being unlikely if the hypothesis is true. 21 / 22

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Example : Lung capacity If H 0 is true, T follows a t-distribution with df= n 0 + n 1 − 2 . P -value: P ( | T | > 5 . 01 ) = P ( T < − 5 . 01 ) + P ( T > 5 . 01 ) 2 · 3 . 54 × 10 − 7 = 7 . 09 × 10 − 7 = If there is no difference in the mean lung function for boys and girls, the observed test statistic of 5.01 is unlikely . We reject H 0 and conclude that boys and girls have different lung function . 22 / 22

Basic statistical concepts Susanne Rosthj Section of Biostatistics - PowerPoint PPT Presentation

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Faculty of Health Sciences Basic statistical concepts Susanne Rosthj Section of Biostatistics Department of Public Health University of Copenhagen

CONCEPTS AND CONCEPTS AND CONCEPTS AND CONCEPTS AND PR PR PRINC PRINC NCIPLES OF NCIPLES

Basic Concepts of I R: Outline Basic Concepts of Information Retrieval: Task definition of

Current C Current C Current C Current C Concepts of Concepts of Concepts of Concepts of

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic

Basic Experimental Design Basic Concepts in Experimental Design Prof. Dr. Luc Duchateau Ghent

Important Concepts Important Concepts Some important concepts in financial and derivative

Nucleic Acids Basic Concepts Basic Concepts Nucleic Acids David Murray PhD UCD|Mater

Part I - Basic concepts of thermochronology Basic concepts of thermochronology

Survival analysis : from basic concepts to open research questions Ecole dt,

CS6220: DATA MINING TECHNIQUES Chapter 10: Cluster Analysis: Basic Concepts and Methods

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Lecture Lecture 3 3 Basic Concepts Basic Concepts Dr. Hazim Dwairi Dr Hazim Dwairi

Basic Concepts G. Urvoy-Keller urvoy@unice.fr Probabilty and Statistics Outline Basic concepts

Part I - Basic concepts of thermochronology Basic concepts of thermochronology

Part I - Basic concepts of thermochronology Basic concepts of thermochronology

Instrumental Variable Regression Erik Gahner Larsen Advanced applied statistics, 2015 1 / 58

Group Sequential Monitoring of Multiple Endpoints Christopher Jennison, Dept of Mathematical

Secondary Framing Secondary Framing Secondary Framing Secondary Framing 1 1 Secondary Framing

Hawthorn District 73 October 15, 2020 One District, One Mission Inspire all students to

Bayesian variable selection for identifying subgroups in cost-effectiveness analysis Elas Moreno

Interaction between COPD patients and healthcare professionals in a cross-sector

The DNS security mess D. J. Bernstein University of Illinois at Chicago The Domain Name

Optimizing the Use of Operative Disclosures Vaginal Delivery in 2019 UCSF AIM Conference

Sambuz

Useful Links

Newsletter

Mail Us