Statistical Analysis for M edical and Public Health Data Qazvin - - PowerPoint PPT Presentation

statistical analysis for m edical and public health data
SMART_READER_LITE
LIVE PREVIEW

Statistical Analysis for M edical and Public Health Data Qazvin - - PowerPoint PPT Presentation

Statistical Analysis for M edical and Public Health Data Qazvin University of M edical Sciences 2017 Workshop Schedule 1- Types of variables 2- Types of Studies 3- Types of data summaries 4- Types of statistical inference 5- statistical


slide-1
SLIDE 1

Statistical Analysis for M edical and Public Health Data

Qazvin University of M edical Sciences 2017

slide-2
SLIDE 2

Workshop Schedule

1- Types of variables 2- Types of Studies 3- Types of data summaries 4- Types of statistical inference 5- statistical graphs and data analysis with STATA

slide-3
SLIDE 3
  • 1. Types of variables
  • Qualitative variables: responses are not

number

– Nominal variable: makes group of people; no

comparison Examples: gender, status (ill, health)

– Ordinal variable: makes group of people; simple

comparison (< = >) Examples: education, social class (I,II,III,IV)

slide-4
SLIDE 4
  • 1. Types of variables
  • Quantitative variables: responses are numbers

1. interval variables: makes groups, comparison, zero point

  • r origin was made by scientists

difference is OK but ratio is not Examples: temperature (0c, 32F , -270K), poverty line (Toman, $, … ) 20C – 10C = 10C 20C/ 10C = 2

F = 32 + 1.8* C

32+1.8* 20=68 32+1.8* 10=50 68 F – 50 F=18F+32=50F=10C 68F/ 50F = 1.36

slide-5
SLIDE 5
  • 1. Types of variables
  • 2. Ratio variables: makes groups, comparison,

zero point or origin is a true zero difference is OK and ratio is OK Examples: age, weight, height 180cm – 170cm = 10cm 180cm/ 170cm=1.06 180kg-170kg = 10kg 18kg/ 170kg=1.06 Statistical methods for interval and ratio variables are the same.

slide-6
SLIDE 6
  • 1. Types of variables
  • Dependent variable (Y)
  • r outcome or response or end point is a

function of many factors

  • Independent variables (X1, X2, …

, Xk) predictors, factors, exploratory variables, treatment are possible causes for Y

slide-7
SLIDE 7
  • 2. Types of studies
  • Observational study :

Definition: An observational study

  • 1. draws inferences from a sample to a population
  • 2. independent variables are not under the control
  • f the researcher because of:

ethical concerns logistical constraints

  • 3. Randomization of treatment is impossible
slide-8
SLIDE 8

Types of observational studies

  • Case-control study: study originally developed in

epidemiology, in which two existing groups differing in outcome are identified and compared on the basis

  • f some supposed causal attribute.
  • Cross-sectional study: involves data collection from a

population, or a representative subset, at one specific point in time.

  • Longitudinal study: correlational research study that

involves repeated observations of the same variables

  • ver long periods of time.
  • Cohort study or Panel study: a particular form of

longitudinal study where a group of patients is closely monitored over a span of time.

  • Ecological study: an observational study in which at

least one variable is measured at the group level.

slide-9
SLIDE 9

Types of observational studies

Disadvantage:

cannot be used as reliable sources to make statements of fact about the "safety, efficacy, or effectiveness" of a practice

Advantages:

1- provide information on “real world” use and practice 2- detect signals about the benefits and risks of practices in the general population 3- help formulate hypotheses to be tested in subsequent experiments 4- provide data needed to design more informative pragmatic clinical trials 5- inform clinical practice

slide-10
SLIDE 10

Experimental Study

Definition: the investigator actively manipulates which groups receive the agent or exposure under study Randomized controlled trials (RCT) The steps in an RCT are:

  • 1. State the hypothesis
  • 2. Select the participants. This step includes sample size,

inclusion and exclusion criteria, and informed consent

  • 3. Allocate participants randomly to either the treatment or

control group; Randomization

  • 4. Administer the intervention. a blinded fashion; single

blind; double blind

  • 5. At a pre-determined time, the outcomes are monitored
slide-11
SLIDE 11

3- Types of data summaries

  • Tables
  • Graphs
  • Descriptive statistics
slide-12
SLIDE 12

3- Types of data summaries One-way table: shows distribution of one variable

Table 1 Distribution of blood group of who where when

percent Freq. Blood group 18.52 25 A 29.63 40 B 40.74 55 AB 11.11 15 O 100 135 Total

slide-13
SLIDE 13

3- Types of data summaries

Two-way table: shows distribution of one variable by second one Table 2 Distr. of … by … who when where

total Disease NO Disease Yes % Freq. % Freq. % Freq. Blood group 25 5 20 A 40 20 20 B 55 15 40 AB 15 5 10 O 135 45 90 Total

slide-14
SLIDE 14

3- Types of data summaries

  • Three-way table

Application: effect of exposure on outcome after controlling for a confounder

Disease - Disease + exposure Age group Y es 25 - 30 no … … … Y es >= 75 No

slide-15
SLIDE 15

Statistical Graphs

  • For qualitative variables:
  • 1. Simple Bar chart
  • 2. Clustered Bar chart
  • 3. Pie chart
  • 4. Clustered pie chart

15

slide-16
SLIDE 16

Bar chart for race

96 26 67

20 40 60 80 100 count of id white black

  • ther

16

slide-17
SLIDE 17

Distribution of low birth weight by race

73 23 15 11 42 25

20 40 60 80 count of id

white black

  • ther

1 1 1 17

slide-18
SLIDE 18

Distribution of race

50.79% 13.76% 35.45%

white black

  • ther

18

slide-19
SLIDE 19

Distribution of low birth weight by race

76.04% 23.96% 57.69% 42.31% 62.69% 37.31%

white black

  • ther

1

Graphs by race

19

slide-20
SLIDE 20

Statistical Graphs

  • For quantitative variables (continuous or discrete)
  • Histogram
  • Box plot
  • Scatter plot
  • line plot
  • ROC curve (Receiver operating characteristic)

curve

20

slide-21
SLIDE 21

Distribution of volume as a continuous variable

5 10 15 20 25 Percent 5,000 10,000 15,000 20,000 25,000 Volume (thousands)

21

slide-22
SLIDE 22

Distribution of M ileage as discrete variable

5 10 15 Percent 10 20 30 40 Mileage (mpg)

22

slide-23
SLIDE 23

Distribution of blood pressure (bp) by Sex effect of sex on bp

120 140 160 180 Blood pressure Male Female 23

slide-24
SLIDE 24

Distribution of blood pressure (bp) by age groups and sex effects of age group and sex on bp

120 140 160 180 Blood pressure

30-45 46-59 60+

Male Female Male Female Male Female 24

slide-25
SLIDE 25

Scatter plot of life expectancy by population growth

Avg. annual % growth Life expectancy at birth

2 4 2 4 50 60 70 80 50 60 70 80

25

slide-26
SLIDE 26

Line chart for life expectancy over years

40 45 50 55 60 65 life expectancy 1900 1910 1920 1930 1940 Year

26

slide-27
SLIDE 27

Line charts for life expectancy and inflation over years

10 20 30 40 50 60 1900 1910 1920 1930 1940 Year life expectancy inflation

27

slide-28
SLIDE 28

Receiver Operator Characteristic Curve (ROC) curve

  • To examine if a clinical marker or a new clinical

test is suitable for diagnosing a disease

  • Find a cutoff point and its sensitivity and

specificity for a marker or a test

  • ROC gives Area Under Curve (AUC) and p-value

to examine the efficacy of the marker or test

  • AUC > 0.5 and closer to 1.0 indicates acceptable

marker or test for diagnosing

28

slide-29
SLIDE 29

An example of a bad marker

0.00 0.25 0.50 0.75 1.00 Sensitivity 0.00 0.25 0.50 0.75 1.00 Specificity

Area under ROC curve = 0.3870

ROC -Asymptotic Normal-- Obs Area Std. Err. [95% Conf. Interval]

  • 189 0.3870 0.0452 0.29841 0.47564

29

slide-30
SLIDE 30

ROC curve for a good marker

0.00 0.25 0.50 0.75 1.00 Sensitivity 0.00 0.25 0.50 0.75 1.00 Specificity

Area under ROC curve = 0.9964

ROC -Asymptotic Normal-- Obs Area Std. Err. [95% Conf. Interval]

  • 2000 0.9964 0.0013 0.99390 0.99893

30

slide-31
SLIDE 31

Choosing a Cutoff point

Detailed report of sensitivity and specificity Correctly Cutpoint Sensitivity Specificity Classified ( >= 1 ) 100.00% 0.00% 50.00% ( >= 2 ) 99.70% 94.20% 96.95% ( >= 3 ) 99.50% 96.00% 97.75% ( >= 4 ) 99.30% 97.60% 98.45% ( >= 5 ) 98.80% 98.30% 98.55% ( >= 6 ) 97.80% 98.50% 98.15% ( >= 7 ) 97.30% 98.80% 98.05% ( >= 8 ) 96.50% 99.70% 98.10% ( > 8 ) 0.00% 100.00% 50.00%

31

slide-32
SLIDE 32

Fundaments of statistical Testing and Confidence Interval

32

slide-33
SLIDE 33

Fundaments of statistical Testing

Research Loop:

33

Population with unknown parameters

Representative sample

statistics

slide-34
SLIDE 34

Fundaments of statistical Testing

M ethods for statistical inference:

1- Estimation 1-1 Point estimation 1-2 Confidence Interval estimation 2- Statistical Testing (T est of Hypothesis)

34

slide-35
SLIDE 35

What is a point estimate?

A point estimate is a statistical measure that is calculated based on data obtained in a sample. Examples: sample mean, sample proportion, etc. Population parameters point estimate M ean = µ Xbar

  • Prop. = P

X/ n; X=number of successes, n=sample size Standard deviation=σ s Standard Error = Std. Err.

s/√n

Coefficient of Variation= σ/ µ s/ Xbar

35

slide-36
SLIDE 36

M ajor problem with point estimates

  • T
  • what extend we have confidence to

generalize a point estimate to its parameter in the population?

  • No specific answer!
  • A point estimate may have confidence from

0% to 100%

  • The question is answered by building an

interval with interested confidence and centered on the point estimate

36

slide-37
SLIDE 37

What is a confidence interval?

  • Confidence Interval is an interval (lower,

upper) of values that covers an interested parameter with a specified confidence.

  • Interested parameters are usually : M ean or

Proportion of a population

  • Specified confidence (1 – α)% in medical

research is usually 95% or 99% C.I.: (lower ---- par ---- upper)

slide-38
SLIDE 38

General rule for making a C.I.

Lower = estimate - confidence coefficient * SE(estimate) upper = estimate + confidence coefficient * SE(estimate)

  • SE = Standard Error
  • confidence coefficient: Z=1.96 (95%); Z=2.58 (99%) come

from Normal distribution if sample size > 25

  • confidence coefficient: t(n-1) from T distribution if sample

size <25. n-1 is called degrees of freedom=df

slide-39
SLIDE 39

95% C.I. for the M ean

If n>=25 Lower = sample mean – 1.96* SE(sample mean) Upper = sample mean + 1.96* SE(sample mean) SE(sample mean) = SD/sqrt(n) If n <25 Lower = sample mean – t(n-1)* SE(sample mean) Upper = sample mean + t(n-1)* SE(sample mean) n-1 is degrees of freedom; df=n-1

slide-40
SLIDE 40
slide-41
SLIDE 41

Confidence interval for the M ean (n>=25)

41

Interpret CI’s?

price 74 6165.257 342.8719 5481.914 6848.6 mpg 74 21.2973 .6725511 19.9569 22.63769 Variable Obs Mean Std. Err. [95% Conf. Interval]

slide-42
SLIDE 42

Confidence interval for the M ean (n < 25)

  • bs 10 33.7 19.22412 12 66

Variable Obs Mean Std. Dev. Min Max . summarize obs 47.452105 . display r(mean) + invttail(r(N)-1,0.025)*r(sd)/sqrt(r(N)) 19.947895 . display r(mean) - invttail(r(N)-1,0.025)*r(sd)/sqrt(r(N))

42

slide-43
SLIDE 43

Confidence Interval for Proportion

  • General rule:

Sample prop. ± Z(1-a/ 2) * SE(sample prop.)

  • Sample prop. = X/ n = p

X=number of successes; n=sample size

  • Z(1-a/ 2) = 1.96 for 95% confidence

Z(1-a/ 2) 2.58 for 99% confidence SE(sample prop.) = sqrt(p* (1-p)/ n); sqrt=√

43

slide-44
SLIDE 44

C.I. for proportion in STATA

Command line: ci y, binomial

y 25 .4 .0979796 .2112548 .6133465 Variable Obs Mean Std. Err. [95% Conf. Interval] Binomial Exact

44

slide-45
SLIDE 45

Statistical Analysis for M edical and Public Health Data

Qazvin University of M edical Sciences 2017

slide-46
SLIDE 46

Statistical Testing

T est of Hypothesis

2

slide-47
SLIDE 47

T est of Hypothesis

1. Populations have parameters which are fixed values but unknown, e.g. the mean or proportion 2. We want to estimate unknown parameters. 3. We want to examine a hypothesis about a parameter at minimum level of errors. 4. H0 or Null Hypothesis is an idea against the investigator’s idea about a parameter. 5. H1 or Alternative Hypothesis is the investigator’s idea about a parameter. 6. If we reject H0 while it is true, we will have type I error in decision making. 7. If we do not reject H0 while it is false, we will have Type II error in decision making. 8. In health and medical research the chance of type I error and type II error should be less that 5% and 20%, respectively.

3

slide-48
SLIDE 48

Type of Errors

4

H0 TRUE H0 FALSE Reject H0 Type I Error Power Not Reject H0 Significant level Type II Error

α= probability of type I Error β= probability of type II Error

1-α= significant level 1-β= power of the test

slide-49
SLIDE 49

Test of Hypothesis

For a test of hypothesis we have to follow these steps:

  • 1. Write the H0 and H1 hypotheses
  • 2. Calculate proper statistics from the sample data
  • 3. Calculate the test value
  • 4. Compare the test value with a critical value for

decision making

  • 5. Find the p-value that indicates to what extent

your decision was obtained by chance and not by real evidence. The less the p-value, the more real evidence and power to reject H0. Usually, it should be less than 0.05.

5

slide-50
SLIDE 50

Classical tests of hypotheses

I. One sample mean comparison

  • II. Two-group mean-comparison test
  • III. M ean-comparison test (pair data)
  • IV. One sample proportion test
  • V. Two-group proportion test

6

slide-51
SLIDE 51
  • I. One sample mean comparison

Pr(T < t) = 0.0012 Pr(|T| > |t|) = 0.0024 Pr(T > t) = 0.9988 Ha: mean < 100 Ha: mean != 100 Ha: mean > 100 Ho: mean = 100 degrees of freedom = 24 mean = mean(x) t = -3.3846 x 25 95.6 1.3 6.5 92.91693 98.28307 Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] One-sample t test

slide-52
SLIDE 52
  • II. Two-group mean-comparison

Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Ho: diff = 0 degrees of freedom = 53 diff = mean(x) - mean(y) t = -6.8557 diff -11.9 1.735777 -15.38153 -8.418473 combined 55 84.89091 1.176161 8.722646 82.53285 87.24897 y 30 90.3 1.314534 7.2 87.61148 92.98852 x 25 78.4 1.06 5.3 76.21227 80.58773 Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] Two-sample t test with equal variances

slide-53
SLIDE 53
  • III. M ean-comparison test (pair data)

Pr(T < t) = 0.0232 Pr(|T| > |t|) = 0.0463 Pr(T > t) = 0.9768 Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0 Ho: mean(diff) = 0 degrees of freedom = 11 mean(diff) = mean(mpg1 - mpg2) t = -2.2444 diff 12 -1.75 .7797144 2.70101 -3.46614 -.0338602 mpg2 12 22.75 .9384465 3.250874 20.68449 24.81551 mpg1 12 21 .7881701 2.730301 19.26525 22.73475 Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] Paired t test

slide-54
SLIDE 54
  • IV. One sample proportion test

Pr(Z < z) = 0.0000 Pr(|Z| > |z|) = 0.0000 Pr(Z > z) = 1.0000 Ha: p < 0.5 Ha: p != 0.5 Ha: p > 0.5 Ho: p = 0.5 p = proportion(x) z = -6.0000 x .2 .04 .1216014 .2783986 Variable Mean Std. Err. [95% Conf. Interval] One-sample test of proportion x: Number of obs = 100

slide-55
SLIDE 55
  • V. Two-group proportion test

Pr(Z < z) = 0.8015 Pr(|Z| < |z|) = 0.3970 Pr(Z > z) = 0.1985 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Ho: diff = 0 diff = prop(x) - prop(y) z = 0.8469 under Ho: .1180735 0.85 0.397 diff .1 .1148368 -.1250761 .3250761 y .65 .0754155 .5021883 .7978117 x .75 .0866025 .5802621 .9197379 Variable Mean Std. Err. z P>|z| [95% Conf. Interval] y: Number of obs = 40 Two-sample test of proportions x: Number of obs = 25

slide-56
SLIDE 56

Statistical M odels for Observational and Interventional Studies

slide-57
SLIDE 57

Statistical M odels for Observational and Interventional Studies

Aim of Statistical M odeling

T

  • explain and describe behavior of a

phenomenon by a mathematical – statistical formula M athematical Biology Statistical Biology

slide-58
SLIDE 58

Statistical M odels for Observational and Interventional Studies

  • Cell Differentiation

A T-cell is a type of blood cell that is a key component of the immune system. T-cells differentiate into either TH1 or TH2 cells that have different functions. The decision to which cell type to differentiate depends on the concentration of transcription factors T-bet (x1) and GATA-3 (x2) within the cell.

slide-59
SLIDE 59

Statistical M odels for Observational and Interventional Studies

A T-cell will differentiate into TH1 (TH2) if x1 is high (low) and x2 is low (high). We need to know the values of X1 and X2 at each moment of time, i.e. the dynamics of X1 & X2 to differentiate a T-cell

X2-Low X2-High X1-Low TH2 X1-High TH1

slide-60
SLIDE 60

Statistical M odels for Observational and Interventional Studies

  • A statistical- mathematical model was

developed by Y ates et al. Accordingly, the xi evolve by :

dxi/ dt = -m* xi + (ai * (xin/ (kin + xin)) + ci* (Si/ (ri+Si)))

This is a math. – statistical model that predicts the dynamic of a T cell in terms of its X1 and x2

slide-61
SLIDE 61

Regression Statistical M odels

  • M ultiple Linear Regression
  • Logistic Regression (Binomial Regression)
  • Log-Linear Regression (Poisson Regression)
  • Polynomial models
  • Nonlinear models
  • M ixed effects models
  • Bayesian Network models
slide-62
SLIDE 62

M ultiple Linear Regression

  • Yi is a continuous variable (i=1,2,…

,n)

  • Xi1, Xi2, …

, Xik are independent variables

  • Regression M odel Equation

Ybar = b0 + b1* X1 + … + bk* Xk

  • Hint 1: on the left of equation we have the mean of Y
  • Hint 2: Ybar is from –∞ to +∞
  • Interpretation of bi: one unit increase in Xi causes bi

units change in Ybar

  • Each X may be nominal, ordinal, continuous
  • If X is binary coding must be 0 and 1
  • If X is categorical (more than two groups) should

change to dummy variables

slide-63
SLIDE 63

Example

  • Sysuse auto

Y= mpg (M ileage) X1 = weight (lbs.) , X2 = foreign (car type= domestic, foreign) Regression equation: mpg = b0 + b1* foreign + b2* weight

slide-64
SLIDE 64

Regression in STATA Output of regression

_cons 41.6797 2.165547 19.25 0.000 37.36172 45.99768 weight -.0065879 .0006371 -10.34 0.000 -.0078583 -.0053175 1.foreign -1.650029 1.075994 -1.53 0.130 -3.7955 .4954422 mpg Coef. Std. Err. t P>|t| [95% Conf. Interval] Total 2443.45946 73 33.4720474 Root MSE = 3.4071 Adj R-squared = 0.6532 Residual 824.171761 71 11.608053 R-squared = 0.6627 Model 1619.2877 2 809.643849 Prob > F = 0.0000 F( 2, 71) = 69.75 Source SS df MS Number of obs = 74

slide-65
SLIDE 65

Logistic Regression

  • Y is a binary variable that takes only two

values (0, 1); 1=interested event

  • Y = 1 with p as probability of being 1
  • Y = 0 with 1-p as probability of being 0
  • What is Ybar (the mean of Y) here?
  • Ybar = 1* p + 0* (1-p) = p
  • Regression equation

p = b0 + b1* X1 + … + bk* Xk

slide-66
SLIDE 66

Logistic Regression

  • Problem: p is not between -∞ and +∞ but

0<= p <= 1 What about p/ (1-p) instead p? 0 <= p/ (1-p) < +∞; but still (-∞, 0) is left p/ (1-p) is called Odds What about log(p/ (1-p)) instead of p?

  • ∞ < log(p/(1-p)) < +∞; that is OK.

In M ath. < log(p/ (1-p)) is called logistic function.

slide-67
SLIDE 67

Logistic Regression

  • Logistic regression equation in two forms:

log(p/ (1-p)) = b0 + b1* x1 + … + bk* Xk p/ (1-p) = e b0 + b1* x1 + …

+ bk* Xk , gives the Odds

Interpretation of e b1 Example: p/ (1-p) = e b0+b1* X1, where X1=0 or 1 X1=0 p/ (1-p) = e b0 X1=1 p/ (1-p) = e b0+b1 = e b0 * e b1 Odds Ratio (OR) = e b0 * e b1 / e b0 = e b1 If OR = 1 then same Odds for two group If OR > 1 then risk factor If OR < 1 then protective factor

slide-68
SLIDE 68

Output logistic

_cons 1.586014 1.910496 0.38 0.702 .1496092 16.8134 3 2.368079 1.039949 1.96 0.050 1.001356 5.600207 2 3.534767 1.860737 2.40 0.016 1.259736 9.918406 race ui 2.1351 .9808153 1.65 0.099 .8677528 5.2534 ht 6.249602 4.322408 2.65 0.008 1.611152 24.24199 ptl 1.719161 .5952579 1.56 0.118 .8721455 3.388787 smoke 2.517698 1.00916 2.30 0.021 1.147676 5.523162 lwt .9849634 .0068217 -2.19 0.029 .9716834 .9984249 age .9732636 .0354759 -0.74 0.457 .9061578 1.045339 low Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] Log likelihood = -100.724 Pseudo R2 = 0.1416 Prob > chi2 = 0.0001 LR chi2(8) = 33.22 Logistic regression Number of obs = 189