1
Introduction to Inferential Statistics
Jaranit Kaewkungwal, Ph.D.
Faculty of Tropical Medicine Mahidol University
Introduction to Inferential Statistics Jaranit Kaewkungwal, Ph.D. - - PowerPoint PPT Presentation
Introduction to Inferential Statistics Jaranit Kaewkungwal, Ph.D. Faculty of Tropical Medicine Mahidol University 1 2 Data & Variables Types of Data Types of Data QUALITATIVE QUALITATIVE Data expressed by type Data expressed by type
Faculty of Tropical Medicine Mahidol University
QUALITATIVE QUALITATIVE Data expressed by type Data expressed by type Data that has been described Data that has been described QUANTITATIVE QUANTITATIVE Data classified by numeric value Data classified by numeric value Data that has been measured or counted Data that has been measured or counted QUALITITATIVE and QUANTITATIVE data are not mutually exclusive QUALITITATIVE and QUANTITATIVE data are not mutually exclusive
Adapted from: Dr. Craig Jackson, Adapted from: Dr. Craig Jackson, University of Central England
NOMINAL DATA NOMINAL DATA
values that the data may have do not have specific order
values act as labels with no real meaning
e.g. Health status e.g. Health status healthy =1 healthy =1 sick=2 sick=2 e.g. Treatment e.g. Treatment new regimen = 1 new regimen = 1 standard regimen = 2 standard regimen = 2 e.g. hair colour e.g. hair colour brown =1 brown =1 blond =2 blond =2 black =100 black =100 ORDINAL DATA ORDINAL DATA
values with some kind of ordering
data that has been measured or counted e.g. social class: e.g. social class: upper=1 upper=1 middle = 2 middle = 2 working = 3 working = 3 e.g. e.g. glioblastoma glioblastoma tumor grade: tumor grade: 1 1 2 2 3 3 4 4 5 5 e.g. position in a race: e.g. position in a race: 1 1 st
st
2 2 nd
nd
3 3 rd
rd
Adapted from: Dr. Craig Jackson, Adapted from: Dr. Craig Jackson, University of Central England
DISCRETE DISCRETE
distinct or separate parts, with no finite detail e.g children in family e.g children in family CONTINUOUS CONTINUOUS
between any two values, there would be a third e.g between meters there are centimetres e.g between meters there are centimetres INTERVAL INTERVAL
equal intervals between values and an arbitrary zero on the scale ale e.g temperature gradient e.g temperature gradient RATIO RATIO
equal intervals between values and and an absolute zero an absolute zero e.g body mass index e.g body mass index
Adapted from: Dr. Craig Jackson, Adapted from: Dr. Craig Jackson, University of Central England
White Hot White Hot Red Hot Red Hot Cold Cold “ “Dangerous Dangerous” ” “ “Unpleasant Unpleasant” ” “ “Uncomfortable Uncomfortable” ” “ “Tolerable Tolerable” ” “ “Comfortable Comfortable” ” “ “Cold Cold” ” 80 80 o
C 60 60 o
C 40 40 o
C 20 20 o
C 10 10 o
C Unsafe Unsafe Safe Safe
Temperature Temperature
Adapted from: Dr. Craig Jackson, Adapted from: Dr. Craig Jackson, University of Central England
1 2 3 4 88 99
Exclude from Analysis?
1 2 99
INDEPENDENT INDEPENDENT (syn: treatment, experimental, predictor, input, exposure, explanatory variable) is a stimulus or activity that is identified or manipulated to predict the dependent variable; they are considered as the causal factors, or that you may manipulate. e.g. new drug, working hours, exposure, worker attitudes, polic e.g. new drug, working hours, exposure, worker attitudes, policies ies DEPENDENT DEPENDENT (syn: Effect, criterion, criterion measure, outcome, output variable) is a response that the researcher wanted to predict; they are considered as the
variables. e.g. e.g. Symptomotology Symptomotology, productivity, accident rates, attitudes, health status, , productivity, accident rates, attitudes, health status, performance on neuropsychological test performance on neuropsychological test
Adapted from: Dr. Craig Jackson, Adapted from: Dr. Craig Jackson, University of Central England
CONTROLLED CONTROLLED
e.g., Working hours, temperatures, extraneous exposure, diet, cl e.g., Working hours, temperatures, extraneous exposure, diet, class, income, ass, income, Ambient noise and temperature in testing room Ambient noise and temperature in testing room
Adapted from: Dr. Craig Jackson, Adapted from: Dr. Craig Jackson, University of Central England
X (independent) Y (dependent) X (independent) Y (dependent) X2 (independent)
S T D ra te Y e s 5 5 /9 5 (6 1 % ) C o n d o m U se N o 4 5 /1 0 5 (4 3 % )
S T D ra te # P a r tn e r s < 5 Y e s 5 /1 5 (3 3 % ) C o n d o m U se N o 3 0 /8 2 (3 7 % ) # P a r tn e r s > 5 Y e s 5 0 /8 0 (6 2 % ) C o n d o m U se N o 1 5 /2 3 (6 5 % )
Dependent Var: Infant//Child Growth Indepependent Var: Adult Fatness Extraneous Var:
Adult Age, Socio-economic, Smoking, Physical Acitivity, etc.
Sex (male & female)
(Bias) (Chance)
Johnson AF. Beneath the technological fix. J Chron Dis 1985 (38), 957-961
Standard Score Raw Score
Sampling Techniques Generalization/ Inferential Statistics
X1 X2 X3 X4 X5 X6 X7 X8 X9 x x n
i i n
=
=
1
μ =
=
x N
i i n 1
X1 X2 X3 X4 X5 X6 X7 X8 X9
( 1 2 2 2 2 3 3 4 5)
Gender Male Female Count 270 260 250 240 230 220 210 Male Femaleσ μ
= −
=
) x N
i i n 1
S x n
i i n
X
= −
=
)
1
Negatively skewed
to include the true value in population
not among different groups
Sampling Method Sampling Method I n f e r e n t i a l S t a t i s t i c s I n f e r e n t i a l S t a t i s t i c s
x x1
1
x x2
2
x x3
3
Confidence Limits of μ : X ± Zα/2,ν SX 95% CI of μ : X ± (1.96 * (SD/√n)) 25 ± (1.96 * (12.2/√100))
Standard error
Point Estimates: Single values (Mean, Variance, Correlation, treatment effect, relative risk, etc.) representing characteristics in the whole population Interval Estimates: Ranges of values, usually centered around point estimates, indicating bounds within which we expect the true values for the whole population to lie (stability
±
α υ
/ , 2
Confidence Limits of μ : X ± Zα/2,ν SX 95% CI of μ : X ± (1.96 * (SD/√n)) 25 ± (1.96 * (12.2/√100))
Standard error
95% CI (from Sample 2) 95% CI (from Sample 1) 95% CI (from Sample 3)
2.5% 95% 2.5%
5% 95%
α / 2 = 0.005 − 2.576 α / 2 = 0.005 2.576
p-value = 0.04
(G1=G2)
(G1<>G2)
0.01, 0.05 0.99, 0.95 0.10, 0.20 0.90, 0.80
Goal Type of Outcome Data
Continuous Categorical Binomial Survival analysis (from Gaussian Continuous Time Population) (Non-Gaussian)
Describe Value of Data (1 Group) Compare Value of Data vs. Hypothetical Value (1 Group) Mean, SD Median, Interquar- tile range Proportion (Percent) Kaplan- Meier survival curve One- sample t- test Wilcoxon’s test Chi-square (χ2) or Binomial/ Runs test
Goal Type of Outcome Data
Continuous Categorical Binomial Survival analysis (from Gaussian Continuous Time Population) (Non-Gaussian)
Compare Values 2 Grps
Indept >2 Grps Grps. Compare Values 2 Grps
Paired >2 Grps Grps/Vars. Unpaired t- test One-way ANOVA Mann- Whitney test Kruskal- Wallis test χ2 test ,
Fisher’s Exact,
χ2 test Log-rank / Mantel- Haenszel Cox Prop Haz.Reg. Paired t-test Repeated measures ANOVA Wilcoxon’s test Friedman’s test McNemar’s χ2 test Cochrane’s Q test Condtn Prop Haz.Reg. Condtn Prop Haz.Reg.
Goal Type of Outcome Data
Continuous Categorical Binomial Survival analysis
(from Gaussian
Continuous Time
Population) (Non-Gaussian)
Quantify Association Values of Two variables Predict Value of Outcome Var: from 1 Var (Simple Reg. ) from > 2 Vars (Multiple Reg.) Pearson’s Correlation Spearman’s Correlation Contingency coefficient, Crude Odds Ratio, Relatv Risk Linear or Non-linear Regression . Cox’s Proportional
Non- parametric Regression Logistic Regression
Goal Type of Outcome Data
Continuous Categorical Binomial Survival analysis (from Gaussian Continuous Time Population) (Non-Gaussian) Measures of Agreement Values from Two Raters/Methods
Measures of Validity Values from Two Raters/Methods
Pearson’s Correlation Weighted Kappa (κ) Weighted Kappa (κ) Agreement rate Cohen’s κ ICC
ANOVA Factor Analysis Non- parametric ANOVA
Factor Analysis
χ2 Sensitivity Specificity ROC curve
Survival (%) Months
10 20 30 40 50 60 70 80 90 100 110 120 20 40 60 80 100
A B
Months
10 20 30 40 50 60 70 80 90 100 110 120 20 40 60 80 100 <10,000 >100,000 10,000-100,000
Survival (%)
Figure 2. Survival from time of human immunodeficiency virus (HIV) infection of 194 CSWs. A, Overall. B, By serum virus load (HIV type 1 RNA copies/mL).
Each curve is truncated when <10 women remain in that group.
Charactersitcs
patients who died % % (95% CI)* (95% CI)** Age at Infection, years <= 19 105 31 (29.5) 72.3 (62.1-80.1) Referent >=20 89 35 (39.3) 63.3 (50.5-73.6) 1.50 (0.92-2.45) Sex work Brothel 159 54 (34.0) 69.6 (61.1-76.6) 1.34 (0.71-2.52) Nonbrothel 35 12 (34.3) 62.9 (41.6-78.2) Referent Oral contraceptive use Yes 112 36 (32.1) 69.6 (59.2-77.9) 0.83 (0.51-1.36) No 82 30 (36.6) 67.1 (54.7-76.8) Referent Depot medroxyprogesterone use Yes 55 18 (32.7) 70.9 (55.7-81.7) 0.78 (0.45-1.37) No 139 48 (34.5) 67.8 (58.4-75.5) Referent Infection status Seroconverted 34 7 (20.6) *** 1.42 (0.63-3.22) Seropositive at enrl. 160 59 (36.9) 69.6 (61.4-76.4) Referent Viral load, HIV-1 RNA copies/ml. >1000,000 34 24 (70.6) 34.5 (18.8-50.9) 15.40 (5.2-45.2) 10,000-100,000 113 38 (33.6) 70.3 (60.1-78.4) 4.63 (1.64-13.1) <10,000 47 4 (8.5) 92.5 (78.4 -97.5) Referent Total 194 66 (34.0) 68.7 (61.0-75.2)
* Survival analysis ** Cox proportional hazard model *** Insufficient follow-up time to this m ore recent converted group; 5-year survival = 77.8 (56.8-89.5) %
Table 2. Survival from time of infection of 194 HIV-infected CSWs
crowding Malnutrition Vaccination Genetic Risk factors for tuberculosis (Distant from Outcome) Mechanism of Tuberculosis (Proximal to Outcome) Susceptible Host Susceptible Host Infection Infection Tuberculosis Tuberculosis Tissue Invasion and Reaction
Exposure to Mycobacterium
Example: Relationship between risk factors and disease : hypertension ( BP) and congestive heart failure (CHF). Hypertension causes many diseases, including congestive heart failure, and congestive heart failure has many causes, including hypertension.
Charactersitcs
patients who died % % (95% CI)* (95% CI)** (95% CI)*** Initial CD4 lymphocyte, cells/μL <200 15 14 (93.3) 0 20.9 (9.00-48.7) 15.5 (6.46-37.4) 200-500 88 34 (38.6) 63.4 (51.8-72.9) 2.46 (1.21-5.01) 1.42 (0.67-3.00) >500 54 10 (18.5) 84.7 (70.4-92.4) Referent Referent Viral load, HIV-1 RNA copies/ml. >1000,000 30 23 (76.7) 26.7 (12.6-43.0) 13.9 (4.78-40.6) 12.5 (4.09-38.2) 10,000-100,000 89 31 (34.8) 65.0 (53.1-74.5) 3.87 (1.36-11.0) 3.42 (1.19-9.81) <10,000 38 4 (10.5) 96.7 (78.6-99.5) Referent Referent Total 157 58 (36.9) 64.6 (56.0-71.9) * Survival analysis ** Cox proportional hazard model *** Cox proporational hazard model adjusted for initial CD4 lymphocyte count and virus load
Table 3. Mortality from time of first CD4 T lymphocyte count of 157 HIV-infected CSWs (125 women were HIV seropositive at study enrollment and 32 seroconverted during study)
Observed agreement - Expected Agreement
1 - Expected Agreement
systolic agegrp 1 3 87 190
<=20 21−40 >= 41 age group
1890-1962
Cigarette Consumption per Adult per Day
12 10 8 6 4 2
CHD Mortality per 10,000
30 20 10