Introduction to data display
Introduction to data display Useful questions to ask when - - PowerPoint PPT Presentation
Introduction to data display Useful questions to ask when - - PowerPoint PPT Presentation
Introduction to data display Useful questions to ask when considering how to display information What do you want to show? What methods are available for this? Is the method chosen the best? Would another have been better?
Useful questions to ask when considering how to display information
- What do you want to show?
- What methods are available for this?
- Is the method chosen the best? Would
another have been better?
Recommendations for the presentation of numbers
- When summarizing categorical data, both
frequencies and percentages can be used. However, if percentages are reported, it is important that the denominator (i.e. total number of observations) is given.
- To summarize continuous numerical data, one
should use the mean and standard deviation,
- r if the data have a skewed distribution use
the median and range or interquartile range.
Recommendations when presenting data and results in tables
- Tables, including column and row headings,
should be clearly labeled and a brief summary
- f the contents of a table should always be
given in words, either as part of the title or in the main body of the text.
- The amount of information should be
maximized for the minimum amount of ink.
Recommendations for construction of graphs
- The amount of information should be maximized
for the minimum amount of ink.
- Each graph should have a title explaining what is
being displayed.
- Axes should be clearly labeled.
- Gridlines should be kept to a minimum.
- Avoid three-dimensional graphs as these can be
difficult to read.
- The number of observations should be included.
Table or graph?
Examples for badly displayed data
Describing categorical data
Clustered data
Displaying quantitative data
Tables for multiple outcome measures
Stem and leaf plot
Histogram
Showing distribution
Skewed data
Box–whisker plots
Displaying the relationship between two continuous variables
Regression
ROC curve
Tabulating categorical outcomes
Tabulating the results of logistic regression analysis
Tabulating quantitative outcomes
Tabulating the results of regression analyses
Patient flow diagram
Forest plots
Funnel plots
Survival
Displaying results in presentations
- Keep slides simple.
- Text is meant to be read. Ensure that your slides are legible.
- For slides use light text on a dark background.
- Keep information layout, colors, patterns, text styles, and
transitions and build effects consistent for all slides in a presentation.
- Maximum of six lines per slide and six words per line.
- Use graphics and animation effects sparingly.
- San serif fonts such as Arial are the more legible for slides.
- Use a minimum font size of 28 points for titles and 18
points for the body of text
Thanks for your attention
Statistical Methods
Descriptive vs. Inferential
- Descriptive statistics summarize your group.
– average age 78.5, 89.3% white.
- Inferential statistics use the theory of probability to
make inferences about larger populations from your sample. – White patients were significantly older than black and Hispanic patients, P<0.001.
Enter your data with statistical analysis in mind.
- For small projects enter data into Microsoft
Excel or directly into SPSS.
- For large projects, create a database with
Microsoft Access.
- Keep variables names in the first row, with <=8
characters, and no internal spaces.
- Enter as little text as possible and use codes
for categories, such as 1=male, 2=female.
Screen your data thoroughly for errors and inconsistencies before doing ANY analyses.
- Check the lowest and highest value for each variable.
– For example, age 1-777.
- Look at histograms to detect typos.
- Cross-check variables to detect impossible
combinations. – For example, pregnant males, survivors discharged to the morgue, patients in the ICU for 25 days with no complications.
Analyze, descriptive statistics, frequencies, select the variable
Statistics AGE 933 79.292 81.300 90.0 26.537 763.0 14.0 777.0 Valid Missing N Mean Median Mode- Std. Dev iation
AGE
700 600 500 400 300 200 100- Std. Dev = 26.54
Analyze, Descriptive Statistics, Crosstabs
SURVIVA L * 48-DISPOSITION Crosstabulation Count 63 63 224 56 12 201 236 3 138 870 224 56 12 63 201 236 3 138 933 EXPIRED SURVIVED SURVIVAL Total HOME REHABILI TATION FACILITY OTHER HOSPITAL MORGUE SKILLED NURSING FACILITY HOME WITH ASSISTA NCE AMA DISCHAR GE AGAINST MEDICAL ADVICE 8 48-DISPOSITION TotalCorrect the data in the original database or spreadsheet and import a revised version into the statistical package.
- The age of 777 should be checked and
changed to the correct age.
- Suspicious values, such as an age of 106
should be checked. In this case it is correct.
Run descriptive statistics to summarize your data.
SURVIVA L 63 6.8 6.8 6.8 870 93.2 93.2 100.0 933 100.0 100.0 EXPIRED SURVIVED Total Valid Frequency Percent Valid Percent Cumulativ e Percent Statistics 49-DAYS IN HOSPIT AL 933 23.34 19.00 20 18.03 236 1 237 Valid Missing N Mean Median Mode- Std. Dev iation
49-DAYS IN HOSPITAL
400 300 200 100- Std. Dev = 18.03
P Value
- A P value is an estimate of the probability of results
such as yours could have occurred by chance alone if there truly was no difference or association.
- P < 0.05 = 5% chance, 1 in 20.
- P <0.01 = 1% chance, 1 in 100.
- Alpha is the threshold. If P is < this threshold, you
consider it statistically significant.
Univariate vs. Multivariate
- Univariate analysis usually refers to one
predictor variable and one outcome variable
– Is gender a predictor of pneumonia?
- Multivariate analysis usually refers to more
than one predictor variable or more than one
- utcome variable being evaluated
simultaneously.
– After adjusting for age, is gender a predictor of pneumonia?
Difference vs. Association
- Some tests are designed to assess whether there are
statistically significant differences between groups. – Is there a statistically significant difference between the age of patients with and without pneumonia?
- Some tests are designed to assess whether there are
statistically significant associations between variables. – Is the age of the patient associated with the number of days in the hospital?
Unmatched vs. Matched
- Some statistical tests are designed to assess
groups that are unmatched or independent.
– Is the admission systolic blood pressure different between men and women?
- Some statistical tests are designed to assess
groups that are matched or data that are paired.
– Is the systolic blood pressure different between admission and discharge?
Level of Measurement
- Categorical vs. continuous variables
– If you take the average of a continuous variable, it has meaning.
- Average age, blood pressure, days in the hospital.
– If you take the average of a categorical variable, it has no meaning.
- Average gender, race, smoker.
Level of Measurement
- Nominal - categorical
– gender, race, hypertensive
- Ordinal - categories that can be ranked
– none, light, moderate, heavy smoker
- Interval - continuous
– blood pressure, age, days in the hospital
Examples of Normal and Skewed
44-DAYS IN ICU 70.0 65.0 60.0 55.0 50.0 45.0 40.0 35.0 30.0 25.0 20.0 15.0 10.0 5.0 0.044-DAYS IN ICU
1000 800 600 400 200- Std. Dev = 3.99
35-SYSTOLIC BLOOD PRESSURE FIRST ER
160 140 120 100 80 60 40 20- Std. Dev = 27.74
Commonly used statistical methods
- 1.
Chi-square
- 2.
Logistic regression
- 3.
Student's t-test
- 4.
Fisher's exact test
- 5.
Kaplan-Meier method
- 7.
Wilcoxon rank-sum test
- 8.
Log-rank test
- 9.
Linear regression analysis
Commonly used statistical methods
- 10.
One-way analysis of variance (ANOVA)
- 12. Mann-Whitney U test
- 13. Kruskal-Wallis test
- 14. Repeated-measures analysis of
variance
- 15. Paired t-test
- 16. Wilcoxon signed-rank test
Chi-square
- The most commonly used statistical test.
- Used to test if two or more percentages are
different.
- For example, suppose that in a study of 933
patients with a hip fracture, 10% of the men (22/219) of the men develop pneumonia compared with 5% of the women (36/714).
- What is the probability that this could happen
by chance alone?
- Univariate, difference, unmatched, nominal,
=>2 groups, n=>20.
Chi-square example
4 8 7 8 5 2 2 3 6 5 8 9 4 3 C % C % C % A P P C 4 T A A E
- t
u a r 7 b 1 7 4 1 2 2 1 1 1 8 9 1 7 9 3 3 P C
a
L F L N a l d f m p s i d c t s i d c t s i d C a . b
Fisher’s Exact Test
- This test can be used for 2 by 2 tables when
the number of cases is too small to satisfy the assumptions of the chi-square.
– Total number of cases is <20 or – The expected number of cases in any cell is <1 or – More than 25% of the cells have expected frequencies <5.
u a r 5 b 1 4 1 3 2 1 9 1 1 9 3 3 P C
a
L F L N a l d f m p s i d c t s i d c t s i d C a . 1 b
6 . 9 9 t a b u 5 5 . 5 . 5 . % % % % % % 5 5 3 5 8 . 5 . 5 . % % % % % % 5 8 3 . . . % % % % % % C E % C 4 % C C E % C 4 % C C E % C 4 % C A P P C 4 T S E S O L
- t
Student’s t-test
- Used to compare the average (mean) in one
group with the average in another group.
- Is the average age of patients significantly
different between those who developed pneumonia and those who did not?
- Univariate, Difference, Unmatched, Interval,
Normal, 2 groups.
n t S 7 4 1 9 3 1 9 9 5 9 2 5 4 1 9 6 2 5 E E A F S i g e s t a r i t d f S i g ta i e ffe r . E ffe r
- w
p p fi d e D i ff u a l
Mann-Whitney U test
- Same as the Wilcoxon rank-sum test
- Used in place of the Student’s
t-test when the data are skewed.
- A nonparametric test that uses
the rank of the value rather than the actual value.
- Univariate, Difference,
Unmatched, Interval, Nonnormal, 2 groups.
Paired t-test
- Used to compare the average for
measurements made twice within the same person - before vs. after.
- Used to compare a treatment group and a
matched control group.
- For example, Did the systolic blood pressure
change significantly from the scene of the injury to admission?
- Univariate, Difference, Matched, Interval,
Normal, 2 groups.
Wilcoxon signed-rank test
- Used to compare two skewed continuous variables
that are paired or matched.
- Nonparametric equivalent of the paired t-test.
- For example, “Was the Glasgow Coma Scale score
different between the scene and admission?”
- Univariate, Difference, Matched, Interval,
Nonnormal, 2 group.
ANOVA
One-way used to compare more than 3 means from independent groups. “Is the age different between White, Black, Hispanic patients?” Two-way used to compare 2 or more means by 2
- r more factors.
“Is the age different between Males and Females, With and Without Pnuemonia?”
Kruskal-Wallis One-Way ANOVA
- Used to compare continuous variables that
are not normally distributed between more than 2 groups.
- Nonparametric equivalent to the one-way
ANOVA.
- Is the length of stay different by ethnicity?
- Analyze, nonparametric tests, K independent
samples.
Repeated-Measures ANOVA
- Used to assess the change in 2 or more continuous
measurement made on the same person. Can also compare groups and adjust for covariates.
- Do changes in the vital signs within the first 24 hours
- f a hip fracture predict which patients will develop
pneumonia?
- Analyze, General Linear Model, Repeated Measures.
Pearson Correlation
- Used to assess the linear association between
two continuous variables.
– r=1.0 perfect correlation – r=0.0 no correlation – r=-1.0 perfect inverse correlation
- Univariate, Association, Interval
- .030
- .008
- .079*
- .033
- .028
- .033
- .030
- .079*
- .028
- .100**
- .008
- .100**
- Sig. (2-ta
- Sig. (2-ta
- Sig. (2-ta
- Sig. (2-ta
- Sig. (2-ta
- Sig. (2-ta
- Sig. (2-ta
Spearman rank-order correlation
- Use to assess the relationship between two
- rdinal variables or two skewed continuous
variables.
- Nonparametric equivalent of the Pearson
correlation.
- Univariate, Association, Ordinal (or skewed).
- .146**
- .008
- .091**
- .014
- .076*
- .014
- .146**
- .091**
- .076*
- .038
- .008
- .038
- Sig. (2-ta
- Sig. (2-ta
- Sig. (2-ta
- Sig. (2-ta
- Sig. (2-ta
- Sig. (2-ta
- Sig. (2-ta
Summary of Inferential Tests
Unpaired vs. Paired
- Student’s t-test
- Chi-square
- One-way ANOVA
- Mann-Whitney U test
- Kruskal-Wallis H test
- Paired t-test
- McNemar’s test
- Repeated-measures
- Wilcoxon signed-rank
- Friedman ANOVA
Parametric vs. Nonparametric
- Student’s t-test
- One-way ANOVA
- Paired t-test
- Pearson correlation
- Correlated F ratio
(repeatedmeasures ANOVA)
- Mann-Whitney U test
- Kruskal-Wallis test
- Wilcoxon signed-rank
- Spearman’s r
- Friedman ANOVA
A Good Rule to Follow
- Always check your results with a
nonparametric.
- If you test your null hypothesis with a
Student’s t-test, also check it with a Mann- Whitney U test.
- It will only take an extra 25 seconds.
Linear Regression
- Used to assess how one or more predictor
variables can be used to predict a continuous
- utcome variable.
- “Do age, number of comorbidities, or
admission vital signs predict the length of stay in the hospital after a hip fracture?”
- Multivariate, Association, Interval/Ordinal
dependent variable.
- 4.4
- .236
- 8.0
- .014
- .425
- Std. Error
Logistic Regression
- Used to assess the predictive value of one or more
variables on an outcome that is a yes/no question.
- “Do age, gender, and comorbidities predict which hip
fracture patients will develop pneumonia?”
- Multivariate, Difference, Nominal dependent
variable, not time-dependent, 2 groups.
1 Total number of comorbidities 2 Cirrhosis 3 COPD 4 Gender 5 Age
Draw Conclusions
- We reject the null hypothesis.
- Patients who are at high risk of developing
pneumonia during their hospitalization for a hip fracture can be identified by:
– total number of pre-existing conditions – cirrhosis – COPD – male gender
Survival Analysis
- Kaplan-Meier method
– Used to plot cumulative survival
- Log-rank test
– Used to compare survival curves
- Cox proportional-hazards
– Used to adjust for covariates in survival analysis
Thanks for your attention
Introduction to Statistics
Descriptive Analysis
Review of Descriptive Stats.
- Descriptive Statistics are used to present
quantitative descriptions in a manageable form.
- This method works by reducing lots of data
into a simpler summary.
Univariate Analysis
- This is the examination across cases of one
variable at a time.
- Frequency distributions are used to group
data.
- One may set up margins that allow us to
group cases into categories.
- Examples include:
– age categories – price categories – temperature categories.
Distributions
Two ways to describe a univariate distribution
- a table
- a graph (histogram, bar chart)
Distributions (con’t)
Ditribution of participants of the research methodology workshop by sex
0% 10% 20% 30% 40% 50% 60% 70% Men WomenSex No % Men 12 60 Women 8 40 total 20 100
Distributions (con’t)
Workshop participants by specialty
Microbiology Env ironmental sciences Fishery Nursing Other Workshop participants by specialty 0% 5% 10 % 15 % 20 % 25 % 30 % 35 % 40 % Microbiology Environmental sciences Fishery Nursing OthersDistributions (cont.)
Category Percent Under 35 9 36-45 21 46-55 45 56-65 19 66+ 6
A Frequency Distribution Table
Distributions (cont.)
10 20 30 40 50 Under 35 36-45 46-55 56-65 66+ Percent
A Histogram
Central Tendency
- An estimate of the “center” of a distribution
- Three different types of estimates:
– Mean – Median – Mode
Mean
- The most commonly used method of describing
central tendency.
- One basically totals all the results and then divides
by the number of units or “n” of the sample.
- Example: The pretest mean was determined by the
sum of all the scores divided by the number of students taking the exam.
Working Example (mean)
- Lets take the set of scores:
11,10,8,9,12,11,6,13
- The Mean would be 80/8=10
Median
- The median is the score found at the exact
middle of the set.
- One must list all scores in numerical order,
and then locate the score in the center of the sample.
- Example: if there are 500 scores in the list,
score #250 would be the median.
- This is useful in weeding out outliers.
Working Example (median)
- Lets take the set of scores:
11,10,8,9,12,11,6,13
- First line up the scores.
6, 8, 9, 10, 11, 11, 12, 13
- The middle score falls at 10.5. There are 8
scores and score #4 and #5 represent the halfway point.
Mode
- The mode is the most repeated score in the
set of results.
- Lets take the set of scores:
11,10,8,9,12,11,6,13
- Again we first line up the scores
6, 8, 9, 10, 11, 11, 12, 13 #11 is the most repeated score and is therefore labeled the mode.
Dispersion
- Three estimates:
– Range – Mean Absolute Deviation – Standard Deviation
- Standard Deviation is more accurate/detailed,
because an outlier can greatly extend the range
Range
- The range is used to identify the highest and
lowest scores.
- Lets take the set of scores:
6, 8, 9, 10, 11, 11, 12, 13
- The range would be 6-13. This identifies the
fact that 7 points separates the highest to the lowest score.
Standard Deviation
- The Standard Deviation is a value that shows
the relation that individual scores have to the mean of the sample.
- If scores are said to be standardized to a
normal curve then there are several statistical manipulations that can be performed to analyze the data set.
Standard Dev. (con’t)
- Assumptions may be made about the percentage of
scores as they deviate from the mean.
- If scores are normally distributed, then one can
assume that approximately 68% of the scores in the sample fall within one standard deviation of the
- mean. Approximately 95% of the scores would then
fall within two standard deviations of the mean.
Working Example (stand. dev.)
- Lets take the set of scores:
11,10,8,9,12,11,6,13
- The mean of this sample was found to be 10.
- Again we use the scores
11,10,8,9,12,11,6,13.
- 11-10=1, 10-10=0, 8-10=-2, 9-10=-1,12-10=2,
11-10=1, 6-10=-4,13-10=3
Working Ex. (Stan. dev. con’t)
- Square these values.
1, 0, 4, 1, 4, 1, 16, 9
- Total these values 36.
- Divide 36 by 7: 5.15
- Take the square root of 5.15: 2.27
- 2.27 is your Standard Deviation.
Interquartile range
- The median is the same as the 50th percentile.
- The 25th and 75th percentiles are called the lower and
upper quartiles.
Interquartile range
Definition: A set of n measurements on the variable x has been arranged in order of magnitude.
- The lower quartile (first quartile), Q1, is the value of x that
exceeds one-fourth of the measurements and is less than the remaining 3/4.
- The second quartile is the median.
- The upper quartile (third quartile), Q3, is the value of x
that exceeds three-fourths of the measurements and is less than one-fourth.
Thanks