Frequency Distribution and Summary Statistics Dongmei Li - - PowerPoint PPT Presentation
Frequency Distribution and Summary Statistics Dongmei Li - - PowerPoint PPT Presentation
Frequency Distribution and Summary Statistics Dongmei Li Department of Public Health Sciences Office of Public Health Studies University of Hawaii at M noa Outline 1. Stemplot 2. Frequency table 3. Summary statistics 2 1. Stem-and-leaf
Outline
- 1. Stemplot
- 2. Frequency table
- 3. Summary statistics
2
- 1. Stem-and-leaf plots (stemplots)
Always start by looking at the data with
graphs and plots
Our favorite technique for looking at a
single variable is the stemplot
A stemplot is a graphical technique that
- rganizes data into a histogram-like
display You can observe a lot by looking – Yogi Berra
3
Stemplot Illustrative Example
Select an SRS of 10 ages List data as an ordered array
05 11 21 24 27 28 30 42 50 52
Divide each data point into a stem-value
and leaf-value
In this example the “tens place” will be
the stem-value and the “ones place” will be the leaf value, e.g., 21 has a stem value
- f 2 and leaf value of 1
4
Stemplot illustration (cont.)
Draw an axis for the stem-values:
0| 1| 2| 3| 4| 5| ×10 axis multiplier (important!)
Place leaves next to their stem value 21 plotted (animation)
1
5
Stemplot illustration continued …
Plot all data points and rearrange in rank order:
0|5
1|1 2|1478 3|0 4|2 5|02 ×10
Here is the plot horizontally:
(for demonstration purposes)
8 7 4 2 5 1 1 0 2 0
- 0 1 2 3 4 5
- Rotated stemplot
6
Interpreting Stemplots
Shape
- Symmetry
- Modality (number of peaks)
- Kurtosis (width of tails)
- Departures (outliers)
Location
- Gravitational center mean
- Middle value median
Spread
- Range and inter-quartile range
- Standard deviation and variance
7
Shape
“Shape” refers to the pattern when
plotted
Here’s the silhouette of our data
X X X X X X X X X X
- 0 1 2 3 4 5
- Consider: symmetry, modality, kurtosis
8
Shape: Idealized Density Curve
A large dataset is introduced
An density curve is superimposed to better discuss shape
9
Symmetrical Shapes
10
Asymmetrical shapes
11
Modality (no. of peaks)
12
Kurtosis (width of tails)
Mesokurtic (medium) Platykurtic (flat)
Leptokurtic (steep)
skinny tails fat tails
Kurtosis is not be easily judged by eye
13
Stemplot – Second Example
Data: 1.47, 2.06, 2.36, 3.43, 3.74, 3.78, 3.94, 4.42
Stem = ones-place Leaves = tenths-place
Round to keep one digit
after decimal point (e.g., 1.47 1.5) Do not plot decimal |1|5 |2|14 |3|4789 |4|4 (×1)
Shape: asymmetric, skewed to the left,
unimodal, no outliers
14
Draw a stemplot using JMP
Analyze---Distribution---Data---Stem and Leaf
15
Open the JMP data set named Stem_and_leaf_plot.jmp
Third Illustrative Example (n = 26)
Age data set from 26 subjects {14, 17, 18, 19, 22, 22, 23, 24, 24, 26, 26, 27, 28,
29, 30, 30, 30, 31, 32, 33, 34, 34, 35, 36, 37, 38}
16
Data set: Stem_and_leaf_plot_example2.jmp Distribution of the age variable?
- 2. Frequency Table
Frequency =
count
Relative frequency
= proportion or %
Cumulative
frequency % less than or equal to level
AGE | Freq Rel.Freq Cum.Freq.
- -----+-----------------------
3 | 2 0.3% 0.3% 4 | 9 1.4% 1.7% 5 | 28 4.3% 6.0% 6 | 37 5.7% 11.6% 7 | 54 8.3% 19.9% 8 | 85 13.0% 32.9% 9 | 94 14.4% 47.2% 10 | 81 12.4% 59.6% 11 | 90 13.8% 73.4% 12 | 57 8.7% 82.1% 13 | 43 6.6% 88.7% 14 | 25 3.8% 92.5% 15 | 19 2.9% 95.4% 16 | 13 2.0% 97.4% 17 | 8 1.2% 98.6% 18 | 6 0.9% 99.5% 19 | 3 0.5% 100.0%
- -----+-----------------------
Total | 654 100.0%
17
Frequency Table with Class Intervals
When data are sparse, group data into class
intervals
Create 4 to 12 class intervals Classes can be uniform or non-uniform End-point convention: e.g., first class interval of
0 to 10 will include 0 but exclude 10 (0 to 9.99)
Talley frequencies Calculate relative frequency Calculate cumulative frequency
18
Class Intervals
Class Freq Relative
- Freq. (%)
Cumulative Freq (%) 0 – 9 1 10 10 10 – 19 1 20 – 29 4 30 – 39 1 40 – 49 1 10 80 50 – 59 2 20 100 Total 10 100
- Uniform class intervals table (width 10) for data:
05 11 21 24 27 28 30 42 50 52
19
Histogram
1 2 3 4 5
0-9 10_19 20-29 30-39 40-49 50-59
Age Class
A histogram is a frequency chart for a quantitative
- measurement. Notice how the bars touch.
20
Bar Chart
50 100 150 200 250 300 350 400 450 500 Pre- Elem. Middle High School-level
A bar chart with non-touching bars is reserved for categorical measurements and non-uniform class intervals
21
- 3. Summary Statistics
Central location
- Mean
- Median
- Mode
Spread
- Range and interquartile range (IQR)
- Variance and standard deviation
22
Location: Mean
“Eye-ball method” visualize where plot would balance Arithmetic method = sum values and divide by n
8 7 4 2 5 1 1 0 2 0
- 0 1 2 3 4 5
- ^
Grav.Center
Eye-ball method around 25 to 30 (takes practice) Arithmetic method mean = 290 / 10 = 29
23
Notation
n sample size X the variable (e.g., ages of subjects) xi the value of individual i for variable X sum all values (capital sigma) Illustrative data (ages of participants):
21 42 5 11 30 50 28 27 24 52
n = 10 X = AGE variable x1= 21, x2= 42, …, x10= 52 xi = x1 + x2 + … + x10= 21 + 42 + … + 52 = 290
24
Central Location: Sample Mean
“Arithmetic average” Traditional measure of central location Sum the values and divide by n “xbar” refers to the sample mean
n i
i n
x n x x x n x
1
1 1
2 1
25
Example: Sample Mean
Ten individuals selected at random have the following ages: 21 42 5 11 30 50 28 27 24 52 Note that n = 10, xi = 21 + 42 + … + 52 = 290, and
. 29 ) 290 ( 10 1 1
i
x n x
10 20 30 40 50 60
Mean = 29
The sample mean is the gravitational center of a distribution
26
Uses of the Sample Mean
The sample mean can be used to predict:
The value of an observation drawn at
random from the sample
The value of an observation drawn at
random from the population
The population mean
27
Population Mean
Same operation as sample mean except based
- n entire population (N ≡ population size)
Conceptually important Usually not available in practice Sometimes referred to as the expected value
i i
x N N x 1
28
Central Location: Median
Ordered array:
05 11 21 24 27 28 30 42 50 52
When n is even, the median is the average of the
(n ÷2)th data and the (n ÷2+1)th data.
When n is odd, the median is the ((n+1) ÷2)th
data.
For illustrative data: n = 10 → the median falls
between 27 and 28=(27+28) ÷ 2 =27.5
05 11 21 24 27 28 30 42 50 52 median Average the adjacent values: M = 27.5
29
More Examples of Medians
Example A: 2 4 6
Median = 4
Example B: 2 4 6 8
Median = 5 (average of 4 and 6)
Example C: 6 2 4
Median 2 (Values must be ordered first)
30
The Median is Robust
This data set has a mean of 1636: 1362 1439 1460 1614 1666 1792 1867
The median is 1614 in both instances, demonstrating its robustness in the face of outliers.
The median is more resistant to skews and
- utliers than the mean; it is more robust.
Here’s the same data set with a data entry error “outlier” (highlighted). This data set has a mean of 2743: 1362 1439 1460 1614 1666 1792 9867
31
Mode
The mode is the most commonly
encountered value in the dataset
This data set has a mode of 7
{4, 7, 7, 7, 8, 8, 9}
This data set has no mode
{4, 6, 7, 8} (each point appears only once)
The mode is useful only in large data sets
with repeating values
32
Comparison of Mean, Median, Mode
Note how the mean gets pulled toward the longer tail more than the median mean = median → symmetrical distrib mean > median → positive skew mean < median → negative skew
33
Spread: Quartiles
Two distributions can be quite
different yet can have the same mean
This data compares particulate
matter in air samples (μg/m3) at two sites. Both sites have a mean of 36, but Site 1 exhibits much greater variability. We would miss the high pollution days if we relied solely on the mean.
Site 1| |Site 2
- 42|2|
8|2| 2|3|234 86|3|6689 2|4|0 |4| |5| |5| |6| 8|6| ×10
34
Spread: Range
Range = maximum – minimum Illustrative example:
Site 1 range = 68 – 22 = 46 Site 2 range = 40 – 32 = 8
Beware: the sample range will
tend to underestimate the population range.
Always supplement the range
with at least one addition measure of spread
Site 1| |Site 2
- 42|2|
8|2| 2|3|234 86|3|6689 2|4|0 |4| |5| |5| |6| 8|6| ×10
35
Spread: Quartiles
Quartile 1 (Q1): cuts off bottom quarter of
data = median of the lower half of the data set
Quartile 3 (Q3): cuts off top quarter of data
= median of the upper half of the data set
Interquartile Range (IQR) = Q3 – Q1
covers the middle 50% of the distribution
05 11 21 24 27 28 30 42 50 52 Q1 median Q3 Q1 = 21, Q3 = 42, and IQR = 42 – 21 = 21
36
Quartiles (Tukey’s Hinges) – Example 2
Data are metabolic rates (cal/day), n = 7
When n is odd, include the median in both halves of the data set. Bottom half: 1362 1439 1460 1614 which has a median of 1449.5 (Q1) Top half: 1614 1666 1792 1867 which has a median of 1729 (Q3) 1362 1439 1460 1614 1666 1792 1867
median
37
Five-Point Summary
Q0 (the minimum) Q1 (25th percentile) Q2 (median) Q3 (75th percentile) Q4 (the maximum)
38
Boxplots
1.
Calculate 5-point summary. Draw box from Q1 to Q3 w/ line at median
2.
Calculate IQR and fences as follows: FenceLower = Q1 – 1.5(IQR) FenceUpper = Q3 + 1.5(IQR) Do not draw fences
3.
Determine if any values lie outside the fences (outside values). If so, plot these separately.
4.
Determine values inside the fences (inside values) Draw whisker from Q3 to upper inside value. Draw whisker from Q1 to lower inside value
39
Illustrative Example: Boxplot
1.
5 pt summary: {5, 21, 27.5, 42, 52}; box from 21 to 42 with line @ 27.5
2.
IQR = 42 – 21 = 21.
FU = Q3 + 1.5(IQR) = 42 + (1.5)(21) = 73.5 FL = Q1 – 1.5(IQR) = 21 – (1.5)(21) = – 10.5
3.
None values above upper fence None values below lower fence
4.
Upper inside value = 52 Lower inside value = 5 Draws whiskers
Data: 05 11 21 24 27 28 30 42 50 52
60 50 40 30 20 10
Upper inside = 52 Q3 = 42 Q1 = 21 Lower inside = 5 Q2 = 27.5
40
Illustrative Example: Boxplot 2
Data: 3 21 22 24 25 26 28 29 31 51
60 50 40 30 20 10
O utside value (51) O utside value (3) Inside value (21) Upper hinge (29) Lower hinge (22) Median (25.5) Inside value (31)
- 1. 5-point summary: 3, 22,
25.5, 29, 51: draw box
- 2. IQR = 29 – 22 = 7
FU = Q3 + 1.5(IQR) = 29 + (1.5)(7) = 39.5 FL = Q1 – 1.5(IQR) = 22 – (1.5)(7) = 11.6
- 3. One above top fence (51)
One below bottom fence (3)
- 4. Upper inside value is 31
Lower inside value is 21 Draw whiskers
41
Illustrative Example: Boxplot 3
Seven metabolic rates: 1362 1439 1460 1614 1666 1792 1867
7 N =Data source: Moore, 2000 1900 1800 1700 1600 1500 1400 1300
1. 5-point summary: 1362, 1449.5, 1614, 1729, 1867 2. IQR = 1729 – 1449.5 = 279.5 FU = Q3 + 1.5(IQR) = 1729 + (1.5)(279.5) = 2148.25 FL = Q1 – 1.5(IQR) = 1449.5 – (1.5)(279.5) = 1030.25
- 3. None outside
- 4. Whiskers end @ 1867 and
1362
42
Boxplots: Interpretation
Location
- Position of median
- Position of box
Spread
- Hinge-spread (IQR)
- Whisker-to-whisker spread
- Range
Shape
- Symmetry or direction of skew
- Long whiskers (tails) indicate leptokurtosis (Long
tails?)
43
Side-by-side boxplots
Boxplots are especially useful when comparing groups
44
Spread: Standard Deviation
Most common
descriptive measures of spread
Based on deviations
around the mean.
This figure
demonstrates the deviations of two of its values
This data set has a mean of 36. The data point 33 has a deviation of 33 – 36 = −3. The data point 40 has a deviation of 40 – 36 = 4.
45
Variance and Standard Deviation
x xi
Deviation =
2
x x SS
i
Sum of squared deviations =
1
2
n SS s
Sample variance =
2
s s
Sample standard deviation =
46
Standard deviation (formula)
2
) ( 1 1 x x n s
i
Sample standard deviation s is the estimator of population standard deviation .
Sum of Squares
47
Illustrative Example: Standard Deviation
Observation Deviations Squared deviations
36 36 36 = 0 02 = 0 38 38 36 = 2 22 = 4 39 39 36 = 3 32 = 9 40 40 36 = 4 42 = 16 36 36 36 = 0 02 = 0 34 34 36 = 2 (2)2 = 4 33 33 36 = 3 (3)2 = 9 32 32 36 = 4 (4)2 = 16
SUMS 0*
SS = 58
2
x xi
x xi
i
x
* Sum of deviations always equals zero
48
Illustrative Example (cont.)
2 3 2
) g/m ( 286 . 8 1 8 58 1 n SS s
3 2
g/m 88 . 2 286 . 8 s s
Sample variance (s2) Standard deviation (s)
49
Interpretation of Standard Deviation
Measure spread (e.g., if group was s1 =
15 and group 2 s2 = 10, group 1 has more spread, i.e., variability)
68-95-99.7 rule (next slide) Chebychev’s rule (two slides hence)
50
68-95-99.7 Rule
Normal Distributions Only!
68% of data in the range μ ± σ 95% of data in the range μ ± 2σ 99.7% of data in the range μ ± 3σ Example. Suppose a variable has a Normal
distribution with μ = 30 and σ = 10. Then:
- 68% of values are between 30 ± 10 = 20 to 40
- 95% are between 30 ± (2)(10) = 30 ± 20 = 10
to 50
- 99.7% are between 30 ± (3)(10) = 30 ± 30 = 0
to 60
51
Chebychev’s Rule All Distributions
Chebychev’s rule says that at least 75% of
the values will fall in the range μ ± 2σ (for any shaped distribution)
Example: A distribution with μ = 30 and σ
= 10 has at least 75% of the values in the range 30 ± (2)(10) = 10 to 50
52
Rules for Rounding
Carry at least four significant digits during
calculations.
Round at last step of operation Always report units
Always use common sense and good judgment.
53
Choosing Summary Statistics
Always report a measure of central
location, a measure of spread, and the sample size
Symmetrical mound-shaped
distributions report mean and standard deviation
Odd shaped distributions report 5-
point summaries (or median and IQR)
54
Software and Calculators
Use software and calculators to check work.
55
Excel Data Analysis T
- olPak
Data set: Boxplot.xlsx Summary statistics using Excel
56
Excel Data Analysis T
- olPak
Get summary statistics using Excel
57
Boxplot using Excel
Youtube links to get boxplot in Excel http://www.youtube.com/watch?v=s8ZW
4PVarwE&feature=related
58
10 20 30 40 50 60 70 Female age Male age
Boxplot and summary statistics by JMP
Data set:
Boxplot.jmp
What do you say
about the comparison of distributions of ages for females and males?
59
60
Exercise
Surgical times. Durations of surgeries
(hours) for 15 patients receiving artificial hearts are shown here. Create a stem plot of these
- data. Describe the distribution. Are there any
- utliers? What is the standard deviation of this
data set? Draw a boxplot based on this data set.
7.0 6.5 3.5 3.1 2.8 2.5 3.8 2.6 2.4 2.1 1.8
2.3 3.1 3.0 2.5
Data set: Presentation2_Exercise.xlsx Presentation2_exercise.jmp
61