[PPT] - Frequency Distribution and Summary Statistics Dongmei Li PowerPoint Presentation

SLIDE 1

Frequency Distribution and Summary Statistics

Dongmei Li Department of Public Health Sciences Office of Public Health Studies University of Hawai’i at Mānoa

SLIDE 2

Outline

1. Stemplot
2. Frequency table
3. Summary statistics

2

SLIDE 3

1. Stem-and-leaf plots (stemplots)

 Always start by looking at the data with

graphs and plots

 Our favorite technique for looking at a

single variable is the stemplot

 A stemplot is a graphical technique that

rganizes data into a histogram-like

display You can observe a lot by looking – Yogi Berra

3

SLIDE 4

Stemplot Illustrative Example

 Select an SRS of 10 ages  List data as an ordered array

05 11 21 24 27 28 30 42 50 52

 Divide each data point into a stem-value

and leaf-value

 In this example the “tens place” will be

the stem-value and the “ones place” will be the leaf value, e.g., 21 has a stem value

f 2 and leaf value of 1

4

SLIDE 5

Stemplot illustration (cont.)

 Draw an axis for the stem-values:

0| 1| 2| 3| 4| 5| ×10  axis multiplier (important!)

 Place leaves next to their stem value  21 plotted (animation)

1

5

SLIDE 6

Stemplot illustration continued …

 Plot all data points and rearrange in rank order:

0|5

1|1 2|1478 3|0 4|2 5|02 ×10

 Here is the plot horizontally:

(for demonstration purposes)

8 7 4 2 5 1 1 0 2 0

0 1 2 3 4 5
Rotated stemplot

6

SLIDE 7

Interpreting Stemplots

 Shape

Symmetry
Modality (number of peaks)
Kurtosis (width of tails)
Departures (outliers)

 Location

Gravitational center  mean
Middle value  median

 Spread

Range and inter-quartile range
Standard deviation and variance

7

SLIDE 8

Shape

 “Shape” refers to the pattern when

plotted

 Here’s the silhouette of our data

X X X X X X X X X X

0 1 2 3 4 5
 Consider: symmetry, modality, kurtosis

8

SLIDE 9

Shape: Idealized Density Curve

A large dataset is introduced

An density curve is superimposed to better discuss shape

9

SLIDE 10

Symmetrical Shapes

10

SLIDE 11

Asymmetrical shapes

11

SLIDE 12

Modality (no. of peaks)

12

SLIDE 13

Kurtosis (width of tails)

Mesokurtic (medium) Platykurtic (flat)

Leptokurtic (steep)

 skinny tails  fat tails

Kurtosis is not be easily judged by eye

13

SLIDE 14

Stemplot – Second Example

 Data: 1.47, 2.06, 2.36, 3.43, 3.74, 3.78, 3.94, 4.42

 Stem = ones-place  Leaves = tenths-place

 Round to keep one digit

after decimal point (e.g., 1.47  1.5) Do not plot decimal |1|5 |2|14 |3|4789 |4|4 (×1)

 Shape: asymmetric, skewed to the left,

unimodal, no outliers

14

SLIDE 15

Draw a stemplot using JMP

Analyze---Distribution---Data---Stem and Leaf

15

Open the JMP data set named Stem_and_leaf_plot.jmp

SLIDE 16

Third Illustrative Example (n = 26)

 Age data set from 26 subjects  {14, 17, 18, 19, 22, 22, 23, 24, 24, 26, 26, 27, 28,

29, 30, 30, 30, 31, 32, 33, 34, 34, 35, 36, 37, 38}

16

Data set: Stem_and_leaf_plot_example2.jmp Distribution of the age variable?

SLIDE 17

2. Frequency Table

 Frequency =

count

 Relative frequency

= proportion or %

 Cumulative

frequency  % less than or equal to level

AGE | Freq Rel.Freq Cum.Freq.

-----+-----------------------

3 | 2 0.3% 0.3% 4 | 9 1.4% 1.7% 5 | 28 4.3% 6.0% 6 | 37 5.7% 11.6% 7 | 54 8.3% 19.9% 8 | 85 13.0% 32.9% 9 | 94 14.4% 47.2% 10 | 81 12.4% 59.6% 11 | 90 13.8% 73.4% 12 | 57 8.7% 82.1% 13 | 43 6.6% 88.7% 14 | 25 3.8% 92.5% 15 | 19 2.9% 95.4% 16 | 13 2.0% 97.4% 17 | 8 1.2% 98.6% 18 | 6 0.9% 99.5% 19 | 3 0.5% 100.0%

-----+-----------------------

Total | 654 100.0%

17

SLIDE 18

Frequency Table with Class Intervals

 When data are sparse, group data into class

intervals

 Create 4 to 12 class intervals  Classes can be uniform or non-uniform  End-point convention: e.g., first class interval of

0 to 10 will include 0 but exclude 10 (0 to 9.99)

 Talley frequencies  Calculate relative frequency  Calculate cumulative frequency

18

SLIDE 19

Class Intervals

Class Freq Relative

Freq. (%)

Cumulative Freq (%) 0 – 9 1 10 10 10 – 19 1 20 – 29 4 30 – 39 1 40 – 49 1 10 80 50 – 59 2 20 100 Total 10 100

Uniform class intervals table (width 10) for data:

05 11 21 24 27 28 30 42 50 52

19

SLIDE 20

Histogram

1 2 3 4 5

0-9 10_19 20-29 30-39 40-49 50-59

Age Class

A histogram is a frequency chart for a quantitative

measurement. Notice how the bars touch.

20

SLIDE 21

Bar Chart

50 100 150 200 250 300 350 400 450 500 Pre- Elem. Middle High School-level

A bar chart with non-touching bars is reserved for categorical measurements and non-uniform class intervals

21

SLIDE 22

3. Summary Statistics

 Central location

Mean
Median
Mode

 Spread

Range and interquartile range (IQR)
Variance and standard deviation

22

SLIDE 23

Location: Mean

“Eye-ball method”  visualize where plot would balance Arithmetic method = sum values and divide by n

8 7 4 2 5 1 1 0 2 0

0 1 2 3 4 5
^

Grav.Center

Eye-ball method  around 25 to 30 (takes practice) Arithmetic method mean = 290 / 10 = 29

23

SLIDE 24

Notation

 n  sample size  X  the variable (e.g., ages of subjects)  xi  the value of individual i for variable X    sum all values (capital sigma)  Illustrative data (ages of participants):

21 42 5 11 30 50 28 27 24 52

n = 10 X = AGE variable x1= 21, x2= 42, …, x10= 52 xi = x1 + x2 + … + x10= 21 + 42 + … + 52 = 290

24

SLIDE 25

Central Location: Sample Mean

 “Arithmetic average”  Traditional measure of central location  Sum the values and divide by n  “xbar” refers to the sample mean

 





    

n i

i n

x n x x x n x

1

1 1

2 1



25

SLIDE 26

Example: Sample Mean

Ten individuals selected at random have the following ages: 21 42 5 11 30 50 28 27 24 52 Note that n = 10, xi = 21 + 42 + … + 52 = 290, and

. 29 ) 290 ( 10 1 1   



i

x n x

10 20 30 40 50 60

Mean = 29

The sample mean is the gravitational center of a distribution

26

SLIDE 27

Uses of the Sample Mean

The sample mean can be used to predict:

 The value of an observation drawn at

random from the sample

 The value of an observation drawn at

random from the population

 The population mean

27

SLIDE 28

Population Mean

 Same operation as sample mean except based

n entire population (N ≡ population size)

 Conceptually important  Usually not available in practice  Sometimes referred to as the expected value

 

 

i i

x N N x 1 

28

SLIDE 29

Central Location: Median

 Ordered array:

05 11 21 24 27 28 30 42 50 52

 When n is even, the median is the average of the

(n ÷2)th data and the (n ÷2+1)th data.

 When n is odd, the median is the ((n+1) ÷2)th

data.

 For illustrative data: n = 10 → the median falls

between 27 and 28=(27+28) ÷ 2 =27.5

05 11 21 24 27 28 30 42 50 52  median Average the adjacent values: M = 27.5

29

SLIDE 30

More Examples of Medians

 Example A: 2 4 6

Median = 4

 Example B: 2 4 6 8

Median = 5 (average of 4 and 6)

 Example C: 6 2 4

Median  2 (Values must be ordered first)

30

SLIDE 31

The Median is Robust

This data set has a mean of 1636: 1362 1439 1460 1614 1666 1792 1867

The median is 1614 in both instances, demonstrating its robustness in the face of outliers.

The median is more resistant to skews and

utliers than the mean; it is more robust.

Here’s the same data set with a data entry error “outlier” (highlighted). This data set has a mean of 2743: 1362 1439 1460 1614 1666 1792 9867

31

SLIDE 32

Mode

 The mode is the most commonly

encountered value in the dataset

 This data set has a mode of 7

{4, 7, 7, 7, 8, 8, 9}

 This data set has no mode

{4, 6, 7, 8} (each point appears only once)

 The mode is useful only in large data sets

with repeating values

32

SLIDE 33

Comparison of Mean, Median, Mode

Note how the mean gets pulled toward the longer tail more than the median mean = median → symmetrical distrib mean > median → positive skew mean < median → negative skew

33

SLIDE 34

Spread: Quartiles

 Two distributions can be quite

different yet can have the same mean

 This data compares particulate

matter in air samples (μg/m3) at two sites. Both sites have a mean of 36, but Site 1 exhibits much greater variability. We would miss the high pollution days if we relied solely on the mean.

Site 1| |Site 2

42|2|

8|2| 2|3|234 86|3|6689 2|4|0 |4| |5| |5| |6| 8|6| ×10

34

SLIDE 35

Spread: Range

 Range = maximum – minimum  Illustrative example:

Site 1 range = 68 – 22 = 46 Site 2 range = 40 – 32 = 8

 Beware: the sample range will

tend to underestimate the population range.

 Always supplement the range

with at least one addition measure of spread

Site 1| |Site 2

42|2|

8|2| 2|3|234 86|3|6689 2|4|0 |4| |5| |5| |6| 8|6| ×10

35

SLIDE 36

Spread: Quartiles

 Quartile 1 (Q1): cuts off bottom quarter of

data = median of the lower half of the data set

 Quartile 3 (Q3): cuts off top quarter of data

= median of the upper half of the data set

 Interquartile Range (IQR) = Q3 – Q1

covers the middle 50% of the distribution

05 11 21 24 27 28 30 42 50 52    Q1 median Q3 Q1 = 21, Q3 = 42, and IQR = 42 – 21 = 21

36

SLIDE 37

Quartiles (Tukey’s Hinges) – Example 2

Data are metabolic rates (cal/day), n = 7

When n is odd, include the median in both halves of the data set. Bottom half: 1362 1439 1460 1614 which has a median of 1449.5 (Q1) Top half: 1614 1666 1792 1867 which has a median of 1729 (Q3) 1362 1439 1460 1614 1666 1792 1867



median

37

SLIDE 38

Five-Point Summary

 Q0 (the minimum)  Q1 (25th percentile)  Q2 (median)  Q3 (75th percentile)  Q4 (the maximum)

38

SLIDE 39

Boxplots

1.

Calculate 5-point summary. Draw box from Q1 to Q3 w/ line at median

2.

Calculate IQR and fences as follows: FenceLower = Q1 – 1.5(IQR) FenceUpper = Q3 + 1.5(IQR) Do not draw fences

3.

Determine if any values lie outside the fences (outside values). If so, plot these separately.

4.

Determine values inside the fences (inside values) Draw whisker from Q3 to upper inside value. Draw whisker from Q1 to lower inside value

39

SLIDE 40

Illustrative Example: Boxplot

1.

5 pt summary: {5, 21, 27.5, 42, 52}; box from 21 to 42 with line @ 27.5

2.

IQR = 42 – 21 = 21.

FU = Q3 + 1.5(IQR) = 42 + (1.5)(21) = 73.5 FL = Q1 – 1.5(IQR) = 21 – (1.5)(21) = – 10.5

3.

None values above upper fence None values below lower fence

4.

Upper inside value = 52 Lower inside value = 5 Draws whiskers

Data: 05 11 21 24 27 28 30 42 50 52

60 50 40 30 20 10

Upper inside = 52 Q3 = 42 Q1 = 21 Lower inside = 5 Q2 = 27.5

40

SLIDE 41

Illustrative Example: Boxplot 2

Data: 3 21 22 24 25 26 28 29 31 51

60 50 40 30 20 10

O utside value (51) O utside value (3) Inside value (21) Upper hinge (29) Lower hinge (22) Median (25.5) Inside value (31)

1. 5-point summary: 3, 22,

25.5, 29, 51: draw box

2. IQR = 29 – 22 = 7

FU = Q3 + 1.5(IQR) = 29 + (1.5)(7) = 39.5 FL = Q1 – 1.5(IQR) = 22 – (1.5)(7) = 11.6

3. One above top fence (51)

One below bottom fence (3)

4. Upper inside value is 31

Lower inside value is 21 Draw whiskers

41

SLIDE 42

Illustrative Example: Boxplot 3

Seven metabolic rates: 1362 1439 1460 1614 1666 1792 1867

7 N =

Data source: Moore, 2000 1900 1800 1700 1600 1500 1400 1300

1. 5-point summary: 1362, 1449.5, 1614, 1729, 1867 2. IQR = 1729 – 1449.5 = 279.5 FU = Q3 + 1.5(IQR) = 1729 + (1.5)(279.5) = 2148.25 FL = Q1 – 1.5(IQR) = 1449.5 – (1.5)(279.5) = 1030.25

3. None outside
4. Whiskers end @ 1867 and

1362

42

SLIDE 43

Boxplots: Interpretation

 Location

Position of median
Position of box

 Spread

Hinge-spread (IQR)
Whisker-to-whisker spread
Range

 Shape

Symmetry or direction of skew
Long whiskers (tails) indicate leptokurtosis (Long

tails?)

43

SLIDE 44

Side-by-side boxplots

Boxplots are especially useful when comparing groups

44

SLIDE 45

Spread: Standard Deviation

 Most common

descriptive measures of spread

 Based on deviations

around the mean.

 This figure

demonstrates the deviations of two of its values

This data set has a mean of 36. The data point 33 has a deviation of 33 – 36 = −3. The data point 40 has a deviation of 40 – 36 = 4.

45

SLIDE 46

Variance and Standard Deviation

x xi 

Deviation =

 

2



  x x SS

i

Sum of squared deviations =

1

2

  n SS s

Sample variance =

2

s s 

Sample standard deviation =

46

SLIDE 47

Standard deviation (formula)



  

2

) ( 1 1 x x n s

i

Sample standard deviation s is the estimator of population standard deviation .

Sum of Squares

47

SLIDE 48

Illustrative Example: Standard Deviation

Observation Deviations Squared deviations

36 36  36 = 0 02 = 0 38 38 36 = 2 22 = 4 39 39  36 = 3 32 = 9 40 40 36 = 4 42 = 16 36 36  36 = 0 02 = 0 34 34 36 = 2 (2)2 = 4 33 33 36 = 3 (3)2 = 9 32 32 36 = 4 (4)2 = 16

SUMS  0*

SS = 58  

2

x xi 

i

x

* Sum of deviations always equals zero

48

SLIDE 49

Illustrative Example (cont.)

2 3 2

) g/m ( 286 . 8 1 8 58 1       n SS s

3 2

g/m 88 . 2 286 . 8     s s

Sample variance (s2) Standard deviation (s)

49

SLIDE 50

Interpretation of Standard Deviation

 Measure spread (e.g., if group was s1 =

15 and group 2 s2 = 10, group 1 has more spread, i.e., variability)

 68-95-99.7 rule (next slide)  Chebychev’s rule (two slides hence)

50

SLIDE 51

68-95-99.7 Rule

Normal Distributions Only!

 68% of data in the range μ ± σ  95% of data in the range μ ± 2σ  99.7% of data in the range μ ± 3σ  Example. Suppose a variable has a Normal

distribution with μ = 30 and σ = 10. Then:

68% of values are between 30 ± 10 = 20 to 40
95% are between 30 ± (2)(10) = 30 ± 20 = 10

to 50

99.7% are between 30 ± (3)(10) = 30 ± 30 = 0

to 60

51

SLIDE 52

Chebychev’s Rule All Distributions

 Chebychev’s rule says that at least 75% of

the values will fall in the range μ ± 2σ (for any shaped distribution)

 Example: A distribution with μ = 30 and σ

= 10 has at least 75% of the values in the range 30 ± (2)(10) = 10 to 50

52

SLIDE 53

Rules for Rounding

 Carry at least four significant digits during

calculations.

 Round at last step of operation  Always report units

Always use common sense and good judgment.

53

SLIDE 54

Choosing Summary Statistics

 Always report a measure of central

location, a measure of spread, and the sample size

 Symmetrical mound-shaped

distributions  report mean and standard deviation

 Odd shaped distributions  report 5-

point summaries (or median and IQR)

54

SLIDE 55

Software and Calculators

Use software and calculators to check work.

55

SLIDE 56

Excel Data Analysis T

olPak

 Data set: Boxplot.xlsx  Summary statistics using Excel

56

SLIDE 57

Excel Data Analysis T

olPak

 Get summary statistics using Excel

57

SLIDE 58

Boxplot using Excel

 Youtube links to get boxplot in Excel  http://www.youtube.com/watch?v=s8ZW

4PVarwE&feature=related

58

10 20 30 40 50 60 70 Female age Male age

SLIDE 59

Boxplot and summary statistics by JMP

 Data set:

Boxplot.jmp

 What do you say

about the comparison of distributions of ages for females and males?

59

SLIDE 60

60

SLIDE 61

Exercise

 Surgical times. Durations of surgeries

(hours) for 15 patients receiving artificial hearts are shown here. Create a stem plot of these

data. Describe the distribution. Are there any
utliers? What is the standard deviation of this

data set? Draw a boxplot based on this data set.

 7.0 6.5 3.5 3.1 2.8 2.5 3.8 2.6 2.4 2.1 1.8

2.3 3.1 3.0 2.5

 Data set: Presentation2_Exercise.xlsx  Presentation2_exercise.jmp

61