Chapter 4
Numerical Methods for Describing Data
1
Chapter 4 Numerical Methods for Describing Data 1 Population - - PowerPoint PPT Presentation
Chapter 4 Numerical Methods for Describing Data 1 Population characteristic - Suppose we want to know the MEAN length of all the fish in Lake Lewisville . . . Fixed value about a population Typical unknown Is this a value that is
Chapter 4
Numerical Methods for Describing Data
1
Population characteristic -
Suppose we want to know the MEAN length
Is this a value that is known? Can we find it out? At any given point in time, how many values are there for the mean length of fish in the lake?
2
Statistic -
Suppose we want to know the MEAN length of all the fish in Lake Lewisville. What can we do to estimate this unknown population characteristic?
3
Measures of Central Tendency
most often – Can be more than one mode – If all values occur only once – there is no mode – Not used as often as mean & median
4
Measures of Central Tendency
Median - the middle value of the data; it divides the observations in half To find: list the observations in numerical
even is if values middle two the
average
is is value middle single median sample n n
Where n = sample size
5
Suppose we catch a sample of 5 fish from the
listed below. Find the median length of fish.
3 4 5 8 10
The numbers are in order & n is odd – so find the middle observation. The median length of fish is 5 inches.
6
Suppose we caught a sample of 6 fish from the
3 4 5 6 8 10
The numbers are in order & n is even – so find the middle two observations. The median length is 5.5 inches. Now, average these two values.
5.5
7
Measures of Central Tendency
Mean is the arithmetic average. – Use m to represent a population mean – Use x to represent a sample mean
n x x
Formula:
S is the capital Greek
letter sigma – it means to sum the values that follow
Population characteristic statistic
m is the lower case Greek
letter mu
8
Suppose we caught a sample of 6 fish from the lake. Find the mean length of the fish.
3 4 5 6 8 10
To find the mean length of fish - add the observations and divide by n.
6 10 8 6 5 4 3
9
x (x - x) 3 4 5 6 8 10 Sum
What is the sum
from the mean?
Now find how each observation deviates from the mean.
Will this sum always equal zero?
YES
This is the deviation from the mean.
3-6
2 4
Find the rest of the deviations from the mean
The mean is considered the balance point of the distribution because it “balances” the positive and negative deviations.
10
Imagine a ruler with pennies placed at 3”, 4”, 5”, 6”, 8” and 10”.
To balance the ruler on your finger, you would need to place your finger at the mean
The mean is the balance point of a distribution
11
What happens to the median & mean if the length of 10 inches was 15 inches?
3 4 5 6 8 15
The median is . . . 5.5 The mean is . . .
6 15 8 6 5 4 3
6.833
What happened?
12
What happens to the median & mean if the 15 inches was 20?
3 4 5 6 8 20
The median is . . . 5.5 The mean is . . .
6 20 8 6 5 4 2
7.667
What happened?
13
Some statistics that are not affected by extreme values . . .
Is the median resistant affected by extreme values? Is the mean affected by extreme values?
NO YES
14
Suppose we caught a sample of 20 fish with the following lengths. Create a histogram for the lengths of fish. (Use a class width of 1.) Mean = Median =
3 5 6 10 6 7 7 8 4 5 6 4 7 5 9 9 8 7 6 8
6.5
Calculate the mean and median.
6.5
Look at the placement of the mean and median in this symmetrical distribution.
15
Suppose we caught a sample of 20 fish with the following lengths. Create a histogram for the lengths of fish. (Use a class width 1.) Mean = Median = 5.5
6.8
Calculate the mean and median. Look at the placement of the mean and median in this skewed distribution.
3 5 6 10 15 7 3 3 4 5 6 4 12 5 3 4 8 13 11 9
16
Suppose we caught a sample of 20 fish with the following lengths. Create a histogram for the lengths of fish. (Use a class width of 1.) Mean = Median = 8.5
7.75
Calculate the mean and median. Look at the placement of the mean and median in this skewed distribution.
3 5 6 10 10 7 10 8 9 5 6 4 9 10 9 9 10 7 10 8
17
Recap:
and median are equal.
pulled in the direction of the skewness.
report the mean!
should be reported as the measure of center!
18
Trimmed mean:
Purpose is to remove outliers from a data set To calculate a trimmed mean:
BOTH ends of the distribution (when listed in order)
data set
19
Mean = 23.8 Find the mean of the following set of data.
12 14 19 20 22 24 25 26 26 50
10%(10) = 1 So remove one observation from each side!
22 8 26 26 25 24 22 20 19 14
T
x
Find a 10% trimmed.
20
60% of the sample was satisfied with their cell phone service.
6 . 15 9 ˆ p
What values are used to describe categorical data?
Suppose that each person in a sample of 15 cell phone users is asked if he or she is satisfied with the cell phone service. Here are the responses: Y N Y Y Y N N Y Y N Y Y Y N N
What would be the possible responses? Find the sample proportion of the people who answered “yes”:
n successes
number ˆ p
Pronounced p-hat The population proportion is denoted by the letter p.
21
Why is the study of variability important?
unusual values
doesn’t provide a complete picture of the distribution. Does this can of soda contain exactly 12
22
20 30 40 50 60 70 20 30 40 50 60 70 20 30 40 50 60 70
Notice that these three data sets all have the same mean and median (at 45), but they have very different amounts of variability.
23
Measures of Variability
The simplest numeric measure of variability is range. Range = largest observation – smallest observation
20 30 40 50 60 70 20 30 40 50 60 70 20 30 40 50 60 70
The first two data sets have a range of 50 (70-20) but the third data set has a much smaller range
24
1
2 2
n x x s
Measures of Variability
Another measure of the variability in a data set uses the deviations from the mean (x – x). Remember the sample of 6 fish that we caught from the lake . . . They were the following lengths: 3”, 4”, 5”, 6”, 8”, 10” The mean length was 6 inches. Recall that we calculated the deviations from the mean. What was the sum of these deviations? Can we find an average deviation? What can we do to the deviations so that we could find an average?
Degree of freedom
The estimated average of the deviations squared is called the variance.
Population variance is denoted by s2.
25
When calculating sample variance, we use degrees of freedom (n – 1) in the denominator instead of n because this tends to produce better estimates. Degrees of freedom will be revisited again in Chapter 8. Suppose that everyone in the class caught a sample of 6 fish from the
contain the same fish? Would our mean lengths be the same? The samples would also have different ranges!
26
x (x - x) (x - x)2 3
4
5
6 8 2 10 4 Sum
What is the sum
squared?
Remember the sample of 6 fish that we caught from the lake . . . Find the variance of the length of fish.
Divide this by 5.
First square the deviations Finding the average of the deviations would always equal 0!
9 4 1 4 16
34 s2 = 6.5
27
Measures of Variability
The square root of variance is called standard deviation. A typical deviation from the mean is the standard deviation. s2 = 6.8 inches2 so s = 2.608 inches The fish in our sample deviate from the mean of 6 by an average of 2.608 inches.
28
Calculation of standard deviation of a sample
1
2
n x x s
Population standard deviation is denoted by s (where n is used in the denominator).
The most commonly used measures of center and variability are the mean and standard deviation, respectively.
29
Measures of Variability
Interquartile range (iqr) is the range of the middle half of the data. Lower quartile (Q1) is the median of the lower half of the data Upper quartile (Q3) is the median of the upper half of the data iqr = Q3 – Q1 What advantage does the interquartile range have over the standard deviation? The iqr is resistant to extreme values
30
The Chronicle of Higher Education (2009-2010 issue) published the accompanying data on the percentage of the population with a bachelor’s or higher degree in 2007 for each of the 50 states and the District of Columbia.
21 27 26 19 30 35 35 26 47 26 27 30 24 29 22 24 29 20 20 27 35 38 25 31 19 24 27 27 23 34 25 32 26 24 22 28 26 30 23 25 22 25 29 33 34 30 17 25 23 34 26
Find the interquartile range for this set of data.
31
21 27 26 19 30 35 35 26 47 26 27 30 24 29 22 24 29 20 20 27 35 38 25 31 19 24 27 27 23 34 25 32 26 24 22 28 26 30 23 25 22 25 29 33 34 30 17 25 23 34 26
First put the data in order & find the median.
17 19 19 20 20 21 22 22 22 23 23 23 24 24 24 24 25 25 25 25 25 26 26 26 26 26 26 27 27 27 27 27 28 29 29 29 30 30 30 30 31 32 33 34 34 34 35 35 35 38 47 26
Find the lower quartile (Q1) by finding the median of the lower half.
24
Find the upper quartile (Q3) by finding the median of the upper half.
30
iqr = 30 – 24 = 6
32
Another graph- Boxplots
What are some advantages of boxplots?
histograms)
sets (n > 10)
33
Boxplots
When to Use
Univariate numerical data
How to construct a Skeleton Boxplot
– Calculate the five number summary – Draw a horizontal (or vertical) scale – Construct a rectangular box from the lower quartile (Q1) to the upper quartile (Q3) – Draw lines from the lower quartile to the smallest observation and from the upper quartile to the largest observation
To describe
– comment on the center, spread, and shape of the distribution and if there is any unusual features
Use for moderate to large data
with data sets of n < 10. The five-number summary is the minimum value, first quartile, median, third quartile, and maximum value
34
Remember the data on the percentage of the population with a bachelor’s or higher degree in 2007 for each of the 50 states and the District of Columbia.
17 19 19 20 20 21 22 22 22 23 23 23 24 24 24 24 25 25 25 25 25 26 26 26 26 26 26 27 27 27 27 27 28 29 29 29 30 30 30 30 31 32 33 34 34 34 35 35 35 38 47
10 20 30 40 50Percentages
First draw a scale Draw a box from Q1 to Q3 Draw a line for the median Draw lines for the whiskers
35
Modified boxplots
To display outliers:
An observation is an outliers if it is more than 1.5(iqr) away from the nearest quartile. An outlier is extreme if it is more than 3(iqr) away from the nearest quartile.
iqr Q iqr Q 5 . 1 and 5 . 1
3 1
iqr Q iqr Q 3 and 3
3 1
Modified boxplots are generally preferred because they provide more information about the data distribution.
36
Remember the data on the percentage of the population with a bachelor’s or higher degree in 2007 for each of the 50 states and the District of Columbia.
17 19 19 20 20 21 22 22 22 23 23 23 24 24 24 24 25 25 25 25 25 26 26 26 26 26 26 27 27 27 27 27 28 29 29 29 30 30 30 30 31 32 33 34 34 34 35 35 35 38 47
10 20 30 40 50Percentages
First, draw the scale, box and the line for the median Draw lines for the whiskers Next calculate the fences for outliers.
24-1.5(6) = 15 30+1.5(6) = 39 30+3(6) = 48
There is one outlier at the upper end at the distribution, but none at the lower end. Is it extreme? Place a solid dot for the
To describe: The distribution of percent of the population with a bachelor’s degree or higher for the U.S. states and District of Columbia is positively skewed with an outlier at 47%. The median percentage is at 26% with a range of 30%.
37
Symmetrical boxplots Approximately symmetrical boxplot Skewed boxplot
Notice that all 3 boxplots are identical, but their corresponding histograms are very
determine the number
boxplot? Notice that the range
the range of the upper half of this distribution are approximately equal so we can say that it is approximately symmetrical. However, the range of the two halves of this distribution are definitely different sizes, so it would be skewed in the direction
38
The 2009-2010 salaries of NBA players published on the web site hoopshype.com were used to construct the comparative boxplot of salary data for five teams. Discuss the similarities and differences.
39
Interpreting Center & Variability
Chebyshev’s Rule –
The percentage of observations that are within k standard deviations of the mean is at least where k > 1
% 1 1 100
2
k
If k = 2, then at least 75% of the observations are within 2 standard deviations of the mean.
% 75 % 2 1 1 100
2
This rule can be used with any distribution – no matter it’s shape!
40
For a sample of families with one preschool child, it was reported that the mean child care time per week was approximately 36 hours with a standard deviation of approximately 12 hours. Using Chebyshev’s rule, at least 75% of the sample observations must be between 12 and 60 hours (within 2 standard deviations of the mean). At most, what percent of the
72 hours? At least 89% of the observations are between 0 & 72 hours. Since time can’t be negative, at most 11% of the observations are above 72 hours.
41
Input the following command into a graphing calculator in order to graph a normal curve with a mean of 20 and standard deviation of 3. Y1 = normalpdf(X,20,3) (Window x: [10,30] y: [0,0.2]) Use the command 2nd trace, 7 to find the area under the curve for the: (Round to 3 decimal places.) Lower limit: 17 Upper limit: 23 Area: ________ Lower limit: 14 Upper limit: 26 Area: ________ Lower limit: 11 Upper limit: 29 Area: ________
What’s my area?
42
Graph a normal curve with a mean of 50 and standard deviation of 5. Y1 = normalpdf(X,50,5) (x: [30,70] y: [0,0.1]) Find the area under the curve for the following: Lower limit: 45 Upper limit: 55 Area: ________ Lower limit: 40 Upper limit: 60 Area: ________ Lower limit: 35 Upper limit: 65 Area: ________
What’s my area?
What pattern do you notice?
43
Interpreting Center & Variability
Empirical Rule-
within 1 standard deviation of the mean
within 2 standard deviation of the mean
are within 3 standard deviation of the mean
Can ONLY be used with distributions that are mound shaped!
68% 95% 99.7%
44
The height of male students at PWSH is approximately normally distributed with a mean of 71 inches and standard deviation of 2.5 inches. a)What percent of the male students are shorter than 66 inches? b) Taller than 73.5 inches? c) Between 66 & 73.5 inches?
About 2.5% About 16% About 81.5%
45
Measures of Relative Standing
Z-score
A z-score tells us how many standard deviations the value is from the mean. deviation standard mean
score
One example of standardized score.
46
What do these z-scores mean?
1.8
2.3 standard deviations below the mean 1.8 standard deviations above the mean 4.3 standard deviations below the mean
47
Sally is taking two different math achievement tests with different means and standard
with a standard deviation of 3.5, while the mean score on test B was 65 with a standard deviation
test B. On which test did Sally score the best? 714 . 1 5 . 3 56 62 z
She did better on test A.
Z-score on test A Z-score on test B
429 . 1 8 . 2 65 69 z
48
Measures of Relative Standing
Percentiles
A percentile is a value in the data set where r percent of the observations fall AT or BELOW that value
49
In addition to weight and length, head circumference is another measure of health in newborn babies. The National Center for Health Statistics reports the following summary values for head circumference (in cm) at birth for boys.
Head circumference (cm) 32.2 33.2 34.5 35.8 37.0 38.2 38.6 Percentile 5 10 25 50 75 90 95
What percent of newborn boys had head circumferences greater than 37.0 cm? 10% of newborn babies have head circumferences bigger than what value?
25% 38.2 cm 50