Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
STAT 113 Variability Colin Reimer Dawson Oberlin College - - PowerPoint PPT Presentation
STAT 113 Variability Colin Reimer Dawson Oberlin College - - PowerPoint PPT Presentation
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations STAT 113 Variability Colin Reimer Dawson Oberlin College September 14, 2017 1 / 48 Last Time: Shape and Center Variability Variance and Standard
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Outline
Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 2 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Distribution of a Quantitative Variable
The distribution of a quantitative variable is characterized by:
- A. Shape (symmetric, skewed, bimodal, etc.)
- B. Center (mean, median)
- C. Spread (Interquartile Range, Standard Deviation)
- D. Outliers (if any)
3 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Skewness
- A distribution is skewed when the extreme values on one side
are more extreme than those on the other.
- We call a distribution right-skewed when the longer “tail” is
- n the right, and left-skewed when the longer tail is on the
left. 4 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Distribution of a Quantitative Variable
The distribution of a numeric variable is characterized by:
- A. Shape (symmetric, skewed, bimodal, etc.)
- B. Center (mean, median)
- C. Spread (Interquartile Range, Standard Deviation)
- D. Outliers (if any)
5 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Resistance/Robustness
- The mean is strongly affected by skew and by outliers
- The mean is pulled toward the extreme values.
- In these cases, we generally prefer a measure of central
tendency which is resistant to the influence of extreme values (also called robust).
- The median is a resist/robust measure of center.
6 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Outline
Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 7 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Distribution of a Quantitative Variable
The distribution of a numeric variable is characterized by:
- A. Shape (symmetric, skewed, bimodal, etc.)
- B. Center (mean, median)
- C. Spread (Interquartile Range, Standard Deviation)
- D. Outliers (if any)
8 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Measures of Variability
- We want to quantify the consistency, or lack thereof, of the
data.
- A general term for “lack of consistency” is variability.
- We will look at:
- Range
- Interquartile Range
- Variance / Standard Deviation
9 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
The Range
The range is easy to compute, but not very reliable.
−20 −10 10 20 30 Fund C1
- −20
−10 10 20 30 Fund C2
- Figure: Historical Annual Returns for Two Hypothetical Index Funds
10 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
The Range
The range is easy to compute, but not very reliable.
−10 −5 5 10 15 Fund E (Full Data Set)
- ●
- −10
−5 5 10 15 Fund Sample 1
- −10
−5 5 10 15 Fund Sample 2
- ●
−10 −5 5 10 15 Fund Sample 3
- Figure: Annual Returns for 3 random samples of 5 years
11 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Outline
Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 12 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Robust Measures of Variability
- We’d like a more robust measure of variability, which is not
affected so much by extreme values.
- Analogous to the median: describe the “middle” part of the
data.
- The idea: find the “middle half” of the data, and then take its
range.
- Specifically, exclude the lowest 25% and the highest 25%, and
take the difference between the highest and lowest remaining values. 13 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Quartiles
- The median divides the data in two.
- Percentiles divide the data into 100 pieces.
- Quartiles divide the data into
. The kth quartile (written Qk) is the point below which k quarters of the data lies.
- So, in terms of quartiles, the median is
, the minimum value is , the maximum value is .
- We can calculate the range using quartiles as
. 14 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Quartiles
20 25 30 35 40 45 50 Height (in.)
- ●
Q0 Q1 Q2 Q3 Q4
15 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
The Inter-Quartile Range (IQR)
The Inter-Quartile Range (IQR)
The Inter-Quartile Range (or IQR) is the distance between the first and third quartiles: IQR = Q3 − Q1
Pedantic Note
The IQR is a single number, not the two quartiles themselves. 16 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
The Inter-Quartile Range (IQR)
20 25 30 35 40 45 50 Height (in.)
- ●
Q0 Q1 Q2 Q3 Q4
Range IQR
17 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
The Five-Number Summary
Five-number Summary
- The quartiles are very natural to report together to describe
the center and spread of a distribution.
- Q0 through Q4 collectively form the five-number summary
- f a quantitative distribution.
Five Number Summary = (xmin, Q1, Median, Q3, xmax) = (Q0, Q1, Q2, Q3, Q4) 18 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Box-and-Whisker Plots
Box-and-Whisker Plots
From the five-number summary, we construct a graph called a box-and-whisker plot (or just box plot, for short)
- 1. Draw an axis
- 2. Draw a rectangle (box) from Q1 to Q3
- 3. Draw a line across the box (or place a dot) at Q2
- 4. Draw lines (whiskers) extending outward from the box on both
sides to either
(a) (Simplest version) xmin and xmax. (b) (R default) Q1 − 1.5IQR and Q3 + 1.5IQR.
- 5. In version (b), plot points beyond the whiskers individually.
19 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Box-and-Whisker Plot: Version 1
20 25 30 35 40 45 50 Height (in.)
- ●
Q0 Q1 Q2 Q3 Q4
Range IQR
20 25 30 35 40 45 50
20 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Box-and-Whisker Plot: Version 2
20 25 30 35 40 45 50 Height (in.)
- ●
Q0 Q1 Q2 Q3 Q4
Range IQR
- 20
25 30 35 40 45 50
21 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Box-and-Whisker Plot: Right Skew
2001 Household Income (Thousands of 2016$) Density 500 1000 1500 2000 0.000 500 1000 1500 2000 2001 Household Income (Thousands of 2016$)
22 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Box-and-Whisker Plot: Right Skew
2001 Household Income (Thousands of 2016$) Density 500 1000 1500 2000 0.000
- ●
- 500
1000 1500 2000 2001 Household Income (Thousands of 2016$)
23 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Matching Graphs to Variables
Handout
24 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Outline
Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 25 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Deviations
Rather than simply measuring the distance between extremes, we can develop measures based on distance from “center”.
Deviation Scores
For each data point, its deviation score is its “distance” from the mean. Deviationi = xi − ¯ x, for each i = 1, . . . , n 26 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Deviations
20 25 30 35 40 45 50 Height (in.)
- ●
mean = 36.76
27 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Deviations
20 25 30 35 40 45 50 Height (in.)
- ●
mean = 36.76
28 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Deviations
20 25 30 35 40 45 50 Height (in.)
- ●
mean = 36.76 Deviation = 6.24
29 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Deviations
20 25 30 35 40 45 50 Height (in.)
- ●
mean = 36.76 Deviation = 6.24 Deviation = −12.76
How can we use these for an overall measure of spread? 30 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Variance
- If we square all the deviations from the mean and average
them, we get the variance.
Variance
The variance, written s2, is the average of the squared deviations from the mean. That is, s2 = n
i=1 Deviation2 i
n − 1 = n
i=1(xi − ¯
x)2 n − 1 31 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
What’s with that denominator?
- With an average, you’re supposed to divide by the number of
things, aren’t you? Why n − 1?
- Usually we are working with a sample, and are interested in
estimating the population variability.
- We get no information about variability from the first
- bservation, so there are only n − 1 “degrees of freedom” in
the sample.
- Interesting math side fact: Variance is equivalent to average
squared distance between all distinct pairs of data points. 32 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Standard Deviation
- Variance (s2) is in squared units relative to the data.
- No problem: just take the square root.
Standard Deviation
s = √ s2 is the standard deviation s = √ s2 = n
i=1 Deviation2 i
n − 1 = n
i=1(xi − ¯
x)2 n − 1 33 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Same range, different s
−20 −10 10 20 30 Fund C1
- s = 18.2
−20 −10 10 20 30 Fund C2
- s = 8.1
The standard deviation uses all the data. 34 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Distribution of a Quantitative Variable
The distribution of a numeric variable is characterized by:
- A. Shape (symmetric, skewed, bimodal, etc.)
- B. Center (mean, median)
- C. Spread (Interquartile Range, Standard Deviation)
- D. Outliers (if any)
35 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Outliers
- Skewness can be an important feature of a distribution.
- But sometimes a few unusual data points make an otherwise
“well-behaved” distribution look skewed/multimodal.
- When not part of the overall pattern, these are called outliers.
- Sometimes reflect measurement errors (e.g., misplaced
decimal)
- Sometimes represent genuinely unusual observations
36 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
On-Base Percentage
A common statistic for batters in baseball is On-Base Percentage
On−Base Percentage Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 2 4 6 8 10 12 Skewness = 0.630 Barry Bonds
Figure: Distribution of major-league hitters with at least 100 Plate Appearances in 2002.
37 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
On-Base Percentage
Distribution without Bonds
On−Base Percentage Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 2 4 6 8 10 12 Skewness = 0.199
38 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Visualizing Outliers
0.2 0.3 0.4 0.5 On−Base Percentage
- ●
- 0.2
0.3 0.4 0.5 On−Base Percentage
39 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Outline
Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 40 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Problems with s and s2
- These measures, even more than the mean itself, are heavily
influenced by extreme values.
2001 Household Income (Thousands of 2016$) Density 500 1000 1500 2000 0.000 0.004 0.008 0.012
41 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Problems with s and s2
2001 Household Income (Thousands of 2016$) Density 500 1000 1500 2000 0.000 2001 Income (Thousands of 2016$) (Top 0.1% Excluded) Density 500 1000 1500 2000 0.000 2001 Income (Thousands of 2016$) (Top 1% Excluded) Density 500 1000 1500 2000 0.000
42 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Problems with s and s2
2001 Household Income (Thousands of 2016$) Density 50 100 150 200 250 300 0.000 2001 Income (Thousands of 2016$) (Top 0.1% Excluded) Density 50 100 150 200 250 300 0.000 2001 Income (Thousands of 2016$) (Top 1% Excluded) Density 50 100 150 200 250 300 0.000
43 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Variance-Stabilizing Transformations
- The mean and standard deviation are unstable in the presence
- f skew.
- However, they have such useful properties otherwise that it is
- ften better to try to “remove” skew, rather than fall back on
- ther measures.
- The most common way to remove skew is by a nonlinear
transformation of the underlying scale.
- Take the original variable, x, and define a new variable
Y = f(x), where f is a one-to-one function.
- Most common case: right-skewed data with positive values
- Logarithmic transform (take y = log(x))
- Square Root (take y = √x)
44 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Variance-Stabilizing Transformations
Original vs. Logarithmic Income Distribution:
2001 Household Income (Thousands of 2016$) Density 50 100 150 200 250 0.000 2001 Household Income (2016$) Density 0.0 1.0 102 103 104 105 106
45 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Summary
Quantitative Data
Visualizing a quantitative variable
- Dot Plots
- Box-and-Whisker Plots
- Histograms
- Density curves
Describing the distribution of a numeric variable
- Shape (symmetry, skew, modes)
- Center (mean, median)
- Spread (IQR, standard deviation)
- Outliers (if any)
46 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Summary
Shape and Center
- A distribution is skewed when the extreme values on one end
are more extreme than on the other
- We say that it is skewed in the direction of the more extreme
values (e.g., right-skewed if there are a few very large values)
- The mean is the “balance point of the data”, written ¯
x.
- Mean has nice math properties, but is affected by skew
- The median divides the cases in half
- It is resistant to outliers/skewness
47 / 48
Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations
Summary
Variability
- The range is unstable for a sample, and is extremely vulnerable
to outliers/skew
- The Interquartile Range (IQR) is the range of the “middle
half” of the data, and is “resistant” (like the median)
- The variance is the “average” of the squared deviations from
each observation to the mean
- The standard deviation is the square root of the variance, in
- rder to restore units to the original scale
- Nonlinear transformations (log, square root, etc.) can be used