STAT 113 Variability Colin Reimer Dawson Oberlin College - - PowerPoint PPT Presentation

stat 113 variability
SMART_READER_LITE
LIVE PREVIEW

STAT 113 Variability Colin Reimer Dawson Oberlin College - - PowerPoint PPT Presentation

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations STAT 113 Variability Colin Reimer Dawson Oberlin College September 14, 2017 1 / 48 Last Time: Shape and Center Variability Variance and Standard


slide-1
SLIDE 1

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

STAT 113 Variability

Colin Reimer Dawson

Oberlin College

September 14, 2017 1 / 48

slide-2
SLIDE 2

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Outline

Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 2 / 48

slide-3
SLIDE 3

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Distribution of a Quantitative Variable

The distribution of a quantitative variable is characterized by:

  • A. Shape (symmetric, skewed, bimodal, etc.)
  • B. Center (mean, median)
  • C. Spread (Interquartile Range, Standard Deviation)
  • D. Outliers (if any)

3 / 48

slide-4
SLIDE 4

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Skewness

  • A distribution is skewed when the extreme values on one side

are more extreme than those on the other.

  • We call a distribution right-skewed when the longer “tail” is
  • n the right, and left-skewed when the longer tail is on the

left. 4 / 48

slide-5
SLIDE 5

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Distribution of a Quantitative Variable

The distribution of a numeric variable is characterized by:

  • A. Shape (symmetric, skewed, bimodal, etc.)
  • B. Center (mean, median)
  • C. Spread (Interquartile Range, Standard Deviation)
  • D. Outliers (if any)

5 / 48

slide-6
SLIDE 6

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Resistance/Robustness

  • The mean is strongly affected by skew and by outliers
  • The mean is pulled toward the extreme values.
  • In these cases, we generally prefer a measure of central

tendency which is resistant to the influence of extreme values (also called robust).

  • The median is a resist/robust measure of center.

6 / 48

slide-7
SLIDE 7

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Outline

Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 7 / 48

slide-8
SLIDE 8

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Distribution of a Quantitative Variable

The distribution of a numeric variable is characterized by:

  • A. Shape (symmetric, skewed, bimodal, etc.)
  • B. Center (mean, median)
  • C. Spread (Interquartile Range, Standard Deviation)
  • D. Outliers (if any)

8 / 48

slide-9
SLIDE 9

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Measures of Variability

  • We want to quantify the consistency, or lack thereof, of the

data.

  • A general term for “lack of consistency” is variability.
  • We will look at:
  • Range
  • Interquartile Range
  • Variance / Standard Deviation

9 / 48

slide-10
SLIDE 10

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

The Range

The range is easy to compute, but not very reliable.

−20 −10 10 20 30 Fund C1

  • −20

−10 10 20 30 Fund C2

  • Figure: Historical Annual Returns for Two Hypothetical Index Funds

10 / 48

slide-11
SLIDE 11

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

The Range

The range is easy to compute, but not very reliable.

−10 −5 5 10 15 Fund E (Full Data Set)

  • −10

−5 5 10 15 Fund Sample 1

  • −10

−5 5 10 15 Fund Sample 2

−10 −5 5 10 15 Fund Sample 3

  • Figure: Annual Returns for 3 random samples of 5 years

11 / 48

slide-12
SLIDE 12

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Outline

Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 12 / 48

slide-13
SLIDE 13

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Robust Measures of Variability

  • We’d like a more robust measure of variability, which is not

affected so much by extreme values.

  • Analogous to the median: describe the “middle” part of the

data.

  • The idea: find the “middle half” of the data, and then take its

range.

  • Specifically, exclude the lowest 25% and the highest 25%, and

take the difference between the highest and lowest remaining values. 13 / 48

slide-14
SLIDE 14

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Quartiles

  • The median divides the data in two.
  • Percentiles divide the data into 100 pieces.
  • Quartiles divide the data into

. The kth quartile (written Qk) is the point below which k quarters of the data lies.

  • So, in terms of quartiles, the median is

, the minimum value is , the maximum value is .

  • We can calculate the range using quartiles as

. 14 / 48

slide-15
SLIDE 15

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Quartiles

20 25 30 35 40 45 50 Height (in.)

Q0 Q1 Q2 Q3 Q4

15 / 48

slide-16
SLIDE 16

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

The Inter-Quartile Range (IQR)

The Inter-Quartile Range (IQR)

The Inter-Quartile Range (or IQR) is the distance between the first and third quartiles: IQR = Q3 − Q1

Pedantic Note

The IQR is a single number, not the two quartiles themselves. 16 / 48

slide-17
SLIDE 17

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

The Inter-Quartile Range (IQR)

20 25 30 35 40 45 50 Height (in.)

Q0 Q1 Q2 Q3 Q4

Range IQR

17 / 48

slide-18
SLIDE 18

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

The Five-Number Summary

Five-number Summary

  • The quartiles are very natural to report together to describe

the center and spread of a distribution.

  • Q0 through Q4 collectively form the five-number summary
  • f a quantitative distribution.

Five Number Summary = (xmin, Q1, Median, Q3, xmax) = (Q0, Q1, Q2, Q3, Q4) 18 / 48

slide-19
SLIDE 19

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Box-and-Whisker Plots

Box-and-Whisker Plots

From the five-number summary, we construct a graph called a box-and-whisker plot (or just box plot, for short)

  • 1. Draw an axis
  • 2. Draw a rectangle (box) from Q1 to Q3
  • 3. Draw a line across the box (or place a dot) at Q2
  • 4. Draw lines (whiskers) extending outward from the box on both

sides to either

(a) (Simplest version) xmin and xmax. (b) (R default) Q1 − 1.5IQR and Q3 + 1.5IQR.

  • 5. In version (b), plot points beyond the whiskers individually.

19 / 48

slide-20
SLIDE 20

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Box-and-Whisker Plot: Version 1

20 25 30 35 40 45 50 Height (in.)

Q0 Q1 Q2 Q3 Q4

Range IQR

20 25 30 35 40 45 50

20 / 48

slide-21
SLIDE 21

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Box-and-Whisker Plot: Version 2

20 25 30 35 40 45 50 Height (in.)

Q0 Q1 Q2 Q3 Q4

Range IQR

  • 20

25 30 35 40 45 50

21 / 48

slide-22
SLIDE 22

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Box-and-Whisker Plot: Right Skew

2001 Household Income (Thousands of 2016$) Density 500 1000 1500 2000 0.000 500 1000 1500 2000 2001 Household Income (Thousands of 2016$)

22 / 48

slide-23
SLIDE 23

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Box-and-Whisker Plot: Right Skew

2001 Household Income (Thousands of 2016$) Density 500 1000 1500 2000 0.000

  • 500

1000 1500 2000 2001 Household Income (Thousands of 2016$)

23 / 48

slide-24
SLIDE 24

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Matching Graphs to Variables

Handout

24 / 48

slide-25
SLIDE 25

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Outline

Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 25 / 48

slide-26
SLIDE 26

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Deviations

Rather than simply measuring the distance between extremes, we can develop measures based on distance from “center”.

Deviation Scores

For each data point, its deviation score is its “distance” from the mean. Deviationi = xi − ¯ x, for each i = 1, . . . , n 26 / 48

slide-27
SLIDE 27

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Deviations

20 25 30 35 40 45 50 Height (in.)

mean = 36.76

27 / 48

slide-28
SLIDE 28

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Deviations

20 25 30 35 40 45 50 Height (in.)

mean = 36.76

28 / 48

slide-29
SLIDE 29

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Deviations

20 25 30 35 40 45 50 Height (in.)

mean = 36.76 Deviation = 6.24

29 / 48

slide-30
SLIDE 30

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Deviations

20 25 30 35 40 45 50 Height (in.)

mean = 36.76 Deviation = 6.24 Deviation = −12.76

How can we use these for an overall measure of spread? 30 / 48

slide-31
SLIDE 31

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Variance

  • If we square all the deviations from the mean and average

them, we get the variance.

Variance

The variance, written s2, is the average of the squared deviations from the mean. That is, s2 = n

i=1 Deviation2 i

n − 1 = n

i=1(xi − ¯

x)2 n − 1 31 / 48

slide-32
SLIDE 32

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

What’s with that denominator?

  • With an average, you’re supposed to divide by the number of

things, aren’t you? Why n − 1?

  • Usually we are working with a sample, and are interested in

estimating the population variability.

  • We get no information about variability from the first
  • bservation, so there are only n − 1 “degrees of freedom” in

the sample.

  • Interesting math side fact: Variance is equivalent to average

squared distance between all distinct pairs of data points. 32 / 48

slide-33
SLIDE 33

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Standard Deviation

  • Variance (s2) is in squared units relative to the data.
  • No problem: just take the square root.

Standard Deviation

s = √ s2 is the standard deviation s = √ s2 = n

i=1 Deviation2 i

n − 1 = n

i=1(xi − ¯

x)2 n − 1 33 / 48

slide-34
SLIDE 34

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Same range, different s

−20 −10 10 20 30 Fund C1

  • s = 18.2

−20 −10 10 20 30 Fund C2

  • s = 8.1

The standard deviation uses all the data. 34 / 48

slide-35
SLIDE 35

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Distribution of a Quantitative Variable

The distribution of a numeric variable is characterized by:

  • A. Shape (symmetric, skewed, bimodal, etc.)
  • B. Center (mean, median)
  • C. Spread (Interquartile Range, Standard Deviation)
  • D. Outliers (if any)

35 / 48

slide-36
SLIDE 36

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Outliers

  • Skewness can be an important feature of a distribution.
  • But sometimes a few unusual data points make an otherwise

“well-behaved” distribution look skewed/multimodal.

  • When not part of the overall pattern, these are called outliers.
  • Sometimes reflect measurement errors (e.g., misplaced

decimal)

  • Sometimes represent genuinely unusual observations

36 / 48

slide-37
SLIDE 37

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

On-Base Percentage

A common statistic for batters in baseball is On-Base Percentage

On−Base Percentage Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 2 4 6 8 10 12 Skewness = 0.630 Barry Bonds

Figure: Distribution of major-league hitters with at least 100 Plate Appearances in 2002.

37 / 48

slide-38
SLIDE 38

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

On-Base Percentage

Distribution without Bonds

On−Base Percentage Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 2 4 6 8 10 12 Skewness = 0.199

38 / 48

slide-39
SLIDE 39

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Visualizing Outliers

0.2 0.3 0.4 0.5 On−Base Percentage

  • 0.2

0.3 0.4 0.5 On−Base Percentage

39 / 48

slide-40
SLIDE 40

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Outline

Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 40 / 48

slide-41
SLIDE 41

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Problems with s and s2

  • These measures, even more than the mean itself, are heavily

influenced by extreme values.

2001 Household Income (Thousands of 2016$) Density 500 1000 1500 2000 0.000 0.004 0.008 0.012

41 / 48

slide-42
SLIDE 42

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Problems with s and s2

2001 Household Income (Thousands of 2016$) Density 500 1000 1500 2000 0.000 2001 Income (Thousands of 2016$) (Top 0.1% Excluded) Density 500 1000 1500 2000 0.000 2001 Income (Thousands of 2016$) (Top 1% Excluded) Density 500 1000 1500 2000 0.000

42 / 48

slide-43
SLIDE 43

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Problems with s and s2

2001 Household Income (Thousands of 2016$) Density 50 100 150 200 250 300 0.000 2001 Income (Thousands of 2016$) (Top 0.1% Excluded) Density 50 100 150 200 250 300 0.000 2001 Income (Thousands of 2016$) (Top 1% Excluded) Density 50 100 150 200 250 300 0.000

43 / 48

slide-44
SLIDE 44

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Variance-Stabilizing Transformations

  • The mean and standard deviation are unstable in the presence
  • f skew.
  • However, they have such useful properties otherwise that it is
  • ften better to try to “remove” skew, rather than fall back on
  • ther measures.
  • The most common way to remove skew is by a nonlinear

transformation of the underlying scale.

  • Take the original variable, x, and define a new variable

Y = f(x), where f is a one-to-one function.

  • Most common case: right-skewed data with positive values
  • Logarithmic transform (take y = log(x))
  • Square Root (take y = √x)

44 / 48

slide-45
SLIDE 45

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Variance-Stabilizing Transformations

Original vs. Logarithmic Income Distribution:

2001 Household Income (Thousands of 2016$) Density 50 100 150 200 250 0.000 2001 Household Income (2016$) Density 0.0 1.0 102 103 104 105 106

45 / 48

slide-46
SLIDE 46

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Summary

Quantitative Data

Visualizing a quantitative variable

  • Dot Plots
  • Box-and-Whisker Plots
  • Histograms
  • Density curves

Describing the distribution of a numeric variable

  • Shape (symmetry, skew, modes)
  • Center (mean, median)
  • Spread (IQR, standard deviation)
  • Outliers (if any)

46 / 48

slide-47
SLIDE 47

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Summary

Shape and Center

  • A distribution is skewed when the extreme values on one end

are more extreme than on the other

  • We say that it is skewed in the direction of the more extreme

values (e.g., right-skewed if there are a few very large values)

  • The mean is the “balance point of the data”, written ¯

x.

  • Mean has nice math properties, but is affected by skew
  • The median divides the cases in half
  • It is resistant to outliers/skewness

47 / 48

slide-48
SLIDE 48

Last Time: Shape and Center Variability Variance and Standard Deviaton Transformations

Summary

Variability

  • The range is unstable for a sample, and is extremely vulnerable

to outliers/skew

  • The Interquartile Range (IQR) is the range of the “middle

half” of the data, and is “resistant” (like the median)

  • The variance is the “average” of the squared deviations from

each observation to the mean

  • The standard deviation is the square root of the variance, in
  • rder to restore units to the original scale
  • Nonlinear transformations (log, square root, etc.) can be used

when appropriate to reduce skew and stabilize variance 48 / 48