Marc Mehlman
Descriptive Statistics
Marc H. Mehlman
marcmehlman@yahoo.com
University of New Haven
Marc Mehlman (University of New Haven) Descriptive Statistics 1 / 44
Descriptive Statistics Marc H. Mehlman marcmehlman@yahoo.com - - PowerPoint PPT Presentation
Descriptive Statistics Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc Mehlman Marc Mehlman (University of New Haven) Descriptive Statistics 1 / 44 Table of Contents Data Distributions 1 Graphical Representation of
Marc Mehlman
University of New Haven
Marc Mehlman (University of New Haven) Descriptive Statistics 1 / 44
Marc Mehlman
1
2
3
4
5
6
7
Marc Mehlman (University of New Haven) Descriptive Statistics 2 / 44
Marc Mehlman
Data Distributions
Marc Mehlman (University of New Haven) Descriptive Statistics 3 / 44
Marc Mehlman
Data Distributions
15
In any graph of data, look for the overall pattern and for striking deviations from that pattern.
spread.
Marc Mehlman (University of New Haven) Descriptive Statistics 4 / 44
Marc Mehlman
Data Distributions
16
approximately mirror images of each other.
the graph (containing the half of the observations with larger values) is much longer than the left side.
much longer than the right side.
Symmetric Symmetric Skewed-left Skewed-left Skewed-right Skewed-right
Marc Mehlman (University of New Haven) Descriptive Statistics 5 / 44
Marc Mehlman
Data Distributions
Alaska Florida
An important kind of deviation is an outlier. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them.
The overall pattern is fairly symmetrical except for two states that clearly do not belong to the main trend. Alaska and Florida have unusual representation of the elderly in their population. A large gap in the distribution is typically a sign of an outlier.
Marc Mehlman (University of New Haven) Descriptive Statistics 6 / 44
Marc Mehlman
Data Distributions
1 # classes should be between 5 and 20 inclusive. 2 class width ≈ max value−min value
Marc Mehlman (University of New Haven) Descriptive Statistics 7 / 44
Marc Mehlman
Graphical Representation of Distributions
Marc Mehlman (University of New Haven) Descriptive Statistics 8 / 44
Marc Mehlman
Graphical Representation of Distributions
6
To examine a single variable, we graphically display its distribution.
proper choice of graph depends on the nature of the variable.
proper choice of graph depends on the nature of the variable. Categorical Variable Pie chart Bar graph Categorical Variable Pie chart Bar graph Quantitative Variable Histogram Stemplot Quantitative Variable Histogram Stemplot
Marc Mehlman (University of New Haven) Descriptive Statistics 9 / 44
Marc Mehlman
Graphical Representation of Distributions
7
The distribution of a categorical variable lists the categories and gives the count or percent of individuals who fall into that category.
whose slices are sized by the counts or percents for the categories.
the category counts or percents.
Marc Mehlman (University of New Haven) Descriptive Statistics 10 / 44
Marc Mehlman
Graphical Representation of Distributions > pie.sales = c(0.12, 0.3, 0.26, 0.16, 0.04, 0.12) > lbls = c("Blueberry", "Cherry", "Apple", "Boston Cream", "Other", "Vanilla Cream") > pie(pie.sales, labels = lbls, main="Pie Sales")
Blueberry Cherry Apple Boston Cream Other Vanilla Cream
Pie Sales Marc Mehlman (University of New Haven) Descriptive Statistics 11 / 44
Marc Mehlman
Graphical Representation of Distributions
Red Blue Green Brown
Favorite Colors
10 20 30 40
Marc Mehlman (University of New Haven) Descriptive Statistics 12 / 44
Marc Mehlman
Graphical Representation of Distributions
9
The distribution of a quantitative variable tells us what values the variable takes on and how often it takes those values.
using bars whose height represents the number of individuals who take on a value within a particular class.
are then plotted to display the distribution while maintaining the
Marc Mehlman (University of New Haven) Descriptive Statistics 13 / 44
Marc Mehlman
Graphical Representation of Distributions
13
For quantitative variables that take many values and/or large datasets.
to percents).
equivalent to the number (percent) of observations in each interval.
Marc Mehlman (University of New Haven) Descriptive Statistics 14 / 44
Marc Mehlman
Graphical Representation of Distributions
> hist(trees$Girth,main="Girth of Black Cherry Trees",xlab="Diameter in Inches")
Girth of Black Cherry Trees
Diameter in Inches Frequency 8 10 12 14 16 18 20 22 2 4 6 8 10 12
Marc Mehlman (University of New Haven) Descriptive Statistics 15 / 44
Marc Mehlman
Graphical Representation of Distributions
10
To construct a stemplot:
leaf (the remaining part of the number).
the stems.
desired.
Marc Mehlman (University of New Haven) Descriptive Statistics 16 / 44
Marc Mehlman
Graphical Representation of Distributions > Girth=trees$Girth > stem(Girth) # stem and leaf plot The decimal point is at the | 8 | 368 10 | 57800123447 12 | 099378 14 | 025 16 | 03359 18 | 00 20 | 6 > stem(Girth, scale=2) The decimal point is at the | 8 | 368 9 | 10 | 578 11 | 00123447 12 | 099 13 | 378 14 | 025 15 | 16 | 03 17 | 359 18 | 00 19 | 20 | 6 Marc Mehlman (University of New Haven) Descriptive Statistics 17 / 44
Marc Mehlman
Graphical Representation of Distributions
Marc Mehlman (University of New Haven) Descriptive Statistics 18 / 44
Marc Mehlman
Graphical Representation of Distributions
Student ID Number
Blood Alcohol Content 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05
Here we have two quantitative variables recorded for each of 16 students:
For each individual studied, we record
data on two variables.
We then examine whether there is a
relationship between these two variables: Do changes in one variable tend to be associated with specific changes in the other variables?
Marc Mehlman (University of New Haven) Descriptive Statistics 19 / 44
Marc Mehlman
Graphical Representation of Distributions
Student Beers BAC 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05
A scatterplot is used to display quantitative bivariate data. Each variable makes up one axis. Each individual is a point on the graph.
Marc Mehlman (University of New Haven) Descriptive Statistics 20 / 44
Marc Mehlman
Graphical Representation of Distributions
70 75 80 85 8 10 12 14 16 18 20
girth vs height
trees$Height trees$Girth
Marc Mehlman (University of New Haven) Descriptive Statistics 21 / 44
Marc Mehlman
Measuring the Center
Marc Mehlman (University of New Haven) Descriptive Statistics 22 / 44
Marc Mehlman
Measuring the Center
n
n
j=1 xj.
1 N
j=1 xj. The mode is the most value
Marc Mehlman (University of New Haven) Descriptive Statistics 23 / 44
Marc Mehlman
Measuring the Center
Marc Mehlman (University of New Haven) Descriptive Statistics 24 / 44
Marc Mehlman
Measuring the Center
24
The mean and median measure center in different ways, and both are useful. The mean and median of a roughly symmetric distribution are close together. If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed distribution, the mean is usually farther out in the long tail than is the median. The mean and median of a roughly symmetric distribution are close together. If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed distribution, the mean is usually farther out in the long tail than is the median.
Marc Mehlman (University of New Haven) Descriptive Statistics 25 / 44
Marc Mehlman
Measuring the Spread
Marc Mehlman (University of New Haven) Descriptive Statistics 26 / 44
Marc Mehlman
Measuring the Spread
1 N
j=1(xj − µ)2 and the population
N
j=1(xj − µ)2.
1 n−1
j=1(xj − ¯
n−1
j=1(xj − ¯
Marc Mehlman (University of New Haven) Descriptive Statistics 27 / 44
Marc Mehlman
Measuring the Spread
Example Suppose our random sample is 4.5, 3.7, 2.8, 5.3, 4.6. Then ¯ x = 1 5 [4.5 + 3.7 + 2.8 + 5.3 + 4.6] = 4.18 s2 = 1 5 − 1
= 0.917 s = √ 0.917 = 0.9576012. Using R: > dat=c(4.5,3.7,2.8,5.3,4.6) > mean(dat) [1] 4.18 > var(dat) [1] 0.917 > sd(dat) [1] 0.9576012
Marc Mehlman (University of New Haven) Descriptive Statistics 28 / 44
Marc Mehlman
Measuring the Spread
1 n−1
j=1(x − ¯
n
j=1(x − ¯
n
j=1(x − ¯
n s2.
1 s2 =
n n
j=1 x2 j −(
n
j=1 xj) 2
n(n−1)
j=1 x2 j −(
n
j=1 xj) 2
n(n−1)
2 s2 = 0 ⇔ s = 0 ⇒ all of the random sample is the same. 3
4 s2 is an (unbiased) estimator of σ2. 5 s is an (biased) estimator of σ. 6 s measures the amount the data is dispersed about the mean. 7 s has the same units of measurement as the data. 8 s is sensitive to the existence of outliers. Marc Mehlman (University of New Haven) Descriptive Statistics 29 / 44
Marc Mehlman
Measuring the Spread
172 = 0.17 which is less than the CV for dogs
8 25 = 0.32.
Marc Mehlman (University of New Haven) Descriptive Statistics 30 / 44
Marc Mehlman
Measuring the Spread
1 K 2 of data is within K standard deviations of the mean.
Marc Mehlman (University of New Haven) Descriptive Statistics 31 / 44
Marc Mehlman
Normal Distribution
Marc Mehlman (University of New Haven) Descriptive Statistics 32 / 44
Marc Mehlman
Normal Distribution
40
Marc Mehlman (University of New Haven) Descriptive Statistics 33 / 44
Marc Mehlman
Normal Distribution
Marc Mehlman (University of New Haven) Descriptive Statistics 34 / 44
Marc Mehlman
Order Statistics
Marc Mehlman (University of New Haven) Descriptive Statistics 35 / 44
Marc Mehlman
Order Statistics
Marc Mehlman (University of New Haven) Descriptive Statistics 36 / 44
Marc Mehlman
Order Statistics
34
We now have a choice between two descriptions for center and spread
Mean and Standard Deviation Median and Interquartile Range
The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers. Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers. NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA! The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers. Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers. NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA!
Choosing Measures of Center and Spread Choosing Measures of Center and Spread
Marc Mehlman (University of New Haven) Descriptive Statistics 37 / 44
Marc Mehlman
Order Statistics
Marc Mehlman (University of New Haven) Descriptive Statistics 38 / 44
Marc Mehlman
Order Statistics
Marc Mehlman (University of New Haven) Descriptive Statistics 39 / 44
Marc Mehlman
Order Statistics
Marc Mehlman (University of New Haven) Descriptive Statistics 40 / 44
Marc Mehlman
Order Statistics
1 draw and label a vertical number line that includes the range of the
2 draw a box from height Q1 to Q3. 3 draw a horizontal line inside the box at the height of the median. 4 draw vertical line segments (whiskers) from the bottom and top of
5 sometimes outliers are identified with ◦’s (R does this).
Marc Mehlman (University of New Haven) Descriptive Statistics 41 / 44
Marc Mehlman
Order Statistics
> boxplot(trees$Height, main="Heights of Black Cherry Trees") > boxplot(USJudgeRatings$DMNR,USJudgeRatings$DILG, + main="Lawyers’ Demeanor/Diligence ratings of US Superior Court state judges")
65 70 75 80 85
Heights of Black Cherry Trees
6 7 8 9
Lawyers' Demeanor/Diligence ratings of US Superior Court state judges
Marc Mehlman (University of New Haven) Descriptive Statistics 42 / 44
Marc Mehlman
Chapter #1 R Assignment
Marc Mehlman (University of New Haven) Descriptive Statistics 43 / 44
Marc Mehlman
Chapter #1 R Assignment
1 Create a barplot and pie chart of eye color from the sailor sample. 2 Create of a histogram and a stemplot of the height of loblolly trees
3 Find the mean, median, five number summary, variance and standard
4 Create a scatterplot of weight versus quarter mile times for the
Marc Mehlman (University of New Haven) Descriptive Statistics 44 / 44