Descriptive Statistics Marc H. Mehlman marcmehlman@yahoo.com - - PowerPoint PPT Presentation

descriptive statistics
SMART_READER_LITE
LIVE PREVIEW

Descriptive Statistics Marc H. Mehlman marcmehlman@yahoo.com - - PowerPoint PPT Presentation

Descriptive Statistics Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc Mehlman Marc Mehlman (University of New Haven) Descriptive Statistics 1 / 44 Table of Contents Data Distributions 1 Graphical Representation of


slide-1
SLIDE 1

Marc Mehlman

Descriptive Statistics

Marc H. Mehlman

marcmehlman@yahoo.com

University of New Haven

Marc Mehlman (University of New Haven) Descriptive Statistics 1 / 44

slide-2
SLIDE 2

Marc Mehlman

Table of Contents

1

Data Distributions

2

Graphical Representation of Distributions

3

Measuring the Center

4

Measuring the Spread

5

Normal Distribution

6

Order Statistics

7

Chapter #1 R Assignment

Marc Mehlman (University of New Haven) Descriptive Statistics 2 / 44

slide-3
SLIDE 3

Marc Mehlman

Data Distributions

Data Distributions

Data Distributions

Marc Mehlman (University of New Haven) Descriptive Statistics 3 / 44

slide-4
SLIDE 4

Marc Mehlman

Data Distributions

15

In any graph of data, look for the overall pattern and for striking deviations from that pattern.

  • You can describe the overall pattern by its shape, center, and

spread.

  • An important kind of deviation is an outlier, an individual that falls
  • utside the overall pattern.

Examining Distributions

Marc Mehlman (University of New Haven) Descriptive Statistics 4 / 44

slide-5
SLIDE 5

Marc Mehlman

Data Distributions

16

  • A distribution is symmetric if the right and left sides of the graph are

approximately mirror images of each other.

  • A distribution is skewed to the right (right-skewed) if the right side of

the graph (containing the half of the observations with larger values) is much longer than the left side.

  • It is skewed to the left (left-skewed) if the left side of the graph is

much longer than the right side.

Symmetric Symmetric Skewed-left Skewed-left Skewed-right Skewed-right

Examining Distributions

Marc Mehlman (University of New Haven) Descriptive Statistics 5 / 44

slide-6
SLIDE 6

Marc Mehlman

Data Distributions

Alaska Florida

An important kind of deviation is an outlier. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them.

The overall pattern is fairly symmetrical except for two states that clearly do not belong to the main trend. Alaska and Florida have unusual representation of the elderly in their population. A large gap in the distribution is typically a sign of an outlier.

Outliers

Marc Mehlman (University of New Haven) Descriptive Statistics 6 / 44

slide-7
SLIDE 7

Marc Mehlman

Data Distributions

in a class of 200 students let xi = # pts out of 500 possible that student i gets Frequency Table Relative F Table Cumulative F Table 0 – 99 10 0 – 99 5% ≤ 99 10 100 – 199 8 100 – 199 4% ≤ 199 18 200 – 299 22 200 – 299 11% ≤ 299 40 300 – 399 100 300 – 399 50% ≤ 399 140 400 – 500 60 400 – 500 30% ≤ 500 200

1 # classes should be between 5 and 20 inclusive. 2 class width ≈ max value−min value

# of classes

Marc Mehlman (University of New Haven) Descriptive Statistics 7 / 44

slide-8
SLIDE 8

Marc Mehlman

Graphical Representation of Distributions

Graphical Representation of Distributions

Graphical Representation of Distributions

Marc Mehlman (University of New Haven) Descriptive Statistics 8 / 44

slide-9
SLIDE 9

Marc Mehlman

Graphical Representation of Distributions

Distribution of a Variable

6

To examine a single variable, we graphically display its distribution.

  • The distribution of a variable tells us what values it takes and how
  • ften it takes these values.
  • Distributions can be displayed using a variety of graphical tools. The

proper choice of graph depends on the nature of the variable.

  • The distribution of a variable tells us what values it takes and how
  • ften it takes these values.
  • Distributions can be displayed using a variety of graphical tools. The

proper choice of graph depends on the nature of the variable. Categorical Variable Pie chart Bar graph Categorical Variable Pie chart Bar graph Quantitative Variable Histogram Stemplot Quantitative Variable Histogram Stemplot

Marc Mehlman (University of New Haven) Descriptive Statistics 9 / 44

slide-10
SLIDE 10

Marc Mehlman

Graphical Representation of Distributions

Categorical Variables

7

The distribution of a categorical variable lists the categories and gives the count or percent of individuals who fall into that category.

  • Pie Charts show the distribution of a categorical variable as a “pie”

whose slices are sized by the counts or percents for the categories.

  • Bar Graphs represent each category as a bar whose heights show

the category counts or percents.

Marc Mehlman (University of New Haven) Descriptive Statistics 10 / 44

slide-11
SLIDE 11

Marc Mehlman

Graphical Representation of Distributions > pie.sales = c(0.12, 0.3, 0.26, 0.16, 0.04, 0.12) > lbls = c("Blueberry", "Cherry", "Apple", "Boston Cream", "Other", "Vanilla Cream") > pie(pie.sales, labels = lbls, main="Pie Sales")

Blueberry Cherry Apple Boston Cream Other Vanilla Cream

Pie Sales Marc Mehlman (University of New Haven) Descriptive Statistics 11 / 44

slide-12
SLIDE 12

Marc Mehlman

Graphical Representation of Distributions

> counts=c(40,30,20,10) > colors=c("Red","Blue","Green","Brown") > barplot(counts,names.arg=colors,main="Favorite Colors")

Red Blue Green Brown

Favorite Colors

10 20 30 40

Marc Mehlman (University of New Haven) Descriptive Statistics 12 / 44

slide-13
SLIDE 13

Marc Mehlman

Graphical Representation of Distributions

Quantitative Variables

9

The distribution of a quantitative variable tells us what values the variable takes on and how often it takes those values.

  • Histograms show the distribution of a quantitative variable by

using bars whose height represents the number of individuals who take on a value within a particular class.

  • Stemplots separate each observation into a stem and a leaf that

are then plotted to display the distribution while maintaining the

  • riginal values of the variable.

Marc Mehlman (University of New Haven) Descriptive Statistics 13 / 44

slide-14
SLIDE 14

Marc Mehlman

Graphical Representation of Distributions

13

For quantitative variables that take many values and/or large datasets.

  • Divide the possible values into classes (equal widths).
  • Count how many observations fall into each interval (may change

to percents).

  • Draw picture representing the distribution―bar heights are

equivalent to the number (percent) of observations in each interval.

Histograms

Marc Mehlman (University of New Haven) Descriptive Statistics 14 / 44

slide-15
SLIDE 15

Marc Mehlman

Graphical Representation of Distributions

> hist(trees$Girth,main="Girth of Black Cherry Trees",xlab="Diameter in Inches")

Girth of Black Cherry Trees

Diameter in Inches Frequency 8 10 12 14 16 18 20 22 2 4 6 8 10 12

Marc Mehlman (University of New Haven) Descriptive Statistics 15 / 44

slide-16
SLIDE 16

Marc Mehlman

Graphical Representation of Distributions

10

To construct a stemplot:

  • Separate each observation into a stem (first part of the number) and a

leaf (the remaining part of the number).

  • Write the stems in a vertical column; draw a vertical line to the right of

the stems.

  • Write each leaf in the row to the right of its stem; order leaves if

desired.

Stemplots

Marc Mehlman (University of New Haven) Descriptive Statistics 16 / 44

slide-17
SLIDE 17

Marc Mehlman

Graphical Representation of Distributions > Girth=trees$Girth > stem(Girth) # stem and leaf plot The decimal point is at the | 8 | 368 10 | 57800123447 12 | 099378 14 | 025 16 | 03359 18 | 00 20 | 6 > stem(Girth, scale=2) The decimal point is at the | 8 | 368 9 | 10 | 578 11 | 00123447 12 | 099 13 | 378 14 | 025 15 | 16 | 03 17 | 359 18 | 00 19 | 20 | 6 Marc Mehlman (University of New Haven) Descriptive Statistics 17 / 44

slide-18
SLIDE 18

Marc Mehlman

Graphical Representation of Distributions

Bivariate Data

Bivariate data comes from measuring two aspects of the same item/individual. For instance, (70, 178), (72, 192), (74, 184), (68, 181) is a random sample of size four obtained from four male college students. The bivariate data gives the height in inches and the weight in pounds of each of the for students. The third student sampled is 74 inches high and weighs 184 pounds. Can one variable be used to predict the other? Do tall people tend to weigh more? Definition A response (or dependent) variable measures the outcome of a study. The explanatory (or independent) variable is the one that predicts the response variable.

Marc Mehlman (University of New Haven) Descriptive Statistics 18 / 44

slide-19
SLIDE 19

Marc Mehlman

Graphical Representation of Distributions

Student ID Number

  • f Beers

Blood Alcohol Content 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05

Here we have two quantitative variables recorded for each of 16 students:

  • 1. how many beers they drank
  • 2. their resulting blood alcohol content (BAC)

Bivariate data

 For each individual studied, we record

data on two variables.

 We then examine whether there is a

relationship between these two variables: Do changes in one variable tend to be associated with specific changes in the other variables?

Marc Mehlman (University of New Haven) Descriptive Statistics 19 / 44

slide-20
SLIDE 20

Marc Mehlman

Graphical Representation of Distributions

Student Beers BAC 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05

Scatterplots

A scatterplot is used to display quantitative bivariate data. Each variable makes up one axis. Each individual is a point on the graph.

Marc Mehlman (University of New Haven) Descriptive Statistics 20 / 44

slide-21
SLIDE 21

Marc Mehlman

Graphical Representation of Distributions

> plot(trees$Girth~trees$Height,main="girth vs height")

  • 65

70 75 80 85 8 10 12 14 16 18 20

girth vs height

trees$Height trees$Girth

Marc Mehlman (University of New Haven) Descriptive Statistics 21 / 44

slide-22
SLIDE 22

Marc Mehlman

Measuring the Center

Measuring the Center

Measuring the Center

Marc Mehlman (University of New Haven) Descriptive Statistics 22 / 44

slide-23
SLIDE 23

Marc Mehlman

Measuring the Center

Measures of the Center Definition Given x1, x2, · · · , xn, the sample mean is ¯ x def = x1+x2+···+xn

n

= 1

n

n

j=1 xj.

The population mean is µ def =

1 N

N

j=1 xj. The mode is the most value

data value (it needs not be unique). If one orders the data from smallest to largest, the median is M def = middle value of data if n is odd the average of the middle two values of data if n is even . Laymen refer to the mean as the average. Example The median sales price of a house in Milford was ✩212,175 for Feb–Apr

  • 2013. If Bill Gates buys a house in Milford for ✩100 million, what will that

do to mean cost of a house in Milford? to the median house in Milford? What is a better measure of the cost of buying a house in Milford, the mean or median?

Marc Mehlman (University of New Haven) Descriptive Statistics 23 / 44

slide-24
SLIDE 24

Marc Mehlman

Measuring the Center

“Statistically, if you lie with your head in the oven and your feet in the fridge, on average you will be comfortably warm.” –Anonymous “Then there is the man who drowned crossing a stream with an average depth of six inches.” – W.I.E. Gates “The average human has one breast and one testicle.” – humorist Des McHale

Marc Mehlman (University of New Haven) Descriptive Statistics 24 / 44

slide-25
SLIDE 25

Marc Mehlman

Measuring the Center

24

The mean and median measure center in different ways, and both are useful. The mean and median of a roughly symmetric distribution are close together. If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed distribution, the mean is usually farther out in the long tail than is the median. The mean and median of a roughly symmetric distribution are close together. If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed distribution, the mean is usually farther out in the long tail than is the median.

Comparing Mean and Median

Marc Mehlman (University of New Haven) Descriptive Statistics 25 / 44

slide-26
SLIDE 26

Marc Mehlman

Measuring the Spread

Measuring the Spread

Measuring the Spread

Marc Mehlman (University of New Haven) Descriptive Statistics 26 / 44

slide-27
SLIDE 27

Marc Mehlman

Measuring the Spread

Definition The population variance is σ2 def =

1 N

N

j=1(xj − µ)2 and the population

standard deviation is σ def = √ σ2 =

  • 1

N

N

j=1(xj − µ)2.

However one often hase only a random sample to examine, not the entire

  • poplulation. With only a random sample, one can not calculate the

population mean, µ, so the best one can do is use the sample mean, ¯ x instead. Definition The sample variance is s2 def =

1 n−1

n

j=1(xj − ¯

x)2 and the sample standard deviation is s def = √ s2 =

  • 1

n−1

n

j=1(xj − ¯

x)2. Notice the use of n − 1 instead n for the sample variance and standard deviation.

Marc Mehlman (University of New Haven) Descriptive Statistics 27 / 44

slide-28
SLIDE 28

Marc Mehlman

Measuring the Spread

Example Suppose our random sample is 4.5, 3.7, 2.8, 5.3, 4.6. Then ¯ x = 1 5 [4.5 + 3.7 + 2.8 + 5.3 + 4.6] = 4.18 s2 = 1 5 − 1

  • (4.5 − 4.18)2 + (3.7 − 4.18)2 + (2.8 − 4.18)2 + (5.3 − 4.18)2 + (4.6 − 4.18)2

= 0.917 s = √ 0.917 = 0.9576012. Using R: > dat=c(4.5,3.7,2.8,5.3,4.6) > mean(dat) [1] 4.18 > var(dat) [1] 0.917 > sd(dat) [1] 0.9576012

Marc Mehlman (University of New Haven) Descriptive Statistics 28 / 44

slide-29
SLIDE 29

Marc Mehlman

Measuring the Spread

Why s2 =

1 n−1

n

j=1(x − ¯

x)2 and not s2 = 1

n

n

j=1(x − ¯

x)2? Ans: s2 is unbiased – on average s2 will be σ2 and 1

n

n

j=1(x − ¯

x)2 is biased – on average it will be n−1

n s2.

Note:

1 s2 =

n n

j=1 x2 j −(

n

j=1 xj) 2

n(n−1)

and s =

  • n n

j=1 x2 j −(

n

j=1 xj) 2

n(n−1)

2 s2 = 0 ⇔ s = 0 ⇒ all of the random sample is the same. 3

¯ x is an estimator of µ.

4 s2 is an (unbiased) estimator of σ2. 5 s is an (biased) estimator of σ. 6 s measures the amount the data is dispersed about the mean. 7 s has the same units of measurement as the data. 8 s is sensitive to the existence of outliers. Marc Mehlman (University of New Haven) Descriptive Statistics 29 / 44

slide-30
SLIDE 30

Marc Mehlman

Measuring the Spread

Definition (Coefficient of Variation, CV) sample CV def = s ¯ x × 100, population CV def = σ µ × 100. CV standardizes standard deviation according to the mean. Example There are 250 dogs at a dog show that weigh an average of 25 pounds, with a standard deviation of 8 pounds. The 250 human owners had an average weight of of 172 lbs with a standard deviation of 29 lbs. Do the humans or the dogs have vary greater in weight? Sol: It depends how the question is interpreted. The varance of weight for the owners is 29 which is greater than the variance for the dogs which is 8. However, for humans CV = 29

172 = 0.17 which is less than the CV for dogs

which is

8 25 = 0.32.

Marc Mehlman (University of New Haven) Descriptive Statistics 30 / 44

slide-31
SLIDE 31

Marc Mehlman

Measuring the Spread

Theorem (Chebyˇ shev’s Theorem) At least 1 −

1 K 2 of data is within K standard deviations of the mean.

Example at least 3/4 ths of data is within 2 standard deviation of the mean at least 8/9 ths of data is within 3 standard deviation of the mean at least 15/18 ths of data is within 4 standard deviation of the mean

Marc Mehlman (University of New Haven) Descriptive Statistics 31 / 44

slide-32
SLIDE 32

Marc Mehlman

Normal Distribution

Normal Distribution

Normal Distribution

Marc Mehlman (University of New Haven) Descriptive Statistics 32 / 44

slide-33
SLIDE 33

Marc Mehlman

Normal Distribution

40

A density curve is a curve that:

  • is always on or above the horizontal axis
  • has an area of exactly 1 underneath it

A density curve describes the overall pattern of a

  • distribution. The area under the curve and above any

range of values on the horizontal axis is the proportion

  • f all observations that fall in that range.

A density curve is a curve that:

  • is always on or above the horizontal axis
  • has an area of exactly 1 underneath it

A density curve describes the overall pattern of a

  • distribution. The area under the curve and above any

range of values on the horizontal axis is the proportion

  • f all observations that fall in that range.

Density Curves

Marc Mehlman (University of New Haven) Descriptive Statistics 33 / 44

slide-34
SLIDE 34

Marc Mehlman

Normal Distribution

Advantage of s over s2: s uses same unit of measurement as the data. Definition (Standardized Scores (z–scores)) sample standardized score: z def = x − ¯ x s , population standardized score: z def = x − µ σ . Advantage of z–scores: unit independent. Example (from Book) z–score of Michael Jordan’s height = 3.21 z–score of Rebecca Lobo’s height = 4.96 Conclusion: Jordan is 3.21 standard deviations taller than the average male height. Lobo is 4.96 standard deviations taller than average female height. Note: negative z–scores ⇒ smaller than average.

Marc Mehlman (University of New Haven) Descriptive Statistics 34 / 44

slide-35
SLIDE 35

Marc Mehlman

Order Statistics

Order Statistics

Order Statistics

Marc Mehlman (University of New Haven) Descriptive Statistics 35 / 44

slide-36
SLIDE 36

Marc Mehlman

Order Statistics

Definition percentile of x = percentage of data less than x. Pk = kth percentile = is the kth percentile Q1 = 1st quantile = P25 = “border of bottom 25% and top 75%”. = median of all values ≤ overall median Q2 = 2nd quantile = P50 = median. Q3 = 3rd quantile = P75 = border of top 25% and bottom 75%. = median of all values ≥ overall median 5–number summary = min, Q1, Q2, Q3, max Interquartile Range = IQR = Q3 − Q1. How to find the Pk, the kth percentile according to the book: Order x1, x2, · · · , xn in increasing order. Count L in to get Pk where L is the smallest integer ≥ k 100n. Example 5 year old is in 90% percentile, weight–wise, for his age group.

Marc Mehlman (University of New Haven) Descriptive Statistics 36 / 44

slide-37
SLIDE 37

Marc Mehlman

Order Statistics

34

We now have a choice between two descriptions for center and spread

 Mean and Standard Deviation  Median and Interquartile Range

The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers. Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers. NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA! The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers. Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers. NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA!

Choosing Measures of Center and Spread Choosing Measures of Center and Spread

Choosing Measures of Center and Spread

Marc Mehlman (University of New Haven) Descriptive Statistics 37 / 44

slide-38
SLIDE 38

Marc Mehlman

Order Statistics

R Commands: Example > mean(trees$Volume) [1] 30.17097 > median(trees$Volume) [1] 24.2 > summary(trees$Volume)

  • Min. 1st Qu.

Median Mean 3rd Qu. Max. 10.20 19.40 24.20 30.17 37.30 77.00 > IQR(trees$Volume) [1] 17.9 > var(trees$Volume) [1] 270.2028 > sd(trees$Volume) [1] 16.43785

Marc Mehlman (University of New Haven) Descriptive Statistics 38 / 44

slide-39
SLIDE 39

Marc Mehlman

Order Statistics

Outliers

Definition (The 1.5 x IQR Rule for Outliers) Call an observation an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile. Outliers have effects on means and variance far greater than their numbers.

Marc Mehlman (University of New Haven) Descriptive Statistics 39 / 44

slide-40
SLIDE 40

Marc Mehlman

Order Statistics

Example The number of eggs laid by Farmer John’s chickens (hens) in September 2016 is given below. 18, 13, 3, 16, 9, 35, 5, 15, 23, 11, 7. Are there any outlier hens? Solution: One figures out the quartiles: Q1 M Q3 3 5 7 9 11 13 15 16 18 23 35 Since IQR = 18 − 7 = 11 and 35 − Q3 = 17 > 1.5 x IQR = 16.5, one identifies 35 as an outlier.

Marc Mehlman (University of New Haven) Descriptive Statistics 40 / 44

slide-41
SLIDE 41

Marc Mehlman

Order Statistics

Boxplots

Definition Given x1, x2, · · · , xn, to create a boxplot (also called a box and whiskers plot)

1 draw and label a vertical number line that includes the range of the

distribution.

2 draw a box from height Q1 to Q3. 3 draw a horizontal line inside the box at the height of the median. 4 draw vertical line segments (whiskers) from the bottom and top of

the box to the minimum and maximum data values that are not

  • utliers.

5 sometimes outliers are identified with ◦’s (R does this).

Boxplots are often useful when comparing the values of two different variables.

Marc Mehlman (University of New Haven) Descriptive Statistics 41 / 44

slide-42
SLIDE 42

Marc Mehlman

Order Statistics

> boxplot(trees$Height, main="Heights of Black Cherry Trees") > boxplot(USJudgeRatings$DMNR,USJudgeRatings$DILG, + main="Lawyers’ Demeanor/Diligence ratings of US Superior Court state judges")

65 70 75 80 85

Heights of Black Cherry Trees

  • 5

6 7 8 9

Lawyers' Demeanor/Diligence ratings of US Superior Court state judges

Marc Mehlman (University of New Haven) Descriptive Statistics 42 / 44

slide-43
SLIDE 43

Marc Mehlman

Chapter #1 R Assignment

Chapter #1 R Assignment

Chapter #1 R Assignment

Marc Mehlman (University of New Haven) Descriptive Statistics 43 / 44

slide-44
SLIDE 44

Marc Mehlman

Chapter #1 R Assignment

Fifty-eight sailors are sampled and their eye color is noted as below blue brown green hazel red 11 32 8 5 2

1 Create a barplot and pie chart of eye color from the sailor sample. 2 Create of a histogram and a stemplot of the height of loblolly trees

from the dataset “Loblolly”. The dataset, “Loblolly” comes with R, just as “trees” does. To observe “Loblolly”, type “Loblolly” at the R prompt (without the quotes). To learn more about the dataset, type “help(Loblolly)” at the R prompt.

3 Find the mean, median, five number summary, variance and standard

deviation from the sample of heights in the dataset “Loblolly”.

4 Create a scatterplot of weight versus quarter mile times for the

dataset, “mtcars”. Assume the the independent variable is the quarter mile times and the dependent variable is the weight.

Marc Mehlman (University of New Haven) Descriptive Statistics 44 / 44