Descriptive Statistics Observed data are at the heart of every - - PowerPoint PPT Presentation

descriptive statistics
SMART_READER_LITE
LIVE PREVIEW

Descriptive Statistics Observed data are at the heart of every - - PowerPoint PPT Presentation

ST 370 Probability and Statistics for Engineers Descriptive Statistics Observed data are at the heart of every application of statistics. We need tools for working with and describing data. To quickly see the main features of a set of data, we


slide-1
SLIDE 1

ST 370 Probability and Statistics for Engineers

Descriptive Statistics

Observed data are at the heart of every application of statistics. We need tools for working with and describing data. To quickly see the main features of a set of data, we need summaries: Numerical summaries: e.g. means, standard deviations; Graphical summaries: e.g. histograms, box-and-whisker plots.

1 / 16 Descriptive Statistics

slide-2
SLIDE 2

ST 370 Probability and Statistics for Engineers

Numerical Summaries of Data

The 8 measurements of pull-off force for

3 32 inch nylon connectors, in

lbf , were 12.6, 12.9, 13.4, 12.3, 13.6, 13.5, 12.6, 13.1. We could describe these by saying that they are around 13 lbf , generally plus or minus around 0.5 lbf . That is, we give a typical value and an indication of dispersion around that typical value.

2 / 16 Descriptive Statistics Numerical Summaries of Data

slide-3
SLIDE 3

ST 370 Probability and Statistics for Engineers

Sample mean For observed values x1, x2, . . . , xn, the most widely used typical value is the sample mean: ¯ x = x1 + x2 + · · · + xn n = 1 n

n

  • i=1

xi. The mean of the pull-off forces (n = 8) is 13.0 lbf .

3 / 16 Descriptive Statistics Numerical Summaries of Data

slide-4
SLIDE 4

ST 370 Probability and Statistics for Engineers

Sample standard deviation When the sample mean is used as the typical value, dispersion around it is almost always measured by the sample standard deviation s = √ s2, where s2 is the sample variance: s2 = 1 n − 1

n

  • i=1

(xi − ¯ x)2. The standard deviation of the pull-off forces is 0.48 lbf . Note The sample variance is almost the average of the n values (x1 − ¯ x)2, (x2 − ¯ x)2, . . . , (xn − ¯ x)2; it differs only in the divisor (n − 1), instead of n.

4 / 16 Descriptive Statistics Numerical Summaries of Data

slide-5
SLIDE 5

ST 370 Probability and Statistics for Engineers

Sample median An alternative typical value is the sample median: one half of the

  • bservations fall below the median, and one half fall above.

The ordered pull-off forces are 12.3, 12.6, 12.6, 12.9, 13.1, 13.4, 13.5, 13.6, so any value between 12.9 and 13.1 could be the median; by convention, we use the midpoint, which happens to be the same as the sample mean, 13.0 lbf . In general, the mean and the median will not be the same.

5 / 16 Descriptive Statistics Numerical Summaries of Data

slide-6
SLIDE 6

ST 370 Probability and Statistics for Engineers

Trimmed mean In some areas, a trimmed mean is used: For some k < n/2, delete the k highest values and the k lowest values; The trimmed mean is the average of the remaining data. Examples In many sports involving a panel of judges, the highest and lowest scores are omitted (k = 1). The LIBOR benchmark interest rate is found by averaging rates submitted by 18 banks, with the highest and lowest 4 submissions omitted (n = 18, k = 4). If all but the middle one or two values are trimmed (k ≈ n/2), the average is the median.

6 / 16 Descriptive Statistics Numerical Summaries of Data

slide-7
SLIDE 7

ST 370 Probability and Statistics for Engineers

Quantiles and Percentiles Recall that the median divides the data values in half: one half fall below, and one half fall above. More generally, for any 0 ≤ p ≤ 1, the pth quantile divides the data into a fraction p falling below the quantile and (1 − p) falling above the quantile. The pth quantile is also called the (100p)% percentile.

7 / 16 Descriptive Statistics Numerical Summaries of Data

slide-8
SLIDE 8

ST 370 Probability and Statistics for Engineers

Quartiles The most frequently used quantiles are the median (p = 0.5) and the quartiles: lower quartile (p = 0.25) and upper quartile (p = 0.75). The quartiles of the pull-off forces are 12.60 and 13.42 lbf . Interquartile range The difference between the upper and lower quartiles is another measure of the dispersion of the data values. It is called the interquartile range (IQR). For the pull-off forces, the IQR is 0.82 lbf .

8 / 16 Descriptive Statistics Numerical Summaries of Data

slide-9
SLIDE 9

ST 370 Probability and Statistics for Engineers

Calculation of quantiles For a given data set size n and a given fraction p, how to divide the values into a fraction p and the complementary fraction (1 − p) may not be obvious. Many different suggestions have been made for the precise calculation. The R function quantile() offers the choice of nine types of calculation; the definition given by Montgomery and Runger appears to be

type = 6;

the default in quantile() is type = 7.

9 / 16 Descriptive Statistics Numerical Summaries of Data

slide-10
SLIDE 10

ST 370 Probability and Statistics for Engineers

Stem and Leaf The stem-and-leaf plot is a device for sorting and binning a set of data values. Useful pencil-and-paper method, but irrelevant in computer-based analysis. Example: Compressive strength of Al-Li alloy specimens Compressive strength (psi) of 80 specimens of an aeronautical alloy.

alloy <- read.csv("Data/Table-06-02.csv")$Strength; stem(alloy, scale = 2)

The number of leaves on each stem is the count from which a histogram is constructed. The outline of the display is a (rotated) histogram. Because the leaves are ordered, the order statistics can be read

  • ff from the display.

10 / 16 Descriptive Statistics Stem-and-Leaf Diagram

slide-11
SLIDE 11

ST 370 Probability and Statistics for Engineers

Frequency Distributions and Histograms

The histogram is a display showing the frequency with which data values fall in various ranges. Example: Compressive strength of Al-Li alloy specimens

# alloy <- read.csv("Data/Table-06-02.csv")$Strength hist(alloy) # To match Figure 6-7, use some non-default options: hist(alloy, breaks = seq(from = 70, to = 250, by = 20), right = FALSE, col = "wheat")

11 / 16 Descriptive Statistics Frequency Distributions and Histograms

slide-12
SLIDE 12

ST 370 Probability and Statistics for Engineers

The height of each bar is its “Frequency”: the number of data values that fall in the corresponding “bin”. Variations of histogram Sometimes the height of the bar is the relative frequency: the fraction of data values that fall in the bin, instead of the number. Sometimes the bins are of different widths; in that case, the height of the bar is usually chosen so that the area of the bar is the relative

  • frequency. Then the total area of the histogram bars is 1.

12 / 16 Descriptive Statistics Frequency Distributions and Histograms

slide-13
SLIDE 13

ST 370 Probability and Statistics for Engineers

Box Plots

The histogram shows the distribution of the data values in some detail. We often need a display that summarizes the data more succinctly. The box-and-whisker plot (or boxplot) shows principally: The extremes: lowest and highest values; The lower and upper quartiles; The median.

13 / 16 Descriptive Statistics Box Plots

slide-14
SLIDE 14

ST 370 Probability and Statistics for Engineers

In R:

# alloy <- read.csv("Data/Table-06-02.csv")$Strength boxplot(alloy)

The central box goes from the lower quartile to the upper quartile, and the median is shown by a line. Some of the more extreme values may be flagged as outliers, and are shown individually. The whiskers connect the box to the most extreme data point that is not flagged as a possible outlier.

14 / 16 Descriptive Statistics Box Plots

slide-15
SLIDE 15

ST 370 Probability and Statistics for Engineers

Comparative boxplots A boxplot of a single set of data is a useful graphical tool for displaying the key characteristics of the data: The typical value, represented by the median; The dispersion, represented by the IQR (interquartile range), which is the length of the box; The extreme values, including some that may be highlighted as

  • utliers.

Boxplots are much more valuable when comparing more than one set

  • f data, such as the pull-off strengths of the two types of nylon

connector.

15 / 16 Descriptive Statistics Box Plots

slide-16
SLIDE 16

ST 370 Probability and Statistics for Engineers

Example: strength of paper The percentage of hardwood fiber affects the tearing strength of

  • paper. Six test sheets were prepared and tested for each of four levels
  • f hardwood content.

In R:

paper <- read.csv("Data/Table-13-01.csv") boxplot(Strength ~ Hardwood, paper)

The boxplots show: The typical strength increases progressively as the hardwood content increases; The dispersion of strength does not change greatly; No test sheets were out of line with the rest of their sample.

16 / 16 Descriptive Statistics Box Plots