SLIDE 1
Business Statistics CONTENTS Data summaries Univariate summaries - - PowerPoint PPT Presentation
Business Statistics CONTENTS Data summaries Univariate summaries - - PowerPoint PPT Presentation
SUMMARIZING DATA Business Statistics CONTENTS Data summaries Univariate summaries Bivariate summaries Statistical summaries Further study DATA SUMMARIES Summarizing data = losing info: so why losing information? to see the essential
SLIDE 2
SLIDE 3
Summarizing data = losing info: so why losing information? ▪ to see the essential features at a glance DATA SUMMARIES
SLIDE 4
Many options, depending on: ▪ nature of variables
▪ numerical vs. categorical ▪ numerical: discrete vs. continuous ▪ categorical: binary (dichotomous) or not
▪ number of variables
▪ univariate ▪ bivariate ▪ multivariate
▪ range of data/number of categories ▪ level of detail and precision ▪ audience DATA SUMMARIES
SLIDE 5
Summarizing means data reduction, so losing information ▪ sometimes a bit ▪ sometimes a lot DATA SUMMARIES
SLIDE 6
Often important first step: sorting subjects (rows)
▪ original 𝑦1, 𝑦2, … , 𝑦𝑜 ▪ sorted 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 , such that 𝑦 1 ≤ 𝑦 2 ≤ ⋯ ≤ 𝑦 𝑜
DATA SUMMARIES
𝑦12 𝑦 12
SLIDE 7
Some definitions on ordered data vectors ▪ median: 𝑁 = 𝑅2 = 𝑞50 = ൞ 𝑦 𝑜+1
2
(𝑜 odd) 𝑦 𝑜
2
(𝑜 even) ▪ first quartile: 𝑅1 = 𝑞25 ≈ 𝑦 0.25 𝑜+1 ▪ third quartile: 𝑅3 = 𝑞75 ≈ 𝑦 0.75 𝑜+1 ▪ quintiles, deciles, percentiles ▪ minimum: 𝑦min = 𝑦 1 ▪ maximum: 𝑦max = 𝑦 𝑜 We use simplified rule for 𝑙th percentile: position is 𝑙
100 𝑜 + 1
▪ precisely between two points: take average of these two points ▪
- therwise round to nearest data point
DATA SUMMARIES
Why ≈? Consider the case 𝑜 = 4.
SLIDE 8
Given are the following data: 9,7,11,5,9,8. Find
- a. the mean
- b. the median
- c. the first quartile
EXERCISE 1
SLIDE 9
Two cases:
▪ numerical data ▪ categorical data
Two forms:
▪ graphical summaries ▪ statistical summaries
UNIVARIATE SUMMARIES
SLIDE 10
Important concepts – numerical data
▪ measures of centrality ▪ measures of dispersion ▪ measures of range ▪ measures of shape
Important concepts – categorical data
▪ measures of frequency ▪ proportion, odds (2 categories only)
UNIVARIATE SUMMARIES
SLIDE 11
Numerical data – graphical summaries
▪ “A picture is worth a thousands words”
UNIVARIATE SUMMARIES
index Note our approximation of a “dot plot” (we misued bivariate scatterplot)
SLIDE 12
Numerical data – log-transforming variables
▪ to make highly skewed data less skewed ▪ to make patterns more visible ▪ to meet assumptions of inferential statistics
UNIVARIATE SUMMARIES
index index
SLIDE 13
Numerical data – statistical summaries
▪ center (central tendency): mean, median, etc. ▪ variability (dispersion): variance, standard deviation, etc. ▪ range: minimum, maximum, etc.
UNIVARIATE SUMMARIES
SLIDE 14
Numerical data – boxplots ▪ agreement on the “box” (𝑅1, 𝑅2, and 𝑅3) ▪ but different conventions on the “whiskers”
▪ 𝑦min and 𝑦max ▪ 𝑅1 − 1.5 𝑅3 − 𝑅1 and 𝑅3 + 1.5 𝑅3 − 𝑅1 (next page) ▪ with outliers indicated by symbols (* and/or ⁰)
UNIVARIATE SUMMARIES
SLIDE 15
Numerical data – boxplots
▪ minimum, first quartile, median, third quartile, maximum ▪ individual outliers
UNIVARIATE SUMMARIES
IQR 1.5IQR 1.5IQR mild outliers extreme outliers IQR: see below Whiskers always at a data point
SLIDE 16
Numerical data – histograms
▪ frequency distribution
UNIVARIATE SUMMARIES
Note the effect of bin size
- n the shape
SLIDE 17
Categorical data – graphical summaries
▪ less useful?
UNIVARIATE SUMMARIES
index
SLIDE 18
Categorical data – graphical summaries
▪ frequency distribution ▪ as a piechart ▪ as a barchart (histogram)
UNIVARIATE SUMMARIES
SLIDE 19
Categorical data – statistical summaries
▪ frequency distribution
Expressed in
▪ counts (frequencies) ▪ proportions ▪ percentages
UNIVARIATE SUMMARIES
Or preferably using value labels: 0 = non-member 1=member
SLIDE 20
Numerical data – centrality ▪ mean
▪ ҧ 𝑦 =
1 𝑜 σ𝑗=1 𝑜
𝑦𝑗
▪ median
▪ sorting of 𝑦: 𝑦𝑗 → 𝑦 𝑗 ▪ 𝑁 = ൞ 𝑦 𝑜+1
2
𝑜 odd
𝑦 𝑜
2
+𝑦 𝑜
2+1
2
𝑜 even
UNIVARIATE SUMMARIES
mean median
SLIDE 21
Numerical data – centrality (less used) ▪ mode
▪ most frequently occurring value ▪ not very useful for continuous data
▪ geometric mean
▪ only for positive data ▪
𝑜 𝑦1𝑦2 ⋯ 𝑦𝑜 = 𝑓 1 𝑜 σ𝑗=1 𝑜
ln 𝑦𝑗
▪ midrange
▪
1 2 max 𝑗
𝑦𝑗 + min
𝑗
𝑦𝑗 =
1 2 𝑦 1 + 𝑦 𝑜
▪ 𝑙% trimmed mean
▪ mean, skipping the highest 𝑙% and the lowest 𝑙%
UNIVARIATE SUMMARIES
SLIDE 22
Numerical data – centrality UNIVARIATE SUMMARIES
Statistic Properties Example life expectancy Mean much used, employs all information, a bit sensitive to
- utliers
64.5 yr Median much used, discards some information, insensitive to
- utliers
69 yr Mode not useful for continuous data, discards some information, sensitive to “binning” 71.5 yr Geometric mean
- nly positive data, more difficult to interpret, useful for index
numbers and growth rates 63.1 yr Midrange easy to calculate, discards a lot of information, very sensistive to outliers 58.2 yr Trimmed mean discards some information, insensitive to outliers, depends
- n value of 𝑙
65.1 yr (with 𝑙 = 5)
SLIDE 23
Numerical data – dispersion ▪ variance
▪ 𝑡2 =
1 𝑜−1 σ𝑗=1 𝑜
𝑦𝑗 − ҧ 𝑦 2
▪ standard deviation
▪ 𝑡 = 𝑡2
▪ Interquartile range (width of box in boxplot)
▪ 𝐽𝑅𝑆 = 𝑅3 − 𝑅1
▪ coefficient of variation
▪ 𝐷𝑊 =
𝑡 ҧ 𝑦 (provided ҧ
𝑦 ≠ 0)
UNIVARIATE SUMMARIES
𝑡 𝐽𝑅𝑆
SLIDE 24
Numerical data – dispersion (less used) ▪ range
▪ max
𝑗
𝑦𝑗 − min
𝑗
𝑦𝑗 = 𝑦 𝑜 − 𝑦 1
▪ mean absolute deviation
▪
1 𝑜 σ𝑗=1 𝑜
𝑦𝑗 − ҧ 𝑦
UNIVARIATE SUMMARIES
SLIDE 25
Numerical data – dispersion UNIVARIATE SUMMARIES
Statistic Properties Example life expectancy Variance much used, square units, “additive” 163.7 yr2 Standard deviation much used 12.8 yr Interquartile range discards some information, insensitive to outliers 20.5 yr Range easy to calculate, discards a lot of information, very sensitive to outliers 45.4 yr Mean absolute deviation easy to interpret 10.8 yr Coefficient of variation much used, dimensionless, problematic for mean close to 0 0.20
SLIDE 26
Given are two data vectors, 𝐲 and 𝐳, with Which has a larger coefficient of variation? EXERCISE 2
SLIDE 27
Numerical data – shape ▪ skewness
▪ a measure of asymmetry ▪ approximately 𝑁3 ≈
1 𝑜 σ𝑗=1 𝑜 𝑦𝑗− ҧ 𝑦 𝑡𝑦 3
▪ mainly used as a benchmark for normality or symmetry
UNIVARIATE SUMMARIES
More complicated formula in book
Symmetric
SLIDE 28
Numerical data – shape ▪ kurtosis
▪ a measure of flatness ▪ approximately 𝑁4 ≈
1 𝑜 σ𝑗=1 𝑜 𝑦𝑗− ҧ 𝑦 𝑡𝑦 4
− 3 ▪ mainly used as a benchmark for normality ▪ also known as excess kurtosis ▪ 𝑁4 = 0 for “normal” data
UNIVARIATE SUMMARIES
platykurtic (platypus) 𝑁4 < 0 leptokurtic (leaping kangaroos) 𝑁4 > 0 More complicated formula in book
SLIDE 29
▪ Several data types → different options:
▪ two numerical data vectors ▪ data vectors of “similar” type ▪ data vector s of “different” type ▪ one numerical and one categorical data vector ▪ two categorical data vectors
▪ Numerical summaries and graphical summaries BIVARIATE SUMMARIES
SLIDE 30
▪ Two numerical data vectors of “similar” type
▪ treat as one numerical difference vector
BIVARIATE SUMMARIES
SLIDE 31
▪ Two numerical data vectors of “similar” type
▪ correlation analysis ▪ scatterplot
BIVARIATE SUMMARIES
SLIDE 32
▪ Two numerical data vectors of “different” type
▪ correlation analysis ▪ scatterplot
BIVARIATE SUMMARIES
SLIDE 33
▪ One numerical and one categorical data vector
▪ split numerical data vector into several data vectors
BIVARIATE SUMMARIES
Note: we actually have two (or more) groups/populations
SLIDE 34
▪ Two categorical data vectors
▪ cross tables (contingency tables) ▪ cells contain “counts” (frequencies)
BIVARIATE SUMMARIES
SLIDE 35
Consider Which group (GATT=0 or GATT=1) has a higher mean and which group has a higher standard deviation? EXERCISE 3
SLIDE 36
▪ Many choices
▪ centrality ▪ dispersion ▪ association
▪ What to use depends on
▪ the nature of the problem ▪ the nature of the data ▪ the audience to address
STATISTICAL SUMMARIES
SLIDE 37
▪ Remember that a summary summarizes ... ▪ ... and that data reduction reduces the amount of information ▪ Example: Anscombe’s quartet:
▪ 𝑜𝑌 = 11, ത 𝑌 = 9, 𝑡𝑌 = 11 ▪ 𝑜𝑍 = 11, ത 𝑍 = 7.50, 𝑡𝑍 = 4.12 ▪ 𝑡𝑌,𝑍 = 0.816 ▪ OLS-regression: 𝑍 = 3.00 + 0.500𝑌
STATISTICAL SUMMARIES
These summaries are too drastic!
SLIDE 38
Almost all jokes on statistics and statisticians are based on this
▪ but there is some truth in it
STATISTICAL SUMMARIES
SLIDE 39