Business Statistics CONTENTS Data summaries Univariate summaries - - PowerPoint PPT Presentation

business statistics
SMART_READER_LITE
LIVE PREVIEW

Business Statistics CONTENTS Data summaries Univariate summaries - - PowerPoint PPT Presentation

SUMMARIZING DATA Business Statistics CONTENTS Data summaries Univariate summaries Bivariate summaries Statistical summaries Further study DATA SUMMARIES Summarizing data = losing info: so why losing information? to see the essential


slide-1
SLIDE 1

SUMMARIZING DATA

Business Statistics

slide-2
SLIDE 2

Data summaries Univariate summaries Bivariate summaries Statistical summaries Further study CONTENTS

slide-3
SLIDE 3

Summarizing data = losing info: so why losing information? ▪ to see the essential features at a glance DATA SUMMARIES

slide-4
SLIDE 4

Many options, depending on: ▪ nature of variables

▪ numerical vs. categorical ▪ numerical: discrete vs. continuous ▪ categorical: binary (dichotomous) or not

▪ number of variables

▪ univariate ▪ bivariate ▪ multivariate

▪ range of data/number of categories ▪ level of detail and precision ▪ audience DATA SUMMARIES

slide-5
SLIDE 5

Summarizing means data reduction, so losing information ▪ sometimes a bit ▪ sometimes a lot DATA SUMMARIES

slide-6
SLIDE 6

Often important first step: sorting subjects (rows)

▪ original 𝑦1, 𝑦2, … , 𝑦𝑜 ▪ sorted 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 , such that 𝑦 1 ≤ 𝑦 2 ≤ ⋯ ≤ 𝑦 𝑜

DATA SUMMARIES

𝑦12 𝑦 12

slide-7
SLIDE 7

Some definitions on ordered data vectors ▪ median: 𝑁 = 𝑅2 = 𝑞50 = ൞ 𝑦 𝑜+1

2

(𝑜 odd) 𝑦 𝑜

2

(𝑜 even) ▪ first quartile: 𝑅1 = 𝑞25 ≈ 𝑦 0.25 𝑜+1 ▪ third quartile: 𝑅3 = 𝑞75 ≈ 𝑦 0.75 𝑜+1 ▪ quintiles, deciles, percentiles ▪ minimum: 𝑦min = 𝑦 1 ▪ maximum: 𝑦max = 𝑦 𝑜 We use simplified rule for 𝑙th percentile: position is 𝑙

100 𝑜 + 1

▪ precisely between two points: take average of these two points ▪

  • therwise round to nearest data point

DATA SUMMARIES

Why ≈? Consider the case 𝑜 = 4.

slide-8
SLIDE 8

Given are the following data: 9,7,11,5,9,8. Find

  • a. the mean
  • b. the median
  • c. the first quartile

EXERCISE 1

slide-9
SLIDE 9

Two cases:

▪ numerical data ▪ categorical data

Two forms:

▪ graphical summaries ▪ statistical summaries

UNIVARIATE SUMMARIES

slide-10
SLIDE 10

Important concepts – numerical data

▪ measures of centrality ▪ measures of dispersion ▪ measures of range ▪ measures of shape

Important concepts – categorical data

▪ measures of frequency ▪ proportion, odds (2 categories only)

UNIVARIATE SUMMARIES

slide-11
SLIDE 11

Numerical data – graphical summaries

▪ “A picture is worth a thousands words”

UNIVARIATE SUMMARIES

index Note our approximation of a “dot plot” (we misued bivariate scatterplot)

slide-12
SLIDE 12

Numerical data – log-transforming variables

▪ to make highly skewed data less skewed ▪ to make patterns more visible ▪ to meet assumptions of inferential statistics

UNIVARIATE SUMMARIES

index index

slide-13
SLIDE 13

Numerical data – statistical summaries

▪ center (central tendency): mean, median, etc. ▪ variability (dispersion): variance, standard deviation, etc. ▪ range: minimum, maximum, etc.

UNIVARIATE SUMMARIES

slide-14
SLIDE 14

Numerical data – boxplots ▪ agreement on the “box” (𝑅1, 𝑅2, and 𝑅3) ▪ but different conventions on the “whiskers”

▪ 𝑦min and 𝑦max ▪ 𝑅1 − 1.5 𝑅3 − 𝑅1 and 𝑅3 + 1.5 𝑅3 − 𝑅1 (next page) ▪ with outliers indicated by symbols (* and/or ⁰)

UNIVARIATE SUMMARIES

slide-15
SLIDE 15

Numerical data – boxplots

▪ minimum, first quartile, median, third quartile, maximum ▪ individual outliers

UNIVARIATE SUMMARIES

IQR 1.5IQR 1.5IQR mild outliers extreme outliers IQR: see below Whiskers always at a data point

slide-16
SLIDE 16

Numerical data – histograms

▪ frequency distribution

UNIVARIATE SUMMARIES

Note the effect of bin size

  • n the shape
slide-17
SLIDE 17

Categorical data – graphical summaries

▪ less useful?

UNIVARIATE SUMMARIES

index

slide-18
SLIDE 18

Categorical data – graphical summaries

▪ frequency distribution ▪ as a piechart ▪ as a barchart (histogram)

UNIVARIATE SUMMARIES

slide-19
SLIDE 19

Categorical data – statistical summaries

▪ frequency distribution

Expressed in

▪ counts (frequencies) ▪ proportions ▪ percentages

UNIVARIATE SUMMARIES

Or preferably using value labels: 0 = non-member 1=member

slide-20
SLIDE 20

Numerical data – centrality ▪ mean

▪ ҧ 𝑦 =

1 𝑜 σ𝑗=1 𝑜

𝑦𝑗

▪ median

▪ sorting of 𝑦: 𝑦𝑗 → 𝑦 𝑗 ▪ 𝑁 = ൞ 𝑦 𝑜+1

2

𝑜 odd

𝑦 𝑜

2

+𝑦 𝑜

2+1

2

𝑜 even

UNIVARIATE SUMMARIES

mean median

slide-21
SLIDE 21

Numerical data – centrality (less used) ▪ mode

▪ most frequently occurring value ▪ not very useful for continuous data

▪ geometric mean

▪ only for positive data ▪

𝑜 𝑦1𝑦2 ⋯ 𝑦𝑜 = 𝑓 1 𝑜 σ𝑗=1 𝑜

ln 𝑦𝑗

▪ midrange

1 2 max 𝑗

𝑦𝑗 + min

𝑗

𝑦𝑗 =

1 2 𝑦 1 + 𝑦 𝑜

▪ 𝑙% trimmed mean

▪ mean, skipping the highest 𝑙% and the lowest 𝑙%

UNIVARIATE SUMMARIES

slide-22
SLIDE 22

Numerical data – centrality UNIVARIATE SUMMARIES

Statistic Properties Example life expectancy Mean much used, employs all information, a bit sensitive to

  • utliers

64.5 yr Median much used, discards some information, insensitive to

  • utliers

69 yr Mode not useful for continuous data, discards some information, sensitive to “binning” 71.5 yr Geometric mean

  • nly positive data, more difficult to interpret, useful for index

numbers and growth rates 63.1 yr Midrange easy to calculate, discards a lot of information, very sensistive to outliers 58.2 yr Trimmed mean discards some information, insensitive to outliers, depends

  • n value of 𝑙

65.1 yr (with 𝑙 = 5)

slide-23
SLIDE 23

Numerical data – dispersion ▪ variance

▪ 𝑡2 =

1 𝑜−1 σ𝑗=1 𝑜

𝑦𝑗 − ҧ 𝑦 2

▪ standard deviation

▪ 𝑡 = 𝑡2

▪ Interquartile range (width of box in boxplot)

▪ 𝐽𝑅𝑆 = 𝑅3 − 𝑅1

▪ coefficient of variation

▪ 𝐷𝑊 =

𝑡 ҧ 𝑦 (provided ҧ

𝑦 ≠ 0)

UNIVARIATE SUMMARIES

𝑡 𝐽𝑅𝑆

slide-24
SLIDE 24

Numerical data – dispersion (less used) ▪ range

▪ max

𝑗

𝑦𝑗 − min

𝑗

𝑦𝑗 = 𝑦 𝑜 − 𝑦 1

▪ mean absolute deviation

1 𝑜 σ𝑗=1 𝑜

𝑦𝑗 − ҧ 𝑦

UNIVARIATE SUMMARIES

slide-25
SLIDE 25

Numerical data – dispersion UNIVARIATE SUMMARIES

Statistic Properties Example life expectancy Variance much used, square units, “additive” 163.7 yr2 Standard deviation much used 12.8 yr Interquartile range discards some information, insensitive to outliers 20.5 yr Range easy to calculate, discards a lot of information, very sensitive to outliers 45.4 yr Mean absolute deviation easy to interpret 10.8 yr Coefficient of variation much used, dimensionless, problematic for mean close to 0 0.20

slide-26
SLIDE 26

Given are two data vectors, 𝐲 and 𝐳, with Which has a larger coefficient of variation? EXERCISE 2

slide-27
SLIDE 27

Numerical data – shape ▪ skewness

▪ a measure of asymmetry ▪ approximately 𝑁3 ≈

1 𝑜 σ𝑗=1 𝑜 𝑦𝑗− ҧ 𝑦 𝑡𝑦 3

▪ mainly used as a benchmark for normality or symmetry

UNIVARIATE SUMMARIES

More complicated formula in book

Symmetric

slide-28
SLIDE 28

Numerical data – shape ▪ kurtosis

▪ a measure of flatness ▪ approximately 𝑁4 ≈

1 𝑜 σ𝑗=1 𝑜 𝑦𝑗− ҧ 𝑦 𝑡𝑦 4

− 3 ▪ mainly used as a benchmark for normality ▪ also known as excess kurtosis ▪ 𝑁4 = 0 for “normal” data

UNIVARIATE SUMMARIES

platykurtic (platypus) 𝑁4 < 0 leptokurtic (leaping kangaroos) 𝑁4 > 0 More complicated formula in book

slide-29
SLIDE 29

▪ Several data types → different options:

▪ two numerical data vectors ▪ data vectors of “similar” type ▪ data vector s of “different” type ▪ one numerical and one categorical data vector ▪ two categorical data vectors

▪ Numerical summaries and graphical summaries BIVARIATE SUMMARIES

slide-30
SLIDE 30

▪ Two numerical data vectors of “similar” type

▪ treat as one numerical difference vector

BIVARIATE SUMMARIES

slide-31
SLIDE 31

▪ Two numerical data vectors of “similar” type

▪ correlation analysis ▪ scatterplot

BIVARIATE SUMMARIES

slide-32
SLIDE 32

▪ Two numerical data vectors of “different” type

▪ correlation analysis ▪ scatterplot

BIVARIATE SUMMARIES

slide-33
SLIDE 33

▪ One numerical and one categorical data vector

▪ split numerical data vector into several data vectors

BIVARIATE SUMMARIES

Note: we actually have two (or more) groups/populations

slide-34
SLIDE 34

▪ Two categorical data vectors

▪ cross tables (contingency tables) ▪ cells contain “counts” (frequencies)

BIVARIATE SUMMARIES

slide-35
SLIDE 35

Consider Which group (GATT=0 or GATT=1) has a higher mean and which group has a higher standard deviation? EXERCISE 3

slide-36
SLIDE 36

▪ Many choices

▪ centrality ▪ dispersion ▪ association

▪ What to use depends on

▪ the nature of the problem ▪ the nature of the data ▪ the audience to address

STATISTICAL SUMMARIES

slide-37
SLIDE 37

▪ Remember that a summary summarizes ... ▪ ... and that data reduction reduces the amount of information ▪ Example: Anscombe’s quartet:

▪ 𝑜𝑌 = 11, ത 𝑌 = 9, 𝑡𝑌 = 11 ▪ 𝑜𝑍 = 11, ത 𝑍 = 7.50, 𝑡𝑍 = 4.12 ▪ 𝑡𝑌,𝑍 = 0.816 ▪ OLS-regression: 𝑍 = 3.00 + 0.500𝑌

STATISTICAL SUMMARIES

These summaries are too drastic!

slide-38
SLIDE 38

Almost all jokes on statistics and statisticians are based on this

▪ but there is some truth in it

STATISTICAL SUMMARIES

slide-39
SLIDE 39

Doane & Seward 5/E 3.1-3.9, 4.1-4.3, 4.5-4.6, 4.8 Tutorial exercises week 1 data summaries, box plots and histograms FURTHER STUDY