SLIDE 1
Business Statistics CONTENTS The role of data The data matrix - - PowerPoint PPT Presentation
Business Statistics CONTENTS The role of data The data matrix - - PowerPoint PPT Presentation
DATA Business Statistics CONTENTS The role of data The data matrix Data types Aspects of data Obtaining data Further study THE ROLE OF DATA Data refers to observed facts there are 82 persons in this train the weight of
SLIDE 2
SLIDE 3
Data refers to observed facts
▪ “there are 82 persons in this train” ▪ “the weight of this pizza is 283 gram” ▪ “this museum hosts paintings by Picasso”
Data helps
▪ to suggest theories (“pizzas with a high price are less popular”) ▪ to test hypotheses (“advertising increase sales”) ▪ to calibrate coefficients of theories (“𝑟 = 𝑏 − 𝑐𝑞, but what are 𝑏 and 𝑐?”)
THE ROLE OF DATA
SLIDE 4
Columns: variables (may have identifying name like “age”) Rows: subjects/cases (may have identifying name like “John”) Cells: observations Entire table: data matrix THE DATA MATRIX
Observation Subject/Case Variable
SLIDE 5
THE DATA MATRIX
Variable name Variable unit Numerical data Nominal data Ordinal data Binary data Missing
- bservation
Subject name
SLIDE 6
Information to extract from a data matrix ▪ One variable
▪ mean age at inauguration ▪ odds of republicans vs. democrats ▪ univariate analysis
▪ Two variables
▪ association between handedness and party ▪ correlation between age and number of terms ▪ bivariate analysis
▪ Many variables
▪ predict terms as a function of height and handednes ▪ multivariate analysis
THE DATA MATRIX
SLIDE 7
The data matrix can represent: ▪ all data (the population)
▪ a list of all US presidents
▪ a non-random selection of data
▪ a list of all US presidents since 1969
▪ a random selection of data (a sample)
▪ a subset of randomly picked presidents from the full list
▪ descriptive statistics is applicable to all three cases ▪ inferential statistics focuses on how to draw conclusions for a population on the basis of information on a random sample THE DATA MATRIX
SLIDE 8
You find data on the body size of 5 men and 5 women Organize these data in a data matrix EXERCISE 1
SLIDE 9
▪ Type of data
▪ categorical, numerical
▪ Countability
▪ discrete, continuous
▪ Range
▪ restricted, infinite, semi-infinite
▪ Coded
▪ numbers for text
▪ Recoded
▪ text for ranges of numbers (or ranges of texts)
ASPECTS OF DATA
SLIDE 10
Type of data ▪ categorical
▪ e.g., dog, cat, horse
▪ numerical (cardinal)
▪ e.g., 12, 45.29
Has consequences for:
▪ transformations (income per capita vs. car type per capita) ▪ statistical summaries (average income vs. average car type)
Special cases
▪ Likert scale (5 or 7-point scale: “strongly agree”, “somewhat agree”, etc.) ▪ binary variable (0/1, yes/no, Dutch/foreign)
ASPECTS OF DATA
SLIDE 11
Countability ▪ discrete
▪ e.g., eggs
▪ (semi-)continuous
▪ e.g., waiting time
Has consequences for:
▪ recoding (“binning”) ▪ statistical summaries (modal income vs. median income)
ASPECTS OF DATA
SLIDE 12
Range ▪ (semi-)infinite
▪ e.g., income
▪ restricted
▪ e.g., percentage of satisfied customers
Has consequences for:
▪ dealing with outliers (exceptional data points)
ASPECTS OF DATA
SLIDE 13
Coding ▪ replacing nominal categories by numbers
▪ e.g., Ford=1, Audi=2, Volkswagen=3, Opel=4
▪ replacing ordinal categories by numbers
▪ e.g., tiny=1, small=2, normal=3, big=4, huge=5
Has consequences for:
▪ preventing recording mistakes (e.g., Vlokswgaen) ▪ preparing for statistical calculations (SPSS, Stata, R, etc)
ASPECTS OF DATA
SLIDE 14
Recoding ▪ grouping categorical data
▪ e.g., “Volkswagen”+“Audi”+“Opel”=“German car”
▪ grouping numerical data
▪ e.g., 𝑦 ∈ 20.000,25.000 =“middle income”
Has consequences for:
▪ statistical summaries (histograms, modal values)
ASPECTS OF DATA
SLIDE 15
Coding of categories into numbers ASPECTS OF DATA
SLIDE 16
Coding of categories into several binary variables ▪ using dummy variables (or dummies for short) ▪ 𝑜dummies = 𝑜categories (redundant!) ▪ 𝑜dummies = 𝑜categories − 1 (with omitted category) ASPECTS OF DATA
SLIDE 17
Some pitfalls: ▪ missing data
▪ blank? 0? 99?
▪ treating coded categories or number-like categories as numbers
▪ e.g., if Volkswage=1, Audi=2, BMW=3, the average car in this street 1.92?
▪ units of data
▪ see Math course
▪ decimals
▪ see Math course
ASPECTS OF DATA
SLIDE 18
Describe the appropriate data characteristic (categorical,
- rdinal, nominal, numerical, continuous, discrete, dummy,
etc.) for
- a. body size (171, 184, etc.)
- b. pet (cat, dog, rabbit)
- c. righthandedness (0, 1)
- d. income group (low, medium, high)
- e. number of children (0, 1, 2, etc.)
EXERCISE 2
SLIDE 19
Typing
▪ from books, etc.
Downloading
▪ from online databases (like CBS) ▪ from general webpages (like Wikipedia)
OBTAINING DATA
SLIDE 20
Purchasing
▪ commercial databases
OBTAINING DATA
SLIDE 21
Generating
▪ from secondary sources ▪ combining multiple sources ▪ by primary research ▪ doing interviews ▪ doing observations ▪ doing experiments
OBTAINING DATA
SLIDE 22