SLIDE 1
Business Statistics CONTENTS Contingency tables Independence of - - PowerPoint PPT Presentation
Business Statistics CONTENTS Contingency tables Independence of - - PowerPoint PPT Presentation
CONTINGENCY TABLES: TESTS Business Statistics CONTENTS Contingency tables Independence of categorical variables 2 2 -tables Old exam question Further study CONTINGENCY TABLES Contingency table (see Summarizing data): matrix with
SLIDE 2
SLIDE 3
Contingency table (see “Summarizing data”): ▪ matrix with counts ▪ rows represent levels of categorical variable 𝑌 (= 1,2) ▪ columns represent levels of categorical variable 𝑍 (= 1,2,3) ▪ “margins” contain totals CONTINGENCY TABLES
SLIDE 4
Simple example: election poll (sample) ▪ three cities ▪ four parties ▪ count data ▪ is political preference in these cities the same? CONTINGENCY TABLES
SLIDE 5
Notation for contingency table ▪ #counts in cell 𝑘, 𝑙 : 𝑜𝑘𝑙 ▪ total in row 𝑘: 𝑜𝑘⋅ = σ𝑙=1
𝑑
𝑜𝑘𝑙 ▪ total in column 𝑙: 𝑜⋅𝑙 = σ𝑘=1
𝑠
𝑜𝑘𝑙 ▪ “total total”: 𝑜⋅⋅ = σ𝑙=1
𝑑
σ𝑘=1
𝑠
𝑜𝑘𝑙 ▪ 1 2 3 tot 1 𝑜11 𝑜12 𝑜13 𝑜1∙ 2 𝑜21 𝑜22 𝑜23 𝑜2∙ tot 𝑜∙1 𝑜∙2 𝑜∙3 𝑜∙∙ CONTINGENCY TABLES
there are 𝑠 rows and 𝑑 columns
SLIDE 6
What does it mean when categorical variable 𝑌 is independent of categorical variable 𝑍? ▪ knowledge of 𝑦𝑗 doesn’t help you to predict 𝑧𝑗 ▪ 𝑄 𝑍 = 𝑙 𝑌 = 𝑘 = 𝑄 𝑍 = 𝑙 ▪ 𝑄 𝑌 = 𝑘 ∩ 𝑍 = 𝑙 = 𝑄 𝑌 = 𝑘 𝑄 𝑍 = 𝑙 ▪ (for all 𝑘 and 𝑙) Can we calculate a statistic for independence? INDEPENDENCE OF CATEGORICAL VARIABLES
SLIDE 7
Again ▪ 1 2 3 tot 1 𝑜11 𝑜12 𝑜13 𝑜1∙ 2 𝑜21 𝑜22 𝑜23 𝑜2∙ tot 𝑜∙1 𝑜∙2 𝑜∙3 𝑜∙∙ For the totals we have: ▪ for row 𝑘: 𝑜𝑘∙ = 𝑜𝑘1 + 𝑜𝑘2 + 𝑜𝑘3 = σ𝑙=1
𝑑
𝑜𝑘𝑙 ▪ for column 𝑙: 𝑜∙𝑙 = 𝑜1𝑙 + 𝑜2𝑙 = σ𝑘=1
𝑠
𝑜𝑘𝑙 ▪ for entire table: 𝑜∙∙ = σ𝑘=1
𝑠
𝑜𝑘∙ = σ𝑙=1
𝑑
𝑜∙𝑙 INDEPENDENCE OF CATEGORICAL VARIABLES
SLIDE 8
▪ There are 𝑜𝑘∙ (out of 𝑜∙∙) cases in row 𝑘
▪ and 𝑜∙𝑙 (out of 𝑜∙∙) cases in column 𝑙
▪ A fraction
𝑜𝑘∙ 𝑜∙∙ of the cases is in row 𝑘
▪ and a fraction
𝑜∙𝑙 𝑜∙∙ of the cases is in column 𝑙
▪ So, if there is no dependence between row 𝑘 and column 𝑙, we expect to have a fraction
𝑜𝑘∙ 𝑜∙∙ × 𝑜∙𝑙 𝑜∙∙ of the cases in
cell 𝑘, 𝑙
▪ which gives an expected count of
𝑜𝑘∙ 𝑜∙∙ × 𝑜∙𝑙 𝑜∙∙ × 𝑜∙∙ = 𝑜𝑘∙×𝑜∙𝑙 𝑜∙∙
𝑓
𝑘𝑙 = 𝑜𝑘∙ × 𝑜∙𝑙
𝑜∙∙ INDEPENDENCE OF CATEGORICAL VARIABLES
SLIDE 9
Given is the following contingency table: If the two variables (party preference and state) would be independent, what is the expected count for Democrat/Utah? EXERCISE 1
SLIDE 10
▪ Compare expected count (𝑓
𝑘𝑙) in cell 𝑘, 𝑙 to observed
count (𝑜𝑘𝑙) ▪ Discrepancy between observed count and expected count under the null hypothesis of independence: 𝑜𝑘𝑙 − 𝑓
𝑘𝑙
▪ Can we aggregate this over all cells (𝑘 = 1, … , 𝑠 and 𝑙 = 1, … , 𝑑)? ▪ Yes, but (again!) positive and negative deviations would easily cancel each other
▪ so we aggregate the squared discrepancy 𝑜𝑘𝑙 − 𝑓
𝑘𝑙 2 over all
cells
INDEPENDENCE OF CATEGORICAL VARIABLES
SLIDE 11
Still one thing to do: ▪ a discrepancy of 5 when 8 expected is much worse than a discrepancy of 5 when 1000 is expected ▪ so we “standardize” by the expected count:
𝑜𝑘𝑙−𝑓𝑘𝑙
2
𝑓𝑘𝑙
Therefore, we measure the overall discrepancy between expected and observed frequencies as:
𝑘=1 𝑠
𝑙=1 𝑑
𝑜𝑘𝑙 − 𝑓
𝑘𝑙 2
𝑓
𝑘𝑙
INDEPENDENCE OF CATEGORICAL VARIABLES
SLIDE 12
It can be shown that under 𝐼0 (independence of 𝑌 (rows) and 𝑍 (columns)):
𝑘=1 𝑠
𝑙=1 𝑑
𝑜𝑘𝑙 − 𝑓
𝑘𝑙 2
𝑓
𝑘𝑙
~𝜓 𝑠−1
𝑑−1 2
▪ Therefore, we call our test value 𝜓calc
2
▪ Reject 𝐼0 at 𝛽 when 𝜓calc
2
> 𝜓 𝑠−1
𝑑−1 ;𝛽 2
▪ for large values of 𝜓calc
2
- nly
INDEPENDENCE OF CATEGORICAL VARIABLES
provided that all 𝑓
𝑘𝑙 ≥ 5
1-tailed, but 2-sided ...
SLIDE 13
Example: Calculations: ▪ 𝑜GATT,Christ = 55, etc. ▪ 𝑓GATT,Christ =
119×75 153
= 58.3, etc. ▪ 𝜓calc
2
=
55−58.3 2 58.3
+ ⋯ = 1.88 (2 × 3 = 6 terms) ▪ 𝜓crit;upper
2
= 𝜓2;0.05
2
= 5.991 ▪ do not reject 𝐼0 and conclude that there is no evidence of dependence between religion and GATT membership INDEPENDENCE OF CATEGORICAL VARIABLES
SLIDE 14
Observed and (under 𝐼0) expected counts 𝜓2 calculations and standardized residuals INDEPENDENCE OF CATEGORICAL VARIABLES
𝑜𝑘𝑙 − 𝑓
𝑘𝑙 2
𝑓
𝑘𝑙
𝑜𝑘𝑙 − 𝑓
𝑘𝑙
𝑓
𝑘𝑙
𝜓calc
2
SLIDE 15
Find the critical value (at 𝛽 = 5%) of the appropriate distribution for a contingency table of 3 rows and 4 columns (without the totals). EXERCISE 2
SLIDE 16
▪ Step 1:
▪ 𝐼0: GATT membership and religion are independent; 𝐼1: GATT membership and religion are dependent; 𝛽 = 0.05
▪ Step 2:
▪ sample statistic 𝜓2 = σ
𝑜obs−𝑜exp
2
𝑜exp
; reject for large values
▪ Step 3:
▪ under 𝐼0: 𝜓2~𝜓𝑒𝑔=2
2
▪ requirement: all expected counts ≥ 5
▪ Step 4:
▪ 𝜓calc
2
= 1.88; 𝜓crit
2
= 𝜓2;0.05
2
= 5.991
▪ Step 5:
▪ do not reject 𝐼0 at 𝛽 = 0.05 and conclude that ...
INDEPENDENCE OF CATEGORICAL VARIABLES
SLIDE 17
Suppose there are only two rows and columns ▪ e.g., GATT/no GATT and Christ/no Christ ▪ or male/female and right-handed/left-handed We can still use contingency tables to check for independency But there is a more versatile way: ▪ test for two proportions (see: comparing two 𝜌s) 2 × 2-TABLES
SLIDE 18
Using a contingency table ▪ 𝜓2 = 0.7576 ▪ 𝜓crit
2
= 𝜓1;0.05
2
= 3.841 ▪ 𝑞−value = 0.384 ▪ independence not rejected 2 × 2-TABLES
SLIDE 19
2 × 2-TABLES
SLIDE 20
▪ Approach 1: (see also next lecture)
▪ group 1 = female; group 2 = male ▪ “success” = left-handed ▪ 𝐼0: 𝜌1 = 𝜌2 ▪ 𝑞1 =
12 120 = 0.10; 𝑞2 = 24 180 = 0.13
▪ pooled proportion: ҧ 𝑞 =
12+24 120+180 = 0.12
▪ 𝑨calc =
0.10−0.13 0.12 1−0.12
1 120+ 1 180
= −0.87 ▪ 𝑨crit = 𝑨0.025 = −1.96 ▪ 𝑞−value = 2 × 0.192 = 0.384 ▪ there is no indication that the proportion of left-handed persons depends on gender
2 × 2-TABLES
SLIDE 21
Approach 2: (see also next lecture)
▪ group 1 = left-handed; group 2 = right-handed ▪ “success” = female ▪ 𝐼0: 𝜌1 = 𝜌2 ▪ 𝑞1 =
12 36 = 0.33; 𝑞2 = 108 264 = 0.41
▪ pooled proportion: ҧ 𝑞 =
12+108 36+264 = 0.40
▪ 𝑨calc =
0.33−0.41 0.40 1−0.40
1 36+ 1 264
= −0.87 ▪ 𝑨crit = 𝑨0.025 = −1.96 ▪ 𝑞−value = 2 × 0.192 = 0.384 ▪ there is no indication that the proportion of females depends on handedness
2 × 2-TABLES
SLIDE 22
Why is this method more versatile? ▪ It also allows to test different hypothesis than “no relation”
- r “𝜌1 = 𝜌2”
▪ For instance
▪ 𝜌1 ≥ 𝜌2 ▪ 𝜌1 = 𝜌2 + 0.2
▪ The 𝜓2-test can only test independence (=“no relation”)
▪ but it has the benefit of also working for larger tables than 2 × 2
2 × 2-TABLES
SLIDE 23
26 March 2015, Q1k OLD EXAM QUESTION
SLIDE 24