 
              CONTINGENCY TABLES: TESTS Business Statistics
CONTENTS Contingency tables Independence of categorical variables 2 × 2 -tables Old exam question Further study
CONTINGENCY TABLES Contingency table (see “Summarizing data”): ▪ matrix with counts ▪ rows represent levels of categorical variable 𝑌 ( = 1,2 ) ▪ columns represent levels of categorical variable 𝑍 ( = 1,2,3 ) ▪ “margins” contain totals
CONTINGENCY TABLES Simple example: election poll (sample) ▪ three cities ▪ four parties ▪ count data ▪ is political preference in these cities the same?
CONTINGENCY TABLES Notation for contingency table ▪ #counts in cell 𝑘, 𝑙 : 𝑜 𝑘𝑙 𝑑 ▪ total in row 𝑘 : 𝑜 𝑘⋅ = σ 𝑙=1 𝑜 𝑘𝑙 𝑠 ▪ total in column 𝑙 : 𝑜 ⋅𝑙 = σ 𝑘=1 𝑜 𝑘𝑙 𝑑 𝑠 ▪ “total total”: 𝑜 ⋅⋅ = σ 𝑙=1 σ 𝑘=1 𝑜 𝑘𝑙 there are 𝑠 rows and 𝑑 columns 1 2 3 tot 1 𝑜 11 𝑜 12 𝑜 13 𝑜 1∙ ▪ 2 𝑜 21 𝑜 22 𝑜 23 𝑜 2∙ tot 𝑜 ∙1 𝑜 ∙2 𝑜 ∙3 𝑜 ∙∙
INDEPENDENCE OF CATEGORICAL VARIABLES What does it mean when categorical variable 𝑌 is independent of categorical variable 𝑍 ? ▪ knowledge of 𝑦 𝑗 doesn’t help you to predict 𝑧 𝑗 ▪ 𝑄 𝑍 = 𝑙 𝑌 = 𝑘 = 𝑄 𝑍 = 𝑙 ▪ 𝑄 𝑌 = 𝑘 ∩ 𝑍 = 𝑙 = 𝑄 𝑌 = 𝑘 𝑄 𝑍 = 𝑙 ▪ (for all 𝑘 and 𝑙 ) Can we calculate a statistic for independence?
INDEPENDENCE OF CATEGORICAL VARIABLES Again 1 2 3 tot 1 𝑜 11 𝑜 12 𝑜 13 𝑜 1∙ ▪ 2 𝑜 21 𝑜 22 𝑜 23 𝑜 2∙ tot 𝑜 ∙1 𝑜 ∙2 𝑜 ∙3 𝑜 ∙∙ For the totals we have: 𝑑 ▪ for row 𝑘 : 𝑜 𝑘∙ = 𝑜 𝑘1 + 𝑜 𝑘2 + 𝑜 𝑘3 = σ 𝑙=1 𝑜 𝑘𝑙 𝑠 ▪ for column 𝑙 : 𝑜 ∙𝑙 = 𝑜 1𝑙 + 𝑜 2𝑙 = σ 𝑘=1 𝑜 𝑘𝑙 𝑠 𝑑 ▪ for entire table: 𝑜 ∙∙ = σ 𝑘=1 𝑜 𝑘∙ = σ 𝑙=1 𝑜 ∙𝑙
INDEPENDENCE OF CATEGORICAL VARIABLES ▪ There are 𝑜 𝑘∙ (out of 𝑜 ∙∙ ) cases in row 𝑘 ▪ and 𝑜 ∙𝑙 (out of 𝑜 ∙∙ ) cases in column 𝑙 𝑜 𝑘∙ ▪ A fraction 𝑜 ∙∙ of the cases is in row 𝑘 𝑜 ∙𝑙 ▪ and a fraction 𝑜 ∙∙ of the cases is in column 𝑙 ▪ So, if there is no dependence between row 𝑘 and column 𝑜 𝑘∙ 𝑜 ∙𝑙 𝑙 , we expect to have a fraction 𝑜 ∙∙ of the cases in 𝑜 ∙∙ × cell 𝑘, 𝑙 𝑜 𝑘∙ 𝑜 𝑘∙ ×𝑜 ∙𝑙 𝑜 ∙𝑙 ▪ which gives an expected count of 𝑜 ∙∙ × 𝑜 ∙∙ × 𝑜 ∙∙ = 𝑜 ∙∙ 𝑘𝑙 = 𝑜 𝑘∙ × 𝑜 ∙𝑙 𝑓 𝑜 ∙∙
EXERCISE 1 Given is the following contingency table: If the two variables (party preference and state) would be independent, what is the expected count for Democrat/Utah?
INDEPENDENCE OF CATEGORICAL VARIABLES ▪ Compare expected count ( 𝑓 𝑘𝑙 ) in cell 𝑘, 𝑙 to observed count ( 𝑜 𝑘𝑙 ) ▪ Discrepancy between observed count and expected count under the null hypothesis of independence: 𝑜 𝑘𝑙 − 𝑓 𝑘𝑙 ▪ Can we aggregate this over all cells ( 𝑘 = 1, … , 𝑠 and 𝑙 = 1, … , 𝑑 )? ▪ Yes, but (again!) positive and negative deviations would easily cancel each other 2 over all ▪ so we aggregate the squared discrepancy 𝑜 𝑘𝑙 − 𝑓 𝑘𝑙 cells
INDEPENDENCE OF CATEGORICAL VARIABLES Still one thing to do: ▪ a discrepancy of 5 when 8 expected is much worse than a discrepancy of 5 when 1000 is expected 2 𝑜 𝑘𝑙 −𝑓 𝑘𝑙 ▪ so we “standardize” by the expected count: 𝑓 𝑘𝑙 Therefore, we measure the overall discrepancy between expected and observed frequencies as: 𝑠 𝑑 2 𝑜 𝑘𝑙 − 𝑓 𝑘𝑙   𝑓 𝑘𝑙 𝑘=1 𝑙=1
INDEPENDENCE OF CATEGORICAL VARIABLES It can be shown that under 𝐼 0 (independence of 𝑌 (rows) and 𝑍 (columns)): 𝑠 𝑑 2 𝑜 𝑘𝑙 − 𝑓 𝑘𝑙 2   ~𝜓 𝑠−1 𝑑−1 𝑓 𝑘𝑙 𝑘=1 𝑙=1 provided that all 𝑓 𝑘𝑙 ≥ 5 2 ▪ Therefore, we call our test value 𝜓 calc 2 2 ▪ Reject 𝐼 0 at 𝛽 when 𝜓 calc > 𝜓 𝑠−1 𝑑−1 ;𝛽 2 ▪ for large values of 𝜓 calc only 1-tailed, but 2-sided ...
INDEPENDENCE OF CATEGORICAL VARIABLES Example: Calculations: ▪ 𝑜 GATT,Christ = 55 , etc. 119×75 = 58.3 , etc. ▪ 𝑓 GATT,Christ = 153 55−58.3 2 2 + ⋯ = 1.88 ( 2 × 3 = 6 terms) ▪ 𝜓 calc = 58.3 2 2 ▪ 𝜓 crit;upper = 𝜓 2;0.05 = 5.991 ▪ do not reject 𝐼 0 and conclude that there is no evidence of dependence between religion and GATT membership
INDEPENDENCE OF CATEGORICAL VARIABLES Observed and (under 𝐼 0 ) expected counts 𝜓 2 calculations and standardized residuals 2 𝑜 𝑘𝑙 − 𝑓 𝑜 𝑘𝑙 − 𝑓 𝑘𝑙 𝑘𝑙 2 𝜓 calc 𝑓 𝑓 𝑘𝑙 𝑘𝑙
EXERCISE 2 Find the critical value (at 𝛽 = 5% ) of the appropriate distribution for a contingency table of 3 rows and 4 columns (without the totals).
INDEPENDENCE OF CATEGORICAL VARIABLES ▪ Step 1: ▪ 𝐼 0 : GATT membership and religion are independent; 𝐼 1 : GATT membership and religion are dependent; 𝛽 = 0.05 ▪ Step 2: 2 𝑜 obs −𝑜 exp ▪ sample statistic 𝜓 2 = σ ; reject for large values 𝑜 exp ▪ Step 3: 2 ▪ under 𝐼 0 : 𝜓 2 ~𝜓 𝑒𝑔=2 ▪ requirement: all expected counts ≥ 5 ▪ Step 4: 2 2 2 = 1.88 ; 𝜓 crit = 5 .991 ▪ 𝜓 calc = 𝜓 2;0.05 ▪ Step 5: ▪ do not reject 𝐼 0 at 𝛽 = 0.05 and conclude that ...
2 × 2 -TABLES Suppose there are only two rows and columns ▪ e.g., GATT/no GATT and Christ/no Christ ▪ or male/female and right-handed/left-handed We can still use contingency tables to check for independency But there is a more versatile way: ▪ test for two proportions (see: comparing two 𝜌 s)
2 × 2 -TABLES Using a contingency table ▪ 𝜓 2 = 0.7576 2 2 ▪ 𝜓 crit = 𝜓 1;0.05 = 3.841 ▪ 𝑞 −value = 0.384 ▪ independence not rejected
2 × 2 -TABLES
2 × 2 -TABLES ▪ Approach 1: (see also next lecture) ▪ group 1 = female; group 2 = male ▪ “success” = left -handed ▪ 𝐼 0 : 𝜌 1 = 𝜌 2 12 24 ▪ 𝑞 1 = 120 = 0.10 ; 𝑞 2 = 180 = 0.13 12+24 ▪ pooled proportion: ҧ 𝑞 = 120+180 = 0.12 0.10−0.13 ▪ 𝑨 calc = = −0.87 120 + 1 1 0.12 1−0.12 180 ▪ 𝑨 crit = 𝑨 0.025 = −1.96 ▪ 𝑞 −value = 2 × 0.192 = 0.384 ▪ there is no indication that the proportion of left-handed persons depends on gender
2 × 2 -TABLES Approach 2: (see also next lecture) ▪ group 1 = left-handed; group 2 = right-handed ▪ “success” = female ▪ 𝐼 0 : 𝜌 1 = 𝜌 2 12 108 ▪ 𝑞 1 = 36 = 0.33 ; 𝑞 2 = 264 = 0.41 12+108 ▪ pooled proportion: ҧ 𝑞 = 36+264 = 0.40 0.33−0.41 ▪ 𝑨 calc = = −0.87 36 + 1 1 0.40 1−0.40 264 ▪ 𝑨 crit = 𝑨 0.025 = −1.96 ▪ 𝑞 −value = 2 × 0.192 = 0.384 ▪ there is no indication that the proportion of females depends on handedness
2 × 2 -TABLES Why is this method more versatile? ▪ It also allows to test different hypothesis than “no relation” or “ 𝜌 1 = 𝜌 2 ” ▪ For instance ▪ 𝜌 1 ≥ 𝜌 2 ▪ 𝜌 1 = 𝜌 2 + 0.2 ▪ The 𝜓 2 - test can only test independence (=“no relation”) ▪ but it has the benefit of also working for larger tables than 2 × 2
OLD EXAM QUESTION 26 March 2015, Q1k
FURTHER STUDY Doane & Seward 5/E 15.1 Tutorial exercises week 5 𝜓 2 -test comparing two proportions
Recommend
More recommend