statistical issues associated with multi way contingency
play

Statistical Issues Associated With Multi-way Contingency Tables - PowerPoint PPT Presentation

Statistical Issues Associated With Multi-way Contingency Tables & Links to Algebraic Geometry Stephen E. Fienberg Cylab, Department of Statistics, & Machine Learning Department Carnegie Mellon University & IMA (Joint work with


  1. Statistical Issues Associated With Multi-way Contingency Tables & Links to Algebraic Geometry Stephen E. Fienberg Cylab, Department of Statistics, & Machine Learning Department Carnegie Mellon University & IMA (Joint work with A. Dobra, A. Rinaldo, & Y. Zhou) 1

  2. Preliminaries • I am an “A” at IMA for Applications of Algebraic Geometry. • This talk: – Continuation from last week’s seminar by Serkan Hosten. • I won’t provide a notational translation table but I will overlap and give links. – Introduction to a number of statistical problems for the analysis of categorical data. 2

  3. Overview Three data examples and two statistical problems: 1. Bounds for cell counts in contingency tables given marginals. 2. Maximum likelihood estimation for log-linear models and large sparse contingency tables. How are they interrelated? Where do algebraic and other geometry tools fit in? Scaling up computations to deal with large sparse tables. 3

  4. Ex. 1: Risk Factors for Coronary Heart Disease Syst. BP • 1841 Czech auto workers d Edwards and Havanek (1985) Biometrika Phys. work Lipo ratio • Selection of 6 binary c e variables • 2 6 table – “0” cell – population unique, “1” b f Anamnesis – 2 cells with “2” Mental work a 4 Smoke (Y/N)

  5. Ex. 1: The Data B no yes F E D C A no yes no yes ne < 3 < 140 no 44 40 112 67 g yes 129 145 12 23 � 140 no 35 12 80 33 yes 109 67 7 9 � 3 < 140 no 23 32 70 66 yes 50 80 7 13 � 140 no 24 25 73 57 yes 51 63 7 16 pos < 3 < 140 no 5 7 21 9 yes 9 17 1 4 � 140 no 4 3 11 8 yes 14 17 5 2 � 3 < 140 no 7 3 14 14 yes 9 16 2 3 � 140 no 4 0 13 11 yes 5 14 4 4 5

  6. R-U Confidentiality Map Original Data Maximum Disclosure Risk Tolerable Risk Released Data No Data Data Utility (Duncan, et al. 2004)

  7. Disclosure Limitation for Sparse Count Data • Uniqueness in population table ⇔ cell count of “1”: – Uniqueness allows intruder to match characteristics in table with other data bases that include same variables to learn confidential information. • Utility typically tied to usefulness of marginal totals for statistical inference. • Risk concerned with small cell counts. – Assess using bounds for cell counts given marginal totals. 7

  8. Marginals as Data Releases • Simple summaries corresponding to subsets of variables. • Traditional mode of reporting for statistical agencies and others. • Useful in statistical modeling: Role of log-linear models. • National Institute of Statistical Sciences Project and some of my former students have dealt with other models and other types of releases. 8

  9. Ex. 2: Genetics Linkage • Data come from a barley milkdew experiment. – Edwards (1992). Comp. Stat. Data Anal. – 37 binary variables (genes) and 81 cases (5% missing data). • Subset of 6 genes that appear closely linked on basis of marginal distributions? • On same chromosome? 9

  10. Ex. 2: The Data 10

  11. Ex. 3: Australian Census Data • 10-dimensional highly sparse contingency table extracted from 1981 Australian population census (based on10 million people): Variable BPL SEX AGE REL MST DUR QAL INC FIN TIS # Categ. 102 2 11 27 5 62 11 15 16 18 • 892,533,945,600 cells! 11

  12. Collapsed Tables • Collapsed 5-way table with 105,600 cells of which 65% are zero Variable BPL MST QAL INC FIN # Categ. 8 5 11 15 16 • Collapsed 6-way table with 48,000 cells of which 41% are zero Variable BPL SEX AGE REL MST QAL # Categ. 8 2 11 5 5 11 12

  13. Two Faces of Algebraic Statistics & Contingency Tables 1. Representation of statistical models for cell probabilities: Description of parameter space. A. Characterizing joint distributions. B. Log-linear models including those with “graphical representation” via conditional independencies. 2. Statistical inference: Studying and characterizing portions of sample space: A. Minimal sufficient statistics (sufficient data summaries) for models—marginal totals. B. Maximum likelihood estimation. C. Distribution over all possible having given marginals (“exact distribution”)—related bounds. 13

  14. Its All About Geometry • Polyhedral Geometry : virtually all data-related quantities can be described by polyhedra. Polyhedral Polytope Cone • Algebraic Geometry : a statistical model is specified by a polynomial map. The set of probability distributions is a hyper-surface of points satisfying polynomial equations. Algebraic (Toric) Variety 14

  15. 2 × 2 Table: The Model • We are interested in the distribution p 11 p 12 p 1+ of the 4 cells in the table specified p 21 p 22 p 2+ by the vector of log probabilities: p +1 p +2 1 • Model of independence: p ij = p i+ p +j log( p 11 , p 12 , p 21 , p 22 ) = A � = ( p 1 + , p 2 + , p + 1 , p + 2 ) � • The set of all probability distributions for model of independence need to satisfy one polynomial equation: p 11 p 22 - p 12 p 21 = 0, Segre Variety and belong to surface of independence: 15

  16. 2 × 2 Table: The Data Design Matrix p ij = p i+ p +j Model of independence: n 11 n 12 n 21 n 22 Observed Counts MSS t 1 = n 1+ 1 1 0 0 n 11 n 12 Margins 0 0 1 1 t 2 = n 2+ t = An t = An n 21 n 22 1 0 1 0 t 3 = n +1 t 4 = n +2 0 1 0 1 • Set of all tables having margins t are integer points inside a polytope and form the fiber : 4 , Ax = t 4 , An = t } { } x � R � 0 {n � R � 0 16

  17. Design Matrix A MLE Sample Space Parameter Space A identifies the fiber: the A specifies the set of set of all tables having the polynomial equations that same margins: encode the dependence { x � 0, Ax = t } among the variables. { } x � 0, Ax = t Leads to the generalized All probability vectors hypergeometric probability satisfy binomial equations: distribution. p u + � p u � = 0 p u + � p u � = 0 [Set of all tables are lattice all integer u ∈ kernel ( A ). u � kernel(A) points in the simplex.] 17

  18. Maximum Likelihood Estimation • Distribution for n given p : n 11 p 12 n 12 p 21 n 21 p 22 n 22 f (n | p) � p 11 • For model of independence: minimal p ij = p i+ p +j sufficient statistics for parameters are: – t = An = ( n 1+ , n 2+ , n +1 , n +2 ) • Maximum likelihood equations: – p i+ =n i+ /n i = 1, 2; p +j =n +j /n j = 1, 2. • Solution (MLEs): ij = n i + n + j /n 2 . ˆ p • Rescale by total n to count scale n p ij =m ij : ˆ m ij = n i + n + j /n. 18

  19. Two-Way Fréchet Bounds • For 2 × 2 tables of counts{ n ij } given the marginal totals { n 1 + ,n 2 + } and { n + 1 ,n + 2 }: n 11 n 12 n 1 + n 21 n 22 n 2 + n + 1 n + 2 n min(n i + ,n + j ) � n ij � max(n i + + n + j � n, 0 ) ˆ m ij = n i + n + j /n. • Link to independence: • Interested in multi-way generalizations involving higher-order, overlapping margins. 19

  20. Log-linear Models for 2 3 Tables • In 3-way table of counts, { n ijk }, we model logarithms of expectations, E( n ijk )= m ijk > 0: log( m ) u u u u u u u = + + + + + + ijk 1 ( i ) 2 ( j ) 3 ( k ) 12 ( ij ) 13 ( ik ) 23 ( jk ) • MSSs are margins corresponding to highest order u -terms: { n ij+ } , { n i+k } , { n +jk } . – MSSs describe simplicial complex : [12][13][23]. • Alternative ways to write model: m ijk = � ij � ik � jk m 111 m 221 = m 112 m 222 m 121 m 211 m 122 m 212 m 111 m 221 m 122 m 212 � m 121 m 211 m 112 m 222 = 0 20

  21. Log-linear Models (cont.) • Maximum likelihood estimates (MLEs) found by setting MSSs equal to their expectations: ˆ m ij + = n ij + for i = 1 , 2 , , j = 1 , 2 , ˆ m + jk = n + jk for j = 1 , 2 , ,k = 1 , 2 , ˆ m i + k = n i + k for i = 1 , 2 ,k = 1 , 2 . • Set: m ijk = n ijk ± � • Solve cubic equation for δ : m 111 m 221 m 122 m 212 � m 121 m 211 m 112 m 222 = 0 • When do we get +ve solutions for { m ijk }? 21

  22. Existence of MLEs for 2 × 2 × 2 Table 0 n n n n n + � � � � � + � 121 1 1 112 122 1 2 + + n n n n 0 n � � + � + � � � 211 221 2 1 212 2 2 + + n n n n n n 11 21 1 12 22 2 + + + + + + + + n n 11 12 + + n n 21 22 + + Delta must be zero and MLE doesn’t exist. 22

  23. Two Other 3-Way Examples With [12][13][23] • 3 3 table where MLE exists • 4 3 table where MLE does not exist 23

  24. MLEs for Log-Linear Models for k -Way Tables • Log-linear models and algebraic geometry representations generalize. • Sampling distributions for f ( n | p ) are key! – ML equations then have similar form. • Existence of MLEs linked to pattern of zeros: – Discoverable by defining basis for models and using algebraic and polyhedral geometry. – Examples discovered using Polymake . • General theorem in Haberman (1974) and “constructive” version in Rinaldo (2005). 24

  25. Graphical & Decomposable Log-linear Models • Graphical log-linear models: defined by simultaneous conditional independence relationships: – Absence of edges in graph. Syst. BP • Decomposable models correspond d Lipo ratio to triangulated graphs. a e Smoke (Y/N) Ex. 1: Czech autoworkers • Graph has 3 cliques: Phys. work [ADE][ABCE][BF] c b Mental work f • “Interesting” decomposable log-linear model for Anamnesis data! 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend