Advances in Visualizing Categorical Data Using the vcd, gnm and - - PowerPoint PPT Presentation

advances in visualizing categorical data using the vcd
SMART_READER_LITE
LIVE PREVIEW

Advances in Visualizing Categorical Data Using the vcd, gnm and - - PowerPoint PPT Presentation

intro vcd gnm 3D odds References Advances in Visualizing Categorical Data Using the vcd, gnm and vcdExtra Packages in R Michael Friendly 1 Heather Turner 2 David Firth 2 Achim Zeileis 3 1 Psychology Department York University 2 University of


slide-1
SLIDE 1

intro vcd gnm 3D

  • dds

References

Advances in Visualizing Categorical Data Using the vcd, gnm and vcdExtra Packages in R

Michael Friendly1 Heather Turner2 David Firth2 Achim Zeileis3

1Psychology Department

York University

2University of Warwick, UK 3Department of Statistics

Universit¨ at Innsbruck

CARME 2011 Rennes, February 9–11, 2011

Slides: http://datavis.ca/papers/adv-vcd-4up.pdf

1 / 67

slide-2
SLIDE 2

intro vcd gnm 3D

  • dds

References

Co-conspirators

Heather Turner

University of Warwick

David Firth

University of Warwick

Achim Zeileis

Universit¨ at Innsbruck

2 / 67

slide-3
SLIDE 3

intro vcd gnm 3D

  • dds

References

Outline

1

Introduction

2

Generalized Mosaic Displays: vcd Package

3

Generalized Nonlinear Models: gnm & vcdExtra Packages

4

3D Mosaics: vcdExtra Package

5

Models and Visualization for Log Odds Ratios

3 / 67

slide-4
SLIDE 4

intro vcd gnm 3D

  • dds

References VCD History Visual overview

Outline

1

Introduction Brief History of VCD Visual overview

2

Generalized Mosaic Displays: vcd Package Extending mosaic-like displays The strucplot framework

3

Generalized Nonlinear Models: gnm & vcdExtra Packages Loglinear models and generalized linear models Generalized nonlinear models: gnm package Models for ordered categories

4

3D Mosaics: vcdExtra Package

5

Models and Visualization for Log Odds Ratios Log odds ratios Examples

4 / 67

slide-5
SLIDE 5

intro vcd gnm 3D

  • dds

References VCD History Visual overview

Brief History of VCD

Hartigan and Kleiner (1981, 1984): representing an n-way contingency table by a “mosaic display,” showing a (recursive) decomposition of frequencies by “tiles”, area ∼ cell frequency. e.g., a 4-way table of viewing TV programs

Freq ~Day + Week + Time + Network

5 / 67

slide-6
SLIDE 6

intro vcd gnm 3D

  • dds

References VCD History Visual overview

Brief History of VCD

Friendly (1994): developed the connection between mosaic displays and loglinear models

Showed how mosaic displays could be used to visualize both

  • bserved frequency (area) and residuals (shading) from some

model. 1st presented at CARME 1995 (thx: Michael & J¨

  • rg!)

6 / 67

slide-7
SLIDE 7

intro vcd gnm 3D

  • dds

References VCD History Visual overview

Brief History of VCD

Visualizing Categorical Data (Friendly, 2000) But: mosaic-like displays have a long history (Friendly, 2002)! von Mayr (1877) Birch (1964) 2002: vcd project at TU & WU, Vienna (Kurt Hornik, David Meyer, Achim Zeileis) → vcd package

7 / 67

slide-8
SLIDE 8

intro vcd gnm 3D

  • dds

References VCD History Visual overview

Brief History of VCD

Visualizing Categorical Data (Friendly, 2000) But: mosaic-like displays have a long history (Friendly, 2002)! von Mayr (1877) Birch (1964) 2002: vcd project at TU & WU, Vienna (Kurt Hornik, David Meyer, Achim Zeileis) → vcd package

8 / 67

slide-9
SLIDE 9

intro vcd gnm 3D

  • dds

References VCD History Visual overview

Brief History of VCD

Visualizing Categorical Data (Friendly, 2000) But: mosaic-like displays have a long history (Friendly, 2002)! von Mayr (1877) Birch (1964) 2002: vcd project at TU & WU, Vienna (Kurt Hornik, David Meyer, Achim Zeileis) → vcd package

9 / 67

slide-10
SLIDE 10

intro vcd gnm 3D

  • dds

References VCD History Visual overview

Outline

1

Introduction Brief History of VCD Visual overview

2

Generalized Mosaic Displays: vcd Package Extending mosaic-like displays The strucplot framework

3

Generalized Nonlinear Models: gnm & vcdExtra Packages Loglinear models and generalized linear models Generalized nonlinear models: gnm package Models for ordered categories

4

3D Mosaics: vcdExtra Package

5

Models and Visualization for Log Odds Ratios Log odds ratios Examples

10 / 67

slide-11
SLIDE 11

Data, pictures, models & stories

data story

Two paths to enlightenment

slide-12
SLIDE 12

Data, pictures, models & stories

data story visualization

E x p l

  • r

a t

  • r

y

Two paths to enlightenment

slide-13
SLIDE 13

Data, pictures, models & stories

data story model visualization

E x p l

  • r

a t

  • r

y M

  • d

e l

  • b

a s e d

Two paths to enlightenment

slide-14
SLIDE 14

Data, pictures, models & stories

data story model visualization summary

E x p l

  • r

a t

  • r

y M

  • d

e l

  • b

a s e d

Two paths to enlightenment

slide-15
SLIDE 15

Data, pictures, models & stories

data story model visualization summary inference

E x p l

  • r

a t

  • r

y M

  • d

e l

  • b

a s e d

Two paths to enlightenment

slide-16
SLIDE 16

intro vcd gnm 3D

  • dds

References VCD History Visual overview

Visual overview: Models for frequency tables

Related models: logistic regression, polytomous regression, log

  • dds models, ...

Goals: Connect all with visualization methods

11 / 67

slide-17
SLIDE 17

intro vcd gnm 3D

  • dds

References VCD History Visual overview

Visual overview: R packages

12 / 67

slide-18
SLIDE 18

intro vcd gnm 3D

  • dds

References Extending mosaic displays The strucplot framework

Outline

1

Introduction Brief History of VCD Visual overview

2

Generalized Mosaic Displays: vcd Package Extending mosaic-like displays The strucplot framework

3

Generalized Nonlinear Models: gnm & vcdExtra Packages Loglinear models and generalized linear models Generalized nonlinear models: gnm package Models for ordered categories

4

3D Mosaics: vcdExtra Package

5

Models and Visualization for Log Odds Ratios Log odds ratios Examples

13 / 67

slide-19
SLIDE 19

intro vcd gnm 3D

  • dds

References Extending mosaic displays The strucplot framework

Extending mosaic-like displays

Initial ideas for mosaic displays were extended in a variety of ways: pairs plots and trellis-like layouts for marginal, conditional and partial views (Friendly 1999). varying the shape attributes of bar plots and mosaic displays

double-decker plots (Hofmann 2001), spine plots and spinograms (Hofmann & Theus 2005)

residual-based shadings to emphasize pattern of association in log-linear models or to visualize significance (Zeileis et al., 2007). dynamic interactive versions (ViSta, MANET, Mondrian):

linking of several graphs and models selection and highlighting across graphs and models interactive modification of the visualized models

14 / 67

slide-20
SLIDE 20

intro vcd gnm 3D

  • dds

References Extending mosaic displays The strucplot framework

Generalized mosaic displays

vcd package and the strucplot framework

Various displays for n-way frequency tables

flat (two-way) tables of frequencies fourfold displays mosaic displays sieve diagrams association plots doubledecker plots spine plots and spinograms

Commonalities

All have to deal with representing n-way tables in 2D All graphical methods use area to represent frequency Some are model-based — designed as a visual representation

  • f an underlying statistical model

Graphical methods use visual attributes (color, shading, etc.) to highlight relevant statistical aspects

15 / 67

slide-21
SLIDE 21

intro vcd gnm 3D

  • dds

References Extending mosaic displays The strucplot framework

Generalized mosaic displays

vcd package and the strucplot framework

Various displays for n-way frequency tables

flat (two-way) tables of frequencies fourfold displays mosaic displays sieve diagrams association plots doubledecker plots spine plots and spinograms

Commonalities

All have to deal with representing n-way tables in 2D All graphical methods use area to represent frequency Some are model-based — designed as a visual representation

  • f an underlying statistical model

Graphical methods use visual attributes (color, shading, etc.) to highlight relevant statistical aspects

16 / 67

slide-22
SLIDE 22

intro vcd gnm 3D

  • dds

References Extending mosaic displays The strucplot framework

Familiar example: UCB Admissions

Data on admission to graduate programs at UC Berkeley, by Dept, Gender and Admission

> structable(Dept ~ Gender + Admit, UCBAdmissions) Dept A B C D E F Gender Admit Male Admitted 512 353 120 138 53 22 Rejected 313 207 205 279 138 351 Female Admitted 89 17 202 131 94 24 Rejected 19 8 391 244 299 317

  • r, as a two-way table (collapsed over Dept),

> structable(~Gender + Admit, UCBAdmissions) Admit Admitted Rejected Gender Male 1198 1493 Female 557 1278

17 / 67

slide-23
SLIDE 23

intro vcd gnm 3D

  • dds

References Extending mosaic displays The strucplot framework

Fourfold displays for 2 × 2 tables

General ideas: Model-based graphs can show both data and model tests (or

  • ther statistical features)

Visual attributes tuned to support perception of relevant statistical comparisons

Gender: Male Admit: Admitted Gender: Female Admit: Rejected 1198 557 1493 1278

Quarter circles: radius ∼ √nij ⇒ area ∼ frequency Independence: Adjoining quadrants ≈ align Odds ratio: ratio of areas of diagonally opposite cells Confidence rings: Visual test of H0 : θ = 1 ↔ adjoining rings

  • verlap

18 / 67

slide-24
SLIDE 24

intro vcd gnm 3D

  • dds

References Extending mosaic displays The strucplot framework

Fourfold displays for 2 × 2 ×k tables

Stratified analysis: one fourfold display for each department Each 2 × 2 table standardized to equate marginal frequencies Shading: highlight departments for which Ha : θi = 1

Gender: Male Admit: Admitted Gender: Female Admit: Rejected

Dept: A

512 89 313 19 Gender: Male Admit: Admitted Gender: Female Admit: Rejected

Dept: B

353 17 207 8 Gender: Male Admit: Admitted Gender: Female Admit: Rejected

Dept: C

120 202 205 391 Gender: Male Admit: Admitted Gender: Female Admit: Rejected

Dept: D

138 131 279 244 Gender: Male Admit: Admitted Gender: Female Admit: Rejected

Dept: E

53 94 138 299 Gender: Male Admit: Admitted Gender: Female Admit: Rejected

Dept: F

22 24 351 317

19 / 67

slide-25
SLIDE 25

intro vcd gnm 3D

  • dds

References Extending mosaic displays The strucplot framework

Mosaic displays

Tiles: Area ∼ observed frequencies, nijk Friendly shading (highlight association pattern):

Residuals: rijk = (nijk − ˆ mijk)/

  • ( ˆ

mijk) Color— blue: r > 0, red: r < 0 Saturation: |r| < 2 (none), > 4 (max), else (middle)

(Other shadings highlight significance) (Other color schemes: HSV, HCL, . . . )

Model: ~Dept+Gender+Admit

Gender Admit Dept F Admitted Rejected Admitted Rejected E D C B A Male Female

Model: ~(Dept*Gender) + Admit

Gender Admit Dept F Admitted Rejected Admitted Rejected E D C B A Male Female

Model: ~(Admit + Gender) * Dept

Gender Admit Dept F Admitted Rejected Admitted Rejected E D C B A Male Female

20 / 67

slide-26
SLIDE 26

Mosaic displays: Fitting & visualizing models

Mutual independence model: Dept ⊥ Gender ⊥ Admit

> berk.mod0 <- loglm(~Dept + Gender + Admit, data = UCB) > mosaic(berk.mod0, gp = shading_Friendly, ...)

−14.0 −4.0 −2.0 0.0 2.0 4.0 20.2 Pearson residuals:

Model: ~Dept+Gender+Admit

Gender Admit Dept F Admitted Rejected Admitted Rejected E D C B A Male Female

slide-27
SLIDE 27

Mosaic displays: Fitting & visualizing models

Joint independence model: Admit ⊥ (Gender, Dept)

> berk.mod1 <- loglm(~Admit + (Gender * Dept), data = UCB) > mosaic(berk.mod1, gp = shading_Friendly, ...)

−10.2 −4.0 −2.0 0.0 2.0 4.0 10.7 Pearson residuals:

Model: ~Admit + (Gender*Dept)

Gender Admit Dept F Admitted Rejected Admitted Rejected E D C B A Male Female

slide-28
SLIDE 28

Mosaic displays: Fitting & visualizing models

Conditional independence model: Admit ⊥ Gender | Dept

> berk.mod2 <- loglm(~(Admit + Gender) * Dept, data = UCB) > mosaic(berk.mod2, gp = shading_Friendly, ...)

−3.13 −2.00 0.00 2.00 2.33 Pearson residuals:

Model: ~(Admit + Gender) * Dept

Gender Admit Dept F Admitted Rejected Admitted Rejected E D C B A Male Female

slide-29
SLIDE 29

intro vcd gnm 3D

  • dds

References Extending mosaic displays The strucplot framework

Double decker plots

Visualize dependence of one categorical (typically binary) variable on predictors Formally: mosaic plots with vertical splits for all predictor dimensions, highlighting response

Dept Gender A Male Female B Male F C Male Female D Male Female E MaleFemale F Male Female Admitted Rejected Admit

24 / 67

slide-30
SLIDE 30

intro vcd gnm 3D

  • dds

References Extending mosaic displays The strucplot framework

Outline

1

Introduction Brief History of VCD Visual overview

2

Generalized Mosaic Displays: vcd Package Extending mosaic-like displays The strucplot framework

3

Generalized Nonlinear Models: gnm & vcdExtra Packages Loglinear models and generalized linear models Generalized nonlinear models: gnm package Models for ordered categories

4

3D Mosaics: vcdExtra Package

5

Models and Visualization for Log Odds Ratios Log odds ratios Examples

25 / 67

slide-31
SLIDE 31

intro vcd gnm 3D

  • dds

References Extending mosaic displays The strucplot framework

The strucplot framework

A general, flexible system for visualizing n-way frequency tables: integrates tabular displays, mosaic displays, association plots, sieve plots, etc. in a common framework. n-way tables: variables partitioned into row and column variables in a “flat” 2D display using model formulae arguments allow for fitting any loglinear model via loglm() in the MASS package. high-level functions for all-pairwise views (pairs()), conditional views (cotabplot()). low-level functions control all aspects of labeling, shading, spacing, etc.

26 / 67

slide-32
SLIDE 32

intro vcd gnm 3D

  • dds

References Extending mosaic displays The strucplot framework

The strucplot framework

Components of the strucplot framework:

27 / 67

slide-33
SLIDE 33

intro vcd gnm 3D

  • dds

References Extending mosaic displays The strucplot framework

Pairwise bivariate plots

Visualize all 2-way views of different independence models in n-way tables: type=

"pairwise": Burt matrix: bivariate, marginal views "total": pairwise plots for mutual independence "conditional": marginal independence, given all others "joint": joint independence of all pairs from other variables

Panel functions for upper, lower, diagonal panels

upper, lower: mosaic, assoc, sieve, ... diagonal: barplot, text, mosaic, ...

Admit

Admitted Rejected

Gender

Male Female

Dept

A B C D E F 500 1000 1500 2000 2500 3000 Admitted Rejected Admit 500 1000 1500 2000 2500 3000 Male Female Gender 200 400 600 800 1000 A B C D E F Dept 500 1000 1500 2000 2500 3000 Admitted Admit 500 1000 1500 2000 2500 3000 Male Female Gender 200 400 600 800 1000 A B C D E F Dept

28 / 67

slide-34
SLIDE 34

intro vcd gnm 3D

  • dds

References Extending mosaic displays The strucplot framework

Pairwise bivariate plots

> pairs(UCBAdmissions, shade=TRUE, space=0.2, + diag_panel = pairs_diagonal_mosaic(offset_varnames=-3, ...))

Admit

Admitted Rejected

Gender

Male Female

Dept

A B C D E F

29 / 67

slide-35
SLIDE 35

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Outline

1

Introduction Brief History of VCD Visual overview

2

Generalized Mosaic Displays: vcd Package Extending mosaic-like displays The strucplot framework

3

Generalized Nonlinear Models: gnm & vcdExtra Packages Loglinear models and generalized linear models Generalized nonlinear models: gnm package Models for ordered categories

4

3D Mosaics: vcdExtra Package

5

Models and Visualization for Log Odds Ratios Log odds ratios Examples

30 / 67

slide-36
SLIDE 36

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Loglinear models and generalized linear models

Loglinear models

Model fitting in the vcd package is based on loglinear models log(mij) = µ + λA

i + λB j ≡ [A][B] ≡∼ A + B

log(mij) = µ + λA

i + λB j + λAB ij

≡ [AB] ≡∼ A * B Fit using iterative proportional fitting (loglm()) → No standard errors, limited syntax for expressing models

Generalized linear models

Link function: E(y | x) = g(µ) = η(x) = β0 + β1x1 + · · · βkxk Variance function: Var(y | x) = f(µ) Loglinear models as special cases with log link, Poisson distn → Var(y | x) = µ

31 / 67

slide-37
SLIDE 37

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Loglinear models and generalized linear models

Loglinear models

Model fitting in the vcd package is based on loglinear models log(mij) = µ + λA

i + λB j ≡ [A][B] ≡∼ A + B

log(mij) = µ + λA

i + λB j + λAB ij

≡ [AB] ≡∼ A * B Fit using iterative proportional fitting (loglm()) → No standard errors, limited syntax for expressing models

Generalized linear models

Link function: E(y | x) = g(µ) = η(x) = β0 + β1x1 + · · · βkxk Variance function: Var(y | x) = f(µ) Loglinear models as special cases with log link, Poisson distn → Var(y | x) = µ

32 / 67

slide-38
SLIDE 38

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Outline

1

Introduction Brief History of VCD Visual overview

2

Generalized Mosaic Displays: vcd Package Extending mosaic-like displays The strucplot framework

3

Generalized Nonlinear Models: gnm & vcdExtra Packages Loglinear models and generalized linear models Generalized nonlinear models: gnm package Models for ordered categories

4

3D Mosaics: vcdExtra Package

5

Models and Visualization for Log Odds Ratios Log odds ratios Examples

33 / 67

slide-39
SLIDE 39

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Generalized nonlinear models: gnm package

A generalized non-linear model (GNM) is the same as a GLM, except that we allow g(µ) = η(x; β) where η(x; β) is nonlinear in the parameters β. GNMs are very general, combining:

classical nonlinear models standard link and variance functions for GLM families

In the context of models for categorical data, GNMs provide:

parsimonious models for structured association models for multiplicative association (e.g., Goodman’s RC(1) model) multiple instances of multiplicative terms (RC(m) models) user-defined functions for custom models

34 / 67

slide-40
SLIDE 40

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Generalized nonlinear models: gnm package

Some models for structured associations in square tables quasi-independence (ignore diagonals)

> gnm(Freq ~ row + col + Diag(row, col), family = poisson)

symmetry (λRC

ij

= λRC

ji )

> gnm(Freq ~ Symm(row, col), family = poisson)

quasi-symmetry = quasi + symmetry

> gnm(Freq ~ row + col + Symm(row, col), family = poisson)

fully-specified “topological” association patterns

> gnm(Freq ~ row + col + Topo(row, col, spec = RCmatrix), ...)

All of these are actually GLMs, but the gnm package provides convienence functions Diag, Symm, and Topo to facilitate model specification.

35 / 67

slide-41
SLIDE 41

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Nonlinear models

Nonlinear terms are specified in model formulae by functions

  • f class "nonlin"

Basic nonlinear functions: Exp(), Inv(), Mult() Nonlinear terms can be nested. e.g. for a UNIDIFF model: log µijk = αik + βjk + exp(γk)δij the exponentiated multiplier is specified as Mult(Exp(C), A:B) Multiple instances. e.g., Goodman’s RC(2) model: log µrc = αr + βc + γr1δc1 + γr2δc2 specified using: instances(Mult(A,B), 2) user-defined functions of class "nonlin" allow further extensions All of these are fully general, providing residuals, fitted values, etc.

36 / 67

slide-42
SLIDE 42

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Generalized nonlinear models: vcdExtra package

Provides glue, extending the vcd package visualization methods for glm and gnm models mosaic.glm() → mosaic methods for class "glm" and class "gnm" objects sieve.glm(), assoc.glm() → sieve diagrams and association plots Generalized residual types:

Pearson deviance standard (adjusted) — unit asymptotic variance

Model lists:

glmlist() — methods for collecting, summarizing and visualizing a list of related models Kway() — generate & fit models of form ~(A+B+...)k.

37 / 67

slide-43
SLIDE 43

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Outline

1

Introduction Brief History of VCD Visual overview

2

Generalized Mosaic Displays: vcd Package Extending mosaic-like displays The strucplot framework

3

Generalized Nonlinear Models: gnm & vcdExtra Packages Loglinear models and generalized linear models Generalized nonlinear models: gnm package Models for ordered categories

4

3D Mosaics: vcdExtra Package

5

Models and Visualization for Log Odds Ratios Log odds ratios Examples

38 / 67

slide-44
SLIDE 44

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Models for ordered categories

Consider an R × C table having ordered categories In many cases, the RC association may be described more simply by assigning numeric scores to the row & column categories. For simplicity, we consider only integer scores, 1, 2, . . . here These models are easily extended to stratified tables R:C model µRC

ij

df Formula Uniform association i × j × γ 1

i:j

Row effects αi × j (I − 1)

R:j

Col effects i × βj (J − 1)

i:C

Row+Col eff jαi + iβj I + J − 3

R:j + i:C

RC(1) φiψj × γ I + J − 3

Mult(R, C)

Unstructured (R:C) µRC

ij

(I − 1)(J − 1)

R:C

39 / 67

slide-45
SLIDE 45

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Example: Social mobility in US, UK & Japan

Data from Yamaguchi (1987): Cross-national comparison of

  • ccupational mobility in the U.S., U.K. and Japan. Re-analysis by

Xie (1992).

> Yama.tab <- xtabs(Freq ~ Father + Son + Country, data = Yamaguchi87) > structable(Country + Son ~ Father, Yama.tab[, , 1:2]) Country US UK Son UpNM LoNM UpM LoM Farm UpNM LoNM UpM LoM Farm Father UpNM 1275 364 274 272 17 474 129 87 124 11 LoNM 1055 597 394 443 31 300 218 171 220 8 UpM 1043 587 1045 951 47 438 254 669 703 16 LoM 1159 791 1323 2046 52 601 388 932 1789 37 Farm 666 496 1031 1632 646 76 56 125 295 191

See: demo("yamaguchi-xie", package="vcdExtra")

40 / 67

slide-46
SLIDE 46

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

First thought: try MCA

> library(ca) > Yama.dft <- expand.dft(Yamaguchi87) > yama.mjca <- mjca(Yama.dft) > plot(yama.mjca, what = c("none", "all"))

Yamaguchi data: Mobility in US, UK and Japan, MCA

Dim 1: Farm vs. Other (52.6%) Dim 2: Occ. Status (28.0%)

−0.2 0.0 0.2 0.4 0.6 0.8 1.0 −0.6 −0.4 −0.2 0.0 0.2 0.4

SonFarm SonLoM SonLoNM SonUpM SonUpNM FatherFarm FatherLoM FatherLoNM FatherUpM FatherUpNM CountryJapan CountryUK CountryUS
  • Dimensions seem to

have reasonable interpretations 2nd glance: do they? How do they relate to theories of social mobility? How to understand Country effects?

41 / 67

slide-47
SLIDE 47

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Models for stratified mobility tables

Baseline models: Perfect mobility: Freq ~(R+C)*L Quasi-perfect mobility: Freq ~(R+C)*L + Diag(R, C) Layer models: Homogeneous: no layer effects Heterogeneous: e.g., µRCL

ijk

= δRC

ij

exp(γL

k )

Extended models: Baseline ⊕ Layer model( R:C model ) Layer model R:C model Homogeneous log multiplicative Row effects

~.+ R:j ~.+ Mult(R:j, Exp(L))

Col effects

~.+ i:C ~.+ Mult(i:C, Exp(L))

Row+Col eff

~.+ R:j + i:C ~.+ Mult(R:j + i:C, Exp(L))

RC(1)

~.+ Mult(R, C) ~.+ Mult(R, C, Exp(L))

Full R:C

~.+ R:C ~.+ Mult(R:C, Exp(L)

42 / 67

slide-48
SLIDE 48

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Yamaguchi data: Baseline models

Minimal, null model asserts Father ⊥ Son | Country

> yamaNull <- gnm(Freq ~ (Father + Son) * Country, data = Yamaguchi87, + family = poisson) > mosaic(yamaNull, ~Country + Son + Father, condvars = "Country", ...)

−17.0 −4.0 −2.0 0.0 2.0 4.0 34.5 Pearson residuals: p−value = <2e−16

[FC][SC] Null [FS] association (perfect mobility)

Son's status Country Father's status Japan Farm LoM UpM LoNM UpNM UK Farm LoM UpM LoNM UpNM US UpNM LoNM UpM LoM Farm Farm LoM UpM LoNM UpNM

43 / 67

slide-49
SLIDE 49

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Yamaguchi data: Baseline models

But, theory → ignore diagonal cells

> yamaDiag <- update(yamaNull, ~. + Diag(Father, Son):Country) > mosaic(yamaDiag, ~Country + Son + Father, condvars = "Country", ...)

−11.9 −4.0 −2.0 0.0 2.0 4.0 17.1 Pearson residuals:

[FC][SC] Quasi perfect mobility, +Diag(F ,S)

Son's status Country Father's status Japan Farm LoM UpM LoNM UpNM UK Farm LoM UpM LoNM UpNM US UpNM LoNM UpM LoM Farm Farm LoM UpM LoNM UpNM

44 / 67

slide-50
SLIDE 50

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Yamaguchi data: Fit models for homogeneous association

gnm package makes it easy to fit collections of models, with simple update() methods

> Rscore <- as.numeric(Yamaguchi87$Father) > Cscore <- as.numeric(Yamaguchi87$Son) > yamaRo <- update(yamaDiag, ~. + Father:Cscore) > yamaCo <- update(yamaDiag, ~. + Rscore:Son) > yamaRpCo <- update(yamaDiag, ~. + Father:Cscore + Rscore:Son) > yamaRCo <- update(yamaDiag, ~. + Mult(Father, Son)) > yamaFIo <- update(yamaDiag, ~. + Father:Son)

Model Ro: homogeneous row effects, +Father:j Son's status Country Father's status Japan Farm LoM UpM LoNM UpNM UK Farm LoM UpM LoNM UpNM US UpNM LoNM UpM LoM Farm Farm LoM UpM LoNM UpNM Model Co: homogeneous col effects, +i:Son Son's status Country Father's status Japan Farm LoM UpM LoNM UpNM UK Farm LoM UpM LoNM UpNM US UpNM LoNM UpM LoM Farm Farm LoM UpM LoNM UpNM Model RCo: homogeneous RC(1) Son's status Country Father's status Japan Farm LoM UpM LoNM UpNM UK Farm LoM UpM LoNM UpNM US UpNM LoNM UpM LoM Farm Farm LoM UpM LoNM UpNM

45 / 67

slide-51
SLIDE 51

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Yamaguchi data: Models for heterogeneous association

Log-multiplicative (UNIDIFF) models:

> yamaRx <- update(yamaDiag, ~ . + Mult(Father:Cscore, Exp(Country))) > yamaCx <- update(yamaDiag, ~ . + Mult(Rscore:Son, Exp(Country))) > yamaRpCx <- update(yamaDiag, ~ . + Mult(Father:Cscore + + Rscore:Son, Exp(Country))) > yamaRCx <- update(yamaDiag, ~ . + Mult(Father,Son, Exp(Country))) > yamaFIx <- update(yamaDiag, ~ . + Mult(Father:Son, Exp(Country)))

GNM model methods: Summary methods: print(model), summary(model), . . . Extractor methods: coef(model), residuals(model), . . . Visualization: Diagnostics: plot(model) Mosaics, etc: mosaic(model)

46 / 67

slide-52
SLIDE 52

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Yamaguchi data: Comparing models

glmlist() and related methods facilitate model comparison

> models <- glmlist(yamaNull, yamaDiag, + yamaRo, yamaRx, yamaCo, yamaCx, yamaRpCo, + yamaRpCx, yamaRCo, yamaRCx, yamaFIo, yamaFIx) > summarise(models) Model Summary: LR Chisq Df Pr(>Chisq) AIC BIC yamaNull 5592 48 0.0000 5496 5099 yamaDiag 1336 33 0.0000 1270 997 yamaRo 156 29 0.0000 98 -142 yamaRx 148 27 0.0000 94 -130 yamaCo 68 29 0.0001 10 -230 yamaCx 59 27 0.0004 5 -219 yamaRpCo 39 26 0.0509

  • 13 -228

yamaRpCx 33 24 0.1034

  • 15 -213

yamaRCo 38 26 0.0642

  • 14 -229

yamaRCx 32 24 0.1240

  • 16 -214

yamaFIo 36 22 0.0288

  • 8 -190

yamaFIx 31 20 0.0560

  • 9 -174

47 / 67

slide-53
SLIDE 53

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Yamaguchi data: Comparing models

glmlist() and related methods facilitate model comparison

> BIC <- matrix(summarise(models)$BIC[-(1:2)], 5, 2, byrow = TRUE)

−220 −200 −180 −160 −140

Yamaguchi−Xie models: R:C model by Layer model Summary

Father−Son model BIC

  • row eff

col eff row+col RC(1) R:C homogeneous log multiplicative

Country model

Homogeneous models all preferred by BIC (Xie preferred heterogeneous models) Little diffce among Col, Row+Col and RC(1) models → R:C association ∼ Row scores (Father’s status)

48 / 67

slide-54
SLIDE 54

intro vcd gnm 3D

  • dds

References loglms and glms gnms

  • rdered

Yamaguchi data: Comparing models

glmlist() and related methods facilitate model comparison

> AIC <- matrix(summarise(models)$AIC[-(1:2)], 5, 2, byrow = TRUE)

−20 20 40 60 80 100

Yamaguchi−Xie models: R:C model by Layer model Summary

Father−Son model AIC

  • row eff

col eff row+col RC(1) R:C homogeneous log multiplicative

Country model

AIC prefers heterogeneous models Row+Col and RC(1) model fit best → R:C association ∼ Father’s status, not just scores Model summary plots provide sensitive comparisons!

49 / 67

slide-55
SLIDE 55

intro vcd gnm 3D

  • dds

References

3D mosaic displays

Loglinear models rely on log(nijk) ∼ linear model

→ nijk ∼ multiplicative model

Mosaic displays rely on (nested) use of Area = Height × Width to represent frequencies in n-way tables How to take this to 3D?

−4.19 −2.00 0.00 2.00 4.00 8.02 Pearson residuals: p−value = <2e−16

Mutual independence: ~Hair+Eye+Sex

Eye color Hair color Sex Blond FemaleMale Red Female Male Brown Female Male Black Brown Hazel Green Blue Female Male −4.19 −2.00 0.00 2.00 4.00 8.02 Pearson residuals: p−value = <2e−16

Mutual independence: Expected frequencies

Eye color Hair color Sex Blond FemaleMale Red Female Male Brown Female Male Black Brown Hazel Green Blue Female Male

50 / 67

slide-56
SLIDE 56

intro vcd gnm 3D

  • dds

References

3D mosaic displays

mosaic3d() in the vcdExtra package partitition unit cube → nested set of 3D tiles, Volume ∼ frequency uses rgl package: interactive, 3D graphs

> mosaic3d(HEC) > mosaic3d(HEC, type="expected")

51 / 67

slide-57
SLIDE 57

intro vcd gnm 3D

  • dds

References Log odds ratios Examples

Outline

1

Introduction Brief History of VCD Visual overview

2

Generalized Mosaic Displays: vcd Package Extending mosaic-like displays The strucplot framework

3

Generalized Nonlinear Models: gnm & vcdExtra Packages Loglinear models and generalized linear models Generalized nonlinear models: gnm package Models for ordered categories

4

3D Mosaics: vcdExtra Package

5

Models and Visualization for Log Odds Ratios Log odds ratios Examples

52 / 67

slide-58
SLIDE 58

intro vcd gnm 3D

  • dds

References Log odds ratios Examples

Log odds ratios

In any two-way, R × C table, all associations can be represented by a set of (R − 1) × (C − 1) odds ratios, θij = nij/ni+1,j ni,j+1/ni+1,j+1 = nij × ni+1,j+1 ni+1,j × ni,j+1 ln(θij) =

  • 1

−1 −1 1

  • ln
  • nij

ni+1,j ni,j+1 ni+1,j+1 T

53 / 67

slide-59
SLIDE 59

intro vcd gnm 3D

  • dds

References Log odds ratios Examples

Log odds ratios

In any two-way, R × C table, all associations can be represented by a set of (R − 1) × (C − 1) odds ratios, θij = nij/ni+1,j ni,j+1/ni+1,j+1 = nij × ni+1,j+1 ni+1,j × ni,j+1 ln(θij) =

  • 1

−1 −1 1

  • ln
  • nij

ni+1,j ni,j+1 ni+1,j+1 T

54 / 67

slide-60
SLIDE 60

intro vcd gnm 3D

  • dds

References Log odds ratios Examples

Log odds ratios

ln θij ∼ N(0, σ2), with estimated asymptotic standard error:

  • σ(ln θij) = (n−1

ij + n−1 i+1,j + n−1 i,j+1 + n−1 i+1,j+1)1/2

This extends naturally to θij | k in higher-way tables, stratified by one or more “control” variables. Many models have a simpler form expressed in terms of ln(θij).

e.g., Uniform association model ln(mij) = µ + λA

i + λB j + γaibj ≡ ln(θij) = γ

Direct visualization of log odds ratios permits more sensitive comparisons than area-based displays.

55 / 67

slide-61
SLIDE 61

intro vcd gnm 3D

  • dds

References Log odds ratios Examples

Models for log odds ratios: Computation

Consider an R × C × K1 × K2 × . . . frequency table nij···, with factors K1, K2 . . . considered as strata. Let n = vec(nij···) be the N × 1 vectorization of the table. Then, all log odds ratios and their asymptotic covariance matrix can be calculated as:

ln( θ) = C ln(n) S = Var[ln(θ)] = C diag(n)−1 CT

where C is an N-column matrix containing all zeros, except for two +1 elements and two −1 elements in each row. e.g., for a 2 × 2 table, C =

  • 1

−1 −1 1

  • With strata, C can be calculated as

C = CRC ⊗ IK1 ⊗ IK2 ⊗ · · · loddsratio() in vcdExtra package provides generic methods (coef(), vcov(), confint(), . . . )

56 / 67

slide-62
SLIDE 62

intro vcd gnm 3D

  • dds

References Log odds ratios Examples

Models for log odds ratios: Estimation

A log odds ratio linear model for the ln(θ) is ln(θ) = Xβ where X is the design matrix of covariates The (asymptotic) ML estimates β are obtained by GLS via

  • β =
  • XTS−1X

−1 XTS−1 ln θ where S = Var[ln(θ)] is the estimated covariance matrix → Standard diagnostic and graphical methods can be adapted to this case.

diagnostics: influence plots, added-variable plots, . . . visualization: effect plots, . . .

57 / 67

slide-63
SLIDE 63

intro vcd gnm 3D

  • dds

References Log odds ratios Examples

Outline

1

Introduction Brief History of VCD Visual overview

2

Generalized Mosaic Displays: vcd Package Extending mosaic-like displays The strucplot framework

3

Generalized Nonlinear Models: gnm & vcdExtra Packages Loglinear models and generalized linear models Generalized nonlinear models: gnm package Models for ordered categories

4

3D Mosaics: vcdExtra Package

5

Models and Visualization for Log Odds Ratios Log odds ratios Examples

58 / 67

slide-64
SLIDE 64

intro vcd gnm 3D

  • dds

References Log odds ratios Examples

Example: Breathlessness & Wheeze in Coal Miners

> fourfold(CoalMiners, mfcol = c(2, 4), fontsize = 18)

Wheeze: W Breathlessness: B Wheeze: NoW Breathlessness: NoB

Age: 25−29

23 105 9 1654 Wheeze: W Breathlessness: B Wheeze: NoW Breathlessness: NoB

Age: 30−34

54 177 19 1863 Wheeze: W Breathlessness: B Wheeze: NoW Breathlessness: NoB

Age: 35−39

121 257 48 2357 Wheeze: W Breathlessness: B Wheeze: NoW Breathlessness: NoB

Age: 40−44

169 273 54 1778 Wheeze: W Breathlessness: B Wheeze: NoW Breathlessness: NoB

Age: 45−49

269 324 88 1712 Wheeze: W Breathlessness: B Wheeze: NoW Breathlessness: NoB

Age: 50−54

404 245 117 1324 Wheeze: W Breathlessness: B Wheeze: NoW Breathlessness: NoB

Age: 55−59

406 225 152 967 Wheeze: W Breathlessness: B Wheeze: NoW Breathlessness: NoB

Age: 60−64

372 132 106 526

There is a strong + association at all ages But can you see the trend?

59 / 67

slide-65
SLIDE 65

intro vcd gnm 3D

  • dds

References Log odds ratios Examples

Example: Breathlessness & Wheeze in Coal Miners

> (lor.CM <- loddsratio(CoalMiners)) log odds ratios for Wheeze and Breathlessness by Age 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 3.695 3.398 3.141 3.015 2.782 2.926 2.441 2.638

Fit linear and quadratic models in Age using WLS:

> lor.CM.df <- as.data.frame(lor.CM) > age <- seq(25, 60, by = 5) > CM.mod1 <- lm(LOR ~ age, weights=1/ASE^2, data=lor.CM.df) > CM.mod2 <- lm(LOR ~ poly(age,2), weights=1/ASE^2, data=lor.CM.df) > anova(CM.mod1, CM.mod2) Analysis of Variance Table Model 1: LOR ~ age Model 2: LOR ~ poly(age, 2) Res.Df RSS Df Sum of Sq F Pr(>F) 1 6 6.34 2 5 5.60 1 0.742 0.66 0.45

60 / 67

slide-66
SLIDE 66

intro vcd gnm 3D

  • dds

References Log odds ratios Examples

Example: Breathlessness & Wheeze in Coal Miners

Plot log odds ratios and fitted regressions: The trend is now clear!

  • 2.5

3.0 3.5 4.0

CoalMiners data: Log odds ratio plot

Age Log odds ratio: Wheeze x Breathlessness 25−29 30−34 35−39 40−44 45−49 50−54 55−59 60−64

61 / 67

slide-67
SLIDE 67

intro vcd gnm 3D

  • dds

References Log odds ratios Examples

Attitudes toward corporal punishment

A four-way table, classifying 1,456 persons in Denmark (Punishment data in vcd package). Attitude: approves moderate punishment of children (moderate), or refuses any punishment (no) Memory: Person recalls having been punished as a child? Education: highest level (elementary, secondary, high) Age group: (15-24, 25-39, 40+)

Age 15–24 25–39 40+ Education Attitude Memory Yes No Yes No Yes No Elementary No 1 26 3 46 20 109 Moderate 21 93 41 119 143 324 Secondary No 2 23 8 52 4 44 Moderate 5 45 20 84 20 56 High No 2 26 6 24 1 13 Moderate 1 19 4 26 8 17

62 / 67

slide-68
SLIDE 68

intro vcd gnm 3D

  • dds

References Log odds ratios Examples

Attitudes toward corporal punishment

Fourfold plots: Association of Attitude with Memory

> cotabplot(punish, panel = cotab_fourfold)

age = 15−24 education = elementary

memory: yes attitude: no memory: no attitude: moderate 1 26 21 93

age = 25−39 education = elementary

memory: yes attitude: no memory: no attitude: moderate 3 46 41 119

age = 40+ education = elementary

memory: yes attitude: no memory: no attitude: moderate 20 109 143 324

age = 15−24 education = secondary

memory: yes attitude: no memory: no attitude: moderate 2 23 5 45

age = 25−39 education = secondary

memory: yes attitude: no memory: no attitude: moderate 8 52 20 84

age = 40+ education = secondary

memory: yes attitude: no memory: no attitude: moderate 4 44 20 56

age = 15−24 education = high

memory: yes attitude: no memory: no attitude: moderate 2 26 1 19

age = 25−39 education = high

memory: yes attitude: no memory: no attitude: moderate 6 24 4 26

age = 40+ education = high

memory: yes attitude: no memory: no attitude: moderate 1 13 8 17

63 / 67

slide-69
SLIDE 69

intro vcd gnm 3D

  • dds

References Log odds ratios Examples

Log odds ratio plot

> (lor.pun <- loddsratio(punish)) log odds ratios for memory and attitude by age, education education age elementary secondary high 15-24

  • 1.7700
  • 0.2451

0.3795 25-39

  • 1.6645
  • 0.4367

0.4855 40+

  • 0.8777
  • 1.3683 -1.8112

Attitudes toward corporal punishment

Education Log odds ratio: Attitude x Memory

elementary secondary high −3 −2 −1 1 2

  • 15−24

25−39 40+

Age

Structure now completely clear Little diffce between younger groups Opposite pattern for the 40+ Need to fit an LOR model to confirm appearences (SEs large) (These methods are under development)

64 / 67

slide-70
SLIDE 70

intro vcd gnm 3D

  • dds

References Log odds ratios Examples

Summary

Effective data analysis for categorical data depends on:

Flexible models, with syntax to specify possibly complex models — easily Flexible visualization tools to help understand data, models, lack of fit, etc. — easily

The vcd package provides very general visualization methods via the strucplot framework The gnm package extends the class of applicable models for contingency tables considerably

Parsimonious models for structured associations Multiplicative and other nonlinear terms

The vcdExtra package provides glue, and a testbed for new visualization methods

65 / 67

slide-71
SLIDE 71

intro vcd gnm 3D

  • dds

References Log odds ratios Examples

Further information

vcd Zeileis A, Meyer D & Hornik K (2006). The Strucplot Framework: Visualizing Multi-Way Contingency Tables with vcd. Journal of Statistical Software, 17(3), 1–48. http://www.jstatsoft.org/v17/i03/

vignette("strucplot", package="vcd").

gnm Turner H & Firth D (2010). Generalized nonlinear models in R: An overview of the gnm package. http://CRAN.R-project.org/package=gnm

vignette("gnmOverview", package="gnm").

vcdExtra Friendly M & others (2010). vcdExtra: vcd

  • additions. http:

//CRAN.R-project.org/package=vcdExtra.

vignette("vcd-tutorial").

66 / 67

slide-72
SLIDE 72

intro vcd gnm 3D

  • dds

References

References I

Friendly, M. (1994). Mosaic displays for multi-way contingency tables. Journal of the American Statistical Association, 89, 190–200. URL http://www.jstor.org/stable/2291215. Friendly, M. (2000). Visualizing Categorical Data. Cary, NC: SAS Institute. Friendly, M. (2002). A brief history of the mosaic display. Journal of Computational and Graphical Statistics, 11(1), 89–107. Hartigan, J. A. and Kleiner, B. (1981). Mosaics for contingency tables. In W. F. Eddy (Ed.), Computer Science and Statistics: Proceedings of the 13th Symposium on the Interface, (pp. 268–273). New York, NY: Springer-Verlag. Hartigan, J. A. and Kleiner, B. (1984). A mosaic of television ratings. The American Statistician, 38, 32–35. Zeileis, A., Meyer, D., and Hornik, K. (2007). Residual-based shadings for visualizing (conditional) independence. Journal of Computational and Graphical Statistics, 16(3), 507–525.

67 / 67