STAT 113 Describing Categorical Data Colin Reimer Dawson Oberlin - - PowerPoint PPT Presentation

stat 113 describing categorical data
SMART_READER_LITE
LIVE PREVIEW

STAT 113 Describing Categorical Data Colin Reimer Dawson Oberlin - - PowerPoint PPT Presentation

Categorical Data Contingency Tables STAT 113 Describing Categorical Data Colin Reimer Dawson Oberlin College September 7, 2017 1 / 29 Categorical Data Contingency Tables Frequency Tables How can we display a data set with categorical


slide-1
SLIDE 1

Categorical Data Contingency Tables

STAT 113 Describing Categorical Data

Colin Reimer Dawson

Oberlin College

September 7, 2017 1 / 29

slide-2
SLIDE 2

Categorical Data Contingency Tables

Frequency Tables

How can we display a data set with categorical values? One option is simply a frequency table. Daily Weekly Monthly Semesterly N/R Total 9 28 18 23 13 91

Table: Results of a Survey of College Studentson Frequency of Video Game Playing (via Nolan and Speed, 2000)

3 / 29

slide-3
SLIDE 3

Categorical Data Contingency Tables

Relative Frequency Tables

If we use proportions or percentages, we have a relative frequency table. Daily Weekly Monthly Semesterly N/R Total 0.0989 0.3077 0.1978 0.2527 0.1429 1.0000

Table: Results of a Survey of College Students on Frequency of Video Game Playing (via Nolan and Speed, 2000)

4 / 29

slide-4
SLIDE 4

Categorical Data Contingency Tables

Graphical Displays

Sometimes a chart is more effective than a table.

Figure: http://www.xkcd.com

5 / 29

slide-5
SLIDE 5

Categorical Data Contingency Tables

Pie Charts

The pie chart is a popular choice for proportion data...

Daily ( 9.9%) Weekly ( 30.8%) Monthly ( 19.8%) Semesterly ( 25.3%) N/R ( 14.3%)

Figure: Results of a Survey of College Students on Frequency of Video Game Playing (via Nolan and Speed, 2000)

6 / 29

slide-6
SLIDE 6

Categorical Data Contingency Tables

Pie Charts...

Figure: http: //www.businessinsider.com/pie-charts-are-the-worst-2013-6

  • “Pie charts are the Nickelback of data visualization.”
  • “Pie charts are the Aquaman of data visualization.”

7 / 29

slide-7
SLIDE 7

Categorical Data Contingency Tables

Bar Charts vs. Pie Charts

  • Pie charts fail to convey anything useful with more than

maybe 4 categories

  • Even with few categories, it’s difficult to judge differences in

slice size.

  • People are better at judging linear size than area.

8 / 29

slide-8
SLIDE 8

Categorical Data Contingency Tables

Bar Chart > Pie Chart

  • Even in situations ideally suited to pie charts, it’s probably still

better to use bar charts. 9 / 29

slide-9
SLIDE 9

Categorical Data Contingency Tables

The Verdict on Pie Charts

10 / 29

slide-10
SLIDE 10

Categorical Data Contingency Tables

The One Exception

11 / 29

slide-11
SLIDE 11

Categorical Data Contingency Tables

Bar Charts

Much easier to see differences between categories.

Daily Weekly Monthly Semesterly N/R % of respondents 10 20 30 40 50

Figure: Two bar plots of Video Game data showing frequency (left) and percentages (right)

12 / 29

slide-12
SLIDE 12

Categorical Data Contingency Tables

Bar Charts

What’s this bar chart telling us?

Figure: A Fair and Balanced Bar Chart (from FOX News, 8/9/12)

13 / 29

slide-13
SLIDE 13

Categorical Data Contingency Tables

Bar Chart Tips

  • The cardinal rule of bar charts: Ratios in area = ratios in value
  • The y-axis must start at 0!
  • Equal distances = equal differences

14 / 29

slide-14
SLIDE 14

Categorical Data Contingency Tables

Summary

Categorical Data

  • Count occurrences of each value. Represent counts with a

frequency table, proportions with a relative frequency table.

  • Pie charts are pretty useless. Prefer bar charts.
  • Cardinal rule of bar charts: Ratios in area = ratios in value
  • The y-axis must start at 0!
  • Equal distances = equal differences

15 / 29

slide-15
SLIDE 15

Categorical Data Contingency Tables

Contingency Tables

  • Recall: with one categorical variable, we summarized by

counting the observations in each category

  • With more than one variable, we do the same thing, but we

keep track of combinations.

  • With two variables, we can store the counts in a two-way table

(also known as a contingency table). 17 / 29

slide-16
SLIDE 16

Categorical Data Contingency Tables

A Simple Contingency Table

Student Sex Computer 1 M PC 2 F Mac 3 F PC 4 M PC 5 F PC 6 F Mac 7 M Mac 8 M PC = ⇒ Computer PC Mac Sex M 3 1 F 2 2 18 / 29

slide-17
SLIDE 17

Categorical Data Contingency Tables

Proportions in a Context

Sometimes we want to ask about proportions within particular subsets of the data.

Example: Driving While Black/Brown

Armentrout, et al. (2007)1 reports data on a variety of outcomes related to traffic stops by the Los Angeles Police Department (LAPD). Two of the variables recorded are racial category of the driver and whether or not the vehicle was searched. Question of interest: Does the proportion of stops that lead to a search differ across racial categories?

1Armentrout, M., Goodrich, A., Nguyen, J., Ortega, L., Smith, L., &

Khadjavi, L.S. (2007). Cops and stops: Racial profiling and a preliminary statistics analysis of Los Angeles police department traffic stops and searches. Retrieved from http://www.public.asu.edu/ etcamach/AMSSI/reports/copsnstops.pdf

19 / 29

slide-18
SLIDE 18

Categorical Data Contingency Tables

Proportions in a Context

Question of interest: Does the proportion of stops that lead to a search differ across racial categories? Hisp./Lat. White Black Asian Others Total Searched 510 109 240 16 7 882 Not Searched 1826 2081 1008 486 104 5505 Total 2336 2190 1248 502 111 6387 Pairs (3 min.):

  • Identify the cases and the population the cases are drawn from.
  • How would you address this question using this data?

20 / 29

slide-19
SLIDE 19

Categorical Data Contingency Tables

Conditional Proportions

Hisp./Lat. White Black Asian Others Total Searched 510 109 240 16 7 882 Not Searched 1826 2081 1008 486 104 5505 Total 2336 2190 1248 502 111 6387

  • We can group the cases according to one variable (e.g., driver

race), and look at the distribution of the other (searched or not) within each group.

  • The resulting proportions are called conditional proportions:

proportions are computed within a context, i.e., cases that satisfy a certain condition. 21 / 29

slide-20
SLIDE 20

Categorical Data Contingency Tables

Conditional Proportions

Searched Yes No Total Driver Race Hisp./Lat. 510/2336 1826/2336 2336 White 109/2190 2081/2190 2190 Black 240/1248 1008/1248 1248 Asian 15/502 486/502 502 Others 7/111 104/111 111 Total 882/6387 5505/6387 6387

22 / 29

slide-21
SLIDE 21

Categorical Data Contingency Tables

Conditional Proportions

Searched Yes No Total Driver Race Hisp./Lat 0.218 0.782 2336 White 0.050 0.950 2190 Black 0.192 0.808 1248 Asian 0.032 0.968 502 Others 0.063 0.937 111 Total 0.138 0.862 6387

23 / 29

slide-22
SLIDE 22

Categorical Data Contingency Tables

Conditional Proportions to Measure Association

Pairs, 1 min.: What do these conditional proportions tell us?

Grade Expected A B C Total Frequency Daily 0.56 0.33 0.11 9 Weekly 0.50 0.50 0.00 28 Monthly 0.11 0.67 0.22 18 Sem’ly 0.30 0.61 0.09 23 Total 0.36 0.55 0.09 78 Table: Results of a survey of college students on frequency of video game playing and expected grade in a stats class (Nolan and Speed, 2000)

24 / 29

slide-23
SLIDE 23

Categorical Data Contingency Tables

A Three-Way Table

Figure: Religious Affiliation by Political Party, 2006 vs 2016 (via 538)

25 / 29

slide-24
SLIDE 24

Categorical Data Contingency Tables

Religious and Political Affiliation

“In 2016, a whopping 35 percent of Republicans were white evangelical Protestants, 18 percent were white mainline Protestants, and 16 percent were white Catholics; together, those groups account for nearly 70 percent of the Republican base. But since 2006, the proportion of Americans identifying as white evangelical Protestant, white Catholic, and white mainline Protestant have all dropped by 5 or 6 percentage points.” From America’s Shifting Religious Makeup Could Spell Trouble For Both Parties, FiveThirtyEight.com 26 / 29

slide-25
SLIDE 25

Categorical Data Contingency Tables

Age Breakdown by Religion

“Religious minorities like Muslims, Hindus and Buddhists; the religiously unaffiliated; and Hispanic Protestants and Catholics all have significant numbers of followers under the age of 30. And all of these groups disproportionately identify as Democrats. This youth and diversity might seem like a gift to the Democratic Party, but it also presents a serious challenge for politicians hoping to present a compelling vision to voters who have a wide range of values and priorities.”

27 / 29

slide-26
SLIDE 26

Categorical Data Contingency Tables

Example: Medical Diagnosis

A test for a rare disease (affecting about 1 in 10,000 people) has been developed and shown to have high accuracy: 99% of those with the disease test positive, and 99% of those without it test negative. Pairs (3 min.): If you test positive, what are the chances you have the disease? Hint: Construct a contingency table with 1,000,000 people who may or may not have the disease and may or may not test positive. 28 / 29

slide-27
SLIDE 27

Categorical Data Contingency Tables

Summary

Two Categorical Variables

  • A two-way table called a contingency table contains the

number of times each combination of values appears.

  • The conditional proportion of x given y is the proportion of

the time x occurs in the context of y.

  • Conditional proportions are not symmetric. Direction of

conditioning can make a big difference to the apparent pattern when base rates are very different.

  • We can use stacked or grouped bar plots to visualize joint

frequencies, or conditional proportions. 29 / 29