Show the Right Numbers ggplots FLOW OF ACTION Will be handled - - PowerPoint PPT Presentation

show the right numbers ggplot s flow of action will be
SMART_READER_LITE
LIVE PREVIEW

Show the Right Numbers ggplots FLOW OF ACTION Will be handled - - PowerPoint PPT Presentation

Show the Right Numbers ggplots FLOW OF ACTION Will be handled automatically Themes unless we say Guides otherwise Coordinates and Scales We always have Geoms and/or Stats to specify these Aesthetic Mappings Data to draw a plot


slide-1
SLIDE 1

Show the Right Numbers

slide-2
SLIDE 2

ggplot’s FLOW OF ACTION

slide-3
SLIDE 3

Data Aesthetic Mappings Geoms and/or Stats Coordinates and Scales Guides Themes

We always have to specify these to draw a plot Will be handled automatically unless we say

  • therwise
slide-4
SLIDE 4

Grouped Data and the group aesthetic

slide-5
SLIDE 5

p + geom_line(color = "gray70", mapping = aes(group = country)) + geom_smooth(size = 1.1, method = "loess", se = FALSE) + scale_y_log10(labels=scales::dollar) + facet_wrap(~ continent, ncol = 5) + labs(x = "Year", y = "GDP per capita", title = "GDP per capita on Five Continents")

The labs() function lets you name labels, title, subtitle, etc.

slide-6
SLIDE 6
slide-7
SLIDE 7

geoms CAN TRANSFORM DATA

slide-8
SLIDE 8

gss_sm

A subset of General Social Survey Questions from 2016

slide-9
SLIDE 9

with(gss_sm, table(religion)) ## ## Protestant Catholic Jewish None Other ## 1371 649 51 619 159

slide-10
SLIDE 10

Just the one aesthetic mapping, to x.

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion)) p + geom_bar()

slide-11
SLIDE 11
slide-12
SLIDE 12

The y-axis variable, count, is not in the

  • data. Instead, ggplot has calculated it

for us. It does this using the default

stat_ function associated with geom_bar(), stat_count(). This

function can compute two new variables, count, and prop (short for proportion). The count statistic is the default one used.

slide-13
SLIDE 13

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion)) p + geom_bar(mapping = aes(y = ..prop..))

slide-14
SLIDE 14
slide-15
SLIDE 15

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion)) p + geom_bar(mapping = aes(y = ..prop.., group = 1))

slide-16
SLIDE 16
slide-17
SLIDE 17

p + geom_bar() p + stat_count()

geom_ functions call their default stat_ functions behind the scenes. (And vice versa)

slide-18
SLIDE 18

p <- ggplot(data = gss_sm, mapping = aes(x = religion)) p + geom_bar() p <- ggplot(data = gss_sm, mapping = aes(x = religion, fill = religion)) p + geom_bar() p <- ggplot(data = gss_sm, mapping = aes(x = religion, fill = religion)) p + geom_bar() + guides(fill = FALSE) p <- ggplot(data = gss_sm, mapping = aes(x = religion, color = religion)) p + geom_bar()

slide-19
SLIDE 19

p <- ggplot(data = gss_sm, mapping = aes(x = religion, color = religion)) p + geom_bar() p <- ggplot(data = gss_sm, mapping = aes(x = religion, fill = religion)) p + geom_bar() + guides(fill = FALSE)

slide-20
SLIDE 20

HISTOGRAMS & KERNEL DENSITIES

slide-21
SLIDE 21

midwest

County-Level Census Data for Midwestern States

slide-22
SLIDE 22

p <- ggplot(data = midwest, mapping = aes(x = area)) p + geom_histogram()

## `stat_bin()` using `bins = 30`. ## Pick better value with `binwidth`.

The default stat for this geom has to make a choice, and is letting us know we might want to override it.

slide-23
SLIDE 23
slide-24
SLIDE 24

p <- ggplot(data = midwest, mapping = aes(x = area)) p + geom_histogram(bins = 10)

slide-25
SLIDE 25
slide-26
SLIDE 26

p <- ggplot(data = subset(midwest, state %in% oh_wi), mapping = aes(x = percollege, fill = state)) p + geom_histogram(position = "identity", alpha = 0.4, bins = 20)

subset our data

  • n the fly

a convenient, built-in operator Just plot x by its values on the scale, don’t stack

  • r dodge
  • h_wi <- c("OH", "WI")
slide-27
SLIDE 27
slide-28
SLIDE 28

p <- ggplot(data = midwest, mapping = aes(x = area)) p + geom_density()

geom_hist()’s continuous counterpart, geom_density()

slide-29
SLIDE 29
slide-30
SLIDE 30

p <- ggplot(data = midwest, mapping = aes(x = area, fill = state, color = state)) p + geom_density(alpha = 0.3)

slide-31
SLIDE 31
slide-32
SLIDE 32

AVOIDING TRANSFORMATIONS WHEN NECESSARY

slide-33
SLIDE 33

## fate gender n percent ## 1 perished male 1364 62.0 ## 2 perished female 126 5.7 ## 3 survived male 367 16.7 ## 4 survived female 344 15.6

No counting up required? Then stat = identity

> titanic

slide-34
SLIDE 34

p <- ggplot(data = titanic, mapping = aes(x = fate, y = percent, fill = sex)) p + geom_bar(stat = "identity", position = "dodge") + theme(legend.position = "top")

The theme() function controls parts of the plot that don’t belong to its “grammatical” structure

slide-35
SLIDE 35

p <- ggplot(data = titanic, mapping = aes(x = fate, y = percent, fill = sex)) p + geom_col(position = "dodge") + theme(legend.position = "top")

Even better: for convenience when not counting up, just use geom_col()

slide-36
SLIDE 36
slide-37
SLIDE 37
  • ecd_sum

## # A tibble: 57 x 5 ## # Groups: year [57] ## year other usa diff hi_lo ## <int> <dbl> <dbl> <dbl> <chr> ## 1 1960 68.6 69.9 1.30 Below ## 2 1961 69.2 70.4 1.20 Below ## 3 1962 68.9 70.2 1.30 Below ## 4 1963 69.1 70.0 0.900 Below ## 5 1964 69.5 70.3 0.800 Below ## 6 1965 69.6 70.3 0.700 Below ## 7 1966 69.9 70.3 0.400 Below ## 8 1967 70.1 70.7 0.600 Below ## 9 1968 70.1 70.4 0.300 Below ## 10 1969 70.1 70.6 0.500 Below ## # ... with 47 more rows

slide-38
SLIDE 38

p <- ggplot(data = oecd_sum, mapping = aes(x = year, y = diff, fill = hi_lo)) p + geom_col() + guides(fill = FALSE) + labs(x = NULL, y = "Difference in Years", title = "The US Life Expectancy Gap", subtitle = "Difference between US and OECD average life expectancies, 1960-2015", caption = "Data: OECD. After a chart by Christopher Ingraham, Washington Post, December 27th 2017.")

slide-39
SLIDE 39
slide-40
SLIDE 40

CROSSTABULATION THE AWKWARD WAY

slide-41
SLIDE 41

WARNING!

There’s nothing wrong with the code on the next few

  • slides. If you go searching online for how to make a

proportional bar chart with ggplot you’ll see answers like this. But, doing it this way is confusing and I find it is much easier to work a slightly different way. So, I won’t cover this approach in class. I’m including it here so you can see why it’s awkward.

slide-42
SLIDE 42

p <- ggplot(data = gss_sm, mapping = aes(x = religion, color = religion)) p + geom_bar() p <- ggplot(data = gss_sm, mapping = aes(x = religion, fill = religion)) p + geom_bar() + guides(fill = FALSE)

slide-43
SLIDE 43

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) p + geom_bar()

Counts are easy

slide-44
SLIDE 44

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) p + geom_bar(position = "fill")

Position adjustments don’t give us the view we want

slide-45
SLIDE 45

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) p + geom_bar(position = "dodge", mapping = aes(y = ..prop..))

slide-46
SLIDE 46

Nope

slide-47
SLIDE 47

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) p + geom_bar(position = "dodge", mapping = aes(y = ..prop.., group = religion))

slide-48
SLIDE 48

Still not right!

Also: hard to read

slide-49
SLIDE 49

p <- ggplot(data = gss_sm, mapping = aes(x = religion)) p + geom_bar(position = "dodge", mapping = aes(y = ..prop.., group = bigregion)) + facet_wrap(~ bigregion, ncol = 2)

Time to take a step back

slide-50
SLIDE 50
slide-51
SLIDE 51

SURELY THINGS CAN BE EASIER THAN THIS?

slide-52
SLIDE 52

TRANSFORM AND SUMMARIZE FIRST THEN SEND CLEAN TABLES TO ggplot

slide-53
SLIDE 53

CROSSTABULATION

slide-54
SLIDE 54 Protestant Catholic Jewish None Other NA Northeast 11.5 25.0 52.9 18.1 17.6 5.6 Midwest 23.7 26.5 5.9 25.4 20.8 27.8 South 47.4 24.7 21.6 27.5 31.4 61.1 West 17.4 23.9 19.6 29.1 30.2 5.6 100 100 100 100 100 100

Column percents / Column Marginals

Protestant Catholic Jewish None Other NA Northeast 32.4 33.2 5.5 23.0 5.7 0.2 100 Midwest 46.8 24.7 0.4 22.6 4.7 0.7 100 South 61.8 15.2 1.0 16.2 4.8 1.0 100 West 37.7 24.5 1.6 28.5 7.6 0.2 100

Row percents / Row Marginals

Protestant Catholic Jewish None Other Northeast 5.5 5.7 0.9 3.9 1 Midwest 11.3 6 0.1 5.5 1.2 South 22.7 5.6 0.4 5.9 1.7 West 8.3 5.4 0.3 6.3 1.7

Total percents

slide-55
SLIDE 55

dplyr lets you manipulate tables in a series of steps, or pipeline

slide-56
SLIDE 56

dplyr draws on the logic of database queries, where the focus is managing and summarizing tables

slide-57
SLIDE 57

group_by()

Group the data at the level we want, such as "Religion by Region"

  • r "Authors by Publications by Year".

filter() rows select() columns

Filter or Select pieces of the data. This gets us the subset of the table we want to work on.

mutate()

Mutate the data by creating new variables at the current level of

  • grouping. Mutating adds new columns to the table.

summarize()

Summarize the grouped data. This creates new variables at a higher level of grouping. For example we might calculate means with mean() or counts with n(). This results in a smaller, summary table, which we might do more things with if we want.

slide-58
SLIDE 58

%>%

Create a pipeline of tabular transformations with the pipe operator

slide-59
SLIDE 59

REORGANIZING TABLES WITH dplyr

slide-60
SLIDE 60

rel_by_region <- gss_sm %>% group_by(bigregion, religion) %>% summarize(n = n()) %>% mutate(freq = n / sum(n), pct = round((freq*100), 1))

slide-61
SLIDE 61

rel_by_region <- gss_sm

> rel_by_region # A tibble: 2,867 x 32 year id ballot age childs sibs degree race sex region income16 relig marital padeg madeg partyid polviews <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> 1 2016 1 1 47 3 2 Bache… White Male New E… $170000… None Married Grad… High… Indepe… Moderate 2 2016 2 2 61 0 3 High … White Male New E… $50000 … None Never … Lt H… High… Ind,ne… Liberal 3 2016 3 3 72 2 3 Bache… White Male New E… $75000 … Cath… Married High… Lt H… Not St… Conserv… 4 2016 4 1 43 4 3 High … White Fema… New E… $170000… Cath… Married NA High… Not St… Moderate 5 2016 5 3 55 2 2 Gradu… White Fema… New E… $170000… None Married Bach… High… Not St… Slightl… 6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000 … None Married NA High… Not St… Slightl… 7 2016 7 1 50 2 2 High … White Male New E… $170000… None Married High… High… Not St… Slightl… 8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000 … Cath… Married Lt H… Lt H… Ind,ne… Slightl… 9 2016 9 1 45 3 5 High … Black Male Middl… $60000 … Prot… Married Lt H… Lt H… Strong… NA 10 2016 10 3 71 4 1 Junio… White Male Middl… $60000 … None Divorc… High… High… Strong… Conserv… # … with 2,857 more rows, and 15 more variables: happy <fct>, partners <fct>, grass <fct>, zodiac <fct>, pres12 <dbl>, # wtssall <dbl>, income_rc <fct>, agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>, # bigregion <fct>, partners_rc <fct>, obama <dbl>
slide-62
SLIDE 62

rel_by_region <- gss_sm %>% group_by(bigregion, religion)

> rel_by_region # A tibble: 2,867 x 32 # Groups: bigregion, religion [24] year id ballot age childs sibs degree race sex region income16 relig marital padeg madeg partyid polviews <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> 1 2016 1 1 47 3 2 Bache… White Male New E… $170000… None Married Grad… High… Indepe… Moderate 2 2016 2 2 61 0 3 High … White Male New E… $50000 … None Never … Lt H… High… Ind,ne… Liberal 3 2016 3 3 72 2 3 Bache… White Male New E… $75000 … Cath… Married High… Lt H… Not St… Conserv… 4 2016 4 1 43 4 3 High … White Fema… New E… $170000… Cath… Married NA High… Not St… Moderate 5 2016 5 3 55 2 2 Gradu… White Fema… New E… $170000… None Married Bach… High… Not St… Slightl… 6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000 … None Married NA High… Not St… Slightl… 7 2016 7 1 50 2 2 High … White Male New E… $170000… None Married High… High… Not St… Slightl… 8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000 … Cath… Married Lt H… Lt H… Ind,ne… Slightl… 9 2016 9 1 45 3 5 High … Black Male Middl… $60000 … Prot… Married Lt H… Lt H… Strong… NA 10 2016 10 3 71 4 1 Junio… White Male Middl… $60000 … None Divorc… High… High… Strong… Conserv… # … with 2,857 more rows, and 15 more variables: happy <fct>, partners <fct>, grass <fct>, zodiac <fct>, pres12 <dbl>, # wtssall <dbl>, income_rc <fct>, agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>, # bigregion <fct>, partners_rc <fct>, obama <dbl>
slide-63
SLIDE 63

rel_by_region <- gss_sm %>% group_by(bigregion, religion) %>% summarize(n = n())

> rel_by_region # A tibble: 24 x 3 # Groups: bigregion [4] bigregion religion n <fct> <fct> <int> 1 Northeast Protestant 158 2 Northeast Catholic 162 3 Northeast Jewish 27 4 Northeast None 112 5 Northeast Other 28 6 Northeast NA 1 7 Midwest Protestant 325 8 Midwest Catholic 172 9 Midwest Jewish 3 10 Midwest None 157 # … with 14 more rows

A function to count how many items there are in the current group

The result of the calculation

slide-64
SLIDE 64

rel_by_region <- gss_sm %>% group_by(bigregion, religion) %>% summarize(n = n()) %>% mutate(freq = n / sum(n), pct = round((freq*100), 1))

> rel_by_region # A tibble: 24 x 5 # Groups: bigregion [4] bigregion religion n freq pct <fct> <fct> <int> <dbl> <dbl> 1 Northeast Protestant 158 0.324 32.4 2 Northeast Catholic 162 0.332 33.2 3 Northeast Jewish 27 0.0553 5.5 4 Northeast None 112 0.230 23 5 Northeast Other 28 0.0574 5.7 6 Northeast NA 1 0.00205 0.2 7 Midwest Protestant 325 0.468 46.8 8 Midwest Catholic 172 0.247 24.7 9 Midwest Jewish 3 0.00432 0.4 10 Midwest None 157 0.226 22.6 # … with 14 more rows

mutate() operations add columns to existing tables

slide-65
SLIDE 65

Objects in a pipeline carry forward some assumptions about context

rel_by_region <- gss_sm %>% group_by(bigregion, religion) %>% summarize(n = n()) %>% mutate(freq = n / sum(n), pct = round((freq*100), 1))

slide-66
SLIDE 66

Grouping with group_by() carries forward; summary calculations are applied to the innermost group

rel_by_region <- gss_sm %>% group_by(bigregion, religion) %>% summarize(n = n()) %>% mutate(freq = n / sum(n), pct = round((freq*100), 1))

slide-67
SLIDE 67

mutate() adds or modifies columns, it doesn’t change the grouping level or the number of rows in the table

rel_by_region <- gss_sm %>% group_by(bigregion, religion) %>% summarize(n = n()) %>% mutate(freq = n / sum(n), pct = round((freq*100), 1))

slide-68
SLIDE 68

Notice how we can create variables

  • n the fly and use them immediately

rel_by_region <- gss_sm %>% group_by(bigregion, religion) %>% summarize(n = n()) %>% mutate(freq = n / sum(n), pct = round((freq*100), 1))

slide-69
SLIDE 69

rel_by_region ## Source: local data frame [24 x 5] ## Groups: bigregion [4] ## ## # A tibble: 24 x 5 ## bigregion religion n freq pct ## <fctr> <fctr> <int> <dbl> <dbl> ## 1 Northeast Protestant 158 0.32377049 32.4 ## 2 Northeast Catholic 162 0.33196721 33.2 ## 3 Northeast Jewish 27 0.05532787 5.5 ## 4 Northeast None 112 0.22950820 23.0 ## 5 Northeast Other 28 0.05737705 5.7 ## 6 Northeast NA 1 0.00204918 0.2 ## 7 Midwest Protestant 325 0.46762590 46.8 ## 8 Midwest Catholic 172 0.24748201 24.7 ## 9 Midwest Jewish 3 0.00431655 0.4 ## 10 Midwest None 157 0.22589928 22.6 ## # ... with 14 more rows