Coding Lab: Visualizing data with ggplot2 Ari Anisfeld Summer 2020 - - PowerPoint PPT Presentation

coding lab visualizing data with ggplot2
SMART_READER_LITE
LIVE PREVIEW

Coding Lab: Visualizing data with ggplot2 Ari Anisfeld Summer 2020 - - PowerPoint PPT Presentation

Coding Lab: Visualizing data with ggplot2 Ari Anisfeld Summer 2020 1 / 36 How to use ggplot How to map data to aesthetics with aes() (and what that means) How to visualize the mappings with geom s How to get more out of your data by


slide-1
SLIDE 1

Coding Lab: Visualizing data with ggplot2

Ari Anisfeld Summer 2020

1 / 36

slide-2
SLIDE 2

How to use ggplot

◮ How to map data to aesthetics with aes() (and what that

means)

◮ How to visualize the mappings with geoms ◮ How to get more out of your data by using multiple aesthetics ◮ How to use facets to add dimensionality

There are whole books on how to use ggplot. This is a quick introduction!

2 / 36

slide-3
SLIDE 3

Understanding ggplot()

By itself, ggplot() tells R to prepare to make a plot. texas_annual_sales <- texas_housing_data %>% group_by(year) %>% summarize(total_volume = sum(volume, na.rm = TRUE)) ggplot(data = texas_annual_sales)

3 / 36

slide-4
SLIDE 4

Adding a mapping

Adding mapping = aes() says how the data will map to “aesthetics”.

◮ e.g. tell R to make x-axis year and y-axis total_volume. ◮ Each row of the data has (year, total_volume).

◮ R will map that to the coordinate pair (x,y) . ◮ Look at the data before moving on!

ggplot(data = texas_annual_sales, mapping = aes(x = year, y = total_volume))

4e+10 5e+10 6e+10 7e+10 8e+10 2000 2005 2010 2015

year total_volume

4 / 36

slide-5
SLIDE 5

Visualizing the mapping with a geom

geom_<name> tells R what type of visualization to produce. Here we see points.

◮ Each row of the data has (year, total_volume). ◮ R will map that to the coordinate pair (x,y).

ggplot(data = texas_annual_sales, mapping = aes(x = year, y = total_volume)) + geom_point()

4e+10 5e+10 6e+10 7e+10 8e+10 2000 2005 2010 2015

year total_volume

5 / 36

slide-6
SLIDE 6

Visualizing the mapping with a geom

Here we see bars.

◮ Each row of the data has (year, total_volume). ◮ R will map that to the coordinate pair (x,y)

ggplot(data = texas_annual_sales, mapping = aes(x = year, y = total_volume)) + geom_col()

0e+00 2e+10 4e+10 6e+10 8e+10 2000 2005 2010 2015

year total_volume

6 / 36

slide-7
SLIDE 7

Visualizing the mapping with a geom

Here we see a line connecting each (x,y) pair. ggplot(data = texas_annual_sales, mapping = aes(x = year, y = total_volume)) + geom_line()

4e+10 5e+10 6e+10 7e+10 8e+10 2000 2005 2010 2015

year total_volume

7 / 36

slide-8
SLIDE 8

Visualizing the mapping with a geom

Here we see a smooth line. R does a statistical transformation!

◮ Now R doesn’t visualize the mapping (year, total_volume) to each (x,y)

pair

◮ Instead it fits a model to the (x,y) and then plots the “smooth” line

ggplot(data = texas_annual_sales, mapping = aes(x = year, y = total_volume)) + geom_smooth() ## `geom_smooth()` using method = 'loess' and formula 'y ~ x' 2.5e+10 5.0e+10 7.5e+10 2000 2005 2010 2015

year total_volume

8 / 36

slide-9
SLIDE 9

Visualizing the mapping with a geom

We can overlay several geom. ggplot(data = texas_annual_sales, mapping = aes(x = year, y = total_volume)) + geom_smooth() + geom_point()

2.5e+10 5.0e+10 7.5e+10 2000 2005 2010 2015

year total_volume

9 / 36

slide-10
SLIDE 10

Visualizing the mapping with a geom

◮ We saw that we can visualize a relationship between two

variables mapping data to x and y

◮ The data can be visualized with different geoms that can be

composed (+) together.

◮ We can even calculate new variables with statistics and plot

those on the fly. Next: Now we’ll look at aesthetics that go beyond x and y axes.

10 / 36

slide-11
SLIDE 11

Using aesthetics to explore data.

We’ll use midwest data and start with only mapping to x and y midwest %>% ggplot(aes(x = percollege, y = percbelowpoverty)) + geom_point()

10 20 30 40 50 10 20 30 40 50

percollege percbelowpoverty

11 / 36

slide-12
SLIDE 12

Using aesthetics to explore data.

◮ color maps data to the color of points or lines.

◮ Each state is assigned a color. ◮ This works with discrete data and continuous data.

midwest %>% ggplot(aes(x = percollege, y = percbelowpoverty, color = state)) + geom_point() 10 20 30 40 50 10 20 30 40 50

percollege percbelowpoverty state

IL IN MI OH WI

12 / 36

slide-13
SLIDE 13

Using aesthetics to explore data.

◮ shape maps data to the shape of points.

◮ Each state is assigned a shape. ◮ This works with discrete data only.

midwest %>% ggplot(aes(x = percollege, y = percbelowpoverty, shape = state)) + geom_point() 10 20 30 40 50 10 20 30 40 50

percollege percbelowpoverty state

IL IN MI OH WI

13 / 36

slide-14
SLIDE 14

Using aesthetics to explore data.

◮ alpha maps data to the transparency of points.

◮ Here we map the percentage of people within a known poverty

status to alpha

midwest %>% ggplot(aes(x = percollege, y = percbelowpoverty, alpha = poptotal)) + geom_point() 10 20 30 40 50 10 20 30 40 50

percollege percbelowpoverty poptotal

1e+06 2e+06 3e+06 4e+06 5e+06

14 / 36

slide-15
SLIDE 15

Using aesthetics to explore data.

◮ size maps data to the size of points and width of lines.

◮ Here we map the percentage of people within a known poverty

status to size

midwest %>% ggplot(aes(x = percollege, y = percbelowpoverty, size = poptotal)) + geom_point() 10 20 30 40 50 10 20 30 40 50

percollege percbelowpoverty poptotal

1e+06 2e+06 3e+06 4e+06 5e+06

15 / 36

slide-16
SLIDE 16

Using aesthetics to explore data.

We can combine any and all aesthetics, and even map the same variable to multiple aesthetics midwest %>% ggplot(aes(x = percollege, y = percbelowpoverty, alpha = percpovertyknown, size = poptotal, color = state))+ geom_point()

16 / 36

slide-17
SLIDE 17

Using aesthetics to explore data.

10 20 30 40 50 10 20 30 40 50

percollege percbelowpoverty percpovertyknown

85 90 95

state

IL

17 / 36

slide-18
SLIDE 18

Using aesthetics to explore data

Different geoms have specific aesthetics that go with them.

◮ use ? to see which aesthetics a geom accepts (e.g

?geom_point)

◮ the bold aesthetics are required.

◮ the ggplot cheatsheet shows all the geoms with their associated

aesthetics

18 / 36

slide-19
SLIDE 19

Facets

Facets provide an additional tool to explore multidimensional data midwest %>% ggplot(aes(x = log(poptotal), y = percbelowpoverty)) + geom_point() + facet_wrap(vars(state))

OH WI IL IN MI 8 10 12 14 8 10 12 14 8 10 12 14 10 20 30 40 50 10 20 30 40 50

log(poptotal) percbelowpoverty

19 / 36

slide-20
SLIDE 20

discrete vs continuous data

aes discrete continuous limited number of classes unlimited number of classes usually chr or lgl numeric x, y yes yes color, fill yes yes shape yes (6 or fewer categories) no size, alpha not advised yes facet yes not advised Here, discrete and continuous have different meaning than in math

◮ For ggplot meaning is more fluid.

◮ If you do group_by with the var and there are fewer than 6 to

10 groups, discrete visualizations can work

◮ If your “discrete” data is numeric, as.character() or

as_factor() to enforce the decision.

20 / 36

slide-21
SLIDE 21

color can be continuous

midwest %>% ggplot(aes(x = percollege, y = percbelowpoverty, color = percpovertyknown)) + geom_point()

10 20 30 40 50 10 20 30 40 50

percollege percbelowpoverty

85 90 95

percpovertyknown

21 / 36

slide-22
SLIDE 22

shape does not play well with many categories

◮ Will only map to 6 categories, the rest become NA. ◮ We can override this behavior and get up to 25 distinct shapes

midwest %>% ggplot(aes(x = percollege, y = percbelowpoverty, shape = county)) + geom_point() + # legend off, otherwise it overwhelms theme(legend.position = "none") 10 20 30 40 50 10 20 30 40 50

percollege percbelowpoverty

22 / 36

slide-23
SLIDE 23

alpha and size can be misleading with discrete data

midwest %>% ggplot(aes(x = percollege, y = percbelowpoverty, alpha = state)) + geom_point() ## Warning: Using alpha for a discrete variable is not advised.

10 20 30 40 50 10 20 30 40 50

percollege percbelowpoverty state

IL IN MI OH WI

23 / 36

slide-24
SLIDE 24

Adding vertical lines

texas_annual_sales %>% ggplot(aes(x = year, y = total_volume)) + geom_point() + geom_vline(aes(xintercept = 2007), linetype = "dotted")

4e+10 5e+10 6e+10 7e+10 8e+10 2000 2005 2010 2015

year total_volume

◮ add horizontal lines with geom_hline() ◮ add any linear fit using geom_abline() by providing a slope

and intercept.

24 / 36

slide-25
SLIDE 25

Key take aways

◮ ggplot starts by mapping data to “aesthetics”.

◮ e.g. What data shows up on x and y axes and how color,

size and shape appear on the plot.

◮ We need to be aware of ‘continuous’ vs. ‘discrete’ variables.

◮ Then, we use geoms to create a visualization based on the

mapping.

◮ Again we need to be aware of ‘continuous’ vs. ‘discrete’

variables.

◮ Making quick plots helps us understand data and makes us

aware of data issues Resources: R for Data Science chap. 3 (r4ds.had.co.nz); RStudio’s ggplot cheatsheet.

25 / 36

slide-26
SLIDE 26

Appendix: Some graphs you made along the way

26 / 36

slide-27
SLIDE 27

lab 0: a map

geom_path is like geom_line, but connects (x, y) pairs in the

  • rder they appear in the data set.

storms %>% group_by(name, year) %>% filter(max(category) == 5) %>% ggplot(aes(x = long, y = lat, color = name)) + geom_path() + borders("world") + coord_quickmap(xlim = c(-130, -60), ylim = c(20, 50))

27 / 36

slide-28
SLIDE 28

lab 0: a map

20 30 40 50 −120 −100 −80 −60

long lat

Dean Emily Felix Gilbert Hugo Isabel Ivan Katrina Mitch

28 / 36

slide-29
SLIDE 29

lab 1: a line plot

french_data <- wid_data %>% filter(type == "Net personal wealth", country == "France") %>% mutate(perc_national_wealth = value * 100) french_data %>% ggplot(aes(y = perc_national_wealth, x = year, color = percentile)) + geom_line()

29 / 36

slide-30
SLIDE 30

lab 1: a line plot

25 50 75 1900 1925 1950 1975 2000

year perc_national_wealth percentile

p0p50 p50p90 p90p100 p99p100

30 / 36

slide-31
SLIDE 31

lab 2: distributions

◮ geom_density() only requires an x asthetic and it calculates

the distribution to plot.

◮ We can set the aesthetics manually, independent of data for

nicer graphs. chi_sq_samples <- tibble(x = c(rchisq(100000, 2), rchisq(100000, 3), rchisq(100000, 4)), df = rep(c("2", "3", "4"), each = 1e5)) chi_sq_samples %>% ggplot(aes(x = x, fill = df)) + geom_density( alpha = .5) + labs(fill = "df", x = "sample")

31 / 36

slide-32
SLIDE 32

lab 2: distributions

0.0 0.1 0.2 0.3 0.4 10 20 30

sample density df

2 3 4

32 / 36

slide-33
SLIDE 33

lab 4: grouped bar graphs

◮ position = "dodge2" tells R to put bars next to each other,

rather than stacked on top of each other.

◮ Notice we use fill and not color because we’re “filling” an

area. mean_share_per_country %>% ggplot(aes(y = country, x = mean_share, fill = percentile)) + geom_col(position = "dodge2") + labs(x = "Mean share of national wealth", y = "", fill = "Wealth\npercentile")

33 / 36

slide-34
SLIDE 34

lab 4: grouped bar graphs

China France India Korea Russia S Africa UK USA 0.00 0.25 0.50 0.75

Mean share of national wealth Wealth percentile

p90p100 p99p100

34 / 36

slide-35
SLIDE 35

lab 4: faceted bar graph

◮ Notice that we manipulate our data to the right specification

before making this graph

◮ Using facet_wrap we get a distinct graph for each time

period. mean_share_per_country_with_time %>% ggplot(aes(x = country, y = mean_share, fill = percentile)) + geom_col(position = "dodge2") + facet_wrap(vars(time_period))

35 / 36

slide-36
SLIDE 36

lab 4: faceted bar graph

1980 to 1999 2000 to present 1959 and earlier 1960 to 1979 China IndiaUSA China IndiaUSA 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8

country mean_share percentile

p90p100 p99p100

36 / 36