Introduction to ggplot2 R Pruim July, 2014 Goals What I will try - - PowerPoint PPT Presentation

introduction to ggplot2
SMART_READER_LITE
LIVE PREVIEW

Introduction to ggplot2 R Pruim July, 2014 Goals What I will try - - PowerPoint PPT Presentation

Introduction to ggplot2 R Pruim July, 2014 Goals What I will try to do give a tour of ggplot2 explain how to think about plots the ggplot2 way prepare/encourage you to learn more later What I cant do in one session show every


slide-1
SLIDE 1

Introduction to ggplot2

R Pruim July, 2014

slide-2
SLIDE 2

Goals

What I will try to do

◮ give a tour of ggplot2 ◮ explain how to think about plots the ggplot2 way ◮ prepare/encourage you to learn more later

What I can’t do in one session

◮ show every bell and whistle ◮ make you an expert at using ggplot2

slide-3
SLIDE 3

The Births78 data set – revised edition

require(dplyr) require(mosaic) require(lubridate) Births2 <- Births78 %>% mutate( date = mdy(date) - years(100), # y2k fix wd = wday(date), # as a number wday = wday(date, label=TRUE, abbr=TRUE) # as text (abbrev) ) head(Births2, 2) ## date births dayofyear wd wday ## 1 1978-01-01 7701 1 1 Sun ## 2 1978-01-02 7527 2 2 Mon

slide-4
SLIDE 4

The grammar of graphics

geom: the geometric “shape” used to display data (glyph)

◮ bar, point, line, ribbon, text, etc.

aesthetic: an attribute controlling how geom is displayed

◮ x position, y position, color, fill, shape, size, etc.

stat: a transformation applied to data before geom gets it

◮ example: histograms work on binned data

scale: conversion of raw data to visual display

◮ particular assignment of colors, shapes, sizes, etc.

guide: helps user convert visual data back into raw data (legends, axes)

slide-5
SLIDE 5

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

slide-6
SLIDE 6

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

What does R need to know?

slide-7
SLIDE 7

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

What does R need to know?

◮ data source ◮ aesthetics ◮ geom – dots

slide-8
SLIDE 8

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

What does R need to know?

slide-9
SLIDE 9

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

What does R need to know?

◮ data frame containing the data: ggplot(data=)

ggplot(data=Births2)

slide-10
SLIDE 10

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

What does R need to know?

◮ data frame containing the data: ggplot(data=)

ggplot(data=Births2) * how we want to map our aesthetics: aes() ggplot(data=Births2, aes(x=date, y=births))

slide-11
SLIDE 11

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

What does R need to know?

◮ data frame containing the data: ggplot(data=)

ggplot(data=Births2) * how we want to map our aesthetics: aes() ggplot(data=Births2, aes(x=date, y=births))

◮ what geom we want to use: + geom_point()

ggplot(data=Births2, aes(x=date, y=births)) + geom_point()

slide-12
SLIDE 12

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

wday Sun Mon Tues Wed Thurs Fri Sat

What information has changed?

slide-13
SLIDE 13

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

wday Sun Mon Tues Wed Thurs Fri Sat

What information has changed?

◮ new aesthetic: mapping color to day of week

ggplot(data=Births2, aes(x=date, y=births, color=wday)) + geom_point()

slide-14
SLIDE 14

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

wday Sun Mon Tues Wed Thurs Fri Sat

slide-15
SLIDE 15

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

wday Sun Mon Tues Wed Thurs Fri Sat

This time we use lines instead of dots ggplot(data=Births2, aes(x=date, y=births, color=wday)) + geom_line()

slide-16
SLIDE 16

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

wday Sun Mon Tues Wed Thurs Fri Sat

slide-17
SLIDE 17

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

wday Sun Mon Tues Wed Thurs Fri Sat

This time we have two layers, one with points and one with lines ggplot(data=Births2, aes(x=date, y=births, color=wday)) + geom_point() + geom_line()

slide-18
SLIDE 18

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

wday Sun Mon Tues Wed Thurs Fri Sat

This time we have two layers, one with points and one with lines ggplot(data=Births2, aes(x=date, y=births, color=wday)) + geom_point() + geom_line()

◮ The layers are placed one on top of the other: the points are

below and the lines are above. Sometimes the order of the layers can be important because of overplotting.

slide-19
SLIDE 19

Alternative Syntax

Births2 %>% ggplot(aes(x=date, y=births, color=wday)) + geom_point() + geom_line()

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

wday Sun Mon Tues Wed Thurs Fri Sat

slide-20
SLIDE 20

What does this do?

Births2 %>% ggplot(aes(x=date, y=births, color="navy")) + geom_point()

slide-21
SLIDE 21

What does this do?

Births2 %>% ggplot(aes(x=date, y=births, color="navy")) + geom_point()

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

"navy" navy

This is mapping the color aesthetic to a new variable with only one value (“navy”). So all the dots get set to the same color, but it’s not navy.

slide-22
SLIDE 22

Setting vs. Mapping

If we want to set the color to be navy for all of the dots, we do it this way: Births2 %>% ggplot(aes(x=date, y=births)) + # map these geom_point(color = "navy") # set this

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

◮ Note that color = "navy" is now outside of the aesthetics

  • list. That’s how ggplot2 distinguishes between mapping and

setting.

slide-23
SLIDE 23

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

wday Sun Mon Tues Wed Thurs Fri Sat

slide-24
SLIDE 24

How do we make this plot?

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

wday Sun Mon Tues Wed Thurs Fri Sat

Births2 %>% ggplot(aes(x=date, y=births)) + geom_line(aes(color=wday)) + # map color here geom_point(color="navy") # set color here

◮ ggplot() establishes the default data and aesthetics for the

geoms, but each geom may change these defaults.

◮ good practice: put into ggplot() the things that affect all (or

most) of the layers; rest in geom_blah()

slide-25
SLIDE 25

Other geoms

apropos("^geom_") [1] "geom_abline" "geom_area" "geom_bar" [4] "geom_bin2d" "geom_blank" "geom_boxplot" [7] "geom_contour" "geom_crossbar" "geom_density" [10] "geom_density2d" "geom_dotplot" "geom_errorbar" [13] "geom_errorbarh" "geom_freqpoly" "geom_hex" [16] "geom_histogram" "geom_hline" "geom_jitter" [19] "geom_line" "geom_linerange" "geom_map" [22] "geom_path" "geom_point" "geom_pointrange" [25] "geom_polygon" "geom_quantile" "geom_rangeframe" [28] "geom_raster" "geom_rect" "geom_ribbon" [31] "geom_rug" "geom_segment" "geom_smooth" [34] "geom_step" "geom_text" "geom_tile" [37] "geom_tufteboxplot" "geom_violin" "geom_vline" help pages will tell you their aesthetics and default stats ?geom_area # for example

slide-26
SLIDE 26

Let’s try geom_area

Births2 %>% ggplot(aes(x=date, y=births, fill=wday)) + geom_area()

3000 6000 9000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

wday Sun Mon Tues Wed Thurs Fri Sat

This is not a good plot

slide-27
SLIDE 27

Let’s try geom_area

Births2 %>% ggplot(aes(x=date, y=births, fill=wday)) + geom_area()

3000 6000 9000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

date births

wday Sun Mon Tues Wed Thurs Fri Sat

This is not a good plot

◮ overplotting is hiding much of the data ◮ extending y-axis to 0 may or may not be desirable.

slide-28
SLIDE 28

Side note: what makes a plot good?

Most (all?) graphics are intended to help us make comparisons

◮ How does something change over time? ◮ Do my treatments matter? How much? ◮ Do men and women respond the same way?

Key plot metric: Does my plot make the comparisions I am interested in

◮ easily, and ◮ accurately?

slide-29
SLIDE 29

Time for some different data

HELPrct: Health Evaluation and Linkage to Primary care randomized clinical trial ?HELPrct

slide-30
SLIDE 30

Why are these people in the study?

HELPrct %>% ggplot(aes(x=substance)) + geom_bar()

50 100 150 alcohol cocaine heroin

substance count

slide-31
SLIDE 31

Why are these people in the study?

HELPrct %>% ggplot(aes(x=substance)) + geom_bar()

50 100 150 alcohol cocaine heroin

substance count

◮ Hmm. What’s up with y?

slide-32
SLIDE 32

Why are these people in the study?

HELPrct %>% ggplot(aes(x=substance)) + geom_bar()

50 100 150 alcohol cocaine heroin

substance count

◮ Hmm. What’s up with y?

◮ stat_bin() is being applied to the data before the

geom_bar() gets to do its thing. Binning creates the y values.

slide-33
SLIDE 33

Data Flow

  • rg data stat

− → statified aesthetics − → aesthetic data scales − → scaled data Simplifications:

◮ aesthetics get computed twice, once before the stat and again

  • after. Examples: bar charts, histograms

◮ item we need to look at the aesthetics to figure out which

variable to bin

◮ then the stat does the binning ◮ bin counts become part of the aesthetics for geom:

y=..count..

◮ This process happens in each layer ◮ stat_identity() is the “do nothing” stat.

slide-34
SLIDE 34

How old are people in the HELP study?

slide-35
SLIDE 35

How old are people in the HELP study?

HELPrct %>% ggplot(aes(x=age)) + geom_histogram()

20 40 20 30 40 50 60

age count

Notice the messages

◮ stat_bin: Histograms are not mapping the raw data but

binned data. stat_bin() performs the data transformation.

◮ binwidth: a default binwidth has been selected, but we should

really choose our own.

slide-36
SLIDE 36

Setting the binwidth manually

HELPrct %>% ggplot(aes(x=age)) + geom_histogram(binwidth=2)

20 40 60 20 30 40 50 60

age count

slide-37
SLIDE 37

How old are people in the HELP study? – Other geoms

HELPrct %>% ggplot(aes(x=age)) + geom_freqpoly(binwidth=2)

20 40 60 20 30 40 50 60

age count

HELPrct %>% ggplot(aes(x=age)) + geom_density()

0.00 0.02 0.04 20 30 40 50 60

age density

slide-38
SLIDE 38

Selecting stat and geom manually

Every geom comes with a default stat

◮ for simple cases, the stat is stat_identity() which does

nothing

◮ we can mix and match geoms and stats however we like

HELPrct %>% ggplot(aes(x=age)) + geom_line(stat="density")

0.00 0.01 0.02 0.03 0.04 0.05 20 30 40 50 60

age density

slide-39
SLIDE 39

Selecting stat and geom manually

Every stat comes with a default geom

◮ we can specify stats instead of geom, if we prefer ◮ we can mix and match geoms and stats however we like

HELPrct %>% ggplot(aes(x=age)) + stat_density( geom="line")

0.00 0.01 0.02 0.03 0.04 0.05 20 30 40 50 60

age density

slide-40
SLIDE 40

More combinations

HELPrct %>% ggplot(aes(x=age)) + geom_point(stat="bin", binwidth=3) + geom_line(stat="bin", binwidth=3)

20 40 60 80 20 30 40 50 60

age count

HELPrct %>% ggplot(aes(x=age)) + geom_area(stat="bin", binwidth=3)

20 40 60 80 20 30 40 50 60

age count

slide-41
SLIDE 41

Your turn: How much do they drink? (i1)

Create a plot that shows the distribution of the average daily alcohol consumption in the past 30 days (i2).

slide-42
SLIDE 42

How much do they drink? (i1)

HELPrct %>% ggplot(aes(x=i1)) + geom_histogram()

50 100 50 100 150

i1 count

HELPrct %>% ggplot(aes(x=i1)) + geom_area(stat="density")

0.00 0.01 0.02 0.03 50 100

i1 density

slide-43
SLIDE 43

Covariates: Adding in more variables

  • Q. How does alcohol consumption (or age, your choice) differ by sex

and substance (alcohol, cocaine, heroin)? Decisions:

◮ How will we display the variables: i1 (or age), sex,

substance

◮ What comparisons are we most interested in?

Give it a try.

◮ Note: I’m cheating a bit. You may want to do some things I

haven’t shown you yet. (Feel free to ask.)

slide-44
SLIDE 44

Covariates: Adding in more variables

Using color and linetype: HELPrct %>% ggplot(aes(x=i1, color=substance, linetype=sex)) + geom_line(stat="density")

0.00 0.03 0.06 0.09 50 100

i1 density

alcohol cocaine heroin sex female male

Using color and facets HELPrct %>% ggplot(aes(x=i1, color=substance)) + geom_line(stat="density") + facet_grid( . ~ sex )

female male 0.03 0.06 0.09

density

substance alcohol cocaine

slide-45
SLIDE 45

Boxplots

Boxplots use stat_quantile() which computes a five-number summary (roughly the five quartiles of the data) and uses them to define a “box” and “whiskers”. The quantitative variable must be y, and there must be an additional x variable. HELPrct %>% ggplot(aes(x=substance, y=age, color=sex)) + geom_boxplot()

20 30 40 50 60 alcohol cocaine heroin

substance age

sex female male

slide-46
SLIDE 46

Horizontal boxplots

Horizontal boxplots are obtained by flipping the coordinate system: HELPrct %>% ggplot(aes(x=substance, y=age, color=sex)) + geom_boxplot() + coord_flip()

alcohol cocaine heroin 20 30 40 50 60

age substance

sex female male

◮ coord_flip() may be used with other plots as well to reverse

the roles of x and y on the plot.

slide-47
SLIDE 47

Give me some space

We’ve triggered a new feature: dodge (for dodging things left/right). We can control how much if we set the dodge manually. HELPrct %>% ggplot(aes(x=substance, y=age, color=sex)) + geom_boxplot(position=position_dodge(width=1))

20 30 40 50 60 alcohol cocaine heroin

substance age

sex female male

slide-48
SLIDE 48

Issues with bigger data

dim(NHANES) ## [1] 31126 53 NHANES %>% ggplot(aes(x=waist, y=weight)) + geom_point() + facet_grid( sex ~ pregnant )

yes no 50 100 150 200 50 100 150 200 male female 0.5 1.0 1.5 0.5 1.0 1.5

waist weight

◮ Although we can see a generally positive association (as we

would expect), the overplotting may be hiding information.

slide-49
SLIDE 49

Using alpha (opacity)

One way to deal with overplotting is to set the opacity low. NHANES %>% ggplot(aes(x=waist, y=weight)) + geom_point(alpha=0.01) + facet_grid( sex ~ pregnant )

yes no 50 100 150 200 50 100 150 200 male female 0.5 1.0 1.5 0.5 1.0 1.5

waist weight

slide-50
SLIDE 50

geom_density2d

Alternatively (or simultaneously) we might prefere a different geom altogether. NHANES %>% ggplot(aes(x=waist, y=weight)) + geom_density2d() + facet_grid( sex ~ pregnant )

yes no 25 50 75 100 25 50 75 100 male female 0.4 0.6 0.8 1.0 1.2 0.4 0.6 0.8 1.0 1.2

waist weight

slide-51
SLIDE 51

geom_hex

NHANES %>% ggplot(aes(x=waist, y=weight)) + geom_hex() + facet_grid( sex ~ pregnant )

yes no 50 100 150 200 50 100 150 200 male female 0.5 1.0 1.5 0.5 1.0 1.5

waist weight

100 200 300 400 count

slide-52
SLIDE 52

Multiple layers

ggplot( data=HELPrct, aes(x=sex, y=age)) + geom_boxplot(outlier.size=0) + geom_jitter(alpha=.6) + coord_flip()

female male 20 30 40 50 60

age sex

slide-53
SLIDE 53

Labeling

NHANES %>% ggplot(aes(x=waist, y=weight)) + geom_hex() + facet_grid( sex ~ pregnant ) + labs(x="waist (m)", y="weight (kg)", title="weight vs waist"

yes no 50 100 150 200 50 100 150 200 male female 0.5 1.0 1.5 0.5 1.0 1.5

waist (m) weight (kg)

100 200 300 400 count

weight vs waist

slide-54
SLIDE 54

Things I haven’t mentioned (much)

◮ scales (fine tuning mapping from data to plot) ◮ guides (so reader can map from plot to data) ◮ coords (coord_flip() is good to know about) ◮ themes (for customizing appearance)

require(ggthemes) qplot( x=date, y=births, data=Births2) + theme_wsj()

7000 8000 9000 10000 Jan 1978 Apr 1978 Jul 1978 Oct 1978 Jan 1979

slide-55
SLIDE 55

Things I haven’t mentioned (much)

◮ scales (fine tuning mapping from data to plot) ◮ guides (so reader can map from plot to data) ◮ coords (coord_flip() is good to know about) ◮ themes (for customizing appearance)

require(xkcd) qplot( x=date, y=births, data=Births2, color=wday, geom="smooth", se=FALSE) + theme_xkcd()

slide-56
SLIDE 56

Things I haven’t mentioned (much)

◮ scales (fine tuning mapping from data to plot) ◮ guides (so reader can map from plot to data) ◮ coords (coord_flip() is good to know about) ◮ themes (for customizing appearance) ◮ position (position_dodge() can be used for side by side bars)

ggplot( data=HELPrct, aes(x=substance, y=age, color=sex)) + geom_violin(coef = 10, position=position_dodge()) + geom_point(aes(color=sex, fill=sex), position=position_jitterdodge

20 30 40 50 60 alcohol cocaine heroin

substance age

sex female male

slide-57
SLIDE 57

Things I haven’t mentioned (much)

◮ scales (fine tuning mapping from data to plot) ◮ guides (so reader can map from plot to data) ◮ themes (for customizing appearance) ◮ position (position_dodge(), position_jitterdodge(),

position_stack(), etc.)

slide-58
SLIDE 58

A little bit of everything

ggplot( data=HELPrct, aes(x=substance, y=age, color=sex)) + geom_boxplot(coef = 10, position=position_dodge(width=1)) geom_point(aes(fill=sex), alpha=.5, position=position_jitterdodge(dodge.width=1)) + facet_wrap(~homeless)

homeless housed 20 30 40 50 60 alcohol cocaine heroin alcohol cocaine heroin

substance age

sex female male

slide-59
SLIDE 59

Some short cuts

  • 1. qplot() provides “quick plots” for ggplot2

qplot(length, width, data=KidsFeet)

8.0 8.5 9.0 9.5 22 24 26

length width

  • 2. mplot(dataframe) provides an interactive plotting tool for

both ggplot2 and lattice. mplot(HELPrct)

◮ quickly make several plots from a data frame

slide-60
SLIDE 60

Want to learn more?

◮ docs.ggplot2.org/ ◮ Winston Chang’s: R Graphics Cookbook

slide-61
SLIDE 61

What’s around the corner?

ggvis

◮ dynamic graphics (brushing, sliders, tooltips, etc.) ◮ uses Vega (D3) to animate plots in a browser ◮ similar structure to ggplot2 but different syntax and names ◮ version 0.3 just released to github

Dynamic documents

◮ combination of RMarkdown, ggvis, and shiny ◮ beta testing now