Graphics using ggplot2 Steve Bagley somgen223.stanford.edu 1 - - PowerPoint PPT Presentation

graphics using ggplot2
SMART_READER_LITE
LIVE PREVIEW

Graphics using ggplot2 Steve Bagley somgen223.stanford.edu 1 - - PowerPoint PPT Presentation

Graphics using ggplot2 Steve Bagley somgen223.stanford.edu 1 data_dir <- "https://web.stanford.edu/class/somgen223/data/" gene_exp1 <- read_csv ( str_c (data_dir, "gene_exp1.csv")) control : treatment) Setup data


slide-1
SLIDE 1

Graphics using ggplot2

Steve Bagley

somgen223.stanford.edu 1

slide-2
SLIDE 2

Setup data

data_dir <- "https://web.stanford.edu/class/somgen223/data/" gene_exp1 <- read_csv(str_c(data_dir, "gene_exp1.csv")) gene_tall <- gather(gene_exp1, condition, expression_level, control:treatment)

somgen223.stanford.edu 2

slide-3
SLIDE 3

ggplot2: a package for the grammar of graphics

  • The grammar of graphics is the idea that graphs are composed of known

elements in specific ways, in the same way that types of words are assembled through the rules of English syntax to form sentences.

  • ggplot2 is a package for doing this in R.
  • There are other plotting packages in R, but they do not follow this conceptual

model.

somgen223.stanford.edu 3

slide-4
SLIDE 4

ggplot: the main function

  • ggplot (in the package ggplot2) is the main function for constructing a graph.
  • Nearly every aspect of the graph can be changed. Usually, the defaults are pretty

good.

  • You combine the graph produced by ggplot with plot specifications to add to
  • r modify the graph.

somgen223.stanford.edu 4

slide-5
SLIDE 5

How to call ggplot

ggplot(data = BOD, mapping = aes(x = Time, y = demand)) + geom_point()

  • ggplot is the main plotting function.
  • data = BOD: this tells which data frame contains the data to be plotted
  • mapping = aes(x = Time, y = demand): use the data in the Time column
  • n x-axis, demand column on y-axis
  • geom_point(): plot the data as points
  • Note that you can use positional instead of named arguments to make this

expression shorter: ggplot(BOD, aes(Time, demand)) + geom_point()

  • The use of “+” to glue these operations together will be explained later.

somgen223.stanford.edu 5

slide-6
SLIDE 6

Use geom_point for scatterplot

BOD %>% ggplot(aes(Time, demand)) + geom_point()

10.0 12.5 15.0 17.5 20.0 2 4 6

Time demand

  • aes specifies the aesthetic mapping from the data (columns) to some aspect of

the graph (x, y position).

  • There are 6 rows in the BOD data frame.
  • There are 6 points in the graph, one for each row.
  • This alignment of rows and points applies through much of ggplot2.

somgen223.stanford.edu 6

slide-7
SLIDE 7

Example: geom_line

BOD %>% ggplot(aes(Time, demand)) + geom_line()

10.0 12.5 15.0 17.5 20.0 2 4 6

Time demand

somgen223.stanford.edu 7

slide-8
SLIDE 8

Example: points and lines

BOD %>% ggplot(aes(Time, demand)) + geom_point() + geom_line()

10.0 12.5 15.0 17.5 20.0 2 4 6

Time demand

somgen223.stanford.edu 8

slide-9
SLIDE 9

Giving arguments to plot specifications: change the size

BOD %>% ggplot(aes(Time, demand)) + geom_point(size = 5)

10.0 12.5 15.0 17.5 20.0 2 4 6

Time demand

somgen223.stanford.edu 9

slide-10
SLIDE 10

Giving arguments to plot specifications: change the color

BOD %>% ggplot(aes(Time, demand)) + geom_point(color = "red")

10.0 12.5 15.0 17.5 20.0 2 4 6

Time demand

somgen223.stanford.edu 10

slide-11
SLIDE 11

Exercise: orange trees

Using the data in Orange:

  • 1. Pull out the data for tree 2 only
  • 2. Plot circumference versus age for those data

somgen223.stanford.edu 11

slide-12
SLIDE 12

Answer: orange trees

Orange %>% filter(Tree == 2) %>% ggplot(aes(age, circumference)) + geom_point()

50 100 150 200 400 800 1200 1600

age circumference

somgen223.stanford.edu 12

slide-13
SLIDE 13

ggplot and + operator

  • ggplot(...) + geom_point() is a strange expression: it uses the + operator

to add things (plots and plot specifications), which are not numbers.

  • This uses a feature called generic functions: the types of the arguments to +

determine which piece of code, called a method, to run.

  • ggplot2 relies on this feature heavily.

somgen223.stanford.edu 13

slide-14
SLIDE 14

Using + in ggplot2

plot1 <- ggplot(BOD, aes(Time, demand)) spec1 <- geom_point() plot1 + spec1 spec2 <- geom_line(color = "blue") plot1 + spec2

  • Note you can save parts of the graph specification and then add them together.

somgen223.stanford.edu 14

slide-15
SLIDE 15

Adding a smoother

BOD %>% ggplot(aes(Time, demand)) + geom_point() + geom_smooth(method = "lm")

10 20 2 4 6

Time demand

  • lm means linear model (best fit, least-square regression)

somgen223.stanford.edu 15

slide-16
SLIDE 16

Making the plot specification depend on the data

## fixed size, default BOD %>% ggplot(aes(Time, demand)) + geom_point() ## fixed size, given as size argument BOD %>% ggplot(aes(Time, demand)) + geom_point(size = 5) ## size of each point depends on value of Time column for that point BOD %>% ggplot(aes(Time, demand)) + geom_point(aes(size = Time)) ## THIS CAUSES AN ERROR! BOD %>% ggplot(aes(Time, demand)) + geom_point(size = Time)

somgen223.stanford.edu 16

slide-17
SLIDE 17

Combining numbers and text in a graph

gene_exp1 # A tibble: 3 x 3 gene control treatment <chr> <dbl> <dbl> 1 ABC123 1 2 DEF234 10 3 3 GKK7 12 13

somgen223.stanford.edu 17

slide-18
SLIDE 18

Plotting treatment vs control

gene_exp1 %>% ggplot(aes(control, treatment)) + geom_point()

5 10 0.0 2.5 5.0 7.5 10.0 12.5

control treatment

somgen223.stanford.edu 18

slide-19
SLIDE 19

Plotting treatment vs control with gene labels

gene_exp1 %>% ggplot(aes(control, treatment)) + geom_point() + geom_text(aes(label = gene))

ABC123 DEF234 GKK7

5 10 0.0 2.5 5.0 7.5 10.0 12.5

control treatment

somgen223.stanford.edu 19

slide-20
SLIDE 20

Control placement of text

gene_exp1 %>% ggplot(aes(control, treatment)) + geom_point() + geom_text(aes(label = gene), hjust = "left", vjust = "bottom")

ABC123 DEF234 GKK7

5 10 0.0 2.5 5.0 7.5 10.0 12.5

control treatment

somgen223.stanford.edu 20

slide-21
SLIDE 21

Grouping

gene_exp1 # A tibble: 3 x 3 gene control treatment <chr> <dbl> <dbl> 1 ABC123 1 2 DEF234 10 3 3 GKK7 12 13

  • Let’s graph the control and treatment values separately for each gene.
  • We’ll need the data in tall format.

somgen223.stanford.edu 21

slide-22
SLIDE 22

Grouping in a graph

gene_tall %>% ggplot(aes(gene, expression_level)) + geom_point()

5 10 ABC123 DEF234 GKK7

gene expression_level

  • It would be nice of the data for each condition were grouped together (color?

line?).

somgen223.stanford.edu 22

slide-23
SLIDE 23

Use the mapping to assign color to the grouping variable

gene_tall %>% ggplot(aes(gene, expression_level)) + geom_point(aes(color = condition))

5 10 ABC123 DEF234 GKK7

gene expression_level condition

control treatment somgen223.stanford.edu 23

slide-24
SLIDE 24

Use group to form groups for geom_line

gene_tall %>% ggplot(aes(gene, expression_level)) + geom_line(aes(group = condition, color = condition))

5 10 ABC123 DEF234 GKK7

gene expression_level condition

control treatment somgen223.stanford.edu 24

slide-25
SLIDE 25

When to use group?

  • You need to include group when the number of graphical objects is not the

same as the number of observations to graph.

  • With geom_line, there are n endpoints, but only n-1 lines between them.

somgen223.stanford.edu 25

slide-26
SLIDE 26

Facets

  • Most explorations of data involve making comparison to highlight an important

difference between subsets.

  • One way to do this visually is to put the data for each condition in a separate

graph, called a “facet”.

  • ggplot can do this, making sure that the facet axes are nicely lined up.

somgen223.stanford.edu 26

slide-27
SLIDE 27

Facet example

gene_tall %>% ggplot(aes(gene, expression_level)) + geom_point() + facet_wrap(vars(condition))

control treatment ABC123 DEF234 GKK7 ABC123 DEF234 GKK7 5 10

gene expression_level

somgen223.stanford.edu 27

slide-28
SLIDE 28

Exercise: Orange trees

ggplot(Orange, aes(age, circumference)) + geom_point()

50 100 150 200 400 800 1200 1600

age circumference

  • It would be better if we visually distinguish each tree’s data.
  • What is the visual equivalent of group_by?

somgen223.stanford.edu 28

slide-29
SLIDE 29

Answer: Orange trees, using facets

ggplot(Orange, aes(age, circumference)) + geom_point() + facet_wrap(vars(Tree))

2 4 3 1 5 400 800 1200 1600 400 800 1200 1600 400 800 1200 1600 50 100 150 200 50 100 150 200

age circumference

somgen223.stanford.edu 29

slide-30
SLIDE 30

Answer: Orange trees, using grouping

ggplot(Orange, aes(age, circumference)) + geom_point(aes(color = Tree)) + geom_line(aes(color = Tree, group = Tree))

50 100 150 200 400 800 1200 1600

age circumference Tree

3 1 5 2 4 somgen223.stanford.edu 30

slide-31
SLIDE 31

Labeling the graph

ggplot(Orange, aes(age, circumference)) + geom_point(aes(color = Tree)) + geom_line(aes(color = Tree, group = Tree)) + labs(x = "Age (days)", y = "Circumference (mm)", title = "Circumference vs. age for orange trees", subtitle = "Data from built-in data frame Orange")

50 100 150 200 400 800 1200 1600

Age (days) Circumference (mm) Tree

3 1 5 2 4

Data from built-in data frame Orange

Circumference vs. age for orange trees

somgen223.stanford.edu 31

slide-32
SLIDE 32

Including the origin (0, 0)

ggplot(Orange, aes(age, circumference, group = Tree)) + geom_point(aes(color = Tree)) + geom_line(aes(color = Tree, group = Tree)) + expand_limits(x = 0, y = 0)

50 100 150 200 500 1000 1500

age circumference Tree

3 1 5 2 4 somgen223.stanford.edu 32

slide-33
SLIDE 33

Reading

  • Read: 3 Data visualisation | R for Data Science (sections 3.1 to 3.4)
  • Read: ggplot2 book 1 Introduction | ggplot2: Elegant Graphics for Data

Analysis

  • Read: ggplot2 book 2 Getting started with ggplot2 | ggplot2: Elegant Graphics

for Data Analysis

somgen223.stanford.edu 33