Working with pipes Computational Pipelines R.W. Oldford Pipes - - PowerPoint PPT Presentation

working with pipes
SMART_READER_LITE
LIVE PREVIEW

Working with pipes Computational Pipelines R.W. Oldford Pipes - - PowerPoint PPT Presentation

Working with pipes Computational Pipelines R.W. Oldford Pipes French surrealist painter Rene Magrittes 1929 painting The Treachery of Images The famous pipe. How people reproached me for it! And yet, could you stuff my pipe? No,


slide-1
SLIDE 1

Working with pipes

Computational Pipelines R.W. Oldford

slide-2
SLIDE 2

Pipes

French surrealist painter Rene Magritte’s 1929 painting “The Treachery of Images” “The famous pipe. How people reproached me for it! And yet, could you stuff my pipe? No, it’s just a representation, is it not? So if I had written on my picture ‘This is a pipe’, I’d have been lying!”

slide-3
SLIDE 3

Pipes

Plumbing These too are representations of pipes, and pipe connectors

slide-4
SLIDE 4

Pipes, connectors, pipelines

Put them together you get pipelines With a variety of different connectors. The resulting pipelines can be large and complex.

slide-5
SLIDE 5

Computational pipes

Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output: If we have several of these, we can connect them one to another in sequence to produce a “pipeline” of modules or steps in the processing of the original input:

slide-6
SLIDE 6

Computational pipes

Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output: If we have several of these, we can connect them one to another in sequence to produce a “pipeline” of modules or steps in the processing of the original input: The connected components form a “pipeline” through which the original input “flows”, with some processing/transformation of the data occurring at each step.

slide-7
SLIDE 7

Computational pipelines

A simple metaphor (viz. that of laying pipes end to end):

◮ data passes through and is processed by a set of computational steps

serially linked so that the output of one becomes the input of the next

◮ E.g. the Unix “pipe” | is called a “pipe”: ls -R Notes | grep ".pdf" |

sort | more

◮ E.g. a graphics rendering pipeline (from Kaufman, Fan and Petkov (2009) Implementing the lattice

Boltzmann model on commodity graphics hardware J. Stat. Mech.)]

slide-8
SLIDE 8

magrittr

The (CRAN) package authored by Stefan Milton Bache (and later joined by Hadley Wickham) “The magrittr (to be pronounced with a sophisticated french accent) is a package with two aims: to decrease development time and to improve readability and maintainability of code. Or even shortr: to make your code smokin’ (puff puff)!” (See outragreous French accent.) “To archive its humble aims, magrittr (remember the accent) provides a new pipe-like operator, %>%, with which you may pipe a value forward into an expression or function call; something along the lines of x %>% f, rather than f(x).” library(magrittr)

slide-9
SLIDE 9

magrittr - a pipe changes program control to program flow

The basic idea is simple. Instead of writing f(x), write x %>% f In magrittr, %>% is a binary operator which pipes the output of the first

  • perand and provides it as the first argument of the second operand.

So, instead of head(mtcars, n = 3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 With magrittr we can use the pipe %>% to do the same thing mtcars %>% head(n = 3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

slide-10
SLIDE 10

magrittr - the payoff

By joining pipes end to end in a pipeline through which data (output) flows and is treated by different operations along the way a more natural program flow can often be had. For example, Hadley Wickham likes to illustrate this point using the nursery rhyme “Little Bunny foo foo” (sung to the tune of the traditional Canadian children’s song “Alouette”). Little bunny foo foo hopping through the forest scooping up the field mice bopping them on the head.

slide-11
SLIDE 11

magrittr - the payoff

How do we represent this natural language (English) expression

Little bunny foo foo / hopping through the forest / scooping up the field mice / bopping them on the head

and represent it in code?

# Start with a little bunny use forward assignment to name it little_bunny() -> foo_foo

Which is a more natural expression? This standard procedural version?

# Without pipes: bop_on( scoop_up( hop_through(foo_foo, forest), field_mouse), head) Or this pipelined version? # Or with pipes: foo_foo %>% hop_through(forest) %>% scoop_up(field_mouse) %>% bop_on(head)

Note: the assignment little_bunny() -> foo_foo appeared in neither expression.

slide-12
SLIDE 12

magrittr - simplified program control with pipes

An example adapted from the magrittr package vignette:

mtcars %>% subset(am == 0) %>% aggregate(. ~ cyl, data = ., FUN = . %>% mean %>% round(2)) %>% transform(kpl = mpg %>% multiply_by(0.4251)) ## cyl mpg disp hp drat wt qsec vs am gear carb kpl ## 1 4 22.90 135.87 84.67 3.77 2.94 20.97 1 0 3.67 1.67 9.734790 ## 2 6 19.12 204.55 115.25 3.42 3.39 19.21 1 0 3.50 2.50 8.127912 ## 3 8 15.05 357.62 194.17 3.12 4.10 17.14 0 3.00 3.08 6.397755

Note (adapted from vignette):

  • 1. By default the left-hand side (LHS) will be piped in as the first argument of

the function appearing on the right-hand side (RHS). This is the case in the subset and transform expressions.

  • 2. %>% may be used in a nested fashion, e.g. it may appear in expressions

within arguments. This is used in the mpg to kpl conversion.

  • 3. When the LHS is needed at a position other than the first, one can use the

dot (.) as placeholder. This is used in the aggregate expression.

slide-13
SLIDE 13

magrittr - simplified program control with pipes

# Again mtcars %>% subset(am == 0) %>% aggregate(. ~ cyl, data = ., FUN = . %>% mean %>% round(2)) %>% transform(kpl = mpg %>% multiply_by(0.4251))

Note (continued from previous slide):

  • 4. Note that if a dot appears naturally as part of a normal R expression

(e.g. in a formula), it is not confused with marking where the data will enter the pipe component (e.g. as it appears in the aggregate expression).

  • 5. If the dot (.) appears as the LHS of a pipeline it creates a single argument

function around the pipeline. E.g. this defines the aggregator function for FUN = above.

  • 6. Note that magrittr has built in some functions like multiply_by() to get

around purely binary operators like *, though this is not strictly necessary (e.g. ‘*‘(2, 3) is the same as 2 * 3).

slide-14
SLIDE 14

magrittr - forward assignment -> at end

The result can be assigned to another variable using forward assignment

mtcars %>% subset(am == 0) %>% aggregate(. ~ cyl, data = ., FUN = . %>% mean %>% round(2)) %>% transform(kpl = mpg %>% multiply_by(0.4251)) -> # FORWARD assigment new_mtcars

which preserves the direction of the data flow along the pipe. Unfortunately, the pipe flow ends with the forward assignment -> . E.g.

mtcars %>% subset(am == 0) %>% aggregate(. ~ cyl, data = ., FUN = . %>% mean %>% round(2)) %>% transform(kpl = mpg %>% multiply_by(0.4251)) -> # forward assigment new_mtcars %>% # pipe cannot continue after assignment head Error in new_mtcars %>% head <- mtcars %>% subset(am == 0) %>% aggregate(. ~ : could not find function “%>%<-” which seems like an oversight on the part of magrittr.

slide-15
SLIDE 15

magrittr - forward assignment ends the pipe

Instead use assign() function (with dot .) to maintain a sense of flow. mtcars %>% subset(am == 0) %>% aggregate(. ~ cyl, data = ., FUN = . %>% mean %>% round(2)) %>% transform(kpl = mpg %>% multiply_by(0.4251)) %>% # pipe assign("new_mtcars", . ) %>% # use assign and dot head ## cyl mpg disp hp drat wt qsec vs am gear carb kpl ## 1 4 22.90 135.87 84.67 3.77 2.94 20.97 1 0 3.67 1.67 9.734790 ## 2 6 19.12 204.55 115.25 3.42 3.39 19.21 1 0 3.50 2.50 8.127912 ## 3 8 15.05 357.62 194.17 3.12 4.10 17.14 0 3.00 3.08 6.397755 Little bunny foo-foo again: # Or with pipes: little_bunny() %>% assign("foo_foo", . ) %>% hop_through(forest) %>% scoop_up(field_mouse) %>% bop_on(head)

slide-16
SLIDE 16

magrittr - usual back assignment <- at start

Alternatively, one could reserve the variable name at the beginning: new_mtcars <- # Use assigment operation up front <- mtcars %>% subset(am == 0) %>% aggregate(. ~ cyl, data = ., FUN = . %>% mean %>% round(2)) %>% transform(kpl = mpg %>% multiply_by(0.4251)) head(new_mtcars) ## cyl mpg disp hp drat wt qsec vs am gear carb kpl ## 1 4 22.90 135.87 84.67 3.77 2.94 20.97 1 0 3.67 1.67 9.734790 ## 2 6 19.12 204.55 115.25 3.42 3.39 19.21 1 0 3.50 2.50 8.127912 ## 3 8 15.05 357.62 194.17 3.12 4.10 17.14 0 3.00 3.08 6.397755 At the conceptual cost of breaking the pipeline flow metaphor (a bit).

slide-17
SLIDE 17

magrittr - compound assignment

Or, perhaps most perversely, could use a different pipe connector, the so-called compound assignment pipe-operator %<>% could be used. N.B. this will have the side-effect that the original data will be changed. For illustration, first copy mtcars

new_mtcars <- mtcars # make a copy head(new_mtcars, 2) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21 6 160 110 3.9 2.620 16.46 1 4 4 ## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 1 4 4

Now use the compound assignment %<>% on new_mtcars

new_mtcars %<>% # Use compound assignment subset(am == 0) %>% aggregate(. ~ cyl, data = ., FUN = . %>% mean %>% round(2)) %>% transform(kpl = mpg %>% multiply_by(0.4251)) # what has happened to new_mtcars? head(new_mtcars, 2) ## cyl mpg disp hp drat wt qsec vs am gear carb kpl ## 1 4 22.90 135.87 84.67 3.77 2.94 20.97 1 0 3.67 1.67 9.734790 ## 2 6 19.12 204.55 115.25 3.42 3.39 19.21 1 0 3.50 2.50 8.127912

It’s as if the data got passed through the pipe and bounced back at the end of the flow!

slide-18
SLIDE 18

magrittr - simplified program control with pipes

%>% works with any function provided it accepts the output from the pipe as its first argument. For example, we could also create a plot using pipes

mtcars %>% subset(am == 0) %>% transform(kpl = mpg %>% multiply_by(0.4251)) %>% data.frame(lp100k = 100/.$kpl) %>% with(plot(x = wt, y = lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." ))

2.5 3.0 3.5 4.0 4.5 5.0 5.5 10 14 18 22 Weight litres per 100 km.

Note data.frame() here appended the column lp100k to its input data. (e.g. try data.frame(mtcars, mtcars))

slide-19
SLIDE 19

magrittr - pipes and with()

Note the use of with() to move things along through the pipe. For example, will the following work?

mtcars %>% subset(am == 0) %>% transform(kpl = mpg %>% multiply_by(0.4251)) %>% data.frame(lp100k = 100/.$kpl) %>% with(plot(x = wt, y = lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." )) %>% with(lines(x = range(wt), y = range(lp100k), col = "steelblue", lwd = 2))

The penultimate piece of pipe passed NULL as the output of the with(plot(...)) on to the final with(lines(...)). Does NOT work because of the NULL and because lines() cannot determine the values of wt and of lpk100.

slide-20
SLIDE 20

magrittr - pipes and with()

Note the use of with() to move things along throught the pipe. For example, will the following work?

mtcars %>% subset(am == 0) %>% transform(kpl = mpg %>% multiply_by(0.4251)) %>% data.frame(lp100k = 100/.$kpl) %>% with(plot(x = wt, y = lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." )) %>% axis(side = 3)

The penultimate piece of pipe passed on NULL as the output of the with(plot) Works because axis() will happily accept (and ignore) NULL because the argument side was named and specified (otherwise it would have failed).

slide-21
SLIDE 21

magrittr - pipes and with()

Note the use of with() to move things along throught the pipe. For example, will the following work?

mtcars %>% subset(am == 0) %>% transform(kpl = mpg %>% multiply_by(0.4251)) %>% data.frame(lp100k = 100/.$kpl) %>% with({ plot(x = wt, y = lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." ) lines(x = range(wt), y = range(lp100k), col = "steelblue", lwd = 2) }) %>% axis(side = 3)

The last piece of pipe passed on NULL as the output of the with(plot(...)) to axis() Works because axis(side = 3) is called on the active, wherever it might be. Which is pretty bad programming style.

slide-22
SLIDE 22

magrittr - pipes and with()

The preferred use with ‘with()

mtcars %>% subset(am == 0) %>% transform(kpl = mpg %>% multiply_by(0.4251)) %>% data.frame(lp100k = 100/.$kpl) %>% with({ plot(x = wt, y = lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." ) lines(x = range(wt), y = range(lp100k), col = "steelblue", lwd = 2) axis(side = 3, col = "blue", col.ticks = "forestgreen") })

This time the value of the with(...) would be that of axis() which returns the location of the axis tic marks. You really need to know what each piece of a pipe passes on to the next!

slide-23
SLIDE 23

magrittr - the exposition pipe %$%

Instead of using with() an exposition pipe %$% will do nearly the same. It exposes the names from the data from the LHS to be used in the RHS expression. For example, the following allows plot() to refer to the names wt and lp100k

  • f its input data.frame.

mtcars %>% subset(am == 0) %>% transform(kpl = mpg %>% multiply_by(0.4251)) %>% data.frame(lp100k = 100/.$kpl) %$% # exposition pipe plot(wt, lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." )

2.5 3.0 3.5 4.0 4.5 5.0 5.5 10 12 14 16 18 20 22 Weight litres per 100 km.

slide-24
SLIDE 24

magrittr - the exposition pipe %$%

Note however, that the pipe has ended with the plot (no further piping). We cannot, for example, now pipe to lines() or to axis(), and expect to continue to refer to the names of the dataset.

mtcars %>% subset(am == 0) %>% transform(kpl = mpg %>% multiply_by(0.4251)) %>% data.frame(lp100k = 100/.$kpl) %$% # exposition pipe plot(wt, lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." ) %>% lines(x = range(mtcars$wt), y = range(mtcars$lp100k), col = "steelblue", lwd = 2) %>% axis(side = 3, col = "blue", col.ticks = "forestgreen")

2.5 3.0 3.5 4.0 4.5 5.0 5.5 10 12 14 16 18 20 22 Weight litres per 100 km. 2.5 3.0 3.5 4.0 4.5 5.0 5.5

Instead, as above, we had to reintroduce the data mtcars which breaks the pipe metaphor.

slide-25
SLIDE 25

magrittr - the exposition pipe %$%

Fortunately, this problem is easily resolved using braces {}.

mtcars %>% subset(am == 0) %>% transform(kpl = mpg %>% multiply_by(0.4251)) %>% data.frame(lp100k = 100/.$kpl) %$% # exposition pipe {plot(x = wt, lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." ) lines(x = range(wt), y = range(lp100k), col = "steelblue", lwd = 2) axis(side = 3, col = "blue", col.ticks = "forestgreen") }

2.5 3.0 3.5 4.0 4.5 5.0 5.5 10 14 18 22 Weight litres per 100 km. 2.5 3.0 3.5 4.0 4.5 5.0 5.5

slide-26
SLIDE 26

magrittr - the exposition pipe %$% and braces {}

Can the piping continue? Of course. Just be mindful of the last output . . . it might not be what you need. Remember,

mtcars %>% subset(am == 0) %>% transform(kpl = mpg %>% multiply_by(0.4251)) %>% data.frame(lp100k = 100/.$kpl) %$% # exposition pipe {plot(x = wt, lp100k, col = "firebrick", xlab = "Weight", ylab = "litres per 100 km." ) lines(x = range(wt), y = range(lp100k), col = "steelblue", lwd = 2) axis(side = 3, col = "blue", col.ticks = "forestgreen") } %>% print

2.5 3.0 3.5 4.0 4.5 5.0 5.5 10 14 18 22 Weight litres per 100 km. 2.5 3.0 3.5 4.0 4.5 5.0 5.5

## [1] 2.5 3.0 3.5 4.0 4.5 5.0 5.5

which are just . . . the numeric tic locations from axis().

slide-27
SLIDE 27

magrittr - the tee pipe %T>%

The tee pipe %T>% is like the usual pipe %>% except that it returns the value of the LHS instead of the value of the RHS. This can be very handy. For example,

library(knitr) mtcars %$% # exposition pipe lm(mpg ~ wt) %T>% # tee pipe, fit is passed on through the next piece {plot(x = .$model$wt, y = .$model$mpg, col = "firebrick", main = "1974 Motor Trend magazine", xlab = "Weight", ylab = "miles per US gallon" ) abline(.$coef, col = "steelblue") } %>% # standard pipe coefficients %>% # standard pipe round %>% # standard pipe kable

2 3 4 5 10 15 20 25 30

1974 Motor Trend magazine

Weight miles per US gallon

x (Intercept) 37 wt

  • 5
slide-28
SLIDE 28

magrittr - the tee pipe %T>% and loon

The tee pipe %T>% is especially handy with loon plots. For example,

set.seed(314159) library(loon) mtcars %$% # exposition pipe l_plot(x = wt, y = mpg, color = cyl, glyph = c("ocircle", "ccircle")[am+1], size = hp/5, itemLabel = rownames(.), title = "1974 Motor Trend magazine", xlabel = "Weight (1000s of lbs)", ylabel = "miles per US gallon") %>% # standard pipe, plot passed on l_configure("selected" = sample(c(TRUE, rep(FALSE, 5)), length(.["x"]), replace = TRUE) ) %T>% # tee pipe, # because l_scaleto_selected returns NULL, and plot needs to be passed on l_scaleto_selected %>% # standard pipe since l_configure returns plot l_configure("showGuides" = TRUE, "showScales" = TRUE) %>% # standard pipe l_configure("showItemLabels" = TRUE) %T>% # tee pipe because a layer would be returned by l_layer_line l_layer_line(x = sort(.["x"]), y = predict(lm(mpg ~ wt, data = data.frame(wt = .["x"], mpg = .["y"])), newdata = data.frame(wt = sort(.["x"])) ), color = l_getOption("select-color"), linewidth = 4 ) -> # forward assignment of plot to p

slide-29
SLIDE 29

magrittr - the tee pipe %T>% and loon

produces

plot(p)

Weight (1000s of lbs) miles per US gallon

1974 Motor Trend magazine

2.2 2.6 3 3.4 16 20 24

slide-30
SLIDE 30

magrittr - lots of ways to write the code

For example,

set.seed(314159) library(loon) mtcars %$% { # exposition pipe to several statements l_hist(disp, xlabel = "Displacement (cubic inches)", linkingGroup = "mtcars") ->> h # global assignment l_plot(x = wt, y = mpg, linkingGroup = "mtcars", color = cyl, glyph = c("ocircle", "ccircle")[am+1], size = hp/5, itemLabel = rownames(.), title = "1974 Motor Trend magazine", xlabel = "Weight (1000s of lbs)", ylabel = "miles per US gallon", showGuides = TRUE, showScales = TRUE, showItemLabels = TRUE, selected = disp < median(disp) ) } %T>% # tee pipe # because next expression returns NULL l_scaleto_selected %T>% # tee pipe again ... why? l_layer_line(x = sort(.["x"][.["selected"]]), y = predict(loess(mpg ~ wt, data = data.frame(wt = .["x"], mpg = .["y"]), # fit only the selected observations subset = .["selected"]), newdata = data.frame(wt = sort(.["x"][.["selected"]])) ), color = l_getOption("select-color"), linewidth = 4, dash = c(10,4), index = "end" # bottommost layer ) -> # forward assignment p

slide-31
SLIDE 31

magrittr - the tee pipe %T>% and loon

From the previous plots we can still make adjustments and then export the results to a grid graphic.

h["showStackedColors"] <- TRUE l_scaleto_world(p) # Get the grobs necessary for grid.arrange gh <- loonGrob(h) gp <- loonGrob(p) library(gridExtra) # contains some friendly grid functions like grid.arrange(gh, gp, nrow = 1)

Displacement (cubic inches) Frequency Weight (1000s of lbs) miles per US gallon

1974 Motor Trend magazine

2 3 4 5 15 25 35

slide-32
SLIDE 32

magrittr - same approach (sort of) with base graphics

For example,

set.seed(314159) mtcars %$% { # exposition pipe to several statements savePar <- par(mfrow = c(1,2)) hist(disp, xlab = "Displacement (cubic inches)") plot(x = wt, y = mpg, col = cyl, pch = c(19, 21)[am+1], cex = hp/50, # divide by 50 now main = "1974 Motor Trend magazine", xlab = "Weight (1000s of lbs)", ylab = "miles per US gallon" )

  • rderX <- order(wt)

lines(x = sort(wt), # fit all observations y = predict(loess(mpg ~ wt, data = data.frame(wt = wt[orderX], mpg = mpg[orderX]))), col = "grey", lwd = 4, lty = 2 ) par(savePar) } # no assignment, no tee pipe

Histogram of disp Displacement (cubic inches) Frequency 100 200 300 400 500 1 2 3 4 5 6 7 2 3 4 5 10 15 20 25 30 1974 Motor Trend magazine Weight (1000s of lbs) miles per US gallon

slide-33
SLIDE 33

On using magrittr pipes

  • 1. Pipes connect a left hand side expression, LHS, to a right hand side

expression, RHS, as in LHS %pipe% RHS where the result of the LHS expression is passed as the first argument to the RHS expression. The result of the LHS can be referenced as a dot . in the RHS.

  • 2. Pipelines are a series of connected pipes:

expr_1 %pipe% expr_2 %pipe% expr_3 %pipe% ... %pipe% expr_k

These are evaluated left to right in pairs as in

((...((expr_1 %pipe% expr_2) %pipe% expr_3) %pipe% ...) %pipe% expr_k)

slide-34
SLIDE 34

On using magrittr pipes

  • 3. There are four different pipes: %>%, %T>%,%$%, and%<>%‘

◮ Standard: LHS %>% RHS ◮ result of RHS is passed on ◮ Tee: LHS %T>% RHS ◮ result of LHS is passed on from RHS ◮ Exposition: LHS %$% RHS ◮ names of result of LHS are exposed to RHS ◮ result of RHS is passed on ◮ Compound assignment: LHS %<>% RHS ◮ result of RHS is passed on ◮ result of pipeline is finally assigned to LHS ◮ changes LHS by side-effect (e.g. try iris[,1:4] %<>% sqrt)

slide-35
SLIDE 35

On using magrittr pipes

  • 4. Pipelines are most easily understood when it is essentially the same object

being pushed through the pipes.

◮ Example: data construction pipeline

mtcars %>% subset(am == 0) %>% transform(lp100k = 100 /(mpg * 0.4251)) -> autoTransData

◮ Example: model pipeline

autoTransData %$% lm(lp100k ~ wt) %>% predict(interval = "prediction") -> autoTransFit

slide-36
SLIDE 36

On using magrittr pipes

  • 4. Continued

◮ Example: plot pipeline (mainly implicit pipeline)

data.frame(autoTransData, autoTransFit)[order(autoTransData[, "wt"]),] %T>% with({ plot(wt, lp100k, ylim = extendrange(c(lwr, fit, upr)), xlab = "Weight (1000s of lbs)", ylab = "Litres per 100 kilometres", main = "Cars with automatic transmissions", col = "firebrick", pch = 19) lines(wt, fit, col = "steelblue") lines(wt, lwr, col = "firebrick", lty = 2) lines(wt, upr, col = "firebrick", lty = 2) }) -> autoTransDataFit

2.5 3.0 3.5 4.0 4.5 5.0 5.5 5 10 15 20 25 Cars with automatic transmissions Weight (1000s of lbs) Litres per 100 kilometres

slide-37
SLIDE 37

On using magrittr pipes

  • 4. Continued

◮ Example: loon plot pipeline (more obvious pipeline)

data.frame(autoTransData, autoTransFit)[order(autoTransData[, "wt"]),] %>% with({ l_plot(x = wt, y = lp100k, linkingGroup = "automatic transmissions", xlabel = "Weight (1000s of lbs)", ylabel = "Litres per 100 kilometres", title = "Cars with automatic transmissions", color = "firebrick", glyph = "circle") %T>% l_layer_line(x = wt, y = fit, color = "steelblue", index = "end") %T>% l_layer_line(x = wt, y = lwr, dash = c(5,5), color = "firebrick", index = "end") %T>% l_layer_line(x = wt, y = upr, dash = c(5,5), color = "firebrick", index = "end") %T>% l_scaleto_world }) -> p plot(p)

Weight (1000s of lbs) Litres per 100 kilometres

Cars with automatic transmissions

slide-38
SLIDE 38

On using magrittr pipes

When should you use pipelines?

◮ Not as a general programming style. ◮ More for data analysis, data wrangling, . . .

◮ track your analysis in easy to read pieces ◮ use many sets of small pipelines ◮ helps you understand and identify chunks of analysis ◮ interupt the pipe anywhere to make sure you are getting what you intended ◮ edit and re-run (true for any commands in a file) ◮ provides reusable chunks that might be adapted and applied to different data

and analyses

◮ if the pipeline becomes too difficult to read, it probably needs to be

separated into different pieces

slide-39
SLIDE 39

A pipeline model for statistical graphics

Lee Wilkinson’s monumental The Grammar of Graphics begins with a pipeline model for constructing statistical graphics: Each step in the pipeline transforms its input to produce output for the next step. The order of steps is essential, though not all need be there for every plot. Because the pipeline consists of separate components, the final graphic that is rendered can be simply and sometimes dramatically changed by making changes to a single component in the pipeline.

slide-40
SLIDE 40

ggplot2 – a grammar of graphics for R

library(ggplot2)

Inspired by Wilkinson’s “Grammar of Graphics”, ggplot2 is a “layered” grammar

  • f graphics.

Much like Wilkinson’s original grammar, ggplot2 uses a pipeline model for its graphics construction in that a plot is built in an ordered series of steps, where each step operates on the output of its immediate predecessor in the line. Departing from the grammar, ggplot2 slightly mixes metaphors in that each step in the pipeline can (typically) be thought of as adding a layer to all that preceded it. From the ggplot2 book:

"The layered grammar of graphics (Wickham 2009) builds on Wilkinson’s grammar, focussing on the primacy of layers and adapting it for embedding within R. In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Facetting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic."

Notationally, the components of the pipeline appear in sequence connected one to the next via an intervening + sign, thus emphasizing each as an addition of a layer (or of some further processing of the plot). Note that he + sign mixes the concepts of layer and of pipeline.

slide-41
SLIDE 41

Example - South African heart disease

Consider the ‘SAheart‘ data from the package ‘ElemStatLearn‘. This is a sample from a retrospective study of heart disease in males from a high-risk region of the Western Cape, South Africa. There are 462 cases and 10 variates. The first few

  • bervations (cases) are shown below.

sbp tobacco ldl adiposity famhist typea

  • besity

alcohol age chd 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 1 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 170 7.50 6.41 38.03 Present 51 31.99 24.26 58 1 134 13.60 3.50 27.78 Present 60 25.99 57.34 49 1 132 6.20 6.47 36.21 Present 62 30.77 14.14 45 For example, sbp denotes “systolic blood pressure”, sbp “low density lipoprotein cholesterol”. famhist “family history of heart disease”, age “age at onset” (in years), and chd indicates whether the patient has coronary heart disease or not (a response).

(see help(SAheart, package="ElemStatLearn") for details)

slide-42
SLIDE 42

Constructing a plot - the pipeline

In the grammar of graphics, a plot processes each component in turn

ggplot(data = SAheart)

First the data

slide-43
SLIDE 43

Constructing a plot - pipeline

In the grammar of graphics, a plot processes each component in turn

ggplot(data = SAheart) + aes( x = age, y = chd)

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

Then the mapping of the data to plot “aesthetics”

slide-44
SLIDE 44

Constructing a plot - pipeline

In the grammar of graphics, a plot processes each component in turn

ggplot(data = SAheart) + aes( x = age, y = chd) + geom_point()

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

Then the geometry.

slide-45
SLIDE 45

Constructing a plot - pipeline

In the grammar of graphics, a plot processes each component in turn

ggplot(data = SAheart) + aes( x = age, y = chd) + geom_point() + geom_smooth()

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

Which can have several further steps in the pipeline

slide-46
SLIDE 46

Constructing a plot

Alternatively, in the grammar of ggplot2, a plot is also a sum of component layers.

ggplot(data = SAheart, mapping = aes(x = age, y = chd))

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

The base display with mapping.

slide-47
SLIDE 47

Constructing a plot

Alternatively, in the grammar of ggplot2, a plot is also a sum of component layers.

ggplot(data = SAheart, mapping = aes(x = age, y = chd)) + geom_point()

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

Here the + is adding layers.

slide-48
SLIDE 48

Constructing a plot

Alternatively, in the grammar of ggplot2, a plot is also a sum of component layers.

ggplot(data = SAheart, mapping = aes(x = age, y = chd)) + geom_point() + geom_smooth()

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

Here the + is adding layers.

slide-49
SLIDE 49

Constructing a plot - separate mappings

Alternatively, we could deliberately associate only the data with the plot, forcing the mapping of the data to aesthetics within each individual component layer: ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd))

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

The mapping is explicit for each layer.

slide-50
SLIDE 50

Constructing a plot - separate mappings

What would the following plot look like?

ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth()

slide-51
SLIDE 51

Constructing a plot - separate mappings

What would the following plot look like?

ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth()

It fails . . . why? How could it be fixed?

slide-52
SLIDE 52

Constructing a plot - separate mappings

What would the following plot look like?

ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth()

It fails . . . why? How could it be fixed? Cautionary note: the ggplot2 grammar mixes the two metaphors of “layers” and “pipes”. Just because an aesthetic precedes a component in the pipeline does not mean that it is available for use.

slide-53
SLIDE 53

Constructing a plot - separate mappings

Solution 1: explicitly, give the mapping for each layer:

ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth(mapping = aes(x = age, y = chd))

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

slide-54
SLIDE 54

Constructing a plot - separate mappings

Solution 2: provide aesthetics upstream:

ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd)) + aes(x = age, y = chd) + geom_smooth()

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

slide-55
SLIDE 55

Constructing a plot - separate mappings

ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd, col = famhist)) + geom_smooth(mapping = aes(x = age, y = chd))

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd famhist

Absent Present

slide-56
SLIDE 56

Constructing a plot - shared and separate mappings

ggplot(data = SAheart) + aes(group = famhist) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth(mapping = aes(x = age, y = chd))

0.0 0.5 1.0 20 30 40 50 60

age chd

slide-57
SLIDE 57

Constructing a plot - shared and separate mappings

ggplot(data = SAheart, mapping = aes(group = famhist)) + geom_point(mapping = aes(x = age, y = chd, col = famhist)) + geom_smooth(mapping = aes(x = age, y = chd))

0.0 0.5 1.0 20 30 40 50 60

age chd famhist

Absent Present

slide-58
SLIDE 58

Constructing a plot - shared and separate mappings

ggplot(data = SAheart, mapping = aes(group = famhist, col = famhist)) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth(mapping = aes(x = age, y = chd))

0.0 0.5 1.0 20 30 40 50 60

age chd famhist

Absent Present

slide-59
SLIDE 59

Constructing a plot

Alternatively, we could split the plot into two pieces by facetting:

ggplot(data = SAheart, mapping = aes(x = age, y = chd)) + geom_point(col="steelblue", size = 3, alpha = 0.4) + geom_smooth(method = "loess", col = "steelblue") + facet_wrap(~famhist)

Absent Present 20 30 40 50 60 20 30 40 50 60 0.0 0.5 1.0

age chd

slide-60
SLIDE 60

Components of the layered grammar

In the grammar of ggplot2, a plot is a structured combination of:

◮ a dataset, ◮ a set of mappings from variates to aesthetics, ◮ one or more layers, each composed of ◮ a geometric object, ◮ a statistical transformation, ◮ a position adjustment, and ◮ (optionally) its own dataset and aesthetic mappings ◮ a scale for each aesthetic mapping, ◮ a coordinate system, ◮ a facetting specification

slide-61
SLIDE 61

Geometric objects

There are a variety of geometric objects that can be added to a plot

◮ geom_abline(), geom_hline(),geom_vline(), geom_curve(),

geom_segment(), geom_step()

◮ geom_label(), geom_text() ◮ geom_point(), geom_smooth(), geom_crossbar(), geom_errorbar(),

geom_errorbarh(), geom_linerange(), geom_pointrange(),

◮ geom_rect(), geom_raster(), geom_area(), geom_ribbon(),

geom_tile(),

◮ geom_bar(), geom_col(), ◮ geom_dotplot(), geom_boxplot(), geom_histogram(),

geom_freqpoly(), geom_density(), geom_violin(), geom_quantile(), geom_qq()

◮ geom_bin2d(), geom_density2d(), geom_hex(), ◮ geom_contour(), ◮ geom_map(), geom_polygon()

Each of these will have their own arguments including mapping, data, stat, et cetera.

slide-62
SLIDE 62

Geometric objects - adding to plots

Beginning with a plot different geometric objects may be added. For example:

p <- ggplot(data = SAheart, mapping = aes(x = tobacco, y = sbp)) p

100 125 150 175 200 10 20 30

tobacco sbp

slide-63
SLIDE 63

Geometric objects - points and density

Beginning with a plot different geometric objects may be added. For example:

p + geom_point() + geom_density_2d(lwd = 1.5, col = "steelblue")

100 125 150 175 200 10 20 30

tobacco sbp

slide-64
SLIDE 64

Geometric objects - histogram

h <- ggplot(data = SAheart, mapping = aes(x = adiposity)) h + geom_histogram(bins = 10, fill = "steelblue", col = "black", alpha = 0.5)

25 50 75 10 20 30 40

adiposity count

Note that had we tried to layer the histogram on top of p, it would have inherited from p a y aesthetic (namely y = sbp) which does not make sense for a histogram.

slide-65
SLIDE 65

Geometric objects - histogram

h + geom_histogram(mapping = aes(y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5)

0.00 0.01 0.02 0.03 0.04 0.05 10 20 30 40

adiposity density

A y aesthetic that does make sense for a histogram is ..density.. which forces the scaling of the vertical axis so that the histogram has unit area. Note that the x aesthetic was inherited from h.

slide-66
SLIDE 66

Geometric objects - density scale histogram

Provided we provide a y aesthetic mapping, a histogram could therefore be added to p as well. p + geom_histogram(mapping = aes(x = adiposity, y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5)

0.00 0.01 0.02 0.03 0.04 0.05 10 20 30 40

tobacco sbp

Note:

◮ the change in vertical scale matches the histogram ◮ the axes labels do not match the aesthetics of the histogram (though the tick marks and

values happen to) Because this is only a grammar, it is as easy to make silly visualizations as it is silly sentences.

slide-67
SLIDE 67

Geometric objects - layering effect

The order of layering (on top of h now) matters:

h + geom_histogram(mapping = aes(y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5) + geom_density(mapping = aes(y = ..density..), fill = "grey", alpha = 0.5)

0.00 0.01 0.02 0.03 0.04 0.05 10 20 30 40

adiposity density

Note that the y aesthetic had to be repeated here . . .

slide-68
SLIDE 68

Geometric objects - layering effect

Switch the order of addition:

h + geom_density(mapping = aes(y = ..density..), fill = "grey", alpha = 0.5) + geom_histogram(mapping = aes(y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5)

0.00 0.01 0.02 0.03 0.04 0.05 10 20 30 40

adiposity density

Note that the aesthetics need to be repeated here . . .

slide-69
SLIDE 69

Geometric objects - bar charts

ggplot(SAheart) + geom_bar(mapping = aes(x = factor(chd), fill = famhist)) + labs(x = "chd", title ="South African heart disease") + coord_flip()

1 100 200 300

count chd famhist

Absent Present

South African heart disease

Which makes you wonder how the data were collected.

slide-70
SLIDE 70

Geometric objects

A different scatterplot

p2 <- ggplot(data = SAheart, mapping = aes(x = sqrt(age), y = sbp)) p2 + geom_point()

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp

slide-71
SLIDE 71

Geometric objects

Note that each geometric object has its own arguments and properties that can be set.

p2 + geom_point(col = "red", size = 3, pch = 21, fill = "yellow", alpha = 0.5) + geom_smooth(method = "loess", col = "steelblue", lty = 2, lwd = 1.5, alpha = 0.2)

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp

slide-72
SLIDE 72

Geometric objects

Aesthetics apply to every point individually.

p2 + geom_point(mapping = aes(size = obesity), fill = "steelblue", col = "black", pch = 21, alpha = 0.4) + geom_smooth(method = "loess", col = "yellow", lty = 2, lwd = 1.5, alpha = 0.2)

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp

  • besity

20 30 40

slide-73
SLIDE 73

Geometric objects

Aesthetics apply to every point individually.

p2 + geom_point(mapping = aes(size = obesity, fill = tobacco), col = "black", pch = 21, alpha = 0.4) + geom_smooth(method = "loess", col = "yellow", lty = 2, lwd = 1.5, alpha = 0.2)

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp

  • besity

20 30 40 10 20 30

tobacco

slide-74
SLIDE 74

Geometric objects

The data may change with each layer

heartAttack <- SAheart[, "chd"] == 1 hAplot <- p2 + geom_point(data = SAheart[heartAttack, ], mapping = aes(size = obesity), alpha = 0.4, col = "black", pch = 21, fill = "steelblue") hAplot

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp

  • besity

15 20 25 30 35 40 45

slide-75
SLIDE 75

Geometric objects

The data may change with each layer

qboth <- hAplot + geom_point(data = SAheart[!heartAttack, ], # Not heartAttack mapping = aes(size = obesity), alpha = 0.4, col = "black", pch = 21, fill = "firebrick") qboth

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp

  • besity

20 30 40

slide-76
SLIDE 76

Geometric objects

The data may change with each layer

qboth + geom_smooth(data = SAheart[heartAttack, ], method = "loess", col = "steelblue", alpha = 0.4) + geom_smooth(data = SAheart[!heartAttack, ], method = "loess", col = "firebrick", alpha = 0.4)

120 160 200 4 5 6 7 8

sqrt(age) sbp

  • besity

20 30 40

slide-77
SLIDE 77

Geometric objects

The data may change with each layer

qboth + geom_smooth(method = "loess")

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp

  • besity

20 30 40

Note smooth is using all of the data here.

slide-78
SLIDE 78

Geometric objects

The data may change with each layer

qboth + geom_smooth(mapping = aes(color = factor(chd)), method = "loess")

120 160 200 4 5 6 7 8

sqrt(age) sbp factor(chd)

1

  • besity

20 30 40

Here the smooth is separate for each colour given by chd as factor. Note ggplot’s default colours.

slide-79
SLIDE 79

Geometric objects

The colours can be coordinated by relying on the original data and using chd as a factor:

p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth(mapping = aes(col = factor(chd)), method = "loess", lwd = 1.5, alpha = 0.2)

120 160 200 4 5 6 7 8

sqrt(age) sbp factor(chd)

1

  • besity

20 30 40

Here the smooth is separate for each colour given by chd as factor.

slide-80
SLIDE 80

Scales

A map from the domain of data values to the range of some aesthetic (e.g. colour, size, axis ranges, . . . ).

p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth(mapping = aes(col = factor(chd)), method = "loess", lwd = 1.5, alpha = 0.2) + scale_fill_manual("chd", values=c("steelblue", "firebrick"))+ scale_color_manual("chd", values=c("steelblue", "firebrick"))

120 160 200 4 5 6 7 8

sqrt(age) sbp chd

1

  • besity

20 30 40

. . . gets your own “scale” values for colour and for fill.

slide-81
SLIDE 81

Scales

A map from the domain of data values to the range of some aesthetic (e.g. colour, size, axis ranges, . . . ).

p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth(mapping = aes(col = factor(chd)), method = "loess", lwd = 1.5, alpha = 0.2) + scale_fill_manual("chd", values=c("steelblue", "firebrick"))+ scale_color_manual("chd", values=c("steelblue", "firebrick")) + scale_size("obesity", breaks = seq(0,100,5))

120 160 200 4 5 6 7 8

sqrt(age) sbp chd

1

  • besity

15 20 25 30 35 40 45

. . . additonally gets your own “scale” values for point size (which is proportional to area).

slide-82
SLIDE 82

Scales

A map from the domain of data values to the range of some aesthetic (e.g. colour, size, axis ranges, . . . ).

p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth(mapping = aes(col = factor(chd)), method = "loess", lwd = 1.5, alpha = 0.2) + scale_fill_manual("chd", values=c("steelblue", "firebrick"))+ scale_color_manual("chd", values=c("steelblue", "firebrick")) + scale_size_area("obesity", breaks = seq(0,100,5))

120 160 200 4 5 6 7 8

sqrt(age) sbp chd

1

  • besity

15 20 25 30 35 40 45

. . . Now a zero value gives a zero area.

slide-83
SLIDE 83

Position scales

There are two position scales: horizontal (x) and vertical (y).

p + geom_point(alpha = 0.5, size = 1) + scale_x_continuous(limits = c(0,40)) + scale_y_continuous(limits = c(75,225))

100 150 200 10 20 30 40

tobacco sbp

slide-84
SLIDE 84

Position scales

There are two position scales: horizontal (x) and vertical (y).

p + geom_point(alpha = 0.5, size = 1) + xlim(0,40) + ylim(75,225)

100 150 200 10 20 30 40

tobacco sbp

slide-85
SLIDE 85

Position scales

There are two position scales: horizontal (x) and vertical (y).

p + aes(x = tobacco + 1) + geom_point(alpha = 0.5, size = 1) + scale_x_log10()

100 125 150 175 200 1 3 10 30

tobacco + 1 sbp

slide-86
SLIDE 86

Coordinates

This is the coordinate system in which the positions are to be plotted. We have already seen coord_flip() which swaps the x and y axes. There are many

  • thers; the aspect ratio, for example, is fixed using coord_fixed():

ggplot(SAheart, aes(obesity, adiposity)) + geom_point() + coord_fixed(ratio = 1)

10 20 30 40 20 30 40

  • besity

adiposity

Here the aspect ratio is fixed so that one unit change in the x direction produces only one unit change in the y direction.

slide-87
SLIDE 87

Coordinates

This is the coordinate system in which the positions are to be plotted. We have already seen coord_flip() which swaps the x and y axes. There are many

  • thers; the aspect ratio, for example, is fixed using coord_fixed():

ggplot(SAheart, aes(obesity, adiposity)) + geom_point() + coord_fixed(ratio = 0.5)

10 20 30 40 20 30 40

  • besity

adiposity

Here the aspect ratio is fixed so that one unit change in the x direction produces only half a unit change in the y direction.

slide-88
SLIDE 88

Coordinates

One coordinate system that is used is called coord_polar() which, unlike its name suggests, does not calculate a polar coordinate transformation but rather treats one of the two positions as defining an angle and the other as defining the radius.

ggplot(SAheart, aes(obesity, adiposity)) + geom_point() + geom_smooth() + coord_polar(theta = "x")

20 30 40 10 20 30 40

  • besity

adiposity

which, arguably, is a pretty weird plot but does demonstrate how coordinate systems are abstracted

  • ut as part of the grammar. Consequently coord_polar() should be used with considerable caution
slide-89
SLIDE 89

Coordinates

Arguably overly complicated, one use of coord_polar() is to construct a pie chart. This is just a bar chart expressed using coord_polar(). First the bar chart

barChart <- ggplot(SAheart, aes(x = factor(1), fill = famhist)) + geom_bar(width = 1) + xlab("") barChart

100 200 300 400 1

count famhist

Absent Present

slide-90
SLIDE 90

Coordinates

Arguably overly complicated, one use of coord_polar() is to construct a pie chart. This is just a bar chart expressed using coord_polar(). Now the pie chart

barChart + coord_polar(theta = "y")

100 200 300 400 1

count famhist

Absent Present

Be careful with coord_polar() and bar charts; it is easy to produce some very silly pointless charts.

slide-91
SLIDE 91

Positions

A bar chart with two variates. Default position is “stack”

barChart2 <- ggplot(SAheart, aes(x = factor(chd), fill = famhist)) + geom_bar(position="stack") + xlab("chd") barChart2

100 200 300 1

chd count famhist

Absent Present

which stacks the two colours in the same bar.

slide-92
SLIDE 92

Positions

To place the colours beside each other rather than stack them, change the position to “dodge”

barChart3 <- ggplot(SAheart, aes(x = factor(chd), fill = famhist)) + geom_bar(position="dodge") + xlab("chd") barChart3

50 100 150 200 1

chd count famhist

Absent Present

slide-93
SLIDE 93

Statistical transformations - stat

These transformations often summarize data in some manner (e.g. by counting, by averaging, etc.). Some statistical functions operate “behind the scenes”:

◮ stat_bin() in geom_bar(), geom_freqpoly(), geom_histogram() ◮ stat_bin2d() in geom_bin2d() ◮ stat_bindot() in geom_dotplot() ◮ stat_binhex() in geom_hex() ◮ stat_boxplot() in geom_boxplot() ◮ stat_contour() in geom_contour() ◮ stat_quantile() in geom_quantile() ◮ stat_smooth() in geom_smooth() ◮ stat_sum() in geom_count()

but may also be called directly (outside the geom_)

slide-94
SLIDE 94

Statistical transformations - stat

Other stats have no corresponding geom_ function:

◮ stat_ecdf(): compute a empirical cumulative distribution plot. ◮ stat_function(): compute y values from a function of x values. ◮ stat_summary(): summarise y values at distinct x values. ◮ stat_summary2d(), stat_summary_hex(): summarise binned values. ◮ stat_qq(): perform calculations for a quantile-quantile plot. ◮ stat_spoke(): convert angle and radius to position. ◮ stat_unique(): remove duplicated rows.

slide-95
SLIDE 95

Statistical transformations - example

Adding some statistical summary information to the scatterplot p2

p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + stat_summary(geom = "point", fun = "median",

  • rientation = "x",

col = "yellow", size = 2, pch = 19)

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp factor(chd)

1

  • besity

20 30 40

Adds the median of the ys at each observed x.

slide-96
SLIDE 96

Statistical transformations - example

Alternatively use stat = "summary" in geom_point(). Also add connecting lines to the scatterplot p2

p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + geom_point(stat = "summary", fun = "median",

  • rientation = "x",

col = "yellow", size = 2, pch = 19) + stat_summary(geom = "line", fun = "median",

  • rientation = "x",

col = "yellow", size = 1, pch = 19)

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp factor(chd)

1

  • besity

20 30 40

Adds the median of the ys at each observed x.

slide-97
SLIDE 97

Miscellaneous

◮ Can also facet in a matrix grid using facet_grid() ◮ position can also be “jitter” (best for scatterplots) ◮ there is a function called theme() which is how the appearance of all

non-data plot components are changed.

◮ E.g. it is possible to turn that grey background grid off via theme() (though

it seems a lot of work)

◮ there is a function qplot() or quickplot() which is more like a base

graphics plot() call and so may be easier to use than following the ggplot2 grammar via ggplot() + ...

◮ ggsave() will save the most recent ggplot.

slide-98
SLIDE 98

Miscellaneous

Note: to plot time series (objects of class ts) you need the ggfortify package and then use autoplot().

library(ggfortify) autoplot(sunspots)

50 100 150 200 250 1750 1800 1850 1900 1950

Similarly, library(ggmap) for raster maps from get_map()

slide-99
SLIDE 99

Working with magrittr pipes

The grammar model of ggplot has + behaving much like a pipe in magrittr and can be used with the pipes of magrittr.

library(magrittr) mtcars %>% subset(am == 0) %>% transform(kpl = mpg %>% multiply_by(0.4251)) %>% data.frame(lp100k = 100/.$kpl) %>% ggplot(aes(x = wt, y = lp100k)) + geom_point(mapping = aes(col = vs)) + ylab("Litres per 100 kilometres") + ggtitle("Gas usage") -> p p

12 16 20 3 4 5

wt Litres per 100 kilometres

0.00 0.25 0.50 0.75 1.00

vs

Gas usage

slide-100
SLIDE 100

Working with magrittr pipes

Note that unlike the base graphics plots, but like grid and loon plots, ggplots are structures that can be passed on with the pipes.

library(magrittr) mtcars %>% subset(am == 0) %>% transform(kpl = mpg %>% multiply_by(0.4251)) %>% data.frame(lp100k = 100/.$kpl) %>% ggplot(aes(x = wt, y = lp100k)) + geom_point(mapping = aes(col = vs)) + ylab("Litres per 100 kilometres") + ggtitle("Gas usage") %>% print ## $title ## [1] "Gas usage" ## ## attr(,"class") ## [1] "labels"

12 16 20 3 4 5

wt Litres per 100 kilometres

0.00 0.25 0.50 0.75 1.00

vs

Gas usage

Note that typically a ggplot data structure is not completely constructed until it has been printed (or forced).

slide-101
SLIDE 101

Interactive ggplots via loon.ggplot package

Coming soon to CRAN!! ggplots can be made interactive using loon.ggplot() Github: “https://github.com/great-northern-diver/loon.ggplot”

slide-102
SLIDE 102

A different piping package pipeR

An alternative pipeline package is pipeR which has a single pipe operator %>>%

◮ simplifies syntax (no looking for the one or two symbols that are different

between %)

◮ pipes to first argument ◮ pipes to dot . (handles formula as well) ◮ single piece of new syntax ~ allows s

◮ “tee pipe” by identifying pipe components as only for side effect ◮ allows assignment within the pipeline ◮ allows simeple printing of comment strings

All in all, I think a better pipelining package. On CRAN and github. Github: “https://renkun-ken.github.io/pipeR/”