Data visualization with ggplot2 R.W. Oldford Computational - - PowerPoint PPT Presentation

data visualization with ggplot2
SMART_READER_LITE
LIVE PREVIEW

Data visualization with ggplot2 R.W. Oldford Computational - - PowerPoint PPT Presentation

Data visualization with ggplot2 R.W. Oldford Computational pipelines Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output: Computational


slide-1
SLIDE 1

Data visualization with ggplot2

R.W. Oldford

slide-2
SLIDE 2

Computational pipelines

Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output:

slide-3
SLIDE 3

Computational pipelines

Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output:

slide-4
SLIDE 4

Computational pipelines

Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output: If we have several of these, we can connect them one to another in sequence to produce a “pipeline” of modules or steps in the processing of the original input:

slide-5
SLIDE 5

Computational pipelines

Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output: If we have several of these, we can connect them one to another in sequence to produce a “pipeline” of modules or steps in the processing of the original input:

slide-6
SLIDE 6

Computational pipelines

Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output: If we have several of these, we can connect them one to another in sequence to produce a “pipeline” of modules or steps in the processing of the original input:

slide-7
SLIDE 7

Computational pipelines

Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output: If we have several of these, we can connect them one to another in sequence to produce a “pipeline” of modules or steps in the processing of the original input:

slide-8
SLIDE 8

Computational pipelines

Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output: If we have several of these, we can connect them one to another in sequence to produce a “pipeline” of modules or steps in the processing of the original input: The connected components form a “pipeline” through which the original input “flows”, with some processing/transformation of the data occurring at each step.

slide-9
SLIDE 9

Computational pipelines

A simple metaphor (viz. that of laying pipes end to end):

slide-10
SLIDE 10

Computational pipelines

A simple metaphor (viz. that of laying pipes end to end):

  • data passes through and is processed by a set of computational steps serially linked so

that the output of one becomes the input of the next

slide-11
SLIDE 11

Computational pipelines

A simple metaphor (viz. that of laying pipes end to end):

  • data passes through and is processed by a set of computational steps serially linked so

that the output of one becomes the input of the next

  • the Unix “pipe” | is called a “pipe”: ls -R Notes
slide-12
SLIDE 12

Computational pipelines

A simple metaphor (viz. that of laying pipes end to end):

  • data passes through and is processed by a set of computational steps serially linked so

that the output of one becomes the input of the next

  • the Unix “pipe” | is called a “pipe”: ls -R Notes | grep ".pdf"
slide-13
SLIDE 13

Computational pipelines

A simple metaphor (viz. that of laying pipes end to end):

  • data passes through and is processed by a set of computational steps serially linked so

that the output of one becomes the input of the next

  • the Unix “pipe” | is called a “pipe”: ls -R Notes | grep ".pdf" | sort
slide-14
SLIDE 14

Computational pipelines

A simple metaphor (viz. that of laying pipes end to end):

  • data passes through and is processed by a set of computational steps serially linked so

that the output of one becomes the input of the next

  • the Unix “pipe” | is called a “pipe”: ls -R Notes | grep ".pdf" | sort | more
slide-15
SLIDE 15

Computational pipelines

A simple metaphor (viz. that of laying pipes end to end):

  • data passes through and is processed by a set of computational steps serially linked so

that the output of one becomes the input of the next

  • the Unix “pipe” | is called a “pipe”: ls -R Notes | grep ".pdf" | sort | more
  • a graphics rendering pipeline (from Kaufman, Fan and Petkov (2009) Implementing the lattice Boltzmann model on

commodity graphics hardware J. Stat. Mech.)]

slide-16
SLIDE 16

Computational pipelines

A simple metaphor (viz. that of laying pipes end to end):

  • data passes through and is processed by a set of computational steps serially linked so

that the output of one becomes the input of the next

  • the Unix “pipe” | is called a “pipe”: ls -R Notes | grep ".pdf" | sort | more
  • a graphics rendering pipeline (from Kaufman, Fan and Petkov (2009) Implementing the lattice Boltzmann model on

commodity graphics hardware J. Stat. Mech.)]

slide-17
SLIDE 17

Wilkinson’s Grammar of Graphics pipeline

Lee Wilkinson’s monumental The Grammar of Graphics begins with a pipeline model for constructing statistical graphics: Each step in the pipeline transforms its input to produce output for the next step. The order of steps is essential, though not all need be there for every plot. Because the pipeline consists of separate components, the final graphic that is rendered can be simply and sometimes dramatically changed by making changes to a single component in the pipeline.

slide-18
SLIDE 18

ggplot2 – a grammar of graphics for R

Inspired by Wilkinson’s “Grammar of Graphics”, Hadley Wickham (in his 2008 Iowa State PhD thesis: Practical tools for exploring data and models) developed a “layered grammar of graphics.” This is implemented as ggplot2 in R.

library(ggplot2)

Much like Wilkinson’s original grammar, ggplot2 uses a pipeline model for its graphics construction in that a plot is built in an ordered series of steps, where each step

  • perates on the output of its immediate predecessor in the line. Departing from the

grammar, ggplot2 slightly mixes metaphors in that each step in the pipeline can (typically) be thought of as adding a layer to all that preceded it. From the ggplot2 book:

"The layered grammar of graphics (Wickham 2009) builds on Wilkinson’s grammar, focussing on the primacy of layers and adapting it for embedding within R. In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Facetting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic."

Notationally, the components of the pipeline appear in sequence connected one to the next via an intervening + sign, thus emphasizing each as an addition of a layer (or of some further processing of the plot).

slide-19
SLIDE 19

Data - South African heart disease

Consider the ‘SAheart‘ data from the package ‘ElemStatLearn‘. This is a sample from a retrospective study of heart disease in males from a high-risk region of the Western Cape, South Africa. There are 462 cases and 10 variates. The first few

  • bervations (cases) are shown below.

sbp tobacco ldl adiposity f amhist typea

  • besity

alcohol age chd 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 1 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 170 7.50 6.41 38.03 Present 51 31.99 24.26 58 1 134 13.60 3.50 27.78 Present 60 25.99 57.34 49 1 132 6.20 6.47 36.21 Present 62 30.77 14.14 45 For example, sbp denotes “systolic blood pressure”, sbp “low density lipoprotein cholesterol”. famhist “family history of heart disease”, age “age at onset” (in years), and chd indicates whether the patient has coronary heart disease or not (a response).

(see help(SAheart, package="ElemStatLearn") for details)

slide-20
SLIDE 20

Constructing a plot - the pipeline

In the grammar of graphics, a plot processes each component in turn

ggplot(data = SAheart)

First the data

slide-21
SLIDE 21

Constructing a plot - pipeline

In the grammar of graphics, a plot processes each component in turn

ggplot(data = SAheart) + aes( x = age, y = chd)

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

Then the mapping of the data to plot “aesthetics”

slide-22
SLIDE 22

Constructing a plot - pipeline

In the grammar of graphics, a plot processes each component in turn

ggplot(data = SAheart) + aes( x = age, y = chd) + geom_point()

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

Then the geometry.

slide-23
SLIDE 23

Constructing a plot - pipeline

In the grammar of graphics, a plot processes each component in turn

ggplot(data = SAheart) + aes( x = age, y = chd) + geom_point() + geom_smooth()

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

Which can have several further steps in the pipeline

slide-24
SLIDE 24

Constructing a plot

Alternatively, in the grammar of ggplot2, a plot is also a sum of component layers.

ggplot(data = SAheart, mapping = aes(x = age, y = chd))

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

The base display with mapping.

slide-25
SLIDE 25

Constructing a plot

Alternatively, in the grammar of ggplot2, a plot is also a sum of component layers.

ggplot(data = SAheart, mapping = aes(x = age, y = chd)) + geom_point()

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

Here the + is adding layers.

slide-26
SLIDE 26

Constructing a plot

Alternatively, in the grammar of ggplot2, a plot is also a sum of component layers.

ggplot(data = SAheart, mapping = aes(x = age, y = chd)) + geom_point() + geom_smooth()

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

Here the + is adding layers.

slide-27
SLIDE 27

Constructing a plot - separate mappings

Alternatively, we could deliberately associate only the data with the plot, forcing the mapping of the data to aesthetics within each individual component layer: ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd))

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

The mapping is explicit for each layer.

slide-28
SLIDE 28

Constructing a plot - separate mappings

What would the following plot look like?

ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth()

slide-29
SLIDE 29

Constructing a plot - separate mappings

What would the following plot look like?

ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth()

It fails . . . why? How could it be fixed?

slide-30
SLIDE 30

Constructing a plot - separate mappings

What would the following plot look like?

ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth()

It fails . . . why? How could it be fixed? Cautionary note: the ggplot2 grammar mixes the two metaphors of “layers” and “pipes”. Just because an aesthetic precedes a component in the pipeline does not mean that it is available for use.

slide-31
SLIDE 31

Constructing a plot - separate mappings

Solution 1: explicitly, give the mapping for each layer:

ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth(mapping = aes(x = age, y = chd))

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

slide-32
SLIDE 32

Constructing a plot - separate mappings

Solution 2: provide aesthetics upstream:

ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd)) + aes(x = age, y = chd) + geom_smooth()

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd

slide-33
SLIDE 33

Constructing a plot - separate mappings

ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd, col = famhist)) + geom_smooth(mapping = aes(x = age, y = chd))

0.00 0.25 0.50 0.75 1.00 20 30 40 50 60

age chd famhist

Absent Present

slide-34
SLIDE 34

Constructing a plot - shared and separate mappings

ggplot(data = SAheart) + aes(group = famhist) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth(mapping = aes(x = age, y = chd))

0.0 0.5 1.0 20 30 40 50 60

age chd

slide-35
SLIDE 35

Constructing a plot - shared and separate mappings

ggplot(data = SAheart, mapping = aes(group = famhist)) + geom_point(mapping = aes(x = age, y = chd, col = famhist)) + geom_smooth(mapping = aes(x = age, y = chd))

0.0 0.5 1.0 20 30 40 50 60

age chd famhist

Absent Present

slide-36
SLIDE 36

Constructing a plot - shared and separate mappings

ggplot(data = SAheart, mapping = aes(group = famhist, col = famhist)) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth(mapping = aes(x = age, y = chd))

0.0 0.5 1.0 20 30 40 50 60

age chd famhist

Absent Present

slide-37
SLIDE 37

Constructing a plot

Alternatively, we could split the plot into two pieces by facetting:

ggplot(data = SAheart, mapping = aes(x = age, y = chd)) + geom_point(col="steelblue", size = 3, alpha = 0.4) + geom_smooth(method = "loess", col = "steelblue") + facet_wrap(~famhist)

Absent Present 20 30 40 50 60 20 30 40 50 60 0.0 0.5 1.0

age chd

slide-38
SLIDE 38

Components of the layered grammar

In the grammar of ggplot2, a plot is a structured combination of:

◮ a dataset, ◮ a set of mappings from variates to aesthetics, ◮ one or more layers, each composed of ◮ a geometric object, ◮ a statistical transformation, ◮ a position adjustment, and ◮ (optionally) its own dataset and aesthetic mappings ◮ a scale for each aesthetic mapping, ◮ a coordinate system, ◮ a facetting specification

slide-39
SLIDE 39

Geometric objects

There are a variety of geometric objects that can be added to a plot

◮ geom_abline(), geom_hline(),geom_vline(), geom_curve(),

geom_segment(), geom_step()

◮ geom_label(), geom_text() ◮ geom_point(), geom_smooth(), geom_crossbar(), geom_errorbar(),

geom_errorbarh(), geom_linerange(), geom_pointrange(),

◮ geom_rect(), geom_raster(), geom_area(), geom_ribbon(),

geom_tile(),

◮ geom_bar(), geom_col(), ◮ geom_dotplot(), geom_boxplot(), geom_histogram(),

geom_freqpoly(), geom_density(), geom_violin(), geom_quantile(), geom_qq()

◮ geom_bin2d(), geom_density2d(), geom_hex(), ◮ geom_contour(), ◮ geom_map(), geom_polygon()

Each of these will have their own arguments including mapping, data, stat, et cetera.

slide-40
SLIDE 40

Geometric objects - adding to plots

Beginning with a plot different geometric objects may be added. For example:

p <- ggplot(data = SAheart, mapping = aes(x = tobacco, y = sbp)) p

100 125 150 175 200 10 20 30

tobacco sbp

slide-41
SLIDE 41

Geometric objects - points and density

Beginning with a plot different geometric objects may be added. For example:

p + geom_point() + geom_density_2d(lwd = 1.5, col = "steelblue")

100 125 150 175 200 10 20 30

tobacco sbp

slide-42
SLIDE 42

Geometric objects - histogram

h <- ggplot(data = SAheart, mapping = aes(x = adiposity)) h + geom_histogram(bins = 10, fill = "steelblue", col = "black", alpha = 0.5)

25 50 75 10 20 30 40

adiposity count

Note that had we tried to layer the histogram on top of p, it would have inherited from p a y aesthetic (namely y = sbp) which does not make sense for a histogram.

slide-43
SLIDE 43

Geometric objects - histogram

h + geom_histogram(mapping = aes(y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5)

0.00 0.01 0.02 0.03 0.04 0.05 10 20 30 40

adiposity density

A y aesthetic that does make sense for a histogram is ..density.. which forces the scaling of the vertical axis so that the histogram has unit area. Note that the x aesthetic was inherited from h.

slide-44
SLIDE 44

Geometric objects - density scale histogram

Provided we provide a y aesthetic mapping, a histogram could therefore be added to p as well. p + geom_histogram(mapping = aes(x = adiposity, y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5)

0.00 0.01 0.02 0.03 0.04 0.05 10 20 30 40

tobacco sbp

Note:

◮ the change in vertical scale matches the histogram ◮ the axes labels do not match the aesthetics of the histogram (though the tick marks and

values happen to) Because this is only a grammar, it is as easy to make silly visualizations as it is silly sentences.

slide-45
SLIDE 45

Geometric objects - layering effect

The order of layering (on top of h now) matters:

h + geom_histogram(mapping = aes(y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5) + geom_density(mapping = aes(y = ..density..), fill = "grey", alpha = 0.5)

0.00 0.01 0.02 0.03 0.04 0.05 10 20 30 40

adiposity density

Note that the y aesthetic had to be repeated here . . .

slide-46
SLIDE 46

Geometric objects - layering effect

Switch the order of addition:

h + geom_density(mapping = aes(y = ..density..), fill = "grey", alpha = 0.5) + geom_histogram(mapping = aes(y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5)

0.00 0.01 0.02 0.03 0.04 0.05 10 20 30 40

adiposity density

Note that the aesthetics need to be repeated here . . .

slide-47
SLIDE 47

Geometric objects - bar charts

ggplot(SAheart) + geom_bar(mapping = aes(x = factor(chd), fill = famhist)) + labs(x = "chd", title ="South African heart disease") + coord_flip()

1 100 200 300

count chd famhist

Absent Present

South African heart disease

Which makes you wonder how the data were collected.

slide-48
SLIDE 48

Geometric objects

A different scatterplot

p2 <- ggplot(data = SAheart, mapping = aes(x = sqrt(age), y = sbp)) p2 + geom_point()

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp

slide-49
SLIDE 49

Geometric objects

Note that each geometric object has its own arguments and properties that can be set.

p2 + geom_point(col = "red", size = 3, pch = 21, fill = "yellow", alpha = 0.5) + geom_smooth(method = "loess", col = "steelblue", lty = 2, lwd = 1.5, alpha = 0.2)

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp

slide-50
SLIDE 50

Geometric objects

Aesthetics apply to every point individually.

p2 + geom_point(mapping = aes(size = obesity), fill = "steelblue", col = "black", pch = 21, alpha = 0.4) + geom_smooth(method = "loess", col = "yellow", lty = 2, lwd = 1.5, alpha = 0.2)

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp

  • besity

20 30 40

slide-51
SLIDE 51

Geometric objects

Aesthetics apply to every point individually.

p2 + geom_point(mapping = aes(size = obesity, fill = tobacco), col = "black", pch = 21, alpha = 0.4) + geom_smooth(method = "loess", col = "yellow", lty = 2, lwd = 1.5, alpha = 0.2)

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp

10 20 30

tobacco

  • besity

20 30 40

slide-52
SLIDE 52

Geometric objects

The data may change with each layer

heartAttack <- SAheart[, "chd"] == 1 hAplot <- p2 + geom_point(data = SAheart[heartAttack, ], mapping = aes(size = obesity), alpha = 0.4, col = "black", pch = 21, fill = "steelblue") hAplot

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp

  • besity

15 20 25 30 35 40 45

slide-53
SLIDE 53

Geometric objects

The data may change with each layer

qboth <- hAplot + geom_point(data = SAheart[!heartAttack, ], # Not heartAttack mapping = aes(size = obesity), alpha = 0.4, col = "black", pch = 21, fill = "firebrick") qboth

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp

  • besity

20 30 40

slide-54
SLIDE 54

Geometric objects

The data may change with each layer

qboth + geom_smooth(data = SAheart[heartAttack, ], method = "loess", col = "steelblue", alpha = 0.4) + geom_smooth(data = SAheart[!heartAttack, ], method = "loess", col = "firebrick", alpha = 0.4)

120 160 200 4 5 6 7 8

sqrt(age) sbp

  • besity

20 30 40

slide-55
SLIDE 55

Geometric objects

The data may change with each layer

qboth + geom_smooth(method = "loess")

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp

  • besity

20 30 40

Note smooth is using all of the data here.

slide-56
SLIDE 56

Geometric objects

The data may change with each layer

qboth + geom_smooth(mapping = aes(color = factor(chd)), method = "loess")

120 160 200 4 5 6 7 8

sqrt(age) sbp factor(chd)

1

  • besity

20 30 40

Here the smooth is separate for each colour given by chd as factor. Note ggplot’s default colours.

slide-57
SLIDE 57

Geometric objects

The colours can be coordinated by relying on the original data and using chd as a factor:

p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth(mapping = aes(col = factor(chd)), method = "loess", lwd = 1.5, alpha = 0.2)

120 160 200 4 5 6 7 8

sqrt(age) sbp factor(chd)

1

  • besity

20 30 40

Here the smooth is separate for each colour given by chd as factor.

slide-58
SLIDE 58

Scales

A map from the domain of data values to the range of some aesthetic (e.g. colour, size, axis ranges, . . . ).

p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth(mapping = aes(col = factor(chd)), method = "loess", lwd = 1.5, alpha = 0.2) + scale_fill_manual("chd", values=c("steelblue", "firebrick"))+ scale_color_manual("chd", values=c("steelblue", "firebrick"))

120 160 200 4 5 6 7 8

sqrt(age) sbp chd

1

  • besity

20 30 40

. . . gets your own “scale” values for colour and for fill.

slide-59
SLIDE 59

Scales

A map from the domain of data values to the range of some aesthetic (e.g. colour, size, axis ranges, . . . ).

p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth(mapping = aes(col = factor(chd)), method = "loess", lwd = 1.5, alpha = 0.2) + scale_fill_manual("chd", values=c("steelblue", "firebrick"))+ scale_color_manual("chd", values=c("steelblue", "firebrick")) + scale_size("obesity", breaks = seq(0,100,5))

120 160 200 4 5 6 7 8

sqrt(age) sbp chd

1

  • besity

15 20 25 30 35 40 45

. . . additonally gets your own “scale” values for point size (which is proportional to area).

slide-60
SLIDE 60

Scales

A map from the domain of data values to the range of some aesthetic (e.g. colour, size, axis ranges, . . . ).

p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth(mapping = aes(col = factor(chd)), method = "loess", lwd = 1.5, alpha = 0.2) + scale_fill_manual("chd", values=c("steelblue", "firebrick"))+ scale_color_manual("chd", values=c("steelblue", "firebrick")) + scale_size_area("obesity", breaks = seq(0,100,5))

120 160 200 4 5 6 7 8

sqrt(age) sbp chd

1

  • besity

15 20 25 30 35 40 45

. . . Now a zero value gives a zero area.

slide-61
SLIDE 61

Position scales

There are two position scales: horizontal (x) and vertical (y).

p + geom_point(alpha = 0.5, size = 1) + scale_x_continuous(limits = c(0,40)) + scale_y_continuous(limits = c(75,225))

100 150 200 10 20 30 40

tobacco sbp

slide-62
SLIDE 62

Position scales

There are two position scales: horizontal (x) and vertical (y).

p + geom_point(alpha = 0.5, size = 1) + xlim(0,40) + ylim(75,225)

100 150 200 10 20 30 40

tobacco sbp

slide-63
SLIDE 63

Position scales

There are two position scales: horizontal (x) and vertical (y).

p + aes(x = tobacco + 1) + geom_point(alpha = 0.5, size = 1) + scale_x_log10()

100 125 150 175 200 1 10

tobacco + 1 sbp

slide-64
SLIDE 64

Coordinates

This is the coordinate system in which the positions are to be plotted. We have already seen coord_flip() which swaps the x and y axes. There are many

  • thers; the aspect ratio, for example, is fixed using coord_fixed():

ggplot(SAheart, aes(obesity, adiposity)) + geom_point() + coord_fixed(ratio = 1)

10 20 30 40 20 30 40

  • besity

adiposity

Here the aspect ratio is fixed so that one unit change in the x direction produces only one unit change in the y direction.

slide-65
SLIDE 65

Coordinates

This is the coordinate system in which the positions are to be plotted. We have already seen coord_flip() which swaps the x and y axes. There are many

  • thers; the aspect ratio, for example, is fixed using coord_fixed():

ggplot(SAheart, aes(obesity, adiposity)) + geom_point() + coord_fixed(ratio = 0.5)

10 20 30 40 20 30 40

  • besity

adiposity

Here the aspect ratio is fixed so that one unit change in the x direction produces only half a unit change in the y direction.

slide-66
SLIDE 66

Coordinates

One coordinate system that is used is called coord_polar() which, unlike its name suggests, does not calculate a polar coordinate transformation but rather treats one of the two positions as defining an angle and the other as defining the radius.

ggplot(SAheart, aes(obesity, adiposity)) + geom_point() + geom_smooth() + coord_polar(theta = "x")

20 30 40 10 20 30 40

  • besity

adiposity

which, arguably, is a pretty weird plot but does demonstrate how coordinate systems are abstracted

  • ut as part of the grammar. Consequently coord_polar() should be used with considerable caution
slide-67
SLIDE 67

Coordinates

Arguably overly complicated, one use of coord_polar() is to construct a pie chart. This is just a bar chart expressed using coord_polar(). First the bar chart

barChart <- ggplot(SAheart, aes(x = factor(1), fill = famhist)) + geom_bar(width = 1) + xlab("") barChart

100 200 300 400 1

count famhist

Absent Present

slide-68
SLIDE 68

Coordinates

Arguably overly complicated, one use of coord_polar() is to construct a pie chart. This is just a bar chart expressed using coord_polar(). Now the pie chart

barChart + coord_polar(theta = "y")

100 200 300 400 1

count famhist

Absent Present

slide-69
SLIDE 69

Coordinates

What’s going on here?

barChart + coord_polar(theta = "x")

100 200 300 400

count famhist

Absent Present

Perhaps a little “too clever by half”?

slide-70
SLIDE 70

Coordinates

What’s going on here?

barChart + coord_polar(theta = "x")

100 200 300 400

count famhist

Absent Present

Perhaps a little “too clever by half”? . . . Be careful with coord_polar(); it’s easy to have it make a very difficult to interpret plot.

slide-71
SLIDE 71

Positions

A bar chart with two variates. Default position is “stack”

barChart2 <- ggplot(SAheart, aes(x = factor(chd), fill = famhist)) + geom_bar(position="stack") + xlab("chd") barChart2

100 200 300 1

chd count famhist

Absent Present

which stacks the two colours in the same bar.

slide-72
SLIDE 72

Positions

What should this look like?

barChart2 + coord_polar(theta = "y")

slide-73
SLIDE 73

Positions

What should this look like?

barChart2 + coord_polar(theta = "y")

100 200 300 1

count chd famhist

Absent Present

Thickness is the bar width, each ring is chd, arc length is count. Again, coord_polar() can be confusing.

slide-74
SLIDE 74

Positions

What should this look like?

barChart2 + coord_polar(theta = "x")

slide-75
SLIDE 75

Positions

What should this look like?

barChart2 + coord_polar(theta = "x")

1 100 200 300

chd count famhist

Absent Present

Angle is the now the bar width, each wedge is chd, thickness is count. Again, coord_polar() can be confusing.

slide-76
SLIDE 76

Positions

To place the colours beside each other rather than stack them, change the position to “dodge”

barChart3 <- ggplot(SAheart, aes(x = factor(chd), fill = famhist)) + geom_bar(position="dodge") + xlab("chd") barChart3

50 100 150 200 1

chd count famhist

Absent Present

which stacks the two colours in the same bar.

slide-77
SLIDE 77

Positions

What should this look like?

barChart3 + coord_polar(theta = "y")

slide-78
SLIDE 78

Positions

What should this look like?

barChart3 + coord_polar(theta = "y")

50 100 150 200 1

count chd famhist

Absent Present

Explain.

slide-79
SLIDE 79

Positions

Now what should this look like?

barChart3 + coord_polar(theta = "x")

slide-80
SLIDE 80

Positions

Now what should this look like?

barChart3 + coord_polar(theta = "x")

1 50 100 150 200

chd count famhist

Absent Present

Explain.

slide-81
SLIDE 81

Positions and facets

A bar chart with two variates. Use facets

barChart4 <- ggplot(SAheart, aes(x = factor(chd), fill = famhist)) + geom_bar(position = "stack") + xlab("chd") + facet_wrap(~chd) barChart4

1 1 1 100 200 300

chd count famhist

Absent Present

Exercise: What should barChart4 + coord_polar(theta = "y") look like? What about barChart4 + coord_polar(theta = "x")?

slide-82
SLIDE 82

Positions and facets

A bar chart with two variates. Use facets

barChart5 <- ggplot(SAheart, aes(x = factor(chd), fill = famhist)) + geom_bar(position = "dodge") + xlab("chd") + facet_wrap(~chd) barChart5

1 1 1 50 100 150 200

chd count famhist

Absent Present

Exercise: What should barChart5 + coord_polar(theta = "y") look like? What about barChart5 + coord_polar(theta = "x")?

slide-83
SLIDE 83

Positions and facets

A bar chart with two variates. Use both variates as facets

barChart6 <- ggplot(SAheart, aes(x = factor(chd), fill = famhist)) + geom_bar(position = "dodge") + xlab("chd") + facet_wrap(famhist~chd) barChart6

Present Present 1 Absent Absent 1 1 1 50 100 150 200 50 100 150 200

chd count famhist

Absent Present

Exercise: What should barChart6 + coord_polar(theta = "y") look like? What about barChart6 + coord_polar(theta = "x")?

slide-84
SLIDE 84

Statistical transformations - stat

These transformations often summarize data in some manner (e.g. by counting, by averaging, etc.). Some statistical functions operate “behind the scenes”:

◮ stat_bin() in geom_bar(), geom_freqpoly(), geom_histogram() ◮ stat_bin2d() in geom_bin2d() ◮ stat_bindot() in geom_dotplot() ◮ stat_binhex() in geom_hex() ◮ stat_boxplot() in geom_boxplot() ◮ stat_contour() in geom_contour() ◮ stat_quantile() in geom_quantile() ◮ stat_smooth() in geom_smooth() ◮ stat_sum() in geom_count()

but may also be called directly (outside the geom_)

slide-85
SLIDE 85

Statistical transformations - stat

Other stats have no corresponding geom_ function:

◮ stat_ecdf(): compute a empirical cumulative distribution plot. ◮ stat_function(): compute y values from a function of x values. ◮ stat_summary(): summarise y values at distinct x values. ◮ stat_summary2d(), stat_summary_hex(): summarise binned values. ◮ stat_qq(): perform calculations for a quantile-quantile plot. ◮ stat_spoke(): convert angle and radius to position. ◮ stat_unique(): remove duplicated rows.

slide-86
SLIDE 86

Statistical transformations - example

Adding some statistical summary information to the scatterplot p2

p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + stat_summary(geom = "point", fun.y = "median", col = "yellow", size = 2, pch = 19)

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp factor(chd)

1

  • besity

20 30 40

Adds the median of the ys at each observed x.

slide-87
SLIDE 87

Statistical transformations - example

Alternatively use stat = "summary" in geom_point(). Also add connecting lines to the scatterplot p2

p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + geom_point(stat = "summary", fun.y = "median", col = "yellow", size = 2, pch = 19) + stat_summary(geom = "line", fun.y = "median", col = "yellow", size = 1, pch = 19)

100 125 150 175 200 4 5 6 7 8

sqrt(age) sbp factor(chd)

1

  • besity

20 30 40

Adds the median of the ys at each observed x.

slide-88
SLIDE 88

Miscellaneous

◮ Can also facet in a matrix grid using facet_grid() ◮ position can also be “jitter” (best for scatterplots) ◮ there is a function called theme() which is how the appearance of all

non-data plot components are changed.

◮ E.g. it is possible to turn that grey background grid off via theme() (though

it seems a lot of work)

◮ there is a function qplot() or quickplot() which is more like a base

graphics plot() call and so may be easier to use than following the ggplot2 grammar via ggplot() + ...

◮ ggsave() will save the most recent ggplot.

slide-89
SLIDE 89

Miscellaneous

Note: to plot time series (objects of class ts) you need the ggfortify package and then use autoplot().

library(ggfortify) autoplot(sunspots)

50 100 150 200 250 1750 1800 1850 1900 1950

Similarly, library(ggmap) for raster maps from get_map()