Data visualization with ggplot2 R.W. Oldford Computational - - PowerPoint PPT Presentation
Data visualization with ggplot2 R.W. Oldford Computational - - PowerPoint PPT Presentation
Data visualization with ggplot2 R.W. Oldford Computational pipelines Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output: Computational
Computational pipelines
Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output:
Computational pipelines
Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output:
Computational pipelines
Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output: If we have several of these, we can connect them one to another in sequence to produce a “pipeline” of modules or steps in the processing of the original input:
Computational pipelines
Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output: If we have several of these, we can connect them one to another in sequence to produce a “pipeline” of modules or steps in the processing of the original input:
Computational pipelines
Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output: If we have several of these, we can connect them one to another in sequence to produce a “pipeline” of modules or steps in the processing of the original input:
Computational pipelines
Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output: If we have several of these, we can connect them one to another in sequence to produce a “pipeline” of modules or steps in the processing of the original input:
Computational pipelines
Have some function/module which takes some input, performs some actions on it (transformations, summarizing, adding information, etc.) and produces output: If we have several of these, we can connect them one to another in sequence to produce a “pipeline” of modules or steps in the processing of the original input: The connected components form a “pipeline” through which the original input “flows”, with some processing/transformation of the data occurring at each step.
Computational pipelines
A simple metaphor (viz. that of laying pipes end to end):
Computational pipelines
A simple metaphor (viz. that of laying pipes end to end):
- data passes through and is processed by a set of computational steps serially linked so
that the output of one becomes the input of the next
Computational pipelines
A simple metaphor (viz. that of laying pipes end to end):
- data passes through and is processed by a set of computational steps serially linked so
that the output of one becomes the input of the next
- the Unix “pipe” | is called a “pipe”: ls -R Notes
Computational pipelines
A simple metaphor (viz. that of laying pipes end to end):
- data passes through and is processed by a set of computational steps serially linked so
that the output of one becomes the input of the next
- the Unix “pipe” | is called a “pipe”: ls -R Notes | grep ".pdf"
Computational pipelines
A simple metaphor (viz. that of laying pipes end to end):
- data passes through and is processed by a set of computational steps serially linked so
that the output of one becomes the input of the next
- the Unix “pipe” | is called a “pipe”: ls -R Notes | grep ".pdf" | sort
Computational pipelines
A simple metaphor (viz. that of laying pipes end to end):
- data passes through and is processed by a set of computational steps serially linked so
that the output of one becomes the input of the next
- the Unix “pipe” | is called a “pipe”: ls -R Notes | grep ".pdf" | sort | more
Computational pipelines
A simple metaphor (viz. that of laying pipes end to end):
- data passes through and is processed by a set of computational steps serially linked so
that the output of one becomes the input of the next
- the Unix “pipe” | is called a “pipe”: ls -R Notes | grep ".pdf" | sort | more
- a graphics rendering pipeline (from Kaufman, Fan and Petkov (2009) Implementing the lattice Boltzmann model on
commodity graphics hardware J. Stat. Mech.)]
Computational pipelines
A simple metaphor (viz. that of laying pipes end to end):
- data passes through and is processed by a set of computational steps serially linked so
that the output of one becomes the input of the next
- the Unix “pipe” | is called a “pipe”: ls -R Notes | grep ".pdf" | sort | more
- a graphics rendering pipeline (from Kaufman, Fan and Petkov (2009) Implementing the lattice Boltzmann model on
commodity graphics hardware J. Stat. Mech.)]
Wilkinson’s Grammar of Graphics pipeline
Lee Wilkinson’s monumental The Grammar of Graphics begins with a pipeline model for constructing statistical graphics: Each step in the pipeline transforms its input to produce output for the next step. The order of steps is essential, though not all need be there for every plot. Because the pipeline consists of separate components, the final graphic that is rendered can be simply and sometimes dramatically changed by making changes to a single component in the pipeline.
ggplot2 – a grammar of graphics for R
Inspired by Wilkinson’s “Grammar of Graphics”, Hadley Wickham (in his 2008 Iowa State PhD thesis: Practical tools for exploring data and models) developed a “layered grammar of graphics.” This is implemented as ggplot2 in R.
library(ggplot2)
Much like Wilkinson’s original grammar, ggplot2 uses a pipeline model for its graphics construction in that a plot is built in an ordered series of steps, where each step
- perates on the output of its immediate predecessor in the line. Departing from the
grammar, ggplot2 slightly mixes metaphors in that each step in the pipeline can (typically) be thought of as adding a layer to all that preceded it. From the ggplot2 book:
"The layered grammar of graphics (Wickham 2009) builds on Wilkinson’s grammar, focussing on the primacy of layers and adapting it for embedding within R. In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Facetting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic."
Notationally, the components of the pipeline appear in sequence connected one to the next via an intervening + sign, thus emphasizing each as an addition of a layer (or of some further processing of the plot).
Data - South African heart disease
Consider the ‘SAheart‘ data from the package ‘ElemStatLearn‘. This is a sample from a retrospective study of heart disease in males from a high-risk region of the Western Cape, South Africa. There are 462 cases and 10 variates. The first few
- bervations (cases) are shown below.
sbp tobacco ldl adiposity f amhist typea
- besity
alcohol age chd 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 1 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 170 7.50 6.41 38.03 Present 51 31.99 24.26 58 1 134 13.60 3.50 27.78 Present 60 25.99 57.34 49 1 132 6.20 6.47 36.21 Present 62 30.77 14.14 45 For example, sbp denotes “systolic blood pressure”, sbp “low density lipoprotein cholesterol”. famhist “family history of heart disease”, age “age at onset” (in years), and chd indicates whether the patient has coronary heart disease or not (a response).
(see help(SAheart, package="ElemStatLearn") for details)
Constructing a plot - the pipeline
In the grammar of graphics, a plot processes each component in turn
ggplot(data = SAheart)
First the data
Constructing a plot - pipeline
In the grammar of graphics, a plot processes each component in turn
ggplot(data = SAheart) + aes( x = age, y = chd)
0.00 0.25 0.50 0.75 1.00 20 30 40 50 60
age chd
Then the mapping of the data to plot “aesthetics”
Constructing a plot - pipeline
In the grammar of graphics, a plot processes each component in turn
ggplot(data = SAheart) + aes( x = age, y = chd) + geom_point()
0.00 0.25 0.50 0.75 1.00 20 30 40 50 60
age chd
Then the geometry.
Constructing a plot - pipeline
In the grammar of graphics, a plot processes each component in turn
ggplot(data = SAheart) + aes( x = age, y = chd) + geom_point() + geom_smooth()
0.00 0.25 0.50 0.75 1.00 20 30 40 50 60
age chd
Which can have several further steps in the pipeline
Constructing a plot
Alternatively, in the grammar of ggplot2, a plot is also a sum of component layers.
ggplot(data = SAheart, mapping = aes(x = age, y = chd))
0.00 0.25 0.50 0.75 1.00 20 30 40 50 60
age chd
The base display with mapping.
Constructing a plot
Alternatively, in the grammar of ggplot2, a plot is also a sum of component layers.
ggplot(data = SAheart, mapping = aes(x = age, y = chd)) + geom_point()
0.00 0.25 0.50 0.75 1.00 20 30 40 50 60
age chd
Here the + is adding layers.
Constructing a plot
Alternatively, in the grammar of ggplot2, a plot is also a sum of component layers.
ggplot(data = SAheart, mapping = aes(x = age, y = chd)) + geom_point() + geom_smooth()
0.00 0.25 0.50 0.75 1.00 20 30 40 50 60
age chd
Here the + is adding layers.
Constructing a plot - separate mappings
Alternatively, we could deliberately associate only the data with the plot, forcing the mapping of the data to aesthetics within each individual component layer: ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd))
0.00 0.25 0.50 0.75 1.00 20 30 40 50 60
age chd
The mapping is explicit for each layer.
Constructing a plot - separate mappings
What would the following plot look like?
ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth()
Constructing a plot - separate mappings
What would the following plot look like?
ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth()
It fails . . . why? How could it be fixed?
Constructing a plot - separate mappings
What would the following plot look like?
ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth()
It fails . . . why? How could it be fixed? Cautionary note: the ggplot2 grammar mixes the two metaphors of “layers” and “pipes”. Just because an aesthetic precedes a component in the pipeline does not mean that it is available for use.
Constructing a plot - separate mappings
Solution 1: explicitly, give the mapping for each layer:
ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth(mapping = aes(x = age, y = chd))
0.00 0.25 0.50 0.75 1.00 20 30 40 50 60
age chd
Constructing a plot - separate mappings
Solution 2: provide aesthetics upstream:
ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd)) + aes(x = age, y = chd) + geom_smooth()
0.00 0.25 0.50 0.75 1.00 20 30 40 50 60
age chd
Constructing a plot - separate mappings
ggplot(data = SAheart) + geom_point(mapping = aes(x = age, y = chd, col = famhist)) + geom_smooth(mapping = aes(x = age, y = chd))
0.00 0.25 0.50 0.75 1.00 20 30 40 50 60
age chd famhist
Absent Present
Constructing a plot - shared and separate mappings
ggplot(data = SAheart) + aes(group = famhist) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth(mapping = aes(x = age, y = chd))
0.0 0.5 1.0 20 30 40 50 60
age chd
Constructing a plot - shared and separate mappings
ggplot(data = SAheart, mapping = aes(group = famhist)) + geom_point(mapping = aes(x = age, y = chd, col = famhist)) + geom_smooth(mapping = aes(x = age, y = chd))
0.0 0.5 1.0 20 30 40 50 60
age chd famhist
Absent Present
Constructing a plot - shared and separate mappings
ggplot(data = SAheart, mapping = aes(group = famhist, col = famhist)) + geom_point(mapping = aes(x = age, y = chd)) + geom_smooth(mapping = aes(x = age, y = chd))
0.0 0.5 1.0 20 30 40 50 60
age chd famhist
Absent Present
Constructing a plot
Alternatively, we could split the plot into two pieces by facetting:
ggplot(data = SAheart, mapping = aes(x = age, y = chd)) + geom_point(col="steelblue", size = 3, alpha = 0.4) + geom_smooth(method = "loess", col = "steelblue") + facet_wrap(~famhist)
Absent Present 20 30 40 50 60 20 30 40 50 60 0.0 0.5 1.0
age chd
Components of the layered grammar
In the grammar of ggplot2, a plot is a structured combination of:
◮ a dataset, ◮ a set of mappings from variates to aesthetics, ◮ one or more layers, each composed of ◮ a geometric object, ◮ a statistical transformation, ◮ a position adjustment, and ◮ (optionally) its own dataset and aesthetic mappings ◮ a scale for each aesthetic mapping, ◮ a coordinate system, ◮ a facetting specification
Geometric objects
There are a variety of geometric objects that can be added to a plot
◮ geom_abline(), geom_hline(),geom_vline(), geom_curve(),
geom_segment(), geom_step()
◮ geom_label(), geom_text() ◮ geom_point(), geom_smooth(), geom_crossbar(), geom_errorbar(),
geom_errorbarh(), geom_linerange(), geom_pointrange(),
◮ geom_rect(), geom_raster(), geom_area(), geom_ribbon(),
geom_tile(),
◮ geom_bar(), geom_col(), ◮ geom_dotplot(), geom_boxplot(), geom_histogram(),
geom_freqpoly(), geom_density(), geom_violin(), geom_quantile(), geom_qq()
◮ geom_bin2d(), geom_density2d(), geom_hex(), ◮ geom_contour(), ◮ geom_map(), geom_polygon()
Each of these will have their own arguments including mapping, data, stat, et cetera.
Geometric objects - adding to plots
Beginning with a plot different geometric objects may be added. For example:
p <- ggplot(data = SAheart, mapping = aes(x = tobacco, y = sbp)) p
100 125 150 175 200 10 20 30
tobacco sbp
Geometric objects - points and density
Beginning with a plot different geometric objects may be added. For example:
p + geom_point() + geom_density_2d(lwd = 1.5, col = "steelblue")
100 125 150 175 200 10 20 30
tobacco sbp
Geometric objects - histogram
h <- ggplot(data = SAheart, mapping = aes(x = adiposity)) h + geom_histogram(bins = 10, fill = "steelblue", col = "black", alpha = 0.5)
25 50 75 10 20 30 40
adiposity count
Note that had we tried to layer the histogram on top of p, it would have inherited from p a y aesthetic (namely y = sbp) which does not make sense for a histogram.
Geometric objects - histogram
h + geom_histogram(mapping = aes(y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5)
0.00 0.01 0.02 0.03 0.04 0.05 10 20 30 40
adiposity density
A y aesthetic that does make sense for a histogram is ..density.. which forces the scaling of the vertical axis so that the histogram has unit area. Note that the x aesthetic was inherited from h.
Geometric objects - density scale histogram
Provided we provide a y aesthetic mapping, a histogram could therefore be added to p as well. p + geom_histogram(mapping = aes(x = adiposity, y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5)
0.00 0.01 0.02 0.03 0.04 0.05 10 20 30 40
tobacco sbp
Note:
◮ the change in vertical scale matches the histogram ◮ the axes labels do not match the aesthetics of the histogram (though the tick marks and
values happen to) Because this is only a grammar, it is as easy to make silly visualizations as it is silly sentences.
Geometric objects - layering effect
The order of layering (on top of h now) matters:
h + geom_histogram(mapping = aes(y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5) + geom_density(mapping = aes(y = ..density..), fill = "grey", alpha = 0.5)
0.00 0.01 0.02 0.03 0.04 0.05 10 20 30 40
adiposity density
Note that the y aesthetic had to be repeated here . . .
Geometric objects - layering effect
Switch the order of addition:
h + geom_density(mapping = aes(y = ..density..), fill = "grey", alpha = 0.5) + geom_histogram(mapping = aes(y = ..density..), bins = 10, fill = "steelblue", col = "black", alpha = 0.5)
0.00 0.01 0.02 0.03 0.04 0.05 10 20 30 40
adiposity density
Note that the aesthetics need to be repeated here . . .
Geometric objects - bar charts
ggplot(SAheart) + geom_bar(mapping = aes(x = factor(chd), fill = famhist)) + labs(x = "chd", title ="South African heart disease") + coord_flip()
1 100 200 300
count chd famhist
Absent Present
South African heart disease
Which makes you wonder how the data were collected.
Geometric objects
A different scatterplot
p2 <- ggplot(data = SAheart, mapping = aes(x = sqrt(age), y = sbp)) p2 + geom_point()
100 125 150 175 200 4 5 6 7 8
sqrt(age) sbp
Geometric objects
Note that each geometric object has its own arguments and properties that can be set.
p2 + geom_point(col = "red", size = 3, pch = 21, fill = "yellow", alpha = 0.5) + geom_smooth(method = "loess", col = "steelblue", lty = 2, lwd = 1.5, alpha = 0.2)
100 125 150 175 200 4 5 6 7 8
sqrt(age) sbp
Geometric objects
Aesthetics apply to every point individually.
p2 + geom_point(mapping = aes(size = obesity), fill = "steelblue", col = "black", pch = 21, alpha = 0.4) + geom_smooth(method = "loess", col = "yellow", lty = 2, lwd = 1.5, alpha = 0.2)
100 125 150 175 200 4 5 6 7 8
sqrt(age) sbp
- besity
20 30 40
Geometric objects
Aesthetics apply to every point individually.
p2 + geom_point(mapping = aes(size = obesity, fill = tobacco), col = "black", pch = 21, alpha = 0.4) + geom_smooth(method = "loess", col = "yellow", lty = 2, lwd = 1.5, alpha = 0.2)
100 125 150 175 200 4 5 6 7 8
sqrt(age) sbp
10 20 30
tobacco
- besity
20 30 40
Geometric objects
The data may change with each layer
heartAttack <- SAheart[, "chd"] == 1 hAplot <- p2 + geom_point(data = SAheart[heartAttack, ], mapping = aes(size = obesity), alpha = 0.4, col = "black", pch = 21, fill = "steelblue") hAplot
100 125 150 175 200 4 5 6 7 8
sqrt(age) sbp
- besity
15 20 25 30 35 40 45
Geometric objects
The data may change with each layer
qboth <- hAplot + geom_point(data = SAheart[!heartAttack, ], # Not heartAttack mapping = aes(size = obesity), alpha = 0.4, col = "black", pch = 21, fill = "firebrick") qboth
100 125 150 175 200 4 5 6 7 8
sqrt(age) sbp
- besity
20 30 40
Geometric objects
The data may change with each layer
qboth + geom_smooth(data = SAheart[heartAttack, ], method = "loess", col = "steelblue", alpha = 0.4) + geom_smooth(data = SAheart[!heartAttack, ], method = "loess", col = "firebrick", alpha = 0.4)
120 160 200 4 5 6 7 8
sqrt(age) sbp
- besity
20 30 40
Geometric objects
The data may change with each layer
qboth + geom_smooth(method = "loess")
100 125 150 175 200 4 5 6 7 8
sqrt(age) sbp
- besity
20 30 40
Note smooth is using all of the data here.
Geometric objects
The data may change with each layer
qboth + geom_smooth(mapping = aes(color = factor(chd)), method = "loess")
120 160 200 4 5 6 7 8
sqrt(age) sbp factor(chd)
1
- besity
20 30 40
Here the smooth is separate for each colour given by chd as factor. Note ggplot’s default colours.
Geometric objects
The colours can be coordinated by relying on the original data and using chd as a factor:
p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth(mapping = aes(col = factor(chd)), method = "loess", lwd = 1.5, alpha = 0.2)
120 160 200 4 5 6 7 8
sqrt(age) sbp factor(chd)
1
- besity
20 30 40
Here the smooth is separate for each colour given by chd as factor.
Scales
A map from the domain of data values to the range of some aesthetic (e.g. colour, size, axis ranges, . . . ).
p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth(mapping = aes(col = factor(chd)), method = "loess", lwd = 1.5, alpha = 0.2) + scale_fill_manual("chd", values=c("steelblue", "firebrick"))+ scale_color_manual("chd", values=c("steelblue", "firebrick"))
120 160 200 4 5 6 7 8
sqrt(age) sbp chd
1
- besity
20 30 40
. . . gets your own “scale” values for colour and for fill.
Scales
A map from the domain of data values to the range of some aesthetic (e.g. colour, size, axis ranges, . . . ).
p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth(mapping = aes(col = factor(chd)), method = "loess", lwd = 1.5, alpha = 0.2) + scale_fill_manual("chd", values=c("steelblue", "firebrick"))+ scale_color_manual("chd", values=c("steelblue", "firebrick")) + scale_size("obesity", breaks = seq(0,100,5))
120 160 200 4 5 6 7 8
sqrt(age) sbp chd
1
- besity
15 20 25 30 35 40 45
. . . additonally gets your own “scale” values for point size (which is proportional to area).
Scales
A map from the domain of data values to the range of some aesthetic (e.g. colour, size, axis ranges, . . . ).
p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + geom_smooth(mapping = aes(col = factor(chd)), method = "loess", lwd = 1.5, alpha = 0.2) + scale_fill_manual("chd", values=c("steelblue", "firebrick"))+ scale_color_manual("chd", values=c("steelblue", "firebrick")) + scale_size_area("obesity", breaks = seq(0,100,5))
120 160 200 4 5 6 7 8
sqrt(age) sbp chd
1
- besity
15 20 25 30 35 40 45
. . . Now a zero value gives a zero area.
Position scales
There are two position scales: horizontal (x) and vertical (y).
p + geom_point(alpha = 0.5, size = 1) + scale_x_continuous(limits = c(0,40)) + scale_y_continuous(limits = c(75,225))
100 150 200 10 20 30 40
tobacco sbp
Position scales
There are two position scales: horizontal (x) and vertical (y).
p + geom_point(alpha = 0.5, size = 1) + xlim(0,40) + ylim(75,225)
100 150 200 10 20 30 40
tobacco sbp
Position scales
There are two position scales: horizontal (x) and vertical (y).
p + aes(x = tobacco + 1) + geom_point(alpha = 0.5, size = 1) + scale_x_log10()
100 125 150 175 200 1 10
tobacco + 1 sbp
Coordinates
This is the coordinate system in which the positions are to be plotted. We have already seen coord_flip() which swaps the x and y axes. There are many
- thers; the aspect ratio, for example, is fixed using coord_fixed():
ggplot(SAheart, aes(obesity, adiposity)) + geom_point() + coord_fixed(ratio = 1)
10 20 30 40 20 30 40
- besity
adiposity
Here the aspect ratio is fixed so that one unit change in the x direction produces only one unit change in the y direction.
Coordinates
This is the coordinate system in which the positions are to be plotted. We have already seen coord_flip() which swaps the x and y axes. There are many
- thers; the aspect ratio, for example, is fixed using coord_fixed():
ggplot(SAheart, aes(obesity, adiposity)) + geom_point() + coord_fixed(ratio = 0.5)
10 20 30 40 20 30 40
- besity
adiposity
Here the aspect ratio is fixed so that one unit change in the x direction produces only half a unit change in the y direction.
Coordinates
One coordinate system that is used is called coord_polar() which, unlike its name suggests, does not calculate a polar coordinate transformation but rather treats one of the two positions as defining an angle and the other as defining the radius.
ggplot(SAheart, aes(obesity, adiposity)) + geom_point() + geom_smooth() + coord_polar(theta = "x")
20 30 40 10 20 30 40
- besity
adiposity
which, arguably, is a pretty weird plot but does demonstrate how coordinate systems are abstracted
- ut as part of the grammar. Consequently coord_polar() should be used with considerable caution
Coordinates
Arguably overly complicated, one use of coord_polar() is to construct a pie chart. This is just a bar chart expressed using coord_polar(). First the bar chart
barChart <- ggplot(SAheart, aes(x = factor(1), fill = famhist)) + geom_bar(width = 1) + xlab("") barChart
100 200 300 400 1
count famhist
Absent Present
Coordinates
Arguably overly complicated, one use of coord_polar() is to construct a pie chart. This is just a bar chart expressed using coord_polar(). Now the pie chart
barChart + coord_polar(theta = "y")
100 200 300 400 1
count famhist
Absent Present
Coordinates
What’s going on here?
barChart + coord_polar(theta = "x")
100 200 300 400
count famhist
Absent Present
Perhaps a little “too clever by half”?
Coordinates
What’s going on here?
barChart + coord_polar(theta = "x")
100 200 300 400
count famhist
Absent Present
Perhaps a little “too clever by half”? . . . Be careful with coord_polar(); it’s easy to have it make a very difficult to interpret plot.
Positions
A bar chart with two variates. Default position is “stack”
barChart2 <- ggplot(SAheart, aes(x = factor(chd), fill = famhist)) + geom_bar(position="stack") + xlab("chd") barChart2
100 200 300 1
chd count famhist
Absent Present
which stacks the two colours in the same bar.
Positions
What should this look like?
barChart2 + coord_polar(theta = "y")
Positions
What should this look like?
barChart2 + coord_polar(theta = "y")
100 200 300 1
count chd famhist
Absent Present
Thickness is the bar width, each ring is chd, arc length is count. Again, coord_polar() can be confusing.
Positions
What should this look like?
barChart2 + coord_polar(theta = "x")
Positions
What should this look like?
barChart2 + coord_polar(theta = "x")
1 100 200 300
chd count famhist
Absent Present
Angle is the now the bar width, each wedge is chd, thickness is count. Again, coord_polar() can be confusing.
Positions
To place the colours beside each other rather than stack them, change the position to “dodge”
barChart3 <- ggplot(SAheart, aes(x = factor(chd), fill = famhist)) + geom_bar(position="dodge") + xlab("chd") barChart3
50 100 150 200 1
chd count famhist
Absent Present
which stacks the two colours in the same bar.
Positions
What should this look like?
barChart3 + coord_polar(theta = "y")
Positions
What should this look like?
barChart3 + coord_polar(theta = "y")
50 100 150 200 1
count chd famhist
Absent Present
Explain.
Positions
Now what should this look like?
barChart3 + coord_polar(theta = "x")
Positions
Now what should this look like?
barChart3 + coord_polar(theta = "x")
1 50 100 150 200
chd count famhist
Absent Present
Explain.
Positions and facets
A bar chart with two variates. Use facets
barChart4 <- ggplot(SAheart, aes(x = factor(chd), fill = famhist)) + geom_bar(position = "stack") + xlab("chd") + facet_wrap(~chd) barChart4
1 1 1 100 200 300
chd count famhist
Absent Present
Exercise: What should barChart4 + coord_polar(theta = "y") look like? What about barChart4 + coord_polar(theta = "x")?
Positions and facets
A bar chart with two variates. Use facets
barChart5 <- ggplot(SAheart, aes(x = factor(chd), fill = famhist)) + geom_bar(position = "dodge") + xlab("chd") + facet_wrap(~chd) barChart5
1 1 1 50 100 150 200
chd count famhist
Absent Present
Exercise: What should barChart5 + coord_polar(theta = "y") look like? What about barChart5 + coord_polar(theta = "x")?
Positions and facets
A bar chart with two variates. Use both variates as facets
barChart6 <- ggplot(SAheart, aes(x = factor(chd), fill = famhist)) + geom_bar(position = "dodge") + xlab("chd") + facet_wrap(famhist~chd) barChart6
Present Present 1 Absent Absent 1 1 1 50 100 150 200 50 100 150 200
chd count famhist
Absent Present
Exercise: What should barChart6 + coord_polar(theta = "y") look like? What about barChart6 + coord_polar(theta = "x")?
Statistical transformations - stat
These transformations often summarize data in some manner (e.g. by counting, by averaging, etc.). Some statistical functions operate “behind the scenes”:
◮ stat_bin() in geom_bar(), geom_freqpoly(), geom_histogram() ◮ stat_bin2d() in geom_bin2d() ◮ stat_bindot() in geom_dotplot() ◮ stat_binhex() in geom_hex() ◮ stat_boxplot() in geom_boxplot() ◮ stat_contour() in geom_contour() ◮ stat_quantile() in geom_quantile() ◮ stat_smooth() in geom_smooth() ◮ stat_sum() in geom_count()
but may also be called directly (outside the geom_)
Statistical transformations - stat
Other stats have no corresponding geom_ function:
◮ stat_ecdf(): compute a empirical cumulative distribution plot. ◮ stat_function(): compute y values from a function of x values. ◮ stat_summary(): summarise y values at distinct x values. ◮ stat_summary2d(), stat_summary_hex(): summarise binned values. ◮ stat_qq(): perform calculations for a quantile-quantile plot. ◮ stat_spoke(): convert angle and radius to position. ◮ stat_unique(): remove duplicated rows.
Statistical transformations - example
Adding some statistical summary information to the scatterplot p2
p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + stat_summary(geom = "point", fun.y = "median", col = "yellow", size = 2, pch = 19)
100 125 150 175 200 4 5 6 7 8
sqrt(age) sbp factor(chd)
1
- besity
20 30 40
Adds the median of the ys at each observed x.
Statistical transformations - example
Alternatively use stat = "summary" in geom_point(). Also add connecting lines to the scatterplot p2
p2 + geom_point(mapping = aes(size = obesity, fill = factor(chd)), col = "black", pch = 21, alpha = 0.4) + geom_point(stat = "summary", fun.y = "median", col = "yellow", size = 2, pch = 19) + stat_summary(geom = "line", fun.y = "median", col = "yellow", size = 1, pch = 19)
100 125 150 175 200 4 5 6 7 8
sqrt(age) sbp factor(chd)
1
- besity
20 30 40
Adds the median of the ys at each observed x.
Miscellaneous
◮ Can also facet in a matrix grid using facet_grid() ◮ position can also be “jitter” (best for scatterplots) ◮ there is a function called theme() which is how the appearance of all
non-data plot components are changed.
◮ E.g. it is possible to turn that grey background grid off via theme() (though
it seems a lot of work)
◮ there is a function qplot() or quickplot() which is more like a base
graphics plot() call and so may be easier to use than following the ggplot2 grammar via ggplot() + ...
◮ ggsave() will save the most recent ggplot.
Miscellaneous
Note: to plot time series (objects of class ts) you need the ggfortify package and then use autoplot().
library(ggfortify) autoplot(sunspots)
50 100 150 200 250 1750 1800 1850 1900 1950