Data Manipulation in R R.W. Oldford Control flow Recall the base - - PowerPoint PPT Presentation
Data Manipulation in R R.W. Oldford Control flow Recall the base - - PowerPoint PPT Presentation
Data Manipulation in R R.W. Oldford Control flow Recall the base data structures in R : dimensionality homogeneous contents heterogeneous contents 1d Atomic vector List 2d Matrix Data frame nd Array Subsets and individual elements can
Control flow
Recall the base data structures in R: dimensionality homogeneous contents heterogeneous contents 1d Atomic vector List 2d Matrix Data frame nd Array Subsets and individual elements can be selected from these using accessors [] (with as many indices are the are dimensions), [[]] and $ for lists. Subsets can be identified by index, name, or as the result of logical expressions (see Logical) Familiar conditional program control (see ?Control)
# Conditionals if(cond) expr if(cond) cons.expr else alt.expr
can also be used with logical conditions cond to evaluate arbitrary expressions expr, cons.expr, or alt.expr. When a list of alternatives is available to choose from, then the function switch(expr, ...) can be used. It evaluates expr and matches it the corresponding element of the list ... following the expression (see ?switch).
Control flow - iteration and the lapply() family
Standard language constructs for, while, and repeat provide simple iteration:
# Looping for(var in seq) expr while(cond) expr repeat expr break next
Three common ways to construct the sequence s seq being looped over
◮ elements of the vector for (value in values) {} ◮ numeric indices of the vector for (i in seq_along(x)) {} ◮ names in a data structure for (name in names(x)) {}
Each is just a different case of elements of a vector. While easy to read (and write) and hence good for maintainable code, for () (and
- ther) loops are inefficient in a scripting language like R.
Control flow - iteration and the lapply() family
When the number of interations is large, the above looping functions are best avoided. Instead, vectorized alternatives should be used instead. These are usually implemented in C++ or C or some other compilable programming language. The lapply() family of functions (see ?lapply) are one set of vectorized functions implemented in C
lapply(X, FUN, ...) sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
These apply a function FUN to the elements of X with ... supplying any optional arguments of FUN. lapply returns a list of the same length as X
lapply(3:4, function(x) x^2) ## [[1]] ## [1] 9 ## ## [[2]] ## [1] 16
Control flow - vector output from sapply() and vapply()
sapply() is a user-friendly version of lapply which produces simplified results:
sapply(3:4, function(x) x^2) ## [1] 9 16 sapply(3:4, function(x) x^2, simplify = FALSE) ## [[1]] ## [1] 9 ## ## [[2]] ## [1] 16
and vapply() is much like sapply() but with a pre-specified type for the return value
- f the function:
vapply(3:4, function(x) paste(c(x, 3*x)^2), FUN.VALUE = character(length = 2)) ## [,1] [,2] ## [1,] "9" "16" ## [2,] "81" "144"
Note: that sapply() tries to be clever and so might return values in a list instead of a vector (e.g. when the FUN returns results of different lengths). In contrast, vapply() always wants to return a vector and so will generate an error if it cannot (e.g. when a list would work). Consequently vapply() may be preferred for programmatic use so that errors are generated; use sapply() within a function with care.
Control flow - apply()
apply() is used to apply a function to the margins of an array or a matrix:
(x <- array(1:8, dim = c(1, 4, 2))) ## , , 1 ## ## [,1] [,2] [,3] [,4] ## [1,] 1 2 3 4 ## ## , , 2 ## ## [,1] [,2] [,3] [,4] ## [1,] 5 6 7 8 apply(x, MARGIN = 1, FUN = max) ## [1] 8 apply(x, MARGIN = 2, FUN = max) ## [1] 5 6 7 8 apply(x, MARGIN = 3, FUN = max) ## [1] 4 8 apply(x, MARGIN = c(1,3), FUN = max) ## [,1] [,2] ## [1,] 4 8
Control flow - sweep()
Often we want to remove a summary statistic from the elements of an array. sweep() is used to apply statistical summary function to the specified margins of an array x and then to remove (or sweep) the resulting summary statistics out of the array.
sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
Here x is the array to be swept, STATS the summary statistics to be swept out, MARGIN the vector of indices identifying that part of x being summarized, and FUN is the method by which the summary statistics are swept from x. For example, we could subtract the median values from every column of the data.matrix of the data frame quakes:
x <- data.matrix(quakes) summaryStat <- apply(quakes, 2, median) head(sweep(quakes, MARGIN = 2, STATS = summaryStat, FUN = "-")) ## lat long depth mag stations ## 1 -0.12 0.21 315 0.2 14 ## 2 -0.32 -0.38 403 -0.4
- 12
## 3 -5.70 2.69
- 205
0.8 16 ## 4 2.33 0.25 379 -0.5
- 8
## 5 -0.12 0.55 402 -0.6
- 16
## 6 0.62 2.90
- 52 -0.6
- 15
Note: For other applications, a different FUN (or “broom”) might be used to sweep out the value.
Control flow - tapply()
tapply() applies a function FUN to values of an atomic vector X given by each combination of factors supplied as INDEX. The array is “ragged” in the sense that not all combinations need have a corresponding subset of values in X.
tapply(X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
For example, with the SAheart data from ElemStatLearn
library(ElemStatLearn) with(SAheart, tapply(sbp, INDEX = list(chd, famhist), FUN = mean)) ## Absent Present ## 0 134.9806 136.4896 ## 1 142.8594 144.3229 with(SAheart, tapply(sbp, INDEX = list(age = cut(age, 3), tobacco = cut(tobacco, 3)), FUN = mean)) ## tobacco ## age (-0.0312,10.4] (10.4,20.8] (20.8,31.2] ## (15,31.3] 128.3846 NA NA ## (31.3,47.7] 134.8561 142.3333 NA ## (47.7,64] 145.6821 145.9355 168 Note that there is not a lot of tobacco use (in cumulative kg) amongst younger people. If there are no
- bservations in X for a group defined by the factors of INDEX, then NA is returned for that group.
Control flow - by()
by(data, INDICES, FUN, ..., simplify = TRUE)
by() is a convenience wrapper function for tapply() which applies the function FUN to values of a data frame data given by each combination of factors supplied as INDICES. Here ... are further arguments to FUN.
by(SAheart[,c("age", "tobacco", "obesity")], INDICES = list(chd = SAheart$chd), FUN = summary) ## chd: 0 ## age tobacco
- besity
## Min. :15.00 Min. : 0.000 Min. :17.75 ## 1st Qu.:27.00 1st Qu.: 0.000 1st Qu.:22.60 ## Median :40.00 Median : 1.035 Median :25.57 ## Mean :38.85 Mean : 2.635 Mean :25.74 ## 3rd Qu.:50.75 3rd Qu.: 4.200 3rd Qu.:28.07 ## Max. :64.00 Max. :20.000 Max. :46.58 ## -------------------------------------------------------- ## chd: 1 ## age tobacco
- besity
## Min. :17.00 Min. : 0.000 Min. :14.70 ## 1st Qu.:42.75 1st Qu.: 1.500 1st Qu.:23.64 ## Median :53.00 Median : 4.130 Median :26.48 ## Mean :50.29 Mean : 5.525 Mean :26.62 ## 3rd Qu.:59.00 3rd Qu.: 8.200 3rd Qu.:28.78 ## Max. :64.00 Max. :31.200 Max. :45.72
Control flow - by()
savePar <- par(mfrow = c(2,2), mar = rep(2,4))
- utput <- by(SAheart[, c("tobacco", "sbp")],
INDICES = list(age = cut(SAheart$age, 2), chd = SAheart$chd), FUN = function(data) { x <- data[,1] y <- data[,2]
- rdered_x <- sort(x)
fit <- loess( y ~ x, data.frame(x = x, y = y)) pred <- predict(fit, newdata = data.frame(x = ordered_x)) plot(data, pch = 19, col = adjustcolor("grey", 0.5)) lines(ordered_x, pred, col = "blue") })
2 4 6 8 10 12 100 120 140 160 5 10 15 20 120 160 200 2 4 6 8 100 120 140 160 180 200 5 10 15 20 25 30 120 160 200
par(savePar)
Control flow - aggregate() for data.frames
aggregate(x, by, FUN, ..., simplify = TRUE, drop = TRUE)
aggregate() splits the data into groups, computes summary statistics for each, and returns the result in a convenient form.
aggregate(SAheart[,c("age", "tobacco", "obesity")], by = list(chd = SAheart$chd), FUN = summary) ## chd age.Min. age.1st Qu. age.Median age.Mean age.3rd Qu. age.Max. ## 1 0 15.00000 27.00000 40.00000 38.85430 50.75000 64.00000 ## 2 1 17.00000 42.75000 53.00000 50.29375 59.00000 64.00000 ## tobacco.Min. tobacco.1st Qu. tobacco.Median tobacco.Mean tobacco.3rd Qu. ## 1 0.000000 0.000000 1.035000 2.634735 4.200000 ## 2 0.000000 1.500000 4.130000 5.524875 8.200000 ## tobacco.Max. obesity.Min. obesity.1st Qu. obesity.Median obesity.Mean ## 1 20.000000 17.75000 22.60250 25.57000 25.73745 ## 2 31.200000 14.70000 23.63500 26.47500 26.62294 ##
- besity.3rd Qu. obesity.Max.
## 1 28.06500 46.58000 ## 2 28.78000 45.72000
Control flow - mapply()
mapply() is a multi-variable version of sapply():
mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
where FUN is a function to be applied to the elements of some number of arguments, which appear in .... The argument MoreArgs is a list of other arguments to FUN to be used at each call.
mapply(rep, 1:3, 3:1) ## [[1]] ## [1] 1 1 1 ## ## [[2]] ## [1] 2 2 ## ## [[3]] ## [1] 3 mapply(rep, times = 1:3, x = 3:1) # using named arguments ## [[1]] ## [1] 3 ## ## [[2]] ## [1] 2 2 ## ## [[3]] ## [1] 1 1 1
Control flow - functional programming constructs
A number of programming constructs that are common in functional programming are also available (see ?Map). These are found in many such languages (e.g. dating back to at least Smalltalk and Lisp). In R, these are:
Map(f, ...) Reduce(f, x, init, right = FALSE, accumulate = FALSE) Filter(f, x) Find(f, x, right = FALSE, nomatch = NULL) Position(f, x, right = FALSE, nomatch = NA_integer_) Negate(f)
Here f is a function (of the appropriate arity - binary for Reduce, unary for Filter, Find, and Position, and k-ary for Map), x a vector, and ... any number of vectors. Negate(f) returns a function which when applied to its arguments simply logical negates wheatever would be returned by the (predicate) function f (i.e. Negate(f) simply returns the function !f)
Note: Various functions (e.g. clusterMap() and mcmapply()) from the package parallel provide parallel versions of Map().
Control flow - Map() on a data.frame
Map(summary, quakes) ## $lat ##
- Min. 1st Qu.
Median Mean 3rd Qu. Max. ##
- 38.59
- 23.47
- 20.30
- 20.64
- 17.64
- 10.72
## ## $long ##
- Min. 1st Qu.
Median Mean 3rd Qu. Max. ## 165.7 179.6 181.4 179.5 183.2 188.1 ## ## $depth ##
- Min. 1st Qu.
Median Mean 3rd Qu. Max. ## 40.0 99.0 247.0 311.4 543.0 680.0 ## ## $mag ##
- Min. 1st Qu.
Median Mean 3rd Qu. Max. ## 4.00 4.30 4.60 4.62 4.90 6.40 ## ## $stations ##
- Min. 1st Qu.
Median Mean 3rd Qu. Max. ## 10.00 18.00 27.00 33.42 42.00 132.00
Control flow - Map() on samples
Suppose x is a vector (i.e. not a list nor a data frame):
Bootstrap <- function(x, fun, nreps) { n <-length(x) Map(function(i) { fun(x[sample(seq_along(x), size = n, replace = TRUE)])}, 1:nreps) } BootVals <- Bootstrap(quakes$mag, mean, 1000)
Map() returns a list (no simplification) which makes BootVals a bit inconvenient for further data analysis:
hist(unlist(BootVals), col = "grey", breaks = 15, main = "Bootstrap distribution", xlab = "average magnitude" ) Bootstrap distribution
average magnitude Frequency 4.58 4.60 4.62 4.64 4.66 50 100 150
Control flow - Reduce()
Reduce(f, x, init, right = FALSE, accumulate = FALSE)
f is a function of some number of arguments, x is a vector, init is an initial value of the same kind as the elements of x, accumulate is a logical indicating whether results
- f every reduction step is to be returned or just the last (default), and right is a logical
indicating whether the reduction should begin at the end (right) of x and move backwards: A simple example might be the calculation of a sample variance, which requires evaluating:
n
- i=1
(xi − x)2 which requires two passes throught x, one to get x = xi/n, the other to gather the squared differences:
twoPass <- function (x) { n <- length(x) xbar <- Reduce(function(x0, x1) {x0 + x1}, x, init=0) / n result <- Reduce(function(x0, x1) {x0 + (x1 - xbar)^2}, x, init=0) result } n <- 1000 ; x <- rnorm(n) twoPass(x) / (n-1) ## [1] 0.9849108
Control flow - Reduce()
It is not uncommon to see authors suggest a single pass algorithm based on the mathematically equivalent form:
n
- i=1
(xi − x)2 =
n
- i=1
x2
i − nx2 = n
- i=1
x2
i − 1
n
- n
- i=1
xi
2
which is implementable using a single Reduce() when we gather two results as we go:
- nePass <- function (x) {
n <- length(x) result <- Reduce(function(x0, x1) {x0 + c(x1^2, x1)}, x, init=c(0,0)) result[1] - (result[2]^2 / n) }
- nePass(x)/(n-1)
## [1] 0.9849108
Unfortunately, the one pass algorithm is not numerically sound. It can even produce negative values! The following corrected two-pass method based on both (replace xi by xi − x in the one-pass) is sounder numerically than either.
twoPassCorrected <- function (x) { n <- length(x) xbar <- Reduce(function(x0, x1) {x0 + x1}, x, init=0) / n result <- Reduce(function(x0, x1) {x0 + c((x1 - xbar)^2, (x1-xbar))}, x, init=c(0,0)) result[1] - (result[2]^2 / n) } twoPassCorrected(x)/(n-1) ## [1] 0.9849108
Control flow - Filter()
Filter(f, x)
Filter() selects that subset of the elements of x which return TRUE to the predicate function f. Note: It is NOT to be confused with the base R function filter() which applies “linear filtering” to a time series. Suppose for example, we want only those variates in a dataset which either are “factors”
- r have at most 3 different values and hence are potential factors.
data(SAheart, package = "ElemStatLearn") possibleFactors <- Filter(f = function (var){ is.factor(var) | (!is.factor(var) & length(unique(var)) <= 3)}, SAheart) possibleFactors[1:3,] ## famhist chd ## 1 Present 1 ## 2 Absent 1 ## 3 Present
Control flow - Find()
Find(f, x, right = FALSE, nomatch = NULL)
Find() searches and finds the first element in x returns TRUE to the predicate function f.
possibleFactor <- Find(f = function (var){ !is.factor(var) & length(unique(var)) <= 3}, SAheart) str(possibleFactor) ## int [1:462] 1 1 0 1 1 0 0 1 0 1 ...
Find() returns NULL if no such element is found.
Find(f = function (var){is.factor(var) & (nlevels(var) >=5)}, SAheart) ## NULL
Control flow - Position()
Position(f, x, right = FALSE, nomatch = NA_integer_)
Position() searches and finds the first element in x returns TRUE to the predicate function f.
Position(f = function (var){ is.factor(var) | (!is.factor(var) & length(unique(var)) <= 3)}, SAheart) ## [1] 5
Position() returns the value of nomatch if no such element is found.
Position(f = function (var){is.factor(var) & (nlevels(var) >=5)}, SAheart) ## [1] NA
Simplifying data manipulation
The lapply(), sapply(), etc. family of functions as well as the functional programming functions Map(), Reduce(), etc. provide a powerful suite of functions to that can be used to get the data into a form for analysis. However, because this set has grown over time (beginning with S), the lapply(), sapply(), Map(), Filter(), ... suite of functions use a variety
- f methods, arguments, and argument names to accomplish what have become
more critically important tasks in the age of large and diverse data sources. Were the language being defined now, greater consideration would be given to rationalizing these functions so as to simplify what are now more routine data transformations. Of course, backward compatability precludes redefining these functions.
Simplifying program control
When much of the data analysis cycle is sequential, in that analysis decisions are
- ften based on examination of the results of a previous data manipulation or
display, a pipe metaphor does not seem out of place. In data manipulation, transformation, and display, it could be very convenient to be able to express the usual mapping, filtering, selection, plotting, and related functions as pipes joined one to the next through which the data are passed and processed.
Simplifying data structure
Similarly, much mathematical manipulation is based on manipulating matrices and arrays. The S language (and hence R) extends these to the statistical idea of a data.frame where rows correspond to multivariate observations (or cases) and variates appear as columns. This mix of matrix-like and list-like behaviour gives a data structure well suited as input to statistical methods. And, knowing this structure, interfaces to all kinds
- f statistical modelling can be enormously simplified via a formula notation.
As the base statistical data structure, then , it is not surprising that certain constraints are imposed on the contents of a data.frame to ease its manipulation for modelling purposes. For example, coercing a character vector to be a factor makes sense for routine model building so that fitting algorithms can automatically adapt to the nature of the data. Unfortunately, for general data analysis, such coercions can sometimes get in the way and be more a nuisance than a help. Borrowing from a relational databases, a simpler and arguably more general representation for data would be as a “table”.
The tidyverse
Because backwards compatability precludes redefinition of the base S (and hence R) language, data structures and functionality to simplify data analysis must be supplemental to the original language. The tidyverse is a collection of complementary yet related packages, each specialized to simplify some data analysis tasks and all of which work together as a coherent tool set. The tidyverse package consists of several other packages: ggplot2, purrr, tibble, dplyr, tidyr, stringr, readr, and forcats.
library(maps) library(magrittr) # this package provides a richer set of "pipes" library(tidyverse)
Note: loading tidyverse might mask functions having the same names that are already loaded from other packages. For example, dplyr masks at least filter() and lag() from the base R stats package. To access these masked functions, stats::filter() and stats::lag() must be used as filter() and lag() will now refer to the functions from the dplyr package.
The tidyverse - simplified data manipulation with dplyr
The dplyr package provides a “grammar for data manipulation” which, as its name suggests, it is a refactoring of the lapply family functionality for data manipulation together with the functional programming mapping functions Map(), Reduce(), Filter(), etc. According to its help information dplyr has three main goals:
◮ “Identify the most important data manipulation verbs and make them easy
to use from” R.
◮ “Provide blazing fast performance for in-memory data by writing key pieces
in”C++ (using Rcpp)
◮ “Use the same interface to work with data no matter where it’s stored,
whether in a data frame, a data table or database.”
The tidyverse - simplified data manipulation with dplyr
Key dplyr functions:
◮ filter() to choose observations based on their values ◮ arrange() to reorder the observations (rows) ◮ select() to select variates by name ◮ mutate() to create new variates from existing ones ◮ summarize() (or summarise()) collapses many values down to a single
summary
◮ group_by() returns a new data structure which organizes the entire data
set into groups Think of these six functions as verbs for a language of data manipulation. Each verb operates on a data frame which is always its first argument. Remaining arguments describe what to do with the data frame. Variable names within the data frame can be used without quotes. The result of each verb is a new data frame which makes it easy to chain together multiple steps to get a new data frame formed from the original.
dplyr - filtering observations (rows)
head(SAheart) ## sbp tobacco ldl adiposity famhist typea obesity alcohol age chd ## 1 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 1 ## 2 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1 ## 3 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 ## 4 170 7.50 6.41 38.03 Present 51 31.99 24.26 58 1 ## 5 134 13.60 3.50 27.78 Present 60 25.99 57.34 49 1 ## 6 132 6.20 6.47 36.21 Present 62 30.77 14.14 45 head(filter(SAheart, chd == 1 & famhist == "Present")) ## sbp tobacco ldl adiposity famhist typea obesity alcohol age chd ## 1 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 1 ## 2 170 7.50 6.41 38.03 Present 51 31.99 24.26 58 1 ## 3 134 13.60 3.50 27.78 Present 60 25.99 57.34 49 1 ## 4 114 4.08 4.59 14.60 Present 62 23.11 6.72 58 1 ## 5 132 0.00 5.80 30.96 Present 69 30.11 0.00 53 1 ## 6 134 14.10 4.44 22.39 Present 65 23.09 0.00 40 1
dplyr - arranging observations (rows)
head(arrange(SAheart, sbp)) ## sbp tobacco ldl adiposity famhist typea obesity alcohol age chd ## 1 101 0.48 7.26 13.00 Absent 50 19.82 5.19 16 ## 2 102 0.40 3.41 17.22 Present 56 23.59 2.06 39 1 ## 3 103 0.03 4.21 18.96 Absent 48 22.94 2.62 18 ## 4 106 1.61 1.74 12.32 Absent 74 20.92 13.37 20 1 ## 5 106 1.08 4.37 26.08 Absent 67 24.07 17.74 28 1 ## 6 106 5.60 3.20 12.30 Absent 49 20.29 0.00 39 head(arrange(SAheart, sbp, desc(ldl))) ## sbp tobacco ldl adiposity famhist typea obesity alcohol age chd ## 1 101 0.48 7.26 13.00 Absent 50 19.82 5.19 16 ## 2 102 0.40 3.41 17.22 Present 56 23.59 2.06 39 1 ## 3 103 0.03 4.21 18.96 Absent 48 22.94 2.62 18 ## 4 106 1.08 4.37 26.08 Absent 67 24.07 17.74 28 1 ## 5 106 5.60 3.20 12.30 Absent 49 20.29 0.00 39 ## 6 106 1.61 1.74 12.32 Absent 74 20.92 13.37 20 1 Note: desc() for descending order.
dplyr - selecting variates (columns)
head(select(SAheart, sbp)) ## sbp ## 1 160 ## 2 144 ## 3 118 ## 4 170 ## 5 134 ## 6 132 head(select(SAheart, sbp, ldl)) ## sbp ldl ## 1 160 5.73 ## 2 144 4.41 ## 3 118 3.48 ## 4 170 6.41 ## 5 134 3.50 ## 6 132 6.47
dplyr - selecting variates (columns)
head(select(SAheart, -sbp)) ## tobacco ldl adiposity famhist typea obesity alcohol age chd ## 1 12.00 5.73 23.11 Present 49 25.30 97.20 52 1 ## 2 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1 ## 3 0.08 3.48 32.28 Present 52 29.14 3.81 46 ## 4 7.50 6.41 38.03 Present 51 31.99 24.26 58 1 ## 5 13.60 3.50 27.78 Present 60 25.99 57.34 49 1 ## 6 6.20 6.47 36.21 Present 62 30.77 14.14 45 head(select(SAheart, famhist, starts_with("a"))) ## famhist adiposity alcohol age ## 1 Present 23.11 97.20 52 ## 2 Absent 28.61 2.06 63 ## 3 Present 32.28 3.81 46 ## 4 Present 38.03 24.26 58 ## 5 Present 27.78 57.34 49 ## 6 Present 36.21 14.14 45
dplyr - renaming variates (columns)
head(rename(SAheart, coronary = chd)) ## sbp tobacco ldl adiposity famhist typea obesity alcohol age coronary ## 1 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 1 ## 2 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1 ## 3 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 ## 4 170 7.50 6.41 38.03 Present 51 31.99 24.26 58 1 ## 5 134 13.60 3.50 27.78 Present 60 25.99 57.34 49 1 ## 6 132 6.20 6.47 36.21 Present 62 30.77 14.14 45 head(rename(SAheart, systolic = sbp, lipid = ldl)) ## systolic tobacco lipid adiposity famhist typea obesity alcohol age chd ## 1 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 1 ## 2 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1 ## 3 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 ## 4 170 7.50 6.41 38.03 Present 51 31.99 24.26 58 1 ## 5 134 13.60 3.50 27.78 Present 60 25.99 57.34 49 1 ## 6 132 6.20 6.47 36.21 Present 62 30.77 14.14 45
dplyr - adding and dropping variates (columns)
mutate() adds new variables and preserves existing; transmute() keeps only the new variables and drops existing variables.
head(mutate(SAheart, ldl2 = ldl^2)) ## sbp tobacco ldl adiposity famhist typea obesity alcohol age chd ldl2 ## 1 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 1 32.8329 ## 2 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1 19.4481 ## 3 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 0 12.1104 ## 4 170 7.50 6.41 38.03 Present 51 31.99 24.26 58 1 41.0881 ## 5 134 13.60 3.50 27.78 Present 60 25.99 57.34 49 1 12.2500 ## 6 132 6.20 6.47 36.21 Present 62 30.77 14.14 45 0 41.8609 head(transmute(SAheart, ldl2 = ldl^2, logBP = log(sbp))) ## ldl2 logBP ## 1 32.8329 5.075174 ## 2 19.4481 4.969813 ## 3 12.1104 4.770685 ## 4 41.0881 5.135798 ## 5 12.2500 4.897840 ## 6 41.8609 4.882802
dplyr - grouped summaries
summarize() (or summarise()) computes summaries of variates; group_by() used to calculate summaries by group. ungroup() to undo a grouping.
summarize(SAheart, ave_sbp = mean(sbp), ave_ldl = mean(ldl), count = n()) ## ave_sbp ave_ldl count ## 1 138.3268 4.740325 462 summarize(group_by(SAheart, chd, famhist), ave_sbp = mean(sbp), ave_ldl = mean(ldl), count = n()) ## # A tibble: 4 x 5 ## # Groups: chd [?] ## chd famhist ave_sbp ave_ldl count ## <int> <fct> <dbl> <dbl> <int> ## 1 0 Absent 135 4.32 206 ## 2 0 Present 136 4.41 96 ## 3 1 Absent 143 4.92 64 ## 4 1 Present 144 5.87 96 Some handy summary functions (in addition to the usual ones) include n() for count, n_distinct(), first(x), last(x), nth(x, 3), sum(x),
magrittr - simplified program control with pipes
%>% is a binary operator which pipes the output of the first operand and provides it as the first argument of the second operand.
SAheart %>% filter(famhist == "Absent", chd == 0) %>% arrange(sbp) %>% head() ## sbp tobacco ldl adiposity famhist typea obesity alcohol age chd ## 1 101 0.48 7.26 13.00 Absent 50 19.82 5.19 16 ## 2 103 0.03 4.21 18.96 Absent 48 22.94 2.62 18 ## 3 106 5.60 3.20 12.30 Absent 49 20.29 0.00 39 ## 4 108 0.00 2.74 11.17 Absent 53 22.61 0.95 20 ## 5 108 15.00 4.91 34.65 Absent 41 27.96 14.40 56 ## 6 108 0.00 1.43 26.26 Absent 42 19.38 0.00 16
magrittr - simplified program control with pipes
The pipe and layering style of ggplot2 fits well with the end of a pipeline. The ggplot pipeline using + simply follows the program control pipeline using %>%:
SAheart %>% mutate(ltob = log(tobacco), lsbp = log(sbp)) %>% filter(age < 50) %>% ggplot(aes(x=ltob, y=lsbp)) + geom_point() + facet_wrap(~chd)
1 −4 −2 2 −4 −2 2 4.6 4.8 5.0 5.2 5.4
ltob lsbp
magrittr - simplified program control with pipes
%>% works with any function provided it accepts the output from the pipe as its first argument. For example, we could also create a loon plot using pipes
library(loon) # Assignment can be made at the beginning or at the end of the pipe SAHealthy <- SAheart %>% filter(famhist == "Absent", chd == 0) %>% arrange(sbp) # Or at the end SAHealthy %>% select(age, ldl) %>% l_plot(linkingGroup ="Healthy heart data") -> p
Now p is a loon plot.
magrittr - simplified program control with pipes
There is another pipe from magrittr, %T>% or “tee pipe”, that can be handy. Like %>%, the tee pipe %T>% pipes a value forward into a function or call expression but where %>% returns the result of the function (or call expression) %T>% instead returns the
- riginal value.
This can be handy in many instances, such as when piping through layers in loon.
SAHealthy %>% select(age, ldl) %>% l_plot(showGuides = TRUE, linkingGroup ="Healthy heart data") %T>% l_layer_line(x=c(2,100), y=c(0,12), color="red", dash=c(20,10), linewidth = 2) %T>% l_scaleto_world() -> p In the above, the data are passed along using %>% to l_plot() which produces the loon plot and %T>% passes this plot along throughout until it is finally assigned to p.
magrittr - simplified program control with pipes
With pipes the value can always be accessed through . This can be helpful whenever other arguments need to access information from the value being passed
data("SAheart", package = "ElemStatLearn") SAheart %>% select(age, ldl, obesity, chd, famhist) %>% l_plot(showGuides = TRUE, linkingGroup ="Healthy heart data", glyph = c("ocircle", "otriangle")[ 1 + .$chd], size = .$obesity, selected = .$chd & (.$famhist == "Present"), color = .$famhist) %T>% l_layer_line(x=c(2,100), y=c(0,12), color="red", dash=c(20,10), linewidth = 2) %T>% l_scaleto_selected() -> p In the above, the data are passed along using %>%, whose variates are accessed using the dot . for the dataframe as in .$chd, then switch to passing the loon plot using %T>% until it is finally assigned to p.
The tidyverse - simplified data with tibble
In tidyverse specialized data frames are introduced that behave more like a table (as in a relational database). Instead of a long (yet clear) name like data.frame.tbl, in tidyverse this specialized data frame is called a tibble (as in pronouncing “tbl”).
(x <- tibble(a = LETTERS, b = 1, c = rnorm(26), d = c^2 + b) ) ## # A tibble: 26 x 4 ## a b c d ## <chr> <dbl> <dbl> <dbl> ## 1 A 1.00 0.383 1.15 ## 2 B 1.00 1.42 3.00 ## 3 C 1.00 0.241 1.06 ## 4 D 1.00 -1.14 2.29 ## 5 E 1.00 -0.486 1.24 ## 6 F 1.00 -1.09 2.18 ## 7 G 1.00 -0.0603 1.00 ## 8 H 1.00 0.118 1.01 ## 9 I 1.00 -0.779 1.61 ## 10 J 1.00 1.09 2.18 ## # ... with 16 more rows class(x) ## [1] "tbl_df" "tbl" "data.frame"
The function data_frame() is an alias for tibble(). tibbles differ from standard data frames in: printing, subsetting, and recycling rules (it will only recycle a singleton).
tibbles - coercion
as_tibble(SAheart) ## # A tibble: 462 x 10 ## sbp tobacco ldl adiposity famhist typea obesity alcohol age chd ## * <int> <dbl> <dbl> <dbl> <fct> <int> <dbl> <dbl> <int> <int> ## 1 160 12.0 5.73 23.1 Present 49 25.3 97.2 52 1 ## 2 144 0.0100 4.41 28.6 Absent 55 28.9 2.06 63 1 ## 3 118 0.0800 3.48 32.3 Present 52 29.1 3.81 46 ## 4 170 7.50 6.41 38.0 Present 51 32.0 24.3 58 1 ## 5 134 13.6 3.50 27.8 Present 60 26.0 57.3 49 1 ## 6 132 6.20 6.47 36.2 Present 62 30.8 14.1 45 ## 7 142 4.05 3.38 16.2 Absent 59 20.8 2.62 38 ## 8 114 4.08 4.59 14.6 Present 62 23.1 6.72 58 1 ## 9 114 3.83 19.4 Present 49 24.9 2.49 29 ## 10 132 5.80 31.0 Present 69 30.1 53 1 ## # ... with 452 more rows Note the data types:
◮ <chr> for character vectors or strings ◮ <dbl> for doubles or reals ◮ <int> for integers ◮ <fctr> for factors ◮ <lgl> for logicals ◮ <date> for dates ◮ <dttm> for date-times
A tibble makes very few changes to the
- input. It never:
◮ changes types (e.g. never
converts strings to factors),
◮ changes names of varibles, or ◮ creates row names.
It never , never changes names of varibles, and never creates row names.
tibbles - names and subsetting
It is possible for a tibble to not have valid R variable names:
(x <- tibble(`1990` = 1:3, `2000` = 4:6, `%` = seq(15L,25L,5L), `;-}` = c("silly", "wink", "grin"), z = rnorm(3) )) ## # A tibble: 3 x 5 ## `1990` `2000` `%` `;-}` z ## <int> <int> <int> <chr> <dbl> ## 1 1 4 15 silly 0.492 ## 2 2 5 20 wink
- 0.824
## 3 3 6 25 grin 0.582 x$"1990" ## [1] 1 2 3 x [["%"]] ## [1] 15 20 25 x [, "2000"] ## # A tibble: 3 x 1 ## `2000` ## <int> ## 1 4 ## 2 5 ## 3 6
Note the last one, unlike with standard data.frames, preserves its shape. By purposely being less clever than data.frame a tibble becomes handier to use in many instances.
tibbles - coercion to data.frame
Much code will work as well with a tibble as with a standard data.frame but some will
- not. For the latter code, tibbles can be coerced to standard data frames.
(x <- tibble(`1990` = 1:3, `2000` = 4:6, `%` = seq(15L,25L,5L), `;-}` = c("silly", "wink", "grin"), z = rnorm(3) )) ## # A tibble: 3 x 5 ## `1990` `2000` `%` `;-}` z ## <int> <int> <int> <chr> <dbl> ## 1 1 4 15 silly 0.547 ## 2 2 5 20 wink
- 0.606
## 3 3 6 25 grin
- 0.108
(y <- as.data.frame(x)) ## 1990 2000 % ;-} z ## 1 1 4 15 silly 0.5474800 ## 2 2 5 20 wink -0.6056142 ## 3 3 6 25 grin -0.1084397
Note the different printing. Note also that while y[["%"]] will work as expected, y[[";-)"]] now returns NULL! Because of its illegal variable name, its contents are accessible only through a numerical index as y[,3] or y[[3]]. The corresponding character vector was also not coerced to be a factor.
is.factor(y[[3]]) ## [1] FALSE
tibble - transposed input via tribble()
Occasionally, especially for data entry, to give the contents of a tibble by row. The transposed tibble function tribble() is used for this purpose:
tribble( # Comments are ignored but can be handy to record information # Variable names must begin with tilde as in a "formula" ~id, ~name, ~address, ~sbp, # stylistically, it might be easier to read with fixed widths # ------//---------------------//-----------------------------//---// "0001", "Joe Blow" , "200 University Avenue, West", 123, "0025", "Jack Spratt" , "123 Weber St." , 110, "0035", "Dinah Wontyoublo" , "987 King Street North" , 150 ) ## # A tibble: 3 x 4 ## id name address sbp ## <chr> <chr> <chr> <dbl> ## 1 0001 Joe Blow 200 University Avenue, West 123 ## 2 0025 Jack Spratt 123 Weber St. 110 ## 3 0035 Dinah Wontyoublo 987 King Street North 150
purrr - functional programming package
The data appear as the first argument to ease their use with pipes (in the same way as the functions of dplyr). Corresponding to base Map() # map a function to each element of a vector map(.x, .f, ...) map_if(.x, .p, .f, ...) map_at(.x, .at, .f, ...) map_lgl(.x, .f, ...) map_chr(.x, .f, ...) map_int(.x, .f, ...) map_dbl(.x, .f, ...) map_dfr(.x, .f, ..., .id = NULL) map_dfc(.x, .f, ...)
purrr - functional programming package
Corresponding to base Reduce() # reduce a list to a single value by # iteratively applying a binary function. reduce(.x, .f, ..., .init) reduce_right(.x, .f, ..., .init) reduce2(.x, .y, .f, ..., .init) reduce2_right(.x, .y, .f, ..., .init) # accumulate applies a function recursively over a list (from the left) accumulate(.x, .f, ..., .init) accumulate_right(.x, .f, ..., .init) Note that the map functionality declared its return type by the name of the function used. This is also the case for many other purrr functions (of which only the first will be mentioned in what follows).
purrr - functional programming package
Other functions like base R functional programming functions: # keep and discard are opposites. # compact is a handy wrapper that removes all elements that are NULL. keep(.x, .p, ...) discard(.x, .p, ...) compact(.x, .p = identity) # Find the value or position of the first match. detect(.x, .f, ..., .right = FALSE, .p) detect_index(.x, .f, ..., .right = FALSE, .p) # Do every or some elements of a list satisfy a predicate? every(.x, .p, ...) some(.x, .p, ...)
purrr - functional programming package
Many other functions # Modify elements selectively modify(.x, .f, ...) # Produce all combinations of list elements cross(.l, .filter = NULL) cross2(.x, .y, .filter = NULL) cross3(.x, .y, .z, .filter = NULL) cross_df(.l, .filter = NULL) # These functions remove a level hierarchy from a list. flatten(.x) # Apply a function to each element of a vector, and its index imap(.x, .f, ...) # Apply a function to list-elements of a list lmap(.x, .f, ...) # Coerce a list to a vector as_vector(.x, .type = NULL) simplify(.x, .type = NULL) simplify_all(.x, .type = NULL)
purrr - functional programming package
Useful functions whose first argument is a function (or list of functions) # Negate a predicate function negate(.p, .default = FALSE) # Invoke functions. invoke(.f, .x = NULL, ..., .env = NULL) invoke_map(.f, .x = list(NULL), ..., .env = NULL)
readr - importing “rectangular” data
Often data are stored in a “rectangular” format of rows and columns, for example as a .csv file. The readr package of the tidyverse has a number of functions which facilitate reading such data into R and turning them into data frames (tibbles):
◮ read_csv() for comma separated files, ◮ read_csv2() for semicolon separated files, ◮ read_tsv() for tab separated files, ◮ read_delim() for arbitrary delimiter separating entries, ◮ read_fwf() for fixed width files, ◮ read_table() for fixed width files with white space separating columns,
and
◮ read_log() for Apache style log files (see also the ‘webreadr‘ package for
reading log files).
readr - importing “rectangular” data with read_csv()
The following data was downloaded March 28, 2017 from the Smithsonian’s “Global Volcanism Program” at http://volcano.si.edu/search_eruption.cfm
(Data source suggested from a poster by Kelly McConville: https://www.causeweb.org/cause/sites/default/files/ecots/ecots16/posters/Kelly_McConvilleAre_Volcanic_Eruptions_Increasing.pdf)
filename <- path_concat(dataDirectory, "Volcanoes/GVP_Eruption_Eruptions.csv") (eruptions <- read_csv(filename)) ## # A tibble: 11,108 x 24 ## `Volcano Number` `Volcano Name` `Eruption Number` `Eruption Categor~ ## <int> <chr> <int> <chr> ## 1 282090 Kirishimayama 22257 Confirmed Eruption ## 2 352090 Sangay 22259 Confirmed Eruption ## 3 267020 Karangetang 22256 Confirmed Eruption ## 4 283120 Kusatsu-Shiranes~ 22258 Confirmed Eruption ## 5 343100 San Miguel 22251 Confirmed Eruption ## 6 273030 Mayon 22250 Confirmed Eruption ## 7 251002 Kadovar 22246 Confirmed Eruption ## 8 272020 Kanlaon 22249 Confirmed Eruption ## 9 264020 Agung 22241 Confirmed Eruption ## 10 261230 Dempo 22248 Confirmed Eruption ## # ... with 11,098 more rows, and 20 more variables: `Area of ## # Activity` <chr>, VEI <int>, `VEI Modifier` <chr>, `Start Year ## # Modifier` <chr>, `Start Year` <int>, `Start Year Uncertainty` <int>, ## # `Start Month` <int>, `Start Day Modifier` <chr>, `Start Day` <int>, ## # `Start Day Uncertainty` <int>, `Evidence Method (dating)` <chr>, `End ## # Year Modifier` <chr>, `End Year` <int>, `End Year Uncertainty` <chr>, ## # `End Month` <int>, `End Day Modifier` <chr>, `End Day` <int>, `End Day ## # Uncertainty` <int>, Latitude <dbl>, Longitude <dbl> which, as a tibble already hints at what details this data might contain. Question: What are the target and study populations here?
eruptions - Some exploratory analysis
Some sense of its contents is had by a few simple summaries, beginning with the variate names:
names(eruptions) ## [1] "Volcano Number" "Volcano Name" ## [3] "Eruption Number" "Eruption Category" ## [5] "Area of Activity" "VEI" ## [7] "VEI Modifier" "Start Year Modifier" ## [9] "Start Year" "Start Year Uncertainty" ## [11] "Start Month" "Start Day Modifier" ## [13] "Start Day" "Start Day Uncertainty" ## [15] "Evidence Method (dating)" "End Year Modifier" ## [17] "End Year" "End Year Uncertainty" ## [19] "End Month" "End Day Modifier" ## [21] "End Day" "End Day Uncertainty" ## [23] "Latitude" "Longitude"
eruptions - Some exploratory analysis
And a summary of each variate:
eruptions %>% select_if(is.numeric) %>% summary ## Volcano Number Eruption Number VEI Start Year ## Min. :210010 Min. :10001 Min. :0.00 Min. :-10450 ## 1st Qu.:263310 1st Qu.:12796 1st Qu.:1.00 1st Qu.: 650 ## Median :290056 Median :15604 Median :2.00 Median : 1846 ## Mean :300374 Mean :15614 Mean :1.95 Mean : 617 ## 3rd Qu.:343030 3rd Qu.:18397 3rd Qu.:2.00 3rd Qu.: 1949 ## Max. :600000 Max. :22260 Max. :7.00 Max. : 2018 ## NA's :2911 NA's :1 ## Start Year Uncertainty Start Month Start Day ## Min. : 1.0 Min. : 0.000 Min. : 0.000 ## 1st Qu.: 50.0 1st Qu.: 0.000 1st Qu.: 0.000 ## Median : 100.0 Median : 1.000 Median : 0.000 ## Mean : 292.7 Mean : 3.436 Mean : 6.966 ## 3rd Qu.: 200.0 3rd Qu.: 7.000 3rd Qu.:15.000 ## Max. :14000.0 Max. :12.000 Max. :31.000 ## NA's :9024 NA's :184 NA's :187 ## Start Day Uncertainty End Year End Month End Day ## Min. : 1 Min. :-475 Min. : 0.000 Min. : 0.00 ## 1st Qu.: 15 1st Qu.:1894 1st Qu.: 3.000 1st Qu.: 3.00 ## Median : 15 Median :1956 Median : 6.000 Median :15.00 ## Mean : 64 Mean :1916 Mean : 6.194 Mean :13.25 ## 3rd Qu.: 45 3rd Qu.:1991 3rd Qu.: 9.000 3rd Qu.:21.00 ## Max. :730 Max. :2018 Max. :12.000 Max. :31.00 ## NA's :10247 NA's :6846 NA's :6849 NA's :6852 ## End Day Uncertainty Latitude Longitude ## Min. : 1.00 Min. :-77.530 Min. :-179.97 ## 1st Qu.: 5.00 1st Qu.: -6.102 1st Qu.: -77.66 ## Median : 15.00 Median : 18.130 Median : 55.71 ## Mean : 23.88 Mean : 16.848 Mean : 31.48 ## 3rd Qu.: 15.00 3rd Qu.: 40.821 3rd Qu.: 139.39 ## Max. :365.00 Max. : 85.608 Max. : 179.58 ## NA's :10416
eruptions - Some exploratory analysis
And a summary of each variate:
eruptions %>% select_if(negate(is.numeric)) %>% summary ## Volcano Name Eruption Category Area of Activity ## Length:11108 Length:11108 Length:11108 ## Class :character Class :character Class :character ## Mode :character Mode :character Mode :character ## VEI Modifier Start Year Modifier Start Day Modifier ## Length:11108 Length:11108 Length:11108 ## Class :character Class :character Class :character ## Mode :character Mode :character Mode :character ## Evidence Method (dating) End Year Modifier End Year Uncertainty ## Length:11108 Length:11108 Length:11108 ## Class :character Class :character Class :character ## Mode :character Mode :character Mode :character ## End Day Modifier ## Length:11108 ## Class :character ## Mode :character
eruptions - Some exploratory analysis
In RStudio all contents can be seen using the View() function. This gives a spreadsheet view of the data that is interactive. For example, it allows sorting, searching, and filtering.
View(eruptions)
eruptions - Some exploratory analysis
We might see how many eruptions has been recorded for each volcano:
eruptions %>% group_by(`Volcano Number`) %>% summarize(count = n()) %>% ggplot(aes(x = `Volcano Number`, y = count)) + geom_point(col="steelblue", alpha=0.5) + ggtitle("Number of Eruptions") -> gp gp
50 100 150 200 250 2e+05 3e+05 4e+05 5e+05 6e+05
Volcano Number count
Number of Eruptions Some volcanoes have tens and even hundreds of recorded observations. There also seems to be a large gap in volcano numbers. Beyond this gap there are very few (possibly only 1?) volcano numbers. If there is only one it is recording about 75 eruptions.
eruptions - Some exploratory analysis
On the curiously large gap in volcano numbers with no numbers beginning with either 4
- r 5.
From the website: "GVP policy had been to embed significant geographical, historical, and age information in the numbers. As a result GVP often changed VNums, most frequently to accommodate newly recognized volcanoes in a particular geographical region, which over time undermined the goal of preventing ambiguity. . . . None of the new numbers start with 0 or 1 to avoid confusion with the legacy system." The website also says that the digits encode geographic information.
eruptions - Some exploratory analysis
We could focus on the largest values of Volcano Number
eruptions %>% filter(`Volcano Number` > 5e+05) ## # A tibble: 77 x 24 ## `Volcano Number` `Volcano Name` `Eruption Number` `Eruption Category` ## <int> <chr> <int> <chr> ## 1 600000 Unknown Source 13307 Confirmed Eruption ## 2 600000 Unknown Source 13306 Confirmed Eruption ## 3 600000 Unknown Source 13305 Confirmed Eruption ## 4 600000 Unknown Source 13304 Confirmed Eruption ## 5 600000 Unknown Source 13303 Confirmed Eruption ## 6 600000 Unknown Source 13302 Confirmed Eruption ## 7 600000 Unknown Source 13301 Confirmed Eruption ## 8 600000 Unknown Source 13299 Confirmed Eruption ## 9 600000 Unknown Source 13298 Confirmed Eruption ## 10 600000 Unknown Source 13297 Confirmed Eruption ## # ... with 67 more rows, and 20 more variables: `Area of Activity` <chr>, ## # VEI <int>, `VEI Modifier` <chr>, `Start Year Modifier` <chr>, `Start ## # Year` <int>, `Start Year Uncertainty` <int>, `Start Month` <int>, ## # `Start Day Modifier` <chr>, `Start Day` <int>, `Start Day ## # Uncertainty` <int>, `Evidence Method (dating)` <chr>, `End Year ## # Modifier` <chr>, `End Year` <int>, `End Year Uncertainty` <chr>, `End ## # Month` <int>, `End Day Modifier` <chr>, `End Day` <int>, `End Day ## # Uncertainty` <int>, Latitude <dbl>, Longitude <dbl>
These seem to be eruptions where the volcano has not been identified. To check whether these are all the same Volcano Number:
eruptions %>% filter(`Volcano Number` > 5e+05) %>% select(`Volcano Number`) %>% range() ## [1] 600000 600000
eruptions - Some exploratory analysis
How does the number of eruptions behave over time?
eruptions %>% ggplot(aes(x = `Start Year`)) + geom_histogram(fill = "steelblue", col = "white") ## Warning: Removed 1 rows containing non-finite values (stat_bin).
1000 2000 3000 4000 −8000 −4000
Start Year count
Looks like the number of eruptions over time is increasing. Likely this is more because
- f increasing recordings than a more volcanic world.
eruptions - Some exploratory analysis
How does the number of eruptions behave over time? What if we separate the counts by Eruption Category?
eruptions %>% ggplot(aes(x = `Start Year`)) + geom_histogram(fill = "steelblue", col = "white") + facet_wrap(~`Eruption Category`) ## Warning: Removed 1 rows containing non-finite values (stat_bin).
Confirmed Eruption Discredited Eruption Uncertain Eruption −8000 −4000 −8000 −4000 −8000 −4000 1000 2000 3000
Start Year count
Still looks like the number of eruptions over time is increasing, especially for confirmed eruptions.
eruptions - Some exploratory analysis
Look only at the confirmed eruptions.
eruptions %>% filter(`Eruption Category` == "Confirmed Eruption") -> confirmed_eruptions
The variate VEI for each eruption stands for the eruption’s “Volcanic Explosivity Index” (see Wikipedia on VEI). This index is a measure of how explosive an eruption is and is correlated with the volume of material ejected. Roughly, this is a logarithmic scale (base 10) with each value above 2 representing a ten fold increase in the volume of material ejected. Below 2, however, this is not the case. VEI of 0 is non-explosive (< 1 × 104m3 of material ejected), VEI of 1 has between 104m3 and 106m3 material ejected, and VEI of 2 has between 106m3 and 107m3.
confirmed_eruptions %>% ggplot(aes(x = VEI)) + geom_bar(fill = "steelblue") + facet_wrap(~`VEI Modifier`) ## Warning: Removed 2277 rows containing non-finite values (stat_count).
C P NA ? ^ + 2 4 6 2 4 6 2 4 6 1000 2000 3000 1000 2000 3000
VEI count
eruptions - Some exploratory analysis
VEI over time (Start Year)
confirmed_eruptions %>% ggplot(aes(x = `Start Year`, y = VEI)) + geom_point(col = "steelblue", alpha = 0.5) + facet_wrap(~`VEI Modifier`) ## Warning: Removed 2277 rows containing missing values (geom_point).
C P NA ? ^ + −8000 −4000 −8000 −4000 −8000 −4000 2 4 6 2 4 6
Start Year VEI
It’s not clear from the data what each of these (non-NA) modifiers mean: ? likely means uncertainty, for + and ˆ either might indicate the valus is a lower bound, and P and C remain a mystery (a guess perhaps is type of material?). No information was found on the source website.
eruptions - Some exploratory analysis
Average VEI over time (Start Year)
confirmed_eruptions %>% group_by(`Start Year`) %>% summarise(VEIAve = mean(VEI, na.rm = TRUE)) %>% ggplot(aes(x = `Start Year`, y = VEIAve)) + geom_point(col = "steelblue", alpha = 0.5) ## Warning: Removed 404 rows containing missing values (geom_point).
2 4 6 −8000 −4000
Start Year VEIAve
Question: What might this plot suggest about the study population?
eruptions - Some exploratory analysis
Average VEI over time (Era of 50 years)
confirmed_eruptions %>% mutate(Ranges = cut_interval(`Start Year`, length = 50), Era = as.numeric(Ranges)) %>% group_by(Era) %>% summarise(VEIAve = mean(VEI, na.rm = TRUE)) %>% ggplot(aes(x = Era, y = VEIAve)) + geom_point(col = "steelblue", alpha = 0.5) + geom_smooth()
2 4 6 50 100 150 200 250
Era VEIAve
Question: What might this plot suggest about study error in this case? How might you reduce it?
eruptions - Some exploratory analysis
Each eruption has information on when it started and when it ended (year, month, day, modifier, uncertainty). This should give us some information on each eruption’s duration and hence also of its importance. To determine durations, there are a number of functions in base R but here we will introduce some from the lubridate package.
library(lubridate) confirmed_eruptions %>% mutate(start = ymd(paste(`Start Year`, `Start Month`, `Start Day`, sep = "-")), end = ymd(paste(`End Year`, `End Month`, `End Day`, sep = "-")), duration = end - start + 1 ) %>% ggplot(aes(VEI, duration)) + geom_point(col = "steelblue", alpha = 0.5)
eruptions - Some exploratory analysis
## Warning: 5416 failed to parse. ## Warning: 6594 failed to parse. ## Warning: Removed 6731 rows containing missing values (geom_point).
25000 50000 75000 2 4 6
VEI duration
Note warnings – perhaps dates are not as clean as we might like.
eruptions - Some exploratory analysis
Try just duration in years
confirmed_eruptions %>% mutate(start = `Start Year`, end = `End Year`, duration = end - start + 1 ) %>% ggplot(aes(VEI, duration)) + geom_point(col = "steelblue", alpha = 0.5) ## Warning: Removed 5968 rows containing missing values (geom_point).
100 200 300 2 4 6
VEI duration
Wow some
eruptions - Some exploratory analysis
Which volcanoes have been erupting for hundreds of years?
confirmed_eruptions %>% mutate(start = `Start Year`, end = `End Year`, duration = end - start + 1 ) %>% filter(duration > 100) %>% select(`Volcano Number`, `Volcano Name`, VEI, start, end, duration, Longitude, Latitude) -> longEruptions longEruptions ## # A tibble: 4 x 8 ## `Volcano Number` `Volcano Name` VEI start end duration Longitude ## <int> <chr> <int> <int> <int> <dbl> <dbl> ## 1 257100 Yasur 3 1774 2018 245 169 ## 2 352090 Sangay 3 1728 1916 189
- 78.3
## 3 211040 Stromboli 3 1558 1857 300 15.2 ## 4 384010 Fogo 1 1500 1761 262
- 24.4
## # ... with 1 more variable: Latitude <dbl>
eruptions - Some exploratory analysis
Which volcanoes have been erupting for hundreds of years?
longEruptions ## # A tibble: 4 x 8 ## `Volcano Number` `Volcano Name` VEI start end duration Longitude ## <int> <chr> <int> <int> <int> <dbl> <dbl> ## 1 257100 Yasur 3 1774 2018 245 169 ## 2 352090 Sangay 3 1728 1916 189
- 78.3
## 3 211040 Stromboli 3 1558 1857 300 15.2 ## 4 384010 Fogo 1 1500 1761 262
- 24.4
## # ... with 1 more variable: Latitude <dbl>
eruptions - Some exploratory analysis
library(lubridate) confirmed_eruptions %>% select(`Start Year`, `Start Month`, `Start Day`, `End Year`, `End Month`, `End Day`) %>% summary() ## Start Year Start Month Start Day End Year ## Min. :-10450.0 Min. : 0.000 Min. : 0.000 Min. :-475 ## 1st Qu.: 180.0 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.:1894 ## Median : 1833.0 Median : 1.000 Median : 0.000 Median :1954 ## Mean : 472.5 Mean : 3.405 Mean : 7.007 Mean :1915 ## 3rd Qu.: 1947.0 3rd Qu.: 7.000 3rd Qu.:15.000 3rd Qu.:1990 ## Max. : 2018.0 Max. :12.000 Max. :31.000 Max. :2018 ## NA's :1 NA's :182 NA's :183 NA's :5897 ## End Month End Day ## Min. : 0.000 Min. : 0.00 ## 1st Qu.: 3.000 1st Qu.: 4.00 ## Median : 6.000 Median :15.00 ## Mean : 6.202 Mean :13.33 ## 3rd Qu.: 9.000 3rd Qu.:21.00 ## Max. :12.000 Max. :31.00 ## NA's :5899 NA's :5900 Lots of missing values. Would have to impute some values to continue.
eruptions - Some exploratory analysis
Given the difficulties with older volcanic information in this data set, we might focus on the most recent and higher quality dates. The Global Volcanism Program points out that satellite coverage able to monitor volcanic gas emissions began in 1978. So we could focus on this range of dates.
library(lubridate) confirmed_eruptions %>% filter(`Start Year` >= 1978) %>% mutate(start = ymd(paste(`Start Year`, `Start Month`, `Start Day`, sep = "-")), end = ymd(paste(`End Year`, `End Month`, `End Day`, sep = "-")), duration = end - start + 1 ) -> duration_eruptions ## Warning: 100 failed to parse.
eruptions - Some exploratory analysis
Given the difficulties with older volcanic information in this data set, we might focus on the most recent and higher quality dates. The Global Volcanism Program points out that satellite coverage able to monitor volcanic gas emissions began in 1978. So we could focus on this range of dates.
library(lubridate) duration_eruptions %>% ggplot(aes(VEI, duration)) + geom_point(col = "steelblue", alpha = 0.5) ## Warning: Removed 136 rows containing missing values (geom_point).
5000 10000 2 4 6
VEI duration
eruptions - Some exploratory analysis
How about the average VEI per year since 1978?
confirmed_eruptions %>% filter(`Start Year` >= 1978) %>% group_by(`Start Year`) %>% summarise(VEIAve = mean(VEI, na.rm = TRUE)) %>% ggplot(aes(x = `Start Year`, y = VEIAve)) + ylim(0, 4) + geom_point(col = "steelblue", alpha = 0.5) + geom_smooth(method = "loess")
1 2 3 4 1980 1990 2000 2010 2020
Start Year VEIAve
Comments? Note that by using Start Year we are only looking at new eruptions in this time period.
eruptions - Some exploratory analysis
Number of new eruptions per year since 1978:
confirmed_eruptions %>% filter(`Start Year` >= 1978 ) %>% group_by(`Start Year`) %>% summarise(count = n()) %>% ggplot(aes(x = `Start Year`, y = count)) + ylim(0, 60) + geom_point(col = "steelblue", alpha = 0.5) + geom_smooth(method = "loess") + ggtitle("Number of new eruptions", subtitle = "World wide")
20 40 60 1980 1990 2000 2010 2020
Start Year count World wide
Number of new eruptions Comments?
eruptions - Some exploratory analysis
Number of new eruptions per year since 1978:
confirmed_eruptions %>% filter(`Start Year` >= 1978 ) %>% group_by(`Start Year`) %>% summarise(count = n()) %>% ggplot(aes(x = `Start Year`, y = count)) + ylim(0, 60) + geom_point(col = "steelblue", alpha = 0.5) + geom_smooth(method = "loess") + ggtitle("Number of new eruptions", subtitle = "World wide")
20 40 60 1980 1990 2000 2010 2020
Start Year count World wide
Number of new eruptions Comments? Right most point is based only on a few months. Should have filtered Start Year < 2018 as well.
eruptions - Some exploratory analysis
Alternatively: Number of new eruptions per year since 1950 and before 2018:
confirmed_eruptions %>% filter(`Start Year` >= 1950 & `Start Year` < 2018) %>% group_by(`Start Year`) %>% summarise(count = n()) %>% ggplot(aes(x = `Start Year`, y = count)) + ylim(0, 60) + geom_point(col = "steelblue", alpha = 0.5) + geom_smooth(method = "loess") + ggtitle("Number of new eruptions", subtitle = "World wide")
20 40 60 1960 1980 2000 2020
Start Year count World wide
Number of new eruptions Looks like about 30-40 new eruptions per year.
eruptions - Some exploratory analysis
There are lots of other plots we might look at: duration of each eruption
confirmed_eruptions %>% filter(`End Year` >= 1978) %>% ggplot(aes(x = `Start Year`, y = `Eruption Number`)) + xlim(1970, 2020) + aes(xend = `End Year`, yend = `Eruption Number`) + geom_segment(col = "steelblue", alpha = 0.5) + geom_vline(xintercept = c(1978, 2018), col = "grey", alpha = 0.5) + ggtitle("Duration of eruptions", subtitle = "Ending in 1978 or later")
10000 12500 15000 17500 20000 22500 1970 1980 1990 2000 2010 2020
Start Year Eruption Number Ending in 1978 or later
Duration of eruptions
eruptions - Some exploratory analysis
There are lots of other plots we might look at: VEI and duration of each eruption
confirmed_eruptions %>% filter(`End Year` >= 1978) %>% mutate(jitter_VEI = jitter(VEI, factor = 2)) %>% ggplot(aes(x = `Start Year`, y = `jitter_VEI`)) + xlim(1970, 2020) + aes(xend = `End Year`, yend = `jitter_VEI`) + geom_segment(col = "steelblue", alpha = 0.5) + geom_vline(xintercept = c(1978, 2018), col = "grey", alpha = 0.5) + ylab("VEI (jittered)") + ggtitle("Duration of eruptions", subtitle = "Ending in 1978 or later")
2 4 6 1970 1980 1990 2000 2010 2020
Start Year VEI (jittered) Ending in 1978 or later
Duration of eruptions
eruptions - Some exploratory analysis
Interactive loon maps of all known locations:
# load the maps library before tidyverse, then access the map function # from the maps package world <- maps::map(fill=TRUE, col = "cornsilk", plot = FALSE) library(loon) confirmed_eruptions %>% group_by(`Volcano Number`) %>% summarize(longitude = first(Longitude), latitude = first(Latitude), count = n(), name = first(`Volcano Name`)) %>% select(longitude, latitude, count, name) %>% l_plot(linkingGroup = "Eruptions", glyph= "ocircle", color = "red", size = .$count, itemLabel = .$name, showItemLabels = TRUE, title = "Volcanos with confirmed eruptions", showGuides = TRUE) %T>% l_layer(world, color = "cornsilk", asSingleLayer = TRUE, label = "world map", index = "last") -> p
eruptions - Some exploratory analysis
Interactive loon maps of all known locations:
readr - importing “rectangular” data with read_csv()
The same Smithsonian site on “Global Volcanism Program” at http://volcano.si.edu/search_eruption.cfm contains other .csv files on volcanic eruptions (downloaded the same time).
# The events filename <- path_concat(dataDirectory, "Volcanoes/GVP_Eruption_Events.csv") (events <- read_csv(filename)) ## # A tibble: 40,980 x 14 ## `Volcano Number` `Volcano Name` `Eruption Number` `Eruption Start ~ ## <int> <chr> <int> <int> ## 1 210020 Chaine des Puys 10011
- 4040
## 2 210020 Chaine des Puys 10011
- 4040
## 3 210020 Chaine des Puys 10011
- 4040
## 4 210020 Chaine des Puys 10011
- 4040
## 5 210020 Chaine des Puys 10011
- 4040
## 6 210020 Chaine des Puys 10011
- 4040
## 7 210040 Calatrava Volcani~ 10012
- 3600
## 8 210040 Calatrava Volcani~ 10012
- 3600
## 9 211001 Larderello 10013 1282 ## 10 211001 Larderello 10013 1282 ## # ... with 40,970 more rows, and 10 more variables: `Event Number` <int>, ## # `Event Type` <chr>, `Event Remarks` <chr>, `Event Date Year ## # Modifier` <chr>, `Event Date Year` <int>, `Event Date Year ## # Uncertainty` <chr>, `Event Date Month` <int>, `Event Date Day ## # Modifier` <chr>, `Event Date Day` <int>, `Event Date Day ## # Uncertainty` <int>
readr - importing “rectangular” data with read_csv()
The same Smithsonian site on “Global Volcanism Program” at http://volcano.si.edu/search_eruption.cfm contains other .csv files on volcanic eruptions (downloaded the same time).
# The events filename <- path_concat(dataDirectory, "Volcanoes/GVP_Volcano_List.csv") (volcano <- read_csv(filename)) ## # A tibble: 963 x 26 ## `Volcano Number` `Volcano Name` `Primary Volcano T~ `Last Eruption Ye~ ## <int> <chr> <chr> <chr> ## 1 283001 Abu Shield(s)
- 6850
## 2 355096 Acamarachi Stratovolcano Unknown ## 3 342080 Acatenango Stratovolcano(es) 1972 ## 4 213004 Acigol-Nevsehir Caldera
- 2080