Data analysis pipelines Reading and tidying tables R.W. Oldford - - PowerPoint PPT Presentation
Data analysis pipelines Reading and tidying tables R.W. Oldford - - PowerPoint PPT Presentation
Data analysis pipelines Reading and tidying tables R.W. Oldford readr - importing rectangular data Often data are stored in a rectangular format of rows and columns, for example as a .csv file. The readr package of the tidyverse has a
readr - importing “rectangular” data
Often data are stored in a “rectangular” format of rows and columns, for example as a .csv file. The readr package of the tidyverse has a number of functions which facilitate reading such data into R and turning them into data frames (tibbles):
◮ read_csv() for comma separated files, ◮ read_csv2() for semicolon separated files, ◮ read_tsv() for tab separated files, ◮ read_delim() for arbitrary delimiter separating entries, ◮ read_fwf() for fixed width files, ◮ read_table() for fixed width files with white space separating columns,
and
◮ read_log() for Apache style log files (see also the ‘webreadr‘ package for
reading log files). Why use these?
readr - importing “rectangular” data
Why read_csv() instead of say read.csv()?
◮ has a lot in common with read.csv()
◮ flexible about metadata, comments, NA spec, . . .
◮ clever
◮ about numbers (e.g. 3.14159 versus 3,14159) ◮ tries to infer data types (e.g. from first 1,000 lines) ◮ identifies problems encountered
◮ much much faster ◮ always returns a tibble ◮ usually more portable code
readr - a little cleverer
Suppose we have the following files test1.csv 2,4,6 3.1, 14, 59 test2.csv a,b 1,2,3 4,5,6 test3.csv x,y, 10, 9, 8 3.1,14,59 each being in some dataDirectory Note: each of the above files ends with a new line.
test1.csv - read.csv versus read_csv
test1.csv 2,4,6 3.1, 14, 59 # a data.frame (x <- read.csv(path_concat(dataDirectory, "test1.csv"))) ## X2 X4 X6 ## 1 3.1 14 59 (x <- read.csv(path_concat(dataDirectory, "test1.csv"), header = FALSE)) ## V1 V2 V3 ## 1 2.0 4 6 ## 2 3.1 14 59
test1.csv - read.csv versus read_csv
test1.csv 2,4,6 3.1, 14, 59 # a tibble (y <- read_csv(path_concat(dataDirectory, "test1.csv"))) ## # A tibble: 1 x 3 ## `2` `4` `6` ## <dbl> <dbl> <dbl> ## 1 3.1 14 59 (y <- read_csv(path_concat(dataDirectory, "test1.csv"), col_names = FALSE)) # can also supply names ## # A tibble: 2 x 3 ## X1 X2 X3 ## <dbl> <dbl> <dbl> ## 1 2 4 6 ## 2 3.1 14 59
test2.csv - first using read.csv
test2.csv a,b 1,2,3 4,5,6
# a data.frame (x <- read.csv(path_concat(dataDirectory, "test2.csv"))) ## a b ## 1 2 3 ## 4 5 6 row.names(x) ## [1] "1" "4" (x <- read.csv(path_concat(dataDirectory, "test2.csv"), header = FALSE)) ## V1 V2 V3 ## 1 a b NA ## 2 1 2 3 ## 3 4 5 6 class(x$V1) ; class(x$V3) ## [1] "factor" ## [1] "integer"
test2.csv - now using read_csv
test2.csv a,b 1,2,3 4,5,6
# a tibble (y <- read_csv(path_concat(dataDirectory, "test2.csv"))) ## # A tibble: 2 x 2 ## a b ## <dbl> <dbl> ## 1 1 2 ## 2 4 5 problems(y) ## # A tibble: 2 x 5 ## row col expected actual file ## <int> <chr> <chr> <chr> <chr> ## 1 1 <NA> 2 columns 3 columns './data/test2.csv' ## 2 2 <NA> 2 columns 3 columns './data/test2.csv'
test2.csv - again with read_csv
test2.csv a,b 1,2,3 4,5,6
(y <- read_csv(path_concat(dataDirectory, "test2.csv"), col_names = FALSE)) ## # A tibble: 3 x 2 ## X1 X2 ## <chr> <chr> ## 1 a b ## 2 1 2 ## 3 4 5 problems(y) ## # A tibble: 2 x 5 ## row col expected actual file ## <int> <chr> <chr> <chr> <chr> ## 1 2 <NA> 2 columns 3 columns './data/test2.csv' ## 2 3 <NA> 2 columns 3 columns './data/test2.csv'
test3.csv - read.csv versus read_csv
test3.csv x,y, 10, 9, 8 3.1,14,59
(x <- read.csv(path_concat(dataDirectory, "test3.csv"))) ## x y X ## 1 10.0 9 8 ## 2 3.1 14 59 class(x$x) ## [1] "numeric" (x <- read.csv(path_concat(dataDirectory, "test3.csv"), header = FALSE)) ## V1 V2 V3 ## 1 x y NA ## 2 10 9 8 ## 3 3.1 14 59 class(x$V3) ## [1] "integer"
test3.csv - read.csv versus read_csv
test3.csv x,y, 10, 9, 8 3.1,14,59
(y <- read_csv(path_concat(dataDirectory, "test3.csv"))) ## # A tibble: 2 x 3 ## x y X3 ## <dbl> <dbl> <dbl> ## 1 10 9 8 ## 2 3.1 14 59 spec(y) ; is.null(problems(y)) ## cols( ## x = col_double(), ## y = col_double(), ## X3 = col_double() ## ) ## [1] FALSE
test3.csv - read.csv versus read_csv
test3.csv x,y, 10, 9, 8 3.1,14,59
(y <- read_csv(path_concat(dataDirectory, "test3.csv"), col_names = FALSE)) ## # A tibble: 3 x 3 ## X1 X2 X3 ## <chr> <chr> <dbl> ## 1 x y NA ## 2 10 9 8 ## 3 3.1 14 59 problems(y) ## [1] row col expected actual ## <0 rows> (or 0-length row.names)
readr - parsing files
All depends on how readr package parses a vector. From vignette("readr"): "The key problem that readr solves is parsing a flat file into a tibble. Parsing is the process of taking a text file and turning it into a rectangular tibble where each column is the appropriate part. Parsing takes place in three basic stages:
- 1. The flat file is parsed into a rectangular matrix of strings.
- 2. The type of each column is determined.
- 3. Each column of strings is parsed into a vector of a more specific type.
It’s easiest to learn how this works in the opposite order Below, you’ll learn how the:
- 1. Vector parsers turn a character vector in to a more specific type.
- 2. Column specification describes the type of each column and the strategy
readr uses to guess types so you don’t need to supply them all.
- 3. Rectangular parsers turn a flat file into a matrix of rows and columns.
Each parse_*() is coupled with a col_*() function, which will be used in the process of parsing a complete tibble." You can override defaults.
Tidy data - tidyr
So everything is a “table” of some sort, a rectangular tibble. tidyr allows messy data to be reshaped into “tidier” data, that is easier for analysis. For example, the tidyr package has a small table, table1, containing the number of cases of tuberculosis (TB) in Afghanistan, Brazil, and China in the years 1999 and 2000. It also records the population of those countries in each year. The data are
table1 ## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
Think of how we might have actually collected this data. I imagine visiting each country (its records at least) for each year and recording the number of cases of TB and its population.
Tidy data - tidyr
Here is the same data but stored in a different tabular form.
table2 ## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583
How do you imagine this data was collected? Does each
◮ row feel like an observation measured on several variates? ◮ column feel like a variate collected on several individuals? ◮ cell feel like a measurement on a single variate for a single case?
Rows, columns, and cells ideally all “feel right” in that one can imagine collecting the data in this way. table2 does not feel as natural as table1 in the context of this problem.
Tidy data - tidyr
Natural questions for this data:
◮ How does the rate of TB per 100,000 people compare between countries?. ◮ How does the rate compare from one year to the next for each country?
To compute the rate from table1
table1 %>% mutate(rate = (cases / population) * 100000) ## # A tibble: 6 x 5 ## country year cases population rate ## <chr> <int> <int> <int> <dbl> ## 1 Afghanistan 1999 745 19987071 3.73 ## 2 Afghanistan 2000 2666 20595360 12.9 ## 3 Brazil 1999 37737 172006362 21.9 ## 4 Brazil 2000 80488 174504898 46.1 ## 5 China 1999 212258 1272915272 16.7 ## 6 China 2000 213766 1280428583 16.7
Tidy data - tidyr
To compare the rates using table1
table1 %>% mutate(rate = (cases / population) * 100000) %>% ggplot(aes(x=year, y=rate)) + ggtitle("Tuberculosis rates per 100,000") + geom_line(aes(group = country, colour = country), lwd = 1.5) + geom_point(aes(colour = country), size = 5) + scale_x_continuous(breaks=c(1999, 2000))
10 20 30 40 1999 2000
year rate country
Afghanistan Brazil China
Tuberculosis rates per 100,000
How about beginning with table2?
Tidy data - tidyr
To compute the rate from table2. First get the cases
table2 %>% filter(type == "cases") ## # A tibble: 6 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 2000 cases 2666 ## 3 Brazil 1999 cases 37737 ## 4 Brazil 2000 cases 80488 ## 5 China 1999 cases 212258 ## 6 China 2000 cases 213766
OK . . . don’t need type anymore and should rename count
table2 %>% filter(type == "cases") %>% select(-type) %>% rename(cases = count) ## # A tibble: 6 x 3 ## country year cases ## <chr> <int> <int> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766
Looks good!
Tidy data - tidyr
Do the same for the population!
table2 %>% filter(type == "population") %>% select(-type) %>% rename(population = count) ## # A tibble: 6 x 3 ## country year population ## <chr> <int> <int> ## 1 Afghanistan 1999 19987071 ## 2 Afghanistan 2000 20595360 ## 3 Brazil 1999 172006362 ## 4 Brazil 2000 174504898 ## 5 China 1999 1272915272 ## 6 China 2000 1280428583
Now we can get a rate from these two tables! As in building a new tibble.
◮ need to save each table ◮ probably should make sure both tables have their rows in the same order.
Tidy data - tidyr
Arrange so that the order is the same and name the tables.
table2 %>% filter(type == "cases") %>% select(-type) %>% rename(cases = count) %>% arrange(country, year) -> table2_cases table2 %>% filter(type == "population") %>% select(-type) %>% rename(population = count) %>% arrange(country, year) -> table2_pop
And a new data set for the rates constructed using both tables.
tibble(country = table2_cases$country, year = table2_cases$year, cases = table2_cases$cases, population = table2_pop$population) %>% mutate(rate = (cases / population) * 100000) -> table2_rates
Now proceed as we did with table.
Tidy data - tidyr
Imagine tuberculosis cases come from one source (or compiled from many separate sources) over time and the populations from another. We might then have two tables, one for the cases
table4a ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766
where columns are added over time as the data comes in and another of the population over the same years:
table4b ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 19987071 20595360 ## 2 Brazil 172006362 174504898 ## 3 China 1272915272 1280428583
Question: Is either table “tidy”? Why? Why not?
Tidy data - what is “tidy”?
So, what is “tidy”?
◮ Both table4a and table4b appear to be realistic. i.e. you can imagine the
data being collected like that
◮ But, so too with this table (call it cases)
## # A tibble: 2 x 4 ## year Afghanistan Brazil China ## <int> <int> <int> <int> ## 1 1999 745 37737 212258 ## 2 2000 2666 80488 213766
and this one (call it populations)
## # A tibble: 2 x 4 ## year Afghanistan Brazil China ## <int> <int> <int> <int> ## 1 1999 19987071 172006362 1272915272 ## 2 2000 20595360 174504898 1280428583
Are these “tidy”? Either could be long and thin, or short and wide, or whatever. Questions: How would you create these two tables from table2? From table4a and table4b? (More on this later.)
Tidy data - what is “tidy”?
So, “tidy”? Or not “tidy”?
◮ Tidy: the tables of either representation (table4a and table4b or cases
and populations) are arguably “tidy” in their own right.
◮ e.g. imagine a time series for each country so each table records country
values of a particular quantity over time
◮ What if there were, say, 100 years of data on three countries? Or, two years
- f data on 190 countries?
Would your opinion change?
◮ Not tidy: try doing the previous analysis to get a plot of TB rates using
these tables. So, what’s tidy?
◮ imagine what are the “natural” variates for your problem?
These are your columns.
◮ imagine what are the “natural” units of study (cases) for your problem?
These are your rows.
Tidying data
Why data might not be “tidy” (for your problem):
◮ data are not organized for analysis, but for something else e.g.
◮ for transaction (selecting, updating cells) ◮ for ease of data entry (e.g. simply just add another year by row or column, or
another country)
◮ data repository is not constructed by data analysts, but by others having no
appreciation of analysis So, even though the data is provably consistent, complete, coherent, and error free, data analysts will still often have to “tidy” data to get it into a shape for analysis. The first thing to do is to understand (for your problem) what are the variates (i.e. what is being measured) and what are the study units or observations or cases (viz. on what are the measurements being taken)? Two common problems often remain
◮ one or more variates are spread across multiple columns ◮ one or more observations are scattered across multiple rows
In the tidyr package, gather() and spread()are designed to help address these.
Note These functions can now been replaced by more general functions pivot_longer() and pivot_wider().
Tidying data - gathering
Recall table4a contains the number of TB cases in three countries for two years
table4a ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766
From our previous analyses, the variates were country, year, and cases.
◮ Only country appears as a column in table4a. ◮ year shows up as the names of the remaining columns
◮ need to gather the values naming the two columns (1999 nd 2000) into a
single variate (column)
◮ cases appear spread across rows and columns in the interior of table4a
◮ each row of has case data for more than one observation ◮ the two columns across which the cases are spread, need to be gathered
together
◮ cases would be the column into which the values are gathered.
Tidying data - gathering
Gather the values contained in the two columns (1999 and 2000) into another pair of variates.
table4a ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766
table4a %>% gather(`1999`, `2000`, # columns that represent values, # not variates. key = "year", # name the variate whose values are # the column names. value = "cases") # name the values being gathered.
## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Brazil 1999 37737 ## 3 China 1999 212258 ## 4 Afghanistan 2000 2666 ## 5 Brazil 2000 80488 ## 6 China 2000 213766
Question: How would you characterise the “units” here?
Any other comments? (year is a character vector . . . why?)
Tidying data - gathering
Again, gather the values contained in the two columns (1999 and 2000) into a different pair of variates. But this time, fix the type of value for the key = "year" variable:
table4a %>% gather(`1999`, `2000`, # columns that represent values, # not variates. key = "year", # name the variate whose values are # the column names. value = "cases", # name the values being gathered. convert = TRUE # run type.convert() on the key ) -> # forward assign to a variable tidy4a tidy4a
## # A tibble: 6 x 3 ## country year cases ## <chr> <int> <int> ## 1 Afghanistan 1999 745 ## 2 Brazil 1999 37737 ## 3 China 1999 212258 ## 4 Afghanistan 2000 2666 ## 5 Brazil 2000 80488 ## 6 China 2000 213766
Tidying data - gathering
Do the same for table4b.
table4b ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 19987071 20595360 ## 2 Brazil 172006362 174504898 ## 3 China 1272915272 1280428583
But fix the the type of value for year:
table4b %>% gather(`1999`, `2000`, # columns that represent values, # not variates. key = "year", # name the variate whose values # are the column names. value = "population", # name the values being gathered. convert = TRUE # run type.convert() on the key. ) -> tidy4b tidy4b
## # A tibble: 6 x 3 ## country year population ## <chr> <int> <int> ## 1 Afghanistan 1999 19987071 ## 2 Brazil 1999 172006362 ## 3 China 1999 1272915272 ## 4 Afghanistan 2000 20595360 ## 5 Brazil 2000 174504898 ## 6 China 2000 1280428583
Tidying data - gathering
We only need to match rows and join columns of tidy4a and tidy4b full_join(tidy4a, tidy4b)
## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
Which looks a lot like table1. Notes:
- 1. Rows were matched on values of country and year
- 2. Columns which differed were joined to the common columns
- 3. Where there are not matching values, NA is returned for the missing
column values (none missing here).
- 4. More on joining tables later.
Tidying data - gathering to pivoting.
In the most recent releases of tidyr it has been announced that gather() is being “retired” (or no longer developed). In its place is a new and more “featureful” function called pivot_longer(). The metaphor here seems to be based on the effect on the data structure. Gathering results in longer data table; in the example, table4a was a 3 × 3 table and two of its columns were gathered together into a single (and hence longer) column. The resulting table, tidy4a, is longer, now being 6 × 3. Generally, the result of gathering is a longer narrower (no wider) table. The table was “lengthened”. In some sense, some columns were turned or pivoted into rows.
Tidying data - pivot_longer
Turning table4a into tidy4a would now look like
table4a %>% pivot_longer(cols = c(`1999`, `2000`), # column selection names_to = "year", # new column name values_to = "cases", # name of values column names_ptypes = integer() # prototype for names ) -> tidy4a tidy4a
## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766
There numerous other arguments, including values_ptypes, which make this a much more powerful function than gather().
◮ e.g. cols can be defined using general selection specs ◮ e.g. values_drop_na can drop rows having only NAs in the value_to
column.
◮ lots more . . .
Tidying data - spreading
Spreading is the opposite of gathering. The values of one or more variates are spread across two or more variates (columns) of the new table, reducing the number of rows and (possibly) increasing the number of columns. Recall
table2 ## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583
Observations are scattered over more than one row. We want to spread the values of type across the columns as two new variates: cases and population whose values are given by count.
Tidying data - spreading
We want to spread the values of type across the columns as two new variates: cases and population whose values are given by count. table2 %>% spread(key = "type", value = "count")
## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
◮ the column giving the new variate names is the key to the spreading. ◮ the column giving the values of these variates is the value
Tidying data - spreading
This suggests how to create the table of cases mentioned on an earlier slide. From table2 spread the countries across the columns
table2 %>% spread(key = country, value = count)
## # A tibble: 4 x 5 ## year type Afghanistan Brazil China ## <int> <chr> <int> <int> <int> ## 1 1999 cases 745 37737 212258 ## 2 1999 population 19987071 172006362 1272915272 ## 3 2000 cases 2666 80488 213766 ## 4 2000 population 20595360 174504898 1280428583
and filter for the cases
table2 %>% spread(key = country, value = count) %>% filter(type == "cases")
## # A tibble: 2 x 5 ## year type Afghanistan Brazil China ## <int> <chr> <int> <int> <int> ## 1 1999 cases 745 37737 212258 ## 2 2000 cases 2666 80488 213766
We only need to get rid of type.
Tidying data - spreading
From table2
## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583
We need only add removing type to the pipeline:
table2 %>% spread(key = country, value = count) %>% filter(type == "cases") %>% select(-type) # everything but type
## # A tibble: 2 x 4 ## year Afghanistan Brazil China ## <int> <int> <int> <int> ## 1 1999 745 37737 212258 ## 2 2000 2666 80488 213766
Which is the table cases (from an earlier slide) which we imagined might even have been the
- riginal recording table of the cases.
Tidying data - spread() to pivot_wider()
Again, spread() is being “retired” or no longer developed in tidyr and a more “featureful” function pivot_wide() being introduced to do the same job and more. To return to the last example:
table2 %>% pivot_wider(names_from = country, values_from = count) %>% filter(type == "cases") %>% select(-type) ## # A tibble: 2 x 4 ## year Afghanistan Brazil China ## <int> <int> <int> <int> ## 1 1999 745 37737 212258 ## 2 2000 2666 80488 213766
Again, the idea is the we are pivoting from a longer data table to a wider one.
See help("pivot_wider") for complete functionality
Tidying data - gathering and spreading example
Here’s a challenge: how can we get from table4a
table4a ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766
to the table of cases?
## # A tibble: 2 x 4 ## year Afghanistan Brazil China ## <int> <int> <int> <int> ## 1 1999 745 37737 212258 ## 2 2000 2666 80488 213766
Tidying data - gathering and spreading example
First need to gather together the year columns. The following are equivalent
table4a %>% gather("1999", "2000", key = year, value = value) # or gather all columns but take valued from all but country table4a %>% gather(key = year, value = value, -country)
## # A tibble: 6 x 3 ## country year value ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Brazil 1999 37737 ## 3 China 1999 212258 ## 4 Afghanistan 2000 2666 ## 5 Brazil 2000 80488 ## 6 China 2000 213766
Then spread the country variable values across the columns
table4a %>% gather("1999", "2000", key = year, value = value) %>% spread(key = country, value = value)
## # A tibble: 2 x 4 ## year Afghanistan Brazil China ## <chr> <int> <int> <int> ## 1 1999 745 37737 212258 ## 2 2000 2666 80488 213766
tidyr - separating and uniting
Sometimes the data values themselves must be wrangled. Common tasks include separating values into pieces or, conversely, uniting separate pieces into a single value. For example, consider
table5 ## # A tibble: 6 x 4 ## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272 ## 6 China 20 00 213766/1280428583
The rate variate is a character vector whose values contain both the number of cases and the population separated by “/”. These values can be separated and assigned to new variates using separate(). table5 %>% separate(rate, into = c("cases", "population"), sep ="/", convert = TRUE)
## # A tibble: 6 x 5 ## country century year cases population ## <chr> <chr> <chr> <int> <int> ## 1 Afghanistan 19 99 745 19987071 ## 2 Afghanistan 20 00 2666 20595360 ## 3 Brazil 19 99 37737 172006362 ## 4 Brazil 20 00 80488 174504898 ## 5 China 19 99 212258 1272915272 ## 6 China 20 00 213766 1280428583
Note: Unless specified with sep, every non-alphanumeric character will mark the separation.
tidyr - separating and uniting
Conversely, the two variates century and year might be united into a single variate
table5 ## # A tibble: 6 x 4 ## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272 ## 6 China 20 00 213766/1280428583
This is done using the unite() function: table5 %>% unite(col = "Cent:Year", century, year, sep = ":")
## # A tibble: 6 x 3 ## country `Cent:Year` rate ## <chr> <chr> <chr> ## 1 Afghanistan 19:99 745/19987071 ## 2 Afghanistan 20:00 2666/20595360 ## 3 Brazil 19:99 37737/172006362 ## 4 Brazil 20:00 80488/174504898 ## 5 China 19:99 212258/1272915272 ## 6 China 20:00 213766/1280428583
tidyr - separating and uniting
Or, all at once and preserving the rate
table5 %>% separate(rate, into = c("cases", "population"), sep ="/", convert = TRUE, remove = FALSE) %>% mutate(rate = 100000 *cases/population) %>% unite(col = "year", century, year, sep = "") %>% mutate(year = as.numeric(year)) ## # A tibble: 6 x 5 ## country year rate cases population ## <chr> <dbl> <dbl> <int> <int> ## 1 Afghanistan 1999 3.73 745 19987071 ## 2 Afghanistan 2000 12.9 2666 20595360 ## 3 Brazil 1999 21.9 37737 172006362 ## 4 Brazil 2000 46.1 80488 174504898 ## 5 China 1999 16.7 212258 1272915272 ## 6 China 2000 16.7 213766 1280428583
Note that year had to be explicitly changed to numerical (no convert with unite).
tidyr - missing values
Donald Rumsfeld, US Secretary of Defense, February 12, 2002: This also applies to data. There are
◮ known knowns; the data for which we have recorded values, ◮ known unknowns; data we know to be missing (recorded as NA for “Not
Available”), and
◮ unknown unknowns; data that we do not know are missing.
And it is the last of these that “tend to be the difficult ones.”
tidyr - missing values
For example, suppose we have the record of the number of courses completed by a University of Waterloo student: courseCount <- tibble( year = c(2013, 2014, 2014, 2016, 2016, 2017, 2018, 2018), term = c("Fall", "Winter", "Spring", "Winter", "Spring", "Fall", "Spring", "Fall"), count = c(5, 4, 1, 5, 1, NA, 2, 5) ) courseCount
## # A tibble: 8 x 3 ## year term count ## <dbl> <chr> <dbl> ## 1 2013 Fall 5 ## 2 2014 Winter 4 ## 3 2014 Spring 1 ## 4 2016 Winter 5 ## 5 2016 Spring 1 ## 6 2017 Fall NA ## 7 2018 Spring 2 ## 8 2018 Fall 5
In this case, there is some explicitly missing data recorded as NA for the "Fall" term of
- 2017. This is an example of a “known unknown”; we know we do not know how many
courses were completed by that student in that term. Are there any “unknown unknowns”?
Missing values - “unknown unknowns”
For example,
- 1. Notice that there are no values for year 2015.
◮ This is an “unknown unknown”. Perhaps the student never registered for any
courses in 2015? Perhaps they did but never completed any? Or maybe they completed some but the record is missing entirely? Who knows?
- 2. Notice also that the values of term correspond to the names of the seasons, but
that “Summer” does not appear.
◮ This too may be an “unknown unknown”. “Summer” may or may not be
missing.
◮ This “unknown unknown” is suggested by the context, and what we know
about the meaning of the variates and their values
- 3. There may also be some implicitly missing data.
◮ These are “unknowns” which might be inferred from the structure of the
data.
◮ The structure suggests there might be “unknown unknowns” that can be
identified as “known unknowns” and so marked as NA for known to be missing. To turn any of these “unknown unknowns” into “known unknowns” or even “known knowns” requires a deeper understanding of the data context including the meaning and possible values of every variate and more generally how the data were collected (e.g. how the observations were selected, the variates defined and their values determined), and how the data were recorded.
Missing values - turning “unknown unknowns” into “known unknowns”
Dealing with each of the previous “unknown unknowns” in order
- 1. Missing year 2015?
◮ cannot really tell whether this is missing or not. ◮ if we think it is, then we could add a row to our data
courseCount %>% add_case(year = 2015) %>% arrange(year)
## # A tibble: 9 x 3 ## year term count ## <dbl> <chr> <dbl> ## 1 2013 Fall 5 ## 2 2014 Winter 4 ## 3 2014 Spring 1 ## 4 2015 <NA> NA ## 5 2016 Winter 5 ## 6 2016 Spring 1 ## 7 2017 Fall NA ## 8 2018 Spring 2 ## 9 2018 Fall 5
And we now have some explicitly missing values, or “known unknowns”. Only the problem context will help us decide whether to include this missing data or not.
Missing values - turning “unknown unknowns” into “known unknowns”
- 2. Perhaps the variate term is missing “Summer”.
◮ We could formally indicate that by it into a factor having 4 levels, each
corresponding to a season.
courseCount %>% mutate(term = factor(term, level = c("Fall", "Winter", "Spring", "Summer")))
## # A tibble: 8 x 3 ## year term count ## <dbl> <fct> <dbl> ## 1 2013 Fall 5 ## 2 2014 Winter 4 ## 3 2014 Spring 1 ## 4 2016 Winter 5 ## 5 2016 Spring 1 ## 6 2017 Fall NA ## 7 2018 Spring 2 ## 8 2018 Fall 5
Only change to the data is term changing from character to factor.
◮ Besides, this is a Waterloo student. ◮ Each term is exactly 4 months long ◮ There are no “Summer” terms at Waterloo, so no missing Summer term ◮ Better:
courseCount %>% mutate(term = factor(term))
to register the complete set of possible values and to resolve the possible “unknown unknowns”.
Missing values - turning “unknown unknowns” into “known unknowns”
- 3. There may also be some implicitly missing data.
◮ We might, for example, wish to assert that every year should have all three
terms in the record. The complete() function will do this for us. courseCount %>% complete(year, term)
## # A tibble: 15 x 3 ## year term count ## <dbl> <chr> <dbl> ## 1 2013 Fall 5 ## 2 2013 Spring NA ## 3 2013 Winter NA ## 4 2014 Fall NA ## 5 2014 Spring 1 ## 6 2014 Winter 4 ## 7 2016 Fall NA ## 8 2016 Spring 1 ## 9 2016 Winter 5 ## 10 2017 Fall NA ## 11 2017 Spring NA ## 12 2017 Winter NA ## 13 2018 Fall 5 ## 14 2018 Spring 2 ## 15 2018 Winter NA
◮ Now have many implicitly missing data values. ◮ Can we turn some of these NAs into “known knowns”? ◮ Are all NAs alike here? ◮ Context provides a lot of info: a term is 4 months; academic order is Fall,
Winter, Spring; co-op alternates academic and work terms; etc.
◮ Note that there are no entries for 2015, so still some possibly implicitly
missing data.
Missing values - turning “unknown unknowns” into “known unknowns”
- 3. There may also be some implicitly missing data.
◮ Alternatively, this might have happened had we switched up the
representations courseCount %>% spread(year, count) ## # A tibble: 3 x 6 ## term `2013` `2014` `2016` `2017` `2018` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Fall 5 NA NA NA 5 ## 2 Spring NA 1 1 NA 2 ## 3 Winter NA 4 5 NA NA
◮ Are all NAs alike here? ◮ Which, if any, would you guess might be zero? ◮ Note again absence of 2015.