Data analysis pipelines Reading and tidying tables R.W. Oldford - - PowerPoint PPT Presentation

data analysis pipelines
SMART_READER_LITE
LIVE PREVIEW

Data analysis pipelines Reading and tidying tables R.W. Oldford - - PowerPoint PPT Presentation

Data analysis pipelines Reading and tidying tables R.W. Oldford readr - importing rectangular data Often data are stored in a rectangular format of rows and columns, for example as a .csv file. The readr package of the tidyverse has a


slide-1
SLIDE 1

Data analysis pipelines

Reading and tidying tables R.W. Oldford

slide-2
SLIDE 2

readr - importing “rectangular” data

Often data are stored in a “rectangular” format of rows and columns, for example as a .csv file. The readr package of the tidyverse has a number of functions which facilitate reading such data into R and turning them into data frames (tibbles):

◮ read_csv() for comma separated files, ◮ read_csv2() for semicolon separated files, ◮ read_tsv() for tab separated files, ◮ read_delim() for arbitrary delimiter separating entries, ◮ read_fwf() for fixed width files, ◮ read_table() for fixed width files with white space separating columns,

and

◮ read_log() for Apache style log files (see also the ‘webreadr‘ package for

reading log files). Why use these?

slide-3
SLIDE 3

readr - importing “rectangular” data

Why read_csv() instead of say read.csv()?

◮ has a lot in common with read.csv()

◮ flexible about metadata, comments, NA spec, . . .

◮ clever

◮ about numbers (e.g. 3.14159 versus 3,14159) ◮ tries to infer data types (e.g. from first 1,000 lines) ◮ identifies problems encountered

◮ much much faster ◮ always returns a tibble ◮ usually more portable code

slide-4
SLIDE 4

readr - a little cleverer

Suppose we have the following files test1.csv 2,4,6 3.1, 14, 59 test2.csv a,b 1,2,3 4,5,6 test3.csv x,y, 10, 9, 8 3.1,14,59 each being in some dataDirectory Note: each of the above files ends with a new line.

slide-5
SLIDE 5

test1.csv - read.csv versus read_csv

test1.csv 2,4,6 3.1, 14, 59 # a data.frame (x <- read.csv(path_concat(dataDirectory, "test1.csv"))) ## X2 X4 X6 ## 1 3.1 14 59 (x <- read.csv(path_concat(dataDirectory, "test1.csv"), header = FALSE)) ## V1 V2 V3 ## 1 2.0 4 6 ## 2 3.1 14 59

slide-6
SLIDE 6

test1.csv - read.csv versus read_csv

test1.csv 2,4,6 3.1, 14, 59 # a tibble (y <- read_csv(path_concat(dataDirectory, "test1.csv"))) ## # A tibble: 1 x 3 ## `2` `4` `6` ## <dbl> <dbl> <dbl> ## 1 3.1 14 59 (y <- read_csv(path_concat(dataDirectory, "test1.csv"), col_names = FALSE)) # can also supply names ## # A tibble: 2 x 3 ## X1 X2 X3 ## <dbl> <dbl> <dbl> ## 1 2 4 6 ## 2 3.1 14 59

slide-7
SLIDE 7

test2.csv - first using read.csv

test2.csv a,b 1,2,3 4,5,6

# a data.frame (x <- read.csv(path_concat(dataDirectory, "test2.csv"))) ## a b ## 1 2 3 ## 4 5 6 row.names(x) ## [1] "1" "4" (x <- read.csv(path_concat(dataDirectory, "test2.csv"), header = FALSE)) ## V1 V2 V3 ## 1 a b NA ## 2 1 2 3 ## 3 4 5 6 class(x$V1) ; class(x$V3) ## [1] "factor" ## [1] "integer"

slide-8
SLIDE 8

test2.csv - now using read_csv

test2.csv a,b 1,2,3 4,5,6

# a tibble (y <- read_csv(path_concat(dataDirectory, "test2.csv"))) ## # A tibble: 2 x 2 ## a b ## <dbl> <dbl> ## 1 1 2 ## 2 4 5 problems(y) ## # A tibble: 2 x 5 ## row col expected actual file ## <int> <chr> <chr> <chr> <chr> ## 1 1 <NA> 2 columns 3 columns './data/test2.csv' ## 2 2 <NA> 2 columns 3 columns './data/test2.csv'

slide-9
SLIDE 9

test2.csv - again with read_csv

test2.csv a,b 1,2,3 4,5,6

(y <- read_csv(path_concat(dataDirectory, "test2.csv"), col_names = FALSE)) ## # A tibble: 3 x 2 ## X1 X2 ## <chr> <chr> ## 1 a b ## 2 1 2 ## 3 4 5 problems(y) ## # A tibble: 2 x 5 ## row col expected actual file ## <int> <chr> <chr> <chr> <chr> ## 1 2 <NA> 2 columns 3 columns './data/test2.csv' ## 2 3 <NA> 2 columns 3 columns './data/test2.csv'

slide-10
SLIDE 10

test3.csv - read.csv versus read_csv

test3.csv x,y, 10, 9, 8 3.1,14,59

(x <- read.csv(path_concat(dataDirectory, "test3.csv"))) ## x y X ## 1 10.0 9 8 ## 2 3.1 14 59 class(x$x) ## [1] "numeric" (x <- read.csv(path_concat(dataDirectory, "test3.csv"), header = FALSE)) ## V1 V2 V3 ## 1 x y NA ## 2 10 9 8 ## 3 3.1 14 59 class(x$V3) ## [1] "integer"

slide-11
SLIDE 11

test3.csv - read.csv versus read_csv

test3.csv x,y, 10, 9, 8 3.1,14,59

(y <- read_csv(path_concat(dataDirectory, "test3.csv"))) ## # A tibble: 2 x 3 ## x y X3 ## <dbl> <dbl> <dbl> ## 1 10 9 8 ## 2 3.1 14 59 spec(y) ; is.null(problems(y)) ## cols( ## x = col_double(), ## y = col_double(), ## X3 = col_double() ## ) ## [1] FALSE

slide-12
SLIDE 12

test3.csv - read.csv versus read_csv

test3.csv x,y, 10, 9, 8 3.1,14,59

(y <- read_csv(path_concat(dataDirectory, "test3.csv"), col_names = FALSE)) ## # A tibble: 3 x 3 ## X1 X2 X3 ## <chr> <chr> <dbl> ## 1 x y NA ## 2 10 9 8 ## 3 3.1 14 59 problems(y) ## [1] row col expected actual ## <0 rows> (or 0-length row.names)

slide-13
SLIDE 13

readr - parsing files

All depends on how readr package parses a vector. From vignette("readr"): "The key problem that readr solves is parsing a flat file into a tibble. Parsing is the process of taking a text file and turning it into a rectangular tibble where each column is the appropriate part. Parsing takes place in three basic stages:

  • 1. The flat file is parsed into a rectangular matrix of strings.
  • 2. The type of each column is determined.
  • 3. Each column of strings is parsed into a vector of a more specific type.

It’s easiest to learn how this works in the opposite order Below, you’ll learn how the:

  • 1. Vector parsers turn a character vector in to a more specific type.
  • 2. Column specification describes the type of each column and the strategy

readr uses to guess types so you don’t need to supply them all.

  • 3. Rectangular parsers turn a flat file into a matrix of rows and columns.

Each parse_*() is coupled with a col_*() function, which will be used in the process of parsing a complete tibble." You can override defaults.

slide-14
SLIDE 14

Tidy data - tidyr

So everything is a “table” of some sort, a rectangular tibble. tidyr allows messy data to be reshaped into “tidier” data, that is easier for analysis. For example, the tidyr package has a small table, table1, containing the number of cases of tuberculosis (TB) in Afghanistan, Brazil, and China in the years 1999 and 2000. It also records the population of those countries in each year. The data are

table1 ## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583

Think of how we might have actually collected this data. I imagine visiting each country (its records at least) for each year and recording the number of cases of TB and its population.

slide-15
SLIDE 15

Tidy data - tidyr

Here is the same data but stored in a different tabular form.

table2 ## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583

How do you imagine this data was collected? Does each

◮ row feel like an observation measured on several variates? ◮ column feel like a variate collected on several individuals? ◮ cell feel like a measurement on a single variate for a single case?

Rows, columns, and cells ideally all “feel right” in that one can imagine collecting the data in this way. table2 does not feel as natural as table1 in the context of this problem.

slide-16
SLIDE 16

Tidy data - tidyr

Natural questions for this data:

◮ How does the rate of TB per 100,000 people compare between countries?. ◮ How does the rate compare from one year to the next for each country?

To compute the rate from table1

table1 %>% mutate(rate = (cases / population) * 100000) ## # A tibble: 6 x 5 ## country year cases population rate ## <chr> <int> <int> <int> <dbl> ## 1 Afghanistan 1999 745 19987071 3.73 ## 2 Afghanistan 2000 2666 20595360 12.9 ## 3 Brazil 1999 37737 172006362 21.9 ## 4 Brazil 2000 80488 174504898 46.1 ## 5 China 1999 212258 1272915272 16.7 ## 6 China 2000 213766 1280428583 16.7

slide-17
SLIDE 17

Tidy data - tidyr

To compare the rates using table1

table1 %>% mutate(rate = (cases / population) * 100000) %>% ggplot(aes(x=year, y=rate)) + ggtitle("Tuberculosis rates per 100,000") + geom_line(aes(group = country, colour = country), lwd = 1.5) + geom_point(aes(colour = country), size = 5) + scale_x_continuous(breaks=c(1999, 2000))

10 20 30 40 1999 2000

year rate country

Afghanistan Brazil China

Tuberculosis rates per 100,000

How about beginning with table2?

slide-18
SLIDE 18

Tidy data - tidyr

To compute the rate from table2. First get the cases

table2 %>% filter(type == "cases") ## # A tibble: 6 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 2000 cases 2666 ## 3 Brazil 1999 cases 37737 ## 4 Brazil 2000 cases 80488 ## 5 China 1999 cases 212258 ## 6 China 2000 cases 213766

OK . . . don’t need type anymore and should rename count

table2 %>% filter(type == "cases") %>% select(-type) %>% rename(cases = count) ## # A tibble: 6 x 3 ## country year cases ## <chr> <int> <int> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766

Looks good!

slide-19
SLIDE 19

Tidy data - tidyr

Do the same for the population!

table2 %>% filter(type == "population") %>% select(-type) %>% rename(population = count) ## # A tibble: 6 x 3 ## country year population ## <chr> <int> <int> ## 1 Afghanistan 1999 19987071 ## 2 Afghanistan 2000 20595360 ## 3 Brazil 1999 172006362 ## 4 Brazil 2000 174504898 ## 5 China 1999 1272915272 ## 6 China 2000 1280428583

Now we can get a rate from these two tables! As in building a new tibble.

◮ need to save each table ◮ probably should make sure both tables have their rows in the same order.

slide-20
SLIDE 20

Tidy data - tidyr

Arrange so that the order is the same and name the tables.

table2 %>% filter(type == "cases") %>% select(-type) %>% rename(cases = count) %>% arrange(country, year) -> table2_cases table2 %>% filter(type == "population") %>% select(-type) %>% rename(population = count) %>% arrange(country, year) -> table2_pop

And a new data set for the rates constructed using both tables.

tibble(country = table2_cases$country, year = table2_cases$year, cases = table2_cases$cases, population = table2_pop$population) %>% mutate(rate = (cases / population) * 100000) -> table2_rates

Now proceed as we did with table.

slide-21
SLIDE 21

Tidy data - tidyr

Imagine tuberculosis cases come from one source (or compiled from many separate sources) over time and the populations from another. We might then have two tables, one for the cases

table4a ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766

where columns are added over time as the data comes in and another of the population over the same years:

table4b ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 19987071 20595360 ## 2 Brazil 172006362 174504898 ## 3 China 1272915272 1280428583

Question: Is either table “tidy”? Why? Why not?

slide-22
SLIDE 22

Tidy data - what is “tidy”?

So, what is “tidy”?

◮ Both table4a and table4b appear to be realistic. i.e. you can imagine the

data being collected like that

◮ But, so too with this table (call it cases)

## # A tibble: 2 x 4 ## year Afghanistan Brazil China ## <int> <int> <int> <int> ## 1 1999 745 37737 212258 ## 2 2000 2666 80488 213766

and this one (call it populations)

## # A tibble: 2 x 4 ## year Afghanistan Brazil China ## <int> <int> <int> <int> ## 1 1999 19987071 172006362 1272915272 ## 2 2000 20595360 174504898 1280428583

Are these “tidy”? Either could be long and thin, or short and wide, or whatever. Questions: How would you create these two tables from table2? From table4a and table4b? (More on this later.)

slide-23
SLIDE 23

Tidy data - what is “tidy”?

So, “tidy”? Or not “tidy”?

◮ Tidy: the tables of either representation (table4a and table4b or cases

and populations) are arguably “tidy” in their own right.

◮ e.g. imagine a time series for each country so each table records country

values of a particular quantity over time

◮ What if there were, say, 100 years of data on three countries? Or, two years

  • f data on 190 countries?

Would your opinion change?

◮ Not tidy: try doing the previous analysis to get a plot of TB rates using

these tables. So, what’s tidy?

◮ imagine what are the “natural” variates for your problem?

These are your columns.

◮ imagine what are the “natural” units of study (cases) for your problem?

These are your rows.

slide-24
SLIDE 24

Tidying data

Why data might not be “tidy” (for your problem):

◮ data are not organized for analysis, but for something else e.g.

◮ for transaction (selecting, updating cells) ◮ for ease of data entry (e.g. simply just add another year by row or column, or

another country)

◮ data repository is not constructed by data analysts, but by others having no

appreciation of analysis So, even though the data is provably consistent, complete, coherent, and error free, data analysts will still often have to “tidy” data to get it into a shape for analysis. The first thing to do is to understand (for your problem) what are the variates (i.e. what is being measured) and what are the study units or observations or cases (viz. on what are the measurements being taken)? Two common problems often remain

◮ one or more variates are spread across multiple columns ◮ one or more observations are scattered across multiple rows

In the tidyr package, gather() and spread()are designed to help address these.

Note These functions can now been replaced by more general functions pivot_longer() and pivot_wider().

slide-25
SLIDE 25

Tidying data - gathering

Recall table4a contains the number of TB cases in three countries for two years

table4a ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766

From our previous analyses, the variates were country, year, and cases.

◮ Only country appears as a column in table4a. ◮ year shows up as the names of the remaining columns

◮ need to gather the values naming the two columns (1999 nd 2000) into a

single variate (column)

◮ cases appear spread across rows and columns in the interior of table4a

◮ each row of has case data for more than one observation ◮ the two columns across which the cases are spread, need to be gathered

together

◮ cases would be the column into which the values are gathered.

slide-26
SLIDE 26

Tidying data - gathering

Gather the values contained in the two columns (1999 and 2000) into another pair of variates.

table4a ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766

table4a %>% gather(`1999`, `2000`, # columns that represent values, # not variates. key = "year", # name the variate whose values are # the column names. value = "cases") # name the values being gathered.

## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Brazil 1999 37737 ## 3 China 1999 212258 ## 4 Afghanistan 2000 2666 ## 5 Brazil 2000 80488 ## 6 China 2000 213766

Question: How would you characterise the “units” here?

Any other comments? (year is a character vector . . . why?)

slide-27
SLIDE 27

Tidying data - gathering

Again, gather the values contained in the two columns (1999 and 2000) into a different pair of variates. But this time, fix the type of value for the key = "year" variable:

table4a %>% gather(`1999`, `2000`, # columns that represent values, # not variates. key = "year", # name the variate whose values are # the column names. value = "cases", # name the values being gathered. convert = TRUE # run type.convert() on the key ) -> # forward assign to a variable tidy4a tidy4a

## # A tibble: 6 x 3 ## country year cases ## <chr> <int> <int> ## 1 Afghanistan 1999 745 ## 2 Brazil 1999 37737 ## 3 China 1999 212258 ## 4 Afghanistan 2000 2666 ## 5 Brazil 2000 80488 ## 6 China 2000 213766

slide-28
SLIDE 28

Tidying data - gathering

Do the same for table4b.

table4b ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 19987071 20595360 ## 2 Brazil 172006362 174504898 ## 3 China 1272915272 1280428583

But fix the the type of value for year:

table4b %>% gather(`1999`, `2000`, # columns that represent values, # not variates. key = "year", # name the variate whose values # are the column names. value = "population", # name the values being gathered. convert = TRUE # run type.convert() on the key. ) -> tidy4b tidy4b

## # A tibble: 6 x 3 ## country year population ## <chr> <int> <int> ## 1 Afghanistan 1999 19987071 ## 2 Brazil 1999 172006362 ## 3 China 1999 1272915272 ## 4 Afghanistan 2000 20595360 ## 5 Brazil 2000 174504898 ## 6 China 2000 1280428583

slide-29
SLIDE 29

Tidying data - gathering

We only need to match rows and join columns of tidy4a and tidy4b full_join(tidy4a, tidy4b)

## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583

Which looks a lot like table1. Notes:

  • 1. Rows were matched on values of country and year
  • 2. Columns which differed were joined to the common columns
  • 3. Where there are not matching values, NA is returned for the missing

column values (none missing here).

  • 4. More on joining tables later.
slide-30
SLIDE 30

Tidying data - gathering to pivoting.

In the most recent releases of tidyr it has been announced that gather() is being “retired” (or no longer developed). In its place is a new and more “featureful” function called pivot_longer(). The metaphor here seems to be based on the effect on the data structure. Gathering results in longer data table; in the example, table4a was a 3 × 3 table and two of its columns were gathered together into a single (and hence longer) column. The resulting table, tidy4a, is longer, now being 6 × 3. Generally, the result of gathering is a longer narrower (no wider) table. The table was “lengthened”. In some sense, some columns were turned or pivoted into rows.

slide-31
SLIDE 31

Tidying data - pivot_longer

Turning table4a into tidy4a would now look like

table4a %>% pivot_longer(cols = c(`1999`, `2000`), # column selection names_to = "year", # new column name values_to = "cases", # name of values column names_ptypes = integer() # prototype for names ) -> tidy4a tidy4a

## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766

There numerous other arguments, including values_ptypes, which make this a much more powerful function than gather().

◮ e.g. cols can be defined using general selection specs ◮ e.g. values_drop_na can drop rows having only NAs in the value_to

column.

◮ lots more . . .

slide-32
SLIDE 32

Tidying data - spreading

Spreading is the opposite of gathering. The values of one or more variates are spread across two or more variates (columns) of the new table, reducing the number of rows and (possibly) increasing the number of columns. Recall

table2 ## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583

Observations are scattered over more than one row. We want to spread the values of type across the columns as two new variates: cases and population whose values are given by count.

slide-33
SLIDE 33

Tidying data - spreading

We want to spread the values of type across the columns as two new variates: cases and population whose values are given by count. table2 %>% spread(key = "type", value = "count")

## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583

◮ the column giving the new variate names is the key to the spreading. ◮ the column giving the values of these variates is the value

slide-34
SLIDE 34

Tidying data - spreading

This suggests how to create the table of cases mentioned on an earlier slide. From table2 spread the countries across the columns

table2 %>% spread(key = country, value = count)

## # A tibble: 4 x 5 ## year type Afghanistan Brazil China ## <int> <chr> <int> <int> <int> ## 1 1999 cases 745 37737 212258 ## 2 1999 population 19987071 172006362 1272915272 ## 3 2000 cases 2666 80488 213766 ## 4 2000 population 20595360 174504898 1280428583

and filter for the cases

table2 %>% spread(key = country, value = count) %>% filter(type == "cases")

## # A tibble: 2 x 5 ## year type Afghanistan Brazil China ## <int> <chr> <int> <int> <int> ## 1 1999 cases 745 37737 212258 ## 2 2000 cases 2666 80488 213766

We only need to get rid of type.

slide-35
SLIDE 35

Tidying data - spreading

From table2

## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583

We need only add removing type to the pipeline:

table2 %>% spread(key = country, value = count) %>% filter(type == "cases") %>% select(-type) # everything but type

## # A tibble: 2 x 4 ## year Afghanistan Brazil China ## <int> <int> <int> <int> ## 1 1999 745 37737 212258 ## 2 2000 2666 80488 213766

Which is the table cases (from an earlier slide) which we imagined might even have been the

  • riginal recording table of the cases.
slide-36
SLIDE 36

Tidying data - spread() to pivot_wider()

Again, spread() is being “retired” or no longer developed in tidyr and a more “featureful” function pivot_wide() being introduced to do the same job and more. To return to the last example:

table2 %>% pivot_wider(names_from = country, values_from = count) %>% filter(type == "cases") %>% select(-type) ## # A tibble: 2 x 4 ## year Afghanistan Brazil China ## <int> <int> <int> <int> ## 1 1999 745 37737 212258 ## 2 2000 2666 80488 213766

Again, the idea is the we are pivoting from a longer data table to a wider one.

See help("pivot_wider") for complete functionality

slide-37
SLIDE 37

Tidying data - gathering and spreading example

Here’s a challenge: how can we get from table4a

table4a ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766

to the table of cases?

## # A tibble: 2 x 4 ## year Afghanistan Brazil China ## <int> <int> <int> <int> ## 1 1999 745 37737 212258 ## 2 2000 2666 80488 213766

slide-38
SLIDE 38

Tidying data - gathering and spreading example

First need to gather together the year columns. The following are equivalent

table4a %>% gather("1999", "2000", key = year, value = value) # or gather all columns but take valued from all but country table4a %>% gather(key = year, value = value, -country)

## # A tibble: 6 x 3 ## country year value ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Brazil 1999 37737 ## 3 China 1999 212258 ## 4 Afghanistan 2000 2666 ## 5 Brazil 2000 80488 ## 6 China 2000 213766

Then spread the country variable values across the columns

table4a %>% gather("1999", "2000", key = year, value = value) %>% spread(key = country, value = value)

## # A tibble: 2 x 4 ## year Afghanistan Brazil China ## <chr> <int> <int> <int> ## 1 1999 745 37737 212258 ## 2 2000 2666 80488 213766

slide-39
SLIDE 39

tidyr - separating and uniting

Sometimes the data values themselves must be wrangled. Common tasks include separating values into pieces or, conversely, uniting separate pieces into a single value. For example, consider

table5 ## # A tibble: 6 x 4 ## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272 ## 6 China 20 00 213766/1280428583

The rate variate is a character vector whose values contain both the number of cases and the population separated by “/”. These values can be separated and assigned to new variates using separate(). table5 %>% separate(rate, into = c("cases", "population"), sep ="/", convert = TRUE)

## # A tibble: 6 x 5 ## country century year cases population ## <chr> <chr> <chr> <int> <int> ## 1 Afghanistan 19 99 745 19987071 ## 2 Afghanistan 20 00 2666 20595360 ## 3 Brazil 19 99 37737 172006362 ## 4 Brazil 20 00 80488 174504898 ## 5 China 19 99 212258 1272915272 ## 6 China 20 00 213766 1280428583

Note: Unless specified with sep, every non-alphanumeric character will mark the separation.

slide-40
SLIDE 40

tidyr - separating and uniting

Conversely, the two variates century and year might be united into a single variate

table5 ## # A tibble: 6 x 4 ## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272 ## 6 China 20 00 213766/1280428583

This is done using the unite() function: table5 %>% unite(col = "Cent:Year", century, year, sep = ":")

## # A tibble: 6 x 3 ## country `Cent:Year` rate ## <chr> <chr> <chr> ## 1 Afghanistan 19:99 745/19987071 ## 2 Afghanistan 20:00 2666/20595360 ## 3 Brazil 19:99 37737/172006362 ## 4 Brazil 20:00 80488/174504898 ## 5 China 19:99 212258/1272915272 ## 6 China 20:00 213766/1280428583

slide-41
SLIDE 41

tidyr - separating and uniting

Or, all at once and preserving the rate

table5 %>% separate(rate, into = c("cases", "population"), sep ="/", convert = TRUE, remove = FALSE) %>% mutate(rate = 100000 *cases/population) %>% unite(col = "year", century, year, sep = "") %>% mutate(year = as.numeric(year)) ## # A tibble: 6 x 5 ## country year rate cases population ## <chr> <dbl> <dbl> <int> <int> ## 1 Afghanistan 1999 3.73 745 19987071 ## 2 Afghanistan 2000 12.9 2666 20595360 ## 3 Brazil 1999 21.9 37737 172006362 ## 4 Brazil 2000 46.1 80488 174504898 ## 5 China 1999 16.7 212258 1272915272 ## 6 China 2000 16.7 213766 1280428583

Note that year had to be explicitly changed to numerical (no convert with unite).

slide-42
SLIDE 42

tidyr - missing values

Donald Rumsfeld, US Secretary of Defense, February 12, 2002: This also applies to data. There are

◮ known knowns; the data for which we have recorded values, ◮ known unknowns; data we know to be missing (recorded as NA for “Not

Available”), and

◮ unknown unknowns; data that we do not know are missing.

And it is the last of these that “tend to be the difficult ones.”

slide-43
SLIDE 43

tidyr - missing values

For example, suppose we have the record of the number of courses completed by a University of Waterloo student: courseCount <- tibble( year = c(2013, 2014, 2014, 2016, 2016, 2017, 2018, 2018), term = c("Fall", "Winter", "Spring", "Winter", "Spring", "Fall", "Spring", "Fall"), count = c(5, 4, 1, 5, 1, NA, 2, 5) ) courseCount

## # A tibble: 8 x 3 ## year term count ## <dbl> <chr> <dbl> ## 1 2013 Fall 5 ## 2 2014 Winter 4 ## 3 2014 Spring 1 ## 4 2016 Winter 5 ## 5 2016 Spring 1 ## 6 2017 Fall NA ## 7 2018 Spring 2 ## 8 2018 Fall 5

In this case, there is some explicitly missing data recorded as NA for the "Fall" term of

  • 2017. This is an example of a “known unknown”; we know we do not know how many

courses were completed by that student in that term. Are there any “unknown unknowns”?

slide-44
SLIDE 44

Missing values - “unknown unknowns”

For example,

  • 1. Notice that there are no values for year 2015.

◮ This is an “unknown unknown”. Perhaps the student never registered for any

courses in 2015? Perhaps they did but never completed any? Or maybe they completed some but the record is missing entirely? Who knows?

  • 2. Notice also that the values of term correspond to the names of the seasons, but

that “Summer” does not appear.

◮ This too may be an “unknown unknown”. “Summer” may or may not be

missing.

◮ This “unknown unknown” is suggested by the context, and what we know

about the meaning of the variates and their values

  • 3. There may also be some implicitly missing data.

◮ These are “unknowns” which might be inferred from the structure of the

data.

◮ The structure suggests there might be “unknown unknowns” that can be

identified as “known unknowns” and so marked as NA for known to be missing. To turn any of these “unknown unknowns” into “known unknowns” or even “known knowns” requires a deeper understanding of the data context including the meaning and possible values of every variate and more generally how the data were collected (e.g. how the observations were selected, the variates defined and their values determined), and how the data were recorded.

slide-45
SLIDE 45

Missing values - turning “unknown unknowns” into “known unknowns”

Dealing with each of the previous “unknown unknowns” in order

  • 1. Missing year 2015?

◮ cannot really tell whether this is missing or not. ◮ if we think it is, then we could add a row to our data

courseCount %>% add_case(year = 2015) %>% arrange(year)

## # A tibble: 9 x 3 ## year term count ## <dbl> <chr> <dbl> ## 1 2013 Fall 5 ## 2 2014 Winter 4 ## 3 2014 Spring 1 ## 4 2015 <NA> NA ## 5 2016 Winter 5 ## 6 2016 Spring 1 ## 7 2017 Fall NA ## 8 2018 Spring 2 ## 9 2018 Fall 5

And we now have some explicitly missing values, or “known unknowns”. Only the problem context will help us decide whether to include this missing data or not.

slide-46
SLIDE 46

Missing values - turning “unknown unknowns” into “known unknowns”

  • 2. Perhaps the variate term is missing “Summer”.

◮ We could formally indicate that by it into a factor having 4 levels, each

corresponding to a season.

courseCount %>% mutate(term = factor(term, level = c("Fall", "Winter", "Spring", "Summer")))

## # A tibble: 8 x 3 ## year term count ## <dbl> <fct> <dbl> ## 1 2013 Fall 5 ## 2 2014 Winter 4 ## 3 2014 Spring 1 ## 4 2016 Winter 5 ## 5 2016 Spring 1 ## 6 2017 Fall NA ## 7 2018 Spring 2 ## 8 2018 Fall 5

Only change to the data is term changing from character to factor.

◮ Besides, this is a Waterloo student. ◮ Each term is exactly 4 months long ◮ There are no “Summer” terms at Waterloo, so no missing Summer term ◮ Better:

courseCount %>% mutate(term = factor(term))

to register the complete set of possible values and to resolve the possible “unknown unknowns”.

slide-47
SLIDE 47

Missing values - turning “unknown unknowns” into “known unknowns”

  • 3. There may also be some implicitly missing data.

◮ We might, for example, wish to assert that every year should have all three

terms in the record. The complete() function will do this for us. courseCount %>% complete(year, term)

## # A tibble: 15 x 3 ## year term count ## <dbl> <chr> <dbl> ## 1 2013 Fall 5 ## 2 2013 Spring NA ## 3 2013 Winter NA ## 4 2014 Fall NA ## 5 2014 Spring 1 ## 6 2014 Winter 4 ## 7 2016 Fall NA ## 8 2016 Spring 1 ## 9 2016 Winter 5 ## 10 2017 Fall NA ## 11 2017 Spring NA ## 12 2017 Winter NA ## 13 2018 Fall 5 ## 14 2018 Spring 2 ## 15 2018 Winter NA

◮ Now have many implicitly missing data values. ◮ Can we turn some of these NAs into “known knowns”? ◮ Are all NAs alike here? ◮ Context provides a lot of info: a term is 4 months; academic order is Fall,

Winter, Spring; co-op alternates academic and work terms; etc.

◮ Note that there are no entries for 2015, so still some possibly implicitly

missing data.

slide-48
SLIDE 48

Missing values - turning “unknown unknowns” into “known unknowns”

  • 3. There may also be some implicitly missing data.

◮ Alternatively, this might have happened had we switched up the

representations courseCount %>% spread(year, count) ## # A tibble: 3 x 6 ## term `2013` `2014` `2016` `2017` `2018` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Fall 5 NA NA NA 5 ## 2 Spring NA 1 1 NA 2 ## 3 Winter NA 4 5 NA NA

◮ Are all NAs alike here? ◮ Which, if any, would you guess might be zero? ◮ Note again absence of 2015.

slide-49
SLIDE 49

Missing values - complete()

If we want all years and all terms: courseCount %>% add_row(year = 2015, term = "Winter") %>% mutate(term = factor(term)) %>% complete(year,term) %>% arrange(year, desc(term))

## # A tibble: 18 x 3 ## year term count ## <dbl> <fct> <dbl> ## 1 2013 Winter NA ## 2 2013 Spring NA ## 3 2013 Fall 5 ## 4 2014 Winter 4 ## 5 2014 Spring 1 ## 6 2014 Fall NA ## 7 2015 Winter NA ## 8 2015 Spring NA ## 9 2015 Fall NA ## 10 2016 Winter 5 ## 11 2016 Spring 1 ## 12 2016 Fall NA ## 13 2017 Winter NA ## 14 2017 Spring NA ## 15 2017 Fall NA ## 16 2018 Winter NA ## 17 2018 Spring 2 ## 18 2018 Fall 5 ◮ Are all NAs alike here? ◮ Which, if any, would you guess might be zero?

slide-50
SLIDE 50

Missing values - complete()

Match the time order of the terms with their factor levels: courseCount %>% mutate(term = factor(term, levels = c("Winter", "Spring", "Fall"))) %>% complete(year, term) %>% arrange(year, term) # Note, now ascending term

## # A tibble: 15 x 3 ## year term count ## <dbl> <fct> <dbl> ## 1 2013 Winter NA ## 2 2013 Spring NA ## 3 2013 Fall 5 ## 4 2014 Winter 4 ## 5 2014 Spring 1 ## 6 2014 Fall NA ## 7 2016 Winter 5 ## 8 2016 Spring 1 ## 9 2016 Fall NA ## 10 2017 Winter NA ## 11 2017 Spring NA ## 12 2017 Fall NA ## 13 2018 Winter NA ## 14 2018 Spring 2 ## 15 2018 Fall 5

slide-51
SLIDE 51

Missing values - complete()

All missing counts are made 0, except the one we know to be missing (the one “known unknown”) courseCount %>% mutate(term = factor(term, levels = c("Winter", "Spring", "Fall"))) %>% complete(year, term, fill = list(count = 0)) %>% mutate(count = ifelse(year == 2017 & term == "Fall", NA, count)) %>% arrange(year, term)

## # A tibble: 15 x 3 ## year term count ## <dbl> <fct> <dbl> ## 1 2013 Winter ## 2 2013 Spring ## 3 2013 Fall 5 ## 4 2014 Winter 4 ## 5 2014 Spring 1 ## 6 2014 Fall ## 7 2016 Winter 5 ## 8 2016 Spring 1 ## 9 2016 Fall ## 10 2017 Winter ## 11 2017 Spring ## 12 2017 Fall NA ## 13 2018 Winter ## 14 2018 Spring 2 ## 15 2018 Fall 5 ◮ Which looks better. (Note: 2015 could have been added.)

slide-52
SLIDE 52

Missing values - systematic and intentional

Suppose we have the following data in grades.csv

lastname, firstname, grad, course, grade Dahun, Attila, yes, STAT847, 83 ,, no, STAT442, 87 ,, , STAT331, 90 Wohnyablo, Dina, yes, STAT847, 78 ,, , STAT842, 80 ,, no, STAT331, 85 Shoruncle, Bob, no, , 73 ,, , STAT231, 85 ,, , , 44

This might have been entered in this way so that redundant information did not have to be repeatedly entered or stored (e.g. names). We need to know when information can be repeated and when it cannot.

slide-53
SLIDE 53

Missing values - systematic and intentional

We read in grades.csv and get a bunch of missing values:

grades <- read_csv(path_concat(dataDirectory, "grades.csv")) grades

## # A tibble: 9 x 5 ## lastname firstname grad course grade ## <chr> <chr> <chr> <chr> <dbl> ## 1 Dahun Attila yes STAT847 83 ## 2 <NA> <NA> no STAT442 87 ## 3 <NA> <NA> <NA> STAT331 90 ## 4 Wohnyablo Dina yes STAT847 78 ## 5 <NA> <NA> <NA> STAT842 80 ## 6 <NA> <NA> no STAT331 85 ## 7 Shoruncle Bob no <NA> 73 ## 8 <NA> <NA> <NA> STAT231 85 ## 9 <NA> <NA> <NA> <NA> 44

Again, are all these NAs the same? Do we know the values of some? Are some truly unknown? tidyr provides the function fill() to “fill” in the missing values of specified variates from those rows directly above (or below if .direction = "up")

slide-54
SLIDE 54

Missing values - fill()

grades ## # A tibble: 9 x 5 ## lastname firstname grad course grade ## <chr> <chr> <chr> <chr> <dbl> ## 1 Dahun Attila yes STAT847 83 ## 2 <NA> <NA> no STAT442 87 ## 3 <NA> <NA> <NA> STAT331 90 ## 4 Wohnyablo Dina yes STAT847 78 ## 5 <NA> <NA> <NA> STAT842 80 ## 6 <NA> <NA> no STAT331 85 ## 7 Shoruncle Bob no <NA> 73 ## 8 <NA> <NA> <NA> STAT231 85 ## 9 <NA> <NA> <NA> <NA> 44

Perhaps we know that names and grad status are intentionally never repeated in the file.

grades %>% fill(firstname, lastname, grad)

## # A tibble: 9 x 5 ## lastname firstname grad course grade ## <chr> <chr> <chr> <chr> <dbl> ## 1 Dahun Attila yes STAT847 83 ## 2 Dahun Attila no STAT442 87 ## 3 Dahun Attila no STAT331 90 ## 4 Wohnyablo Dina yes STAT847 78 ## 5 Wohnyablo Dina yes STAT842 80 ## 6 Wohnyablo Dina no STAT331 85 ## 7 Shoruncle Bob no <NA> 73 ## 8 Shoruncle Bob no STAT231 85 ## 9 Shoruncle Bob no <NA> 44

slide-55
SLIDE 55

Missing values - fill()

## # A tibble: 9 x 5 ## lastname firstname grad course grade ## <chr> <chr> <chr> <chr> <dbl> ## 1 Dahun Attila yes STAT847 83 ## 2 Dahun Attila no STAT442 87 ## 3 Dahun Attila no STAT331 90 ## 4 Wohnyablo Dina yes STAT847 78 ## 5 Wohnyablo Dina yes STAT842 80 ## 6 Wohnyablo Dina no STAT331 85 ## 7 Shoruncle Bob no <NA> 73 ## 8 Shoruncle Bob no STAT231 85 ## 9 Shoruncle Bob no <NA> 44

But what about course? Are the NAs the same? Likely the second NA could be filled in, but perhaps not the first. Separate out Bob and use fill() on course for Bob

grades %>% fill(firstname, lastname, grad) %>% filter(firstname == "Bob", lastname == "Shoruncle") %>% fill(course) -> bobShoruncle bobShoruncle

## # A tibble: 3 x 5 ## lastname firstname grad course grade ## <chr> <chr> <chr> <chr> <dbl> ## 1 Shoruncle Bob no <NA> 73 ## 2 Shoruncle Bob no STAT231 85 ## 3 Shoruncle Bob no STAT231 44

slide-56
SLIDE 56

Missing values - fill()

And now everyone else

grades %>% fill(firstname, lastname, grad) %>% filter(firstname != "Bob" & lastname != "Shoruncle") %>% fill(course) -> everyoneElse everyoneElse

## # A tibble: 6 x 5 ## lastname firstname grad course grade ## <chr> <chr> <chr> <chr> <dbl> ## 1 Dahun Attila yes STAT847 83 ## 2 Dahun Attila no STAT442 87 ## 3 Dahun Attila no STAT331 90 ## 4 Wohnyablo Dina yes STAT847 78 ## 5 Wohnyablo Dina yes STAT842 80 ## 6 Wohnyablo Dina no STAT331 85

And put them together again using dplyr’s bind_rows() to join the two tables together.

grades <- bind_rows(bobShoruncle, everyoneElse)

dplyr also provides bind_cols() as a means to join tibbles.

slide-57
SLIDE 57

Missing values - fill()

The tibble can now be arranged in the order preferred:

grades <- grades %>% arrange(lastname, firstname, desc(course), desc(grade)) grades

## # A tibble: 9 x 5 ## lastname firstname grad course grade ## <chr> <chr> <chr> <chr> <dbl> ## 1 Dahun Attila yes STAT847 83 ## 2 Dahun Attila no STAT442 87 ## 3 Dahun Attila no STAT331 90 ## 4 Shoruncle Bob no STAT231 85 ## 5 Shoruncle Bob no STAT231 44 ## 6 Shoruncle Bob no <NA> 73 ## 7 Wohnyablo Dina yes STAT847 78 ## 8 Wohnyablo Dina yes STAT842 80 ## 9 Wohnyablo Dina no STAT331 85

The missing values are replaced by correct values and the only remaining NA is the one “known unknown”.