Data analysis pipelines Reading and tidying tables R.W. Oldford

readr - importing “rectangular” data Often data are stored in a “rectangular” format of rows and columns, for example as a .csv file. The readr package of the tidyverse has a number of functions which facilitate reading such data into R and turning them into data frames (tibbles): ◮ read_csv() for comma separated files, ◮ read_csv2() for semicolon separated files, ◮ read_tsv() for tab separated files, ◮ read_delim() for arbitrary delimiter separating entries, ◮ read_fwf() for fixed width files, ◮ read_table() for fixed width files with white space separating columns, and ◮ read_log() for Apache style log files (see also the ‘webreadr‘ package for reading log files). Why use these?

readr - importing “rectangular” data Why read_csv() instead of say read.csv() ? ◮ has a lot in common with read.csv() ◮ flexible about metadata, comments, NA spec, . . . ◮ clever ◮ about numbers (e.g. 3.14159 versus 3,14159 ) ◮ tries to infer data types (e.g. from first 1,000 lines) ◮ identifies problems encountered ◮ much much faster ◮ always returns a tibble ◮ usually more portable code

readr - a little cleverer Suppose we have the following files test1.csv 2,4,6 3.1, 14, 59 test2.csv a,b 1,2,3 4,5,6 test3.csv x,y, 10, 9, 8 3.1,14,59 each being in some dataDirectory Note: each of the above files ends with a new line.

test1.csv - read.csv versus read_csv test1.csv 2,4,6 3.1, 14, 59 # a data.frame (x <- read.csv ( path_concat (dataDirectory, "test1.csv"))) ## X2 X4 X6 ## 1 3.1 14 59 (x <- read.csv ( path_concat (dataDirectory, "test1.csv"), header = FALSE)) ## V1 V2 V3 ## 1 2.0 4 6 ## 2 3.1 14 59

test1.csv - read.csv versus read_csv test1.csv 2,4,6 3.1, 14, 59 # a tibble (y <- read_csv ( path_concat (dataDirectory, "test1.csv"))) ## # A tibble: 1 x 3 ## `2` `4` `6` ## <dbl> <dbl> <dbl> ## 1 3.1 14 59 (y <- read_csv ( path_concat (dataDirectory, "test1.csv"), col_names = FALSE)) # can also supply names ## # A tibble: 2 x 3 ## X1 X2 X3 ## <dbl> <dbl> <dbl> ## 1 2 4 6 ## 2 3.1 14 59

test2.csv - first using read.csv test2.csv a,b 1,2,3 4,5,6 # a data.frame (x <- read.csv ( path_concat (dataDirectory, "test2.csv"))) ## a b ## 1 2 3 ## 4 5 6 row.names (x) ## [1] "1" "4" (x <- read.csv ( path_concat (dataDirectory, "test2.csv"), header = FALSE)) ## V1 V2 V3 ## 1 a b NA ## 2 1 2 3 ## 3 4 5 6 class (x $ V1) ; class (x $ V3) ## [1] "factor" ## [1] "integer"

test2.csv - now using read_csv test2.csv a,b 1,2,3 4,5,6 # a tibble (y <- read_csv ( path_concat (dataDirectory, "test2.csv"))) ## # A tibble: 2 x 2 ## a b ## <dbl> <dbl> ## 1 1 2 ## 2 4 5 problems (y) ## # A tibble: 2 x 5 ## row col expected actual file ## <int> <chr> <chr> <chr> <chr> ## 1 1 <NA> 2 columns 3 columns './data/test2.csv' ## 2 2 <NA> 2 columns 3 columns './data/test2.csv'

test2.csv - again with read_csv test2.csv a,b 1,2,3 4,5,6 (y <- read_csv ( path_concat (dataDirectory, "test2.csv"), col_names = FALSE)) ## # A tibble: 3 x 2 ## X1 X2 ## <chr> <chr> ## 1 a b ## 2 1 2 ## 3 4 5 problems (y) ## # A tibble: 2 x 5 ## row col expected actual file ## <int> <chr> <chr> <chr> <chr> ## 1 2 <NA> 2 columns 3 columns './data/test2.csv' ## 2 3 <NA> 2 columns 3 columns './data/test2.csv'

test3.csv - read.csv versus read_csv test3.csv x,y, 10, 9, 8 3.1,14,59 (x <- read.csv ( path_concat (dataDirectory, "test3.csv"))) ## x y X ## 1 10.0 9 8 ## 2 3.1 14 59 class (x $ x) ## [1] "numeric" (x <- read.csv ( path_concat (dataDirectory, "test3.csv"), header = FALSE)) ## V1 V2 V3 ## 1 x y NA ## 2 10 9 8 ## 3 3.1 14 59 class (x $ V3) ## [1] "integer"

test3.csv - read.csv versus read_csv test3.csv x,y, 10, 9, 8 3.1,14,59 (y <- read_csv ( path_concat (dataDirectory, "test3.csv"))) ## # A tibble: 2 x 3 ## x y X3 ## <dbl> <dbl> <dbl> ## 1 10 9 8 ## 2 3.1 14 59 spec (y) ; is.null ( problems (y)) ## cols( ## x = col_double(), ## y = col_double(), ## X3 = col_double() ## ) ## [1] FALSE

test3.csv - read.csv versus read_csv test3.csv x,y, 10, 9, 8 3.1,14,59 (y <- read_csv ( path_concat (dataDirectory, "test3.csv"), col_names = FALSE)) ## # A tibble: 3 x 3 ## X1 X2 X3 ## <chr> <chr> <dbl> ## 1 x y NA ## 2 10 9 8 ## 3 3.1 14 59 problems (y) ## [1] row col expected actual ## <0 rows> (or 0-length row.names)

readr - parsing files All depends on how readr package parses a vector. From vignette("readr") : "The key problem that readr solves is parsing a flat file into a tibble. Parsing is the process of taking a text file and turning it into a rectangular tibble where each column is the appropriate part. Parsing takes place in three basic stages: 1. The flat file is parsed into a rectangular matrix of strings. 2. The type of each column is determined. 3. Each column of strings is parsed into a vector of a more specific type. It’s easiest to learn how this works in the opposite order Below, you’ll learn how the: 1. Vector parsers turn a character vector in to a more specific type. 2. Column specification describes the type of each column and the strategy readr uses to guess types so you don’t need to supply them all. 3. Rectangular parsers turn a flat file into a matrix of rows and columns. Each parse_*() is coupled with a col_*() function, which will be used in the process of parsing a complete tibble." You can override defaults.

Tidy data - tidyr So everything is a “table” of some sort, a rectangular tibble . tidyr allows messy data to be reshaped into “tidier” data, that is easier for analysis. For example, the tidyr package has a small table, table1 , containing the number of cases of tuberculosis (TB) in Afghanistan, Brazil, and China in the years 1999 and 2000. It also records the population of those countries in each year. The data are table1 ## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 Think of how we might have actually collected this data. I imagine visiting each country (its records at least) for each year and recording the number of cases of TB and its population.

Tidy data - tidyr Here is the same data but stored in a different tabular form. table2 ## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583 How do you imagine this data was collected? Does each ◮ row feel like an observation measured on several variates? ◮ column feel like a variate collected on several individuals? ◮ cell feel like a measurement on a single variate for a single case? Rows, columns, and cells ideally all “feel right” in that one can imagine collecting the data in this way. table2 does not feel as natural as table1 in the context of this problem.

Tidy data - tidyr Natural questions for this data: ◮ How does the rate of TB per 100,000 people compare between countries?. ◮ How does the rate compare from one year to the next for each country? To compute the rate from table1 table1 %>% mutate (rate = (cases / population) * 100000) ## # A tibble: 6 x 5 ## country year cases population rate ## <chr> <int> <int> <int> <dbl> ## 1 Afghanistan 1999 745 19987071 3.73 ## 2 Afghanistan 2000 2666 20595360 12.9 ## 3 Brazil 1999 37737 172006362 21.9 ## 4 Brazil 2000 80488 174504898 46.1 ## 5 China 1999 212258 1272915272 16.7 ## 6 China 2000 213766 1280428583 16.7

Tidy data - tidyr To compare the rates using table1 table1 %>% mutate (rate = (cases / population) * 100000) %>% ggplot ( aes (x=year, y=rate)) + ggtitle ("Tuberculosis rates per 100,000") + geom_line ( aes (group = country, colour = country), lwd = 1.5) + geom_point ( aes (colour = country), size = 5) + scale_x_continuous (breaks= c (1999, 2000)) Tuberculosis rates per 100,000 40 country 30 Afghanistan rate Brazil 20 China 10 1999 2000 year How about beginning with table2 ?

Data analysis pipelines Reading and tidying tables R.W. Oldford - PowerPoint PPT Presentation

Data analysis pipelines Reading and tidying tables R.W. Oldford readr - importing rectangular data Often data are stored in a rectangular format of rows and columns, for example as a .csv file. The readr package of the tidyverse has a

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering

Pipelines for data analysis in R Hadley Wickham @hadleywickham Chief Scientist, RStudio

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&D

Introduction to Data Pipelines CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering

Greg Neiheisel CTO Astronomer Data Engineering Platform Streaming data Data pipelines Code

Building Data applications with Go from Bloom filters to Data pipelines Sergii Khomenko, Data

Symposium Co-locating Nuclear Plants with Natural Gas Pipelines Paul Blanch, Energy Consultant

Princeton Hydro LLC. Pipelines in the Landscape Both photographs attributed to Delaware

Safety of Gas Gathering Pipelines RIN: 2137-AF38 Docket: PHMSA 2011 0023 Gas Pipeline

Importing data into R Workshop 3 2 Objectives By doing this workshop and carrying out the

Maureen P. Walsh Open Repositories 2013 Charlottetown, PEI

Topics for today Introduction to R Graphics: Getting started with R g U i R t t fi

UNIX Data Tools Bualo Chapter 7 1 / 37 Overview In Chapter 3 we learned the basic operations

Data Modeling and Database Design Yuri Takhteyev Faculty of Information University of Toronto

Introduction to C Programming File Input/Output Waseda University Todays Topics

UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics Core Facility EPFL Objectives

What if you need real-time read/write for large datasets? 2 Lecture based on these two books.