Tidying and exploring data Workshop 5 2 Objectives By doing this - PowerPoint PPT Presentation

1 Tidying and exploring data Workshop 5

2 Objectives By doing this workshop and carrying out the independent study the successful student will be able to: - Explain what is meant by ‘tidy data’ - Devise reproducible strategies to tidy imported data Short talk outline some of the possibilities followed by opportunities for you apply and combine ideas Remember to apply what you know about reproducibility

3 Outline Owes much to Hadley Wickham Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http://dx.doi.org/10.18637/jss.v059.i10 “tidy datasets are all alike but every messy dataset is messy in its own way” Difficult to be comprehensive!

4 Cleaning, tidying, exploring = 80-85% Iterative Cleaning - Content NAs, factor levels, variable names Tidying - Organisation Key concept - variables in columns, cases in rows, Clean and tidy datasets are easy to work with Exploring will help you see if it’s clean and tidy

5 What is tidy? Each variable is in a named column Each row is an observation Easy to explore, plot, model, report. Easy way to think about data. Several powerful packages exist.

6 Tidy buoy #44025 data One example: Indicative of types of manipulation possible - the sophisticated way # read the first line in A single vars <- readLines(file, n = 1) string vars "#YY MM DD hh mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE" # we can split the line into separate strings using strsplit. Here we want to split on any number of white spaces. we also use unlist to store the result as a character vector instead of a list coln <- unlist(strsplit(vars, split = "\\s+", fixed = F)) coln [1] "#YY" "MM" "DD" "hh" "mm" "WDIR" "WSPD" "GST" "WVHT" "DPD" "APD" "MWD" "PRES" "ATMP" [15] "WTMP" "DEWP" "VIS" "TIDE" names(mydata) <- coln str(mydata) 'data.frame': 6358 obs. of 18 variables: $ #YY : int 2010 2011 2011 2011 2011 2011 2011 2011 2011 2011 ... $ MM : int 12 1 1 1 1 1 1 1 1 1 ... $ DD : int 31 1 1 1 1 1 1 1 1 1 ... $ hh : int 23 0 1 2 3 4 5 6 7 8 ... Key point: almost all things are scriptable ………Google!

7 Tidy buoy #44025 data Key point: almost all things are scriptable………... ………even if you have to fudge it a bit…………Be creative # less sophisticated alternative names(mydata) <-c("YY","MM", "DD", "hh", "mm", "WDIR","WSPD","GST", "WVHT","DPD","APD","MWD", "PRES","ATMP", "WTMP", "DEWP", "VIS", "TIDE")

8 Useful tidying packages Tidy Untidy: one variable in several columns; multiple obs in a row library(tidyr) biomass2 <- gather(data = biomass, fertiliser, mass) library(reshape2) biomass2 <- melt(biomass, measure.vars = 1:6) Key point: TMTOWTDI

9 Tidy Untidy:data in rows and columns; one obs per cell library(reshape2) fungi2 <- melt(fungi, id.vars = "Temperature", measure.vars = c("A","B","C","D"))

10 Untidy: data are not factors > mydata <- read_sav("../data/prac9a.sav") > str(mydata) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 6 variables: $ terrsize: atomic 0.463 0.446 0.651 0.507 0.879 ... Example from the ..- attr(*, "label")= chr "Territory size (Ha)" ..- attr(*, "format.spss")= chr "F8.3" $ country :Class 'labelled' atomic [1:120] 1 1 1 1 1 1 1 1 1 1 ... Importing data slides .. ..- attr(*, "label")= chr "Country" .. ..- attr(*, "format.spss")= chr "F8.0" .. ..- attr(*, "labels")= Named num [1:3] 1 2 3 .. .. ..- attr(*, "names")= chr [1:3] "U.K" "France" "Germany" $ woodtype:Class 'labelled' atomic [1:120] 1 1 1 1 1 1 1 1 1 1 ... .. ..- attr(*, "label")= chr "Wood Type" .. ..- attr(*, "format.spss")= chr "F8.0" .. ..- attr(*, "labels")= Named num [1:2] 1 2 .. .. ..- attr(*, "names")= chr [1:2] "Deciduous" "Mixed" .... .... .... Tidier > mydata$country <- as_factor(mydata$country) > mydata$woodtype <- as_factor(mydata$woodtype) > str(mydata) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 6 variables: $ terrsize: atomic 0.463 0.446 0.651 0.507 0.879 ... ..- attr(*, "label")= chr "Territory size (Ha)" ..- attr(*, "format.spss")= chr "F8.3" $ country : Factor w/ 3 levels "U.K","France",..: 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, "label")= chr "Country" $ woodtype: Factor w/ 2 levels "Deciduous","Mixed": 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, "label")= chr "Wood Type"

11 To delete use Adding and deleting rows and columns selection str(biomass) # delete the second column 'data.frame': 10 obs. of 6 variables: biomass3 <- biomass[,c(1,3:6)] $ WaterControl: num 350 324 359 255 208 ... # or $ A : num 159 146 116 135 137 ... biomass3 <- biomass[,-2] $ B : num 150.1 154.4 69.5 150.7 212.6 ... str(biomass3) $ C : num 80 266.4 161.2 161.4 51.2 ... 'data.frame': 10 obs. of 5 variables: $ D : num 267 110 221 160 198 ... $ WaterControl: num 350 324 359 255 208 ... $ E : num 350 320 359 255 208 ... $ B : num 150.1 154.4 69.5 150.7 212.6 ... $ C : num 80 266.4 161.2 161.4 51.2 ... TMTOWTDI $ D : num 267 110 221 160 198 ... $ E : num 350 320 359 255 208 ... # delete the 2nd row and 5th row # adding a column biomass4 <- biomass[c(1,3,4,6:10),] biomass$addedcol <- 1 # or # or biomass4 <- biomass[c(-2,-5),] biomass$addedcol2 <- biomass$WaterControl - biomass$A # or commonly on a conditional statement str(biomass) str(biomass4) 'data.frame': 10 obs. of 8 variables: 'data.frame': 8 obs. of 6 variables: $ WaterControl: num 350 324 359 255 208 ... $ WaterControl: num 350 359 255 326 295 ... $ A : num 159 146 116 135 137 ... $ A : num 159.1 116.3 135.2 81.8 115.7 ... $ B : num 150.1 154.4 69.5 150.7 212.6 ... $ B : num 150.1 69.5 150.7 144 149.8 ... $ C : num 80 266.4 161.2 161.4 51.2 ... $ C : num 80 161 161 184 176 ... $ D : num 267 110 221 160 198 ... $ D : num 267 221 160 270 224 ... $ E : num 350 320 359 255 208 ... $ E : num 350 359 255 326 295 ... $ addedcol : num 1 1 1 1 1 1 1 1 1 1 $ addedcol2 : num 190.7 178.2 242.2 120.1 71.9 ...

12 Additional useful functions droplevels {base} used to drop unused levels from a factor is.na {base} indicates which elements are missing. complete.cases {stats} Return a logical vector indicating which cases are complete, i.e., have no missing values. Reordering factor levels: seek1$hiqual = factor(seek1$hiqual, levels(seek1$hiqual)[c(5,4,1,6,2,3)])

Tidying and exploring data Workshop 5 2 Objectives By doing this - PowerPoint PPT Presentation

1 Tidying and exploring data Workshop 5 2 Objectives By doing this workshop and carrying out the independent study the successful student will be able to: - Explain what is meant by tidy data - Devise reproducible strategies to tidy

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

Data analysis pipelines Reading and tidying tables R.W. Oldford readr - importing

EXPLORE ARIZONA THROUGH DATA FOCUS ON STUDENT DATA OVERVIEW WELCOME! EXPLORING DATA

Tidying Shakespeare Julia Silge Data Scientist at Stack Overflow DataCamp Sentiment Analysis in

KonMari Your Backlog: Tidying Up Those PBIs Presented by:

What's That Smell? Tidying Up Our Test Code Presented

What's That Smell? Tidying Up Our Test Code Presented

Pitch location and Greinkes July Exploring Pitch Data in R Strike zone success Exploring

Introduction to Data Science: Common observation to be religion, income, frequency where sex and

Middle Grades/High School Exploring Change in the Number of Cases Middle Grades/High School

Outline Exploring Sequential Data A Tutorial Introduction 1 Overview of what sequence analysis

Examining common themed variables Emily Robinson Data Scientist DataCamp Categorical Data in

Exploring the Effects of Socioeconomic Exploring the Effects of Socioeconomic and Demographic

Exploring and Using the Semantic Web Mathieu dAquin KMi, The Open University

Morceaux choisis It is often said that 80% of data analysis is spent on the process of cleaning

Tidy data & tidy tools Hadley Wickham Assistant Professor / Dobelman Family Junior Chair

Converting the Ad-Hoc Configuration of a Heterogeneous Environment to a CFM How I learned to

There's More Than One Way To Dispatch It

Best practices: bar plots IN TERMEDIATE DATA VIS UALIZ ATION W ITH GGP LOT2 Rick Scavetta

Experiences in Applying MDE to Telescope and Instrument Control System Domain L. Andolfato, R.

Formal Theory, Informally Jonathan Worthington French Perl Workshop 2006 Formal Theory,

Scheduling and routing problems at TNT Some solutions and some future research directions orensen

Class Structure Last time: Midterm! This time: Exploration and Exploitation Next time: Batch RL

. Needs effective (finitary) representation .. Failed Termination Proof vs. Non- TNT:

Tidying and exploring data Workshop 5 2 Objectives By doing this - PowerPoint PPT Presentation

1 Tidying and exploring data Workshop 5 2 Objectives By doing this workshop and carrying out the independent study the successful student will be able to: - Explain what is meant by tidy data - Devise reproducible strategies to tidy

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

Data analysis pipelines Reading and tidying tables R.W. Oldford readr - importing

EXPLORE ARIZONA THROUGH DATA FOCUS ON STUDENT DATA OVERVIEW WELCOME! EXPLORING DATA

Tidying Shakespeare Julia Silge Data Scientist at Stack Overflow DataCamp Sentiment Analysis in

KonMari Your Backlog: Tidying Up Those PBIs Presented by:

What's That Smell? Tidying Up Our Test Code Presented

What's That Smell? Tidying Up Our Test Code Presented

Pitch location and Greinkes July Exploring Pitch Data in R Strike zone success Exploring

Introduction to Data Science: Common observation to be religion, income, frequency where sex and

Middle Grades/High School Exploring Change in the Number of Cases Middle Grades/High School

Outline Exploring Sequential Data A Tutorial Introduction 1 Overview of what sequence analysis

Examining common themed variables Emily Robinson Data Scientist DataCamp Categorical Data in

Exploring the Effects of Socioeconomic Exploring the Effects of Socioeconomic and Demographic

Exploring and Using the Semantic Web Mathieu dAquin KMi, The Open University

Morceaux choisis It is often said that 80% of data analysis is spent on the process of cleaning

Tidy data &amp; tidy tools Hadley Wickham Assistant Professor / Dobelman Family Junior Chair

Converting the Ad-Hoc Configuration of a Heterogeneous Environment to a CFM How I learned to

There's More Than One Way To Dispatch It

Best practices: bar plots IN TERMEDIATE DATA VIS UALIZ ATION W ITH GGP LOT2 Rick Scavetta

Experiences in Applying MDE to Telescope and Instrument Control System Domain L. Andolfato, R.

Formal Theory, Informally Jonathan Worthington French Perl Workshop 2006 Formal Theory,

Scheduling and routing problems at TNT Some solutions and some future research directions orensen

Class Structure Last time: Midterm! This time: Exploration and Exploitation Next time: Batch RL

. Needs effective (finitary) representation .. Failed Termination Proof vs. Non- TNT:

Tidy data & tidy tools Hadley Wickham Assistant Professor / Dobelman Family Junior Chair