1
Tidying and exploring data Workshop 5 2 Objectives By doing this - - PowerPoint PPT Presentation
Tidying and exploring data Workshop 5 2 Objectives By doing this - - PowerPoint PPT Presentation
1 Tidying and exploring data Workshop 5 2 Objectives By doing this workshop and carrying out the independent study the successful student will be able to: - Explain what is meant by tidy data - Devise reproducible strategies to tidy
2
Objectives
By doing this workshop and carrying out the independent study the successful student will be able to:
- Explain what is meant by ‘tidy data’
- Devise reproducible strategies to tidy imported data
Short talk outline some of the possibilities followed by opportunities for you apply and combine ideas Remember to apply what you know about reproducibility
3
Outline
Owes much to Hadley Wickham Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http://dx.doi.org/10.18637/jss.v059.i10 “tidy datasets are all alike but every messy dataset is messy in its own way” Difficult to be comprehensive!
4
Cleaning, tidying, exploring = 80-85% Iterative Cleaning - Content NAs, factor levels, variable names Tidying - Organisation Key concept - variables in columns, cases in rows, Clean and tidy datasets are easy to work with Exploring will help you see if it’s clean and tidy
5
What is tidy?
Each variable is in a named column Each row is an
- bservation
Easy to explore, plot, model, report. Easy way to think about data. Several powerful packages exist.
6
Tidy buoy #44025 data
One example: Indicative of types of manipulation possible - the sophisticated way
# read the first line in vars <- readLines(file, n = 1) vars "#YY MM DD hh mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE" # we can split the line into separate strings using strsplit. Here we want to split on any number of white
- spaces. we also use unlist to store the result as a character vector instead of a list
coln <- unlist(strsplit(vars, split = "\\s+", fixed = F)) coln [1] "#YY" "MM" "DD" "hh" "mm" "WDIR" "WSPD" "GST" "WVHT" "DPD" "APD" "MWD" "PRES" "ATMP" [15] "WTMP" "DEWP" "VIS" "TIDE" names(mydata) <- coln str(mydata) 'data.frame': 6358 obs. of 18 variables: $ #YY : int 2010 2011 2011 2011 2011 2011 2011 2011 2011 2011 ... $ MM : int 12 1 1 1 1 1 1 1 1 1 ... $ DD : int 31 1 1 1 1 1 1 1 1 1 ... $ hh : int 23 0 1 2 3 4 5 6 7 8 ...
A single string Key point: almost all things are scriptable ………Google!
7
Tidy buoy #44025 data
# less sophisticated alternative names(mydata) <-c("YY","MM", "DD", "hh", "mm", "WDIR","WSPD","GST", "WVHT","DPD","APD","MWD", "PRES","ATMP", "WTMP", "DEWP", "VIS", "TIDE")
Key point: almost all things are scriptable………... ………even if you have to fudge it a bit…………Be creative
8
Useful tidying packages
Untidy: one variable in several columns; multiple
- bs in a row
Tidy Key point: TMTOWTDI
library(tidyr) biomass2 <- gather(data = biomass, fertiliser, mass) library(reshape2) biomass2 <- melt(biomass, measure.vars = 1:6)
9
library(reshape2) fungi2 <- melt(fungi, id.vars = "Temperature", measure.vars = c("A","B","C","D"))
Tidy Untidy:data in rows and columns; one obs per cell
10
Tidier Untidy: data are not factors
> mydata <- read_sav("../data/prac9a.sav") > str(mydata) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 6 variables: $ terrsize: atomic 0.463 0.446 0.651 0.507 0.879 ... ..- attr(*, "label")= chr "Territory size (Ha)" ..- attr(*, "format.spss")= chr "F8.3" $ country :Class 'labelled' atomic [1:120] 1 1 1 1 1 1 1 1 1 1 ... .. ..- attr(*, "label")= chr "Country" .. ..- attr(*, "format.spss")= chr "F8.0" .. ..- attr(*, "labels")= Named num [1:3] 1 2 3 .. .. ..- attr(*, "names")= chr [1:3] "U.K" "France" "Germany" $ woodtype:Class 'labelled' atomic [1:120] 1 1 1 1 1 1 1 1 1 1 ... .. ..- attr(*, "label")= chr "Wood Type" .. ..- attr(*, "format.spss")= chr "F8.0" .. ..- attr(*, "labels")= Named num [1:2] 1 2 .. .. ..- attr(*, "names")= chr [1:2] "Deciduous" "Mixed" .... .... .... > mydata$country <- as_factor(mydata$country) > mydata$woodtype <- as_factor(mydata$woodtype) > str(mydata) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 6 variables: $ terrsize: atomic 0.463 0.446 0.651 0.507 0.879 ... ..- attr(*, "label")= chr "Territory size (Ha)" ..- attr(*, "format.spss")= chr "F8.3" $ country : Factor w/ 3 levels "U.K","France",..: 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, "label")= chr "Country" $ woodtype: Factor w/ 2 levels "Deciduous","Mixed": 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, "label")= chr "Wood Type"
Example from the Importing data slides
11
Adding and deleting rows and columns
To delete use selection
# delete the second column biomass3 <- biomass[,c(1,3:6)] # or biomass3 <- biomass[,-2] str(biomass3)
'data.frame': 10 obs. of 5 variables: $ WaterControl: num 350 324 359 255 208 ... $ B : num 150.1 154.4 69.5 150.7 212.6 ... $ C : num 80 266.4 161.2 161.4 51.2 ... $ D : num 267 110 221 160 198 ... $ E : num 350 320 359 255 208 ...
# delete the 2nd row and 5th row biomass4 <- biomass[c(1,3,4,6:10),] # or biomass4 <- biomass[c(-2,-5),] # or commonly on a conditional statement str(biomass4)
'data.frame': 8 obs. of 6 variables: $ WaterControl: num 350 359 255 326 295 ... $ A : num 159.1 116.3 135.2 81.8 115.7 ... $ B : num 150.1 69.5 150.7 144 149.8 ... $ C : num 80 161 161 184 176 ... $ D : num 267 221 160 270 224 ... $ E : num 350 359 255 326 295 ...
# adding a column biomass$addedcol <- 1 # or biomass$addedcol2 <- biomass$WaterControl - biomass$A str(biomass)
'data.frame': 10 obs. of 8 variables:
$ WaterControl: num 350 324 359 255 208 ...
$ A : num 159 146 116 135 137 ... $ B : num 150.1 154.4 69.5 150.7 212.6 ... $ C : num 80 266.4 161.2 161.4 51.2 ... $ D : num 267 110 221 160 198 ... $ E : num 350 320 359 255 208 ... $ addedcol : num 1 1 1 1 1 1 1 1 1 1 $ addedcol2 : num 190.7 178.2 242.2 120.1 71.9 ...
str(biomass)
'data.frame': 10 obs. of 6 variables:
$ WaterControl: num 350 324 359 255 208 ...
$ A : num 159 146 116 135 137 ... $ B : num 150.1 154.4 69.5 150.7 212.6 ... $ C : num 80 266.4 161.2 161.4 51.2 ... $ D : num 267 110 221 160 198 ... $ E : num 350 320 359 255 208 ...
TMTOWTDI
12
Additional useful functions
droplevels {base} used to drop unused levels from a factor is.na {base} indicates which elements are missing. complete.cases {stats} Return a logical vector indicating which cases are complete, i.e., have no missing values. Reordering factor levels: seek1$hiqual = factor(seek1$hiqual, levels(seek1$hiqual)[c(5,4,1,6,2,3)])