Tidying and exploring data Workshop 5 2 Objectives By doing this - - PowerPoint PPT Presentation

tidying and exploring data
SMART_READER_LITE
LIVE PREVIEW

Tidying and exploring data Workshop 5 2 Objectives By doing this - - PowerPoint PPT Presentation

1 Tidying and exploring data Workshop 5 2 Objectives By doing this workshop and carrying out the independent study the successful student will be able to: - Explain what is meant by tidy data - Devise reproducible strategies to tidy


slide-1
SLIDE 1

1

Tidying and exploring data

Workshop 5

slide-2
SLIDE 2

2

Objectives

By doing this workshop and carrying out the independent study the successful student will be able to:

  • Explain what is meant by ‘tidy data’
  • Devise reproducible strategies to tidy imported data

Short talk outline some of the possibilities followed by opportunities for you apply and combine ideas Remember to apply what you know about reproducibility

slide-3
SLIDE 3

3

Outline

Owes much to Hadley Wickham Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http://dx.doi.org/10.18637/jss.v059.i10 “tidy datasets are all alike but every messy dataset is messy in its own way” Difficult to be comprehensive!

slide-4
SLIDE 4

4

Cleaning, tidying, exploring = 80-85% Iterative Cleaning - Content NAs, factor levels, variable names Tidying - Organisation Key concept - variables in columns, cases in rows, Clean and tidy datasets are easy to work with Exploring will help you see if it’s clean and tidy

slide-5
SLIDE 5

5

What is tidy?

Each variable is in a named column Each row is an

  • bservation

Easy to explore, plot, model, report. Easy way to think about data. Several powerful packages exist.

slide-6
SLIDE 6

6

Tidy buoy #44025 data

One example: Indicative of types of manipulation possible - the sophisticated way

# read the first line in vars <- readLines(file, n = 1) vars "#YY MM DD hh mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE" # we can split the line into separate strings using strsplit. Here we want to split on any number of white

  • spaces. we also use unlist to store the result as a character vector instead of a list

coln <- unlist(strsplit(vars, split = "\\s+", fixed = F)) coln [1] "#YY" "MM" "DD" "hh" "mm" "WDIR" "WSPD" "GST" "WVHT" "DPD" "APD" "MWD" "PRES" "ATMP" [15] "WTMP" "DEWP" "VIS" "TIDE" names(mydata) <- coln str(mydata) 'data.frame': 6358 obs. of 18 variables: $ #YY : int 2010 2011 2011 2011 2011 2011 2011 2011 2011 2011 ... $ MM : int 12 1 1 1 1 1 1 1 1 1 ... $ DD : int 31 1 1 1 1 1 1 1 1 1 ... $ hh : int 23 0 1 2 3 4 5 6 7 8 ...

A single string Key point: almost all things are scriptable ………Google!

slide-7
SLIDE 7

7

Tidy buoy #44025 data

# less sophisticated alternative names(mydata) <-c("YY","MM", "DD", "hh", "mm", "WDIR","WSPD","GST", "WVHT","DPD","APD","MWD", "PRES","ATMP", "WTMP", "DEWP", "VIS", "TIDE")

Key point: almost all things are scriptable………... ………even if you have to fudge it a bit…………Be creative

slide-8
SLIDE 8

8

Useful tidying packages

Untidy: one variable in several columns; multiple

  • bs in a row

Tidy Key point: TMTOWTDI

library(tidyr) biomass2 <- gather(data = biomass, fertiliser, mass) library(reshape2) biomass2 <- melt(biomass, measure.vars = 1:6)

slide-9
SLIDE 9

9

library(reshape2) fungi2 <- melt(fungi, id.vars = "Temperature", measure.vars = c("A","B","C","D"))

Tidy Untidy:data in rows and columns; one obs per cell

slide-10
SLIDE 10

10

Tidier Untidy: data are not factors

> mydata <- read_sav("../data/prac9a.sav") > str(mydata) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 6 variables: $ terrsize: atomic 0.463 0.446 0.651 0.507 0.879 ... ..- attr(*, "label")= chr "Territory size (Ha)" ..- attr(*, "format.spss")= chr "F8.3" $ country :Class 'labelled' atomic [1:120] 1 1 1 1 1 1 1 1 1 1 ... .. ..- attr(*, "label")= chr "Country" .. ..- attr(*, "format.spss")= chr "F8.0" .. ..- attr(*, "labels")= Named num [1:3] 1 2 3 .. .. ..- attr(*, "names")= chr [1:3] "U.K" "France" "Germany" $ woodtype:Class 'labelled' atomic [1:120] 1 1 1 1 1 1 1 1 1 1 ... .. ..- attr(*, "label")= chr "Wood Type" .. ..- attr(*, "format.spss")= chr "F8.0" .. ..- attr(*, "labels")= Named num [1:2] 1 2 .. .. ..- attr(*, "names")= chr [1:2] "Deciduous" "Mixed" .... .... .... > mydata$country <- as_factor(mydata$country) > mydata$woodtype <- as_factor(mydata$woodtype) > str(mydata) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 6 variables: $ terrsize: atomic 0.463 0.446 0.651 0.507 0.879 ... ..- attr(*, "label")= chr "Territory size (Ha)" ..- attr(*, "format.spss")= chr "F8.3" $ country : Factor w/ 3 levels "U.K","France",..: 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, "label")= chr "Country" $ woodtype: Factor w/ 2 levels "Deciduous","Mixed": 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, "label")= chr "Wood Type"

Example from the Importing data slides

slide-11
SLIDE 11

11

Adding and deleting rows and columns

To delete use selection

# delete the second column biomass3 <- biomass[,c(1,3:6)] # or biomass3 <- biomass[,-2] str(biomass3)

'data.frame': 10 obs. of 5 variables: $ WaterControl: num 350 324 359 255 208 ... $ B : num 150.1 154.4 69.5 150.7 212.6 ... $ C : num 80 266.4 161.2 161.4 51.2 ... $ D : num 267 110 221 160 198 ... $ E : num 350 320 359 255 208 ...

# delete the 2nd row and 5th row biomass4 <- biomass[c(1,3,4,6:10),] # or biomass4 <- biomass[c(-2,-5),] # or commonly on a conditional statement str(biomass4)

'data.frame': 8 obs. of 6 variables: $ WaterControl: num 350 359 255 326 295 ... $ A : num 159.1 116.3 135.2 81.8 115.7 ... $ B : num 150.1 69.5 150.7 144 149.8 ... $ C : num 80 161 161 184 176 ... $ D : num 267 221 160 270 224 ... $ E : num 350 359 255 326 295 ...

# adding a column biomass$addedcol <- 1 # or biomass$addedcol2 <- biomass$WaterControl - biomass$A str(biomass)

'data.frame': 10 obs. of 8 variables:

$ WaterControl: num 350 324 359 255 208 ...

$ A : num 159 146 116 135 137 ... $ B : num 150.1 154.4 69.5 150.7 212.6 ... $ C : num 80 266.4 161.2 161.4 51.2 ... $ D : num 267 110 221 160 198 ... $ E : num 350 320 359 255 208 ... $ addedcol : num 1 1 1 1 1 1 1 1 1 1 $ addedcol2 : num 190.7 178.2 242.2 120.1 71.9 ...

str(biomass)

'data.frame': 10 obs. of 6 variables:

$ WaterControl: num 350 324 359 255 208 ...

$ A : num 159 146 116 135 137 ... $ B : num 150.1 154.4 69.5 150.7 212.6 ... $ C : num 80 266.4 161.2 161.4 51.2 ... $ D : num 267 110 221 160 198 ... $ E : num 350 320 359 255 208 ...

TMTOWTDI

slide-12
SLIDE 12

12

Additional useful functions

droplevels {base} used to drop unused levels from a factor is.na {base} indicates which elements are missing. complete.cases {stats} Return a logical vector indicating which cases are complete, i.e., have no missing values. Reordering factor levels: seek1$hiqual = factor(seek1$hiqual, levels(seek1$hiqual)[c(5,4,1,6,2,3)])