workshop 2 4 data manipulation
play

Workshop 2.4: Data manipulation Murray Logan April 9, 2016 Table - PDF document

-1- Workshop 2.4: Data manipulation Murray Logan April 9, 2016 Table of contents 1 Data manipulation 1 2 Sorting data 2 3 Manipulating factors 7 4 Filtering 10 5 Reshaping data 15 6 Merging data 19 7 VLOOKUP 24 8


  1. -1- Workshop 2.4: Data manipulation Murray Logan April 9, 2016 Table of contents 1 Data manipulation 1 2 Sorting data 2 3 Manipulating factors 7 4 Filtering 10 5 Reshaping data 15 6 Merging data 19 7 VLOOKUP 24 8 Aggregating data (pivot tables) 26 9 Transformations 36 1. Data manipulation 1.1. Important data manipulation libraries Task Function Package Sorting base order() plyr arrange() Re-ordering factor levels base factor(,levels=) gdata reorder(,new.order=) Re-labelling base factor(,lab=) car recode() plyr revalue(,replace=) Re-naming columns base colnames() dplyr rename(,replace=) Filtering/Subsetting indexing base base subset(,subset=,select=) select(,...) dplyr Reshaping data melt() , cast() reshape2 1.2. Important data manipulation libraries Task Function Package Merging/joining base merge() plyr join() Aggregating base tapply()

  2. -2- Task Function Package ddply() plyr reshape2 cast() doBy summaryBy() Transformations transform() , within() base plyr mutate() Adding columns base within() mutate() plyr 1.3. The grammar of data manipulation 1.3.1. Verbs • arrange() - sorting data • select() - subset columns • filter() - subset rows • mutate() - adding colums • summarise() - aggregate 1.4. Data files > source("http://www.flutterbys.com.au/stats/downloads/ + data/manipulationDatasets.R") > #OR > load(url("http://www.flutterbys.com.au/stats/downloads/ + data/manipulationDatasets.RData")) Plot Cond Between Temp LAT LONG P1 H A1 15.74 17.26 146.2 P2 H A1 23.84 14.07 144.9 P3 H A1 13.64 20.75 144.7 P4 H A2 37.95 18.41 142.1 P1 M A2 25.3 18.47 144 P2 M A2 13.8 20.39 145.8 P3 M A3 26.87 20.14 147.7 P4 M A3 29.38 19.69 144.8 P1 L A3 27.76 20.34 145.8 P2 L A4 18.95 20.06 144.9 P3 L A4 37.12 18.65 142.2 P4 L A4 25.9 14.52 144.2 2. Sorting data

  3. -3- 2.1. Sorting data Sorting by LAT > order(data.1$LAT) [1] 2 12 1 4 5 11 8 10 7 9 6 3 > data.1[order(data.1$LAT),] Plot Cond Between Temp LAT 2 P2 H A1 23.83643 14.07060 12 P4 L A4 25.89843 14.52130 1 P1 H A1 15.73546 17.25752 4 P4 H A2 37.95281 18.41013 5 P1 M A2 25.29508 18.46762 11 P3 L A4 37.11781 18.64913 8 P4 M A3 29.38325 19.68780 10 P2 L A4 18.94612 20.06427 7 P3 M A3 26.87429 20.14244 9 P1 L A3 27.75781 20.33795 6 P2 M A2 13.79532 20.38767 3 P3 H A1 13.64371 20.74986 LONG 2 144.8877 12 144.1700 1 146.2397 4 142.0585 5 144.0437 11 142.2459 8 144.7944 10 144.8924 7 147.7174 9 145.7753 6 145.8359 3 144.6884 2.2. Sorting data Sorting by LAT > #OR > library(plyr) #OR library(dplyr) > arrange(data.1, LAT) Plot Cond Between Temp LAT 1 P2 H A1 23.83643 14.07060 2 P4 L A4 25.89843 14.52130 3 P1 H A1 15.73546 17.25752 4 P4 H A2 37.95281 18.41013 5 P1 M A2 25.29508 18.46762 6 P3 L A4 37.11781 18.64913 7 P4 M A3 29.38325 19.68780 8 P2 L A4 18.94612 20.06427 9 P3 M A3 26.87429 20.14244 10 P1 L A3 27.75781 20.33795

  4. -4- 11 P2 M A2 13.79532 20.38767 12 P3 H A1 13.64371 20.74986 LONG 1 144.8877 2 144.1700 3 146.2397 4 142.0585 5 144.0437 6 142.2459 7 144.7944 8 144.8924 9 147.7174 10 145.7753 11 145.8359 12 144.6884 2.3. Sorting data Sorting by Cond and then TEMP > library(plyr) #OR library(dplyr) > arrange(data.1, Cond,Temp) Plot Cond Between Temp LAT 1 P3 H A1 13.64371 20.74986 2 P1 H A1 15.73546 17.25752 3 P2 H A1 23.83643 14.07060 4 P4 H A2 37.95281 18.41013 5 P2 L A4 18.94612 20.06427 6 P4 L A4 25.89843 14.52130 7 P1 L A3 27.75781 20.33795 8 P3 L A4 37.11781 18.64913 9 P2 M A2 13.79532 20.38767 10 P1 M A2 25.29508 18.46762 11 P3 M A3 26.87429 20.14244 12 P4 M A3 29.38325 19.68780 LONG 1 144.6884 2 146.2397 3 144.8877 4 142.0585 5 144.8924 6 144.1700 7 145.7753 8 142.2459 9 145.8359 10 144.0437 11 147.7174 12 144.7944 > #OR > data.1[order(data.1$Cond,data.1$Temp),] Plot Cond Between Temp LAT 3 P3 H A1 13.64371 20.74986 1 P1 H A1 15.73546 17.25752

  5. -5- 2 P2 H A1 23.83643 14.07060 4 P4 H A2 37.95281 18.41013 10 P2 L A4 18.94612 20.06427 12 P4 L A4 25.89843 14.52130 9 P1 L A3 27.75781 20.33795 11 P3 L A4 37.11781 18.64913 6 P2 M A2 13.79532 20.38767 5 P1 M A2 25.29508 18.46762 7 P3 M A3 26.87429 20.14244 8 P4 M A3 29.38325 19.68780 LONG 3 144.6884 1 146.2397 2 144.8877 4 142.0585 10 144.8924 12 144.1700 9 145.7753 11 142.2459 6 145.8359 5 144.0437 7 147.7174 8 144.7944 2.4. Sorting data Sort by the sum of Temp and LAT > library(plyr) > arrange(data.1,Temp+LAT) Plot Cond Between Temp LAT 1 P1 H A1 15.73546 17.25752 2 P2 M A2 13.79532 20.38767 3 P3 H A1 13.64371 20.74986 4 P2 H A1 23.83643 14.07060 5 P2 L A4 18.94612 20.06427 6 P4 L A4 25.89843 14.52130 7 P1 M A2 25.29508 18.46762 8 P3 M A3 26.87429 20.14244 9 P1 L A3 27.75781 20.33795 10 P4 M A3 29.38325 19.68780 11 P3 L A4 37.11781 18.64913 12 P4 H A2 37.95281 18.41013 LONG 1 146.2397 2 145.8359 3 144.6884 4 144.8877 5 144.8924 6 144.1700 7 144.0437 8 147.7174 9 145.7753 10 144.7944 11 142.2459 12 142.0585

  6. -6- > #OR > data.1[order(data.1$Temp+data.1$LAT),] Plot Cond Between Temp LAT 1 P1 H A1 15.73546 17.25752 6 P2 M A2 13.79532 20.38767 3 P3 H A1 13.64371 20.74986 2 P2 H A1 23.83643 14.07060 10 P2 L A4 18.94612 20.06427 12 P4 L A4 25.89843 14.52130 5 P1 M A2 25.29508 18.46762 7 P3 M A3 26.87429 20.14244 9 P1 L A3 27.75781 20.33795 8 P4 M A3 29.38325 19.68780 11 P3 L A4 37.11781 18.64913 4 P4 H A2 37.95281 18.41013 LONG 1 146.2397 6 145.8359 3 144.6884 2 144.8877 10 144.8924 12 144.1700 5 144.0437 7 147.7174 9 145.7753 8 144.7944 11 142.2459 4 142.0585 2.5. Your turn • sort by Condition and then the ratio of Temp to LAT 2.6. Your turn • sort by Condition and then the ratio of Temp to LAT > library(plyr) > arrange(data.1,Cond,Temp/LAT) Plot Cond Between Temp LAT LONG 1 P3 H A1 13.64371 20.74986 144.6884 2 P1 H A1 15.73546 17.25752 146.2397 3 P2 H A1 23.83643 14.07060 144.8877 4 P4 H A2 37.95281 18.41013 142.0585 5 P2 L A4 18.94612 20.06427 144.8924 6 P1 L A3 27.75781 20.33795 145.7753 7 P4 L A4 25.89843 14.52130 144.1700 8 P3 L A4 37.11781 18.64913 142.2459 9 P2 M A2 13.79532 20.38767 145.8359 10 P3 M A3 26.87429 20.14244 147.7174 11 P1 M A2 25.29508 18.46762 144.0437 12 P4 M A3 29.38325 19.68780 144.7944

  7. -7- > #OR > data.1[order(data.1$Cond, data.1$Temp/data.1$LAT),] Plot Cond Between Temp LAT LONG 3 P3 H A1 13.64371 20.74986 144.6884 1 P1 H A1 15.73546 17.25752 146.2397 2 P2 H A1 23.83643 14.07060 144.8877 4 P4 H A2 37.95281 18.41013 142.0585 10 P2 L A4 18.94612 20.06427 144.8924 9 P1 L A3 27.75781 20.33795 145.7753 12 P4 L A4 25.89843 14.52130 144.1700 11 P3 L A4 37.11781 18.64913 142.2459 6 P2 M A2 13.79532 20.38767 145.8359 7 P3 M A3 26.87429 20.14244 147.7174 5 P1 M A2 25.29508 18.46762 144.0437 8 P4 M A3 29.38325 19.68780 144.7944 3. Manipulating factors 3.1. Re-levelling (sorting) factors (data.2) Plot Cond Between Temp P1 H A1 15.74 P2 H A1 23.84 P3 H A1 13.64 P4 H A2 37.95 P1 M A2 25.3 P2 M A2 13.8 P3 M A3 26.87 P4 M A3 29.38 P1 L A3 27.76 P2 L A4 18.95 P3 L A4 37.12 P4 L A4 25.9 > levels(data.2$Cond) [1] "H" "L" "M" 3.2. Re-levelling (sorting) factors > data.2$Cond <- factor(data.2$Cond, levels=c("L","M","H")) > #OR > library(gdata) > data.2$Cond <- reorder(data.2$Cond, new.order=c("L","M","H")) > > levels(data.2$Cond)

  8. -8- [1] "L" "M" "H" 3.3. Re-labelling factors > #create a copy of data2 and call it data.3 > data.3 <- data.2 > #reorder AND rename the factor levels > data.3$Cond <- factor(data.3$Cond, levels=c("L","M","H"), + lab=c("Low","Medium","High")) > data.3 Plot Cond Between Temp 1 P1 High A1 15.73546 2 P2 High A1 23.83643 3 P3 High A1 13.64371 4 P4 High A2 37.95281 5 P1 Medium A2 25.29508 6 P2 Medium A2 13.79532 7 P3 Medium A3 26.87429 8 P4 Medium A3 29.38325 9 P1 Low A3 27.75781 10 P2 Low A4 18.94612 11 P3 Low A4 37.11781 12 P4 Low A4 25.89843 3.4. Re-labelling factors > #restore data.3 from the original data.2 again > data.3 <- data.2 > library(plyr) > data.3$Cond <- revalue(data.3$Cond,c("L"="Low", + "M"="Medium","H"="High")) > data.3 Plot Cond Between Temp 1 P1 High A1 15.73546 2 P2 High A1 23.83643 3 P3 High A1 13.64371 4 P4 High A2 37.95281 5 P1 Medium A2 25.29508 6 P2 Medium A2 13.79532 7 P3 Medium A3 26.87429 8 P4 Medium A3 29.38325 9 P1 Low A3 27.75781 10 P2 Low A4 18.94612 11 P3 Low A4 37.11781 12 P4 Low A4 25.89843 3.5. Re-naming columns (vectors) 3.5.1. By position > data.3 <- data.2 > colnames(data.3) <- c("Plot","Condition","Treatment", + "Temperature") > data.3 Plot Condition Treatment Temperature

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend