Workshop 2.4: Data manipulation Murray Logan April 9, 2016 Table - - PDF document

workshop 2 4 data manipulation
SMART_READER_LITE
LIVE PREVIEW

Workshop 2.4: Data manipulation Murray Logan April 9, 2016 Table - - PDF document

-1- Workshop 2.4: Data manipulation Murray Logan April 9, 2016 Table of contents 1 Data manipulation 1 2 Sorting data 2 3 Manipulating factors 7 4 Filtering 10 5 Reshaping data 15 6 Merging data 19 7 VLOOKUP 24 8


slide-1
SLIDE 1
  • 1-

Workshop 2.4: Data manipulation

Murray Logan

April 9, 2016

Table of contents

1 Data manipulation 1 2 Sorting data 2 3 Manipulating factors 7 4 Filtering 10 5 Reshaping data 15 6 Merging data 19 7 VLOOKUP 24 8 Aggregating data (pivot tables) 26 9 Transformations 36

  • 1. Data manipulation

1.1. Important data manipulation libraries

Task Function Package Sorting

  • rder()

base arrange() plyr Re-ordering factor levels factor(,levels=) base reorder(,new.order=) gdata Re-labelling factor(,lab=) base recode() car revalue(,replace=) plyr Re-naming columns colnames() base rename(,replace=) dplyr Filtering/Subsetting indexing base subset(,subset=,select=) base select(,...) dplyr Reshaping data melt(), cast() reshape2

1.2. Important data manipulation libraries

Task Function Package Merging/joining merge() base join() plyr Aggregating tapply() base

slide-2
SLIDE 2
  • 2-

Task Function Package ddply() plyr cast() reshape2 summaryBy() doBy Transformations transform(), within() base mutate() plyr Adding columns within() base mutate() plyr

1.3. The grammar of data manipulation

1.3.1. Verbs

  • arrange() - sorting data
  • select() - subset columns
  • filter() - subset rows
  • mutate() - adding colums
  • summarise() - aggregate

1.4. Data files

> source("http://www.flutterbys.com.au/stats/downloads/ + data/manipulationDatasets.R") > #OR > load(url("http://www.flutterbys.com.au/stats/downloads/ + data/manipulationDatasets.RData"))

Plot Cond Between Temp LAT LONG P1 H A1 15.74 17.26 146.2 P2 H A1 23.84 14.07 144.9 P3 H A1 13.64 20.75 144.7 P4 H A2 37.95 18.41 142.1 P1 M A2 25.3 18.47 144 P2 M A2 13.8 20.39 145.8 P3 M A3 26.87 20.14 147.7 P4 M A3 29.38 19.69 144.8 P1 L A3 27.76 20.34 145.8 P2 L A4 18.95 20.06 144.9 P3 L A4 37.12 18.65 142.2 P4 L A4 25.9 14.52 144.2

  • 2. Sorting data
slide-3
SLIDE 3
  • 3-

2.1. Sorting data

Sorting by LAT

> order(data.1$LAT)

[1] 2 12 1 4 5 11 8 10 7 9 6 3

> data.1[order(data.1$LAT),]

Plot Cond Between Temp LAT 2 P2 H A1 23.83643 14.07060 12 P4 L A4 25.89843 14.52130 1 P1 H A1 15.73546 17.25752 4 P4 H A2 37.95281 18.41013 5 P1 M A2 25.29508 18.46762 11 P3 L A4 37.11781 18.64913 8 P4 M A3 29.38325 19.68780 10 P2 L A4 18.94612 20.06427 7 P3 M A3 26.87429 20.14244 9 P1 L A3 27.75781 20.33795 6 P2 M A2 13.79532 20.38767 3 P3 H A1 13.64371 20.74986 LONG 2 144.8877 12 144.1700 1 146.2397 4 142.0585 5 144.0437 11 142.2459 8 144.7944 10 144.8924 7 147.7174 9 145.7753 6 145.8359 3 144.6884

2.2. Sorting data

Sorting by LAT

> #OR > library(plyr) #OR library(dplyr) > arrange(data.1, LAT)

Plot Cond Between Temp LAT 1 P2 H A1 23.83643 14.07060 2 P4 L A4 25.89843 14.52130 3 P1 H A1 15.73546 17.25752 4 P4 H A2 37.95281 18.41013 5 P1 M A2 25.29508 18.46762 6 P3 L A4 37.11781 18.64913 7 P4 M A3 29.38325 19.68780 8 P2 L A4 18.94612 20.06427 9 P3 M A3 26.87429 20.14244 10 P1 L A3 27.75781 20.33795

slide-4
SLIDE 4
  • 4-

11 P2 M A2 13.79532 20.38767 12 P3 H A1 13.64371 20.74986 LONG 1 144.8877 2 144.1700 3 146.2397 4 142.0585 5 144.0437 6 142.2459 7 144.7944 8 144.8924 9 147.7174 10 145.7753 11 145.8359 12 144.6884

2.3. Sorting data

Sorting by Cond and then TEMP

> library(plyr) #OR library(dplyr) > arrange(data.1, Cond,Temp)

Plot Cond Between Temp LAT 1 P3 H A1 13.64371 20.74986 2 P1 H A1 15.73546 17.25752 3 P2 H A1 23.83643 14.07060 4 P4 H A2 37.95281 18.41013 5 P2 L A4 18.94612 20.06427 6 P4 L A4 25.89843 14.52130 7 P1 L A3 27.75781 20.33795 8 P3 L A4 37.11781 18.64913 9 P2 M A2 13.79532 20.38767 10 P1 M A2 25.29508 18.46762 11 P3 M A3 26.87429 20.14244 12 P4 M A3 29.38325 19.68780 LONG 1 144.6884 2 146.2397 3 144.8877 4 142.0585 5 144.8924 6 144.1700 7 145.7753 8 142.2459 9 145.8359 10 144.0437 11 147.7174 12 144.7944

> #OR > data.1[order(data.1$Cond,data.1$Temp),]

Plot Cond Between Temp LAT 3 P3 H A1 13.64371 20.74986 1 P1 H A1 15.73546 17.25752

slide-5
SLIDE 5
  • 5-

2 P2 H A1 23.83643 14.07060 4 P4 H A2 37.95281 18.41013 10 P2 L A4 18.94612 20.06427 12 P4 L A4 25.89843 14.52130 9 P1 L A3 27.75781 20.33795 11 P3 L A4 37.11781 18.64913 6 P2 M A2 13.79532 20.38767 5 P1 M A2 25.29508 18.46762 7 P3 M A3 26.87429 20.14244 8 P4 M A3 29.38325 19.68780 LONG 3 144.6884 1 146.2397 2 144.8877 4 142.0585 10 144.8924 12 144.1700 9 145.7753 11 142.2459 6 145.8359 5 144.0437 7 147.7174 8 144.7944

2.4. Sorting data

Sort by the sum of Temp and LAT

> library(plyr) > arrange(data.1,Temp+LAT)

Plot Cond Between Temp LAT 1 P1 H A1 15.73546 17.25752 2 P2 M A2 13.79532 20.38767 3 P3 H A1 13.64371 20.74986 4 P2 H A1 23.83643 14.07060 5 P2 L A4 18.94612 20.06427 6 P4 L A4 25.89843 14.52130 7 P1 M A2 25.29508 18.46762 8 P3 M A3 26.87429 20.14244 9 P1 L A3 27.75781 20.33795 10 P4 M A3 29.38325 19.68780 11 P3 L A4 37.11781 18.64913 12 P4 H A2 37.95281 18.41013 LONG 1 146.2397 2 145.8359 3 144.6884 4 144.8877 5 144.8924 6 144.1700 7 144.0437 8 147.7174 9 145.7753 10 144.7944 11 142.2459 12 142.0585

slide-6
SLIDE 6
  • 6-

> #OR > data.1[order(data.1$Temp+data.1$LAT),]

Plot Cond Between Temp LAT 1 P1 H A1 15.73546 17.25752 6 P2 M A2 13.79532 20.38767 3 P3 H A1 13.64371 20.74986 2 P2 H A1 23.83643 14.07060 10 P2 L A4 18.94612 20.06427 12 P4 L A4 25.89843 14.52130 5 P1 M A2 25.29508 18.46762 7 P3 M A3 26.87429 20.14244 9 P1 L A3 27.75781 20.33795 8 P4 M A3 29.38325 19.68780 11 P3 L A4 37.11781 18.64913 4 P4 H A2 37.95281 18.41013 LONG 1 146.2397 6 145.8359 3 144.6884 2 144.8877 10 144.8924 12 144.1700 5 144.0437 7 147.7174 9 145.7753 8 144.7944 11 142.2459 4 142.0585

2.5. Your turn

  • sort by Condition and then the ratio of Temp to LAT

2.6. Your turn

  • sort by Condition and then the ratio of Temp to LAT

> library(plyr) > arrange(data.1,Cond,Temp/LAT)

Plot Cond Between Temp LAT LONG 1 P3 H A1 13.64371 20.74986 144.6884 2 P1 H A1 15.73546 17.25752 146.2397 3 P2 H A1 23.83643 14.07060 144.8877 4 P4 H A2 37.95281 18.41013 142.0585 5 P2 L A4 18.94612 20.06427 144.8924 6 P1 L A3 27.75781 20.33795 145.7753 7 P4 L A4 25.89843 14.52130 144.1700 8 P3 L A4 37.11781 18.64913 142.2459 9 P2 M A2 13.79532 20.38767 145.8359 10 P3 M A3 26.87429 20.14244 147.7174 11 P1 M A2 25.29508 18.46762 144.0437 12 P4 M A3 29.38325 19.68780 144.7944

slide-7
SLIDE 7
  • 7-

> #OR > data.1[order(data.1$Cond, data.1$Temp/data.1$LAT),]

Plot Cond Between Temp LAT LONG 3 P3 H A1 13.64371 20.74986 144.6884 1 P1 H A1 15.73546 17.25752 146.2397 2 P2 H A1 23.83643 14.07060 144.8877 4 P4 H A2 37.95281 18.41013 142.0585 10 P2 L A4 18.94612 20.06427 144.8924 9 P1 L A3 27.75781 20.33795 145.7753 12 P4 L A4 25.89843 14.52130 144.1700 11 P3 L A4 37.11781 18.64913 142.2459 6 P2 M A2 13.79532 20.38767 145.8359 7 P3 M A3 26.87429 20.14244 147.7174 5 P1 M A2 25.29508 18.46762 144.0437 8 P4 M A3 29.38325 19.68780 144.7944

  • 3. Manipulating factors

3.1. Re-levelling (sorting) factors

(data.2) Plot Cond Between Temp P1 H A1 15.74 P2 H A1 23.84 P3 H A1 13.64 P4 H A2 37.95 P1 M A2 25.3 P2 M A2 13.8 P3 M A3 26.87 P4 M A3 29.38 P1 L A3 27.76 P2 L A4 18.95 P3 L A4 37.12 P4 L A4 25.9

> levels(data.2$Cond)

[1] "H" "L" "M"

3.2. Re-levelling (sorting) factors

> data.2$Cond <- factor(data.2$Cond, levels=c("L","M","H")) > #OR > library(gdata) > data.2$Cond <- reorder(data.2$Cond, new.order=c("L","M","H")) > > levels(data.2$Cond)

slide-8
SLIDE 8
  • 8-

[1] "L" "M" "H"

3.3. Re-labelling factors

> #create a copy of data2 and call it data.3 > data.3 <- data.2 > #reorder AND rename the factor levels > data.3$Cond <- factor(data.3$Cond, levels=c("L","M","H"), + lab=c("Low","Medium","High")) > data.3

Plot Cond Between Temp 1 P1 High A1 15.73546 2 P2 High A1 23.83643 3 P3 High A1 13.64371 4 P4 High A2 37.95281 5 P1 Medium A2 25.29508 6 P2 Medium A2 13.79532 7 P3 Medium A3 26.87429 8 P4 Medium A3 29.38325 9 P1 Low A3 27.75781 10 P2 Low A4 18.94612 11 P3 Low A4 37.11781 12 P4 Low A4 25.89843

3.4. Re-labelling factors

> #restore data.3 from the original data.2 again > data.3 <- data.2 > library(plyr) > data.3$Cond <- revalue(data.3$Cond,c("L"="Low", + "M"="Medium","H"="High")) > data.3

Plot Cond Between Temp 1 P1 High A1 15.73546 2 P2 High A1 23.83643 3 P3 High A1 13.64371 4 P4 High A2 37.95281 5 P1 Medium A2 25.29508 6 P2 Medium A2 13.79532 7 P3 Medium A3 26.87429 8 P4 Medium A3 29.38325 9 P1 Low A3 27.75781 10 P2 Low A4 18.94612 11 P3 Low A4 37.11781 12 P4 Low A4 25.89843

3.5. Re-naming columns (vectors)

3.5.1. By position

> data.3 <- data.2 > colnames(data.3) <- c("Plot","Condition","Treatment", + "Temperature") > data.3

Plot Condition Treatment Temperature

slide-9
SLIDE 9
  • 9-

1 P1 H A1 15.73546 2 P2 H A1 23.83643 3 P3 H A1 13.64371 4 P4 H A2 37.95281 5 P1 M A2 25.29508 6 P2 M A2 13.79532 7 P3 M A3 26.87429 8 P4 M A3 29.38325 9 P1 L A3 27.75781 10 P2 L A4 18.94612 11 P3 L A4 37.11781 12 P4 L A4 25.89843

3.6. Re-naming columns (vectors)

3.6.1. By name

> library(plyr) > data.3 <- rename(data.2, replace=c("Plot"="Plot", + "Cond"="Condition","Between"="Treatment", + "Temp"="Temperature")) > data.3

Plot Condition Treatment Temperature 1 P1 H A1 15.73546 2 P2 H A1 23.83643 3 P3 H A1 13.64371 4 P4 H A2 37.95281 5 P1 M A2 25.29508 6 P2 M A2 13.79532 7 P3 M A3 26.87429 8 P4 M A3 29.38325 9 P1 L A3 27.75781 10 P2 L A4 18.94612 11 P3 L A4 37.11781 12 P4 L A4 25.89843

> #It is not necessary to redefine all of the columns, nor do the renaming rules need to be in the same order (position) as columns > data.3<-rename(data.2, replace=c("Temp"="Temperature","Cond"="Condition","Between"="Treatment")) > data.3

Plot Condition Treatment Temperature 1 P1 H A1 15.73546 2 P2 H A1 23.83643 3 P3 H A1 13.64371 4 P4 H A2 37.95281 5 P1 M A2 25.29508 6 P2 M A2 13.79532 7 P3 M A3 26.87429 8 P4 M A3 29.38325 9 P1 L A3 27.75781 10 P2 L A4 18.94612 11 P3 L A4 37.11781 12 P4 L A4 25.89843

slide-10
SLIDE 10
  • 10-

3.7. Re-naming columns (vectors)

3.7.1. By name

> library(plyr) > #It is not necessary to redefine all of the columns, > # nor do the renaming rules need to be in the same order > # (position) as columns > data.3<-rename(data.2, replace=c("Temp"="Temperature", + "Cond"="Condition","Between"="Treatment")) > data.3

Plot Condition Treatment Temperature 1 P1 H A1 15.73546 2 P2 H A1 23.83643 3 P3 H A1 13.64371 4 P4 H A2 37.95281 5 P1 M A2 25.29508 6 P2 M A2 13.79532 7 P3 M A3 26.87429 8 P4 M A3 29.38325 9 P1 L A3 27.75781 10 P2 L A4 18.94612 11 P3 L A4 37.11781 12 P4 L A4 25.89843

  • 4. Filtering

4.1. Effects of Subsetting (filtering)

(data.2) Plot Cond Between Temp P1 H A1 15.74 P2 H A1 23.84 P3 H A1 13.64 P4 H A2 37.95 P1 M A2 25.3 P2 M A2 13.8 P3 M A3 26.87 P4 M A3 29.38 P1 L A3 27.76 P2 L A4 18.95 P3 L A4 37.12 P4 L A4 25.9

> #examine the levels of the Cond factor > levels(data.2$Cond)

[1] "H" "L" "M"

slide-11
SLIDE 11
  • 11-

4.2. Effects of Subsetting (filtering)

> #subset the dataset to just Cond H > data.3<-subset(data.2,Cond=="H") > #examine subset data > data.3

Plot Cond Between Temp 1 P1 H A1 15.73546 2 P2 H A1 23.83643 3 P3 H A1 13.64371 4 P4 H A2 37.95281

> #examine the levels of the Cond factor > levels(data.3$Cond)

[1] "H" "L" "M"

> levels(data.3$Between)

[1] "A1" "A2" "A3" "A4"

4.3. Effects of Subsetting (filtering)

4.3.1. Correction - all factors

> #subset the dataset to just Cond H > data.3<-subset(data.2,Cond=="H") > #drop the unused factor levels from all factors > data.3<-droplevels(data.3) > #examine the levels of each factor > levels(data.3$Cond)

[1] "H"

> levels(data.3$Between)

[1] "A1" "A2"

4.4. Effects of Subsetting (filtering)

4.4.1. Correction - single factor

> #subset the dataset to just Cond H > data.3<-subset(data.2,Cond=="H") > #drop the unused factor levels from Cond > data.3$Cond<-factor(data.3$Cond) > #examine the levels of each factor > levels(data.3$Cond)

[1] "H"

> levels(data.3$Between)

[1] "A1" "A2" "A3" "A4"

slide-12
SLIDE 12
  • 12-

4.5. Filtering/subsetting

4.5.1. Via indexing

> data.2[data.2$Cond=="H",]

Plot Cond Between Temp 1 P1 H A1 15.73546 2 P2 H A1 23.83643 3 P3 H A1 13.64371 4 P4 H A2 37.95281

> #OR > library(dplyr) > slice(data.2,1:4)

Plot Cond Between Temp 1 P1 H A1 15.73546 2 P2 H A1 23.83643 3 P3 H A1 13.64371 4 P4 H A2 37.95281

4.6. Filtering/subsetting

4.6.1. Better still - via subset

> subset(data.2,Cond=="H")

Plot Cond Between Temp 1 P1 H A1 15.73546 2 P2 H A1 23.83643 3 P3 H A1 13.64371 4 P4 H A2 37.95281 4.6.2. OR dplyr

> library(dplyr) > filter(data.2,Cond=="H")

Plot Cond Between Temp 1 P1 H A1 15.73546 2 P2 H A1 23.83643 3 P3 H A1 13.64371 4 P4 H A2 37.95281

4.7. Filtering/subsetting

Multiple cases

> subset(data.2,Cond %in% c("H","M"))

Plot Cond Between Temp 1 P1 H A1 15.73546 2 P2 H A1 23.83643 3 P3 H A1 13.64371

slide-13
SLIDE 13
  • 13-

4 P4 H A2 37.95281 5 P1 M A2 25.29508 6 P2 M A2 13.79532 7 P3 M A3 26.87429 8 P4 M A3 29.38325

> #OR > filter(data.2,Cond %in% c("H","M"))

Plot Cond Between Temp 1 P1 H A1 15.73546 2 P2 H A1 23.83643 3 P3 H A1 13.64371 4 P4 H A2 37.95281 5 P1 M A2 25.29508 6 P2 M A2 13.79532 7 P3 M A3 26.87429 8 P4 M A3 29.38325

4.8. Filtering/subsetting

Multiple conditions Keep only the observations with H OR M condition AND Temp < 30

> subset(data.2,Cond %in% c("H","M") & Temp<30)

Plot Cond Between Temp 1 P1 H A1 15.73546 2 P2 H A1 23.83643 3 P3 H A1 13.64371 5 P1 M A2 25.29508 6 P2 M A2 13.79532 7 P3 M A3 26.87429 8 P4 M A3 29.38325

> #OR > library(dplyr) > filter(data.2, Cond %in% c("H","M") & Temp<30)

Plot Cond Between Temp 1 P1 H A1 15.73546 2 P2 H A1 23.83643 3 P3 H A1 13.64371 4 P1 M A2 25.29508 5 P2 M A2 13.79532 6 P3 M A3 26.87429 7 P4 M A3 29.38325

> #or > filter(data.2, Cond %in% c("H","M"), Temp<30)

Plot Cond Between Temp 1 P1 H A1 15.73546 2 P2 H A1 23.83643 3 P3 H A1 13.64371

slide-14
SLIDE 14
  • 14-

4 P1 M A2 25.29508 5 P2 M A2 13.79532 6 P3 M A3 26.87429 7 P4 M A3 29.38325

4.9. Filtering/subsetting

Keep only the Plot and Temp columns

> subset(data.2,select=c(Plot,Temp))

Plot Temp 1 P1 15.73546 2 P2 23.83643 3 P3 13.64371 4 P4 37.95281 5 P1 25.29508 6 P2 13.79532 7 P3 26.87429 8 P4 29.38325 9 P1 27.75781 10 P2 18.94612 11 P3 37.11781 12 P4 25.89843

> #OR > library(dplyr) > select(data.2,c(Plot,Temp))

Plot Temp 1 P1 15.73546 2 P2 23.83643 3 P3 13.64371 4 P4 37.95281 5 P1 25.29508 6 P2 13.79532 7 P3 26.87429 8 P4 29.38325 9 P1 27.75781 10 P2 18.94612 11 P3 37.11781 12 P4 25.89843

4.10. Filtering/subsetting

Keep only the Plot and Temp columns and observations with H OR M condition AND Temp < 30

> subset(data.2,Cond %in% c("H","M") & Temp<30, + select=c(Plot,Temp))

Plot Temp 1 P1 15.73546 2 P2 23.83643 3 P3 13.64371 5 P1 25.29508 6 P2 13.79532 7 P3 26.87429 8 P4 29.38325

slide-15
SLIDE 15
  • 15-

> #OR > subset(data.2,Cond %in% c("H","M") & Temp<30, + select=c(-Cond,-Between))

Plot Temp 1 P1 15.73546 2 P2 23.83643 3 P3 13.64371 5 P1 25.29508 6 P2 13.79532 7 P3 26.87429 8 P4 29.38325

> #OR via indexing > data.2[data.2$Cond %in% c("H","M") & data.2$Temp<30, + c("Plot","Temp")]

Plot Temp 1 P1 15.73546 2 P2 23.83643 3 P3 13.64371 5 P1 25.29508 6 P2 13.79532 7 P3 26.87429 8 P4 29.38325

4.11. Filtering/subsetting

Keep only the Plot and Temp columns and observations with H OR M condition AND Temp < 30

> select(filter(data.2, Cond %in% c("H","M") & Temp<30), + c(Plot, Temp))

Plot Temp 1 P1 15.73546 2 P2 23.83643 3 P3 13.64371 4 P1 25.29508 5 P2 13.79532 6 P3 26.87429 7 P4 29.38325

  • 5. Reshaping data

5.1. Reshaping data frames

5.1.1. Wide data Between Plot Time.0 Time.1 Time.2 R1 A1 P1 8 14 14 R2 A1 P2 10 12 11 R3 A2 P3 7 11 8 R4 A2 P4 11 9 2

slide-16
SLIDE 16
  • 16-

5.1.2. Wide to long

> library(reshape) > melt(data.w,id=1:2,measure.var=3:5,var="Time")

Between Plot Time value 1 A1 P1 Time.0 8 2 A1 P2 Time.0 10 3 A2 P3 Time.0 7 4 A2 P4 Time.0 11 5 A1 P1 Time.1 14 6 A1 P2 Time.1 12 7 A2 P3 Time.1 11 8 A2 P4 Time.1 9 9 A1 P1 Time.2 14 10 A1 P2 Time.2 11 11 A2 P3 Time.2 8 12 A2 P4 Time.2 2

5.2. Reshaping data frames

5.2.1. Wide to long

> #OR > library(reshape) > melt(data.w,id=c("Between","Plot"),measure.var=c("Time.0", + "Time.1","Time.2"),var="Time")

Between Plot Time value 1 A1 P1 Time.0 8 2 A1 P2 Time.0 10 3 A2 P3 Time.0 7 4 A2 P4 Time.0 11 5 A1 P1 Time.1 14 6 A1 P2 Time.1 12 7 A2 P3 Time.1 11 8 A2 P4 Time.1 9 9 A1 P1 Time.2 14 10 A1 P2 Time.2 11 11 A2 P3 Time.2 8 12 A2 P4 Time.2 2

5.3. Reshaping data frames

5.3.1. Long data Resp1 Resp2 Between Plot Subplot Within 8 17 A1 P1 S1 B1 10 18 A1 P1 S1 B2 7 17 A1 P1 S2 B1 11 21 A1 P1 S2 B2 14 19 A2 P2 S3 B1 12 13 A2 P2 S3 B2

slide-17
SLIDE 17
  • 17-

Resp1 Resp2 Between Plot Subplot Within 11 24 A2 P2 S4 B1 9 18 A2 P2 S4 B2 14 25 A3 P3 S5 B1 11 18 A3 P3 S5 B2 8 27 A3 P3 S6 B1 2 22 A3 P3 S6 B2 8 17 A1 P4 S7 B1 10 22 A1 P4 S7 B2 7 16 A1 P4 S8 B1 12 13 A1 P4 S8 B2 11 23 A2 P5 S9 B1 12 19 A2 P5 S9 B2 12 23 A2 P5 S10 B1 10 21 A2 P5 S10 B2 3 17 A3 P6 S11 B1 11 16 A3 P6 S11 B2 13 26 A3 P6 S12 B1 7 28 A3 P6 S12 B2

5.4. Reshaping data frames

5.4.1. Widen (cast) Widen Resp1 for repeated measures (Within)

> library(reshape) > cast(data,Between+Plot+Subplot~Within,value="Resp1")

Between Plot Subplot B1 B2 1 A1 P1 S1 8 10 2 A1 P1 S2 7 11 3 A1 P4 S7 8 10 4 A1 P4 S8 7 12 5 A2 P2 S3 14 12 6 A2 P2 S4 11 9 7 A2 P5 S9 11 12 8 A2 P5 S10 12 10 9 A3 P3 S5 14 11 10 A3 P3 S6 8 2 11 A3 P6 S11 3 11 12 A3 P6 S12 13 7

5.5. Reshaping data frames

5.5.1. Widen (cast) Widen (cast) both Resp1 and Resp2 for repeated measures (Within)

slide-18
SLIDE 18
  • 18-

> library(reshape) > d1<-cast(data,Between+Plot+Subplot~Within,value="Resp1") > d2<-cast(data,Between+Plot+Subplot~Within,value="Resp2") > merge(d1,d2,by=c("Between","Plot","Subplot"), + suffixes=c("Resp1","Resp2"))

Between Plot Subplot B1Resp1 B2Resp1 1 A1 P1 S1 8 10 2 A1 P1 S2 7 11 3 A1 P4 S7 8 10 4 A1 P4 S8 7 12 5 A2 P2 S3 14 12 6 A2 P2 S4 11 9 7 A2 P5 S10 12 10 8 A2 P5 S9 11 12 9 A3 P3 S5 14 11 10 A3 P3 S6 8 2 11 A3 P6 S11 3 11 12 A3 P6 S12 13 7 B1Resp2 B2Resp2 1 17 18 2 17 21 3 17 22 4 16 13 5 19 13 6 24 18 7 23 21 8 23 19 9 25 18 10 27 22 11 17 16 12 26 28

5.6. Reshaping data frames

5.6.1. Widen (cast) Widen (cast) both Resp1 and Resp2 for repeated measures (Within)

> library(reshape) > tmp <- melt(data,id=3:6) > #OR > tmp <- melt(data,id=c('Between','Plot','Subplot','Within')) > cast(tmp, Between+Plot+Subplot~Within+variable)

Between Plot Subplot B1_Resp1 B1_Resp2 1 A1 P1 S1 8 17 2 A1 P1 S2 7 17 3 A1 P4 S7 8 17 4 A1 P4 S8 7 16 5 A2 P2 S3 14 19 6 A2 P2 S4 11 24 7 A2 P5 S9 11 23 8 A2 P5 S10 12 23 9 A3 P3 S5 14 25 10 A3 P3 S6 8 27 11 A3 P6 S11 3 17 12 A3 P6 S12 13 26

slide-19
SLIDE 19
  • 19-

B2_Resp1 B2_Resp2 1 10 18 2 11 21 3 10 22 4 12 13 5 12 13 6 9 18 7 12 19 8 10 21 9 11 18 10 2 22 11 11 16 12 7 28

  • 6. Merging data

6.1. Merging data frames

Bio data (missing Subplot 3) Resp1 Resp2 Between Plot Subplot 1 8 18 A1 P1 S1 2 10 21 A1 P1 S2 4 11 23 A1 P2 S4 5 14 22 A2 P3 S5 6 12 24 A2 P3 S6 7 11 23 A2 P4 S7 8 9 20 A2 P4 S8 9 14 11 A3 P5 S9 10 11 22 A3 P5 S10 11 8 24 A3 P6 S11 12 2 16 A3 P6 S12 Physio-chemical data (missing S7) Chem1 Chem2 Between Plot Subplot 1 1.453 0.8858 A1 P1 S1 2 3.266 0.18 A1 P1 S2 3 1.179 5.078 A1 P2 S3 4 13.4 1.576 A1 P2 S4 5 3.779 1.622 A2 P3 S5 6 1.197 4.237 A2 P3 S6 8 5.688 2.986 A2 P4 S8 9 4.835 4.133 A3 P5 S9 10 2.003 3.604 A3 P5 S10

slide-20
SLIDE 20
  • 20-

Chem1 Chem2 Between Plot Subplot 11 12.33 1.776 A3 P6 S11 12 4.014 0.2255 A3 P6 S12

6.2. Merging data frames

Merge bio and chem data (only keep full matches - an inner join)

> merge(data.bio,data.chem,by=c("Between","Plot","Subplot"))

Between Plot Subplot Resp1 Resp2 Chem1 1 A1 P1 S1 8 18 1.452878 2 A1 P1 S2 10 21 3.266253 3 A1 P2 S4 11 23 13.400350 4 A2 P3 S5 14 22 3.779183 5 A2 P3 S6 12 24 1.196657 6 A2 P4 S8 9 20 5.687807 7 A3 P5 S10 11 22 2.002931 8 A3 P5 S9 14 11 4.834518 9 A3 P6 S11 8 24 12.326867 10 A3 P6 S12 2 16 4.014221 Chem2 1 0.8858208 2 0.1800177 3 1.5762780 4 1.6222430 5 4.2369184 6 2.9859003 7 3.6043314 8 4.1328919 9 1.7763576 10 0.2255188

  • S3 and S7 absent
  • order of subplots

6.3. Merging data frames

Merge bio and chem data (only keep full matches - an inner join)

> #OR > library(plyr) > join(data.bio,data.chem,by=c("Between","Plot","Subplot"), + type="inner")

Resp1 Resp2 Between Plot Subplot Chem1 1 8 18 A1 P1 S1 1.452878 2 10 21 A1 P1 S2 3.266253 3 11 23 A1 P2 S4 13.400350 4 14 22 A2 P3 S5 3.779183 5 12 24 A2 P3 S6 1.196657 6 9 20 A2 P4 S8 5.687807 7 14 11 A3 P5 S9 4.834518

slide-21
SLIDE 21
  • 21-

8 11 22 A3 P5 S10 2.002931 9 8 24 A3 P6 S11 12.326867 10 2 16 A3 P6 S12 4.014221 Chem2 1 0.8858208 2 0.1800177 3 1.5762780 4 1.6222430 5 4.2369184 6 2.9859003 7 4.1328919 8 3.6043314 9 1.7763576 10 0.2255188

  • S3 and S7 absent
  • order of subplots

6.4. Merging data frames

Merge bio and chem data (keep all data - outer join)

> join(data.bio,data.chem,by=c("Between","Plot","Subplot"), + type="full")

Resp1 Resp2 Between Plot Subplot Chem1 1 8 18 A1 P1 S1 1.452878 2 10 21 A1 P1 S2 3.266253 3 11 23 A1 P2 S4 13.400350 4 14 22 A2 P3 S5 3.779183 5 12 24 A2 P3 S6 1.196657 6 11 23 A2 P4 S7 NA 7 9 20 A2 P4 S8 5.687807 8 14 11 A3 P5 S9 4.834518 9 11 22 A3 P5 S10 2.002931 10 8 24 A3 P6 S11 12.326867 11 2 16 A3 P6 S12 4.014221 12 NA NA A1 P2 S3 1.178652 Chem2 1 0.8858208 2 0.1800177 3 1.5762780 4 1.6222430 5 4.2369184 6 NA 7 2.9859003 8 4.1328919 9 3.6043314 10 1.7763576 11 0.2255188 12 5.0780682

> merge(data.bio,data.chem,by=c("Between","Plot","Subplot"), + all=T)

Between Plot Subplot Resp1 Resp2 Chem1

slide-22
SLIDE 22
  • 22-

1 A1 P1 S1 8 18 1.452878 2 A1 P1 S2 10 21 3.266253 3 A1 P2 S3 NA NA 1.178652 4 A1 P2 S4 11 23 13.400350 5 A2 P3 S5 14 22 3.779183 6 A2 P3 S6 12 24 1.196657 7 A2 P4 S7 11 23 NA 8 A2 P4 S8 9 20 5.687807 9 A3 P5 S9 14 11 4.834518 10 A3 P5 S10 11 22 2.002931 11 A3 P6 S11 8 24 12.326867 12 A3 P6 S12 2 16 4.014221 Chem2 1 0.8858208 2 0.1800177 3 5.0780682 4 1.5762780 5 1.6222430 6 4.2369184 7 NA 8 2.9859003 9 4.1328919 10 3.6043314 11 1.7763576 12 0.2255188

6.5. Merging data frames

Merge bio and chem data (only keep full BIO matches - left join)

> join(data.bio,data.chem,by=c("Between","Plot","Subplot"), + type="left")

Resp1 Resp2 Between Plot Subplot Chem1 1 8 18 A1 P1 S1 1.452878 2 10 21 A1 P1 S2 3.266253 3 11 23 A1 P2 S4 13.400350 4 14 22 A2 P3 S5 3.779183 5 12 24 A2 P3 S6 1.196657 6 11 23 A2 P4 S7 NA 7 9 20 A2 P4 S8 5.687807 8 14 11 A3 P5 S9 4.834518 9 11 22 A3 P5 S10 2.002931 10 8 24 A3 P6 S11 12.326867 11 2 16 A3 P6 S12 4.014221 Chem2 1 0.8858208 2 0.1800177 3 1.5762780 4 1.6222430 5 4.2369184 6 NA 7 2.9859003 8 4.1328919 9 3.6043314 10 1.7763576 11 0.2255188

slide-23
SLIDE 23
  • 23-

> merge(data.bio,data.chem,by=c("Between","Plot","Subplot"), + all.x=T)

Between Plot Subplot Resp1 Resp2 Chem1 1 A1 P1 S1 8 18 1.452878 2 A1 P1 S2 10 21 3.266253 3 A1 P2 S4 11 23 13.400350 4 A2 P3 S5 14 22 3.779183 5 A2 P3 S6 12 24 1.196657 6 A2 P4 S7 11 23 NA 7 A2 P4 S8 9 20 5.687807 8 A3 P5 S9 14 11 4.834518 9 A3 P5 S10 11 22 2.002931 10 A3 P6 S11 8 24 12.326867 11 A3 P6 S12 2 16 4.014221 Chem2 1 0.8858208 2 0.1800177 3 1.5762780 4 1.6222430 5 4.2369184 6 NA 7 2.9859003 8 4.1328919 9 3.6043314 10 1.7763576 11 0.2255188

6.6. Merging data frames

Merge bio and chem data (only keep full CHEM matches - right join)

> join(data.bio,data.chem,by=c("Between","Plot","Subplot"), + type="right")

Between Plot Subplot Resp1 Resp2 Chem1 1 A1 P1 S1 8 18 1.452878 2 A1 P1 S2 10 21 3.266253 3 A1 P2 S3 NA NA 1.178652 4 A1 P2 S4 11 23 13.400350 5 A2 P3 S5 14 22 3.779183 6 A2 P3 S6 12 24 1.196657 7 A2 P4 S8 9 20 5.687807 8 A3 P5 S9 14 11 4.834518 9 A3 P5 S10 11 22 2.002931 10 A3 P6 S11 8 24 12.326867 11 A3 P6 S12 2 16 4.014221 Chem2 1 0.8858208 2 0.1800177 3 5.0780682 4 1.5762780 5 1.6222430 6 4.2369184 7 2.9859003 8 4.1328919

slide-24
SLIDE 24
  • 24-

9 3.6043314 10 1.7763576 11 0.2255188

> merge(data.bio,data.chem,by=c("Between","Plot","Subplot"), + all.y=T)

Between Plot Subplot Resp1 Resp2 Chem1 1 A1 P1 S1 8 18 1.452878 2 A1 P1 S2 10 21 3.266253 3 A1 P2 S3 NA NA 1.178652 4 A1 P2 S4 11 23 13.400350 5 A2 P3 S5 14 22 3.779183 6 A2 P3 S6 12 24 1.196657 7 A2 P4 S8 9 20 5.687807 8 A3 P5 S9 14 11 4.834518 9 A3 P5 S10 11 22 2.002931 10 A3 P6 S11 8 24 12.326867 11 A3 P6 S12 2 16 4.014221 Chem2 1 0.8858208 2 0.1800177 3 5.0780682 4 1.5762780 5 1.6222430 6 4.2369184 7 2.9859003 8 4.1328919 9 3.6043314 10 1.7763576 11 0.2255188

  • 7. VLOOKUP

7.1. VLOOKUP

Biological data set (data.bio) Resp1 Resp2 Between Plot Subplot 1 8 18 A1 P1 S1 2 10 21 A1 P1 S2 4 11 23 A1 P2 S4 5 14 22 A2 P3 S5 6 12 24 A2 P3 S6 7 11 23 A2 P4 S7 8 9 20 A2 P4 S8 9 14 11 A3 P5 S9 10 11 22 A3 P5 S10 11 8 24 A3 P6 S11 12 2 16 A3 P6 S12 Geographical data set (lookup table) (data.geo) Plot LAT LONG

slide-25
SLIDE 25
  • 25-

1 P1 17.9605 145.4326 2 P2 17.5210 146.1983 3 P3 17.0011 146.3839 4 P4 18.2350 146.7934 5 P5 18.9840 146.0345 6 P6 20.1154 146.4672

7.2. VLOOKUP

Incorporate (merge) the lat/longs into the bio data

> join(data.bio,data.geo,by=c("Plot"), type="left")

Resp1 Resp2 Between Plot Subplot LAT 1 8 18 A1 P1 S1 17.9605 2 10 21 A1 P1 S2 17.9605 3 11 23 A1 P2 S4 17.5210 4 14 22 A2 P3 S5 17.0011 5 12 24 A2 P3 S6 17.0011 6 11 23 A2 P4 S7 18.2350 7 9 20 A2 P4 S8 18.2350 8 14 11 A3 P5 S9 18.9840 9 11 22 A3 P5 S10 18.9840 10 8 24 A3 P6 S11 20.1154 11 2 16 A3 P6 S12 20.1154 LONG 1 145.4326 2 145.4326 3 146.1983 4 146.3839 5 146.3839 6 146.7934 7 146.7934 8 146.0345 9 146.0345 10 146.4672 11 146.4672

> merge(data.bio,data.geo,by=c("Plot"),all.x=T)

Plot Resp1 Resp2 Between Subplot LAT 1 P1 8 18 A1 S1 17.9605 2 P1 10 21 A1 S2 17.9605 3 P2 11 23 A1 S4 17.5210 4 P3 14 22 A2 S5 17.0011 5 P3 12 24 A2 S6 17.0011 6 P4 11 23 A2 S7 18.2350 7 P4 9 20 A2 S8 18.2350 8 P5 14 11 A3 S9 18.9840 9 P5 11 22 A3 S10 18.9840 10 P6 8 24 A3 S11 20.1154 11 P6 2 16 A3 S12 20.1154 LONG 1 145.4326 2 145.4326

slide-26
SLIDE 26
  • 26-

3 146.1983 4 146.3839 5 146.3839 6 146.7934 7 146.7934 8 146.0345 9 146.0345 10 146.4672 11 146.4672

  • 8. Aggregating data (pivot tables)

8.1. Aggregating data frames

(data.a) Resp1 Resp2 Between Plot 1 8 22 A1 P1 2 10 18 A1 P2 3 7 26 A2 P3 4 11 21 A2 P4 5 14 17 A3 P5 6 12 18 A3 P6

  • 1. Via tapply

> tapply(data.a$Resp1,data.a$Between,mean)

A1 A2 A3 9 9 13

8.2. Aggregating data frames

Resp1 Resp2 Between Plot 1 8 22 A1 P1 2 10 18 A1 P2 3 7 26 A2 P3 4 11 21 A2 P4 5 14 17 A3 P5 6 12 18 A3 P6

  • 1. Via tapply

> tapply(data.a$Resp1,data.a$Between,mean)

A1 A2 A3 9 9 13

> tapply(data.a$Resp1,data.a$Between,function(x) { + print(x) + return(mean(x)) + })

[1] 8 10 [1] 7 11 [1] 14 12 A1 A2 A3 9 9 13

slide-27
SLIDE 27
  • 27-

8.3. Aggregating data frames

Resp1 Resp2 Between Plot 1 8 22 A1 P1 2 10 18 A1 P2 3 7 26 A2 P3 4 11 21 A2 P4 5 14 17 A3 P5 6 12 18 A3 P6

  • 2. Via cast

> library(reshape) > cast(data.a,~Between,value="Resp1",fun=mean)

value A1 A2 A3 1 (all) 9 9 13

8.4. Aggregating data frames

Resp1 Resp2 Between Plot 1 8 22 A1 P1 2 10 18 A1 P2 3 7 26 A2 P3 4 11 21 A2 P4 5 14 17 A3 P5 6 12 18 A3 P6

  • 2. Via cast

> library(reshape) > cast(data.a,~Between,value="Resp1",fun=mean)

value A1 A2 A3 1 (all) 9 9 13

> library(reshape) > cast(data.a,~Between,value="Resp1",fun=function(x) { + print(x) + return(mean(x)) + })

integer(0) [1] 8 10 [1] 7 11 [1] 14 12 value A1 A2 A3 1 (all) 9 9 13

8.5. Aggregating data frames

Resp1 Resp2 Between Plot 1 8 22 A1 P1 2 10 18 A1 P2 3 7 26 A2 P3 4 11 21 A2 P4 5 14 17 A3 P5 6 12 18 A3 P6

slide-28
SLIDE 28
  • 28-
  • 3. Via ddply

> library(plyr) > ddply(data.a,~Between,numcolwise(mean))

Between Resp1 Resp2 1 A1 9 20.0 2 A2 9 23.5 3 A3 13 17.5

> library(plyr) > ddply(data.a,~Between,function(x) { + print(x); return(mean(x)) + })

Resp1 Resp2 Between Plot 1 8 22 A1 P1 2 10 18 A1 P2 Resp1 Resp2 Between Plot 1 7 26 A2 P3 2 11 21 A2 P4 Resp1 Resp2 Between Plot 1 14 17 A3 P5 2 12 18 A3 P6 Between V1 1 A1 NA 2 A2 NA 3 A3 NA

8.6. Aggregating data frames

Resp1 Resp2 Between Plot 1 8 22 A1 P1 2 10 18 A1 P2 3 7 26 A2 P3 4 11 21 A2 P4 5 14 17 A3 P5 6 12 18 A3 P6

  • 4. Via dply

> library(dplyr) > group_by(data.a, Between) %.% summarize(mean=mean(Resp1), + sd=sd(Resp1))

Source: local data frame [3 x 3] Between mean sd 1 A1 9 1.414214 2 A2 9 2.828427 3 A3 13 1.414214

slide-29
SLIDE 29
  • 29-

8.7. Aggregating data frames

Resp1 Resp2 Between Plot 1 8 22 A1 P1 2 10 18 A1 P2 3 7 26 A2 P3 4 11 21 A2 P4 5 14 17 A3 P5 6 12 18 A3 P6

  • 4. Via dply

> library(dplyr) > data.a %>% group_by(Between) %>% summarize(mean=mean(Resp1), + sd=sd(Resp1))

Source: local data frame [3 x 3] Between mean sd 1 A1 9 1.414214 2 A2 9 2.828427 3 A3 13 1.414214

8.8. Aggregating data frames

Resp1 Resp2 Between Plot 1 8 22 A1 P1 2 10 18 A1 P2 3 7 26 A2 P3 4 11 21 A2 P4 5 14 17 A3 P5 6 12 18 A3 P6

  • 5. Via summaryBy

> library(doBy) > summaryBy(Resp1~Between, data.a, FUN=mean)

Between Resp1.mean 1 A1 9 2 A2 9 3 A3 13

> library(doBy) > summaryBy(Resp1~Between, data.a, FUN=function(x) {print(x); mean(x)})

[1] 8 10 [1] 7 11 [1] 14 12 Between 1 A1 2 A2 3 A3 Resp1.function(x) { print(x) mean(x) } 1 9 2 9 3 13

slide-30
SLIDE 30
  • 30-

8.9. Aggregating data

Variables passed Passed as Functions Returns tapply Single specified Vectors Single Matrix cast Single specified Vectors List of Dataframe summaryBy All specified Vectors List of Dataframe ddply All Dataframes Multiple Dataframe group_by All Dataframes Dataframe

8.10. Aggregating data

A typical data set (data) Resp1 Resp2 Between Plot Subplot Within 1 8 17 A1 P1 S1 B1 2 10 18 A1 P1 S1 B2 3 7 17 A1 P1 S2 B1 4 11 21 A1 P1 S2 B2 5 14 19 A2 P2 S3 B1 6 12 13 A2 P2 S3 B2 7 11 24 A2 P2 S4 B1 8 9 18 A2 P2 S4 B2 9 14 25 A3 P3 S5 B1 10 11 18 A3 P3 S5 B2 11 8 27 A3 P3 S6 B1 12 2 22 A3 P3 S6 B2 13 8 17 A1 P4 S7 B1 14 10 22 A1 P4 S7 B2 15 7 16 A1 P4 S8 B1 16 12 13 A1 P4 S8 B2 17 11 23 A2 P5 S9 B1 18 12 19 A2 P5 S9 B2 19 12 23 A2 P5 S10 B1 20 10 21 A2 P5 S10 B2 21 3 17 A3 P6 S11 B1 22 11 16 A3 P6 S11 B2 23 13 26 A3 P6 S12 B1 24 7 28 A3 P6 S12 B2

8.11. Aggregating data

8.11.1. Pivot tables, data summaries

  • 1. Via tapply

> tapply(data$Resp1,data$Between,mean)

A1 A2 A3 9.125 11.375 8.625

slide-31
SLIDE 31
  • 31-

8.12. Aggregating data

8.12.1. Pivot tables, data summaries

  • 2. Via cast

> library(plyr) > cast(data,~Between,value="Resp1",mean)

value A1 A2 A3 1 (all) 9.125 11.375 8.625

8.13. Aggregating data frames

8.13.1. Pivot tables, data summaries

  • 3. Via ddply

> library(plyr) > ddply(data,~Between,summarize, Mean=mean(Resp1))

Between Mean 1 A1 9.125 2 A2 11.375 3 A3 8.625

> ddply(data,~Between,function(x) mean(x$Resp1))

Between V1 1 A1 9.125 2 A2 11.375 3 A3 8.625

> ddply(data,~Between,numcolwise(mean))

Between Resp1 Resp2 1 A1 9.125 17.625 2 A2 11.375 20.000 3 A3 8.625 22.375

8.14. Aggregating data frames

8.14.1. Pivot tables, data summaries

  • 4. Via group_by

> library(dplyr) > data %>% group_by(Between) %>% summarize(Mean=mean(Resp1))

Source: local data frame [3 x 2] Between Mean 1 A1 9.125 2 A2 11.375 3 A3 8.625

slide-32
SLIDE 32
  • 32-

8.15. Aggregating data frames

8.15.1. Pivot tables, data summaries

  • 5. Via summaryBy

> library(doBy) > summaryBy(Resp1~Between, data, FUN=mean)

Between Resp1.mean 1 A1 9.125 2 A2 11.375 3 A3 8.625

8.16. Aggregating data frames

Resp1 means and sd

> library(plyr) > ddply(data,~Between,function(x) { + data.frame(mean=mean(x$Resp1), sd=sd(x$Resp1)) + })

Between mean sd 1 A1 9.125 1.885092 2 A2 11.375 1.505941 3 A3 8.625 4.438066

> #OR > library(reshape) > cast(data,Between~., value='Resp1', fun=c(mean,sd))

Between mean sd 1 A1 9.125 1.885092 2 A2 11.375 1.505941 3 A3 8.625 4.438066

8.17. Aggregating data frames

Resp1 means and sd

> library(plyr) > mstat <- function(x,...) { + data.frame(mean=mean(x,...), sd=sd(x,...)) + } > ddply(data,~Between,function(x) { + mstat(x$Resp1) + })

Between mean sd 1 A1 9.125 1.885092 2 A2 11.375 1.505941 3 A3 8.625 4.438066

slide-33
SLIDE 33
  • 33-

8.18. Aggregating data frames

Resp1 means and sd

> library(dplyr) > data %>% group_by(Between) %>% summarize(Mean=mean(Resp1), + SD=sd(Resp1))

Source: local data frame [3 x 3] Between Mean SD 1 A1 9.125 1.885092 2 A2 11.375 1.505941 3 A3 8.625 4.438066

8.19. Aggregating data frames

Resp1 means and sd (Between/Within)

> library(plyr) > ddply(data,~Between+Within,function(x) { + data.frame(mean=mean(x$Resp1), sd=sd(x$Resp1)) + })

Between Within mean sd 1 A1 B1 7.50 0.5773503 2 A1 B2 10.75 0.9574271 3 A2 B1 12.00 1.4142136 4 A2 B2 10.75 1.5000000 5 A3 B1 9.50 5.0662281 6 A3 B2 7.75 4.2720019

> #OR > library(reshape) > cast(data, Between+Within~., value="Resp1", fun=c(mean,sd))

Between Within mean sd 1 A1 B1 7.50 0.5773503 2 A1 B2 10.75 0.9574271 3 A2 B1 12.00 1.4142136 4 A2 B2 10.75 1.5000000 5 A3 B1 9.50 5.0662281 6 A3 B2 7.75 4.2720019

8.20. Aggregating data frames

Resp1 means and sd (Between/Within)

> library(dplyr) > data %>% group_by(Between,Within) %>% summarize(mean=mean(Resp1), + sd=sd(Resp1))

Source: local data frame [6 x 4] Groups: Between Between Within mean sd 1 A1 B1 7.50 0.5773503 2 A1 B2 10.75 0.9574271

slide-34
SLIDE 34
  • 34-

3 A2 B1 12.00 1.4142136 4 A2 B2 10.75 1.5000000 5 A3 B1 9.50 5.0662281 6 A3 B2 7.75 4.2720019

8.21. Aggregating data frames

Resp1 and Resp2 mean and sd for each Between/Within combination

> library(plyr) > ddply(data,~Between+Within,function(x) { + data.frame(Mean1=mean(x$Resp1), sd1=sd(x$Resp1), + Mean2=mean(x$Resp2), sd2=sd(x$Resp2)) + })

Between Within Mean1 sd1 Mean2 sd2 1 A1 B1 7.50 0.5773503 16.75 0.500000 2 A1 B2 10.75 0.9574271 18.50 4.041452 3 A2 B1 12.00 1.4142136 22.25 2.217356 4 A2 B2 10.75 1.5000000 17.75 3.403430 5 A3 B1 9.50 5.0662281 23.75 4.573474 6 A3 B2 7.75 4.2720019 21.00 5.291503

8.22. Aggregating data frames

Resp1 and Resp2 mean and sd for each Between/Within combination

> library(plyr) > ddply(data,~Between+Within,function(x) { + data.frame(Mean=numcolwise(mean)(x), + SD=numcolwise(sd)(x)) + })

Between Within Mean.Resp1 Mean.Resp2 SD.Resp1 SD.Resp2 1 A1 B1 7.50 16.75 0.5773503 0.500000 2 A1 B2 10.75 18.50 0.9574271 4.041452 3 A2 B1 12.00 22.25 1.4142136 2.217356 4 A2 B2 10.75 17.75 1.5000000 3.403430 5 A3 B1 9.50 23.75 5.0662281 4.573474 6 A3 B2 7.75 21.00 4.2720019 5.291503

8.23. Aggregating data frames

Resp1 and Resp2 mean and sd for each Between/Within combination

> library(dplyr) > data %>% group_by(Between,Within) %>% summarise_each(funs(mean,sd))

Source: local data frame [6 x 10] Groups: Between Between Within Resp1_mean Resp2_mean Plot_mean 1 A1 B1 7.50 16.75 2.5 2 A1 B2 10.75 18.50 2.5 3 A2 B1 12.00 22.25 3.5 4 A2 B2 10.75 17.75 3.5 5 A3 B1 9.50 23.75 4.5 6 A3 B2 7.75 21.00 4.5 Variables not shown: Subplot_mean (dbl), Resp1_sd (dbl), Resp2_sd (dbl), Plot_sd (dbl), Subplot_sd (dbl)

slide-35
SLIDE 35
  • 35-

8.24. Aggregating data frames

Resp1 and Resp2 mean and sd for each Between/Within combination

> library(reshape) > cast(melt(data, meas=c('Resp1','Resp2')), + Between+Within~variable, fun=c(mean,sd))

Between Within Resp1_mean Resp1_sd Resp2_mean Resp2_sd 1 A1 B1 7.50 0.5773503 16.75 0.500000 2 A1 B2 10.75 0.9574271 18.50 4.041452 3 A2 B1 12.00 1.4142136 22.25 2.217356 4 A2 B2 10.75 1.5000000 17.75 3.403430 5 A3 B1 9.50 5.0662281 23.75 4.573474 6 A3 B2 7.75 4.2720019 21.00 5.291503

8.25. Aggregating data frames

Resp1 and Resp2 means for each Between/Within combination in long format

> library(plyr) > library(reshape) > melt(ddply(data,~Between+Within,summarize, + Resp1=mean(Resp1),Resp2=mean(Resp2) + ), + measure.vars=c("Resp1","Resp2"),variable_name="Resp")

Between Within Resp value 1 A1 B1 Resp1 7.50 2 A1 B2 Resp1 10.75 3 A2 B1 Resp1 12.00 4 A2 B2 Resp1 10.75 5 A3 B1 Resp1 9.50 6 A3 B2 Resp1 7.75 7 A1 B1 Resp2 16.75 8 A1 B2 Resp2 18.50 9 A2 B1 Resp2 22.25 10 A2 B2 Resp2 17.75 11 A3 B1 Resp2 23.75 12 A3 B2 Resp2 21.00

> #OR > ddply(data,~Between+Within,function(x){ + rbind(data.frame(Resp="Resp1",value=mean(x$Resp1)), + data.frame(Resp="Resp2",value=mean(x$Resp2))) + })

Between Within Resp value 1 A1 B1 Resp1 7.50 2 A1 B1 Resp2 16.75 3 A1 B2 Resp1 10.75 4 A1 B2 Resp2 18.50 5 A2 B1 Resp1 12.00 6 A2 B1 Resp2 22.25 7 A2 B2 Resp1 10.75 8 A2 B2 Resp2 17.75 9 A3 B1 Resp1 9.50 10 A3 B1 Resp2 23.75 11 A3 B2 Resp1 7.75 12 A3 B2 Resp2 21.00

slide-36
SLIDE 36
  • 36-

8.26. Aggregating data frames

Resp1 and Resp2 means for each Between/Within combination in long format

> library(plyr) > #OR > ddply(data,~Between+Within,function(x){ + rbind(data.frame(Resp="Resp1",value=mean(x$Resp1)), + data.frame(Resp="Resp2",value=mean(x$Resp2))) + })

Between Within Resp value 1 A1 B1 Resp1 7.50 2 A1 B1 Resp2 16.75 3 A1 B2 Resp1 10.75 4 A1 B2 Resp2 18.50 5 A2 B1 Resp1 12.00 6 A2 B1 Resp2 22.25 7 A2 B2 Resp1 10.75 8 A2 B2 Resp2 17.75 9 A3 B1 Resp1 9.50 10 A3 B1 Resp2 23.75 11 A3 B2 Resp1 7.75 12 A3 B2 Resp2 21.00

8.27. Aggregating data frames

Resp1 and Resp2 means and standard deviations (Between/Within) long format

> library(plyr) > ddply(data,~Between+Within,function(x){ + rbind(data.frame(Resp="Resp1",Mean=mean(x$Resp1), Sd=sd(x$Resp1)), + data.frame(Resp="Resp2",Mean=mean(x$Resp2), Sd=sd(x$Resp2))) + })

Between Within Resp Mean Sd 1 A1 B1 Resp1 7.50 0.5773503 2 A1 B1 Resp2 16.75 0.5000000 3 A1 B2 Resp1 10.75 0.9574271 4 A1 B2 Resp2 18.50 4.0414519 5 A2 B1 Resp1 12.00 1.4142136 6 A2 B1 Resp2 22.25 2.2173558 7 A2 B2 Resp1 10.75 1.5000000 8 A2 B2 Resp2 17.75 3.4034296 9 A3 B1 Resp1 9.50 5.0662281 10 A3 B1 Resp2 23.75 4.5734742 11 A3 B2 Resp1 7.75 4.2720019 12 A3 B2 Resp2 21.00 5.2915026

  • 9. Transformations

9.1. Transformations and Derivatives

A typical data set (data.s) Between Plot Resp1 Resp2 1 A1 P1 8 13 2 A1 P2 10 22 3 A2 P3 7 23 4 A2 P4 11 22

slide-37
SLIDE 37
  • 37-
  • 1. manual transform

> data.s$logResp1 <- log(data.s$Resp1) > data.s$logResp2 <- log(data.s$Resp2) > data.s

Between Plot Resp1 Resp2 logResp1 logResp2 1 A1 P1 8 13 2.079442 2.564949 2 A1 P2 10 22 2.302585 3.091042 3 A2 P3 7 23 1.945910 3.135494 4 A2 P4 11 22 2.397895 3.091042

9.2. Transformations and Derivatives

A typical data set (data.s) Between Plot Resp1 Resp2 1 A1 P1 8 13 2 A1 P2 10 22 3 A2 P3 7 23 4 A2 P4 11 22

  • 2. via transform

> transform(data.s, logResp1=log(Resp1), logRes2=log(Resp2))

Between Plot Resp1 Resp2 logResp1 logRes2 1 A1 P1 8 13 2.079442 2.564949 2 A1 P2 10 22 2.302585 3.091042 3 A2 P3 7 23 1.945910 3.135494 4 A2 P4 11 22 2.397895 3.091042

9.3. Transformations and Derivatives

A typical data set (data.s) Between Plot Resp1 Resp2 1 A1 P1 8 13 2 A1 P2 10 22 3 A2 P3 7 23 4 A2 P4 11 22

  • 3. via within

> within(data.s, logResp1<-log(Resp1), logRes2<-log(Resp2))

Between Plot Resp1 Resp2 logResp1 1 A1 P1 8 13 2.079442 2 A1 P2 10 22 2.302585 3 A2 P3 7 23 1.945910 4 A2 P4 11 22 2.397895

slide-38
SLIDE 38
  • 38-

9.4. Transformations and Derivatives

A typical data set (data.s) Between Plot Resp1 Resp2 1 A1 P1 8 13 2 A1 P2 10 22 3 A2 P3 7 23 4 A2 P4 11 22

  • 4. via mutate (plyr)

> mutate(data.s, logResp1=log(Resp1), logResp2=log(Resp2), + diff=logResp1-logResp2)

Between Plot Resp1 Resp2 logResp1 logResp2 diff 1 A1 P1 8 13 2.079442 2.564949 -0.4855078 2 A1 P2 10 22 2.302585 3.091042 -0.7884574 3 A2 P3 7 23 1.945910 3.135494 -1.1895841 4 A2 P4 11 22 2.397895 3.091042 -0.6931472

9.5. Transformations and Derivatives

Transforms at a time Returns Notes manually Single Single vector transform Multiple New dataframe within Multiple New dataframe Similar, yet more flexible and faster than transform mutate Multiple New dataframe Works with each column sequentially, can derive new columns summaryBy Multiple New dataframe

9.6. Transformations and Derivatives

Typical data set (data.s) Between Plot Resp1 Resp2 1 A1 P1 8 13 2 A1 P2 10 22 3 A2 P3 7 23 4 A2 P4 11 22 Scale (log) transform a single variable - Resp1

9.7. Transformations and Derivatives

Scale (log) transform a single variable - Resp1

slide-39
SLIDE 39
  • 39-

> (data.s$Resp1 <- log(data.s$Resp1)) > #OR > transform(data.s,logResp1=log(Resp1)) > #OR > within(data.s,logResp1<-log(Resp1))

Between Plot Resp1 Resp2 logResp1 1 A1 P1 8 13 2.079442 2 A1 P2 10 22 2.302585 3 A2 P3 7 23 1.945910 4 A2 P4 11 22 2.397895

> library(plyr) > mutate(data.s,logResp1=log(Resp1))

Between Plot Resp1 Resp2 logResp1 1 A1 P1 8 13 2.079442 2 A1 P2 10 22 2.302585 3 A2 P3 7 23 1.945910 4 A2 P4 11 22 2.397895

9.8. Transformations and Derivatives

Typical data set Between Plot Resp1 Resp2 1 A1 P1 8 13 2 A1 P2 10 22 3 A2 P3 7 23 4 A2 P4 11 22 Scale (log) transform multiple variables - Resp1 and Resp2

9.9. Transformations and Derivatives

Scale (log) transform multiple variables - Resp1 and Resp2

> within(data.s, logResp1<-log(Resp1),logResp2<-log(Resp2)) > #OR > transform(data.s,logResp1=log(Resp1),logResp2=log(Resp2))

Between Plot Resp1 Resp2 logResp1 logResp2 1 A1 P1 8 13 2.079442 2.564949 2 A1 P2 10 22 2.302585 3.091042 3 A2 P3 7 23 1.945910 3.135494 4 A2 P4 11 22 2.397895 3.091042

> library(plyr) > mutate(data.s,logResp1=log(Resp1),logResp2=log(Resp2))

Between Plot Resp1 Resp2 logResp1 logResp2 1 A1 P1 8 13 2.079442 2.564949 2 A1 P2 10 22 2.302585 3.091042 3 A2 P3 7 23 1.945910 3.135494 4 A2 P4 11 22 2.397895 3.091042

slide-40
SLIDE 40
  • 40-

9.10. Transformations and Derivatives

Center a single variable (Resp2)

> scale(data.s$Resp2,scale=FALSE) > within(data.s, cResp2<-Resp2-mean(Resp2)) > #OR > transform(data.s,cResp2=Resp2-mean(Resp2))

Between Plot Resp1 Resp2 cResp2 1 A1 P1 8 13

  • 7

2 A1 P2 10 22 2 3 A2 P3 7 23 3 4 A2 P4 11 22 2

> library(plyr) > mutate(data.s,cResp2=Resp2-mean(Resp2))

Between Plot Resp1 Resp2 cResp2 1 A1 P1 8 13

  • 7

2 A1 P2 10 22 2 3 A2 P3 7 23 3 4 A2 P4 11 22 2

9.11. Transformations and Derivatives

Center a both variables (Resp1 and Resp2)

> within(data.s, cResp1<-Resp1-mean(Resp1), + cResp2<-Resp2-mean(Resp2)) > #OR > transform(data.s,cResp1=Resp1-mean(Resp1), + cResp2=Resp2-mean(Resp2))

Between Plot Resp1 Resp2 cResp1 cResp2 1 A1 P1 8 13

  • 1
  • 7

2 A1 P2 10 22 1 2 3 A2 P3 7 23

  • 2

3 4 A2 P4 11 22 2 2

> library(plyr) > mutate(data.s,cResp1=Resp1-mean(Resp1), + cResp2=Resp2-mean(Resp2))

Between Plot Resp1 Resp2 cResp1 cResp2 1 A1 P1 8 13

  • 1
  • 7

2 A1 P2 10 22 1 2 3 A2 P3 7 23

  • 2

3 4 A2 P4 11 22 2 2

9.12. Transformations and Derivatives

Center a both variables (Resp1 and Resp2) and calculate the difference

> library(plyr) > mutate(data.s,cResp1=Resp1-mean(Resp1), + cResp2=Resp2-mean(Resp2), diff=cResp1-cResp2)

slide-41
SLIDE 41
  • 41-

Between Plot Resp1 Resp2 cResp1 cResp2 diff 1 A1 P1 8 13

  • 1
  • 7

6 2 A1 P2 10 22 1 2

  • 1

3 A2 P3 7 23

  • 2

3

  • 5

4 A2 P4 11 22 2 2

9.13. Transformations and Derivatives

Center a single variable (Resp2) within each level of the Between factor

> library(plyr) > ddply(data.s, ~Between, mutate, cResp2=Resp2-mean(Resp2))

Between Plot Resp1 Resp2 cResp2 1 A1 P1 8 13

  • 4.5

2 A1 P2 10 22 4.5 3 A2 P3 7 23 0.5 4 A2 P4 11 22

  • 0.5

9.14. Transformations and Derivatives

Difference in means centered within Between (derivative of derivatives)

> library(plyr) > ddply(data.s,~Between,mutate,cResp1=Resp1-mean(Resp1), + cResp2=Resp2-mean(Resp2),cDiff=cResp1-cResp2)

Between Plot Resp1 Resp2 cResp1 cResp2 cDiff 1 A1 P1 8 13

  • 1
  • 4.5

3.5 2 A1 P2 10 22 1 4.5

  • 3.5

3 A2 P3 7 23

  • 2

0.5

  • 2.5

4 A2 P4 11 22 2

  • 0.5

2.5

9.15. Transformations and Derivatives

Difference in scaled versions of two variables within Between (derivative of derivatives)

> library(plyr) > ddply(data.s,~Between,mutate, + sResp1=Resp1/max(Resp1),sResp2=Resp2/max(Resp2), + sDiff=sResp1-sResp2, + logsResp1=log(sResp1),logsResp2=log(sResp2), + sLog=logsResp1-logsResp2 + )

Between Plot Resp1 Resp2 sResp1 sResp2 1 A1 P1 8 13 0.8000000 0.5909091 2 A1 P2 10 22 1.0000000 1.0000000 3 A2 P3 7 23 0.6363636 1.0000000 4 A2 P4 11 22 1.0000000 0.9565217 sDiff logsResp1 logsResp2 sLog 1 0.20909091 -0.2231436 -0.52609310 0.30294954 2 0.00000000 0.0000000 0.00000000 0.00000000 3 -0.36363636 -0.4519851 0.00000000 -0.45198512 4 0.04347826 0.0000000 -0.04445176 0.04445176