data munging with r
play

Data Munging with R Rob Kabacoff, Ph.D. Topics Single dataset - PowerPoint PPT Presentation

Data Munging with R Rob Kabacoff, Ph.D. Topics Single dataset subsetting data sorting data creating new variables renaming variables aggregating Multiple datasets merging data Additional topics reshaping


  1. Data Munging with R Rob Kabacoff, Ph.D.

  2. Topics  Single dataset  subsetting data  sorting data  creating new variables  renaming variables  aggregating  Multiple datasets  merging data  Additional topics  reshaping  working with dates  cleaning text 2

  3. Data Management with a single dataset

  4. dplyr functions  filter – select rows  select – select columns  arrange – reorder rows  mutate – create new columns  rename – rename columns  group_by and summarize - aggregate be sure to issue library(dplyr) to make these available 4

  5. filter subset data by selecting rows df1 <- filter (mtcars, cyl==4, mpg > 20) df2 <- filter (mtcars, cyl==4 & mpg > 20) # same df3 <- filter (mtcars, cyl %in% c(4, 6) | am ==1) 5

  6. Logical Operators Operator Description < Less than <= Less than or equal to > Greater than >= Greater than or equal to == Exactly equal to != Not equal to !x Not x x | y x or y x & y x and y isTRUE(x) Test if x is TRUE 6

  7. select subset data by selecting columns (variables) df1 <- select (mtcars, mpg, cyl, wt) df2 <- select (mtcars, mpg:qsec, carb) df3 <- select (mtcars, -am, -carb) 7

  8. arrange reorder rows df1 <- arrange (mtcars, cyl) df2 <- arrange (mtcars, cyl, mpg) df3 <- arrange (mtcars, cyl, desc(mpg)) 8

  9. mutate create new variables (add new columns) df1 <- mutate(mtcars, power = disp * hp, am = factor(am, levels=c(0, 1), labels = c("automatic", "manual")) ) 9

  10. Arithmetic Operators Operator Description + Addition - Subtraction * Multiplication / Division ^ Exponentiation 10

  11. rename rename variables (columns) df <- rename (mtcars, displacement = disp, transmission = am) 11

  12. group_by and summarize aggregate data by groups df <- group_by (mtcars, cyl, gear) df2 <- summarise (df, disp_n = n(), disp_mean = mean(disp), disp_sd = sd(disp) ) 12

  13. group_by and summarize (2) aggregate data by groups df <- group_by (mtcars, cyl, gear) df2 <- summarise_each (df, funs(mean)) df3 <- summarise_each (df, funs(min, max)) 13

  14. Putting it all together df <- select(mtcars, cyl, disp, mpg) df <- filter(df, mpg > 20) df <- arrange(df, cyl, desc(mpg)) df <- select(mtcars, cyl, disp, mpg) %>% filter(mpg > 20) %>% arrange(cyl, desc(mpg)) 14

  15. Calculating percentages mtcars %>% group_by(cyl) %>% cyl n pct <dbl> <int> <dbl> summarise(n = n()) %>% 1 4 11 0.34375 mutate(pct = n/sum(n)) 2 6 7 0.21875 3 8 14 0.43750 as.data.frame(mtcars %>% group_by(cyl) %>% summarise(n = n()) %>% mutate(pct = paste0(round(100 * n/sum(n), 0), "%"))) cyl n pct 1 4 11 34% 2 6 7 22% 3 8 14 44% 15

  16. Calculating percentages as.data.frame(mtcars %>% group_by(cyl, gear) %>% summarise(n = n()) %>% mutate(pct = paste0(round(100 * n/sum(n), 0), "%"))) cyl gear n pct 1 4 3 1 9% 2 4 4 8 73% 3 4 5 2 18% 4 6 3 2 29% 5 6 4 4 57% 6 6 5 1 14% 7 8 3 12 86% 8 8 5 2 14% 16

  17. Windows functions (min_rank) # what are the 2 automatic transmission cars and # 2 manual transmission cars that have the lowest gas mileage? mtcars$name <- row.name(mtcars) mtcars %>% group_by(am) %>% filter(min_rank(mpg) <= 2) %>% select(name, am, mpg) # have the highest gas mileage? mtcars %>% group_by(am) %>% filter(min_rank( desc (mpg)) <= 2) %>% select(name, am, mpg)

  18. Merging Datasets

  19. Start with some data monitors <- read.table(header=TRUE, text=' monitorid lat long 1 42.467573 -87.810047 2 42.049148 -88.273029 3 39.110539 -90.324080 ') pollutants <- read.table(header=TRUE, text=' pollutant duration monitorid ozone 1h 1 so2 1h 1 ozone 8h 2 no2 1h 4 ') example from https://rpubs.com/NateByers/Merging 19

  20. Inner join library(dplyr) inner_join(pollutants, monitors, by = "monitorid") pollutant duration monitorid lat long 1 ozone 1h 1 42.46757 -87.81005 2 so2 1h 1 42.46757 -87.81005 3 ozone 8h 2 42.04915 -88.27303 pollutants monitors pollutant duration monitorid monitorid lat long 1 ozone 1h 1 1 1 42.46757 -87.81005 2 so2 1h 1 2 2 42.04915 -88.27303 3 ozone 8h 2 3 3 39.11054 -90.32408 4 no2 1h 4 20

  21. Left join library(dplyr) left_join(pollutants, monitors, by = "monitorid") pollutant duration monitorid lat long 1 ozone 1h 1 42.46757 -87.81005 2 so2 1h 1 42.46757 -87.81005 3 ozone 8h 2 42.04915 -88.27303 4 no2 1h 4 NA NA pollutants monitors pollutant duration monitorid monitorid lat long 1 ozone 1h 1 1 1 42.46757 -87.81005 2 so2 1h 1 2 2 42.04915 -88.27303 3 ozone 8h 2 3 3 39.11054 -90.32408 4 no2 1h 4 21

  22. Full join library(dplyr) full_join(pollutants, monitors, by = "monitorid") pollutant duration monitorid lat long 1 ozone 1h 1 42.46757 -87.81005 2 so2 1h 1 42.46757 -87.81005 3 ozone 8h 2 42.04915 -88.27303 4 no2 1h 4 NA NA 5 <NA> <NA> 3 39.11054 -90.32408 pollutants monitors pollutant duration monitorid monitorid lat long 1 ozone 1h 1 1 1 42.46757 -87.81005 2 so2 1h 1 2 2 42.04915 -88.27303 3 ozone 8h 2 3 3 39.11054 -90.32408 4 no2 1h 4 22

  23. Filtering with semi_join library(dplyr) semi_join(pollutants, monitors, by = "monitorid") pollutant duration monitorid 1 ozone 1h 1 keep pollutants rows 2 so2 1h 1 that have a match in 3 ozone 8h 2 monitors pollutants monitors pollutant duration monitorid monitorid lat long 1 ozone 1h 1 1 1 42.46757 -87.81005 2 so2 1h 1 2 2 42.04915 -88.27303 3 ozone 8h 2 3 3 39.11054 -90.32408 4 no2 1h 4 23

  24. Filtering with anti_join library(dplyr) anti_join(pollutants, monitors, by = "monitorid") pollutant duration monitorid keep pollutants rows 1 no2 1h 4 that don't have a match in monitors pollutants monitors pollutant duration monitorid monitorid lat long 1 ozone 1h 1 1 1 42.46757 -87.81005 2 so2 1h 1 2 2 42.04915 -88.27303 3 ozone 8h 2 3 3 39.11054 -90.32408 4 no2 1h 4 24

  25. Working with dates

  26. Reading dates  Dates come in as a character variable  Convert to a date variable  Use the lubridate package 26

  27. Reading dates  Dates come in as a character variable  Convert to a date variable  Use the lubridate package example: say date variable is stored as a character variable in the form "mm-dd-yyyy" convert it to a date variable using function mdy( ) mdy("12-01-2010") 27

  28. Reading dates example: data <- read.table(header=TRUE, text=' First Last birthday John Smith 12-01-2010 Bill Doe 1/9/1963 R doesn't know these are dates Jane Williams 05/19/08 ') library(lubridate) R knows these are dates data$DOB <- mdy(dates$birthday) First Last birthday DOB 1 John Smith 12-01-2010 2010-12-01 2 Bill Doe 1/9/1963 1963-01-09 3 Jane Williams 05/19/08 2008-05-19 28

  29. Reading dates 29

  30. Accessing data parts

  31. Accessing date parts data$year <- year(data$DOB) data$month <- month(data$DOB, label = TRUE) data$day <- day(data$DOB) data$weekday <- wday(data$DOB, label=TRUE, abbr = FALSE) First Last birthday DOB year month day weekday 1 John Smith 12-01-2010 2010-12-01 2010 Dec 1 Wednesday 2 Bill Doe 1/9/1963 1963-01-09 1963 Jan 9 Wednesday 3 Jane Williams 05/19/08 2008-05-19 2008 May 19 Monday 31

  32. Date arithmetic data$age <- difftime(now(), data$DOB) First Last DOB age 1 John Smith 2010-12-01 2281.775 days 2 Bill Doe 1963-01-09 19774.775 days 3 Jane Williams 2008-05-19 3207.775 days data$ageyrs <- as.numeric(data$age) / 365.25 First Last DOB age ageyrs 1 John Smith 2010-12-01 2281.775 days 6.247 2 Bill Doe 1963-01-09 19774.775 days 54.140 3 Jane Williams 2008-05-19 3207.775 days 8.782

  33. Manipulating Text

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend