Data Munging with R Rob Kabacoff, Ph.D. Topics Single dataset - PowerPoint PPT Presentation

Data Munging with R Rob Kabacoff, Ph.D.

Topics  Single dataset  subsetting data  sorting data  creating new variables  renaming variables  aggregating  Multiple datasets  merging data  Additional topics  reshaping  working with dates  cleaning text 2

Data Management with a single dataset

dplyr functions  filter – select rows  select – select columns  arrange – reorder rows  mutate – create new columns  rename – rename columns  group_by and summarize - aggregate be sure to issue library(dplyr) to make these available 4

filter subset data by selecting rows df1 <- filter (mtcars, cyl==4, mpg > 20) df2 <- filter (mtcars, cyl==4 & mpg > 20) # same df3 <- filter (mtcars, cyl %in% c(4, 6) | am ==1) 5

Logical Operators Operator Description < Less than <= Less than or equal to > Greater than >= Greater than or equal to == Exactly equal to != Not equal to !x Not x x | y x or y x & y x and y isTRUE(x) Test if x is TRUE 6

select subset data by selecting columns (variables) df1 <- select (mtcars, mpg, cyl, wt) df2 <- select (mtcars, mpg:qsec, carb) df3 <- select (mtcars, -am, -carb) 7

arrange reorder rows df1 <- arrange (mtcars, cyl) df2 <- arrange (mtcars, cyl, mpg) df3 <- arrange (mtcars, cyl, desc(mpg)) 8

mutate create new variables (add new columns) df1 <- mutate(mtcars, power = disp * hp, am = factor(am, levels=c(0, 1), labels = c("automatic", "manual")) ) 9

Arithmetic Operators Operator Description + Addition - Subtraction * Multiplication / Division ^ Exponentiation 10

rename rename variables (columns) df <- rename (mtcars, displacement = disp, transmission = am) 11

group_by and summarize aggregate data by groups df <- group_by (mtcars, cyl, gear) df2 <- summarise (df, disp_n = n(), disp_mean = mean(disp), disp_sd = sd(disp) ) 12

group_by and summarize (2) aggregate data by groups df <- group_by (mtcars, cyl, gear) df2 <- summarise_each (df, funs(mean)) df3 <- summarise_each (df, funs(min, max)) 13

Putting it all together df <- select(mtcars, cyl, disp, mpg) df <- filter(df, mpg > 20) df <- arrange(df, cyl, desc(mpg)) df <- select(mtcars, cyl, disp, mpg) %>% filter(mpg > 20) %>% arrange(cyl, desc(mpg)) 14

Calculating percentages mtcars %>% group_by(cyl) %>% cyl n pct <dbl> <int> <dbl> summarise(n = n()) %>% 1 4 11 0.34375 mutate(pct = n/sum(n)) 2 6 7 0.21875 3 8 14 0.43750 as.data.frame(mtcars %>% group_by(cyl) %>% summarise(n = n()) %>% mutate(pct = paste0(round(100 * n/sum(n), 0), "%"))) cyl n pct 1 4 11 34% 2 6 7 22% 3 8 14 44% 15

Calculating percentages as.data.frame(mtcars %>% group_by(cyl, gear) %>% summarise(n = n()) %>% mutate(pct = paste0(round(100 * n/sum(n), 0), "%"))) cyl gear n pct 1 4 3 1 9% 2 4 4 8 73% 3 4 5 2 18% 4 6 3 2 29% 5 6 4 4 57% 6 6 5 1 14% 7 8 3 12 86% 8 8 5 2 14% 16

Windows functions (min_rank) # what are the 2 automatic transmission cars and # 2 manual transmission cars that have the lowest gas mileage? mtcars$name <- row.name(mtcars) mtcars %>% group_by(am) %>% filter(min_rank(mpg) <= 2) %>% select(name, am, mpg) # have the highest gas mileage? mtcars %>% group_by(am) %>% filter(min_rank( desc (mpg)) <= 2) %>% select(name, am, mpg)

Merging Datasets

Start with some data monitors <- read.table(header=TRUE, text=' monitorid lat long 1 42.467573 -87.810047 2 42.049148 -88.273029 3 39.110539 -90.324080 ') pollutants <- read.table(header=TRUE, text=' pollutant duration monitorid ozone 1h 1 so2 1h 1 ozone 8h 2 no2 1h 4 ') example from https://rpubs.com/NateByers/Merging 19

Inner join library(dplyr) inner_join(pollutants, monitors, by = "monitorid") pollutant duration monitorid lat long 1 ozone 1h 1 42.46757 -87.81005 2 so2 1h 1 42.46757 -87.81005 3 ozone 8h 2 42.04915 -88.27303 pollutants monitors pollutant duration monitorid monitorid lat long 1 ozone 1h 1 1 1 42.46757 -87.81005 2 so2 1h 1 2 2 42.04915 -88.27303 3 ozone 8h 2 3 3 39.11054 -90.32408 4 no2 1h 4 20

Left join library(dplyr) left_join(pollutants, monitors, by = "monitorid") pollutant duration monitorid lat long 1 ozone 1h 1 42.46757 -87.81005 2 so2 1h 1 42.46757 -87.81005 3 ozone 8h 2 42.04915 -88.27303 4 no2 1h 4 NA NA pollutants monitors pollutant duration monitorid monitorid lat long 1 ozone 1h 1 1 1 42.46757 -87.81005 2 so2 1h 1 2 2 42.04915 -88.27303 3 ozone 8h 2 3 3 39.11054 -90.32408 4 no2 1h 4 21

Full join library(dplyr) full_join(pollutants, monitors, by = "monitorid") pollutant duration monitorid lat long 1 ozone 1h 1 42.46757 -87.81005 2 so2 1h 1 42.46757 -87.81005 3 ozone 8h 2 42.04915 -88.27303 4 no2 1h 4 NA NA 5 <NA> <NA> 3 39.11054 -90.32408 pollutants monitors pollutant duration monitorid monitorid lat long 1 ozone 1h 1 1 1 42.46757 -87.81005 2 so2 1h 1 2 2 42.04915 -88.27303 3 ozone 8h 2 3 3 39.11054 -90.32408 4 no2 1h 4 22

Filtering with semi_join library(dplyr) semi_join(pollutants, monitors, by = "monitorid") pollutant duration monitorid 1 ozone 1h 1 keep pollutants rows 2 so2 1h 1 that have a match in 3 ozone 8h 2 monitors pollutants monitors pollutant duration monitorid monitorid lat long 1 ozone 1h 1 1 1 42.46757 -87.81005 2 so2 1h 1 2 2 42.04915 -88.27303 3 ozone 8h 2 3 3 39.11054 -90.32408 4 no2 1h 4 23

Filtering with anti_join library(dplyr) anti_join(pollutants, monitors, by = "monitorid") pollutant duration monitorid keep pollutants rows 1 no2 1h 4 that don't have a match in monitors pollutants monitors pollutant duration monitorid monitorid lat long 1 ozone 1h 1 1 1 42.46757 -87.81005 2 so2 1h 1 2 2 42.04915 -88.27303 3 ozone 8h 2 3 3 39.11054 -90.32408 4 no2 1h 4 24

Working with dates

Reading dates  Dates come in as a character variable  Convert to a date variable  Use the lubridate package 26

Reading dates  Dates come in as a character variable  Convert to a date variable  Use the lubridate package example: say date variable is stored as a character variable in the form "mm-dd-yyyy" convert it to a date variable using function mdy( ) mdy("12-01-2010") 27

Reading dates example: data <- read.table(header=TRUE, text=' First Last birthday John Smith 12-01-2010 Bill Doe 1/9/1963 R doesn't know these are dates Jane Williams 05/19/08 ') library(lubridate) R knows these are dates data$DOB <- mdy(dates$birthday) First Last birthday DOB 1 John Smith 12-01-2010 2010-12-01 2 Bill Doe 1/9/1963 1963-01-09 3 Jane Williams 05/19/08 2008-05-19 28

Reading dates 29

Accessing data parts

Accessing date parts data$year <- year(data$DOB) data$month <- month(data$DOB, label = TRUE) data$day <- day(data$DOB) data$weekday <- wday(data$DOB, label=TRUE, abbr = FALSE) First Last birthday DOB year month day weekday 1 John Smith 12-01-2010 2010-12-01 2010 Dec 1 Wednesday 2 Bill Doe 1/9/1963 1963-01-09 1963 Jan 9 Wednesday 3 Jane Williams 05/19/08 2008-05-19 2008 May 19 Monday 31

Date arithmetic data$age <- difftime(now(), data$DOB) First Last DOB age 1 John Smith 2010-12-01 2281.775 days 2 Bill Doe 1963-01-09 19774.775 days 3 Jane Williams 2008-05-19 3207.775 days data$ageyrs <- as.numeric(data$age) / 365.25 First Last DOB age ageyrs 1 John Smith 2010-12-01 2281.775 days 6.247 2 Bill Doe 1963-01-09 19774.775 days 54.140 3 Jane Williams 2008-05-19 3207.775 days 8.782

Manipulating Text

Data Munging with R Rob Kabacoff, Ph.D. Topics Single dataset - PowerPoint PPT Presentation

Data Munging with R Rob Kabacoff, Ph.D. Topics Single dataset subsetting data sorting data creating new variables renaming variables aggregating Multiple datasets merging data Additional topics reshaping

Practical R: Data Ingestion and Munging Practical R: Data Ingestion and Munging Abhijit Dasgupta

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Business Statistics CONTENTS The role of data The data matrix Data types Aspects of data

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Data Presentation and Collection Week 2 Prepared by: Nurazrin Jupri Types of data Data

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Data Preparation Data cleaning Data integration and transformation (Data

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Data types and data structures Book definition An abstract data type is a programmer-defined data

arato@biconsulting.hu rstats.budapestbi.hu R and Big Data Master Code Code Code Data Data

Lets Fight Air Inequality with Open Data Christa Hasenkopf, PhD Co-Founder & CEO of

Koszul Duality and Geometric Satake for SL 2 ( R ) Oliver Straser Annual conference of the DFG

Lists Advanced Concepts Dennis Komm Programming and Problem-Solving The modules numpy ,

Connected components, universal covers and the Lascar group Anand Pillay University of Leeds

translators and localisers Luca Morado Vzquez Silvia Rodrguez Vzquez University of

SOBA: Secrecy-preserving Observable Ballot-level Audit Josh Benaloh, Microsoft Research Douglas

Auditability and Verifiability of Elec4ons Ronald L. Rivest MIT UC Davis December 1, 2016 Have

(t s) & ((. I * R?v, eu.., srt fdkr A.t) bft) X tt, Jot. of t r..i u). Jf /tt

Sambuz

Useful Links

Newsletter

Mail Us

Data Munging with R Rob Kabacoff, Ph.D. Topics Single dataset - PowerPoint PPT Presentation

Data Munging with R Rob Kabacoff, Ph.D. Topics Single dataset subsetting data sorting data creating new variables renaming variables aggregating Multiple datasets merging data Additional topics reshaping

Practical R: Data Ingestion and Munging Practical R: Data Ingestion and Munging Abhijit Dasgupta

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Business Statistics CONTENTS The role of data The data matrix Data types Aspects of data

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Data Presentation and Collection Week 2 Prepared by: Nurazrin Jupri Types of data Data

Big Data Analytics Armistead Boyd SVP, Product &amp; Data Partnerships October 25, 2016 What is

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Data Preparation Data cleaning Data integration and transformation (Data

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Data types and data structures Book definition An abstract data type is a programmer-defined data

arato@biconsulting.hu rstats.budapestbi.hu R and Big Data Master Code Code Code Data Data

Lets Fight Air Inequality with Open Data Christa Hasenkopf, PhD Co-Founder &amp; CEO of

Koszul Duality and Geometric Satake for SL 2 ( R ) Oliver Straser Annual conference of the DFG

Lists Advanced Concepts Dennis Komm Programming and Problem-Solving The modules numpy ,

Connected components, universal covers and the Lascar group Anand Pillay University of Leeds

translators and localisers Luca Morado Vzquez Silvia Rodrguez Vzquez University of

SOBA: Secrecy-preserving Observable Ballot-level Audit Josh Benaloh, Microsoft Research Douglas

Auditability and Verifiability of Elec4ons Ronald L. Rivest MIT UC Davis December 1, 2016 Have

(t s) &amp; ((. I * R?v, eu.., srt fdkr A.t) bft) X tt, Jot. of t r..i u). Jf /tt

Sambuz

Useful Links

Newsletter

Mail Us

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

Lets Fight Air Inequality with Open Data Christa Hasenkopf, PhD Co-Founder & CEO of

(t s) & ((. I * R?v, eu.., srt fdkr A.t) bft) X tt, Jot. of t r..i u). Jf /tt