Transforming new features
FE ATU R E E N G IN E E R IN G IN R
Jose Hernandez
Data Scientist, University of Washington
Transforming ne w feat u res FE ATU R E E N G IN E E R IN G IN R - - PowerPoint PPT Presentation
Transforming ne w feat u res FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington Addressing ske w ed v ariables ggplot(online_retail, aes(x = Quantity)) + geom_density() FEATURE ENGINEERING IN R Po
FE ATU R E E N G IN E E R IN G IN R
Jose Hernandez
Data Scientist, University of Washington
FEATURE ENGINEERING IN R
ggplot(online_retail, aes(x = Quantity)) + geom_density()
FEATURE ENGINEERING IN R
FEATURE ENGINEERING IN R
ggplot(online_retail, aes(x = Quantity)) + geom_density()
FEATURE ENGINEERING IN R
ggplot(transformed, aes(x = Quantity)) + geom_density()
FEATURE ENGINEERING IN R
ggplot(online_retail, aes(x = Quantity)) + geom_histogram(stat = "count")
FEATURE ENGINEERING IN R
ggplot(transformed, aes(x = Quantity)) + geom_density()
FEATURE ENGINEERING IN R
# Transforming with caret retail_vars <- online_retail %>% select(Quantity) processed_vars <- preProcess(retail_vars, method = c("YeoJohnson"))
Data frame: retail_vars Method: "BoxCox" or "YeoJohnson" Transforms all numeric variables in the data frame
transformed <- predict(processed_vars, online_retail)
transformed contains the transformation of variables
FEATURE ENGINEERING IN R
ggplot(transformed, aes(x = Quantity)) + geom_density()
FEATURE ENGINEERING IN R
Box-Cox - positive numeric features Yeo and Johnson - numeric features with negative values Both transform variables to t a normal distribution
FE ATU R E E N G IN E E R IN G IN R
FE ATU R E E N G IN E E R IN G IN R
Jose Hernandez
Data Scientist, University of Washington
FEATURE ENGINEERING IN R
raw range = 0 to 100 new range = 0 to 1 Useful on variables that have known upper and lower bounds There are a few outliers Data is approximately uniform across the ranges
FEATURE ENGINEERING IN R
adult_incomes %>% select(age) %>% range() 17 90
FEATURE ENGINEERING IN R
adult_incomes <- adult_incomes %>% mutate(scaled_age = (age - min(age)) / (max(age) - min(age)))
R nds the minimum for you with min() R also nds the maximum with max()
mutate() creates your new column
adult_incomes %>% select(age, scaled_age) %>% summary() age scaled_age
1st Qu.:28.00 1st Qu.:0.1507 Median :37.00 Median :0.2740 Mean :38.58 Mean :0.2956 3rd Qu.:48.00 3rd Qu.:0.4247
FEATURE ENGINEERING IN R
income_vars <- adult_incomes %>% select(age, educational_num) processed_vars <- preProcess(income_vars, method = c("range")) transformed <- predict(processed_vars, adult_incomes) transformed %>% select(age, educational_num) %>% summary() age educational_num
1st Qu.:0.1507 1st Qu.:0.5333 Median :0.2740 Median :0.6000 Mean :0.2956 Mean :0.6054 3rd Qu.:0.4247 3rd Qu.:0.7333
FEATURE ENGINEERING IN R
FEATURE ENGINEERING IN R
adult_incomes <- adult_incomes %>% mutate(mscale_age = age - mean(age)) adult_incomes %>% select(age, mscale_age) %>% summary() age mscale_age
1st Qu.:28.00 1st Qu.:-0.14495 Median :37.00 Median :-0.02167 Mean :38.58 Mean : 0.00000 3rd Qu.:48.00 3rd Qu.: 0.12902
FEATURE ENGINEERING IN R
adult_incomes %>% select(age, hours_per_week) %>% summary() age hours_per_week
Median :37.00 Median :40.00 Mean :38.58 Mean :40.44 3rd Qu.:48.00 3rd Qu.:45.00
processed_vars <- preProcess(adult_incomes %>% select(age, hours_per_week), method = c("center"))
FEATURE ENGINEERING IN R
transformed <- predict(processed_vars, adult_incomes) transformed %>% select(age, hours_per_week) %>% summary() age hours_per_week
Median : -1.582 Median : -0.4375 Mean : 0.000 Mean : 0.0000 3rd Qu.: 9.418 3rd Qu.: 4.5625
FEATURE ENGINEERING IN R
Scaling between 0 and 1: Well dened upper and lower bounds Not a lot of outliers Centering around the mean: Helpful when you have outliers
FE ATU R E E N G IN E E R IN G IN R
FE ATU R E E N G IN E E R IN G IN R
Jose Hernandez
Data Scientist, University of Washington
FEATURE ENGINEERING IN R
Useful when: You have some outliers Measurements in dierent scales of magnitude
FEATURE ENGINEERING IN R
Mean centering changes the values but not the scale of the variables Z-Score standardization changes the scale to unit variance
FEATURE ENGINEERING IN R
mutate(z_quantity = (Quantity - mean(Quantity))/ sd(Quantity))
Use the mean() function and subtract from the original variable Use the sd() function to calculate the standard deviation
select(Quantity, z_quantity) %>% summary() Quantity z_quantity
1st Qu.: 1.000 1st Qu.:-0.53561 Median : 3.000 Median :-0.35481 Mean : 6.925 Mean : 0.00000 3rd Qu.: 8.000 3rd Qu.: 0.09717
FEATURE ENGINEERING IN R
select(Quantity, UnitPrice) %>% summary() Quantity UnitPrice
1st Qu.: 1.000 1st Qu.: 1.250 Median : 3.000 Median : 2.510 Mean : 6.925 Mean : 4.137 3rd Qu.: 8.000 3rd Qu.: 4.250
FEATURE ENGINEERING IN R
processed_vars <- preProcess(online_retail %>% select(Quantity, UnitPrice), method = c("center", "scale"))
Use methods "center" and "scale"
select("Quantity","UnitPrice") %>% summary() Quantity UnitPrice
1st Qu.: 1.000 1st Qu.: 1.250 Median : 3.000 Median : 2.510 Mean : 6.925 Mean : 4.137 3rd Qu.: 8.000 3rd Qu.: 4.250
FE ATU R E E N G IN E E R IN G IN R