Transforming ne w feat u res FE ATU R E E N G IN E E R IN G IN R - - PowerPoint PPT Presentation

▶

Oct 18, 2022 180 likes •476 views

Transforming ne w feat u res FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington Addressing ske w ed v ariables ggplot(online_retail, aes(x = Quantity)) + geom_density() FEATURE ENGINEERING IN R Po

SLIDE 1

Transforming new features

FE ATU R E E N G IN E E R IN G IN R

Jose Hernandez

Data Scientist, University of Washington

SLIDE 2

FEATURE ENGINEERING IN R

Addressing skewed variables

ggplot(online_retail, aes(x = Quantity)) + geom_density()

SLIDE 3

FEATURE ENGINEERING IN R

Power transformations in statistics

SLIDE 4

FEATURE ENGINEERING IN R

Using power transformations

ggplot(online_retail, aes(x = Quantity)) + geom_density()

SLIDE 5

FEATURE ENGINEERING IN R

Box-Cox transformations

ggplot(transformed, aes(x = Quantity)) + geom_density()

SLIDE 6

FEATURE ENGINEERING IN R

Yeo-Johnson transformation

ggplot(online_retail, aes(x = Quantity)) + geom_histogram(stat = "count")

SLIDE 7

FEATURE ENGINEERING IN R

Yeo-Johnson transformation

ggplot(transformed, aes(x = Quantity)) + geom_density()

SLIDE 8

FEATURE ENGINEERING IN R

# Transforming with caret retail_vars <- online_retail %>% select(Quantity) processed_vars <- preProcess(retail_vars, method = c("YeoJohnson"))

Data frame: retail_vars Method: "BoxCox" or "YeoJohnson" Transforms all numeric variables in the data frame

transformed <- predict(processed_vars, online_retail)

transformed contains the transformation of variables

SLIDE 9

FEATURE ENGINEERING IN R

Plotting the transformation results

ggplot(transformed, aes(x = Quantity)) + geom_density()

SLIDE 10

FEATURE ENGINEERING IN R

Box-Cox vs. Yeo-Johnson

Box-Cox - positive numeric features Yeo and Johnson - numeric features with negative values Both transform variables to t a normal distribution

SLIDE 11

Your turn!

FE ATU R E E N G IN E E R IN G IN R

SLIDE 12

Normalization techniques: Scaling and Centering

FE ATU R E E N G IN E E R IN G IN R

Jose Hernandez

Data Scientist, University of Washington

SLIDE 13

FEATURE ENGINEERING IN R

Scaling to a range

raw range = 0 to 100 new range = 0 to 1 Useful on variables that have known upper and lower bounds There are a few outliers Data is approximately uniform across the ranges

SLIDE 14

FEATURE ENGINEERING IN R

adult_incomes %>% select(age) %>% range() 17 90

SLIDE 15

FEATURE ENGINEERING IN R

adult_incomes <- adult_incomes %>% mutate(scaled_age = (age - min(age)) / (max(age) - min(age)))

R nds the minimum for you with min() R also nds the maximum with max()

mutate() creates your new column

adult_incomes %>% select(age, scaled_age) %>% summary() age scaled_age

Min. :17.00 Min. :0.0000

1st Qu.:28.00 1st Qu.:0.1507 Median :37.00 Median :0.2740 Mean :38.58 Mean :0.2956 3rd Qu.:48.00 3rd Qu.:0.4247

Max. :90.00 Max. :1.0000

SLIDE 16

FEATURE ENGINEERING IN R

income_vars <- adult_incomes %>% select(age, educational_num) processed_vars <- preProcess(income_vars, method = c("range")) transformed <- predict(processed_vars, adult_incomes) transformed %>% select(age, educational_num) %>% summary() age educational_num

Min. :0.0000 Min. :0.0000

1st Qu.:0.1507 1st Qu.:0.5333 Median :0.2740 Median :0.6000 Mean :0.2956 Mean :0.6054 3rd Qu.:0.4247 3rd Qu.:0.7333

Max. :1.0000 Max. :1.0000

SLIDE 17

FEATURE ENGINEERING IN R

Mean centering

SLIDE 18

FEATURE ENGINEERING IN R

Coding example

adult_incomes <- adult_incomes %>% mutate(mscale_age = age - mean(age)) adult_incomes %>% select(age, mscale_age) %>% summary() age mscale_age

Min. :17.00 Min. :-0.29564

1st Qu.:28.00 1st Qu.:-0.14495 Median :37.00 Median :-0.02167 Mean :38.58 Mean : 0.00000 3rd Qu.:48.00 3rd Qu.: 0.12902

Max. :90.00 Max. : 0.70436

SLIDE 19

FEATURE ENGINEERING IN R

Using caret and centering

adult_incomes %>% select(age, hours_per_week) %>% summary() age hours_per_week

Min. :17.00 Min. : 1.00

Median :37.00 Median :40.00 Mean :38.58 Mean :40.44 3rd Qu.:48.00 3rd Qu.:45.00

Max. :90.00 Max. :99.00

processed_vars <- preProcess(adult_incomes %>% select(age, hours_per_week), method = c("center"))

SLIDE 20

FEATURE ENGINEERING IN R

Using caret and centering

transformed <- predict(processed_vars, adult_incomes) transformed %>% select(age, hours_per_week) %>% summary() age hours_per_week

Min. :-21.582 Min. :-39.4375

Median : -1.582 Median : -0.4375 Mean : 0.000 Mean : 0.0000 3rd Qu.: 9.418 3rd Qu.: 4.5625

Max. : 51.418 Max. : 58.5625

SLIDE 21

FEATURE ENGINEERING IN R

Normalization techniques summary

Scaling between 0 and 1: Well dened upper and lower bounds Not a lot of outliers Centering around the mean: Helpful when you have outliers

SLIDE 22

It's your turn!

FE ATU R E E N G IN E E R IN G IN R

SLIDE 23

Z-score standardization

FE ATU R E E N G IN E E R IN G IN R

Jose Hernandez

Data Scientist, University of Washington

SLIDE 24

FEATURE ENGINEERING IN R

Z-score standardization

Useful when: You have some outliers Measurements in dierent scales of magnitude

SLIDE 25

FEATURE ENGINEERING IN R

Mean centering vs. z-score standardization

Mean centering changes the values but not the scale of the variables Z-Score standardization changes the scale to unit variance

SLIDE 26

FEATURE ENGINEERING IN R

nline_retail <- online_retail %>%

mutate(z_quantity = (Quantity - mean(Quantity))/ sd(Quantity))

Use the mean() function and subtract from the original variable Use the sd() function to calculate the standard deviation

nline_retail %>%

select(Quantity, z_quantity) %>% summary() Quantity z_quantity

Min. : 1.000 Min. :-0.53561

1st Qu.: 1.000 1st Qu.:-0.53561 Median : 3.000 Median :-0.35481 Mean : 6.925 Mean : 0.00000 3rd Qu.: 8.000 3rd Qu.: 0.09717

Max. :99.000 Max. : 8.32327

SLIDE 27

FEATURE ENGINEERING IN R

Standardizing multiple variables

nline_retail %>%

select(Quantity, UnitPrice) %>% summary() Quantity UnitPrice

Min. : 1.000 Min. : 0.000

1st Qu.: 1.000 1st Qu.: 1.250 Median : 3.000 Median : 2.510 Mean : 6.925 Mean : 4.137 3rd Qu.: 8.000 3rd Qu.: 4.250

Max. :99.000 Max. :950.990

SLIDE 28

FEATURE ENGINEERING IN R

processed_vars <- preProcess(online_retail %>% select(Quantity, UnitPrice), method = c("center", "scale"))

Use methods "center" and "scale"

nline_retail <- predict(processed_vars, online_retail)
nline_retail %>%

select("Quantity","UnitPrice") %>% summary() Quantity UnitPrice

Min. : 1.000 Min. : 0.000

1st Qu.: 1.000 1st Qu.: 1.250 Median : 3.000 Median : 2.510 Mean : 6.925 Mean : 4.137 3rd Qu.: 8.000 3rd Qu.: 4.250

Max. :99.000 Max. :950.990

SLIDE 29

Let's get standardizing!

FE ATU R E E N G IN E E R IN G IN R