Transforming ne w feat u res FE ATU R E E N G IN E E R IN G IN R - - PowerPoint PPT Presentation

transforming ne w feat u res
SMART_READER_LITE
LIVE PREVIEW

Transforming ne w feat u res FE ATU R E E N G IN E E R IN G IN R - - PowerPoint PPT Presentation

Transforming ne w feat u res FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington Addressing ske w ed v ariables ggplot(online_retail, aes(x = Quantity)) + geom_density() FEATURE ENGINEERING IN R Po


slide-1
SLIDE 1

Transforming new features

FE ATU R E E N G IN E E R IN G IN R

Jose Hernandez

Data Scientist, University of Washington

slide-2
SLIDE 2

FEATURE ENGINEERING IN R

Addressing skewed variables

ggplot(online_retail, aes(x = Quantity)) + geom_density()

slide-3
SLIDE 3

FEATURE ENGINEERING IN R

Power transformations in statistics

slide-4
SLIDE 4

FEATURE ENGINEERING IN R

Using power transformations

ggplot(online_retail, aes(x = Quantity)) + geom_density()

slide-5
SLIDE 5

FEATURE ENGINEERING IN R

Box-Cox transformations

ggplot(transformed, aes(x = Quantity)) + geom_density()

slide-6
SLIDE 6

FEATURE ENGINEERING IN R

Yeo-Johnson transformation

ggplot(online_retail, aes(x = Quantity)) + geom_histogram(stat = "count")

slide-7
SLIDE 7

FEATURE ENGINEERING IN R

Yeo-Johnson transformation

ggplot(transformed, aes(x = Quantity)) + geom_density()

slide-8
SLIDE 8

FEATURE ENGINEERING IN R

# Transforming with caret retail_vars <- online_retail %>% select(Quantity) processed_vars <- preProcess(retail_vars, method = c("YeoJohnson"))

Data frame: retail_vars Method: "BoxCox" or "YeoJohnson" Transforms all numeric variables in the data frame

transformed <- predict(processed_vars, online_retail)

transformed contains the transformation of variables

slide-9
SLIDE 9

FEATURE ENGINEERING IN R

Plotting the transformation results

ggplot(transformed, aes(x = Quantity)) + geom_density()

slide-10
SLIDE 10

FEATURE ENGINEERING IN R

Box-Cox vs. Yeo-Johnson

Box-Cox - positive numeric features Yeo and Johnson - numeric features with negative values Both transform variables to t a normal distribution

slide-11
SLIDE 11

Your turn!

FE ATU R E E N G IN E E R IN G IN R

slide-12
SLIDE 12

Normalization techniques: Scaling and Centering

FE ATU R E E N G IN E E R IN G IN R

Jose Hernandez

Data Scientist, University of Washington

slide-13
SLIDE 13

FEATURE ENGINEERING IN R

Scaling to a range

raw range = 0 to 100 new range = 0 to 1 Useful on variables that have known upper and lower bounds There are a few outliers Data is approximately uniform across the ranges

slide-14
SLIDE 14

FEATURE ENGINEERING IN R

adult_incomes %>% select(age) %>% range() 17 90

slide-15
SLIDE 15

FEATURE ENGINEERING IN R

adult_incomes <- adult_incomes %>% mutate(scaled_age = (age - min(age)) / (max(age) - min(age)))

R nds the minimum for you with min() R also nds the maximum with max()

mutate() creates your new column

adult_incomes %>% select(age, scaled_age) %>% summary() age scaled_age

  • Min. :17.00 Min. :0.0000

1st Qu.:28.00 1st Qu.:0.1507 Median :37.00 Median :0.2740 Mean :38.58 Mean :0.2956 3rd Qu.:48.00 3rd Qu.:0.4247

  • Max. :90.00 Max. :1.0000
slide-16
SLIDE 16

FEATURE ENGINEERING IN R

income_vars <- adult_incomes %>% select(age, educational_num) processed_vars <- preProcess(income_vars, method = c("range")) transformed <- predict(processed_vars, adult_incomes) transformed %>% select(age, educational_num) %>% summary() age educational_num

  • Min. :0.0000 Min. :0.0000

1st Qu.:0.1507 1st Qu.:0.5333 Median :0.2740 Median :0.6000 Mean :0.2956 Mean :0.6054 3rd Qu.:0.4247 3rd Qu.:0.7333

  • Max. :1.0000 Max. :1.0000
slide-17
SLIDE 17

FEATURE ENGINEERING IN R

Mean centering

slide-18
SLIDE 18

FEATURE ENGINEERING IN R

Coding example

adult_incomes <- adult_incomes %>% mutate(mscale_age = age - mean(age)) adult_incomes %>% select(age, mscale_age) %>% summary() age mscale_age

  • Min. :17.00 Min. :-0.29564

1st Qu.:28.00 1st Qu.:-0.14495 Median :37.00 Median :-0.02167 Mean :38.58 Mean : 0.00000 3rd Qu.:48.00 3rd Qu.: 0.12902

  • Max. :90.00 Max. : 0.70436
slide-19
SLIDE 19

FEATURE ENGINEERING IN R

Using caret and centering

adult_incomes %>% select(age, hours_per_week) %>% summary() age hours_per_week

  • Min. :17.00 Min. : 1.00

Median :37.00 Median :40.00 Mean :38.58 Mean :40.44 3rd Qu.:48.00 3rd Qu.:45.00

  • Max. :90.00 Max. :99.00

processed_vars <- preProcess(adult_incomes %>% select(age, hours_per_week), method = c("center"))

slide-20
SLIDE 20

FEATURE ENGINEERING IN R

Using caret and centering

transformed <- predict(processed_vars, adult_incomes) transformed %>% select(age, hours_per_week) %>% summary() age hours_per_week

  • Min. :-21.582 Min. :-39.4375

Median : -1.582 Median : -0.4375 Mean : 0.000 Mean : 0.0000 3rd Qu.: 9.418 3rd Qu.: 4.5625

  • Max. : 51.418 Max. : 58.5625
slide-21
SLIDE 21

FEATURE ENGINEERING IN R

Normalization techniques summary

Scaling between 0 and 1: Well dened upper and lower bounds Not a lot of outliers Centering around the mean: Helpful when you have outliers

slide-22
SLIDE 22

It's your turn!

FE ATU R E E N G IN E E R IN G IN R

slide-23
SLIDE 23

Z-score standardization

FE ATU R E E N G IN E E R IN G IN R

Jose Hernandez

Data Scientist, University of Washington

slide-24
SLIDE 24

FEATURE ENGINEERING IN R

Z-score standardization

Useful when: You have some outliers Measurements in dierent scales of magnitude

slide-25
SLIDE 25

FEATURE ENGINEERING IN R

Mean centering vs. z-score standardization

Mean centering changes the values but not the scale of the variables Z-Score standardization changes the scale to unit variance

slide-26
SLIDE 26

FEATURE ENGINEERING IN R

  • nline_retail <- online_retail %>%

mutate(z_quantity = (Quantity - mean(Quantity))/ sd(Quantity))

Use the mean() function and subtract from the original variable Use the sd() function to calculate the standard deviation

  • nline_retail %>%

select(Quantity, z_quantity) %>% summary() Quantity z_quantity

  • Min. : 1.000 Min. :-0.53561

1st Qu.: 1.000 1st Qu.:-0.53561 Median : 3.000 Median :-0.35481 Mean : 6.925 Mean : 0.00000 3rd Qu.: 8.000 3rd Qu.: 0.09717

  • Max. :99.000 Max. : 8.32327
slide-27
SLIDE 27

FEATURE ENGINEERING IN R

Standardizing multiple variables

  • nline_retail %>%

select(Quantity, UnitPrice) %>% summary() Quantity UnitPrice

  • Min. : 1.000 Min. : 0.000

1st Qu.: 1.000 1st Qu.: 1.250 Median : 3.000 Median : 2.510 Mean : 6.925 Mean : 4.137 3rd Qu.: 8.000 3rd Qu.: 4.250

  • Max. :99.000 Max. :950.990
slide-28
SLIDE 28

FEATURE ENGINEERING IN R

processed_vars <- preProcess(online_retail %>% select(Quantity, UnitPrice), method = c("center", "scale"))

Use methods "center" and "scale"

  • nline_retail <- predict(processed_vars, online_retail)
  • nline_retail %>%

select("Quantity","UnitPrice") %>% summary() Quantity UnitPrice

  • Min. : 1.000 Min. : 0.000

1st Qu.: 1.000 1st Qu.: 1.250 Median : 3.000 Median : 2.510 Mean : 6.925 Mean : 4.137 3rd Qu.: 8.000 3rd Qu.: 4.250

  • Max. :99.000 Max. :950.990
slide-29
SLIDE 29

Let's get standardizing!

FE ATU R E E N G IN E E R IN G IN R