Feat u re crossing FE ATU R E E N G IN E E R IN G IN R Jose - - PowerPoint PPT Presentation

feat u re crossing
SMART_READER_LITE
LIVE PREVIEW

Feat u re crossing FE ATU R E E N G IN E E R IN G IN R Jose - - PowerPoint PPT Presentation

Feat u re crossing FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington E x amples of crossing feat u res Ho u sing price prediction [ Location x bedroom number ] H u man a rib u tes [ gender x


slide-1
SLIDE 1

Feature crossing

FE ATU R E E N G IN E E R IN G IN R

Jose Hernandez

Data Scientist, University of Washington

slide-2
SLIDE 2

FEATURE ENGINEERING IN R

Examples of crossing features

Housing price prediction [ Location x bedroom number ] Human aributes [ gender x height ]

slide-3
SLIDE 3

FEATURE ENGINEERING IN R

Crossing categorical features

discipline_logs %>% select(infraction) %>% table() academic dishonesty alcohol 1294 746 disruptive conduct failure to cooperate 1031 4072 fighting minor incident 2135 522 plagiarism vandalism 112 88 discipline_logs %>% select(gender) %>% table() Female Male 3055 6945

slide-4
SLIDE 4

FEATURE ENGINEERING IN R

Exploring Visually

discipline_logs %>% group_by(infraction, gender) %>% summarize(n = n()) %>% ggplot(., aes(infraction, n, fill = gender)) + geom_bar(stat = "identity", position = "dodge")

slide-5
SLIDE 5

FEATURE ENGINEERING IN R

slide-6
SLIDE 6

FEATURE ENGINEERING IN R

Exploring crossed features

discipline_logs %>% select(gender, infraction) %>% table() infraction gender academic dishonesty alcohol disruptive conduct Female 393 222 330 Male 901 524 701 infraction gender failure to cooperate fighting minor incident Female 1258 638 150 Male 2814 1497 372 infraction gender plagiarism vandalism Female 39 25 Male 73 63

slide-7
SLIDE 7

FEATURE ENGINEERING IN R

dmy <- dummyVars( ~ gender:infraction, data = discipline_logs)

gender:infraction

  • ut_df <- predict(dmy, newdata = discipline_logs)

glimpse(out_df) Observations: 10,000 Variables: 16 $ genderFemale.infractionacademic.dishonesty <dbl> 0, 0, 0, 0, 0... $ genderMale.infractionacademic.dishonesty <dbl> 0, 0, 0, 0, 0... $ genderFemale.infractionalcohol <dbl> 0, 0, 0, 0, 0... $ genderMale.infractionalcohol <dbl> 0, 0, 0, 0, 0... $ genderFemale.infractiondisruptive.conduct <dbl> 0, 1, 0, 0, 0... $ genderMale.infractiondisruptive.conduct <dbl> 1, 0, 0, 0, 0... $ genderFemale.infractionfailure.to.cooperate <dbl> 0, 0, 0, 0, 0... $ genderMale.infractionfailure.to.cooperate <dbl> 0, 0, 1, 1, 0...

slide-8
SLIDE 8

FEATURE ENGINEERING IN R

Things to consider

Many categories = possible sparse features Prior knowledge of what might interact is needed in some regression contexts Be sure to explore the dierent methods available to determine what to cross

slide-9
SLIDE 9

It's time for you to try!

FE ATU R E E N G IN E E R IN G IN R

slide-10
SLIDE 10

Principal component analysis

FE ATU R E E N G IN E E R IN G IN R

Jose Hernandez

Data Scientist, University of Washington

slide-11
SLIDE 11

FEATURE ENGINEERING IN R

PCA for feature engineering

slide-12
SLIDE 12

FEATURE ENGINEERING IN R

PCA for feature engineering

slide-13
SLIDE 13

FEATURE ENGINEERING IN R

PCA with 2 variables

slide-14
SLIDE 14

FEATURE ENGINEERING IN R

PCA with 2 variables

slide-15
SLIDE 15

FEATURE ENGINEERING IN R

PCA with 2 variables

slide-16
SLIDE 16

FEATURE ENGINEERING IN R

PCA with 2 variables

slide-17
SLIDE 17

FEATURE ENGINEERING IN R

Performing PCA using princomp

glass_x <- glass_df %>% select(-ID, -glass_type) glass_pca <- prcomp(glass_x, center = TRUE,

  • scale. = TRUE)

Center = Mean 0 Scale = Unit variance

slide-18
SLIDE 18

FEATURE ENGINEERING IN R

print(glass_pca) Standard deviations (1, .., p=9): [1] 1.58466518 1.43180731 1.18526115 1.07604017 0.95603465 0.72638502 [7] 0.60741950 0.25269141 0.04011007 Rotation (n x k) = (9 x 9): PC1 PC2 PC3 PC4 PC5 RI -0.5451766 0.28568318 -0.0869108293 -0.14738099 0.073542700 Na 0.2581256 0.27035007 0.3849196197 -0.49124204 -0.153683304 Mg -0.1108810 -0.59355826 -0.0084179590 -0.37878577 -0.123509124 Al 0.4287086 0.29521154 -0.3292371183 0.13750592 -0.014108879 Si 0.2288364 -0.15509891 0.4587088382 0.65253771 -0.008500117 K 0.2193440 -0.15397013 -0.6625741197 0.03853544 0.307039842 Ca -0.4923061 0.34537980 0.0009847321 0.27644322 0.188187742 Ba 0.2503751 0.48470218 -0.0740547309 -0.13317545 -0.251334261 Fe -0.1858415 -0.06203879 -0.2844505524 0.23049202 -0.873264047 PC6 PC7 PC8 PC9 RI -0.11528772 -0.08186724 -0.75221590 -0.02573194 Na 0.55811757 -0.14858006 -0.12769315 0.31193718 Mg -0.30818598 0.20604537 -0.07689061 0.57727335 Al 0.01885731 0.69923557 -0.27444105 0.19222686 Si -0.08609797 -0.21606658 -0.37992298 0.29807321 K 0.24363237 -0.50412141 -0.10981168 0.26050863 Ca 0.14866937 0.09913463 0.39870468 0.57932321 Ba -0.65721884 -0.35178255 0.14493235 0.19822820 Fe 0.24304431 -0.07372136 -0.01627141 0.01466944

slide-19
SLIDE 19

It's your turn!

FE ATU R E E N G IN E E R IN G IN R

slide-20
SLIDE 20

Interpreting PCA

  • utput

FE ATU R E E N G IN E E R IN G IN R

Jose Hernandez

Data Scientist, University of Washington

slide-21
SLIDE 21

FEATURE ENGINEERING IN R

Determining the variation explained

summary(glass_pca) Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 1.585 1.4318 1.1853 1.0760 0.9560 0.72639 0.6074 Proportion of Variance 0.279 0.2278 0.1561 0.1286 0.1016 0.05863 0.0410 Cumulative Proportion 0.279 0.5068 0.6629 0.7915 0.8931 0.95173 0.9927 PC8 PC9 Standard deviation 0.25269 0.04011 Proportion of Variance 0.00709 0.00018 Cumulative Proportion 0.99982 1.00000

PC1 and PC2 account for 50% of the variance PC1 through PC6 account for 95% of the variance

slide-22
SLIDE 22

FEATURE ENGINEERING IN R

Creating a tibble for plotting

prop_var <- tibble(sdev = glass_pca$sdev, pca_comp = 1:n()) prop_var <- prop_var %>% mutate(pcVar = sdev^2, propVar_ex = pcVar/sum(pcVar), pca_comp = as.character(pca_comp))

slide-23
SLIDE 23

FEATURE ENGINEERING IN R

Plotting the results

ggplot(prop_var, aes(pca_comp, propVar_ex, group = 1)) + geom_line() + geom_point()

slide-24
SLIDE 24

FEATURE ENGINEERING IN R

Exploring the outcome labels

autoplot(glass_pca, data = glass_df, colour = 'glass_type')

slide-25
SLIDE 25

FEATURE ENGINEERING IN R

PCA considerations

Useful when you have a lot of correlated features Each component is uncorrelated PCA is good when there is a linear relationship with the response Kernel PCA can account for non-linearity

slide-26
SLIDE 26

Give it a try!

FE ATU R E E N G IN E E R IN G IN R

slide-27
SLIDE 27

Wrap-up

FE ATU R E E N G IN E E R IN G IN R

Jose M Hernandez

Data Scientist, University of Washington

slide-28
SLIDE 28

FEATURE ENGINEERING IN R

Course wrap-up: Chapter 1

Categorical data Numrical representations (0, 1)

slide-29
SLIDE 29

FEATURE ENGINEERING IN R

Course wrap-up: Chapter 2

Numerical features and used Bucketing/Binning Date stamps to features

slide-30
SLIDE 30

FEATURE ENGINEERING IN R

Course wrap-up: Chapter 3

Box-Cox and Yeo-Johnson Scaling features Mean centering Z - score standardization

slide-31
SLIDE 31

FEATURE ENGINEERING IN R

Course wrap-up: Chapter 4

Crossing features for beer model performance PCA as a useful feature engineering method

slide-32
SLIDE 32

FEATURE ENGINEERING IN R

Course wrap-up: Functions we used

tidyverse packages like: dplyr ggplot caret

slide-33
SLIDE 33

FEATURE ENGINEERING IN R

Course wrap-up: Extension to feature enginnering

Feature engineering for text and images

slide-34
SLIDE 34

Congratulations!

FE ATU R E E N G IN E E R IN G IN R