Feature crossing
FE ATU R E E N G IN E E R IN G IN R
Jose Hernandez
Data Scientist, University of Washington
Feat u re crossing FE ATU R E E N G IN E E R IN G IN R Jose - - PowerPoint PPT Presentation
Feat u re crossing FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington E x amples of crossing feat u res Ho u sing price prediction [ Location x bedroom number ] H u man a rib u tes [ gender x
FE ATU R E E N G IN E E R IN G IN R
Jose Hernandez
Data Scientist, University of Washington
FEATURE ENGINEERING IN R
Housing price prediction [ Location x bedroom number ] Human aributes [ gender x height ]
FEATURE ENGINEERING IN R
discipline_logs %>% select(infraction) %>% table() academic dishonesty alcohol 1294 746 disruptive conduct failure to cooperate 1031 4072 fighting minor incident 2135 522 plagiarism vandalism 112 88 discipline_logs %>% select(gender) %>% table() Female Male 3055 6945
FEATURE ENGINEERING IN R
discipline_logs %>% group_by(infraction, gender) %>% summarize(n = n()) %>% ggplot(., aes(infraction, n, fill = gender)) + geom_bar(stat = "identity", position = "dodge")
FEATURE ENGINEERING IN R
FEATURE ENGINEERING IN R
discipline_logs %>% select(gender, infraction) %>% table() infraction gender academic dishonesty alcohol disruptive conduct Female 393 222 330 Male 901 524 701 infraction gender failure to cooperate fighting minor incident Female 1258 638 150 Male 2814 1497 372 infraction gender plagiarism vandalism Female 39 25 Male 73 63
FEATURE ENGINEERING IN R
dmy <- dummyVars( ~ gender:infraction, data = discipline_logs)
gender:infraction
glimpse(out_df) Observations: 10,000 Variables: 16 $ genderFemale.infractionacademic.dishonesty <dbl> 0, 0, 0, 0, 0... $ genderMale.infractionacademic.dishonesty <dbl> 0, 0, 0, 0, 0... $ genderFemale.infractionalcohol <dbl> 0, 0, 0, 0, 0... $ genderMale.infractionalcohol <dbl> 0, 0, 0, 0, 0... $ genderFemale.infractiondisruptive.conduct <dbl> 0, 1, 0, 0, 0... $ genderMale.infractiondisruptive.conduct <dbl> 1, 0, 0, 0, 0... $ genderFemale.infractionfailure.to.cooperate <dbl> 0, 0, 0, 0, 0... $ genderMale.infractionfailure.to.cooperate <dbl> 0, 0, 1, 1, 0...
FEATURE ENGINEERING IN R
Many categories = possible sparse features Prior knowledge of what might interact is needed in some regression contexts Be sure to explore the dierent methods available to determine what to cross
FE ATU R E E N G IN E E R IN G IN R
FE ATU R E E N G IN E E R IN G IN R
Jose Hernandez
Data Scientist, University of Washington
FEATURE ENGINEERING IN R
FEATURE ENGINEERING IN R
FEATURE ENGINEERING IN R
FEATURE ENGINEERING IN R
FEATURE ENGINEERING IN R
FEATURE ENGINEERING IN R
FEATURE ENGINEERING IN R
glass_x <- glass_df %>% select(-ID, -glass_type) glass_pca <- prcomp(glass_x, center = TRUE,
Center = Mean 0 Scale = Unit variance
FEATURE ENGINEERING IN R
print(glass_pca) Standard deviations (1, .., p=9): [1] 1.58466518 1.43180731 1.18526115 1.07604017 0.95603465 0.72638502 [7] 0.60741950 0.25269141 0.04011007 Rotation (n x k) = (9 x 9): PC1 PC2 PC3 PC4 PC5 RI -0.5451766 0.28568318 -0.0869108293 -0.14738099 0.073542700 Na 0.2581256 0.27035007 0.3849196197 -0.49124204 -0.153683304 Mg -0.1108810 -0.59355826 -0.0084179590 -0.37878577 -0.123509124 Al 0.4287086 0.29521154 -0.3292371183 0.13750592 -0.014108879 Si 0.2288364 -0.15509891 0.4587088382 0.65253771 -0.008500117 K 0.2193440 -0.15397013 -0.6625741197 0.03853544 0.307039842 Ca -0.4923061 0.34537980 0.0009847321 0.27644322 0.188187742 Ba 0.2503751 0.48470218 -0.0740547309 -0.13317545 -0.251334261 Fe -0.1858415 -0.06203879 -0.2844505524 0.23049202 -0.873264047 PC6 PC7 PC8 PC9 RI -0.11528772 -0.08186724 -0.75221590 -0.02573194 Na 0.55811757 -0.14858006 -0.12769315 0.31193718 Mg -0.30818598 0.20604537 -0.07689061 0.57727335 Al 0.01885731 0.69923557 -0.27444105 0.19222686 Si -0.08609797 -0.21606658 -0.37992298 0.29807321 K 0.24363237 -0.50412141 -0.10981168 0.26050863 Ca 0.14866937 0.09913463 0.39870468 0.57932321 Ba -0.65721884 -0.35178255 0.14493235 0.19822820 Fe 0.24304431 -0.07372136 -0.01627141 0.01466944
FE ATU R E E N G IN E E R IN G IN R
FE ATU R E E N G IN E E R IN G IN R
Jose Hernandez
Data Scientist, University of Washington
FEATURE ENGINEERING IN R
summary(glass_pca) Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 1.585 1.4318 1.1853 1.0760 0.9560 0.72639 0.6074 Proportion of Variance 0.279 0.2278 0.1561 0.1286 0.1016 0.05863 0.0410 Cumulative Proportion 0.279 0.5068 0.6629 0.7915 0.8931 0.95173 0.9927 PC8 PC9 Standard deviation 0.25269 0.04011 Proportion of Variance 0.00709 0.00018 Cumulative Proportion 0.99982 1.00000
PC1 and PC2 account for 50% of the variance PC1 through PC6 account for 95% of the variance
FEATURE ENGINEERING IN R
prop_var <- tibble(sdev = glass_pca$sdev, pca_comp = 1:n()) prop_var <- prop_var %>% mutate(pcVar = sdev^2, propVar_ex = pcVar/sum(pcVar), pca_comp = as.character(pca_comp))
FEATURE ENGINEERING IN R
ggplot(prop_var, aes(pca_comp, propVar_ex, group = 1)) + geom_line() + geom_point()
FEATURE ENGINEERING IN R
autoplot(glass_pca, data = glass_df, colour = 'glass_type')
FEATURE ENGINEERING IN R
Useful when you have a lot of correlated features Each component is uncorrelated PCA is good when there is a linear relationship with the response Kernel PCA can account for non-linearity
FE ATU R E E N G IN E E R IN G IN R
FE ATU R E E N G IN E E R IN G IN R
Jose M Hernandez
Data Scientist, University of Washington
FEATURE ENGINEERING IN R
Categorical data Numrical representations (0, 1)
FEATURE ENGINEERING IN R
Numerical features and used Bucketing/Binning Date stamps to features
FEATURE ENGINEERING IN R
Box-Cox and Yeo-Johnson Scaling features Mean centering Z - score standardization
FEATURE ENGINEERING IN R
Crossing features for beer model performance PCA as a useful feature engineering method
FEATURE ENGINEERING IN R
tidyverse packages like: dplyr ggplot caret
FEATURE ENGINEERING IN R
Feature engineering for text and images
FE ATU R E E N G IN E E R IN G IN R