Feature Engineering ( x ) transform x Many slides attributable - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Feature Engineering φ ( x ) transform x Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 1

Logistics • Project 1 is out! (due in two weeks) • Start early! Work required is about 2 HWs • HW4 will be out next Wed • Due two weeks later (1 week after project) • More time to learn req’d material • Class TOMORROW 3pm • Mon on Thurs at Tufts Mike Hughes - Tufts COMP 135 - Spring 2019 2

Objectives Today : Feature Engineering Concept Check-in How should I preprocess my features? How can I select a subset of important features? What to do if features are missing? Mike Hughes - Tufts COMP 135 - Spring 2019 3

Check-in Q1: logsumexp What scalar value should these calls produce? What happens instead with a real computer? What is the fix? Mike Hughes - Tufts COMP 135 - Spring 2019 4

logsumexp explained logsumexp([ − 100 , − 97 , − 101]) = log( e − 100 + e − 97 + e − 101 ) = log( e − 97 ( e − 3 + e 0 + e − 4 )) Factor out the MAX of -97 � e − 3 + e 0 + e − 4 � = log( e − 97 ) + log 0 1 @ e − 3 + e 0 + e − 4 = − 97 + log A | {z } 1 ≤ sum ≤ 3 Mike Hughes - Tufts COMP 135 - Spring 2019 5

Check-in Q2: Gradient steps How can I diagnose step size choices? What are three ways to improve step size selection? Mike Hughes - Tufts COMP 135 - Spring 2019 6

Check-in Q2: Gradient steps How can I diagnose step size choices? Trace plots of loss, gradient norm, and parameters Explore like “Goldilocks”, find one too small and one too big What are three ways to improve step size selection? Use decaying step size Use line search to find step size that reduces loss Use second order methods (Newton, LBFGS) Mike Hughes - Tufts COMP 135 - Spring 2019 7

What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Spring 2019 8

Transformations of Features Mike Hughes - Tufts COMP 135 - Spring 2019 9

Fitting a line isn’t always ideal Mike Hughes - Tufts COMP 135 - Spring 2019 10

Can fit linear functions to nonlinear features A nonlinear function of x: y ( x i ) = θ 0 + θ 1 x i + θ 2 x 2 i + θ 3 x 3 ˆ i φ ( x i ) = [1 x i x 2 i x 3 i ] Can be written as a linear function of 4 θ g φ g ( x i ) = θ T φ ( x i ) X y ( x i ) = ˆ g =1 “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw data Mike Hughes - Tufts COMP 135 - Spring 2019 11

What feature transform to use? • Anything that works for your data! • sin / cos for periodic data • polynomials for high-order dependencies φ ( x i ) = [1 x i x 2 i . . . ] • interactions between feature dimensions φ ( x i ) = [1 x i 1 x i 2 x i 3 x i 4 . . . ] • Many other choices possible Mike Hughes - Tufts COMP 135 - Spring 2019 12

Standard Pipeline Data, Label Pairs Performance Task { x n , y n } N measure n =1 data label x y Mike Hughes - Tufts COMP 135 - Spring 2019 13

Feature Transform Pipeline Data, Label Pairs { x n , y n } N n =1 Feature, Label Pairs Performance Task { φ ( x n ) , y n } N measure n =1 label data φ ( x ) y x Mike Hughes - Tufts COMP 135 - Spring 2019 14

What features to use here? Mike Hughes - Tufts COMP 135 - Spring 2019 15

Reasons for Feature Transform • Improve prediction quality • Improve interpretability • Reduce computational costs • Fewer features means fewer parameters • Improve numerical performance of training Mike Hughes - Tufts COMP 135 - Spring 2019 16

Recall from HW2 Polynomial Features Mike Hughes - Tufts COMP 135 - Spring 2019 17

Error vs. Degree (orig. poly.) Mike Hughes - Tufts COMP 135 - Spring 2019 18

Error vs. Degree (rescaled poly) Mike Hughes - Tufts COMP 135 - Spring 2019 19

Weight histograms (orig. poly.) Mike Hughes - Tufts COMP 135 - Spring 2019 20

Weight histograms (rescaled poly.) Mike Hughes - Tufts COMP 135 - Spring 2019 21

Scikit-Learn Transformer API # Construct a “transformer” >>> t = Transformer() # Train any parameters needed >>> t.fit(x_NF) # y optional, often unused # Apply to extract new features >>> feat_NG = t.transform(x_NF) Mike Hughes - Tufts COMP 135 - Spring 2019 22

Example 1: Sum of features From sklearn.base import TransformerMixin class SumFeatureExtractor(TransformerMixin): """ Extracts *sum* of feature vector as new feat “”” def __init__(self): pass def fit(self, x_NF): return self def transform(self, x_NF): return np.sum(x_NF, axis=1)[:,np.newaxis] Mike Hughes - Tufts COMP 135 - Spring 2019 23

Example 2: Square features From sklearn.base import TransformerMixin class SquareFeatureExtractor(TransformerMixin): """ Extracts *square* of feature vector as new feat “”” def fit(self, x_NF): return self def transform(self, x_NF): TODO Mike Hughes - Tufts COMP 135 - Spring 2019 24

Example 2: Square features From sklearn.base import TransformerMixin class SquareFeatureExtractor(TransformerMixin): """ Extracts *square* of feature vector as new feat “”” def fit(self, x_NF): return self def transform(self, x_NF): return np.square(x_NF) # OR return np.power(x_NF, 2) Mike Hughes - Tufts COMP 135 - Spring 2019 25

Mike Hughes - Tufts COMP 135 - Spring 2019 26

Feature Rescaling Input: Each numeric feature has arbitrary min/max • Some in [0, 1], Some in [-5, 5], Some [-3333, -2222] Transformed feature vector • Set each feature value f to have [0, 1] range φ ( x n ) f = x nf − min f max f − min f • min_f = minimum observed in training set • max_f = maximum observed in training set Mike Hughes - Tufts COMP 135 - Spring 2019 27

Example 3: Rescaling features From sklearn.base import TransformerMixin class MinMaxScaleFeatureExtractor(TransformerMixin): """ Rescales features between 0 and 1 “”” def fit(self, x_NF): self.min_F = # TODO self.max_F = # TODO def transform(self, x_NF): # TODO Mike Hughes - Tufts COMP 135 - Spring 2019 28

Example 3: Rescaling features From sklearn.base import TransformerMixin class MinMaxFeatureRescaler(TransformerMixin): """ Rescales each feature column to be within [0, 1] Uses training data min/max “”” def fit(self, x_NF): self.min_1F = np.min(x_NF, axis=0, keepdims=1) self.max_1F = np.max(x_NF, axis=0, keepdims=1) def transform(self, x_NF): feat_NF = ((x_NF – self.min_1F) / (self.max_1F – self.min_1F)) return feat_NF Mike Hughes - Tufts COMP 135 - Spring 2019 29

Feature Standardization Input: Each feature is numeric, has arbitrary scale Transformed feature vector • Set each feature value f to have zero mean, unit variance φ ( x n ) f = x nf − µ f σ f Empirical mean observed in training set µ f Empirical standard deviation observed in training set σ f Mike Hughes - Tufts COMP 135 - Spring 2019 30

Feature Standardization φ ( x n ) f = x nf − µ f σ f • Treats each feature as “Normal(0, 1)” • Typical range will be -3 to +3 • If original data is approximately normal • Also called z-score transform Mike Hughes - Tufts COMP 135 - Spring 2019 31

Feature Scaling with Outliers • What happens to standard scaling when training data has outliers? Mike Hughes - Tufts COMP 135 - Spring 2019 32

Feature Scaling with Outliers Mike Hughes - Tufts COMP 135 - Spring 2019 33

Combining several transformers Mike Hughes - Tufts COMP 135 - Spring 2019 34

Categorical Features ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"] Numerical encoding "uses Firefox” à 1 à “uses Safari” 3 Mike Hughes - Tufts COMP 135 - Spring 2019 35

Categorical Features ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"] Internet Explorer One-hot vector Chrome Firefox Safari [ 1 0 0 0 ] "uses Firefox” “uses Safari” [ 0 0 1 0 ] Mike Hughes - Tufts COMP 135 - Spring 2019 36

Feature Selection or “Pruning” Mike Hughes - Tufts COMP 135 - Spring 2019 37

Best Subset Selection Mike Hughes - Tufts COMP 135 - Spring 2019 38

Problem: Too many subsets! Mike Hughes - Tufts COMP 135 - Spring 2019 39

Forward Stepwise Selection Start with zero feature model (guess mean) Store as M_0 Add best scoring single feature (search among F) Store as M_1 For each size k = 2, … F Try each possible not-included feature (F – k + 1) Add best scoring feature to the model M_k-1 Store as M_k Pick best among M_0, M_1, … M_F on validation Mike Hughes - Tufts COMP 135 - Spring 2019 40

Best vs Forward Stepwise Easy to find cases where forward stepwise ‘s greedy approach doesn’t deliver best possible subset. Mike Hughes - Tufts COMP 135 - Spring 2019 41

Backwards Stepwise Selection Start with all features Gradually test all models with one feature removed. Repeat. Mike Hughes - Tufts COMP 135 - Spring 2019 42

Feature Engineering ( x ) transform x Many slides attributable - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Feature Engineering ( x ) transform x Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James,

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Feature Extraction 7-1 Ronald Peikert SciVis 2007 - Feature Extraction What are features?

Feature Structures, Unification Some grammatical phenomena Linguistic features Feature

Feature Point Feature-based approach: Detect and match feature Detec.on and Matching points

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

CS4495/6495 Introduction to Computer Vision 4B-L2 Matching feature points (a little) Feature

Using Data Fusion and Web Mining to Support Feature Location in Software SEMERU Feature: a

Feature Extraction 7-1 Ronald Peikert SciVis 2008 - Feature Extraction What are features?

Feature Space Aleix M. Martinez aleix@ece.osu.edu Feature Space Many problems in science

Usage Scenarios for a Common Feature Modeling Language Thorsten Berger and Philippe Collet

Feature Modelling: A Survey, a Formalism and a Transformation for Analysis Thomas De Vylder

3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music

Slimium: Debloating the Chromium Browser with Feature Subsetting CHENXIONG QIAN, HYUNGJOON

Feat ure Select ion using/ f or Feat ure Select ion using/ f or Transduct ive ransduct ive S

RF Exposure Procedures Update TCB Workshop October 2013 Laboratory Division Office of

UNIVERSITY OF MANCHESTER Department of Computer Science CS3282: Digital Communications 06 BMG

A Spectral View of Adversarially Robust Features Shivam Garg Vatsal Sharan * Brian Zhang *

Fast High-Dimensional Feature Matching for Object Recognition David Lowe Computer Science

6.869 Advances in Computer Vision Prof. Bill Freeman March 1, 2005 1 2 Local Features Today

Chapter 8 Shape representation and description 8.1 Matching 2 8.1 Matching we wish to