Feature Engineering ( x ) transform x Many slides attributable - - PowerPoint PPT Presentation

feature engineering
SMART_READER_LITE
LIVE PREVIEW

Feature Engineering ( x ) transform x Many slides attributable - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Feature Engineering ( x ) transform x Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James,


slide-1
SLIDE 1

Feature Engineering

1

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/

Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes

x

φ(x)

transform

slide-2
SLIDE 2

Logistics

  • Project 1 is out! (due in two weeks)
  • Start early! Work required is about 2 HWs
  • HW4 will be out next Wed
  • Due two weeks later (1 week after project)
  • More time to learn req’d material
  • Class TOMORROW 3pm
  • Mon on Thurs at Tufts

2

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-3
SLIDE 3

Objectives Today: Feature Engineering

Concept Check-in How should I preprocess my features? How can I select a subset of important features? What to do if features are missing?

3

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-4
SLIDE 4

Check-in Q1: logsumexp

4

Mike Hughes - Tufts COMP 135 - Spring 2019

What scalar value should these calls produce? What happens instead with a real computer? What is the fix?

slide-5
SLIDE 5

logsumexp explained

5

Mike Hughes - Tufts COMP 135 - Spring 2019

logsumexp([−100, −97, −101]) = log(e−100 + e−97 + e−101) = log(e−97(e−3 + e0 + e−4)) = log(e−97) + log

  • e−3 + e0 + e−4

= −97 + log @e−3 + e0 + e−4 | {z }

1≤sum≤3

1 A

Factor out the MAX of -97

slide-6
SLIDE 6

Check-in Q2: Gradient steps

6

Mike Hughes - Tufts COMP 135 - Spring 2019

How can I diagnose step size choices? What are three ways to improve step size selection?

slide-7
SLIDE 7

Check-in Q2: Gradient steps

7

Mike Hughes - Tufts COMP 135 - Spring 2019

How can I diagnose step size choices? Trace plots of loss, gradient norm, and parameters Explore like “Goldilocks”, find one too small and

  • ne too big

What are three ways to improve step size selection? Use decaying step size Use line search to find step size that reduces loss Use second order methods (Newton, LBFGS)

slide-8
SLIDE 8

What will we learn?

8

Mike Hughes - Tufts COMP 135 - Spring 2019

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-9
SLIDE 9

9

Mike Hughes - Tufts COMP 135 - Spring 2019

Transformations of Features

slide-10
SLIDE 10

Fitting a line isn’t always ideal

10

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-11
SLIDE 11

Can fit linear functions to nonlinear features

11

Mike Hughes - Tufts COMP 135 - Spring 2019

ˆ y(xi) = θ0 + θ1xi + θ2x2

i + θ3x3 i

A nonlinear function of x: Can be written as a linear function of “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw data

φ(xi) = [1 xi x2

i x3 i ]

ˆ y(xi) =

4

X

g=1

θgφg(xi) = θT φ(xi)

slide-12
SLIDE 12

What feature transform to use?

  • Anything that works for your data!
  • sin / cos for periodic data
  • polynomials for high-order dependencies
  • interactions between feature dimensions
  • Many other choices possible

12

Mike Hughes - Tufts COMP 135 - Spring 2019

φ(xi) = [1 xi x2

i . . .]

φ(xi) = [1 xi1xi2 xi3xi4 . . .]

slide-13
SLIDE 13

Standard Pipeline

13

Mike Hughes - Tufts COMP 135 - Spring 2019

data x label y Data, Label Pairs Performance measure

{xn, yn}N

n=1

Task

slide-14
SLIDE 14

Feature, Label Pairs

Feature Transform Pipeline

14

Mike Hughes - Tufts COMP 135 - Spring 2019

data x label y Data, Label Pairs Performance measure

{xn, yn}N

n=1

Task

φ(x)

{φ(xn), yn}N

n=1

slide-15
SLIDE 15

What features to use here?

15

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-16
SLIDE 16

Reasons for Feature Transform

  • Improve prediction quality
  • Improve interpretability
  • Reduce computational costs
  • Fewer features means fewer parameters
  • Improve numerical performance of training

16

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-17
SLIDE 17

Recall from HW2 Polynomial Features

17

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-18
SLIDE 18

Error vs. Degree (orig. poly.)

18

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-19
SLIDE 19

Error vs. Degree (rescaled poly)

19

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-20
SLIDE 20

Weight histograms (orig. poly.)

20

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-21
SLIDE 21

Weight histograms (rescaled poly.)

21

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-22
SLIDE 22

Scikit-Learn Transformer API

22

Mike Hughes - Tufts COMP 135 - Spring 2019

# Construct a “transformer” >>> t = Transformer() # Train any parameters needed >>> t.fit(x_NF) # y optional, often unused # Apply to extract new features >>> feat_NG = t.transform(x_NF)

slide-23
SLIDE 23

Example 1: Sum of features

From sklearn.base import TransformerMixin class SumFeatureExtractor(TransformerMixin): """ Extracts *sum* of feature vector as new feat “”” def __init__(self): pass def fit(self, x_NF): return self def transform(self, x_NF): return np.sum(x_NF, axis=1)[:,np.newaxis]

23

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-24
SLIDE 24

Example 2: Square features

From sklearn.base import TransformerMixin class SquareFeatureExtractor(TransformerMixin): """ Extracts *square* of feature vector as new feat “”” def fit(self, x_NF): return self def transform(self, x_NF): TODO

24

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-25
SLIDE 25

Example 2: Square features

From sklearn.base import TransformerMixin class SquareFeatureExtractor(TransformerMixin): """ Extracts *square* of feature vector as new feat “”” def fit(self, x_NF): return self def transform(self, x_NF): return np.square(x_NF) # OR return np.power(x_NF, 2)

25

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-26
SLIDE 26

26

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-27
SLIDE 27

Feature Rescaling

Input: Each numeric feature has arbitrary min/max

  • Some in [0, 1], Some in [-5, 5], Some [-3333, -2222]

Transformed feature vector

  • Set each feature value f to have [0, 1] range
  • min_f = minimum observed in training set
  • max_f = maximum observed in training set

27

Mike Hughes - Tufts COMP 135 - Spring 2019

φ(xn)f = xnf − minf maxf − minf

slide-28
SLIDE 28

Example 3: Rescaling features

From sklearn.base import TransformerMixin class MinMaxScaleFeatureExtractor(TransformerMixin): """ Rescales features between 0 and 1 “”” def fit(self, x_NF): self.min_F = # TODO self.max_F = # TODO def transform(self, x_NF): # TODO

28

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-29
SLIDE 29

Example 3: Rescaling features

From sklearn.base import TransformerMixin class MinMaxFeatureRescaler(TransformerMixin): """ Rescales each feature column to be within [0, 1] Uses training data min/max “”” def fit(self, x_NF): self.min_1F = np.min(x_NF, axis=0, keepdims=1) self.max_1F = np.max(x_NF, axis=0, keepdims=1) def transform(self, x_NF): feat_NF = ((x_NF – self.min_1F) / (self.max_1F – self.min_1F)) return feat_NF

29

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-30
SLIDE 30

Input: Each feature is numeric, has arbitrary scale Transformed feature vector

  • Set each feature value f to have zero mean, unit variance

Empirical mean observed in training set Empirical standard deviation observed in training set

Feature Standardization

φ(xn)f = xnf − µf σf

30

Mike Hughes - Tufts COMP 135 - Spring 2019

µf σf

slide-31
SLIDE 31

Feature Standardization

  • Treats each feature as “Normal(0, 1)”
  • Typical range will be -3 to +3
  • If original data is approximately normal
  • Also called z-score transform

31

Mike Hughes - Tufts COMP 135 - Spring 2019

φ(xn)f = xnf − µf σf

slide-32
SLIDE 32

Feature Scaling with Outliers

  • What happens to standard scaling when

training data has outliers?

32

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-33
SLIDE 33

Feature Scaling with Outliers

33

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-34
SLIDE 34

Combining several transformers

34

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-35
SLIDE 35

Categorical Features

Numerical encoding

35

Mike Hughes - Tufts COMP 135 - Spring 2019

["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"] "uses Firefox” à 1 “uses Safari” à 3

slide-36
SLIDE 36

Categorical Features

One-hot vector

36

Mike Hughes - Tufts COMP 135 - Spring 2019

["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]

[ 0 0 1 0 ] [ 1 0 0 0 ]

Firefox Chrome Safari Internet Explorer "uses Firefox” “uses Safari”

slide-37
SLIDE 37

Feature Selection or “Pruning”

37

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-38
SLIDE 38

Best Subset Selection

38

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-39
SLIDE 39

Problem: Too many subsets!

39

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-40
SLIDE 40

Forward Stepwise Selection

40

Mike Hughes - Tufts COMP 135 - Spring 2019

Start with zero feature model (guess mean) Store as M_0 Add best scoring single feature (search among F) Store as M_1 For each size k = 2, … F Try each possible not-included feature (F – k + 1) Add best scoring feature to the model M_k-1 Store as M_k Pick best among M_0, M_1, … M_F on validation

slide-41
SLIDE 41

Best vs Forward Stepwise

41

Mike Hughes - Tufts COMP 135 - Spring 2019

Easy to find cases where forward stepwise ‘s greedy approach doesn’t deliver best possible subset.

slide-42
SLIDE 42

Backwards Stepwise Selection

Start with all features Gradually test all models with one feature removed. Repeat.

42

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-43
SLIDE 43

Other Feature Selection Methods

  • Remove features with low variance
  • Select to maximize mutual information

43

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-44
SLIDE 44

Missing Data: Imputation

  • https://scikit-learn.org/stable/modules/impute.html#impute

44

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-45
SLIDE 45

Properties of Good Features

45

Mike Hughes - Tufts COMP 135 - Spring 2019

  • Informative
  • Independent
  • Monotonic with predictive probability
  • If monotonic, linear decision boundaries possible