Regression Many slides attributable to: Prof. Mike Hughes Erik - - PowerPoint PPT Presentation

regression
SMART_READER_LITE
LIVE PREVIEW

Regression Many slides attributable to: Prof. Mike Hughes Erik - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL


slide-1
SLIDE 1

Regression

1

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/

Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

Logistics

  • HW0 due TONIGHT (Wed 1/23 at 11:59pm)
  • HW1 out later tonight, due a week from today
  • What you submit: PDF and zip
  • Next recitation is Mon 1/28
  • Multivariate Calculus review
  • The gory math behind linear regression

2

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-3
SLIDE 3

Regression Unit Objectives

  • 3 steps of a regression task
  • Training
  • Prediction
  • Evaluation
  • Metrics
  • Splitting data into train/valid/test
  • A “taste” of 3 Methods
  • Linear Regression
  • K-Nearest Neighbors
  • Decision Tree Regression

3

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-4
SLIDE 4

What will we learn?

4

Mike Hughes - Tufts COMP 135 - Spring 2019

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-5
SLIDE 5

5

Mike Hughes - Tufts COMP 135 - Spring 2019

Task: Regression

Supervised Learning Unsupervised Learning Reinforcement Learning

regression

x y y

is a numeric variable e.g. sales in $$

slide-6
SLIDE 6

6

Mike Hughes - Tufts COMP 135 - Spring 2019

Regression Example: Uber

Supervised Learning Unsupervised Learning Reinforcement Learning

regression

slide-7
SLIDE 7

Regression Example: Uber

7

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-8
SLIDE 8

Regression Example: Uber

8

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-9
SLIDE 9

Try it!

9

Mike Hughes - Tufts COMP 135 - Spring 2019 What should happen here? What info did you use to make that guess?

slide-10
SLIDE 10

Regression: Prediction Step

Goal: Predict response y well given features x

  • Input:
  • Output:

10

Mike Hughes - Tufts COMP 135 - Spring 2019

xi , [xi1, xi2, . . . xif . . . xiF ]

Entries can be real-valued, or other numeric types (e.g. integer, binary) Scalar value like 3.1 or -133.7

ˆ y(xi) ∈ R

“features” “covariates” “predictors” “attributes” “responses” “labels”

slide-11
SLIDE 11

Regression: Prediction Step

11

Mike Hughes - Tufts COMP 135 - Spring 2019

>>> # Given: pretrained regression object model >>> # Given: 2D array of features x

>>> x_NF.shape (N, F) >>> yhat_N = model.predict(x_NF) >>> yhat_N.shape (N,)

slide-12
SLIDE 12

Regression: Training Step

Goal: Given a labeled dataset, learn a function that can perform prediction well

  • Input: Pairs of features and labels/responses
  • Output:

12

Mike Hughes - Tufts COMP 135 - Spring 2019

ˆ y(·) : RF → R

{xn, yn}N

n=1

slide-13
SLIDE 13

13

Mike Hughes - Tufts COMP 135 - Spring 2019

>>> # Given: 2D array of features x >>> # Given: 1D array of responses/labels y

>>> y_N.shape (N,) >>> x_NF.shape (N, F) >>> model = RegressionModel() >>> model.fit(x_NF, y_N)

Regression: Training Step

slide-14
SLIDE 14

Regression: Evaluation Step

Goal: Assess quality of predictions

  • Input: Pairs of predicted and “true” responses
  • Output: Scalar measure of error/quality
  • Measuring Error: lower is better
  • Measuring Quality: higher is better

14

Mike Hughes - Tufts COMP 135 - Spring 2019

{ˆ y(xn), yn}N

n=1

slide-15
SLIDE 15

Visualizing errors

15

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-16
SLIDE 16

Regression: Evaluation Metrics

  • mean squared error
  • mean absolute error

16

Mike Hughes - Tufts COMP 135 - Spring 2019

1 N

N

X

n=1

|yn − ˆ yn| 1 N

N

X

n=1

(yn − ˆ yn)2

slide-17
SLIDE 17

Discuss

  • Which error metric is more sensitive to
  • utliers?
  • Which error metric is the easiest to take

derivatives of?

17

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-18
SLIDE 18

Regression: Evaluation Metrics

18

Mike Hughes - Tufts COMP 135 - Spring 2019

https://scikit-learn.org/stable/modules/model_evaluation.html

slide-19
SLIDE 19

How to model y given x?

19

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-20
SLIDE 20

Is the model constant?

20

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-21
SLIDE 21

Is the model linear?

21

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-22
SLIDE 22

Is the model polynomial?

22

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-23
SLIDE 23

Generalize: sample to population

23

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-24
SLIDE 24

24

Mike Hughes - Tufts COMP 135 - Spring 2019

Generalize: sample to population

slide-25
SLIDE 25

Labeled dataset

25

Mike Hughes - Tufts COMP 135 - Spring 2019

x y

Each row represents one example Assume rows are arranged “uniformly at random” (order doesn’t matter)

slide-26
SLIDE 26

Split into train and test

26

Mike Hughes - Tufts COMP 135 - Spring 2019

x y

train test

slide-27
SLIDE 27

Model Complexity vs Error

27

Mike Hughes - Tufts COMP 135 - Spring 2019

Overfitting Underfitting

slide-28
SLIDE 28

How to fit best model?

28

Mike Hughes - Tufts COMP 135 - Spring 2019

Option 1: Fit on train, select on test 1) Fit each model to training data 2) Evaluate each model on test data 3) Select model with lowest test error

x y

train test

slide-29
SLIDE 29

How to fit best model?

29

Mike Hughes - Tufts COMP 135 - Spring 2019

Option 1: Fit on train, select on test 1) Fit each model to training data 2) Evaluate each model on test data 3) Select model with lowest test error

Avoid!

Problems

  • Fitting procedure used test data
  • Not fair assessment of how will do on

unseen data

x y

train test

slide-30
SLIDE 30

How to fit best model?

30

Mike Hughes - Tufts COMP 135 - Spring 2019

Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set

train test validation

x y

slide-31
SLIDE 31

How to fit best model?

31

Mike Hughes - Tufts COMP 135 - Spring 2019

Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set

train test validation

x y

Concerns

  • Will train be too small?
  • Make better use of data?
slide-32
SLIDE 32

Linear Regression

Parameters: Prediction: Training: find weights and bias that minimize error

32

Mike Hughes - Tufts COMP 135 - Spring 2019

ˆ y(xi) ,

F

X

f=1

wfxif + b

w = [w1, w2, . . . wf . . . wF ] b

weight vector bias scalar

slide-33
SLIDE 33

Sales vs. Ad Budgets

33

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-34
SLIDE 34

Linear Regression: Training

34

Mike Hughes - Tufts COMP 135 - Spring 2019

min

w,b N

X

n=1

⇣ yn − ˆ y(xn, w, b) ⌘2

Optimization problem: “Least Squares”

slide-35
SLIDE 35

35

Mike Hughes - Tufts COMP 135 - Spring 2019

Linear Regression: Training

min

w,b N

X

n=1

⇣ yn − ˆ y(xn, w, b) ⌘2

Optimization problem: “Least Squares”

Exact formula for optimal values of w, b exist! With only one feature (F=1):

w = PN

n=1(xn − ¯

x)(yn − ¯ y) PN

n=1(xn − ¯

x)2

b = ¯ y − w¯ x

¯ x = mean(x1, . . . xN) ¯ y = mean(y1, . . . yN)

We will derive these in next class

slide-36
SLIDE 36

36

Mike Hughes - Tufts COMP 135 - Spring 2019

Linear Regression: Training

min

w,b N

X

n=1

⇣ yn − ˆ y(xn, w, b) ⌘2

Optimization problem: “Least Squares”

Exact formula for optimal values of w, b exist! With many features (F >= 1):

We will derive these in next class

[w1 . . . wF b]T = ( ˜ XT ˜ X)−1 ˜ XT y

˜ X =     x11 . . . x1F 1 x21 . . . x2F 1 . . . xN1 . . . xNF 1    

slide-37
SLIDE 37

Nearest Neighbor Regression

37

Mike Hughes - Tufts COMP 135 - Spring 2019

Parameters: none Prediction:

  • find “nearest” training vector to given input x
  • predict y value of this neighbor

Training: none needed (use training data as lookup table)

slide-38
SLIDE 38

Distance metrics

  • Euclidean
  • Manhattan
  • Many others are possible

38

Mike Hughes - Tufts COMP 135 - Spring 2019

dist(x, x0) = v u u t

F

X

f=1

(xf − x0

f)2

dist(x, x0) =

F

X

f=1

|xf − x0

f|

slide-39
SLIDE 39

Nearest Neighbor “Prediction functions” are piecewise constant

39

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-40
SLIDE 40

K nearest neighbor regression

40

Mike Hughes - Tufts COMP 135 - Spring 2019

Parameters: K : number of neighbors Prediction:

  • find K “nearest” training vectors to input x
  • predict average y of this neighborhood

Training: none needed (use training data as lookup table)

slide-41
SLIDE 41

Error vs Model Complexity

41

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Fig 2.4 ESL textbook

slide-42
SLIDE 42

Salary prediction for Hitters data

42

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-43
SLIDE 43

43

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-44
SLIDE 44

Decision Tree Regression

44

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-45
SLIDE 45

45

Mike Hughes - Tufts COMP 135 - Spring 2019

Decision tree regression

Parameters:

  • at each internal node: x variable id and threshold
  • at each leaf: scalar y value to predict

Prediction assumption:

  • x space is divided into rectangular regions
  • y is similar within “region”

Training assumption:

  • minimize error on training set
  • often, use greedy heuristics
slide-46
SLIDE 46

Ideal Training for Decision Tree

46

Mike Hughes - Tufts COMP 135 - Spring 2019

min

R1,...RJ J

X

j=1

X

n:xn∈Rj

(yn − ˆ yRj)2

Search space is too big! Hard to solve exactly…

slide-47
SLIDE 47

Greedy Top-Down Training

47

Mike Hughes - Tufts COMP 135 - Spring 2019

Stop when:

  • number of examples assigned to a leaf is too small
  • Maximum depth is exceeded

Given a big region, find best binary split into two subregions

min

j,s,ˆ yR1,ˆ yR2

slide-48
SLIDE 48

Greedy Tree for Hitters Data

48

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-49
SLIDE 49

Summary of Methods

49

Mike Hughes - Tufts COMP 135 - Spring 2019

Function class flexibility Knobs to tune Interpret? Linear Regression Linear Include bias? Penalize weights (more next week) Inspect weights Decision Tree Regression Axis-aligned Piecewise constant

  • Max. depth
  • Min. leaf size

Goal criteria Inspect tree K Nearest Neighbors Regression Piecewise constant Number of Neighbors Distance metric How neighbors vote Inspect neighbors

slide-50
SLIDE 50

Discuss:

We are studying data with ~5 features Which method’s predictions change most across versions of feature representation?

  • Version A:
  • Version B:

50

Mike Hughes - Tufts COMP 135 - Spring 2019

[x1 x2 x3 x4 x5] [10x1 x2 x3 x4 x5]

slide-51
SLIDE 51

Discuss:

We are studying data with ~3 features Which method’s predictions change most across versions of feature representation?

  • Version A:
  • Version B:

51

Mike Hughes - Tufts COMP 135 - Spring 2019

[x1 x2] [x1 x2 noise noise noise]

slide-52
SLIDE 52

52

Mike Hughes - Tufts COMP 135 - Spring 2019

Regression Unit Objectives

  • 3 steps of a regression task
  • Training
  • Prediction
  • Evaluation
  • Metrics
  • Splitting into train/valid/test
  • “Taste” of 3 Methods
  • Linear Regression
  • K-Nearest Neighbors
  • Decision Tree Regression
  • Chosen performance metric should

be integrated at training

  • Mean squared error is “easy”, but

not always the right thing to do