Model Selection Matt Gormley Lecture 4 January 29, 2018 1 - - PowerPoint PPT Presentation

model selection
SMART_READER_LITE
LIVE PREVIEW

Model Selection Matt Gormley Lecture 4 January 29, 2018 1 - - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Model Selection Matt Gormley Lecture 4 January 29, 2018 1 Q&A Q: How do we deal with ties in k-Nearest Neighbors


slide-1
SLIDE 1

Model Selection

1

10-601 Introduction to Machine Learning

Matt Gormley Lecture 4 January 29, 2018

Machine Learning Department School of Computer Science Carnegie Mellon University

slide-2
SLIDE 2

Q&A

2

Q: How do we deal with ties in k-Nearest

Neighbors (e.g. even k or equidistant points)?

A: I would ask you all for a good solution! Q: How do we define a distance function when

the features are categorical (e.g. weather takes values {sunny, rainy, overcast})?

A: Step 1: Convert from categorical attributes to

numeric features (e.g. binary) Step 2: Select an appropriate distance function (e.g. Hamming distance)

slide-3
SLIDE 3

Reminders

  • Homework 2: Decision Trees

– Out: Wed, Jan 24 – Due: Mon, Feb 5 at 11:59pm

  • 10601 Notation Crib Sheet

3

slide-4
SLIDE 4

K-NEAREST NEIGHBORS

7

slide-5
SLIDE 5

k-Nearest Neighbors

Chalkboard:

– KNN for binary classification – Distance functions – Efficiency of KNN – Inductive bias of KNN – KNN Properties

8

slide-6
SLIDE 6

KNN ON FISHER IRIS DATA

9

slide-7
SLIDE 7

Fisher Iris Dataset

Fisher (1936) used 150 measurements of flowers from 3 different species: Iris setosa (0), Iris virginica (1), Iris versicolor (2) collected by Anderson (1936)

10

Full dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set Species Sepal Length Sepal Width Petal Length Petal Width 4.3 3.0 1.1 0.1 4.9 3.6 1.4 0.1 5.3 3.7 1.5 0.2 1 4.9 2.4 3.3 1.0 1 5.7 2.8 4.1 1.3 1 6.3 3.3 4.7 1.6 1 6.7 3.0 5.0 1.7

slide-8
SLIDE 8

Fisher Iris Dataset

Fisher (1936) used 150 measurements of flowers from 3 different species: Iris setosa (0), Iris virginica (1), Iris versicolor (2) collected by Anderson (1936)

11

Full dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set Species Sepal Length Sepal Width 4.3 3.0 4.9 3.6 5.3 3.7 1 4.9 2.4 1 5.7 2.8 1 6.3 3.3 1 6.7 3.0

Deleted two of the four features, so that input space is 2D

slide-9
SLIDE 9

KNN on Fisher Iris Data

12

slide-10
SLIDE 10

KNN on Fisher Iris Data

13

Special Case: Nearest Neighbor

slide-11
SLIDE 11

KNN on Fisher Iris Data

14

Special Case: Majority Vote

slide-12
SLIDE 12

KNN on Fisher Iris Data

15

slide-13
SLIDE 13

KNN on Fisher Iris Data

16

Special Case: Nearest Neighbor

slide-14
SLIDE 14

KNN on Fisher Iris Data

17

slide-15
SLIDE 15

KNN on Fisher Iris Data

18

slide-16
SLIDE 16

KNN on Fisher Iris Data

19

slide-17
SLIDE 17

KNN on Fisher Iris Data

20

slide-18
SLIDE 18

KNN on Fisher Iris Data

21

slide-19
SLIDE 19

KNN on Fisher Iris Data

22

slide-20
SLIDE 20

KNN on Fisher Iris Data

23

slide-21
SLIDE 21

KNN on Fisher Iris Data

24

slide-22
SLIDE 22

KNN on Fisher Iris Data

25

slide-23
SLIDE 23

KNN on Fisher Iris Data

26

slide-24
SLIDE 24

KNN on Fisher Iris Data

27

slide-25
SLIDE 25

KNN on Fisher Iris Data

28

slide-26
SLIDE 26

KNN on Fisher Iris Data

29

slide-27
SLIDE 27

KNN on Fisher Iris Data

30

slide-28
SLIDE 28

KNN on Fisher Iris Data

31

slide-29
SLIDE 29

KNN on Fisher Iris Data

32

slide-30
SLIDE 30

KNN on Fisher Iris Data

33

slide-31
SLIDE 31

KNN on Fisher Iris Data

34

slide-32
SLIDE 32

KNN on Fisher Iris Data

35

slide-33
SLIDE 33

KNN on Fisher Iris Data

36

Special Case: Majority Vote

slide-34
SLIDE 34

KNN ON GAUSSIAN DATA

37

slide-35
SLIDE 35

KNN on Gaussian Data

38

slide-36
SLIDE 36

KNN on Gaussian Data

39

slide-37
SLIDE 37

KNN on Gaussian Data

40

slide-38
SLIDE 38

KNN on Gaussian Data

41

slide-39
SLIDE 39

KNN on Gaussian Data

42

slide-40
SLIDE 40

KNN on Gaussian Data

43

slide-41
SLIDE 41

KNN on Gaussian Data

44

slide-42
SLIDE 42

KNN on Gaussian Data

45

slide-43
SLIDE 43

KNN on Gaussian Data

46

slide-44
SLIDE 44

KNN on Gaussian Data

47

slide-45
SLIDE 45

KNN on Gaussian Data

48

slide-46
SLIDE 46

KNN on Gaussian Data

49

slide-47
SLIDE 47

KNN on Gaussian Data

50

slide-48
SLIDE 48

KNN on Gaussian Data

51

slide-49
SLIDE 49

KNN on Gaussian Data

52

slide-50
SLIDE 50

KNN on Gaussian Data

53

slide-51
SLIDE 51

KNN on Gaussian Data

54

slide-52
SLIDE 52

KNN on Gaussian Data

55

slide-53
SLIDE 53

KNN on Gaussian Data

56

slide-54
SLIDE 54

KNN on Gaussian Data

57

slide-55
SLIDE 55

KNN on Gaussian Data

58

slide-56
SLIDE 56

KNN on Gaussian Data

59

slide-57
SLIDE 57

KNN on Gaussian Data

60

slide-58
SLIDE 58

KNN on Gaussian Data

61

slide-59
SLIDE 59

KNN on Gaussian Data

62

slide-60
SLIDE 60

K-NEAREST NEIGHBORS

63

slide-61
SLIDE 61

Questions

  • How could k-Nearest Neighbors (KNN) be

applied to regression?

  • Can we do better than majority vote? (e.g.

distance-weighted KNN)

  • Where does the Cover & Hart (1967) Bayes

error rate bound come from?

64

slide-62
SLIDE 62

KNN Learning Objectives

You should be able to…

  • Describe a dataset as points in a high dimensional space

[CIML]

  • Implement k-Nearest Neighbors with O(N) prediction
  • Describe the inductive bias of a k-NN classifier and relate

it to feature scale [a la. CIML]

  • Sketch the decision boundary for a learning algorithm

(compare k-NN and DT)

  • State Cover & Hart (1967)'s large sample analysis of a

nearest neighbor classifier

  • Invent "new" k-NN learning algorithms capable of dealing

with even k

  • Explain computational and geometric examples of the

curse of dimensionality

65

slide-63
SLIDE 63

k-Nearest Neighbors

But how do we choose k?

66

slide-64
SLIDE 64

MODEL SELECTION

67

slide-65
SLIDE 65

Model Selection

WARNING:

  • In some sense, our discussion of model

selection is premature.

  • The models we have considered thus far are

fairly simple.

  • The models and the many decisions available

to the data scientist wielding them will grow to be much more complex than what we’ve seen so far.

68

slide-66
SLIDE 66

Model Selection

Statistics

  • Def: a model defines the data

generation process (i.e. a set or family of parametric probability distributions)

  • Def: model parameters are the

values that give rise to a particular probability distribution in the model family

  • Def: learning (aka. estimation) is

the process of finding the parameters that best fit the data

  • Def: hyperparameters are the

parameters of a prior distribution over parameters

Machine Learning

  • Def: (loosely) a model defines the

hypothesis space over which learning performs its search

  • Def: model parameters are the

numeric values or structure selected by the learning algorithm that give rise to a hypothesis

  • Def: the learning algorithm

defines the data-driven search

  • ver the hypothesis space (i.e.

search for good parameters)

  • Def: hyperparameters are the

tunable aspects of the model, that the learning algorithm does not select

69

slide-67
SLIDE 67

Model Selection

Machine Learning

  • Def: (loosely) a model defines the

hypothesis space over which learning performs its search

  • Def: model parameters are the

numeric values or structure selected by the learning algorithm that give rise to a hypothesis

  • Def: the learning algorithm

defines the data-driven search

  • ver the hypothesis space (i.e.

search for good parameters)

  • Def: hyperparameters are the

tunable aspects of the model, that the learning algorithm does not select

70

  • model = set of all possible

trees, possibly restricted by some hyperparameters (e.g. max depth)

  • parameters = structure of a

specific decision tree

  • learning algorithm = ID3,

CART, etc.

  • hyperparameters = max-

depth, threshold for splitting criterion, etc.

Example: Decision Tree

slide-68
SLIDE 68

Model Selection

Machine Learning

  • Def: (loosely) a model defines the

hypothesis space over which learning performs its search

  • Def: model parameters are the

numeric values or structure selected by the learning algorithm that give rise to a hypothesis

  • Def: the learning algorithm

defines the data-driven search

  • ver the hypothesis space (i.e.

search for good parameters)

  • Def: hyperparameters are the

tunable aspects of the model, that the learning algorithm does not select

71

  • model = set of all possible

nearest neighbors classifiers

  • parameters = none

(KNN is an instance-based or non-parametric method)

  • learning algorithm = for naïve

setting, just storing the data

  • hyperparameters = k, the

number of neighbors to consider Example: k-Nearest Neighbors

slide-69
SLIDE 69

Model Selection

Machine Learning

  • Def: (loosely) a model defines the

hypothesis space over which learning performs its search

  • Def: model parameters are the

numeric values or structure selected by the learning algorithm that give rise to a hypothesis

  • Def: the learning algorithm

defines the data-driven search

  • ver the hypothesis space (i.e.

search for good parameters)

  • Def: hyperparameters are the

tunable aspects of the model, that the learning algorithm does not select

72

  • model = set of all linear

separators

  • parameters = vector of

weights (one for each feature)

  • learning algorithm = mistake

based updates to the parameters

  • hyperparameters = none

(unless using some variant such as averaged perceptron)

Example: Perceptron

slide-70
SLIDE 70

Model Selection

Statistics

  • Def: a model defines the data

generation process (i.e. a set or family of parametric probability distributions)

  • Def: model parameters are the

values that give rise to a particular probability distribution in the model family

  • Def: learning (aka. estimation) is

the process of finding the parameters that best fit the data

  • Def: hyperparameters are the

parameters of a prior distribution over parameters

Machine Learning

  • Def: (loosely) a model defines the

hypothesis space over which learning performs its search

  • Def: model parameters are the

numeric values or structure selected by the learning algorithm that give rise to a hypothesis

  • Def: the learning algorithm

defines the data-driven search

  • ver the hypothesis space (i.e.

search for good parameters)

  • Def: hyperparameters are the

tunable aspects of the model, that the learning algorithm does not select

73

If “learning” is all about picking the best parameters how do we pick the best hyperparameters?

slide-71
SLIDE 71

Model Selection

  • Two very similar definitions:

– Def: model selection is the process by which we choose the “best” model from among a set of candidates – Def: hyperparameter optimization is the process by which we choose the “best” hyperparameters from among a set of candidates (could be called a special case of model selection)

  • Both assume access to a function capable of

measuring the quality of a model

  • Both are typically done “outside” the main training

algorithm --- typically training is treated as a black box

74

slide-72
SLIDE 72

Example of Hyperparameter Opt.

Chalkboard:

– Special cases of k-Nearest Neighbors – Choosing k with validation data – Choosing k with cross-validation

75

slide-73
SLIDE 73

Cross-Validation

Cross validation is a method of estimating loss on held out data Input: training data, learning algorithm, loss function (e.g. 0/1 error) Output: an estimate of loss function on held-out data Key idea: rather than just a single “validation” set, use many! (Error is more stable. Slower computation.)

76

D = y(1)

y(2) y(N) x(1) x(2) x(N)

Fold 1 Fold 2 Fold 3 Fold 4

Algorithm: Divide data into folds (e.g. 4) 1. Train on folds {1,2,3} and predict on {4} 2. Train on folds {1,2,4} and predict on {3} 3. Train on folds {1,3,4} and predict on {2} 4. Train on folds {2,3,4} and predict on {1} Concatenate all the predictions and evaluate loss (almost equivalent to averaging loss

  • ver the folds)
slide-74
SLIDE 74

Model Selection

WARNING (again):

– This section is only scratching the surface! – Lots of methods for hyperparameter

  • ptimization: (to talk about later)
  • Grid search
  • Random search
  • Bayesian optimization
  • Graduate-student descent

Main Takeaway:

– Model selection / hyperparameter optimization is just another form of learning

77

slide-75
SLIDE 75

Model Selection Learning Objectives

You should be able to…

  • Plan an experiment that uses training, validation, and

test datasets to predict the performance of a classifier on unseen data (without cheating)

  • Explain the difference between (1) training error, (2)

validation error, (3) cross-validation error, (4) test error, and (5) true error

  • For a given learning technique, identify the model,

learning algorithm, parameters, and hyperparamters

  • Define "instance-based learning" or "nonparametric

methods"

  • Select an appropriate algorithm for optimizing (aka.

learning) hyperparameters

78