Introduction to Machine Learning Eric Medvet 16/3/2017 1/77 - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning Eric Medvet 16/3/2017 1/77 - - PowerPoint PPT Presentation

Introduction to Machine Learning Eric Medvet 16/3/2017 1/77 Outline Machine Learning: what and why? Motivating example Tree-based methods Regression trees Trees aggregation 2/77 Teachers Eric Medvet Dipartimento di Ingegneria e


slide-1
SLIDE 1

1/77

Introduction to Machine Learning

Eric Medvet 16/3/2017

slide-2
SLIDE 2

2/77

Outline

Machine Learning: what and why? Motivating example Tree-based methods Regression trees Trees aggregation

slide-3
SLIDE 3

3/77

Teachers

◮ Eric Medvet

◮ Dipartimento di Ingegneria e Architettura (DIA) ◮ http://medvet.inginf.units.it/

slide-4
SLIDE 4

4/77

Section 1 Machine Learning: what and why?

slide-5
SLIDE 5

5/77

What is Machine Learning?

Definition

Machine Learning is the science of getting computer to learn without being explicitly programmed.

Definition

Data Mining is the science of discovering patterns in data.

slide-6
SLIDE 6

6/77

In practice

A set of mathematical and statistical tools for:

◮ building a model which allows to predict an output, given an

input (supervised learning)

◮ learn relationships and structures in data (unsupervised

learning)

slide-7
SLIDE 7

7/77

Machine Learning everyday

Example problem: spam

Discriminate between spam and non-spam emails.

Figure: Spam filtering in Gmail.

slide-8
SLIDE 8

8/77

Machine Learning everyday

Example problem: image understanding

Recognize objects in images.

Figure: Object recognition in Google Photos.

slide-9
SLIDE 9

9/77

Why ML/DM “today”?

◮ we collect more and more data (big data) ◮ we have more and more computational power

Figure: From http://www.mkomo.com/cost-per-gigabyte-update.

slide-10
SLIDE 10

10/77

ML/DM is popular!

Figure: Popular areas of interest, from the Skill Up 2016: Developer Skills Report2

1https://techcus.com/p/r1zSmbXut/

top-5-highest-paying-programming-languages-of-2016/.

2https://techcus.com/p/r1zSmbXut/

top-5-highest-paying-programming-languages-of-2016/.

slide-11
SLIDE 11

11/77

What does the Machine Learning practitioner?

Be able to:

  • 1. design
  • 2. implement
  • 3. assess experimentally

an end-to-end Machine Learning or Data Mining system.

slide-12
SLIDE 12

11/77

What does the Machine Learning practitioner?

Be able to:

  • 1. design
  • 2. implement
  • 3. assess experimentally

an end-to-end Machine Learning or Data Mining system.

◮ Which is the problem to be solved? Which are the input and

  • utput? Which are the most suitable algorithms? How should

data be prepared? Does computation time matter?

slide-13
SLIDE 13

11/77

What does the Machine Learning practitioner?

Be able to:

  • 1. design
  • 2. implement
  • 3. assess experimentally

an end-to-end Machine Learning or Data Mining system.

◮ Which is the problem to be solved? Which are the input and

  • utput? Which are the most suitable algorithms? How should

data be prepared? Does computation time matter?

◮ Write some code!

slide-14
SLIDE 14

11/77

What does the Machine Learning practitioner?

Be able to:

  • 1. design
  • 2. implement
  • 3. assess experimentally

an end-to-end Machine Learning or Data Mining system.

◮ Which is the problem to be solved? Which are the input and

  • utput? Which are the most suitable algorithms? How should

data be prepared? Does computation time matter?

◮ Write some code! ◮ How to measure solution quality? How to compare solutions?

Is my solution general?

slide-15
SLIDE 15

12/77

Subsection 1 Motivating example

slide-16
SLIDE 16

13/77

The amateur botanist friend

He likes to collect Iris plants. He “realized” that there are 3 species, in particular, that he likes: Iris setosa, Iris virginica, and Iris versicolor. He’d like to have a tool to automatically classify collected samples in one of the 3 species.

Figure: Iris versicolor.

How to help him?

slide-17
SLIDE 17

14/77

Let’s help him

◮ Which is the problem to be solved?

slide-18
SLIDE 18

14/77

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

slide-19
SLIDE 19

14/77

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

◮ Which are the input and output?

slide-20
SLIDE 20

14/77

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

◮ Which are the input and output?

◮ Output: one species among I. setosa, I. virginica, I. versicolor.

slide-21
SLIDE 21

14/77

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

◮ Which are the input and output?

◮ Output: one species among I. setosa, I. virginica, I. versicolor. ◮ Input: the plant sample. . .

slide-22
SLIDE 22

14/77

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

◮ Which are the input and output?

◮ Output: one species among I. setosa, I. virginica, I. versicolor. ◮ Input: the plant sample. . . ◮ a description in natural language?

slide-23
SLIDE 23

14/77

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

◮ Which are the input and output?

◮ Output: one species among I. setosa, I. virginica, I. versicolor. ◮ Input: the plant sample. . . ◮ a description in natural language? ◮ a digital photo?

slide-24
SLIDE 24

14/77

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

◮ Which are the input and output?

◮ Output: one species among I. setosa, I. virginica, I. versicolor. ◮ Input: the plant sample. . . ◮ a description in natural language? ◮ a digital photo? ◮ DNA sequences?

slide-25
SLIDE 25

14/77

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

◮ Which are the input and output?

◮ Output: one species among I. setosa, I. virginica, I. versicolor. ◮ Input: the plant sample. . . ◮ a description in natural language? ◮ a digital photo? ◮ DNA sequences? ◮ some measurements of the sample!

slide-26
SLIDE 26

15/77

Iris: input and output

Figure: Sepal and petal.

Input: sepal length and width, petal length and width (in cm) Output: the class Example: (5.1, 3.5, 1.4, 0.2) → I. setosa

slide-27
SLIDE 27

16/77

Other information

The botanist friend asked a senior botanist to inspect several samples and label them with the corresponding species. Sepal length Sepal width Petal length Petal width Species 5.1 3.5 1.4 0.2

  • I. setosa

4.9 3.0 1.4 0.2

  • I. setosa

7.0 3.2 4.7 1.4

  • I. versicolor

6.0 2.2 5.0 1.5

  • I. virginica
slide-28
SLIDE 28

17/77

Notation and terminology

◮ Sepal length, sepal width, petal length, and petal width are

input variables (or independent variables, or features, or attributes).

◮ Species is the output variable (or dependent variable, or

response).

slide-29
SLIDE 29

18/77

Notation and terminology

X =      x1,1 x1,2 · · · x1,p x2,1 x2,2 · · · x2,p . . . . . . ... . . . xn,1 xn,2 · · · xn,p      y =      y1 y2 . . . yn     

◮ xT 1 = (x1,1, x1,2, . . . , x1,p) is an observation (or instance, or

data point), composed of p variable values;

slide-30
SLIDE 30

18/77

Notation and terminology

X =      x1,1 x1,2 · · · x1,p x2,1 x2,2 · · · x2,p . . . . . . ... . . . xn,1 xn,2 · · · xn,p      y =      y1 y2 . . . yn     

◮ xT 1 = (x1,1, x1,2, . . . , x1,p) is an observation (or instance, or

data point), composed of p variable values; y1 is the corresponding output variable value

slide-31
SLIDE 31

18/77

Notation and terminology

X =      x1,1 x1,2 · · · x1,p x2,1 x2,2 · · · x2,p . . . . . . ... . . . xn,1 xn,2 · · · xn,p      y =      y1 y2 . . . yn     

◮ xT 1 = (x1,1, x1,2, . . . , x1,p) is an observation (or instance, or

data point), composed of p variable values; y1 is the corresponding output variable value

◮ xT 2 = (x1,2, x2,2, . . . , xn,2) is the vector of all the n values for

the 2nd variable (X2).

slide-32
SLIDE 32

19/77

Notation and terminology

Different communities (e.g., statistical learning vs. machine learning vs. artificial intelligence) use different terms and notation:

◮ x(i) j

instead of xi,j (hence x(i) instead of xi)

◮ m instead of n and n instead of p ◮ . . .

Focus on the meaning!

slide-33
SLIDE 33

20/77

Iris: visual interpretation

Simplification: forget petal and

  • I. virginica → 2 variables, 2

species (binary classification problem). 4 5 6 7 2 3 4 5 Sepal length Sepal width

  • I. setosa
  • I. versicolor
slide-34
SLIDE 34

20/77

Iris: visual interpretation

Simplification: forget petal and

  • I. virginica → 2 variables, 2

species (binary classification problem).

◮ Problem: given any new

  • bservation, we want to

automatically assign the species. 4 5 6 7 2 3 4 5 Sepal length Sepal width

  • I. setosa
  • I. versicolor
slide-35
SLIDE 35

20/77

Iris: visual interpretation

Simplification: forget petal and

  • I. virginica → 2 variables, 2

species (binary classification problem).

◮ Problem: given any new

  • bservation, we want to

automatically assign the species.

◮ Sketch of a possible

solution: 4 5 6 7 2 3 4 5 Sepal length Sepal width

  • I. setosa
  • I. versicolor
slide-36
SLIDE 36

20/77

Iris: visual interpretation

Simplification: forget petal and

  • I. virginica → 2 variables, 2

species (binary classification problem).

◮ Problem: given any new

  • bservation, we want to

automatically assign the species.

◮ Sketch of a possible

solution:

  • 1. learn a model (classifier)

4 5 6 7 2 3 4 5 Sepal length Sepal width

  • I. setosa
  • I. versicolor
slide-37
SLIDE 37

20/77

Iris: visual interpretation

Simplification: forget petal and

  • I. virginica → 2 variables, 2

species (binary classification problem).

◮ Problem: given any new

  • bservation, we want to

automatically assign the species.

◮ Sketch of a possible

solution:

  • 1. learn a model (classifier)
  • 2. “use” model on new
  • bservations

4 5 6 7 2 3 4 5 Sepal length Sepal width

  • I. setosa
  • I. versicolor
slide-38
SLIDE 38

21/77

“A” model?

There could be many possible models:

◮ how to choose? ◮ how to compare?

slide-39
SLIDE 39

22/77

Choosing the model

The choice of the model/tool/algorithm to be used is determined by many factors:

◮ Problem size (n and p) ◮ Availability of an output variable (y) ◮ Computational effort (when learning or “using”) ◮ Explicability of the model ◮ . . .

We will see many options.

slide-40
SLIDE 40

23/77

Comparing many models

Experimentally: does the model work well on (new) data?

slide-41
SLIDE 41

23/77

Comparing many models

Experimentally: does the model work well on (new) data? Define “works well”:

◮ a single performance index? ◮ how to measure? ◮ repeatability/reproducibility. . .

We will see/discuss many options.

slide-42
SLIDE 42

24/77

It does not work well. . .

Why?

◮ the data is not informative ◮ the data is not representative ◮ the data has changed ◮ the data is too noisy

We will see/discuss these issues.

slide-43
SLIDE 43

25/77

ML is not magic

Problem: find birth town from height/weight. 60 70 80 90 100 140 160 180 200 Weight [kg] Height [cm] Trieste Udine Q: which is the data issue here?

slide-44
SLIDE 44

26/77

Implementation

When “solving” a problem, we usually need:

◮ explore/visualize data ◮ apply one or more learning algorithms ◮ assess learned models

“By hands?” No, with software!

slide-45
SLIDE 45

27/77

ML/DM software

Many options:

◮ libraries for general purpose languages:

◮ Java: e.g., http://haifengl.github.io/smile/ ◮ Python: e.g., http://scikit-learn.org/stable/ ◮ . . .

◮ specialized sw environments:

◮ Octave: https://en.wikipedia.org/wiki/GNU_Octave ◮ R: https:

//en.wikipedia.org/wiki/R_(programming_language)

◮ from scratch

slide-46
SLIDE 46

28/77

ML/DM software: which one?

◮ production/prototype ◮ platform constraints ◮ degree of (data) customization ◮ documentation availability/community size ◮ . . . ◮ previous knowledge/skills

slide-47
SLIDE 47

29/77

Section 2 Tree-based methods

slide-48
SLIDE 48

30/77

The carousel robot attendant

Problem: replace the carousel attendant with a robot which automatically decides who can ride the carousel.

slide-49
SLIDE 49

31/77

Carousel: data

Observed human attendant’s decisions. 5 10 15 100 150 200 Age a [year] Height h [cm] Cannot ride Can ride How can the robot take the decision?

slide-50
SLIDE 50

31/77

Carousel: data

Observed human attendant’s decisions. 5 10 15 100 150 200 Age a [year] Height h [cm] Cannot ride Can ride How can the robot take the decision?

◮ if younger than 10 →

can’t!

slide-51
SLIDE 51

31/77

Carousel: data

Observed human attendant’s decisions. 5 10 15 100 150 200 Age a [year] Height h [cm] Cannot ride Can ride How can the robot take the decision?

◮ if younger than 10 →

can’t!

◮ otherwise:

slide-52
SLIDE 52

31/77

Carousel: data

Observed human attendant’s decisions. 5 10 15 100 150 200 Age a [year] Height h [cm] Cannot ride Can ride How can the robot take the decision?

◮ if younger than 10 →

can’t!

◮ otherwise:

◮ if shorter than 120

→ can’t!

◮ otherwise → can!

slide-53
SLIDE 53

31/77

Carousel: data

Observed human attendant’s decisions. 5 10 15 100 150 200 Age a [year] Height h [cm] Cannot ride Can ride How can the robot take the decision?

◮ if younger than 10 →

can’t!

◮ otherwise:

◮ if shorter than 120

→ can’t!

◮ otherwise → can!

Decision tree! a < 10 T h < 120 T F F

slide-54
SLIDE 54

32/77

How to build a decision tree

Dividi-et-impera (recursively):

◮ find a cut variable and a cut value ◮ for left-branch, dividi-et-impera ◮ for right-branch, dividi-et-impera

slide-55
SLIDE 55

33/77

How to build a decision tree: detail

Recursive binary splitting function BuildDecisionTree(X, y) if ShouldStop(y) then ˆ y ← most common class in y return new terminal node with ˆ y else (i, t) ← BestBranch(X, y) n ← new branch node with (i, t) append child BuildDecisionTree(X|xi<t, y|xi<t) to n append child BuildDecisionTree(X|xi≥t, y|xi≥t) to n return n end if end function

◮ Recursive binary splitting ◮ Top down (start from the “big” problem)

slide-56
SLIDE 56

34/77

Best branch

function BestBranch(X, y) (i⋆, t⋆) ← arg mini,t E(y|xi≥t) + E(y|xi<t) return (i⋆, t⋆) end function Classification error on subset: E(y) = |{y ∈ y : y = ˆ y}| |y| ˆ y = the most common class in y

◮ Greedy (choose split to minimize error now, not in later steps)

slide-57
SLIDE 57

35/77

Best branch

(i⋆, t⋆) ← arg min

i,t

E(y|xi≥t) + E(y|xi<t) The formula say what is done, not how is done! Q: different “how” can differ? how?

slide-58
SLIDE 58

36/77

Stopping criterion

function ShouldStop(y) if y contains only one class then return true else if |y| < kmin then return true else return false end if end function Other possible criterion:

◮ tree depth larger than dmax

slide-59
SLIDE 59

37/77

Categorical independent variables

◮ Trees can work with categorical variables ◮ Branch node is xi = c or xi ∈ C ′ ⊂ C (c is a class) ◮ Can mix categorical and numeric variables

slide-60
SLIDE 60

38/77

Stopping criterion: role of kmin

Suppose kmin = 1 (never stop for y size) 5 10 15 100 150 200 Age a [year] Height h [cm] Cannot ride Can ride

h < 120 a < 9.0 a < 9.6 a < 9.1 a < 9.4 a < 10

Q: what’s wrong?

slide-61
SLIDE 61

39/77

Tree complexity

When the tree is “too complex”

◮ less readable/understandable/explicable ◮ maybe there was noise into the data

Q: what’s noise in carousel data? Tree complexity issue is not related (only) with kmin

slide-62
SLIDE 62

40/77

Tree complexity: other interpretation

◮ maybe there was noise into the data

The tree fits the learning data too much:

◮ it overfits (overfitting) ◮ does not generalize (high variance: model varies if learning

data varies)

slide-63
SLIDE 63

41/77

High variance

“model varies if learning data varies”: what? why data varies?

◮ learning data is about the system/phenomenon/nature S

◮ a collection of observations of S ◮ a point of view on S

slide-64
SLIDE 64

41/77

High variance

“model varies if learning data varies”: what? why data varies?

◮ learning data is about the system/phenomenon/nature S

◮ a collection of observations of S ◮ a point of view on S

◮ learning is about understanding/knowing/explaining S

slide-65
SLIDE 65

41/77

High variance

“model varies if learning data varies”: what? why data varies?

◮ learning data is about the system/phenomenon/nature S

◮ a collection of observations of S ◮ a point of view on S

◮ learning is about understanding/knowing/explaining S

◮ if I change the point of view on S, my knowledge about S

should remain the same!

slide-66
SLIDE 66

42/77

Fighting overfitting

◮ large kmin (large w.r.t what?) ◮ when building, limit depth ◮ when building, don’t split if low overall impurity decrease ◮ after building, prune

(bias, variance will be detailed later)

slide-67
SLIDE 67

43/77

Evaluation: k-fold cross-validation

How to estimate the predictor performance on new (unavailable) data?

  • 1. split learning data (X and y) in k equal slices (each of n

k

  • bservations/elements)
  • 2. for each split (i.e., each i ∈ {1, . . . , k} )

2.1 learn on all but k-th slice 2.2 compute classification error on unseen k-th slice

  • 3. average the k classification errors

In essence:

◮ can the learner generalize on available data? ◮ how the learned artifact will behave on unseen data?

slide-68
SLIDE 68

44/77

Evaluation: k-fold cross-validation

folding 1 accuracy1 folding 2 accuracy2 folding 3 accuracy3 folding 4 accuracy4 folding 5 accuracy5 accuracy = 1 k

i=k

  • i=1

accuracyi Or with classification error rate or any other meaningful (effectiveness) measure Q: how should data be split?

slide-69
SLIDE 69

45/77

Subsection 1 Regression trees

slide-70
SLIDE 70

46/77

Regression with trees

Trees can be used for regression, instead of classification. decision tree vs. regression tree

slide-71
SLIDE 71

47/77

Tree building: decision → regression

function BuildDecisionTree(X, y) if ShouldStop(y) then ˆ y ← most common class in y return new terminal node with ˆ y else (i, t) ← BestBranch(X, y) n ← new branch node with (i, t) append child BuildDecisionTree(X|xi<t, y|xi<t) to n append child BuildDecisionTree(X|xi≥t, y|xi≥t) to n return n end if end function Q: what should we change?

slide-72
SLIDE 72

47/77

Tree building: decision → regression

function BuildDecisionTree(X, y) if ShouldStop(y) then ˆ y ← ¯ y ⊲ mean y return new terminal node with ˆ y else (i, t) ← BestBranch(X, y) n ← new branch node with (i, t) append child BuildDecisionTree(X|xi<t, y|xi<t) to n append child BuildDecisionTree(X|xi≥t, y|xi≥t) to n return n end if end function Q: what should we change?

slide-73
SLIDE 73

48/77

Interpretation

5 10 15 20 25 30 2 4

slide-74
SLIDE 74

49/77

Regression and overfitting

Image from F. Daolio

slide-75
SLIDE 75

50/77

Trees in summary

Pros: easily interpretable/explicable learning and regression/classification easily understandable can handle both numeric and categorical values Cons: not so accurate (Q: always?)

slide-76
SLIDE 76

51/77

Tree accuracy?

Image from An Introduction to Statistical Learning

slide-77
SLIDE 77

52/77

Subsection 2 Trees aggregation

slide-78
SLIDE 78

53/77

Weakness of the tree

20 40 60 80 100 15 20 25 30 Small tree:

◮ low complexity ◮ will hardly fit the “curve”

part

◮ high bias, low variance

Big tree:

◮ high complexity ◮ may overfit the noise on the

right part

◮ low bias, high variance

slide-79
SLIDE 79

54/77

The trees view

Small tree:

◮ “a car is something that

moves” Big tree:

◮ “a car is a made-in-Germany

blue object with 4 wheels, 2 doors, chromed fenders, curved rear enclosing engine”

slide-80
SLIDE 80

55/77

Big tree view

A big tree:

◮ has a detailed view of the learning data (high complexity) ◮ “trusts too much” the learning data (high variance)

What if we “combine” different big tree views and ignore details

  • n which they disagree?
slide-81
SLIDE 81

56/77

Wisdom of the crowds

What if we “combine” different big tree views and ignore details

  • n which they disagree?

◮ many views ◮ independent views ◮ aggregation of views

≈ the wisdom of the crowds: a collective opinion may be better than a single expert’s opinion

slide-82
SLIDE 82

57/77

Wisdom of the trees

◮ many views ◮ independent views ◮ aggregation of views

slide-83
SLIDE 83

57/77

Wisdom of the trees

◮ many views

◮ just use many trees

◮ independent views ◮ aggregation of views

slide-84
SLIDE 84

57/77

Wisdom of the trees

◮ many views

◮ just use many trees

◮ independent views ◮ aggregation of views

◮ just average prediction (regression) or take most common

prediction (classification)

slide-85
SLIDE 85

57/77

Wisdom of the trees

◮ many views

◮ just use many trees

◮ independent views

◮ ??? learning is deterministic: same data ⇒ same tree ⇒ same

view

◮ aggregation of views

◮ just average prediction (regression) or take most common

prediction (classification)

slide-86
SLIDE 86

58/77

Independent views

Independent views ≡ different points of view ≡ different learning data But we have only one learning data!

slide-87
SLIDE 87

59/77

Independent views: idea!

Like in cross-fold, consider only a part of the data, but:

◮ instead of a subset ◮ a sample with repetitions

slide-88
SLIDE 88

59/77

Independent views: idea!

Like in cross-fold, consider only a part of the data, but:

◮ instead of a subset ◮ a sample with repetitions

X = (xT

1 xT 2 xT 3 xT 4 xT 5 )

  • riginal learning data

X1 = (xT

1 xT 5 xT 3 xT 2 xT 5 )

sample 1 X2 = (xT

4 xT 2 xT 3 xT 1 xT 1 )

sample 2 Xi = . . . sample i

◮ (y omitted for brevity) ◮ learning data size is not a limitation (differently than with

subset)

slide-89
SLIDE 89

59/77

Independent views: idea!

Like in cross-fold, consider only a part of the data, but:

◮ instead of a subset ◮ a sample with repetitions

X = (xT

1 xT 2 xT 3 xT 4 xT 5 )

  • riginal learning data

X1 = (xT

1 xT 5 xT 3 xT 2 xT 5 )

sample 1 X2 = (xT

4 xT 2 xT 3 xT 1 xT 1 )

sample 2 Xi = . . . sample i

◮ (y omitted for brevity) ◮ learning data size is not a limitation (differently than with

subset)

Bagging of trees (bootstrap, more in general)

slide-90
SLIDE 90

60/77

Tree bagging

When learning:

  • 1. Repeat B times

1.1 take a sample of the learning data 1.2 learn a tree (unpruned)

When predicting:

  • 1. Repeat B times

1.1 get a prediction from ith learned tree

  • 2. predict the average (or most common) prediction

For classification, other aggregations can be done: majority voting (most common) is the simplest

slide-91
SLIDE 91

61/77

How many trees?

B is a parameter:

◮ when there is a parameter, there is the problem of finding a

good value

◮ remember kmin, depth (Q: impact on?)

slide-92
SLIDE 92

61/77

How many trees?

B is a parameter:

◮ when there is a parameter, there is the problem of finding a

good value

◮ remember kmin, depth (Q: impact on?) ◮ it has been shown (experimentally) that

◮ for “large” B, bagging is better than single tree ◮ increasing B does not cause overfitting ◮ (for us: default B is ok! “large” ≈ hundreds)

Q: how better? at which cost?

slide-93
SLIDE 93

62/77

Bagging

100 200 300 400 500 5 6 7 8 ·10−2 Number B of trees Test error

slide-94
SLIDE 94

63/77

Independent view: improvement

Despite being learned on different samples, bagging trees may be correlated, hence views are not very independent

◮ e.g., one variable is much more important than others for

predicting (strong predictor) Idea: force point of view differentiation by “hiding” variables

slide-95
SLIDE 95

64/77

Random forest

When learning:

  • 1. Repeat B times

1.1 take a sample of the learning data 1.2 consider only m on p independent variables 1.3 learn a tree (unpruned)

When predicting:

  • 1. Repeat B times

1.1 get a prediction from ith learned tree

  • 2. predict the average (or most common) prediction

◮ (observations and) variables are randomly chosen. . . ◮ . . . to learn a forest of trees

Q: are missing variables a problem?

slide-96
SLIDE 96

65/77

Random forest: parameter m

How to choose the value for m?

◮ m = p → bagging ◮ it has been shown (experimentally) that

◮ m does not relate with overfitting ◮ m = √p is good for classification ◮ m = p

3 is good for regression

◮ (for us, default m is ok!)

slide-97
SLIDE 97

66/77

Random forest

Experimentally shown: one of the “best” multi-purpose supervised classification methods

◮ Manuel Fern´

andez-Delgado et al. “Do we need hundreds of classifiers to solve real world classification problems”. In: J.

  • Mach. Learn. Res 15.1 (2014), pp. 3133–3181
  • but. . .
slide-98
SLIDE 98

67/77

No free lunch!

“Any two optimization algorithms are equivalent when their performance is averaged across all possible problems”

◮ David H Wolpert. “The lack of a priori distinctions between

learning algorithms”. In: Neural computation 8.7 (1996),

  • pp. 1341–1390

Why free lunch?

◮ many restaurants, many items on menus, many possibly prices

for each item: where to go to eat?

◮ no general answer ◮ but, if you are a vegan, or like pizza, then a best choice could

exist Q: problem? algorithm?

slide-99
SLIDE 99

68/77

Nature of the prediction

Consider classification:

◮ tree → the class ◮ forest → the class, as resulting from a voting

slide-100
SLIDE 100

68/77

Nature of the prediction

Consider classification:

◮ tree → the class

◮ “virginica” is just “virginica”

◮ forest → the class, as resulting from a voting

◮ “241 virginica, 170 versicolor, 89 setosa” is different than “478

virginica, 10 versicolor, 2 setosa”

Is this information useful/exploitable?

slide-101
SLIDE 101

69/77

Confidence/tunability

Voting outcome:

◮ in classification, a measure of confidence of the decision ◮ in binary classification, voting threshold can be tuned to

adjust bias towards one class (sensitivity) Q: in regression?

slide-102
SLIDE 102

70/77

Binary classification

Consider the problem of classifying a person (’s data) as suffering

  • r not suffering from a disease X.

◮ positive: an observation of “suffering” class ◮ negative: an observation of “not suffering” class

In other problems, positive may mean a different thing: define it!

slide-103
SLIDE 103

71/77

FPR, FNR

Given some labeled data and a classifier for the disease X problem, we can measure:

◮ the number of negative observations wrongly classified as

positives: False Positives (FP)

◮ the number of positive observations wrongly classified as

negatives: False Negatives (FN) To decouple FP, FN from data size: FPR = FP N = FP FP + TN FNR = FN P = FN FN + TP

slide-104
SLIDE 104

72/77

Accuracy and error rate

Accuracy = 1 − Error Rate Error Rate = FN + FP P + N Q: Error Rate ? = FPR+FNR

2

slide-105
SLIDE 105

73/77

FPR, FNR and sensitivity

◮ Suppose FPR = 0.06, FNR = 0.04 with threshold set to 0.5

(default for RF)

◮ One could be interested in “limiting” the FNR. . .

Experimentally: 0.2 0.4 0.6 0.8 1 0.2 0.4 Threshold t Error rate FPR FNR

slide-106
SLIDE 106

74/77

Receiver operating characteristic (ROC)

FPR, FNR vs. t 0.5 1 0.2 0.4 EER Threshold t Error rate FPR FNR TPR vs. FPR 0.2 0.4 0.6 0.8 1 EER FPR TPR

◮ Equal error rate (EER)

slide-107
SLIDE 107

75/77

. . . is better than

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 t FPR, FNR

◮ which is the best? ◮ robustness w.r.t. t?

slide-108
SLIDE 108

76/77

ROC and comparison

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 FPR TPR Classifier C1 Classifier C2 Random classifier C1 is better than C2: how much?

◮ EER ◮ Area under the curve (AUC)

slide-109
SLIDE 109

77/77

Bagging/RF/boosting in summary

Tree Bagging RF Boosting interpretability

  • numeric/categorical
  • accuracy
  • test error estimate
  • variable importance
  • confidence/tunability
  • fast to learn

  • (almost) non-parametric
  • ∗: Q: how faster? when? does it matter?