Introduction to Machine Learning Andrea De Lorenzo A.Y.2020 1/122 - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning Andrea De Lorenzo A.Y.2020 1/122 - - PowerPoint PPT Presentation

Introduction to Machine Learning Andrea De Lorenzo A.Y.2020 1/122 Section 1 General information 2/122 Lecturers Andrea De Lorenzo Dipartimento di Ingegneria e Architettura (DIA) http://delorenzo.inginf.units.it/ 3/122 Course


slide-1
SLIDE 1

1/122

Introduction to Machine Learning

Andrea De Lorenzo A.Y.2020

slide-2
SLIDE 2

2/122

Section 1 General information

slide-3
SLIDE 3

3/122

Lecturers

◮ Andrea De Lorenzo

◮ Dipartimento di Ingegneria e Architettura (DIA) ◮ http://delorenzo.inginf.units.it/

slide-4
SLIDE 4

4/122

Course materials

◮ Lecturer’s slides

◮ http://delorenzo.inginf.units.it/project/ introduction-to-machine-learning-2020

◮ Suggested textbooks (for further reading)

◮ Gareth James et al. An introduction to statistical learning.

  • Vol. 6. Springer, 2013

◮ Other material:

◮ I might point you to some scientific papers for discussing examples of application or specific details—just a “chat”

Everything you are required to know is in the lecturer’s slides

slide-5
SLIDE 5

5/122

Section 2 Introduction

slide-6
SLIDE 6

6/122

What is Machine Learning?

Definition

Machine Learning is the science of getting computer to learn without being explicitly programmed.

Definition

Data Mining/Analytics is the science of discovering patterns in data.

slide-7
SLIDE 7

7/122

In practice

A set of mathematical and statistical tools for: ◮ building a model which allows to predict an output, given an input (supervised learning)

◮ example input, output pairs are available

◮ learn relationships and structures in data (unsupervised learning)

slide-8
SLIDE 8

8/122

Machine Learning: a computer science perspective

slide-9
SLIDE 9

9/122

Machine Learning everyday

Example problem: spam

Discriminate between spam and non-spam emails.

Figure: Spam filtering in Gmail.

slide-10
SLIDE 10

10/122

Machine Learning everyday

Example problem: flight trajectories

Do flights over the same pair origin, destination follow the “same” trajectory? Why?

Figure: Clustering of flight trajectories.

slide-11
SLIDE 11

11/122

Machine Learning everyday

Example problem: image understanding

Recognize objects in images.

Figure: Object recognition in Google Photos.

slide-12
SLIDE 12

12/122

Machine Learning everyday

Q: what type of learning (supervised/unsupervised) is in the examples? ◮ spam ◮ image understanding ◮ flight trajectories

slide-13
SLIDE 13

13/122

Why ML/DM “today”?

◮ we collect more and more data (big data) ◮ we have more and more computational power

Figure: From http://www.mkomo.com/cost-per-gigabyte-update.

slide-14
SLIDE 14

14/122

ML/DM is popular!

Figure: Popular areas of interest, from the Skill Up 2016: Developer Skills Report2

1https://techcus.com/p/r1zSmbXut/

top-5-highest-paying-programming-languages-of-2016/.

2https://techcus.com/p/r1zSmbXut/

top-5-highest-paying-programming-languages-of-2016/.

slide-15
SLIDE 15

15/122

Aims of the course

Be able to:

  • 1. design
  • 2. implement
  • 3. assess experimentally

an end-to-end Machine Learning or Data Mining system.

slide-16
SLIDE 16

15/122

Aims of the course

Be able to:

  • 1. design
  • 2. implement
  • 3. assess experimentally

an end-to-end Machine Learning or Data Mining system. ◮ Which is the problem to be solved? Which are the input and

  • utput? Which are the most suitable techniques? How should

data be prepared? Does computation time matter?

slide-17
SLIDE 17

15/122

Aims of the course

Be able to:

  • 1. design
  • 2. implement
  • 3. assess experimentally

an end-to-end Machine Learning or Data Mining system. ◮ Which is the problem to be solved? Which are the input and

  • utput? Which are the most suitable techniques? How should

data be prepared? Does computation time matter? ◮ Write some code!

slide-18
SLIDE 18

15/122

Aims of the course

Be able to:

  • 1. design
  • 2. implement
  • 3. assess experimentally

an end-to-end Machine Learning or Data Mining system. ◮ Which is the problem to be solved? Which are the input and

  • utput? Which are the most suitable techniques? How should

data be prepared? Does computation time matter? ◮ Write some code! ◮ How to measure solution quality? How to compare solutions? Is my solution general?

slide-19
SLIDE 19

15/122

Aims of the course

Be able to:

  • 1. design
  • 2. implement
  • 3. assess experimentally

an end-to-end Machine Learning or Data Mining system. ◮ Which is the problem to be solved? Which are the input and

  • utput? Which are the most suitable techniques? How should

data be prepared? Does computation time matter? ◮ Write some code! ◮ How to measure solution quality? How to compare solutions? Is my solution general?

◮ Itself: design and implementation

slide-20
SLIDE 20

16/122

Aims of the course: communication

Be able to:

  • 1. design
  • 2. implement
  • 3. assess experimentally

an end-to-end Machine Learning or Data Mining system. And be able to convince the “client” that it is: ◮ technically sound ◮ economically viable ◮ in its larger context

slide-21
SLIDE 21

17/122

Subsection 1 Motivating example

slide-22
SLIDE 22

18/122

The amateur botanist friend

He likes to collect Iris plants. He “realized” that there are 3 species, in particular, that he likes: Iris setosa, Iris virginica, and Iris versicolor. He’d like to have a tool to automatically classify collected samples in one of the 3 species.

Figure: Iris versicolor.

How to help him?

slide-23
SLIDE 23

19/122

Let’s help him

◮ Which is the problem to be solved?

slide-24
SLIDE 24

19/122

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

slide-25
SLIDE 25

19/122

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

◮ Which are the input and output?

slide-26
SLIDE 26

19/122

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

◮ Which are the input and output?

◮ Output: one species among I. setosa, I. virginica, I. versicolor.

slide-27
SLIDE 27

19/122

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

◮ Which are the input and output?

◮ Output: one species among I. setosa, I. virginica, I. versicolor. ◮ Input: the plant sample. . .

slide-28
SLIDE 28

19/122

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

◮ Which are the input and output?

◮ Output: one species among I. setosa, I. virginica, I. versicolor. ◮ Input: the plant sample. . .

◮ a description in natural language?

slide-29
SLIDE 29

19/122

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

◮ Which are the input and output?

◮ Output: one species among I. setosa, I. virginica, I. versicolor. ◮ Input: the plant sample. . .

◮ a description in natural language? ◮ a digital photo?

slide-30
SLIDE 30

19/122

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

◮ Which are the input and output?

◮ Output: one species among I. setosa, I. virginica, I. versicolor. ◮ Input: the plant sample. . .

◮ a description in natural language? ◮ a digital photo? ◮ DNA sequences?

slide-31
SLIDE 31

19/122

Let’s help him

◮ Which is the problem to be solved?

◮ Assign exactly one specie to a sample.

◮ Which are the input and output?

◮ Output: one species among I. setosa, I. virginica, I. versicolor. ◮ Input: the plant sample. . .

◮ a description in natural language? ◮ a digital photo? ◮ DNA sequences? ◮ some measurements of the sample!

slide-32
SLIDE 32

20/122

Iris: input and output

Figure: Sepal and petal.

Input: sepal length and width, petal length and width (in cm) Output: the class Example: (5.1, 3.5, 1.4, 0.2) → I. setosa

slide-33
SLIDE 33

21/122

Other information

The botanist friend asked a senior botanist to inspect several samples and label them with the corresponding species. Sepal length Sepal width Petal length Petal width Species 5.1 3.5 1.4 0.2

  • I. setosa

4.9 3.0 1.4 0.2

  • I. setosa

7.0 3.2 4.7 1.4

  • I. versicolor

6.0 2.2 5.0 1.5

  • I. virginica
slide-34
SLIDE 34

22/122

Notation and terminology

◮ Sepal length, sepal width, petal length, and petal width are input variables (or independent variables, or features, or attributes). ◮ Species is the output variable (or dependent variable, or response).

slide-35
SLIDE 35

23/122

Notation and terminology

X =      x1,1 x1,2 · · · x1,p x2,1 x2,2 · · · x2,p . . . . . . ... . . . xn,1 xn,2 · · · xn,p      y =      y1 y2 . . . yn      ◮ xT

1 = (x1,1, x1,2, . . . , x1,p) is an observation (or instance, or

data point), composed of p variable values;

slide-36
SLIDE 36

23/122

Notation and terminology

X =      x1,1 x1,2 · · · x1,p x2,1 x2,2 · · · x2,p . . . . . . ... . . . xn,1 xn,2 · · · xn,p      y =      y1 y2 . . . yn      ◮ xT

1 = (x1,1, x1,2, . . . , x1,p) is an observation (or instance, or

data point), composed of p variable values; y1 is the corresponding output variable value

slide-37
SLIDE 37

23/122

Notation and terminology

X =      x1,1 x1,2 · · · x1,p x2,1 x2,2 · · · x2,p . . . . . . ... . . . xn,1 xn,2 · · · xn,p      y =      y1 y2 . . . yn      ◮ xT

1 = (x1,1, x1,2, . . . , x1,p) is an observation (or instance, or

data point), composed of p variable values; y1 is the corresponding output variable value ◮ xT

2 = (x1,2, x2,2, . . . , xn,2) is the vector of all the n values for

the 2nd variable (X2).

slide-38
SLIDE 38

24/122

Notation and terminology

Different communities (e.g., statistical learning vs. machine learning vs. artificial intelligence) use different terms and notation: ◮ x(i)

j

instead of xi,j (hence x(i) instead of xi) ◮ m instead of n and n instead of p ◮ . . . Focus on the meaning!

slide-39
SLIDE 39

25/122

Iris: visual interpretation

Simplification: forget petal and

  • I. virginica → 2 variables, 2

species (binary classification problem). 4 5 6 7 2 3 4 5 Sepal length Sepal width

  • I. setosa
  • I. versicolor
slide-40
SLIDE 40

25/122

Iris: visual interpretation

Simplification: forget petal and

  • I. virginica → 2 variables, 2

species (binary classification problem). ◮ Problem: given any new

  • bservation, we want to

automatically assign the species. 4 5 6 7 2 3 4 5 Sepal length Sepal width

  • I. setosa
  • I. versicolor
slide-41
SLIDE 41

25/122

Iris: visual interpretation

Simplification: forget petal and

  • I. virginica → 2 variables, 2

species (binary classification problem). ◮ Problem: given any new

  • bservation, we want to

automatically assign the species. ◮ Sketch of a possible solution: 4 5 6 7 2 3 4 5 Sepal length Sepal width

  • I. setosa
  • I. versicolor
slide-42
SLIDE 42

25/122

Iris: visual interpretation

Simplification: forget petal and

  • I. virginica → 2 variables, 2

species (binary classification problem). ◮ Problem: given any new

  • bservation, we want to

automatically assign the species. ◮ Sketch of a possible solution:

  • 1. learn a model (classifier)

4 5 6 7 2 3 4 5 Sepal length Sepal width

  • I. setosa
  • I. versicolor
slide-43
SLIDE 43

25/122

Iris: visual interpretation

Simplification: forget petal and

  • I. virginica → 2 variables, 2

species (binary classification problem). ◮ Problem: given any new

  • bservation, we want to

automatically assign the species. ◮ Sketch of a possible solution:

  • 1. learn a model (classifier)
  • 2. “use” model on new
  • bservations

4 5 6 7 2 3 4 5 Sepal length Sepal width

  • I. setosa
  • I. versicolor
slide-44
SLIDE 44

26/122

“A” model?

There could be many possible models: ◮ how to choose? ◮ how to compare? Q: a model of what?

slide-45
SLIDE 45

27/122

Choosing the model

The choice of the model/tool/technique to be used is determined by many factors: ◮ Problem size (n and p) ◮ Availability of an output variable (y) ◮ Computational effort (when learning or “using”) ◮ Explicability of the model ◮ . . . We will see some options.

slide-46
SLIDE 46

28/122

Comparing many models

Experimentally: does the model work well on (new) data?

slide-47
SLIDE 47

28/122

Comparing many models

Experimentally: does the model work well on (new) data? Define “works well”: ◮ a single performance index? ◮ how to measure? ◮ repeatability/reproducibility. . .

◮ Q: what’s the difference?

We will see/discuss some options.

slide-48
SLIDE 48

29/122

It does not work well. . .

Why? ◮ the data is not informative ◮ the data is not representative ◮ the data has changed ◮ the data is too noisy We will see/discuss these issues.

slide-49
SLIDE 49

30/122

ML is not magic

Problem: find birth town from height/weight. 60 70 80 90 100 140 160 180 200 Weight [kg] Height [cm] Trieste Udine Q: which is the data issue here?

slide-50
SLIDE 50

31/122

Implementation

When “solving” a problem, we usually need: ◮ explore/visualize data ◮ apply one or more ML technique ◮ assess learned models “By hands?” No, with software!

slide-51
SLIDE 51

32/122

ML/DM software

Many options: ◮ libraries for general purpose languages:

◮ Java: e.g., http://haifengl.github.io/smile/ ◮ Python: e.g., http://scikit-learn.org/stable/ ◮ . . .

◮ specialized sw environments:

◮ Octave: https://en.wikipedia.org/wiki/GNU_Octave ◮ R: https: //en.wikipedia.org/wiki/R_(programming_language)

◮ from scratch

slide-52
SLIDE 52

33/122

ML/DM software: which one?

◮ production/prototype ◮ platform constraints ◮ degree of (data) customization ◮ documentation availability/community size ◮ . . . ◮ previous knowledge/skills

slide-53
SLIDE 53

34/122

ML/DM software: why?

In all cases, sw allows to be more productive and concise. E.g., learn and use a model for classification, in Java+Smile:

1

double[][] instances = ...;

2

int[] labels = ...;

3

RandomForest classifier = (new RandomForest.Trainer()).train( instances, labels);

4

double[] newInstance = ...;

5

int newLabel = classifier.predict(newInstance);

In R:

1

d = ...

2

classifier = randomForest(label~., d)

3

newD = ...

4

newLabels = predict(classifier, newD)

slide-54
SLIDE 54

35/122

Section 3 Plotting data: an overview

slide-55
SLIDE 55

36/122

Advanced plotting

◮ many packages (e.g., ggplot2) ◮ many options Which is the most proper chart to support a thesis?

slide-56
SLIDE 56

37/122

Aim of a plot: examples

slide-57
SLIDE 57

38/122

Aim of a plot: examples

slide-58
SLIDE 58

39/122

Aim of a plot: examples

slide-59
SLIDE 59

40/122

Aim of a plot: examples

slide-60
SLIDE 60

41/122

Section 4 Tree-based methods

slide-61
SLIDE 61

42/122

The carousel robot attendant

Problem: replace the carousel attendant with a robot which automatically decides who can ride the carousel.

slide-62
SLIDE 62

43/122

Carousel: data

Observed human attendant’s decisions. 5 10 15 100 150 200 Age a [year] Height h [cm] Cannot ride Can ride How can the robot take the decision?

slide-63
SLIDE 63

43/122

Carousel: data

Observed human attendant’s decisions. 5 10 15 100 150 200 Age a [year] Height h [cm] Cannot ride Can ride How can the robot take the decision? ◮ if younger than 10 → can’t!

slide-64
SLIDE 64

43/122

Carousel: data

Observed human attendant’s decisions. 5 10 15 100 150 200 Age a [year] Height h [cm] Cannot ride Can ride How can the robot take the decision? ◮ if younger than 10 → can’t! ◮ otherwise:

slide-65
SLIDE 65

43/122

Carousel: data

Observed human attendant’s decisions. 5 10 15 100 150 200 Age a [year] Height h [cm] Cannot ride Can ride How can the robot take the decision? ◮ if younger than 10 → can’t! ◮ otherwise:

◮ if shorter than 120 → can’t! ◮ otherwise → can!

slide-66
SLIDE 66

43/122

Carousel: data

Observed human attendant’s decisions. 5 10 15 100 150 200 Age a [year] Height h [cm] Cannot ride Can ride How can the robot take the decision? ◮ if younger than 10 → can’t! ◮ otherwise:

◮ if shorter than 120 → can’t! ◮ otherwise → can!

Decision tree! a < 10 T h < 120 T F F

slide-67
SLIDE 67

44/122

How to build a decision tree

Dividi-et-impera (recursively): ◮ find a cut variable and a cut value ◮ for left-branch, dividi-et-impera ◮ for right-branch, dividi-et-impera

slide-68
SLIDE 68

45/122

How to build a decision tree: detail

Recursive binary splitting function BuildDecisionTree(X, y) if ShouldStop(y) then ˆ y ← most common class in y return new terminal node with ˆ y else (i, t) ← BestBranch(X, y) n ← new branch node with (i, t) append child BuildDecisionTree(X|xi<t, y|xi<t) to n append child BuildDecisionTree(X|xi≥t, y|xi≥t) to n return n end if end function ◮ Recursive binary splitting ◮ Top down (start from the “big” problem)

slide-69
SLIDE 69

46/122

Best branch

function BestBranch(X, y) (i⋆, t⋆) ← arg mini,t E(y|xi≥t) + E(y|xi<t) return (i⋆, t⋆) end function Classification error on subset: E(y) = |{y ∈ y : y = ˆ y}| |y| ˆ y = the most common class in y ◮ Greedy (choose split to minimize error now, not in later steps)

slide-70
SLIDE 70

47/122

Best branch

(i⋆, t⋆) ← arg min

i,t

E(y|xi≥t) + E(y|xi<t) The formula say what is done, not how is done! Q: “how” can different methods differ?

slide-71
SLIDE 71

48/122

Stopping criterion

function ShouldStop(y) if y contains only one class then return true else if |y| < kmin then return true else return false end if end function Other possible criterion: ◮ tree depth larger than dmax

slide-72
SLIDE 72

49/122

Best branch criteria

Classification error E() works, but has been shown to be “not sufficiently sensitive for tree-growing”. E(y) = |{y ∈ y : y = ˆ y}| |y| = 1−max

c

|{y ∈ y : y = c}| |y| = 1−max

c

py,c Other two option: ◮ Gini index G(y) =

  • c

py,c(1 − py,c) ◮ Cross-entropy D(y) = −

  • c

py,c log py,c For all indexes, the lower the better (node impurity).

slide-73
SLIDE 73

50/122

Best branch criteria: binary classification

0.2 0.4 0.6 0.8 1 0.2 0.4 py,c Index ·(y)

  • Class. error E

Gini index G Cross-entropy D Cross-entropy is rescaled. Q: what happens with multiclass problems?

slide-74
SLIDE 74

51/122

Categorical independent variables

◮ Trees can work with categorical variables ◮ Branch node is xi = c or xi ∈ C ′ ⊂ C (c is a class) ◮ Can mix categorical and numeric variables

slide-75
SLIDE 75

52/122

Stopping criterion: role of kmin

Suppose kmin = 1 (never stop for y size) 5 10 15 100 150 200 Age a [year] Height h [cm] Cannot ride Can ride

h < 120 a < 9.0 a < 9.6 a < 9.1 a < 9.4 a < 10

Q: what’s wrong? (recall: “a model of what?”)

slide-76
SLIDE 76

53/122

Tree complexity

When the tree is “too complex” ◮ less readable/understandable/explicable ◮ maybe there was noise into the data Q: what’s noise in carousel data? Tree complexity is not related (only) with kmin, but also with data

slide-77
SLIDE 77

54/122

Tree complexity: other interpretation

◮ maybe there was noise into the data The tree fits the learning data too much: ◮ it overfits (overfitting) ◮ does not generalize (high variance: model varies if learning data varies)

slide-78
SLIDE 78

55/122

High variance

“model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S

◮ a collection of observations of S ◮ a point of view on S

slide-79
SLIDE 79

55/122

High variance

“model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S

◮ a collection of observations of S ◮ a point of view on S

◮ learning is about understanding/knowing/explaining S

slide-80
SLIDE 80

55/122

High variance

“model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S

◮ a collection of observations of S ◮ a point of view on S

◮ learning is about understanding/knowing/explaining S

◮ if I change the point of view on S, my knowledge about S should remain the same!

slide-81
SLIDE 81

56/122

Spotting overfitting

Model complexity Error Learning error Test error: error on unseen data

slide-82
SLIDE 82

56/122

Spotting overfitting

Model complexity Error Learning error Test error Test error: error on unseen data

slide-83
SLIDE 83

57/122

k-fold cross-validation

Where can I find “unseen data”? Pretend to have it!

  • 1. split learning data (X and y) in k equal slices (each of n

k

  • bservations/elements)
  • 2. for each split (i.e., each i ∈ {1, . . . , k} )

2.1 learn on all but k-th slice 2.2 compute classification error on unseen k-th slice

  • 3. average the k classification errors

In essence: ◮ can the learner generalize beyond available data? ◮ how the learned artifact will behave on unseen data?

slide-84
SLIDE 84

58/122

k-fold cross-validation

folding 1 error1 folding 2 error2 folding 3 error3 folding 4 error4 folding 5 error5 error = 1 k

i=k

  • i=1

errori Or with any other meaningful (effectiveness) measure Q: how should data be split?

slide-85
SLIDE 85

59/122

Fighting overfitting with trees

◮ large kmin (large w.r.t. what?) ◮ when building, limit depth ◮ when building, don’t split if low overall impurity decrease ◮ after building, prune

slide-86
SLIDE 86

60/122

Pruning: high level idea

  • 1. learn a full tree t0
  • 2. build from t0 a sequence T = {t0, t1, . . . , tn} of trees such

that

◮ ti is a root-subtree of ti−1 (ti ⊂ ti−1) ◮ ti is always less complex than ti−1

  • 3. choose the t ∈ T with minimum classification error with

k-fold cross-validation

slide-87
SLIDE 87

61/122

k-fold cross-validation: data splitting

Q: how should data be split? Example: Android Malware detection ◮ Gerardo Canfora et al. “Effectiveness of opcode ngrams for detection of

multi family android malware”. In: Availability, Reliability and Security (ARES), 2015 10th International Conference on. IEEE. 2015, pp. 333–340

◮ Gerardo Canfora et al. “Detecting android malware using sequences of

system calls”. In: Proceedings of the 3rd International Workshop on Software Development Lifecycle for Mobile. ACM. 2015, pp. 13–20

slide-88
SLIDE 88

62/122

Using cross-validation (CV) for assessment (I)

How the learned artifact will behave on unseen data? More precisely: How an artifact learned with this learning technique will behave

  • n unseen data?
slide-89
SLIDE 89

63/122

Using CV for assessment (II)

“This learning technique” = BuildDecisionTree() with kmin = 10

  • 1. repeat k times

1.1 BuildDecisionTree() with kmin = 10 on all but one slice

k−1 k n observations in each X passed to

BuildDecisionTree()

1.2 compute classification error on left out slice

  • 2. average computed classification errors

k invocations of BuildDecisionTree()

slide-90
SLIDE 90

64/122

Using CV for assessment (III)

“This learning technique” = BuildDecisionTree() with kmin chosen automatically with a 10-fold CV For assessing this technique, we do two nested CVs:

  • 1. repeat k times

1.1 choose kmin among m values with 10-CV (repeat BuildDecisionTree() 10m times) on all but one slice

k−1 k 9 10n observations in each X passed to

BuildDecisionTree()!

1.2 compute classification error on left out slice

◮ usually, a new tree is built on k−1

k n observations

  • 2. average computed classification errors

(10m + 1)k invocations of BuildDecisionTree()

slide-91
SLIDE 91

65/122

Using CV for assessment: “cheating”

“This learning technique” = BuildDecisionTree() with kmin chosen automatically with a 10-fold CV Using just one CV is cheating (cherry picking)! ◮ kmin is chosen exactly to minimize error on the full dataset ◮ conceptually, this way of “fitting” kmin is similar to the way we build the tree

slide-92
SLIDE 92

66/122

Subsection 1 Regression trees

slide-93
SLIDE 93

67/122

Regression with trees

Trees can be used for regression, instead of classification. decision tree vs. regression tree

slide-94
SLIDE 94

68/122

Tree building: decision → regression

function BuildDecisionTree(X, y) if ShouldStop(y) then ˆ y ← most common class in y return new terminal node with ˆ y else (i, t) ← BestBranch(X, y) n ← new branch node with (i, t) append child BuildDecisionTree(X|xi<t, y|xi<t) to n append child BuildDecisionTree(X|xi≥t, y|xi≥t) to n return n end if end function Q: what should we change?

slide-95
SLIDE 95

68/122

Tree building: decision → regression

function BuildDecisionTree(X, y) if ShouldStop(y) then ˆ y ← ¯ y ⊲ mean y return new terminal node with ˆ y else (i, t) ← BestBranch(X, y) n ← new branch node with (i, t) append child BuildDecisionTree(X|xi<t, y|xi<t) to n append child BuildDecisionTree(X|xi≥t, y|xi≥t) to n return n end if end function Q: what should we change?

slide-96
SLIDE 96

69/122

Best branch

function BestBranch(X, y) (i⋆, t⋆) ← arg mini,t E(y|xi≥t) + E(y|xi<t) return (i⋆, t⋆) end function Q: what should we change?

slide-97
SLIDE 97

69/122

Best branch

function BestBranch(X, y) (i⋆, t⋆) ← arg mini,t

  • yi∈y|xi ≥t(yi − ¯

y)2 +

yi∈y|xi <t(yi − ¯

y)2 return (i⋆, t⋆) end function Q: what should we change? Minimize sum of residual sum of squares (RSS) (the two ¯ y are different)

slide-98
SLIDE 98

70/122

Stopping criterion

function ShouldStop(y) if y contains only one class then return true else if |y| < kmin then return true else return false end if end function Q: what should we change?

slide-99
SLIDE 99

70/122

Stopping criterion

function ShouldStop(y) if RSS is 0 then return true else if |y| < kmin then return true else return false end if end function Q: what should we change?

slide-100
SLIDE 100

71/122

Interpretation

5 10 15 20 25 30 2 4

slide-101
SLIDE 101

72/122

Regression and overfitting

Image from F. Daolio

slide-102
SLIDE 102

73/122

Trees in summary

Pros: easily interpretable/explicable learning and regression/classification easily understandable can handle both numeric and categorical values Cons: not so accurate (Q: always?)

slide-103
SLIDE 103

74/122

Tree accuracy?

Image from An Introduction to Statistical Learning

slide-104
SLIDE 104

75/122

Subsection 2 Trees aggregation

slide-105
SLIDE 105

76/122

Weakness of the tree

20 40 60 80 100 15 20 25 30 Small tree: ◮ low complexity ◮ will hardly fit the “curve” part ◮ high bias, low variance Big tree: ◮ high complexity ◮ may overfit the noise on the right part ◮ low bias, high variance

slide-106
SLIDE 106

77/122

The trees view

Small tree: ◮ “a car is something that moves” Big tree: ◮ “a car is a made-in-Germany blue object with 4 wheels, 2 doors, chromed fenders, curved rear enclosing engine”

slide-107
SLIDE 107

78/122

Big tree view

A big tree: ◮ has a detailed view of the learning data (high complexity) ◮ “trusts too much” the learning data (high variance) What if we “combine” different big tree views and ignore details

  • n which they disagree?
slide-108
SLIDE 108

79/122

Wisdom of the crowds

What if we “combine” different big tree views and ignore details

  • n which they disagree?

◮ many views ◮ independent views ◮ aggregation of views ≈ the wisdom of the crowds: a collective opinion may be better than a single expert’s opinion

slide-109
SLIDE 109

80/122

Wisdom of the trees

◮ many views ◮ independent views ◮ aggregation of views

slide-110
SLIDE 110

80/122

Wisdom of the trees

◮ many views

◮ just use many trees

◮ independent views ◮ aggregation of views

slide-111
SLIDE 111

80/122

Wisdom of the trees

◮ many views

◮ just use many trees

◮ independent views ◮ aggregation of views

◮ just average prediction (regression) or take most common prediction (classification)

slide-112
SLIDE 112

80/122

Wisdom of the trees

◮ many views

◮ just use many trees

◮ independent views

◮ ??? learning is deterministic: same data ⇒ same tree ⇒ same view

◮ aggregation of views

◮ just average prediction (regression) or take most common prediction (classification)

slide-113
SLIDE 113

81/122

Independent views

Independent views ≡ different points of view ≡ different learning data But we have only one learning data!

slide-114
SLIDE 114

82/122

Independent views: idea! (Bootstrap)

Like in cross-fold, consider only a part of the data, but: ◮ instead of a subset ◮ a sample with repetitions

slide-115
SLIDE 115

82/122

Independent views: idea! (Bootstrap)

Like in cross-fold, consider only a part of the data, but: ◮ instead of a subset ◮ a sample with repetitions X = (xT

1 xT 2 xT 3 xT 4 xT 5 )

  • riginal learning data

X1 = (xT

1 xT 5 xT 3 xT 2 xT 5 )

sample 1 X2 = (xT

4 xT 2 xT 3 xT 1 xT 1 )

sample 2 Xi = . . . sample i

◮ (y omitted for brevity) ◮ learning data size is not a limitation (differently than with subset)

slide-116
SLIDE 116

83/122

Tree bagging

When learning:

  • 1. Repeat B times

1.1 take a sample of the learning data 1.2 learn a tree (unpruned)

When predicting:

  • 1. Repeat B times

1.1 get a prediction from ith learned tree

  • 2. predict the average (or most common) prediction

For classification, other aggregations can be done: majority voting (most common) is the simplest Using independent, possibly different classifiers together: ensemble

  • f classifiers
slide-117
SLIDE 117

84/122

How many trees?

B is a parameter: ◮ when there is a parameter, there is the problem of finding a good value ◮ remember kmin, depth (Q: impact on?)

slide-118
SLIDE 118

84/122

How many trees?

B is a parameter: ◮ when there is a parameter, there is the problem of finding a good value ◮ remember kmin, depth (Q: impact on?) ◮ it has been shown (experimentally) that

◮ for “large” B, bagging is better than single tree ◮ increasing B does not cause overfitting ◮ (for us: default B is ok! “large” ≈ hundreds)

Q: how better? at which cost?

slide-119
SLIDE 119

85/122

Bagging: impact of B

100 200 300 400 500 5 6 7 8 ·10−2 Number B of trees Test error

slide-120
SLIDE 120

86/122

Independent view: improvement

Despite being learned on different samples, bagging trees may be correlated, hence views are not very independent ◮ e.g., one variable is much more important than others for predicting (strong predictor) Idea: force point of view differentiation by “hiding” variables

slide-121
SLIDE 121

87/122

Random forest

When learning:

  • 1. Repeat B times

1.1 take a sample of the learning data 1.2 consider only m on p independent variables 1.3 learn a tree (unpruned)

When predicting:

  • 1. Repeat B times

1.1 get a prediction from ith learned tree

  • 2. predict the average (or most common) prediction

◮ (observations and) variables are randomly chosen. . . ◮ . . . to learn a forest of trees Q: are missing variables a problem?

slide-122
SLIDE 122

88/122

Random forest: parameter m

How to choose the value for m? ◮ m = p → bagging ◮ it has been shown (experimentally) that

◮ m does not relate with overfitting ◮ m = √p is good for classification ◮ m = p

3 is good for regression

◮ (for us, default m is ok!)

slide-123
SLIDE 123

89/122

Random forest

Experimentally shown: one of the “best” multi-purpose supervised classification methods ◮ Manuel Fern´

andez-Delgado et al. “Do we need hundreds of classifiers to solve real world classification problems”. In: J. Mach. Learn. Res 15.1 (2014), pp. 3133–3181

  • but. . .
slide-124
SLIDE 124

90/122

No free lunch!

“Any two optimization algorithms are equivalent when their performance is averaged across all possible problems” ◮ David H Wolpert. “The lack of a priori distinctions between learning

algorithms”. In: Neural computation 8.7 (1996), pp. 1341–1390

Why free lunch? ◮ many restaurants, many items on menus, many possibly prices for each item: where to go to eat? ◮ no general answer ◮ but, if you are a vegan, or like pizza, then a best choice could exist Q: problem? algorithm?

slide-125
SLIDE 125

91/122

Observation sampling

When learning:

  • 1. Repeat B times

1.1 take a sample of the learning data 1.2 consider only m on p independent variables (only for RF) 1.3 learn a tree (unpruned)

Each learned tree uses only a portion of the observation in the learning data: ◮ for each observation, ≈ B

3 trees did not considere it when

learned

slide-126
SLIDE 126

91/122

Observation sampling

When learning:

  • 1. Repeat B times

1.1 take a sample of the learning data 1.2 consider only m on p independent variables (only for RF) 1.3 learn a tree (unpruned)

Each learned tree uses only a portion of the observation in the learning data: ◮ for each observation, ≈ B

3 trees did not considere it when

learned ◮ those observation were unseen for those trees, like in cross-validation (OOB = out-of-bag)

slide-127
SLIDE 127

92/122

Bonus 1: OOB error

◮ for unseen each observation there are B

3 predictions

◮ can “average” prediction among trees, observation and obtain an estimate of the testing error (OOB error)

◮ like with cross-fold validation ◮ for free!

slide-128
SLIDE 128

93/122

OOB error

Image from An Introduction to Statistical Learning

slide-129
SLIDE 129

94/122

Why estimating the test error?

Because the test data, in real world, is not available! ◮ will my ML solution work?

slide-130
SLIDE 130

95/122

Bagging/RF and explicability

◮ Trees are easily understandable → explicability ◮ Hundreds of trees are not!

Image from F. Daolio

slide-131
SLIDE 131

96/122

Bagging/RF and explicability: idea!

While learning:

  • 1. for each tree, at each split

1.1 keep note of the split variable 1.2 keep note of RSS/Gini reduction

  • 2. for each variable, sum reductions

The largest reduction, the more important the variable!

slide-132
SLIDE 132

97/122

Bonus 2: variable importance

Instead of explicability based on tree shape: ◮ importance of variables based on RSS/Gini reduction

slide-133
SLIDE 133

98/122

Nature of the prediction

Consider classification: ◮ tree → the class ◮ forest → the class, as resulting from a voting

slide-134
SLIDE 134

98/122

Nature of the prediction

Consider classification: ◮ tree → the class

◮ “virginica” is just “virginica”

◮ forest → the class, as resulting from a voting

◮ “241 virginica, 170 versicolor, 89 setosa” is different than “478 virginica, 10 versicolor, 2 setosa”

Different confidence in the prediction

slide-135
SLIDE 135

99/122

Bonus 3: confidence/tunability

Voting outcome: ◮ in classification, a measure of confidence of the decision ◮ in binary classification, voting threshold can be tuned to adjust bias towards one class (sensitivity) Q: in regression?

slide-136
SLIDE 136

100/122

Subsection 3 Binary classification

slide-137
SLIDE 137

101/122

Binary classification

Binary classification: ◮ one of the most common classes of problems ◮ (comparative) evaluation is important!

slide-138
SLIDE 138

102/122

Binary classification: evaluation

Consider the problem of classifying a person (’s data) as suffering

  • r not suffering from a disease X.

Suppose we have “an accuracy of 99.99%”. Q: is it good?

slide-139
SLIDE 139

103/122

Binary classification: positives/negatives

Consider the problem of classifying a person (’s data) as suffering

  • r not suffering from a disease X.

◮ positive: an observation of “suffering” class ◮ negative: an observation of “not suffering” class In other problems, positive may mean a different thing: define it!

slide-140
SLIDE 140

104/122

Effectiveness indexes: FPR, FNR

Given some labeled data and a classifier for the disease X problem, we can measure: ◮ the number of negative observations wrongly classified as positives: False Positives (FP) ◮ the number of positive observations wrongly classified as negatives: False Negatives (FN) To decouple FP, FN from data size: FPR = FP N = FP FP + TN FNR = FN P = FN FN + TP

slide-141
SLIDE 141

105/122

Accuracy and error rate

Relation of FPR, FNR with accuracy and error rate Accuracy = 1 − Error Rate Error Rate = FN + FP P + N Q: Error Rate ? = FPR+FNR

2

slide-142
SLIDE 142

106/122

FPR, FNR and sensitivity

◮ Suppose FPR = 0.06, FNR = 0.04 with threshold set to 0.5 (default for RF) ◮ One could be interested in “limiting” the FNR → change the threshold Experimentally: 0.2 0.4 0.6 0.8 1 0.2 0.4 Threshold t Error rate FPR FNR

slide-143
SLIDE 143

107/122

Comparing classifiers with FPR, FNR

◮ Classifier A: FPR = 0.06, FNR = 0.04 ◮ Classifier B: FPR = 0.10, FNR = 0.01 Which one is the better? We’d like to have one single index → EER, AUC

slide-144
SLIDE 144

108/122

Equal Error Rate (EER)

FPR, FNR vs. t 0.5 1 0.2 0.4 EER Threshold t Error rate FPR FNR EER: the FPR at the value of t for which FPR = FNR

slide-145
SLIDE 145

109/122

AUC: Area Under the Curve

TPR vs. FPR 0.2 0.4 0.6 0.8 1 EER FPR TPR AUC: the area under the TPR vs. FPR curve, plotted for different values of threshold t ◮ the curve is called the Receiver operating characteristic (ROC)

slide-146
SLIDE 146

110/122

ROC and comparison

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 FPR TPR Classifier C1 Classifier C2 Random classifier Q: what does the bisector represent?

slide-147
SLIDE 147

111/122

Other issues: robustness w.r.t. the threshold

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 t FPR, FNR “Same” with other parameters

slide-148
SLIDE 148

112/122

Other issues: robustness w.r.t. random components

Consider A vs. B, AUC measured with cross-fold validation: ◮ A: 0.85, 0.73, 0.91, · · · → µ = 0.83, σ = 0.15 ◮ B: 0.81, 0.78, 0.79, · · · → µ = 0.81, σ = 0.03 Can we say that A is better than B? (for effectiveness only) In general, other sources of performance variability: ◮ random seed ◮ subclass of problem class (e.g., image recognition of dogs, cats, . . . )

slide-149
SLIDE 149

113/122

Comparing techniques

Technique A, B; different index (e.g., AUC) values: ◮ A → (x1

a , x2 a , . . . ) → random variable Xa

◮ B → (x1

b, x2 b, . . . ) → random variable Xb

Do Xa, Xb follow different distributions? ◮ yes: A and B are different (concerning the AUC) ◮ no: difference in µa, µb might be due to randomness → A, B are not significantly different

slide-150
SLIDE 150

114/122

Statistical significance in a nutshell

Just the way of thinking:

  • 1. State a set of assumptions (the null hypothesis H0), e.g.:

◮ Xa, Xb are normally distributed and independent ◮ ¯ xa = ¯ xb (or ¯ xa ≥ ¯ xb) ◮ any other assumption in the statistical model

slide-151
SLIDE 151

114/122

Statistical significance in a nutshell

Just the way of thinking:

  • 1. State a set of assumptions (the null hypothesis H0), e.g.:

◮ Xa, Xb are normally distributed and independent ◮ ¯ xa = ¯ xb (or ¯ xa ≥ ¯ xb) ◮ any other assumption in the statistical model

  • 2. Perform a statistical test, appropriate choice depending on

many factors, e.g.:

◮ Wilcoxon test (many versions) ◮ Friedman (many versions) ◮ . . .

slide-152
SLIDE 152

114/122

Statistical significance in a nutshell

Just the way of thinking:

  • 1. State a set of assumptions (the null hypothesis H0), e.g.:

◮ Xa, Xb are normally distributed and independent ◮ ¯ xa = ¯ xb (or ¯ xa ≥ ¯ xb) ◮ any other assumption in the statistical model

  • 2. Perform a statistical test, appropriate choice depending on

many factors, e.g.:

◮ Wilcoxon test (many versions) ◮ Friedman (many versions) ◮ . . .

  • 3. . . . which outputs a p-value ∈ [0, 1]

◮ 0 is “good”, 1 is “bad”

slide-153
SLIDE 153

115/122

p-value: meaning

0 is “good”, 1 is “bad” The p-value is the degree to which the data conform to the pattern predicted by the null hypothesis ◮ p-value = P(x1

a , x2 a , . . . , x1 b, x2 b, . . . |H0)

If p-value is low: ◮ we’ve been very (un)lucky in having observed x1

a , x2 a , . . . , x1 b, x2 b, . . .

◮ “maybe” because H0 is not true

slide-154
SLIDE 154

115/122

p-value: meaning

0 is “good”, 1 is “bad” The p-value is the degree to which the data conform to the pattern predicted by the null hypothesis ◮ p-value = P(x1

a , x2 a , . . . , x1 b, x2 b, . . . |H0)

If p-value is low: ◮ we’ve been very (un)lucky in having observed x1

a , x2 a , . . . , x1 b, x2 b, . . .

◮ “maybe” because H0 is not true

◮ Warning! Any part of H0, not necessarily the ¯ xa = ¯ xb part!

slide-155
SLIDE 155

116/122

Statistical significance

Things are much more complex than this. . . Some interesting papers: ◮ Joaqu´

ın Derrac et al. “A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms”. In: Swarm and Evolutionary Computation 1.1 (2011), pp. 3–18

◮ C´

edric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. “How Many Random Seeds? Statistical Power Analysis in Deep Reinforcement Learning Experiments”. In: arXiv preprint arXiv:1806.08295 (2018)

◮ Sander Greenland et al. “Statistical tests, P values, confidence intervals,

and power: a guide to misinterpretations”. In: European journal of epidemiology 31.4 (2016), pp. 337–350

slide-156
SLIDE 156

117/122

Subsection 4 Boosting

slide-157
SLIDE 157

118/122

Many views and aggregation

In bagging/RF (regression): ◮ many views are different samples ◮ aggregation is average Alternative: ◮ many views are subsequent residuals ◮ aggregation is the sum

slide-158
SLIDE 158

119/122

Boosting

When learning:

  • 1. Current data is learning data
  • 2. Repeat B times

2.1 learn a tree on current data 2.2 current data becomes residuals of learned tree (y − ˆ y)

When predicting:

  • 1. Repeat B times

1.1 get a prediction from ith learned tree

  • 2. sum prediction

Q: implementation differences w.r.t. RF?

slide-159
SLIDE 159

120/122

Boosting (regression)

function BoostTrees(X, y) t(X) ← 0 for i ∈ {1, 2, . . . , B} do ti ← BuildRegressionTree(X, y, d) t(X) ← t(X) + λti(X) y ← y − λti(X) end for return t end function ◮ Each learned tree should be simple (maximum splits d) ◮ λ slows down learning Trickier with classification.

slide-160
SLIDE 160

121/122

Boosting parameters

◮ λ usually set to 0.01 or 0.001 ◮ λ and B interact: for small λ, B should be large ◮ large B can lead to overfitting (unlike bagging/RF, Q: why) Find a good value for B with cross-validation (Both boosting and bagging general techniques)

slide-161
SLIDE 161

122/122

Bagging/RF/boosting in summary

Tree Bagging RF Boosting interpretability

  • numeric/categorical
  • accuracy
  • test error estimate
  • variable importance
  • confidence/tunability
  • fast to learn

  • (almost) non-parametric
  • ∗: Q: how faster? when? does it matter?