[PPT] - Classification Department Biosysteme Karsten Borgwardt Data Mining PowerPoint Presentation

SLIDE 1

Classification

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 66 / 164

SLIDE 2

What is Classification?

Problem Given an object, which class of objects does it belong to? Given object x, predict its class label y. Examples Computer vision: Is this object a chair? Credit cards: Is this customer to be trusted? Marketing: Will this customer buy/like our product? Function prediction: Is this protein an enzyme? Gene finding: Does this sequence contain a splice site? Personalized medicine: Will this patient respond to drug treatment?

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 67 / 164

SLIDE 3

What is Classification?

Setting Classification is usually performed in a supervised setting: We are given a training dataset. A training dataset is a dataset of pairs {(xi, yi)}n

i=1, that is objects and

their known class labels. The test set is a dataset of test points {x′

i}d i=1 with unknown class label.

The task is to predict the class label y′

i of x′ i via a function f .

Role of y if y ∈ {0, 1}: then we are dealing with a binary classification problem if y ∈ {1, . . . , n}, (3 ≤ n ∈ N): a multiclass classification problem if y ∈ R: a regression problem

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 68 / 164

SLIDE 4

Evaluating Classifiers

The Contingency Table

In a binary classification problem, one can represent the accuracy of the predictions in a contingency table: y = 1 y = −1 f (x) = 1 TP FP f (x) = −1 FN TN Here, T refers to True, F to False, P to Positive (prediction) and N to Negative (prediction).

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 69 / 164

SLIDE 5

Evaluating Classifiers

Accuracy

The accuracy of a classifier is defined as TP + TN TP + TN + FP + FN Accuracy measures which percentage of the predictions is correct. It is the most common criterion for reporting the performance of a classifier. Still, it has a fundamental shortcoming: If the classes are unbalanced, the accuracy on the entire dataset may look high, while being low on the smaller class.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 70 / 164

SLIDE 6

Evaluating Classifiers

Precision-Recall

If the positive class is much smaller than the negative class, one should rather use precision and recall to evaluate the classifier. The precision of a classifier is defined as TP TP + FP . The recall of a classifier is defined as TP TP + FN .

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 71 / 164

SLIDE 7

Evaluating Classifiers

Trade-off between Precision and Recall

There is a trade-off between precision and recall: By predicting all points to be positive (f (x = 1)) one can guarantee that the recall is 1. However, the precision will then be bad. By only predicting points to be members of the positive class for which one is highly confident about the prediction, one will increase precision, but lower recall. One workaround is to report the precision recall break-even point, that is the value at which precision and recall are identical.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 72 / 164

SLIDE 8

Evaluating Classifiers

Dependence on Classification Threshold

TP, TN, FP, FN depend on f (x) where x ∈ D. The most common definition of f (x) is f (x) =

1

if s(x) ≥ θ, −1 if s(x) < θ, where s : D → R is a scoring function, and θ ∈ R is a threshold. As the predictions based on f vary with θ, so do TP, TN, FP, FN, and all evaluations criteria based on them. It is therefore important to report results as a function of θ whenever possible, not just for one fixed choice of θ.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 73 / 164

SLIDE 9

Evaluating Classifiers

How to report results as a function of θ

The efficient strategy to compute all solutions as a function of θ is to rank all points x by their score s(x). This ranking is a vector of length t. This ranking is a vector r of length t, whose ith element is r(i). We then perform the following steps:

For i = 1 to t − 1

Define the positive predictions P to be the set {r(1), . . . , r(i)}. Define the negative predictions N to be the set {r(i + 1), . . . , r(t)}. Compute the evaluation criteria e(i) of interest for P and N.

Return vector e

The common strategy is to compute two evaluation criteria e1 and e2 and to then visualize the result in a 2-D plot.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 74 / 164

SLIDE 10

Evaluating Classifiers

ROC curves

One popular such 2D-plot is the Receiver Operating Characteristics Curve, which represents the true positive rate versus the false positive rate. The true positive rate (or sensitivity) is identical to the recall: TP TP + FN . That is, the fraction of positive points that were correctly classified. The false positive rate (or 1− specificity) is FP FP + TN That is, the fraction of negative points that were misclassified.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 75 / 164

SLIDE 11

Evaluating Classifiers

ROC curves

Each ROC curves starts at (0,0). If no point is predicted to be positive, then there are no True Positives and False Positives. Each ROC curve ends at (1,1). If all points are predicted to be positive, then there are no True Negatives and False Negatives.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 76 / 164

SLIDE 12

Evaluating Classifiers

ROC curves

The ROC curve of a perfect classifier runs through the point (0,1) - it correctly classifies all negative points (FP=0) and correctly classifies all positive points (FN=0). While the ROC curve does not depend

n an arbitrarily chosen threshold θ, it

seems difficult to summarize the performance of a classifier in terms of a ROC curve.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 77 / 164

SLIDE 13

Evaluating Classifiers

ROC curves

The solution to this problem is the Area under the Receiver Operating Characteristics (AUC), a number between 0 and 1. The AUC can be interpreted as follows: When we present one negative and one positive test point to the classifier, then the AUC is the probability with which the classifier will assign a larger score to the positive than to the negative point. The larger AUC, the better the classifier. The AUC of a perfect classifier can be shown to be 1. The AUC of a random classifier (guessing the prediction) is 0.5. The AUC of a ‘stupid’ classifier (misclassifying all points) is 0.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 78 / 164

SLIDE 14

Evaluating Classifiers

Summarizing PR values

2-D plot of (recall,precision) values for different values of θ Starts at (0,1). Full precision, no recall. The precision recall break-even point is the point at which the precision-recall-curve intersects the bisecting line. The area under the precision-recall-curve (AUPRC) is another statistic to quantify the performance of a classifier. It is 1 for a perfect classifier, that is, it reaches 100% precision and 100% recall at the same time.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 79 / 164

SLIDE 15

Evaluating Classifiers

Example: The Good and the Bad

We are given 102 test points, 2 are positive, 100 negative. Our prediction ranks ten negative points first, then the 2 positive points, then the remaining 90 points.

0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 80 / 164

SLIDE 16

Evaluating Classifiers

What to do if we only have one dataset for training and test?

If only one dataset is available for training and testing, it is essential not to train and test on the same instance, but rather to split the available data into training and test data. Splitting the dataset into k subsets and using one of them for testing and the rest for training is referred to as k-fold cross-validation. If k = n, cross-validation is referred to as leave-one-out-validation. Randomly sampling subsets of the data for training and testing and averaging over the results is called bootstrapping.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 81 / 164

SLIDE 17

Evaluating Classifiers

Illustration of cross-validation: 10-fold cross-validation

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 ...

Step 1 Step 2 Step 10 ...

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 82 / 164

SLIDE 18

Evaluating Classifiers

How to optimize the parameters of a classifier?

Most classifiers use parameters c that have to be set (more on these later). It is wrong to optimize these parameters by trying out different values and picking those that perform best on the test set. These parameters are overfit on this particular test dataset, and may not generalize to other datasets. Instead, one needs an internal cross-validation on the training data to optimize c.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 83 / 164

SLIDE 19

Evaluating Classifiers

Illustration of cross-validation with parameter optimization: Step 1

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 ...

Outer Loop: Step 1 Inner Loop: Step 1 Inner Loop: Step 9 ...

...

Repeat the inner loop for different values

f c.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 84 / 164

SLIDE 20

Evaluating Classifiers

Illustration of cross-validation with parameter optimization: Step 10

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 ...

Outer Loop: Step 10 Inner Loop: Step 1 Inner Loop: Step 9 ...

...

Repeat the inner loop for different values

f c.

10 9 10

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 85 / 164

SLIDE 21

Choosing Classifiers

Criteria for a good classifier

Accuracy Runtime and scalability Interpretability Flexibility

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 86 / 164

SLIDE 22

Nearest Neighbour Classification

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 87 / 164

SLIDE 23

Nearest Neighbour

The actual classification

Given x, we predict its label y by xi = arg min

x′∈D ||x − x′||2 ⇒ f (x) = yi

The predicted label of x is that of the point closest to it, that is, its nearest neighbour.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 88 / 164

SLIDE 24

Nearest Neighbour

k-NN classification

An object is classified to the class most common among its k nearest neighbors (k ∈ N). If k = 1, then the object is assigned to the class of its single nearest neighbor. k is a hyperparameter that is chosen by the user. The larger k, the lower the influence of single noisy points on the classification. In binary classification problems, one usually chooses odd k to avoid ties. It is an instance of instance-based learning and lazy learning.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 89 / 164

SLIDE 25

Nearest Neighbour

Runtime

Naively, one has to compute the distance to all n neighbours in the training dataset for each point:

O(n) to find the nearest neighbour in 1-NN O(n + n log n) to find the k nearest neighbours in k-NN

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 90 / 164

SLIDE 26

Nearest Neighbour

Challenges in Nearest Neighbour Classification

Speed of prediction Choice of hyperparameter k Choice of distance function and weights of dimensions

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 91 / 164

SLIDE 27

Nearest Neighbour

How to speed Nearest Neighbour up

Exploit the triangle inequality: d(x1, x2) + d(x2, x3) ≥ d(x1, x3) This holds for any metric d.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 92 / 164

SLIDE 28

Nearest Neighbour

How to speed Nearest Neighbour up

Rewrite triangle inequality: d(x1, x2) ≥ d(x1, x3) − d(x2, x3) That means if you know d(x1, x3) and d(x2, x3), you can provide a lower bound on d(x1, x2). If you know a point that is closer to x1 than d(x1, x3) − d(x2, x3), then you can avoid to compute d(x1, x2).

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 93 / 164

SLIDE 29

Nearest Neighbour

How to set the hyperparameter k: Bootstrapping and cross-validation

Cross-validation on the training data Bootstrapping on training data

Split the training dataset in two subsets T1 and T2. For different choices of k, compute the accuracy on T2, using T1 as training set. Repeat the above two steps several times. Pick the k with the highest average accuracy across all iterations.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 94 / 164

SLIDE 30

Nearest Neighbour

How to weight the dimensions

The Mahalanobis distance dM takes the d × d covariance structure Σ between features into account: dM(x, x′) =

(x − x′)⊤Σ−1(x − x′)

, where the covariance matrix is defined as: Σij = cov(Xi, Xj) = E

(Xi − µi)(Xj − µj)
Xi is a random variable representing feature i and µi is its mean.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 95 / 164

SLIDE 31

Naive Bayes Classifier

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 96 / 164

SLIDE 32

Naive Bayes

Bayes’ Rule (named after Thomas Bayes, 1701-1761)

P(Y = y|X = x) = P(X = x|Y = y)P(Y = y) P(X = x) Here, X is the random variable that describes the features of an object, and Y the random variable that describes its class label. x is an instantiation of X and y and instantiation of Y .

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 97 / 164

SLIDE 33

Naive Bayes

Naive Bayes Classification

Classify x into one of m classes y1, . . . , ym: arg max

yi

P(Y = yi|X = x) = arg max

yi

P(X = x|Y = yi)P(Y = yi) P(X = x) , where

P(Y = yi) is the prior probability, P(X = x) is the evidence, P(X = x|Y = yi) is the likelihood, P(Y = yi| X = x) is the posterior probability.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 98 / 164

SLIDE 34

Naive Bayes

Simplifications

P(X = x) is the same for all classes, ignore this term. That means: P(Y = yi|X = x) ∝ P(Y = yi)P(X = x|Y = yi). If x is multidimensional, that is if x contains d features x = (x1, . . . , xd), we further assume that the features are conditionally independent given the class label: P(X = x|Y = yi) = P(X = (x1, . . . , xd)|Y = yi) =

d

j=1

P(Xj = xj|Y = yi).

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 99 / 164

SLIDE 35

Naive Bayes

The actual classification

The actual classification is performed by computing arg max

yi

P(y = yi|X = x) ∝ P(Y = yi)

d

j=1

P(Xj = xj|Y = yi)

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 100 / 164

SLIDE 36

Naive Bayes

How to train the model

One has to estimate P(Y ). Typically, one assumes that all classes have the same probability, or one infers the class probabilities from the class frequencies in the training dataset. One has to estimate P(X|Y ). Popular choices for x = (x1, . . . , xd) are

for binary data: P(X = x|Y = yi) = d

j=1 Ber(xj|θj,i), where θj,i is the probability that feature

xj has value 1 in class yi, for continuous data: P(X = x|Y = yi) = d

j=1 N(xj|µj,i, σ2 j,i), where µj,i is the mean of

feature xj in objects of class i, and σ2

j,i is the variance.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 101 / 164

SLIDE 37

Naive Bayes

Fundamental Probability Distributions

The Bernoulli distribution: Ber(x|θ) = θ if x = 1 1 − θ if x = 0 The Gaussian (normal) distribution: N(x|µ, σ2) =

1 √ 2πσ2 e−

1 2σ2 (x−µ)2

The multivariate Gaussian distribution: N(x|µ, Σ) = 1 (2π)

d 2 |Σ| 1 2

e− 1

2 ((x−µ)⊤Σ−1(x−µ)) Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 102 / 164

SLIDE 38

Naive Bayes

P(y = yi|X = x) ∝ P(Y = yi) d

j=1 P(Xj = xj|Y = yi)

A C A T C A G T A T C A G C G T C G A T G T A G G T A G C A A C G G C G G C A G C A A T G G C G G T G T C G

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 103 / 164

SLIDE 39

Naive Bayes

Advantages

Speed: Effort of prediction for one test point is O(md), as we have to compute the class posterior for all m classes Ability to deal with missing data: Missing features xj can simply be dropped when evaluating the class posteriors, by dropping P(xj|yi). Ability to combine discrete and continuous features: Use discrete or continuous probability distributions for each attribute Practical performance: Despite the unrealistic independence assumption on the features, Naive Bayes often provides good results in practice.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 104 / 164

SLIDE 40

Naive Bayes

Fundamental Probability Distributions

The Bernoulli distribution: Ber(x|θ) = θ if x = 1 1 − θ if x = 0 The Gaussian (normal) distribution: N(x|µ, σ2) =

1 √ 2πσ2 e−

1 2σ2 (x−µ)2

The multivariate Gaussian distribution: N(x|µ, Σ) = 1 (2π)

d 2 |Σ| 1 2

e− 1

2 ((x−µ)⊤Σ−1(x−µ)) Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 105 / 164

SLIDE 41

Linear Discriminant Analysis

based on: Shai Shalev-Shwartz and Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press 2014, Chapter 24.3

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 106 / 164

SLIDE 42

Linear Discriminant Analysis

Underlying assumptions of LDA

The goal is to predict a label y ∈ {0, 1} from a vector x of d features, that is, x = (x1, . . . , xd). We assume that P(Y = 1) = P(Y = 0) = 1

2.

We assume that the conditional probability of X given Y is a multivariate Gaussian distribution. The covariance matrix Σ is the same for both classes, Y = 0 and Y = 1, but they differ in their means, µ0 and µ1.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 107 / 164

SLIDE 43

Linear Discriminant Analysis

Log-likelihood ratio

The density distribution is then: N(x|µy, Σ) = 1 (2π)

d 2 |Σ| 1 2

e− 1

2 ((x−µy)⊤Σ−1(x−µy))

We would like to find: arg max

y∈{0,1} P(Y = y)P(X = x|Y = y)

We predict f (x) = 1 if the log-likelihood ratio exceeds zero: log(P(Y = 1)P(X = x|Y = 1) P(Y = 0)P(X = x|Y = 0)) > 0

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 108 / 164

SLIDE 44

Linear Discriminant Analysis

Link to linear regression

The log likelihood ratio is equivalent to 1 2((x − µ0)⊤Σ−1(x − µ0)) − 1 2((x − µ1)⊤Σ−1(x − µ1)). This can be rewritten as w, x + b = Σd

i=1wixi + b,

where w = (µ1 − µ0)⊤Σ−1 and b = 1 2(µ⊤

0 Σ−1µ0 − µ⊤ 1 Σ−1µ1).

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 109 / 164

SLIDE 45

Linear Discriminant Analysis

Link to linear regression

This shows that linear regression is a Bayes-optimal classifier under the aforementioned assumptions. To perform LDA, all we have to do is to estimate µ0, µ1 and Σ from the data, and use the equations above to determine w and b. Then the classification is performed via f (x) = sgn(w, x + b) This mathematical elegance goes hand in hand with the difficulty to compute Σ−1 in very high-dimensional spaces.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 110 / 164

SLIDE 46

Logistic Regression

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 111 / 164

SLIDE 47

Logistic Regression

Concept (David Cox, 1958)

Logistic Regression is a classification model for binary output variables, y ∈ {−1, 1}. We define an auxiliary target variable z, which is expressed as a linear function of the input variable x, that is z =

i=1 wixi + w0.

The logistic function f now maps z to the interval [0, 1]: f (z) = exp(z) exp(z) + 1 = 1 1 + exp(−z)

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 112 / 164

SLIDE 48

Logistic function

15 10 5 5 10 15 z 0.0 0.2 0.4 0.6 0.8 1.0 f(z)

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 113 / 164

SLIDE 49

Logistic Regression

Concept (David Cox, 1958)

The logistic function can now be written as: fw(x) = f (w, x) = 1 1 + e(−(w0+d

i=1 wixi))

fw(x) is the probability that x is in class 1.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 114 / 164

SLIDE 50

Logistic Regression

Concept (David Cox, 1958)

One can now define the inverse of the logistic function g = f −1, the logit or log-odds function: g(fw(x)) = ln

fw(x)

1 − fw(x)

= w0 +

d

i=1

wixi g ◦ fw clarifies the link to linear regression.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 115 / 164

SLIDE 51

Logistic Regression

Training the model

While the link to linear regression is clear now, how to training the model remains unclear. We need an objective that is to be optimized in the training step to learn the weights w.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 116 / 164

SLIDE 52

Logistic Regression

P(y = 1|X = x) =

1 1+exp(−w,x)

10
8
6
4
2

2 4 6 8 10

<w,x>

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

P(Y=1|x)

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 117 / 164

SLIDE 53

Logistic Regression

P(y = −1|X = x) =

1 1+exp(w,x)

10
8
6
4
2

2 4 6 8 10

<w,x>

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

P(Y=-1|x)

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 118 / 164

SLIDE 54

Logistic Regression

Training the model

The log probability of each point is therefore: log( 1 1 + exp(−yw, x)) = log((1 + exp(−yw, x))−1) = − log(1 + exp(−yw, x)) To train the logistic regression model, we minimize the total negative log probability

ver all points, the logistic loss function, which is convex in w:

arg min

w∈Rd

1 nΣn

i=1 log(1 + exp(−yiw, xi))

Several approaches for optimizing this objective exist, e.g. Maximum Likelihood Estimation, and software packages that implement them, e.g. liblinear.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 119 / 164

SLIDE 55

Logistic Regression

Discussion (Kevin Murphy, MIT Press 2012)

Logistic regression models are easy to fit: The algorithms are easy to implement and fast. There are methods to train Logistic regression models that take time linear in the number of non-zero features in the data, which is the minimal possible time. Logistic regression models are easy to interpret, as their output represents the log odd ratio between the positive and the negative class. Several important extensions (multiclass classification, nonlinear decision boundaries,

ther types of output data) are possible.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 120 / 164

SLIDE 56

Decision Trees

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 121 / 164

SLIDE 57

Decision Tree

Key idea

Recursively split the data space into regions that contain a single class only

Age Treated with drug B Pregnancy Treated with drug C YES Treatment No Treatment YES NO Less than 12 12+ Heart problems NO Treatment No Treatment YES NO Treatment No Treatment YES NO

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 122 / 164

SLIDE 58

Decision Tree

Concept

A decision tree is a flowchart like tree structure with

a root: this is the uppermost node internal nodes: these represents tests on an attribute branches: these represent outcomes of a test leaf nodes: these hold a class label

Age Treated with drug B Pregnancy Treated with drug C YES Treatment No Treatment YES NO Less than 12 12+ Heart problems NO Treatment No Treatment YES NO Treatment No Treatment YES NO

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 123 / 164

SLIDE 59

Decision Tree

Classification

given a test point x perform test on the attributes of x at the root follow the branch that corresponds to the outcome of this test repeat this procedure, until you reach a leaf node predict the label of x to be the label of that leaf node

Age Treated with drug B Pregnancy Treated with drug C YES Treatment No Treatment YES NO Less than 12 12+ Heart problems NO Treatment No Treatment YES NO Treatment No Treatment YES NO

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 124 / 164

SLIDE 60

Decision Tree

Popularity

requires no domain knowledge easy to interpret construction and prediction is fast But how to construct a decision tree?

Age Treated with drug B Pregnancy Treated with drug C YES Treatment No Treatment YES NO Less than 12 12+ Heart problems NO Treatment No Treatment YES NO Treatment No Treatment YES NO

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 125 / 164

SLIDE 61

Decision Tree

Construction of Decision Trees

Training Procedure

1 start with all training examples 2 select attribute and threshold that gives the “best” split 3 create child nodes based on split 4 repeat Step 2 and 3 on each child using its data until a stopping criterion is reached

all examples are in the same class number of examples in a node is too small the tree gets too large

Central problem: How to choose the “best” attribute? Via a cost function.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 126 / 164

SLIDE 62

Decision Tree

function DecisionTree(D)

If Stopping Criterion fulfilled: Predicted class for points in D is the majority class in D;

therwise

arg max(A,θ) = cost(D) − cost(DA,θ); DA,<θ = {x ∈ D|A < θ}; DA,≥θ = {x ∈ D|A ≥ θ}; Create two child nodes of D, containing DA,<θ and DA,≥θ DecisionTree(DA,<θ); DecisionTree(DA,≥θ); end

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 127 / 164

SLIDE 63

Decision Tree

Information gain (Quinlan, 1986)

ID3 (short for Iterative Dichotomizer 3) uses the information gain as its attribute selection measure. The information content is defined as: Info(D) = −

m

i=1

p(y = yi|x ∈ D) log2(p(y = yi|x ∈ D)), where p(y = yi|x ∈ D) is the probability that an arbitrary tuple in D belongs to class yi and is estimated by |{(xj,yj)|xj∈D ∩ yj=yi}|

|D|

. This is also known as the Shannon entropy of D.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 128 / 164

SLIDE 64

Decision Tree

Information gain

Assume that attribute A was used to split D into v partitions or subsets, {D1, D2, . . . , Dv}, where Dj contains those tuples in D that have outcome aj of A. Ideally, the Dj would provide a perfect classification, but they seldomly do. How much more information do we need to arrive at an exact classification? This is quantified by InfoA(D) =

v

j=1

|Dj| |D| Info(Dj). (1)

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 129 / 164

SLIDE 65

Decision Tree

Information gain

The information gain is the loss of entropy (increase in information) that is caused by splitting with respect to attribute A Gain(A) = Info(D) − InfoA(D) (2) We pick A such that this gain is maximised.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 130 / 164

SLIDE 66

Decision Tree

Gain ratio (Quinlan, 1993)

The information gain is biased towards attributes with a large number of values For example, an ID attribute maximises the information gain! Hence C4.5 uses an extension of information gain: the gain ratio The gain ratio is based on the split information SplitInfoA(D) = −

v

j=1

|Dj| |D| log2(|Dj| |D| ). (3) and is defined as GainRatio(A) = Gain(A) SplitInfo(A) (4)

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 131 / 164

SLIDE 67

Decision Tree

Gain ratio

The attribute with maximum gain ratio is selected as the splitting attribute The ratio becomes unstable, as the split information approaches zero A constraint is added to ensure that the information gain of the test selected is at least as great as the average gain over all tests examined.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 132 / 164

SLIDE 68

Decision Tree

Gini index (Breiman, Friedman, Olshen and Stone, 1984)

Attribute selection measure in the CART system. Gini index measures class impurity as Gini(D) = 1 − m

i=1 p(y = yi)2

If we split via attribute A into partitions {D1, D2, . . . , Dv}, the Gini index of this partitioning is defined as GiniA(D) =

v

j=1

|Dj| |D| Gini(Dj) (5) and the reduction in impurity by a split on A is ∆Gini(D) = Gini(D) − GiniA(D) (6)

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 133 / 164

SLIDE 69

Decision Tree - Advantages I

Simple to understand and to interpret. Trees can be visualised. Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree. Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information. Able to handle multi-output problems. Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 134 / 164

SLIDE 70

Decision Tree - Advantages II

Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model. Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.

Source: http://scikit-learn.org/stable/modules/tree.html

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 135 / 164

SLIDE 71

Decision Tree - Disadvantages I

Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem. Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble. The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 136 / 164

SLIDE 72

Decision Tree - Disadvantages II

training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement. There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

Source: http://scikit-learn.org/stable/modules/tree.html

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 137 / 164

SLIDE 73

Decision Trees - XOR Problem

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0.5 1 1.5 2 x Y

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 138 / 164

SLIDE 74

Decision Trees

Random Forests (Breiman, 2001)

To minimize overfitting and the large variability in decision trees, one may use an ensemble of several decision trees. Breiman (2001) introduced Random Forests a collection of k decision trees.

The idea is to subsample a subset of n′ instances and a subset of d′ features of the training dataset k times. On each of these k samples, a decision tree is constructed. Subsequently, the classification of trees is performed by taking a majority vote over the trees. Three key parameters: number of instances in the subset n′, number of features in the subsets d′, and number of trees k.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 139 / 164

SLIDE 75

Support Vector Machines

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 140 / 164

SLIDE 76

Support Vector Machines

Hyperplane classifiers

Vapnik et al. (1974) defined a family of classifiers for binary classification problems. This family is the class of hyperplanes in some dot product space H, w, x + b = 0, (7) where w ∈ H, b ∈ R. These correspond to decision functions (‘classifiers’): f (x) = sgn(w, x + b) (8) Vapnik et al. proposed a learning algorithm for determining this f from the training dataset.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 141 / 164

SLIDE 77

Support Vector Machines

The optimal hyperplane

maximises the margin of separation between any training point and the hyperplane max

w∈H,b∈R min{x − xi | x ∈ H, w, x + b = 0, i ∈ {1, . . . , n}}

(9)

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 142 / 164

SLIDE 78

Support Vector Machines

Hard-margin SVM

minw∈H,b∈R 1 2w2 (10) subject to yi(w, xi + b) ≥ 1, ∀i ∈ {1, . . . , n} Why minimise 1

2w2?

The size of the margin is

2

w. The smaller w, the larger the margin.

Why do we have to obey the constraints yi(w, xi + b) ≥ 1? They ensure that all training data points of the same class are on the same side of the hyperplane and

utside the margin.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 143 / 164

SLIDE 79

Support Vector Machines

The Lagrangian

We form the Lagrangian: L(w, b, α) = 1 2w2 −

n

i=1

αi(yi(xi, w + b) − 1) (11) The Lagrangian is minimised with respect to the primal variables w and b, and maximised with respect to the dual variables αi.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 144 / 164

SLIDE 80

Support Vector Machines

Support Vectors

At optimality, ∂ ∂bL(w, b, α) = 0 and ∂ ∂wL(w, b, α) = 0 (12) such that

n

i=1

αiyi = 0 and w =

n

i=1

αiyixi (13) Hence the solution vector w, the crucial parameter of the SVM classifier, has an expansion in terms of the training points and their labels. Those training points with αi > 0 are the Support Vectors.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 145 / 164

SLIDE 81

Support Vector Machines

The dual problem

Plugging (13) into the Lagrangian (11), we obtain the dual optimization problem that is solved in practice: maximiseα∈RnW (α) =

n

i=1

αi − 1 2

n

i,j=1

αiαjyiyjxi, xj (14)

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 146 / 164

SLIDE 82

Support Vector Machines

The kernel trick

The key insight is that (14) accesses the training data only in terms of inner products xi, xj. We can plug in an inner product of our choice from any space H here! This is referred to as a kernel k: k(xi, xj) = xi, xjH (15)

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 147 / 164

SLIDE 83

Support Vector Machines

Definition of a kernel

We assume that data points x ∈ R are mapped to a space H via a mapping φ : R → H. Then the kernel is defined as an inner product ·, · : H × H → R between data points in this space H: k(x, x′) = φ(x), φ(x′)

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 148 / 164

SLIDE 84

Support Vector Machines

Prediction

The decision function for Support Vector Machine classification then becomes f (x) = sgn(

n

i=1

yiαiφ(x), φ(xi) + b) = (16) = sgn(

n

i=1

yiαik(x, xi) + b); (17) The prediction effort is linear in the number of non-zero entries of α, that is, in worst-case linear in O(n). In practice, the number of support vectors is often much smaller than n.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 149 / 164

SLIDE 85

Support Vector Machines

ξi! ξj!

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 150 / 164

SLIDE 86

Support Vector Machines

Soft-margin SVM: C-SVM (Cortes and Vapnik, 1995)

Points are now allowed to lie inside the margin or in the wrong halfspace (margin errors): minw∈H,b∈R,ξ∈Rn 1 2w2 + CΣn

i=1ξi,

(18) subject to yi(w, xi + b) ≥ 1−ξi, (19) ∀i ∈ {1, . . . , n}, ξi ≥ 0 C ∈ R is a penalty score parameter that determines the tradeoff between maximizing the margin and minimizing margin errors. ξ is a slack variable, which measures the degree of misclassification of each margin error.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 151 / 164

SLIDE 87

Support Vector Machines

Soft-margin SVM: ν-SVM (Sch¨

lkopf et al., 2000)

An alternative formulation of soft-margin SVMs uses the parameter ν ∈ (0, 1]: minw∈H,b,ρ∈R,ξ∈Rn 1 2w2 + (1 nΣn

i=1ξi) − νρ,

(20) subject to yi(w, xi + b) ≥ ρ − ξi, (21) ∀i ∈ {1, . . . , n}, ξi ≥ 0, ρ ≥ 0. ν can be shown to be a lower bound for the fraction of support vectors and an upper bound for the fraction of margin errors among all training points.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 152 / 164

SLIDE 88

Support Vector Machines

Class imbalance: SVM with different error costs (Veropoulos et al., 1999)

Margin errors in different classes are penalized differently (two parameters C + and C − instead of one): minw∈H,b∈R,ξ∈Rn 1 2w2 + C +Σi|yi=1ξi + C −Σi|yi=−1ξi, (22) subject to yi(w, xi + b) ≥ 1−ξi, (23) ∀i ∈ {1, . . . , n}, ξi ≥ 0 The idea is that margin errors in the smaller class should be more expensive than in the larger class.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 153 / 164

SLIDE 89

Kernels

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 154 / 164

SLIDE 90

Kernels

Building kernels in practice

In practical applications, one constructs kernels by

1 Proposing a kernel function and explicitly defining the corresponding mapping φ or/and 2 Combining known kernels in ways that obey the closure properties of kernels.

Knowing these closure properties and a set of basic kernel functions is of utmost importance when working with kernels in practice.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 155 / 164

SLIDE 91

Kernels

Closure properties of kernels

Assume we are given two kernels k1 and k2. Then k1 + k2 is a kernel as well. Assume we are given two kernel functions k1 and k2. Then their product is a kernel k1k2 as well. Assume we are given a kernel function k and a positive scalar λ ∈ R+. Then λk is a kernel as well. Assume that a kernel k is only defined on a set D. Then its zero-extension k0 is also a kernel: k0(x, x′) = k(x, x′) if x ∈ D and x′ ∈ D

therwise

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 156 / 164

SLIDE 92

Kernels

Some prominent kernels

linear kernel k(x, x′) =

d

l=1

xlx′

l = x⊤x′,

(24) polynomial kernel k(x, x′) = (x⊤x′ + c)d, where c, d ∈ R, Gaussian Radial Basis Function (RBF) kernel k(x, x′) = exp(− 1 2σ2 x − x′2), (25) where σ ∈ R.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 157 / 164

SLIDE 93

Kernels

Some useful kernels

the constant ‘all-ones’ kernel: k(x, x′) = 1 the delta (Dirac) kernel: k(x, x′) = 1 x = x′

therwise

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 158 / 164

SLIDE 94

Kernels on structured data

R-convolution kernels (Haussler, 1999)

R-Convolution kernels are a famous recipe for constructing kernels on structured data. It is based on decomposing objects X and X ′ via a relation R into sets of substructures S and S′. Their most simple and most widely used instance kR is the idea to compare all pairs of these substructures of X and X ′ pairwise: kR(X, X ′) =

s∈S,s′∈S′

kbase(s, s′) For instance, a substructure would be the elements of a set, the nodes of a graph or the substrings of a string. kbase is an arbitrary vectorial kernel, very often even the delta kernel.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 159 / 164

SLIDE 95

Kernels on strings

String Kernels

The spectrum kernel, mentioned above, counts all pairs of matching substrings in two strings. That is, X and X ′ are strings and S and S′ are the sets of all of their substrings. kR(X, X ′) =

s∈S,s′∈S′

kbase(s, s′) kbase is a delta kernel, that is, for every pair of matching substrings, the kernel increases by one. Naively, one has to enumerate all substrings (quadratic effort O(|X|2 + |X ′|2). Via a special data structure called Suffix Trees, it can be done in linear time O(|X| + |X ′|) (Vishwanathan and Smola, 2003).

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 160 / 164

SLIDE 96

Kernels on graphs

Random Walk Kernel

Count matching walks in G and G ′: the more matching walks there are, the higher the similarity between G and G ′. Matching walks are sequences of nodes and edges with matching node labels

Elegant Computation (G¨

artner et al., 2003)

Compute the direct product graph of G and G ′ and count all walks of length k in it. Every walk in the product graph corresponds to one walk in G and in G ′: k×(G, G ′) =

|V×|

i,j=1

[

∞

k=0

λkAk

×]ij = e⊤(1 − λA×)−1e

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 161 / 164

SLIDE 97

Kernels on graphs

! " # !! "!

X

!! !! !! "!

"" " "!

!! !! !! "! "! !! "! "! #! !! #! "!

""

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 162 / 164

SLIDE 98

References I

D. R. Cox, J Roy Stat Soc B 20, 215 (1958).
J. R. Quinlan, Mach. Learn. 1, 81 (1986).
J. R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann Publishers

Inc., San Francisco, CA, USA, 1993).

L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, Classification and Regression

Trees (Wadsworth, 1984).

V. V.N., A. Chervonenkis, Theory of pattern recognition: Statistical problems of

learning [Russian] (Moscow: Nauka, 1974).

C. Cortes, V. Vapnik, Machine Learning 20, 273 (1995).
K. Veropoulos, C. Campbell, N. Cristianini, Proceedings of the International Joint

Conference on AI (1999), pp. 55–60.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 163 / 164

SLIDE 99

References II

D. Haussler, Convolution kernels on discrete structures (1999).
K. P. i. Murphy, Machine learning : a probabilistic perspective, Adaptive computation

and machine learning series (MIT Press, Cambridge (Mass.), 2012).

L. Breiman, Machine Learning 45, 5 (2001).
B. Sch¨
lkopf, A. J. Smola, R. C. Williamson, P. L. Bartlett, Neural Comput. 12, 1207

(2000).

S. V. N. Vishwanathan, A. J. Smola, Advances in Neural Information Processing

Systems 15 [Neural Information Processing Systems, NIPS 2002, December 9-14, 2002, Vancouver, British Columbia, Canada] (2002), pp. 569–576.

T. G¨

artner, SIGKDD Explorations 5, 49 (2003).

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 164 / 164