Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

advanced analytics in business d0s07a big data platforms
SMART_READER_LITE
LIVE PREVIEW

Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Supervised Learning Overview Regression Logistic regression K-NN Decision and regression trees 2 The analytics process 3 Recall Supervised learning


slide-1
SLIDE 1

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

Supervised Learning

slide-2
SLIDE 2

Overview

Regression Logistic regression K-NN Decision and regression trees

2

slide-3
SLIDE 3

The analytics process

3

slide-4
SLIDE 4

Recall

Supervised learning

You have a labelled data set at your disposal Correlate features to target Common case: predict the future based on patterns observed now (predictive) Classification (categorical) versus regression (continuous)

Unsupervised learning

Describe patterns in data Clustering, association rules, sequence rules No labelling required Common case: descriptive, explanatory

For supervised learning, our data set will contain a label 4

slide-5
SLIDE 5

Most classification use cases use a binary categorical variable

Churn prediction: churn yes/no Credit scoring: default yes/no Fraud detection: suspicous yes/no Response modeling: customer buys yes/no Predictive maintenance: needs check yes/no

Regression: continuous label Classification: categorical label For classification:

Binary classification (positive/negative

  • utcome)

Multiclass classification (more than two possible outcomes) Ordinal classification (target is ordinal) Multilabel classification (multiple outcomes are possible)

For regression:

Absolute values Delta values Quantiles regression

Recall

Single versus multi-output models is possible as well (Definitions in literature and documentation can differ a bit) 5

slide-6
SLIDE 6

Defining your target

Recommender system: a form of multi-class? Multi-label? Survival analysis: instead of yes/no predict the “time until yes occurs”

Oftentimes, different approaches are possible

Regression, quantile regression, mean residuals regression? Or: predicting the absolute value or the change? Or: convert manually to a number of bins and perform classification? Or: reduce the groups to two outcomes? Or: sequential binary classification (“classifier chaining”)? Or: perform segmentation first and build a model per segment?

6

slide-7
SLIDE 7

Regression

7

slide-8
SLIDE 8

Regression

https://xkcd.com/605/

8

slide-9
SLIDE 9

Linear regression

Not much new here…

with

Price = 100000 + 100000 * number of bedrooms : mean response when (y-intercept) : change in mean response when increases by one unit

How to determine the parameters?

: minimize sum of squared errors (SSE) With standard error

OLS: “Ordinary Least Squares”. Why SSE though?

y = β0 + β1x1 + β2x2 + … + ϵ ϵ ∼ N(0, σ)

β0 → x = 0 β1 x1

argmin→

β ∑n i=1(yi − ^

yi)2 σ = √SSE/n

9

slide-10
SLIDE 10

Logistic Regression

10

slide-11
SLIDE 11

Logistic regression

Classification is solved as well?

Customer Age Income Gender … Response John 30 1200 M No → 0 Sophie 24 2200 F No → 0 Sarah 53 1400 F Yes → 1 David 48 1900 M No → 0 Seppe 35 800 M Yes → 1

But no guarantee that output is 0 or 1 Okay fine, a probability then, but no guarantee that outcome is between 0 and 1 either Target and errors also not normally distributed (assumption of OLS violated)

^ y = β0 + β1age + β2income + β3gender

11

slide-12
SLIDE 12

Logistic regression

We use a bounding function to limit the

  • utcome between 0 and 1:

(logistic, sigmoid) Same basic formula, but now with the goal

  • f binary classification

Two possible outcomes: either 0 or 1, no or yes – a categorical, binary label, not continuous Logistic regression is thus a technique for classification rather than regression Though the predictions are still continuous: between [0,1]

f(z) =

1 1+e−z

12

slide-13
SLIDE 13

Logistic regression

Linear regression with a transformation such that the output is always between 0 and 1, and can thus be interpreted as a probability (e.g. response or churn probability) Or (“logit” – natural logarithm of odds): P(response = yes|age, income, gender) = 1 − P(response = no|age, income, gender) = 1 1 + e−(β0+β1age+β2income+β3gender)

ln( ) = β0 + β1age + β2income + β3gender P(response = yes|age, income, gender) P(response = no|age, income, gender)

13

slide-14
SLIDE 14

Our first predictive model: a formula Not very spectacular, but note:

Easy to understand Easy to construct Easy to implement

In some settings, the end result will be a logistic model “extracted” from more complex approaches

Customer Age Income Gender … Response John 30 1200 M No → 0 Sophie 24 2200 F No → 0 Sarah 53 1400 F Yes → 1 David 48 1900 M No → 0 Seppe 35 800 M Yes → 1

↓ ↓

Customer Age Income Gender … Response Score Will 44 1500 M 0.76 Emma 28 1000 F 0.44

Logistic regression

1 1 + e−(0.10+0.22age+0.05income−0.80gender) 14

slide-15
SLIDE 15

Logistic regression

If increases by 1: : “odds-ratio”: multiplicative increase in odds when increases by 1 (other variables constant)

→ → odds/probability increase with → → odds/probability decrease with

Doubling amount:

Amount of change required for doubling primary outcome odds Doubling amount for

1 1 + e−(β0+β1age+β2income+β3gender) Xi

logit|Xi+1 = logit|Xi + βi

  • dds|Xi+1 = odds|Xieβi

eβi Xi

βi > 0 eβi > 1 Xi βi < 0 eβi < 1 Xi Xi = log(2)/βi

15

slide-16
SLIDE 16

Logistic regression

Easy to interpret and understand Statistical rigor, a “well-calibrated” classifier Linear decision boundary, though interaction effects can be taken into the model (and explicitely; allows for investigation) Sensitive to outliers Categorical variables need to be converted (e.g. using dummy encoding as the most common approach, though recall ways to reduce large amount of dummies) Somewhat sensitive to the curse of dimensionality…

16

slide-17
SLIDE 17

Regularization

17

slide-18
SLIDE 18

Stepwise approaches

Statisticians love “parsimonious” models:

If a “smaller” model works just as well as a “larger” one, prefer the smaller one Also: “curse of dimensionality”

Makes sense: most statistical techniques don’t like dumping in your whole feature set all at once Selection based approaches (build up final model step-by-step):

Forward selection Backward selection Hybrid (stepwise) selection

See MASS::stepAIC , leaps::regsubsets , caret , or simply step in R Not implemented by default in Python (neither scikit-learn or statsmodels ) … What’s going on? 18

slide-19
SLIDE 19

Stepwise approaches

Trying to get the best, smallest model given some information about a large number of variables is reasonable

Many sources cover stepwise selection methods However, this is not really a legitimate situation

Frank Harrell (1996):

It yields R-squared values that are badly biased to be high, the F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution. The method yields confidence intervals for effects and predicted values that are falsely narrow (Altman and Andersen, 1989). It yields p-values that do not have the proper meaning, and the proper correction for them is a difficult problem. It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large (Tibshirani, 1996). It has severe problems in the presence of collinearity. It is based on methods (e.g., F tests for nested models) that were intended to be used to test prespecified hypotheses. Increasing the sample size does not help very much (Derksen and Keselman, 1992). It uses a lot of paper. (https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/)

“ “

19

slide-20
SLIDE 20

Stepwise approaches

Some of these issues have / can be fixed (e.g. using proper tests), but… This actually already reveals something we’ll visit again when talking about evaluation! Take-away: use a proper train-test setup!

Developing and confirming a model based on the same dataset is called data dredging. Although there is some underlying relationship amongst the variables, and stronger relationships are expected to yield stronger scores, these are random variables and the realized values contain error. Thus, when you select variables based on having better realized values, they may be such because of their underlying true value, error, or both. True, using the AIC is better than using p-values, because it penalizes the model for complexity, but the AIC is itself a random variable (if you run a study several times and fit the same model, the AIC will bounce around just like everything else).

“ “

20

slide-21
SLIDE 21

Regularization

SSEModel1 = (1 − 1)2 + (2 − 2)2 + (3 − 3)2 + (8 − 4)2 = 16 SSEModel2 = (1 − −1)2 + (2 − 2)2 + (3 − 5)2 + (8 − 8)2 = 8 21

slide-22
SLIDE 22

Regularization

Key insight: introduce a penalty on the size of the weights

Constrained, instead of less parameters! Makes the model less sensitive to outliers, improves generalization

Lasso and ridge regression:

Standard: with Lasso regression (L1 regularization): Ridge regression (L2 regularization):

No penalization on the intercept! Obviously: standardization/normalization required!

y = β0 + β1x1 + … + βpxp + ϵ argmin→

β ∑n i=1(yi − ^

yi)2 argmin→

β ∑n i=1(yi − ^

yi)2 + λ ∑p

j=1 |βj|

argmin→

β ∑n i=1(yi − ^

yi)2 + λ ∑p

j=1 β2 j

22

slide-23
SLIDE 23

Lasso and ridge regression

https://newonlinecourses.science.psu.edu/stat508/lesson/5/5.1

23

slide-24
SLIDE 24

Lasso and ridge regression

Lasso will force coefficients to become zero, ridge only to keep them within bounds

Variable selection for free with Lasso Why ridge, then? Easier to implement (slightly) and faster to compute (slightly), or when you have a limited number of variables to begin with Lasso will also not consider grouping effects (e.g. pick a variable at random when variables are correlated), will not work when number of instances is less than number of features

In practice, however, lasso is preferred, tends to work well even with small sample sizes How to pick a good value for lambda: cross-validation! (See later) Works both for linear and logistic regression, concept of L1 and L2 regularization also pops up with other model types (e.g. SVM’s, Neural Networks) and fields:

Tikhonov regularization (Andrey Tikhonov), ridge regression (statistics), weight decay (machine learning), the Tikhonov–Miller method, the Phillips–Twomey method, the constrained linear inversion method, and the method

  • f linear regularization

Need to normalize variables befhorehand to ensure that the regularisation term regularises/affects the variable involved in a similar manner! λ

MATLAB always uses the centred and scaled variables for the computations within ridge. It just back-transforms them before returning them.

“ “

24

slide-25
SLIDE 25

Lasso and ridge regression

25

slide-26
SLIDE 26

Elastic net

Every time you have two similar approaches, there’s an easy paper opportunity by proposing to combine them (and giving it a new name)…

Combine L1 and L2 penalties Retains benefit of introducing sparsity Good at getting grouping effects Implemented in R and Python (check the documentation: everybody disagrees on how to call and ) Grid search on two parameters necessary Lasso parameter will be the most pronounced in most practical settings

argmin→

β n

i=1

(yi − ^ yi)2 + λ1

p

j=1

|βj| + λ2

p

j=1

β2

j

λ1 λ2

26

slide-27
SLIDE 27

Some Other Forms of Regression

27

slide-28
SLIDE 28

Non-parametric regression

“Non-parametric” being a fancy name for “no underlying distribution assumed, purely data- driven”

“Smoothers” such as LOESS (locally weighted scatterplot smoothing) Does not require specification of a function to fit a model, only a “smoothing” parameter Very flexible, but requires large data samples (because LOESS relies on local data structure to provide local fitting), does not produce a regression function Take care when using this as a “model”, more an exploratory means!

28

slide-29
SLIDE 29

Generalized additive models

(GAMs), similar concept as normal regression, but uses splines and other given smoothing function in a linear combination Benefit: capture non-linearities by smooth functions Functions can be parametric, non-parametric, polynomial, local weighted mean, … Very flexible, best of both worlds approach Danger of overfitting: stringent validation required Theoretical relation to boosting (which we’ll discuss later)

Very nice technique but not that well known y = β0 + f1(x1)+. . . +fpxp + ϵ

aerosolve - Machine learning for humans A machine learning library designed from the ground up to be human friendly. A general additive linear piecewise spline model. The training is done at a higher resolution specified by num_buckets between the min and max of a feature’s range. At the end of each iteration we attempt to project the linear piecewise spline into a lower dimensional function such as a polynomial spline with Dirac delta endpoints.

“ “

29

slide-30
SLIDE 30

Generalized additive models

Henckaerts, Antonio, et al., 2017:

log(E(nclaims)) = log(exp) + β0 + β1coveragePO + β2coverageFO + β3fueldiesel+ f1(ageph) + f2(power) + f3(bm) + f4(ageph, power) + f5(long, lat)

30

slide-31
SLIDE 31

Multinomial and ordinal logistic regression

Extension for non-binary categorical outcomes

possible outcomes, features For

  • utcomes, construct

binary logistic regression models

and thus P(yi = k|Xi) = β0,k + β1,kx1,i + ⋯ + βM,kxM,i

K M K K − 1

ln( ) = β∙,1 ⋅ Xi

P(yi=1|Xi) P(yi=K|Xi)

ln( ) = β∙,2 ⋅ Xi

P(yi=2|Xi) P(yi=K|Xi)

… ln( ) = β∙,K−1 ⋅ Xi

P(yi=K−1|Xi) P(yi=K|Xi)

P(yi = K|Xi) = 1 − ∑K−1

k=1 P(yi = k|Xi) = 1 − ∑K−1 k=1 P(yi = K|Xi)eβ∙,kXi

P(yi = K|Xi) =

1 1+∑K−1

k=1 eβ∙,kXi

31

slide-32
SLIDE 32

Multinomial and ordinal logistic regression

Extension for ordered categorical outcomes but since , The logit functions for all ratings are parallel since they only differ in the intercept (proportional odds model) P(yi = D|Xi) = P(yi ≤ D|Xi) P(yi = C|Xi) = P(yi ≤ C|Xi) − P(yi ≤ D|Xi) P(yi = B|Xi) = P(yi ≤ B|Xi) − P(yi ≤ C|Xi) P(yi = A|Xi) = P(yi ≤ A|Xi) − P(yi ≤ B|Xi) P(yi = AA|Xi) = P(yi ≤ AA|Xi) − P(yi ≤ A|Xi) P(yi = AAA|Xi) = 1 − P(yi ≤ AA|Xi) ln( ) = −θR + β1x1 + ⋯ + βnxn

P(yi≤R|Xi) 1−P(yi≤R|Xi)

P(yi ≤ AAA|Xi) = 1 θAAA = ∞ 32

slide-33
SLIDE 33

PCR and PLS

Principal Component Regression (PCR):

Key idea: perform PCA first on the features and then perform normal regression Number of components to be tuned using cross-validation (see later) Standardization required as PCA is scaling-sensitive

Partial Least Squares (PLS) regression:

PCR does not take response into account PLS performs PCA but now include the target as well Variance aspect often dominates so PLS will behave closely to PLS in many settings

33

slide-34
SLIDE 34

Decision Trees

34

slide-35
SLIDE 35

Decision trees

Both for classification and regression

We’ll discuss classification first

Based on recursively partioning the data

Splitting decision: How to split a node?

Age < 30, income < 1000, status = married?

Stopping decision: When to stop splitting?

When to stop growing the tree?

Assignment decision: How to assign a label outcome in the leaf nodes?

Which class to assign to the leave node?

35

slide-36
SLIDE 36

Terminology

36

slide-37
SLIDE 37

ID3

ID3 (Iterative Dichotomiser 3)

Most basic decision tree algorithm, by Ross Quinlan (1986) Begin with the original set as the root node On each iteration of the algorithm, iterate through every unused attribute of the set and calculate a measure for that attribute, e.g. Entropy and Information Gain Select the best attribute, split on the selected attribute to produce subsets Continue to recurse on each subset, considering only attributes not selected before (for this particular branch of the tree) Recursion stops when every element in a subset belongs to the same class label, or there are no more attributes to be selected, or there are no instances left in the subset S S H(S) IG(A, S) S

37

slide-38
SLIDE 38

ID3

38

slide-39
SLIDE 39

ID3

39

slide-40
SLIDE 40

ID3

40

slide-41
SLIDE 41

Impurity measures

Which measure? Based on impurity

  • Minimal impurity
  • Also minimal impurity
  • Maximal impurity

41

slide-42
SLIDE 42

Impurity measures

Intuitively, it’s easy to see that:

  • ••••• → ••• + •••

Is better than:

  • ••••• → ••• + •••

But what about:

  • ••••• → •••• + ••

We need a measure… 42

slide-43
SLIDE 43

Entropy

Entropy is a measure of the amount of uncertainty in a data set (information theory) with the data (sub)set, the classes (e,g, {yes, no}), the proportion of elements with class over

When , the set is completely pure (all elements belong to the same class)

H(S) = − ∑x∈X p(x)log2(p(x)) S X p(x) x |S|

H(S) = 0

43

slide-44
SLIDE 44

Entropy

For the original data set, we get:

#yes #no x = yes x = no Entropy 9 5

  • 0.41
  • 0.53

0.94

44

slide-45
SLIDE 45

Information gain

  • ••••• → •••• + ••

We can calculate the entropy of the original set and all the subsets, but how to measure the improvement? Information gain: measure of the difference in impurity before and after the split

How much uncertainty was reduced by a particular split?

with the set of subsets obtained by splitting the original set on attribute and = IG(A, S) = H(S) − ∑t∈T p(t)H(t) T S A p(t) |t|/|S| 45

slide-46
SLIDE 46

Information gain

Original set :

#yes #no x = yes x = no Entropy 9 5

  • 0.41
  • 0.53

0.94

Calculate the entropy for all subsets created by all candidate splitting features:

Attribute Subset #yes #no x = yes x = no Entropy Outlook Sunny 2 3

  • 0.53
  • 0.44

0.97 Overcast 4 0.00 Rain 3 2

  • 0.44
  • 0.53

0.97 Temperature Hot 2 2

  • 0.50
  • 0.50

1 Mild 4 2

  • 0.39
  • 0.53

0.92 Cool 3 1

  • 0.31
  • 0.5

0.81 Humidity High 3 4

  • 0.52
  • 0.46

0.99 Normal 6 1

  • 0.19
  • 0.4

0.59 Wind Strong 3 3

  • 0.50
  • 0.50

1 Weak 6 2

  • 0.31
  • 0.5

0.81

S 46

slide-47
SLIDE 47

#yes #no x = yes x = no Entropy 9 5

  • 0.41
  • 0.53

0.94 Attribute Subset #yes #no x = yes x = no Entropy Outlook Sunny 2 3

  • 0.53
  • 0.44

0.97 Overcast 4 0.00 Rain 3 2

  • 0.44
  • 0.53

0.97 Temperature Hot 2 2

  • 0.50
  • 0.50

1 Mild 4 2

  • 0.39
  • 0.53

0.92 Cool 3 1

  • 0.31
  • 0.5

0.81 Humidity High 3 4

  • 0.52
  • 0.46

0.99 Normal 6 1

  • 0.19
  • 0.4

0.59 Wind Strong 3 3

  • 0.50
  • 0.50

1 Weak 6 2

  • 0.31
  • 0.5

0.81

Information gain

IG(outlook, S) = 0.94 - ( 0.97 + 0.00 + 0.97) = 0.94 - 0.69 = 0.25 ← highest IG IG(temperature, S) = 0.03 IG(humidity, S) = 0.15 IG(wind, S) = 0.05

IG(A, S) = H(S) − ∑t∈T p(t)H(t)

5 14 4 14 5 14

47

slide-48
SLIDE 48

ID3: after one split

48

slide-49
SLIDE 49

ID3: continue

Recursion stops when every element in a subset belongs to the same class label,

  • r there are no more attributes to be selected, or there are no instances left in the

subset 49

slide-50
SLIDE 50

ID3: final tree

Assign labels to the leaf nodes: easy – just pick the most common class 50

slide-51
SLIDE 51

Impurity measures

Entropy (Shannon index) is not the only measure of impurity that can be used

Entropy: Gini diversity index:

Not very different: Gini works a bit better for continuous variables (see after) and is a little faster Most implementations default to this approach

Classification error:

Something to think about: why not use accuracy directly? Or another metric of interest such as AUC or precision, F1, …?

H(S) = − ∑x∈X p(x)log2(p(x)) Gini(S) = 1 − ∑x∈X p(x)2 ClassErr(S) = 1 − maxx∈Xp(x)

51

slide-52
SLIDE 52

Summary so far

Using the tree to predict new labels is easy: just follow the questions in the tree and look at the

  • utcome

Easy to understand, easy (for a computer) to construct Can be easily expressed as simple IF…THEN rules as well, can hence be easily implemented in existing programs (even as a SQL procedure)

Fun as background research: algorithms exist which directly try to introduce prediction models in the form of a rule base, with RIPPER (Repeated Incremental Pruning to Produce Error Reduction) being the most well known. It’s just as old and leads to very similar models, but has some interesting differences and is (nowadays) not widely implemented or known anymore 52

slide-53
SLIDE 53

https://christophm.github.io/interpretable-ml-book/rules.html

See also RuleFit (https://github.com/christophM/rulefit) and Skope-Rules (https://github.com/scikit-learn-contrib/skope-rules) for interesting, newer approaches 53

slide-54
SLIDE 54

Problems still to solve

ID3 is greedy: it never backtracks (i.e. retraces

  • n previous steps) during the construction of

the tree, it only moves forward

This means that global optimality of the tree is not guaranteed Algorithms exist which overcome this (see e.g.

evtree package for R), though they’re often

slow and do not give much better results (so greedy is good enough)

A bigger problem however is the fact that we do not have a way to tackle continuous variables, such as temperature = 21, 24, 26, 27, 30, … Another big problem is that the “grow for as long as you can” strategy leads to trees which will be horribly overfitting!

54

slide-55
SLIDE 55

Spotting overfitting

(Note that this is a good motivating case to illustrate the difference between “supervised methods for predictive analytics” or “for descriptive analytics”) 55

slide-56
SLIDE 56

C4.5

Also by Ross Quinlan Extension of ID3: still uses information gain Main contribution: dealing with continuous variables Can also deal with missing variables

The original paper describes that you just ignore them when calculating the impurity measure and information gain Though most implementations do not implement this!

Also allows to set importance weights on attributes (biasing the information gain, basically) Describes methods to prune trees

56

slide-57
SLIDE 57

C4.5: continuous variables

Say we want to split on temperature = 21, 24, 26, 27, 30, …

Obviously, using the values as-is would not be a good idea Would lead to a lot of subsets, many of which potentially having only a few instances When applying the tree on new data: the chance of encountering a value which was unseen during training is much higher than for categoricals, e.g. what if temperature is 22?

Instead, enforce binary splits by:

Splitting on temperature <= 21 → two subsets (yes, no) – and calculate the information gain Splitting on temperature <= 24 → two subsets (yes, no) – and calculate the information gain Splitting on temperature <= 26 → two subsets (yes, no) – and calculate the information gain Splitting on temperature <= 27 → two subsets (yes, no) – and calculate the information gain And so on… Important: only the distinct set of values seen in the training set are considered (others wouldn’t change the information gain), though some papers propose changes to make this a little more stable

57

slide-58
SLIDE 58

C4.5: continuous variables

Note that each of our “temperate <= …” splits lead to a yes/no outcome

Couldn’t we do the same for categorical features as well? Humity = “high” → two subsets (yes, no) – and calculate the information gain Humity = “medium” → two subsets (yes, no) – and calculate the information gain Humity = “low” → two subsets (yes, no) – and calculate the information gain

Turns out that constructing such a binary tree is better

Information gain measure is biased towards preferring splits that lead to more subsets (avoided now, always two subsets) Obviously: we can now re-use attributes throughout the tree as each attribute leads to multiple binary subsets Tree can be deeper, but less wide Weka is a weird exception

58

slide-59
SLIDE 59

C4.5: preventing overfitting

Early stopping: stop based on a stop criterion

E.g. when number of instances in subset goes below a threshold Or when depth gets too high Or: set aside a validation set during training and stop when performance on this set starts to decrease (if you have enough training data to begin with)

59

slide-60
SLIDE 60

C4.5: preventing overfitting

Pruning: grow the full tree, but then reduce it

Merge back leaf nodes if they add little power to the classification accuracy “Inside every big tree is a small, perfect tree waiting to come out.” –Dan Steinberg Many forms exist: weakest-link pruning, cost-complexity pruning (common), etc… Oftentimes governed by a “complexity” parameter in most implementations

Only recently in sklearn : https://scikit- learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html#sphx-glr-auto-examples-tree-plot-cost- complexity-pruning-py

Scoring of new instances can now return a “probability”

60

slide-61
SLIDE 61

C5 (See5)

Also by Ross Quinlan Commercial offering Faster, memory efficient, smaller trees, weighting of cases and supplying misclassification costs Not widely adopted… only recently made open source, better open source implementations available Support for boosting (see next course)

Outdated 61

slide-62
SLIDE 62

A few final insights…

Conditional inference trees:

Another method which instead of using information gain uses a significance test procedure in

  • rder to select variables

Preprocessing:

Decision trees are robust to outliers Only missing value treatment needed Some implementations have proposed three-way splits (yes / no / NA)

Multiclass:

Concept of decision trees easily extended to multiclass setting

62

slide-63
SLIDE 63

A few final insights…

Categorization:

Recall preprocessing: possible to run a decision tree on one continuous variable only to suggest a good binning based on the leaf nodes

Interaction effects and nonlinearities:

Considered by default by decision trees CHAID (Chi-square automatic interaction detection): chi-square based test to split trees

Variable selection:

Based on features that pop up earlier in the tree

Non well-calibrated and unstable classifier:

Sensitive to changes in training data: a small change can cause your tree to look different

63

slide-64
SLIDE 64

A few final insights

Remember to prevent overfitting trees

But a deep tree does not necessarily mean that you have a problem And a short tree does not necessarily mean that it’s not overfitting

However, note that it is now very likely that your leaf nodes will not be completely pure (i.e. not containing 100% yes cases or no cases) 64

slide-65
SLIDE 65

A few final insights

Together with logistic regression, decision trees are in the top-3 of predictive techniques on tabular data. They’re simple to understand and present, and require very little data preparation, and can learn non-linear relationships In fact, many ways exist to take your favority implementation and extending it

E.g. when domain experts have their favorite set of features, you can easily constrain the tree to

  • nly consider a subset of the features in the first n levels

You might even consider playing with more candidates to generate binary split rules, e.g. “feature X between A and B?”, “euclidean dist(X, Y) <= t”, …

65

slide-66
SLIDE 66

A few final insights

Regression trees:

As made popular by CART: Classification And Regression Trees “Tree structured regression offers an interesting alternative for looking at regression type problems. It has sometimes given clues to data structure not apparent from a linear regression analysis” Instead of calculating the #yes’s vs. #no’s to get the predicted class label, take the mean of the continuous label and use that as the outcome

But how to select the splitting criterion?

Squared residuals minimization algorithm which implies that expected sum of variances for two resulting nodes should be minimized Based on sum of squared errors: find the split that produces the greatest separation in SSE in each subset Find nodes with minimal within variance… and therefore greatest between variance (a little bit like k-means, a clustering technique we’ll visit later) Important: comments regarding pruning still apply, though a regression tree will typically need to be deeper than a classification one

66

slide-67
SLIDE 67

A few final insights

Visualizing:

R: use fancyRpartPlot() from the rattle package for nicer visualizations Python: use dtreeviz (https://github.com/parrt/dtreeviz) for nicer visualizations

67

slide-68
SLIDE 68

Decision trees versus (logistic) regression

Decision boundary of decision trees: squares orthogonal to a dimension 68

slide-69
SLIDE 69

Decision trees versus (logistic) regression

Decision trees can struggle to capture linear relationships

E.g. the best it can do is a step function approximation of a linear relationship This is strictly related to how decision trees work: it splits the input features in several “orthogonal” regions and assigns a prediction value to each region Here, a deeper tree would be necessary to approximate the linear relationship Or apply a transformation first (e.g. PCA)

69

slide-70
SLIDE 70

K-nearest Neighbors

70

slide-71
SLIDE 71

K-nearest neighbors (K-NN)

A non-parametric method used for classification and regression “The data is the model” Trivially easy Based on the concept of distances (so normalization/standardization required)

71

slide-72
SLIDE 72

K-nearest neighbors (K-NN)

Has some appealing properties: easy to understand Fun to tweak with custom distances, different values for k, dynamic k-values, custom distance measures… – even a recommender system can be built using this approach Provides surprisingly good results, given enough data Regression: e.g. use a (weighted) average of the k nearest neighbors, weighted by the inverse of their distance The main disadvantage of k-NN is that it is a “lazy learner”: it does not learn anything from the training data and simply uses the training data itself as a model

This means you don’t get a formula, or a tree as a summarizing model And that you need to keep training data around Relatively slow Not as stable for noisy data, might be hard to generalize, unstable with regards to k setting

72

slide-73
SLIDE 73

K-nearest neighbors (K-NN)

https://towardsdatascience.com/scanned-digits-recognition-using-k-nearest-neighbor-k-nn-d1a1528f0dea

73

slide-74
SLIDE 74

Wrap up

Linear regression, logistic regression and decision trees still amongst most widely used techniques

White box, statistical rigor, easy to construct and interpret

K-NN less can provide a quick baseline and is very extensible Decision between regression and (which) classification not always a clear cut choice!

Neither is the decision between unsupervised and supervised Iteration, domain expert involvement required

74