Lecture 4: Rule-based classification and regression Felix Held, - PowerPoint PPT Presentation

Lecture 4: Rule-based classification and regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 1st April 2019

Amendment: Bias-Variance Tradeoff ] Variance Bias 2 Error Model complexity Overfit Underfit 𝑆 𝑔 averaged over 𝐲 𝑔(𝐲)]] + Bias-Variance Decomposition 1/26 2 𝑔(𝐲)]) + Irreducible Error 𝜏 2 = Total expected prediction error 𝑔(𝐲)) 2 ] = 𝑆 𝔽 𝑞(𝒰,𝐲,𝑧) [(𝑧 − ˆ Bias 2 averaged over 𝐲 𝔽 𝑞(𝐲) [(𝑔(𝐲) − 𝔽 𝑞(𝒰) [ ˆ 𝔽 𝑞(𝐲) [ Var 𝑞(𝒰) [ ˆ Variance of ˆ Irreducible Error

Observations 𝑔(𝑦)] → 𝑔(𝑦) Might not be fulfilled in reality. the number of variables 𝑞 staying fixed and increasing 𝑜 . for increasing sample size 𝑔(𝑦)) → 0 for increasing sample size 𝑔 𝑔 are sample-size dependent 2/26 ▶ Irreducible error cannot be changed ▶ Bias and variance of ˆ ▶ For a consistent estimator ˆ 𝔽 𝑞(𝒰) [ ˆ ▶ In many cases: Var 𝑞(𝒰) ( ˆ ▶ Caution: Theoretical guarantees are often dependent on

Amendment: Leave-One-Out Cross-validation (LOOCV) Cross-validation with 𝑑 = 𝑜 is called leave-one-out cross-validation . exist for many special cases (e.g. regularized regression) point is used for testing and the training sets are very similar results vary drastically with 𝑑 . Maybe the underlying model assumptions are not appropriate. 3/26 ▶ Popular because explicit formulas (or approximations) ▶ Uses the most data for training possible ▶ More variable than 𝑑 -fold CV for 𝑑 < 𝑜 since only one data ▶ In praxis: Try out different values for 𝑑 . Be cautious if

Classification and Partitions

Classification and Partitions A classification algorithm constructs a partition of feature space and assigns a class to each. assigns a class in each modelling 𝑞(𝑗|𝐲) and determines decision boundaries through Bayes’ rule feature space conditional on the class. It models 𝑞(𝐲, 𝑗) by assuming that 𝑞(𝐲|𝑗) is a normal distribution and either estimates 𝑞(𝑗) from data or through prior knowledge. 4/26 ▶ kNN creates local neighbourhoods in feature space and ▶ Logistic regression divides feature space implicitly by ▶ Discriminant analysis creates an explicit model of the

New point-of-view: Rectangular Partitioning and a regression function is given by neighbourhoods.) (Derivations are similar to kNN with regions instead of 𝐲 𝑚 ∈𝑆 𝑛 ∑ |𝑆 𝑛 | 1 ( 𝑛=1 ∑ 𝑁 𝑔(𝐲) = ˆ 5/26 Idea: Create an explicit partition by dividing feature space 𝐲 𝑚 ∈𝑆 𝑛 𝑛=1 ∑ 𝑁 1≤𝑗≤𝐿 𝑑(𝐲) = arg max ̂ classes 𝑗 ∈ {1, … , 𝐿} is (classification) to each region. mean (regression) or constant conditional class probability into rectangular regions and assign a constant conditional Given regions 𝑆 𝑛 for 𝑛 = 1, … , 𝑁 , a classification rule for 1 (𝐲 ∈ 𝑆 𝑛 ) ( ∑ 1 (𝑗 𝑚 = 𝑗)) 𝑧 𝑚 ) 1 (𝐲 ∈ 𝑆 𝑛 )

Classification and Regression Trees (CART) > values/classes in each region sequence of binary splits Partition from a 6/26 Partition > Partition Arbitrary Rectangular ▶ Complexity of partitioning: ▶ Classification and Regression Trees create a sequence of binary axis-parallel splits in order to reduce variability of 0 0 0 0 00 x2 >= 2.2 4 0 0 0 0 0 0 yes no 00 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 1.00 .00 x1 >= 3.5 00 0 x 2 0 0 0 0 0 0 60% 2 1 1 0 0 0 1 0 1 1 0 0 1 1 0 1 1.00 .00 .00 1.00 0 11 1 0 0 0 1 11 0 20% 20% 0 2 4 6 x 1

CART: Tree building/growing 1. Start with all data in a root node sequence of splits preceding them defines the regions 𝑆 𝑛 . All nodes without descendents are called leaf nodes . The stopping criterion 3. Repeat Step 2 on all child nodes until the tree reaches a 2.2 Choose the feature 𝑘 that led to the best splitting of the 𝑘 } and 𝑘 } the greatest improvement in node purity : threshold 𝑢 2. Binary splitting 7/26 2.1 Consider each feature 𝑦 ⋅𝑘 for 𝑘 = 1, … , 𝑞 . Choose a 𝑘 (for continuous features) or a partition of the feature categories (for categorical features) that results in {𝑗 𝑚 ∶ 𝑦 𝑚𝑘 > 𝑢 {𝑗 𝑚 ∶ 𝑦 𝑚𝑘 ≤ 𝑢 data and create a new child node for each subset

Measures of node purity 𝜌 𝑗𝑛 after a split can be used as an impurity measure. maximal when all classes are equally common. 𝜌 𝑗𝑛 𝐿 Entropy/deviance: 𝜌 𝑗𝑛 ) 𝜌 𝑗𝑛 (1 − ˆ 𝐿 Use Gini impurity: ∑ 8/26 ∑ ˆ 1 Misclassification error: |𝑆 𝑛 | 𝐲 𝑚 ∈𝑆 𝑛 1 (𝑗 𝑚 = 𝑗) 𝜌 𝑗𝑛 = ▶ Three common measures to determine impurity in a region 𝑆 𝑛 are (for classification trees) 1 − max 𝑗 ˆ 𝑗=1 ˆ − ∑ 𝑗=1 ˆ 𝜌 𝑗𝑛 log ˆ ▶ All criteria are zero when only one class is present and ▶ For regression trees the decrease in mean squared error

Node impurity in two class case Example for a two-class problem ( 𝑗 = 0 or 1 ). ˆ problems for misclassification error). Only gini impurity and entropy are used in practice (averaging 9/26 empirical frequency of class 0 in a region 𝑆 𝑛 . 𝜌 0𝑛 is the 0.6 Impurity 0.4 0.2 0.0 0.00 0.25 0.50 0.75 1.00 π 0m Impurity Measure Entropy Gini Misclassification

Stopping criteria 30 splits from root node) Running CART until one of these criteria is fulfilled generates 10/26 ▶ Minimum size of leaf nodes (e.g. 5 samples per leaf node) ▶ Minimum decrease in impurity (e.g. cutoff at 1%) ▶ Maximum tree depth, i.e. number of splits (e.g. maximum ▶ Maximum number of leaf nodes a max tree .

Summary of CART boundaries chance of being picked is used for splitting and which is best might change with small changes of the data) 11/26 ▶ Pro: Outcome is easily interpretable ▶ Pro: Can easily handle missing data ▶ Neutral: Only suitable for axis-parallel decision ▶ Con: Features with more potential splits have a higher ▶ Con: Prone to overfitting/unstable (only the best feature

CART and overfitting How can overfitting be avoided? stopping since a weak split might lead to a strong split later collapsing internal nodes. This can be more effective since weak splits are allowed during tree building. (“The silly certainty of hindsight”) stacking, … 12/26 ▶ Tuning of stopping criteria: These can easily lead to early ▶ Pruning: Build a max tree first. Then reduce its size by ▶ Ensemble methods: Examples are bagging, boosting,

A note on pruning ⎵⏟ ⎵ ⎵ ⎵ ⎵ ⎵ ⎵ ⎵ Cost ⎵ + 𝛽|𝑈| ⏟ Complexity where (𝑗 𝑚 , 𝐲 𝑚 ) is the training data, ̂ 𝑑 the CART classification rule and |𝑈| is the number of leaf nodes/regions defined by the tree. ⎵ ⎵⏟⎵ ⎵ ∑ defined as 𝐷 𝛽 (𝑈) = ∑ 𝑆 𝑛 ∈𝑈 ( 1 ⎵ |𝑆 𝑛 | 𝐲 𝑚 ∈𝑆 𝑛 ̂ 𝑑(𝐲))) ⏟⎵ ⎵ ⎵ ⎵ 13/26 ▶ A common strategy is cost-complexity pruning . ▶ For a given 𝛽 > 0 and a tree 𝑈 its cost-complexity is 1 (𝑗 𝑚 ≠ ▶ It can be shown that successive subtrees 𝑈 𝑙 of the max tree 𝑈 max can be found such that each tree 𝑈 𝑙 minimizes 𝐷 𝛽 𝑙 (𝑈 𝑙 ) where 𝛽 1 ≥ ⋯ ≥ 𝛽 𝐾 ▶ The tree with the lowest cost-complexity is chosen

Re-cap of the bootstrap and variance reduction

The Bootstrap – A short recapitulation (I) Given a sample 𝑦 𝑗 , 𝑗 = 1, … , 𝑜 from an underlying population All of these approaches require fairly large sample sizes. bootstrap ) generalized linear models) distributional assumptions fulfilled Computation: 𝜄 . ̂ variability of Solution: Find confidence intervals (CIs) quantifying the 𝜄 ? ̂ What is the uncertainty of 𝜄(𝑦 1 , … , 𝑦 𝑜 ) . ̂ 𝜄 = ̂ estimate a statistic 𝜄 by 14/26 ▶ Through theoretical results (e.g. linear models) if ▶ Linearisation for more complex models (e.g. nonlinear or ▶ Nonparametric approaches using the data (e.g.

The Bootstrap – A short recapitulation (II) 𝑦 1 , … , ̃ 1 Check out this blog post! impossible. 1 The data is discrete and values not seen in the data are 𝜄 ̂ distribution of ̂ Nonparametric bootstrap 𝑦 𝑜 ) 𝜄 𝑐 ( ̃ ̂ 2. Calculate 𝑦 1 , … , ̃ ̃ 1. Sample 𝑐 = 1, … , 𝐶 15/26 Given a sample 𝑦 1 , … , 𝑦 𝑜 bootstrapping performs for 𝑦 𝑜 with replacement from original sample ▶ 𝐶 should be large (in the 1000–10000s) ▶ The distribution of 𝜄 𝑐 approximates the sampling ▶ The bootstrap makes exactly one strong assumption :

CI for statistics of an exponential random variable Data (n = 200) simulated from 𝑦 ∼ Exp (1/3) , i.e. 𝔽 𝑞(𝑦) [𝑦] = 3 (dotted) [red = empirical, blue = theoretical] 16/26 0.4 0.3 Frequency 0.2 0.1 0.0 0 5 10 15 x ▶ Orange histogram shows original sample ▶ Blue line is the true density ▶ Black outlined histogram shows a bootstrapped sample ▶ Vertical lines are the mean of 𝑦 (dashed) and the 99% quantile

Lecture 4: Rule-based classification and regression Felix Held, - PowerPoint PPT Presentation

Lecture 4: Rule-based classification and regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 1st April 2019 Amendment: Bias-Variance Tradeoff ] Variance Bias 2 Error Model complexity Overfit Underfit

Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based

Classification or Regression? Regression Classification: want to learn a discrete target

Regression Based on Support Vector Classification Marcin Orchel AGH University of Science and

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Theory for Minimum Norm Interpolation: Regression and Classification in High Dimensions Tengyuan

Rule Changes - Non rule change year Review of 2017 rule changes - just the easy to forgot

Common Rule Advanced Notice of Proposed Rulemaking (ANPRM) IRB Investigator Advanced Notice

2nd RULE: You MUST TALK about BOOK CLUB. 2nd RULE: You DO NOT talk about 3rd RULE: PERSEVERE -- If

Rule #1: Have a takeaway. Rule #2: Keep It Simple. Rule #3: Repetition is Good. Rule #4: Be

Counting Rules, etc Product Rule Generalized Product Rule Division Rule Bijection

Logistic Regression Lecture 6 Logistic Regression Classification Model CS 335

Learning From Data Lecture 8 Linear Classification and Regression Linear Classification Linear

Two Major Rendering Methods: Rasterization and Ray Tracing Sung-Eui Yoon ( ) Course

15-388/688 - Practical Data Science: Decision trees and interpretable models J. Zico Kolter

segment tree By Zohre Akbari January2014 2 Arbitrarily oriented segments Two cases of

Chapter 5. Tree-based Methods Wei Pan Division of Biostatistics, School of Public Health,

Lecture 9 Polar Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan

Repetition Code Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical Engineering

Di Digi gital tal Co Comm mmuni unication cation Sy Syst stem ems ECS 452 EC Asst.

Information Theory and Security: Quantitative Information Flow Pasquale Malacaria