Data Mining II Regression Heiko Paulheim Regression - PowerPoint PPT Presentation

Data Mining II Regression Heiko Paulheim

Regression • Classification – covered in Data Mining I – predict a label from a finite collection – e.g., true/false, low/medium/high, ... • Regression – predict a numerical value – from a possibly infinite set of possible values • Examples – temperature – sales figures – stock market prices – ... Heiko Paulheim 2

Contents • A closer look at the problem – e.g., interpolation vs. extrapolation – measuring regression performance • Revisiting classifiers we already know – which can also be used for regression • Adoption of classifiers for regression – model trees – support vector machines – artificial neural networks • Other methods of regression – linear regression and variants – isotonic regression – local regression Heiko Paulheim 3

The Regression Problem • Classification – algorithm “knows” all possible labels, e.g. yes/no, low/medium/high – all labels appear in the training data – the prediction is always one of those labels • Regression – algorithm “knows” some possible values, e.g., 18°C and 21°C – prediction may also be a value not in the training data, e.g., 20°C Heiko Paulheim 4

Interpolation vs. Extrapolation • Training data: – weather observations for current day – e.g., temperature, wind speed, humidity, … – target: temperature on the next day – training values between -15°C and 32°C • Interpolating regression – only predicts values from the interval [-15°C,32°C] • Extrapolating regression – may also predict values outside of this interval Heiko Paulheim 5

Interpolation vs. Extrapolation • Interpolating regression is regarded as “safe” – i.e., only reasonable/realistic values are predicted http://xkcd.com/605/ Heiko Paulheim 6

Interpolation vs. Extrapolation • Sometimes, however, only extrapolation is interesting – how far will the sea level have risen by 2050? – will there be a nuclear meltdown in my power plant? http://i1.ytimg.com/vi/FVfiujbGLfM/hqdefault.jpg Heiko Paulheim 7

Baseline Prediction • For classification: predict most frequent label • For regression: predict average value – or median – or mode – in any case: only interpolating regression • often a strong baseline http://xkcd.com/937/ Heiko Paulheim 8

k Nearest Neighbors Revisited • Problem – find out what the weather is in a certain place x – where there is no weather station – how could you do that? Heiko Paulheim 9

k Nearest Neighbors Revisited • Idea: use the average of the nearest stations • Example: – 3x sunny x – 2x cloudy – result: sunny • Approach is called – “k nearest neighbors” – where k is the number of neighbors to consider – in the example: k=5 – in the example: “near” denotes geographical proximity Heiko Paulheim 10

k Nearest Neighbors for Regression • Idea: use the numeric average of the nearest stations 18°C • Example: – 18°C, 20°C, 21°C, 22°C, 21°C x 20°C • Compute the average – again: k=5 21°C 21°C 22°C – (18+20+21+22+21)/5 22°C – prediction: 20.4°C • Only interpolating regression! Heiko Paulheim 11

Performance Measures • Recap: measuring performance for classification: TP + TN Accuracy = TP + TN + FP + FN • If we use the numbers 0 and 1 for class labels, we can reformulate this as ∑ ∣ predicted − actual ∣ all examples Accuracy = 1 − N Why? – the nominator is the sum of all correctly classified examples • i.e., the difference of the prediction and the actual label is 0 – the denominator is the total number of examples Heiko Paulheim 12

Mean Absolute Error • We have ∑ ∣ predicted − actual ∣ all examples Accuracy = 1 − N • For an arbitrary numerical target, we can define ∑ ∣ predicted − actual ∣ all examples MAE = N • Mean Absolute Error – intuition: how much does the prediction differ from the actual value on average? Heiko Paulheim 13

(Root) Mean Squared Error • Mean Squared Error: 2 ∣ predicted − actual ∣ ∑ all examples MSE = N • Root Mean Squared Error: 2 RMSE = √ ∣ predicted − actual ∣ ∑ all examples N • More severe errors are weighted higher by MSE and RMSE Heiko Paulheim 14

Correlation • Pearson's correlation coefficient • Scores well if – high actual values get high predictions – low actual values get low predictions • Caution: PCC is scale-invariant! – actual income: $1, $2, $3 – predicted income: $1,000, $2,000, $3,000 → PCC = 1 ∑ ( pred − pred )×( act − act ) all examples PCC = 2 × √ ∑ 2 √ ∑ ( pred − pred ) ( act − act ) all examples all examples Heiko Paulheim 15

Linear Regression • Assumption: target variable y is (approximately) linearly dependent on attributes – for visualization: one attribute x – in reality: x 1 ...x n y x Heiko Paulheim 16

Linear Regression • Target: find a linear function f: f(x)=w 0 + w 1 x 1 + w 2 x 2 + … + w n x n – so that the error is minimized – i.e., for all examples (x 1 ,...x n ,y), f(x) should be a correct prediction for y – given a performance measure y x Heiko Paulheim 17

Linear Regression • Typical performance measure used: Mean Squared Error 2 ∑ ( w 0 + w 1 ⋅ x 1 + w 2 ⋅ x 2 + ... + w n ⋅ x n − y ) • Task: find w 0 ....w n so that all examples is minimized • note: we omit the denominator N y x Heiko Paulheim 18

Linear Regression: Multi Dimensional Example Heiko Paulheim 19

Linear Regression vs. k-NN Regression • Recap: Linear regression extrapolates, k-NN interpolates prediction of linear three nearest regression neighbors y prediction of 3-NN we want a prediction for that x x x Heiko Paulheim 20

Linear Regression Examples Heiko Paulheim 21

Linear Regression and Overfitting • Given two regression models – One using five variables to explain a phenomenon – Another one using 100 variables • Which one do you prefer? • Recap: Occam’s Razor – out of two theories explaining the same phenomenon, prefer the smaller one Heiko Paulheim 22

Ridge Regression • Linear regression only minimizes the errors on the training data 2 ∑ ( w 0 + w 1 ⋅ x 1 + w 2 ⋅ x 2 + ... + w n ⋅ x n − y ) – i.e., all examples • With many variables, we can have a large set of very small w i – this might be a sign of overfitting! • Ridge Regression: – introduces regularization – create a simpler model by favoring larger factors, minimize 2 +λ ∑ 2 ( w 0 + w 1 ⋅ x 1 + w 2 ⋅ x 2 + ... + w n ⋅ x n − y ) w i ∑ all examples all variables Heiko Paulheim 23

Lasso Regression • Ridge Regression optimizes 2 +λ ∑ 2 ( w 0 + w 1 ⋅ x 1 + w 2 ⋅ x 2 + ... + w n ⋅ x n − y ) w i ∑ all examples all variables • Lasso Regression optimizes 2 +λ ∑ ( w 0 + w 1 ⋅ x 1 + w 2 ⋅ x 2 + ... + w n ⋅ x n − y ) | w i | ∑ all examples all variables • Observation – Predictive performance is pretty similar – Ridge Regression yields small, but non-zero coefficients – Lasso Regression yields zero coefficients Heiko Paulheim 24

…but what about Non-linear Problems? Heiko Paulheim 25

Isotonic Regression • Special case: – Target function is monotonous • i.e., f(x 1 )≤f(x 2 ) for x 1 <x 2 – For that class of problem, efficient algorithms exist • Simplest: Pool Adjacent Violators Algorithm (PAVA) Heiko Paulheim 26

Isotonic Regression • Identify adjacent violators, i.e., f(x i )>(x i+1 ) • Replace them with new values f'(x i )=f'(x i+1 ) so that sum of squared errors is minimized – ...and pool them, i.e., they are going to be handled as one point • Repeat until no more adjacent violators are left Heiko Paulheim 27

Isotonic Regression • After all points are reordered so that f'(x i )=f'(x i+1 ) holds for every i – Connect the points with a piecewise linear function Heiko Paulheim 31

Isotonic Regression • Comparison to the original points – Plateaus exist where the points are not monotonous – Overall, the mean squared error is minimized Heiko Paulheim 32

…but what about non-linear, non-monotonous Problems? Heiko Paulheim 33

Data Mining II Regression Heiko Paulheim Regression - PowerPoint PPT Presentation

Data Mining II Regression Heiko Paulheim Regression Classification covered in Data Mining I predict a label from a finite collection e.g., true/false, low/medium/high, ... Regression predict a numerical value

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining and Recommender Systems T emporal data mining: Regression for Sequence Data Learning

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Lift-Based Bidding in Ad Selection Jian Xu*, Xuhui Shao, Jianjie Ma, Kuang-chih Lee, and Quan Lu

SafePredict: a meta-algorithm for machine learning to guarantee correctness by refusing

Shape Constrained Nonparametric Baseline Estimators in the Cox Model Joint work with Rik Lopuha

To save and enhance lives October 5th, 2015 2 1

Multidimensional Scaling Applied Multivariate Statistics Spring 2012 Outline Fundamental

IsoGeneGUI: a graphical user interface for analyzing dose-response studies in microarray

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Model

L-relations and Galois triangles Basic notions Adjoint product Symmetry Commutativity and

Data Mining II Regression Heiko Paulheim Regression - PowerPoint PPT Presentation

Data Mining II Regression Heiko Paulheim Regression Classification covered in Data Mining I predict a label from a finite collection e.g., true/false, low/medium/high, ... Regression predict a numerical value

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining and Recommender Systems T emporal data mining: Regression for Sequence Data Learning

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Lift-Based Bidding in Ad Selection Jian Xu*, Xuhui Shao, Jianjie Ma, Kuang-chih Lee, and Quan Lu

SafePredict: a meta-algorithm for machine learning to guarantee correctness by refusing

Shape Constrained Nonparametric Baseline Estimators in the Cox Model Joint work with Rik Lopuha

To save and enhance lives October 5th, 2015 2 1

Multidimensional Scaling Applied Multivariate Statistics Spring 2012 Outline Fundamental

IsoGeneGUI: a graphical user interface for analyzing dose-response studies in microarray

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Model

L-relations and Galois triangles Basic notions Adjoint product Symmetry Commutativity and

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Model