Overfitting Validation process. Overfitting Ettore Lanzarone - - PowerPoint PPT Presentation

overfitting
SMART_READER_LITE
LIVE PREVIEW

Overfitting Validation process. Overfitting Ettore Lanzarone - - PowerPoint PPT Presentation

Lesson 3 MEDICAL SUPPORT SYSTEMS FOR CHRONIC DISEASES Engineering and Management for Health University of Bergamo Overfitting Validation process. Overfitting Ettore Lanzarone March 18, 2020 LESSON 3 Lesson 3 Overfitting Linear


slide-1
SLIDE 1

Lesson 3 Validation process.

Ettore Lanzarone March 18, 2020

MEDICAL SUPPORT SYSTEMS FOR CHRONIC DISEASES

Engineering and Management for Health University of Bergamo

LESSON 3

Overfitting

Overfitting

slide-2
SLIDE 2

Lesson 3

Overfitting

Linear regression

R2 = 0.6624

A decent model fit even though R2 is not close to 1. Residuals are normally distributed and this confirm the goodness of fit.

Overfitting

Polynomial regression with a high

  • rder polynomial

R2 = 0.72

Is it a better model? No, the model includes the variability of data due to random errors associated with the

  • bservations.
slide-3
SLIDE 3

Lesson 3

Overfitting

Modeling techniques tend to overfit the data. For example, let us consider multiple regression:

  • Every new variable added to the regression model increases the R2.
  • Interpretation: every additional predictive variable helps to explain yet more of the

variability of the data variance. But this variability cannot be true! Solutions:

  • Expert evaluation.
  • Use proper validation techniques and not only simple metrics as the R2.

Overfitting

The solution is to assess the capability of the model to predict new data. This is actually the goal of estimating a model. You estimate the model on the black points and then new blue points arrive. Lower R2 Higher R2 Which one do you prefer?

slide-4
SLIDE 4

Lesson 3

Overfitting

  • As models are used for prediction purposes, it is useful to evaluate the capacity to make

accurate predictions for new samples of data. This process is referred to as “validation”.

  • One way is to literally obtain new observations. Once the model is developed from the
  • riginal sample, a new study is conducted (replicating the original one as closely as possible)

and the new data are used to assess the predictive validity of the model.

  • This procedure is usually viewed as impractical because of the requirement to conduct a

new study, as well as the difficulty in truly replicating the original study.

  • An alternative, more practical procedure is cross-validation

Cross-validation

Cross-validation

slide-5
SLIDE 5

Lesson 3

Cross-validation

In cross-validation the original dataset is split into two parts:

  • Training set

part of the dataset used to develop the model

  • Testing (or validation) set

part of the dataset used to assess the goodness of fit A valid model should show good predictive accuracy. While R2 offers no protection against is overfitting, cross validation inherently offers protection against overfitting. How to divide the dataset?

Cross-validation

QUESTION 1 Which portion of the sample should be in each part?

  • If the sample size is very large, it is often best to split the sample in half.
  • For smaller samples, it is more conventional to split the sample such that 2/3 of the
  • bservations are in the training set and 1/3 are in the testing set.
  • This also depends on the cross-validation technique.
slide-6
SLIDE 6

Lesson 3

Cross-validation

QUESTION 2 How should the sample be split?

  • The most common approach is to randomly divide the dataset, thus theoretically eliminating

any systematic differences.

  • One alternative is to define matched pairs of subjects in the original dataset and to assign
  • ne member of each pair to the training set and the other to the testing set.

Cross-validation

1. Model development (propose structure and parameters) 2. Cross-validation ITERATE UNTIL VALIDATION RESULTS ARE ACCEPTABLE 3. Model fitting considering the entire dataset; once validated all the dataset can be used to determine model parameters 4. Model applicable in real cases (e.g., integrated in a machine)

slide-7
SLIDE 7

Lesson 3

Cross-validation

Cross-Validation metrics

Let us consider each point i in the testing set. The quality

  • f the prediction

is measured through the error

Cross-validation

Cross-Validation metrics

There exist several indicators based on this error. They can be used to compare alternative models. The best model is associated with the lowest value of the adopted indicator.

slide-8
SLIDE 8

Lesson 3

Cross-validation

Cross-Validation metrics

There exist several indicators based on this error. They can be used to compare alternative models. The best model is associated with the lowest value of the adopted indicator.

Cross-validation

Cross-Validation metrics

slide-9
SLIDE 9

Lesson 3

Cross-validation

Leave-one-out Cross Validation (LOOCV)

Cross-validation

Example Model 1 – linear regression

slide-10
SLIDE 10

Lesson 3

Cross-validation

Example Model 2 – quadratic regression This second model is associated with a lower MSE. Thus, it provides a better goodness of fit.

Cross-validation

Example Model 3 – join the dots Bad model, just to provide an example. Even though the R2 is zero, this model is the worst one.

slide-11
SLIDE 11

Lesson 3

Cross-validation

The LOOCV is a good approach because every time the data in the training set are “similar” to those of the entire dataset (about same quantity). As the quality of the fitting depends on the amount of data, this preserve this information. However, the LOOCV is computationally expensive! The model has to be identified as many times as the cardinality of the dataset (number of

  • bservations in the dataset).

Cross-validation

K-Fold Cross Validation

1. Split the sample into k subsets of equal size 2. For each fold estimate a model on all the subsets except one 3. Use the left out subset to test the model, by calculating a cross validation metric of choice 4. Average the CV metric across subsets to get the cross validation error This k-fold validation reduces the computational effort (less replications required) while at the same time tries to have a relevant number of observations in the training set at each replication.

slide-12
SLIDE 12

Lesson 3

Cross-validation

K-Fold Cross Validation

Example: 1. Split the data into 5 samples 2. Fit a model to the training subsets and use the testing subset to calculate a cross validation metric. 3. Repeat the process for the next sample, until all samples have been used to test the model

Cross-validation

Monte Carlo cross validation

Randomly splits the dataset into training and validation data. The results are then averaged over the splits. In a stratified variant of this approach, the random samples are generated in such a way that the mean response value (i.e. the dependent variable in the regression) is equal in the training and testing sets.

slide-13
SLIDE 13

Lesson 3

Cross-validation

Other cross-validation approaches exist. All of them based on the same idea of dividing the dataset and computing the metrics above mentioned.

Simulated and real data

Simulated and real data

slide-14
SLIDE 14

Lesson 3

Simulated and real data

When the data are complex is also useful to first use simulated data. The overall process is: 1. Validate the model on simulated data 2. Validate the model on real acquired data

  • In-sample validation
  • Out-of-sample validation (the accurate cross validation)

Some parts can be omitted depending on the features of the case.

Simulated and real data

Validate the model on simulated data

Assume a set of realistic model parameters Define the model for the dataset Data

slide-15
SLIDE 15

Lesson 3

Simulated and real data

Validate the model on simulated data

Assume a set of realistic model parameters Define the model for the dataset Generate a simulated dataset considering the model and the assumed parameters Data

Simulated and real data

Validate the model on simulated data

Assume a set of realistic model parameters Define the model for the dataset Generate a simulated dataset considering the model and the assumed parameters Estimate the model parameters based

  • n the simulated data

Compare the estimated parameters with the assumed ones Data

slide-16
SLIDE 16

Lesson 3

Simulated and real data

Compare the estimated parameters with the assumed ones

  • When we have point estimates, we can evaluate the error between the

estimated value and the assumed value.

  • When we have probability density function, we can consider the

position of the assumed value with respect to the confidence interval

  • f the estimation.

This analysis allows to consider the effectiveness of the estimation approach:

  • Is the set of coefficient identifiable?
  • Is the approach suitable?

However, it assumes that the model perfectly describe the real problem and the real data (observations are matematically generated considering the same model).

Simulated and real data

Compare the estimated parameters with the assumed ones

It neglects the gap between the model and the reality. Each model is an abstraction of reality, which includes assumptions and simplifications. This analysis is only preliminary, devoted to assess the techical functioning of the modeling and estimation framework.

slide-17
SLIDE 17

Lesson 3

Simulated and real data

Validate the model on real acquired data This analysis includes the gap between the reality and the mathematical description. The best is to consider the out-of-sample validation (i.e., the cross-validation approaches above mentioned). If the results of the out-of-sample validation are not very satisfying, analyze the reasons. The in-sample validation can provide useful insights. For example, if the in-sample validation is satisfying, it could mean that a larger amount of

  • bservations are required but the structure of the model and the approach are appropriate.

R

A statistical tool: R

slide-18
SLIDE 18

Lesson 3

R

R is an open source platform to perform statistical analyses. It includes several packages for any type of statistical analysis. Here we will see how to use R with reference to two specific examples: 1. Linear regression with variable selection (AIC) 2. Bayesian analysis

R

1) Linear regression with variable selection (AIC) Example from a medical study: data from ascending aorta aneurysmal tissues. The goal is to model and estimate the coefficients of a stress-strech curve, the ultimate stress and the ultimate stretch based on patient’s characteristics:

  • Gender (male or female)
  • Age
  • BMI = Weight/Height2
  • Presence of hypertension
  • Presence of bicuspid aortic valve
  • Presence of aortic prosthesis
  • Diameter of the aortic valve

DATA SOURCE Auricchio, F., Ferrara, A., Lanzarone, E., Morganti, S., & Totaro, P. (2017). A regression method based on noninvasive clinical data to predict the mechanical behavior of ascending aorta aneurysmal tissue. IEEE Transactions on Biomedical Engineering, 64(11), 2607-2617.

slide-19
SLIDE 19

Lesson 3

R

Input observations

R

Output observations Model coefficients have be precomputed with a direct fitting on the curve for each patient

slide-20
SLIDE 20

Lesson 3

R

Script editing Console Additional windows

R

OPEN THE PROGRAM AND PLAY WITH …

Free download: R program: www.r-project.org Rstudio (advanced editor): www.rstudio.com

slide-21
SLIDE 21

Lesson 3

R

The linear model can be used also to model other functions which are linear at least on the coeffients. Example: if you want to include an additive quadratic term axi

2

it is not linear with respect to the input variables it is linear with respect to the coefficients  compute the observations xi

2 based on xi and insert them as additional input variables

The same can be applied to the output variable. Example: if you know that the the variable is positive  compute the logaritmic observations log(yi)based on yi and use them

R

2) Bayesian approach Example from an industrial application: power consumption on a machine. The goal is to estimate the coefficents of the model for the tested machine. The model is: We assume that the observations follow a density centerd on the value computed from the model: with:

slide-22
SLIDE 22

Lesson 3

R

No need to write the likelihood; package RSTAN dose for you.

R

Associated data file with the observations. Then, the computation is launched with a script.

slide-23
SLIDE 23

Lesson 3

R

OPEN THE PROGRAM AND PLAY WITH …

Free download: R program: www.r-project.org Rstudio (advanced editor): www.rstudio.com