Lesson 3 Validation process.
Ettore Lanzarone March 18, 2020
Overfitting Validation process. Overfitting Ettore Lanzarone - - PowerPoint PPT Presentation
Lesson 3 MEDICAL SUPPORT SYSTEMS FOR CHRONIC DISEASES Engineering and Management for Health University of Bergamo Overfitting Validation process. Overfitting Ettore Lanzarone March 18, 2020 LESSON 3 Lesson 3 Overfitting Linear
Ettore Lanzarone March 18, 2020
Linear regression
A decent model fit even though R2 is not close to 1. Residuals are normally distributed and this confirm the goodness of fit.
Polynomial regression with a high
Is it a better model? No, the model includes the variability of data due to random errors associated with the
Modeling techniques tend to overfit the data. For example, let us consider multiple regression:
variability of the data variance. But this variability cannot be true! Solutions:
The solution is to assess the capability of the model to predict new data. This is actually the goal of estimating a model. You estimate the model on the black points and then new blue points arrive. Lower R2 Higher R2 Which one do you prefer?
accurate predictions for new samples of data. This process is referred to as “validation”.
and the new data are used to assess the predictive validity of the model.
new study, as well as the difficulty in truly replicating the original study.
In cross-validation the original dataset is split into two parts:
part of the dataset used to develop the model
part of the dataset used to assess the goodness of fit A valid model should show good predictive accuracy. While R2 offers no protection against is overfitting, cross validation inherently offers protection against overfitting. How to divide the dataset?
QUESTION 1 Which portion of the sample should be in each part?
QUESTION 2 How should the sample be split?
any systematic differences.
1. Model development (propose structure and parameters) 2. Cross-validation ITERATE UNTIL VALIDATION RESULTS ARE ACCEPTABLE 3. Model fitting considering the entire dataset; once validated all the dataset can be used to determine model parameters 4. Model applicable in real cases (e.g., integrated in a machine)
Let us consider each point i in the testing set. The quality
is measured through the error
There exist several indicators based on this error. They can be used to compare alternative models. The best model is associated with the lowest value of the adopted indicator.
There exist several indicators based on this error. They can be used to compare alternative models. The best model is associated with the lowest value of the adopted indicator.
Example Model 1 – linear regression
Example Model 2 – quadratic regression This second model is associated with a lower MSE. Thus, it provides a better goodness of fit.
Example Model 3 – join the dots Bad model, just to provide an example. Even though the R2 is zero, this model is the worst one.
The LOOCV is a good approach because every time the data in the training set are “similar” to those of the entire dataset (about same quantity). As the quality of the fitting depends on the amount of data, this preserve this information. However, the LOOCV is computationally expensive! The model has to be identified as many times as the cardinality of the dataset (number of
1. Split the sample into k subsets of equal size 2. For each fold estimate a model on all the subsets except one 3. Use the left out subset to test the model, by calculating a cross validation metric of choice 4. Average the CV metric across subsets to get the cross validation error This k-fold validation reduces the computational effort (less replications required) while at the same time tries to have a relevant number of observations in the training set at each replication.
Example: 1. Split the data into 5 samples 2. Fit a model to the training subsets and use the testing subset to calculate a cross validation metric. 3. Repeat the process for the next sample, until all samples have been used to test the model
Randomly splits the dataset into training and validation data. The results are then averaged over the splits. In a stratified variant of this approach, the random samples are generated in such a way that the mean response value (i.e. the dependent variable in the regression) is equal in the training and testing sets.
Other cross-validation approaches exist. All of them based on the same idea of dividing the dataset and computing the metrics above mentioned.
When the data are complex is also useful to first use simulated data. The overall process is: 1. Validate the model on simulated data 2. Validate the model on real acquired data
Some parts can be omitted depending on the features of the case.
Validate the model on simulated data
Assume a set of realistic model parameters Define the model for the dataset Data
Validate the model on simulated data
Assume a set of realistic model parameters Define the model for the dataset Generate a simulated dataset considering the model and the assumed parameters Data
Validate the model on simulated data
Assume a set of realistic model parameters Define the model for the dataset Generate a simulated dataset considering the model and the assumed parameters Estimate the model parameters based
Compare the estimated parameters with the assumed ones Data
Compare the estimated parameters with the assumed ones
estimated value and the assumed value.
position of the assumed value with respect to the confidence interval
This analysis allows to consider the effectiveness of the estimation approach:
However, it assumes that the model perfectly describe the real problem and the real data (observations are matematically generated considering the same model).
Compare the estimated parameters with the assumed ones
It neglects the gap between the model and the reality. Each model is an abstraction of reality, which includes assumptions and simplifications. This analysis is only preliminary, devoted to assess the techical functioning of the modeling and estimation framework.
Validate the model on real acquired data This analysis includes the gap between the reality and the mathematical description. The best is to consider the out-of-sample validation (i.e., the cross-validation approaches above mentioned). If the results of the out-of-sample validation are not very satisfying, analyze the reasons. The in-sample validation can provide useful insights. For example, if the in-sample validation is satisfying, it could mean that a larger amount of
R is an open source platform to perform statistical analyses. It includes several packages for any type of statistical analysis. Here we will see how to use R with reference to two specific examples: 1. Linear regression with variable selection (AIC) 2. Bayesian analysis
1) Linear regression with variable selection (AIC) Example from a medical study: data from ascending aorta aneurysmal tissues. The goal is to model and estimate the coefficients of a stress-strech curve, the ultimate stress and the ultimate stretch based on patient’s characteristics:
DATA SOURCE Auricchio, F., Ferrara, A., Lanzarone, E., Morganti, S., & Totaro, P. (2017). A regression method based on noninvasive clinical data to predict the mechanical behavior of ascending aorta aneurysmal tissue. IEEE Transactions on Biomedical Engineering, 64(11), 2607-2617.
Input observations
Output observations Model coefficients have be precomputed with a direct fitting on the curve for each patient
Script editing Console Additional windows
Free download: R program: www.r-project.org Rstudio (advanced editor): www.rstudio.com
The linear model can be used also to model other functions which are linear at least on the coeffients. Example: if you want to include an additive quadratic term axi
2
it is not linear with respect to the input variables it is linear with respect to the coefficients compute the observations xi
2 based on xi and insert them as additional input variables
The same can be applied to the output variable. Example: if you know that the the variable is positive compute the logaritmic observations log(yi)based on yi and use them
2) Bayesian approach Example from an industrial application: power consumption on a machine. The goal is to estimate the coefficents of the model for the tested machine. The model is: We assume that the observations follow a density centerd on the value computed from the model: with:
No need to write the likelihood; package RSTAN dose for you.
Associated data file with the observations. Then, the computation is launched with a script.
Free download: R program: www.r-project.org Rstudio (advanced editor): www.rstudio.com