a major risk in classification overfitting assume we have
play

A major risk in classification: overfitting Assume we have a small - PowerPoint PPT Presentation

A major risk in classification: overfitting Assume we have a small data set We fit a model that separates red and blue red blue When more data becomes available, we see that the model is poor red blue A simpler model might have worked


  1. A major risk in classification: overfitting

  2. Assume we have a small data set

  3. We fit a model that separates red and blue red blue

  4. When more data becomes available, we see that the model is poor red blue

  5. A simpler model might have worked better red blue

  6. A predictor always works best on the data set on which it was trained!

  7. Solution: divide data into training and test sets

  8. Solution: divide data into training and test sets Training data Best model for training data

  9. Solution: divide data into training and test sets Test data Evaluate model on test data

  10. Frequently used approach: k -fold cross-validation • Divide data into k equal parts • Use k –1 parts as training set, 1 as test set • Repeat k times, so each part has been used once as test set

  11. Also: Leave-one-out cross-validation • Fit model on n –1 data points • Evaluate on remaining data point • Repeat n times, so each point has been left out once

  12. And: Repeated random sub-sampling validation • Randomly split data into training and test data sets • Train model on training set, evaluate on test set • Repeat multiple times, average over result

  13. Random sub-sampling in R # We assume our data are stored in data table called `data`.

  14. Random sub-sampling in R # We assume our data are stored in data table called `data`. # Fraction of data used for training purposes (here: 40%) train_fraction <- 0.4

  15. Random sub-sampling in R # We assume our data are stored in data table called `data`. # Fraction of data used for training purposes (here: 40%) train_fraction <- 0.4 # Number of observations in training set train_size <- floor(train_fraction * nrow(data))

  16. Random sub-sampling in R # We assume our data are stored in data table called `data`. # Fraction of data used for training purposes (here: 40%) train_fraction <- 0.4 # Number of observations in training set train_size <- floor(train_fraction * nrow(data)) # Indices of observations to be used for training train_indices <- sample(1:nrow(data), size = train_size)

  17. Random sub-sampling in R # We assume our data are stored in data table called `data`. # Fraction of data used for training purposes (here: 40%) train_fraction <- 0.4 # Number of observations in training set train_size <- floor(train_fraction * nrow(data)) # Indices of observations to be used for training train_indices <- sample(1:nrow(data), size = train_size) # Extract training and test data train_data <- data[train_indices, ] # get training data test_data <- data[-train_indices, ] # get test data

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend