A major risk in classification: overfitting Assume we have a small - - PowerPoint PPT Presentation

a major risk in classification overfitting assume we have
SMART_READER_LITE
LIVE PREVIEW

A major risk in classification: overfitting Assume we have a small - - PowerPoint PPT Presentation

A major risk in classification: overfitting Assume we have a small data set We fit a model that separates red and blue red blue When more data becomes available, we see that the model is poor red blue A simpler model might have worked


slide-1
SLIDE 1

A major risk in classification: overfitting

slide-2
SLIDE 2

Assume we have a small data set

slide-3
SLIDE 3

We fit a model that separates red and blue

red blue

slide-4
SLIDE 4

When more data becomes available, we see that the model is poor

red blue

slide-5
SLIDE 5

A simpler model might have worked better

red blue

slide-6
SLIDE 6

A predictor always works best on the data set on which it was trained!

slide-7
SLIDE 7

Solution: divide data into training and test sets

slide-8
SLIDE 8

Solution: divide data into training and test sets

Training data Best model for training data

slide-9
SLIDE 9

Solution: divide data into training and test sets

Test data Evaluate model on test data

slide-10
SLIDE 10

Frequently used approach: k-fold cross-validation

  • Divide data into k equal parts
  • Use k–1 parts as training set, 1 as test set
  • Repeat k times, so each part has been used once as

test set

slide-11
SLIDE 11

Also: Leave-one-out cross-validation

  • Fit model on n–1 data points
  • Evaluate on remaining data point
  • Repeat n times, so each point has been left out once
slide-12
SLIDE 12

And: Repeated random sub-sampling validation

  • Randomly split data into training and test data sets
  • Train model on training set, evaluate on test set
  • Repeat multiple times, average over result
slide-13
SLIDE 13

Random sub-sampling in R

# We assume our data are stored in data table called `data`.

slide-14
SLIDE 14

Random sub-sampling in R

# We assume our data are stored in data table called `data`. # Fraction of data used for training purposes (here: 40%) train_fraction <- 0.4

slide-15
SLIDE 15

Random sub-sampling in R

# We assume our data are stored in data table called `data`. # Fraction of data used for training purposes (here: 40%) train_fraction <- 0.4 # Number of observations in training set train_size <- floor(train_fraction * nrow(data))

slide-16
SLIDE 16

Random sub-sampling in R

# We assume our data are stored in data table called `data`. # Fraction of data used for training purposes (here: 40%) train_fraction <- 0.4 # Number of observations in training set train_size <- floor(train_fraction * nrow(data)) # Indices of observations to be used for training train_indices <- sample(1:nrow(data), size = train_size)

slide-17
SLIDE 17

Random sub-sampling in R

# We assume our data are stored in data table called `data`. # Fraction of data used for training purposes (here: 40%) train_fraction <- 0.4 # Number of observations in training set train_size <- floor(train_fraction * nrow(data)) # Indices of observations to be used for training train_indices <- sample(1:nrow(data), size = train_size) # Extract training and test data train_data <- data[train_indices, ] # get training data test_data <- data[-train_indices, ] # get test data