SLIDE 1
ECE 4524 Artificial Intelligence and Engineering Applications - - PowerPoint PPT Presentation
ECE 4524 Artificial Intelligence and Engineering Applications - - PowerPoint PPT Presentation
ECE 4524 Artificial Intelligence and Engineering Applications Lecture 23: Learning Theory Reading: AIAMA 18.4-18.5 Todays Schedule: Evaluating Hypotheses/Models PAC Learning and Sample Complexity Assumptions about Training and
SLIDE 2
SLIDE 3
Error Rate
We define the error rate as the proportion of mistakes made by h
- ver a set of N examples
Error Rate = 1 N
N
- i=1
✶yi=h(xi) where ✶ is the indicator function.
◮ When this error rate is zero over the training set, h is said to
be consistent.
◮ It is always possible to find a hypothesis space H complex
enough so that some h ∈ H is consistent.
SLIDE 4
Test Error Rate
Thus we are more concerned with the test error rate.
◮ A low test error indicate h generalizes well ◮ Often a consistent hypothesis has worse generalization than a
less-complex one.
◮ This trade-off between the complexity of H and the test
performance is the core of supervised machine learning.
SLIDE 5
Cross-Validation
◮ So, the test error is the final word on the performance of h,
but recall that we can only use the test set once. Otherwise we are said to be peeking.
◮ However, if we use the entire training set for training we will
likely over-train.
◮ The answer is to use cross-validation to estimate the
generalization performance of h. We partition the training set into a training and validation set.
◮ holdout cross-validation - reserve a percentage (typically 1/3)
from D for validation.
◮ k-fold cross-validation - generate k independent subsets of D.
giving k estimates of generalization performance
◮ when k = N this is called leave-one-out cross-validation.
SLIDE 6
SLIDE 7
Selecting Hypothesis Complexity
So, to select an optimal h we need a learning algorithm, a way to
- ptimize the parameters over a given set H
◮ Define the size of H as some parameter which adjusts the
complexity of H.
◮ For increasing values of size use cross-validation and the
learning algorithm to give an estimate of the training and validation error.
◮ stop when h is consistent or the training error has converged ◮ search backwards to find the size with the smallest validation
error
◮ finally, train h at the optimal size using the full training set.
SLIDE 8
SLIDE 9
Loss Functions
Minimizing the error rate assumes that all errors are equal in the success of the agent. From our discussion of Utility we know this is not true.
◮ In ML it is traditional to work with a cost rather than utility
via a loss function. L(x, y, ˆ y) = U(result of y given x) − U(result of ˆ y given x) where y = f (x) and ˆ y = h(x)
◮ We often assume no dependence on x on the loss so we just
have L(y, ˆ y).
SLIDE 10
Empirical Loss
◮ We would like to minimize the expected loss over the
validation set
N
- i=1
L(yi, h(xi))P(xi, yi) however we don’t know the joint probability
◮ Instead we assume a uniform distribution and optimize the
empirical loss 1 N
N
- i=1
L(yi, h(xi))
SLIDE 11
Probably Approximately Correct Learning
For Boolean functions (binary classifiers) define: error(h) =
- x
- y
L0/1(y, h(x)) N ≥ 1
ǫ
- ln 1
δ + ln |H|
SLIDE 12