Learning From Data Lecture 7 Approximation Versus Generalization
The VC Dimension Approximation Versus Generalization Bias and Variance The Learning Curve
- M. Magdon-Ismail
CSCI 4100/6100
Learning From Data Lecture 7 Approximation Versus Generalization - - PowerPoint PPT Presentation
Learning From Data Lecture 7 Approximation Versus Generalization The VC Dimension Approximation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100 recap: The Vapnik-Chervonenkis Bound (VC Bound) P [ |
The VC Dimension Approximation Versus Generalization Bias and Variance The Learning Curve
CSCI 4100/6100
recap: The Vapnik-Chervonenkis Bound (VC Bound)
P [|Ein(g) − Eout(g)| > ǫ] ≤ 4mH(2N)e−ǫ2N/8,
P[|Ein(g)−Eout(g)|>ǫ]≤2|H|e−2ǫ2N ← finite H
P [|Ein(g) − Eout(g)| ≤ ǫ] ≥ 1 − 4mH(2N)e−ǫ2N/8,
P[|Ein(g)−Eout(g)|≤ǫ]≥1−2|H|e−2ǫ2N ← finite H
N log 4mH(2N) δ
Eout(g)≤Ein(g)+√
1 2N log 2|H| δ
← finite H
k−1
i
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 2 /22
VC dimension − →
N ≤ dvc: H could shatter your data (H can shatter some N points). N > dvc: N is a break point for H; H cannot possibly shatter your data.
N
A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 3 /22
dvc versus number of parameters − →
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 4 /22
dvc for perceptron − →
What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 1 points that cannot be shattered. (c) Every set of d + 1 points can be shattered. (d) Every set of d + 1 points cannot be shattered.
What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 2 points that cannot be shattered. (c) Every set of d + 2 points can be shattered. (d) Every set of d + 1 points cannot be shattered. (e) Every set of d + 2 points cannot be shattered.
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 5 /22
Step 1 answer − →
What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 1 points that cannot be shattered. (c) Every set of d + 1 points can be shattered. (d) Every set of d + 1 points cannot be shattered.
What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 2 points that cannot be shattered. (c) Every set of d + 2 points can be shattered. (d) Every set of d + 1 points cannot be shattered. (e) Every set of d + 2 points cannot be shattered.
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 6 /22
Step 2 answer − →
What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 1 points that cannot be shattered. (c) Every set of d + 1 points can be shattered. (d) Every set of d + 1 points cannot be shattered.
What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 2 points that cannot be shattered. (c) Every set of d + 2 points can be shattered. (d) Every set of d + 1 points cannot be shattered. (e) Every set of d + 2 points cannot be shattered.
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 7 /22
dvc characterizes complexity in error bar − →
in-sample error model complexity
|H| Error |H|∗
in-sample error model complexity
VC dimension, dvc Error d∗
vc
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 8 /22
Sample complexity − →
A simple iterative method works well. Trying N = 1000 we get N ≈ 1 0.12 log 4(2000)3 + 4 0.1
We continue iteratively, and converge to N ≈ 30000. If dvc = 4, N ≈ 40000; for dvc = 5, N ≈ 50000.
(N ∝ dvc, but gross overestimates)
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 9 /22
Theory versus practice − →
– a single parameter characterizes complexity of H – dvc depends only on H. – Ein can reach outside D to Eout when dvc is finite.
– Hoeffding; – mH(N) is a worst case # of dichotomies, not average case or likely case. – The polynomial bound on mH(N) is loose.
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 10 /22
Test set − →
You can use |H| = 1 in the Hoeffding bound with Etest.
The training set has optimistic bias due to selection – fitting the data. A test set has no bias.
Etest ≈ Eout but now Etest may be bad.
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 11 /22
Approximation versus Generalization − →
You are better off buying a lottery ticket.
Independent of f, P(x), learning algorithm.
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 12 /22
Bias-variance analysis − →
. . . as opposed to how well did the learning approximate f in-sample (Ein).
. . . as opposed to how close is Ein to Eout.
Different learning algorithms can have different Eout when applied to the same H!
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 13 /22
Sin example − →
x y x y
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 14 /22
Many data sets − →
x y x y
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 15 /22
Average behavior − →
x y ¯ g(x) sin(x) x y ¯ g(x) sin(x)
gD(x)
1 K(gD1(x) + · · · + gDK(x)) ← your average prediction on x
(gD(x) − ¯
gD(x)2 − ¯
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 16 /22
Error on out-of-sample test point − →
x f(x) gD(x) ED
x f(x) gD(x) ED
ED
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 17 /22
bias-vardecomposition − →
Eout(x) = ED
= ED
= ED
− 2¯ g(x)f(x) + f(x)2
← understand this; the rest is just algebra
= ED
−¯ g(x)2 + ¯ g(x)2 − 2¯ g(x)f(x) + f(x)2 = ED
− ¯ g(x)2
+ (¯ g(x) − f(x))2
f H bias var f H
Very small model Very large model
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 18 /22
Back to sin example − →
x y ¯ g(x) sin(x) x y ¯ g(x) sin(x)
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 19 /22
2 versus 5 data points − →
x y x y x y ¯ g(x) sin(x) x y ¯ g(x) sin(x)
H0 bias = 0.50; var = 0.25. Eout = 0.75 H1 bias = 0.21; var = 1.69. Eout = 1.90
x y x y x y ¯ g(x) sin(x) x y ¯ g(x) sin(x)
H0 bias = 0.50; var = 0.1. Eout = 0.6 H1 bias = 0.21; var = 0.21. Eout = 0.42
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 20 /22
Learning curves − →
Number of Data Points, N Expected Error Eout Ein Number of Data Points, N Expected Error Eout Ein
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 21 /22
Decomposing the learning curve − →
Number of Data Points, N Expected Error in-sample error generalization error Eout Ein Number of Data Points, N Expected Error bias variance Eout Ein
Pick H that can generalize and has a good chance to fit the data Pick (H, A) to approximate f and not behave wildly after seeing the data
c A M L Creator: Malik Magdon-Ismail
Approximation Versus Generalization: 22 /22