Learning From Data Lecture 7 Approximation Versus Generalization - - PowerPoint PPT Presentation

learning from data lecture 7 approximation versus
SMART_READER_LITE
LIVE PREVIEW

Learning From Data Lecture 7 Approximation Versus Generalization - - PowerPoint PPT Presentation

Learning From Data Lecture 7 Approximation Versus Generalization The VC Dimension Approximation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100 recap: The Vapnik-Chervonenkis Bound (VC Bound) P [ |


slide-1
SLIDE 1

Learning From Data Lecture 7 Approximation Versus Generalization

The VC Dimension Approximation Versus Generalization Bias and Variance The Learning Curve

  • M. Magdon-Ismail

CSCI 4100/6100

slide-2
SLIDE 2

recap: The Vapnik-Chervonenkis Bound (VC Bound)

P [|Ein(g) − Eout(g)| > ǫ] ≤ 4mH(2N)e−ǫ2N/8,

for any ǫ > 0.

P[|Ein(g)−Eout(g)|>ǫ]≤2|H|e−2ǫ2N ← finite H

P [|Ein(g) − Eout(g)| ≤ ǫ] ≥ 1 − 4mH(2N)e−ǫ2N/8,

for any ǫ > 0.

P[|Ein(g)−Eout(g)|≤ǫ]≥1−2|H|e−2ǫ2N ← finite H

Eout(g) ≤ Ein(g) +

  • 8

N log 4mH(2N) δ

, w.p. at least 1 − δ.

Eout(g)≤Ein(g)+√

1 2N log 2|H| δ

← finite H

mH(N) ≤

k−1

  • i=1
  • N

i

  • ≤ N k−1 + 1

k is a break point.

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 2 /22

VC dimension − →

slide-3
SLIDE 3

The VC Dimension dvc

mH(N) ∼ N k−1 The tightest bound is obtained with the smallest break point k∗. Definition [VC Dimension] dvc = k∗ − 1. The VC dimension is the largest N which can be shattered (mH(N) = 2N).

N ≤ dvc: H could shatter your data (H can shatter some N points). N > dvc: N is a break point for H; H cannot possibly shatter your data.

mH(N) ≤ N dvc + 1 ∼ N dvc Eout(g) ≤ Ein(g) + O

  • dvc log N

N

  • c

A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 3 /22

dvc versus number of parameters − →

slide-4
SLIDE 4

The VC-dimension is an Effective Number of Parameters

N 1 2 3 4 5 · · · #Param dvc 2-D perceptron 2 4 8 14 · · · 3 3 1-D pos. ray 2 3 4 5 · · · 1 1 2-D pos. rectangles 2 4 8 16 < 25 · · · 4 4

  • pos. convex sets

2 4 8 16 32 · · · ∞ ∞ There are models with few parameters but infinite dvc. There are models with redundant parameters but small dvc.

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 4 /22

dvc for perceptron − →

slide-5
SLIDE 5

VC-dimension of the Perceptron in Rd is d + 1

This can be shown in two steps:

  • 1. dvc ≥ d + 1.

What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 1 points that cannot be shattered. (c) Every set of d + 1 points can be shattered. (d) Every set of d + 1 points cannot be shattered.

  • 2. dvc ≤ d + 1.

What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 2 points that cannot be shattered. (c) Every set of d + 2 points can be shattered. (d) Every set of d + 1 points cannot be shattered. (e) Every set of d + 2 points cannot be shattered.

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 5 /22

Step 1 answer − →

slide-6
SLIDE 6

VC-dimension of the Perceptron in Rd is d + 1

This can be shown in two steps:

  • 1. dvc ≥ d + 1.

What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 1 points that cannot be shattered. (c) Every set of d + 1 points can be shattered. (d) Every set of d + 1 points cannot be shattered.

  • 2. dvc ≤ d + 1.

What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 2 points that cannot be shattered. (c) Every set of d + 2 points can be shattered. (d) Every set of d + 1 points cannot be shattered. (e) Every set of d + 2 points cannot be shattered.

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 6 /22

Step 2 answer − →

slide-7
SLIDE 7

VC-dimension of the Perceptron in Rd is d + 1

This can be shown in two steps:

  • 1. dvc ≥ d + 1.

What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 1 points that cannot be shattered. (c) Every set of d + 1 points can be shattered. (d) Every set of d + 1 points cannot be shattered.

  • 2. dvc ≤ d + 1.

What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 2 points that cannot be shattered. (c) Every set of d + 2 points can be shattered. (d) Every set of d + 1 points cannot be shattered. (e) Every set of d + 2 points cannot be shattered.

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 7 /22

dvc characterizes complexity in error bar − →

slide-8
SLIDE 8

A Single Parameter Characterizes Complexity

Eout(g) ≤ Ein(g) +

  • 1

2N log 2|H| δ

in-sample error model complexity

  • ut-of-sample error

|H| Error |H|∗

Eout(g) ≤ Ein(g) +

  • 8

N log 4((2N)dvc + 1) δ

  • penalty for model complexity

Ω(dvc)

in-sample error model complexity

  • ut-of-sample error

VC dimension, dvc Error d∗

vc

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 8 /22

Sample complexity − →

slide-9
SLIDE 9

Sample Complexity: How Many Data Points Do You Need?

Set the error bar at ǫ. ǫ =

  • 8

N ln 4((2N)dvc + 1) δ Solve for N: N = 8 ǫ2 ln 4((2N)dvc + 1) δ = O (dvc ln N)

  • Example. dvc = 3; error bar ǫ = 0.1; confidence 90% (δ = 0.1).

A simple iterative method works well. Trying N = 1000 we get N ≈ 1 0.12 log 4(2000)3 + 4 0.1

  • ≈ 21192.

We continue iteratively, and converge to N ≈ 30000. If dvc = 4, N ≈ 40000; for dvc = 5, N ≈ 50000.

(N ∝ dvc, but gross overestimates)

Practical Rule of Thumb: N = 10 × dvc

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 9 /22

Theory versus practice − →

slide-10
SLIDE 10

Theory Versus Practice

The VC analysis allows us to reach outside the data for general H.

– a single parameter characterizes complexity of H – dvc depends only on H. – Ein can reach outside D to Eout when dvc is finite.

In Practice . . .

  • The VC bound is loose.

– Hoeffding; – mH(N) is a worst case # of dichotomies, not average case or likely case. – The polynomial bound on mH(N) is loose.

  • It is a good guide – models with small dvc are good.
  • Roughly 10 × dvc examples needed to get good generalization.

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 10 /22

Test set − →

slide-11
SLIDE 11

The Test Set

  • Another way to estimate Eout(g) is using a test set to obtain Etest(g).
  • Etest is better than Ein: you don’t pay the price for fitting.

You can use |H| = 1 in the Hoeffding bound with Etest.

  • Both a test and training set have variance.

The training set has optimistic bias due to selection – fitting the data. A test set has no bias.

  • The price for a test set is fewer training examples. (why is this bad?)

Etest ≈ Eout but now Etest may be bad.

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 11 /22

Approximation versus Generalization − →

slide-12
SLIDE 12

VC Bound Quantifies Approximation Versus Generalization

The best H is H = {f}.

You are better off buying a lottery ticket.

dvc ↑ = ⇒ better chance of approximating f (Ein ≈ 0). dvc ↓ = ⇒ better chance of generalizing to out of sample (Ein ≈ Eout). Eout ≤ Ein + Ω(dvc). VC analysis only depends on H.

Independent of f, P(x), learning algorithm.

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 12 /22

Bias-variance analysis − →

slide-13
SLIDE 13

Bias-Variance Analysis

Another way to quantify the tradeoff:

  • 1. How well can the learning approximate f.

. . . as opposed to how well did the learning approximate f in-sample (Ein).

  • 2. How close can you get to that approximation with a finite data set.

. . . as opposed to how close is Ein to Eout.

Bias-variance analysis applies to squared errors (classification and regression) Bias-variance analysis can take into account the learning algorithm

Different learning algorithms can have different Eout when applied to the same H!

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 13 /22

Sin example − →

slide-14
SLIDE 14

A Simple Learning Problem

2 Data Points. 2 hypothesis sets: H0 : h(x) = b H1 : h(x) = ax + b

x y x y

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 14 /22

Many data sets − →

slide-15
SLIDE 15

Let’s Repeat the Experiment Many Times

x y x y

For each data set D, you get a different gD. So, for a fixed x, gD(x) is random value, depending on D.

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 15 /22

Average behavior − →

slide-16
SLIDE 16

What’s Happening on Average

x y ¯ g(x) sin(x) x y ¯ g(x) sin(x)

We can define: gD(x) ← random value, depending on D ¯ g(x) = ED

gD(x)

1 K(gD1(x) + · · · + gDK(x)) ← your average prediction on x

var(x) = ED

(gD(x) − ¯

g(x))2 = ED

gD(x)2 − ¯

g(x)2 ← how variable is your prediction?

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 16 /22

Error on out-of-sample test point − →

slide-17
SLIDE 17

Eout on Test Point x for Data D

x f(x) gD(x) ED

  • ut(x)

x f(x) gD(x) ED

  • ut(x)

ED

  • ut(x) = (gD(x) − f(x))2

← squared error, a random value depending on D Eout(x) = ED

ED

  • ut(x)
  • ← expected Eout(x) before seeing D

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 17 /22

bias-vardecomposition − →

slide-18
SLIDE 18

The Bias-Variance Decomposition

Eout(x) = ED

  • (gD(x) − f(x))2

= ED

  • gD(x)2 − 2gD(x)f(x) + f(x)2

= ED

  • gD(x)2

− 2¯ g(x)f(x) + f(x)2

← understand this; the rest is just algebra

= ED

  • gD(x)2

−¯ g(x)2 + ¯ g(x)2 − 2¯ g(x)f(x) + f(x)2 = ED

  • gD(x)2

− ¯ g(x)2

  • var(x)

+ (¯ g(x) − f(x))2

  • bias(x)

Eout(x) = bias(x) + var(x)

f H bias var f H

Very small model Very large model

If you take average over x: Eout = bias + var

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 18 /22

Back to sin example − →

slide-19
SLIDE 19

Back to H0 and H1; and, our winner is . . .

x y ¯ g(x) sin(x) x y ¯ g(x) sin(x)

H0 bias = 0.50 var = 0.25 Eout = 0.75 H1 bias = 0.21 var = 1.69 Eout = 1.90

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 19 /22

2 versus 5 data points − →

slide-20
SLIDE 20

Match Learning Power to Data, . . . Not to f

2 Data Points 5 Data Points

x y x y x y ¯ g(x) sin(x) x y ¯ g(x) sin(x)

H0 bias = 0.50; var = 0.25. Eout = 0.75 H1 bias = 0.21; var = 1.69. Eout = 1.90

x y x y x y ¯ g(x) sin(x) x y ¯ g(x) sin(x)

H0 bias = 0.50; var = 0.1. Eout = 0.6 H1 bias = 0.21; var = 0.21. Eout = 0.42

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 20 /22

Learning curves − →

slide-21
SLIDE 21

Learning Curves: When Does the Balance Tip?

Simple Model Complex Model

Number of Data Points, N Expected Error Eout Ein Number of Data Points, N Expected Error Eout Ein

Eout = Ex [Eout(x)]

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 21 /22

Decomposing the learning curve − →

slide-22
SLIDE 22

Decomposing The Learning Curve

VC Analysis Bias-Variance Analysis

Number of Data Points, N Expected Error in-sample error generalization error Eout Ein Number of Data Points, N Expected Error bias variance Eout Ein

Pick H that can generalize and has a good chance to fit the data Pick (H, A) to approximate f and not behave wildly after seeing the data

c A M L Creator: Malik Magdon-Ismail

Approximation Versus Generalization: 22 /22