 
              Learning From Data Lecture 7 Approximation Versus Generalization The VC Dimension Approximation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100
recap: The Vapnik-Chervonenkis Bound (VC Bound) P [ | E in ( g ) − E out ( g ) | > ǫ ] ≤ 4 m H (2 N ) e − ǫ 2 N/ 8 , for any ǫ > 0 . P [ | E in ( g ) − E out ( g ) | >ǫ ] ≤ 2 |H| e − 2 ǫ 2 N ← finite H P [ | E in ( g ) − E out ( g ) | ≤ ǫ ] ≥ 1 − 4 m H (2 N ) e − ǫ 2 N/ 8 , for any ǫ > 0 . P [ | E in ( g ) − E out ( g ) |≤ ǫ ] ≥ 1 − 2 |H| e − 2 ǫ 2 N ← finite H � N log 4 m H (2 N ) 8 E out ( g ) ≤ E in ( g ) + , w.p. at least 1 − δ. E out ( g ) ≤ E in ( g )+ √ δ 2 N log 2 |H| 1 ← finite H δ k − 1 � � ≤ N k − 1 + 1 � � � N m H ( N ) ≤ k is a break point. i i =1 M Approximation Versus Generalization : 2 /22 � A c L Creator: Malik Magdon-Ismail VC dimension − →
The VC Dimension d vc m H ( N ) ∼ N k − 1 The tightest bound is obtained with the smallest break point k ∗ . Definition [VC Dimension] d vc = k ∗ − 1. The VC dimension is the largest N which can be shattered ( m H ( N ) = 2 N ). N ≤ d vc : H could shatter your data ( H can shatter some N points). N > d vc : N is a break point for H ; H cannot possibly shatter your data. m H ( N ) ≤ N d vc + 1 ∼ N d vc �� � d vc log N E out ( g ) ≤ E in ( g ) + O N M Approximation Versus Generalization : 3 /22 � A c L Creator: Malik Magdon-Ismail d vc versus number of parameters − →
The VC-dimension is an Effective Number of Parameters N 1 2 3 4 5 · · · #Param d vc 2-D perceptron 2 4 8 14 · · · 3 3 1-D pos. ray 2 3 4 5 · · · 1 1 < 2 5 · · · 2-D pos. rectangles 2 4 8 16 4 4 pos. convex sets 2 4 8 16 32 · · · ∞ ∞ There are models with few parameters but infinite d vc . There are models with redundant parameters but small d vc . M Approximation Versus Generalization : 4 /22 � A c L Creator: Malik Magdon-Ismail d vc for perceptron − →
VC-dimension of the Perceptron in R d is d + 1 This can be shown in two steps: 1. d vc ≥ d + 1. What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 1 points that cannot be shattered. (c) Every set of d + 1 points can be shattered. (d) Every set of d + 1 points cannot be shattered. 2. d vc ≤ d + 1. What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 2 points that cannot be shattered. (c) Every set of d + 2 points can be shattered. (d) Every set of d + 1 points cannot be shattered. (e) Every set of d + 2 points cannot be shattered. M Approximation Versus Generalization : 5 /22 � A c L Creator: Malik Magdon-Ismail Step 1 answer − →
VC-dimension of the Perceptron in R d is d + 1 This can be shown in two steps: 1. d vc ≥ d + 1. What needs to be shown? � (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 1 points that cannot be shattered. (c) Every set of d + 1 points can be shattered. (d) Every set of d + 1 points cannot be shattered. 2. d vc ≤ d + 1. What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 2 points that cannot be shattered. (c) Every set of d + 2 points can be shattered. (d) Every set of d + 1 points cannot be shattered. (e) Every set of d + 2 points cannot be shattered. M Approximation Versus Generalization : 6 /22 � A c L Creator: Malik Magdon-Ismail Step 2 answer − →
VC-dimension of the Perceptron in R d is d + 1 This can be shown in two steps: 1. d vc ≥ d + 1. What needs to be shown? � (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 1 points that cannot be shattered. (c) Every set of d + 1 points can be shattered. (d) Every set of d + 1 points cannot be shattered. 2. d vc ≤ d + 1. What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 2 points that cannot be shattered. (c) Every set of d + 2 points can be shattered. (d) Every set of d + 1 points cannot be shattered. � (e) Every set of d + 2 points cannot be shattered. M Approximation Versus Generalization : 7 /22 � A c L Creator: Malik Magdon-Ismail d vc characterizes complexity in error bar − →
A Single Parameter Characterizes Complexity out-of-sample error model complexity � Error 2 N log 2 |H| 1 E out ( g ) ≤ E in ( g ) + δ in-sample error |H| |H| ∗ ↓ out-of-sample error � N log 4((2 N ) d vc + 1) 8 E out ( g ) ≤ E in ( g ) + model complexity Error δ � �� � penalty for model complexity in-sample error Ω( d vc ) d ∗ VC dimension, d vc vc M Approximation Versus Generalization : 8 /22 � A c L Creator: Malik Magdon-Ismail Sample complexity − →
Sample Complexity: How Many Data Points Do You Need? Set the error bar at ǫ . � N ln 4((2 N ) d vc + 1) 8 ǫ = δ Solve for N : ǫ 2 ln 4((2 N ) d vc + 1) N = 8 = O ( d vc ln N ) δ Example. d vc = 3; error bar ǫ = 0 . 1; confidence 90% ( δ = 0 . 1). A simple iterative method works well. Trying N = 1000 we get � 4(2000) 3 + 4 � 1 N ≈ 0 . 1 2 log ≈ 21192 . 0 . 1 We continue iteratively, and converge to N ≈ 30000. If d vc = 4, N ≈ 40000; for d vc = 5, N ≈ 50000. ( N ∝ d vc , but gross overestimates) Practical Rule of Thumb: N = 10 × d vc M Approximation Versus Generalization : 9 /22 � A c L Creator: Malik Magdon-Ismail Theory versus practice − →
Theory Versus Practice The VC analysis allows us to reach outside the data for general H . – a single parameter characterizes complexity of H – d vc depends only on H . – E in can reach outside D to E out when d vc is finite. In Practice . . . • The VC bound is loose. – Hoeffding; – m H ( N ) is a worst case # of dichotomies, not average case or likely case. – The polynomial bound on m H ( N ) is loose. • It is a good guide – models with small d vc are good. • Roughly 10 × d vc examples needed to get good generalization. M Approximation Versus Generalization : 10 /22 � A c L Creator: Malik Magdon-Ismail Test set − →
The Test Set • Another way to estimate E out ( g ) is using a test set to obtain E test ( g ). • E test is better than E in : you don’t pay the price for fitting. You can use |H| = 1 in the Hoeffding bound with E test . • Both a test and training set have variance. The training set has optimistic bias due to selection – fitting the data. A test set has no bias. • The price for a test set is fewer training examples. (why is this bad?) E test ≈ E out but now E test may be bad. M Approximation Versus Generalization : 11 /22 � A c L Creator: Malik Magdon-Ismail Approximation versus Generalization − →
VC Bound Quantifies Approximation Versus Generalization The best H is H = { f } . You are better off buying a lottery ticket. d vc ↑ = ⇒ better chance of approximating f ( E in ≈ 0). d vc ↓ = ⇒ better chance of generalizing to out of sample ( E in ≈ E out ). E out ≤ E in + Ω( d vc ) . VC analysis only depends on H . Independent of f, P ( x ), learning algorithm. M Approximation Versus Generalization : 12 /22 � A c L Creator: Malik Magdon-Ismail Bias-variance analysis − →
Bias-Variance Analysis Another way to quantify the tradeoff: 1. How well can the learning approximate f . . . . as opposed to how well did the learning approximate f in-sample ( E in ). 2. How close can you get to that approximation with a finite data set. . . . as opposed to how close is E in to E out . Bias-variance analysis applies to squared errors (classification and regression) Bias-variance analysis can take into account the learning algorithm Different learning algorithms can have different E out when applied to the same H ! M Approximation Versus Generalization : 13 /22 � A c L Creator: Malik Magdon-Ismail Sin example − →
A Simple Learning Problem 2 Data Points. 2 hypothesis sets: H 0 : h ( x ) = b H 1 : h ( x ) = ax + b y y x x M Approximation Versus Generalization : 14 /22 � A c L Creator: Malik Magdon-Ismail Many data sets − →
Let’s Repeat the Experiment Many Times y y x x For each data set D , you get a different g D . So, for a fixed x , g D ( x ) is random value, depending on D . M Approximation Versus Generalization : 15 /22 � A c L Creator: Malik Magdon-Ismail Average behavior − →
What’s Happening on Average g ( x ) ¯ y y g ( x ) ¯ sin( x ) sin( x ) x x We can define: g D ( x ) ← random value , depending on D � g D ( x ) � g ( x ) = E D ¯ K ( g D 1 ( x ) + · · · + g D K ( x )) ← your average prediction on x 1 ≈ � ( g D ( x ) − ¯ g ( x )) 2 � var ( x ) = E D � g D ( x ) 2 � − ¯ ← how variable is your prediction? g ( x ) 2 = E D M Approximation Versus Generalization : 16 /22 � A c L Creator: Malik Magdon-Ismail Error on out-of-sample test point − →
E out on Test Point x for Data D f ( x ) f ( x ) E D out ( x ) E D out ( x ) g D ( x ) g D ( x ) x x E D out ( x ) = ( g D ( x ) − f ( x )) 2 ← squared error , a random value depending on D � E D � E out ( x ) = E D out ( x ) ← expected E out ( x ) before seeing D M Approximation Versus Generalization : 17 /22 � A c L Creator: Malik Magdon-Ismail bias - var decomposition − →
Recommend
More recommend