learning from data lecture 7 approximation versus
play

Learning From Data Lecture 7 Approximation Versus Generalization - PowerPoint PPT Presentation

Learning From Data Lecture 7 Approximation Versus Generalization The VC Dimension Approximation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100 recap: The Vapnik-Chervonenkis Bound (VC Bound) P [ |


  1. Learning From Data Lecture 7 Approximation Versus Generalization The VC Dimension Approximation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100

  2. recap: The Vapnik-Chervonenkis Bound (VC Bound) P [ | E in ( g ) − E out ( g ) | > ǫ ] ≤ 4 m H (2 N ) e − ǫ 2 N/ 8 , for any ǫ > 0 . P [ | E in ( g ) − E out ( g ) | >ǫ ] ≤ 2 |H| e − 2 ǫ 2 N ← finite H P [ | E in ( g ) − E out ( g ) | ≤ ǫ ] ≥ 1 − 4 m H (2 N ) e − ǫ 2 N/ 8 , for any ǫ > 0 . P [ | E in ( g ) − E out ( g ) |≤ ǫ ] ≥ 1 − 2 |H| e − 2 ǫ 2 N ← finite H � N log 4 m H (2 N ) 8 E out ( g ) ≤ E in ( g ) + , w.p. at least 1 − δ. E out ( g ) ≤ E in ( g )+ √ δ 2 N log 2 |H| 1 ← finite H δ k − 1 � � ≤ N k − 1 + 1 � � � N m H ( N ) ≤ k is a break point. i i =1 M Approximation Versus Generalization : 2 /22 � A c L Creator: Malik Magdon-Ismail VC dimension − →

  3. The VC Dimension d vc m H ( N ) ∼ N k − 1 The tightest bound is obtained with the smallest break point k ∗ . Definition [VC Dimension] d vc = k ∗ − 1. The VC dimension is the largest N which can be shattered ( m H ( N ) = 2 N ). N ≤ d vc : H could shatter your data ( H can shatter some N points). N > d vc : N is a break point for H ; H cannot possibly shatter your data. m H ( N ) ≤ N d vc + 1 ∼ N d vc �� � d vc log N E out ( g ) ≤ E in ( g ) + O N M Approximation Versus Generalization : 3 /22 � A c L Creator: Malik Magdon-Ismail d vc versus number of parameters − →

  4. The VC-dimension is an Effective Number of Parameters N 1 2 3 4 5 · · · #Param d vc 2-D perceptron 2 4 8 14 · · · 3 3 1-D pos. ray 2 3 4 5 · · · 1 1 < 2 5 · · · 2-D pos. rectangles 2 4 8 16 4 4 pos. convex sets 2 4 8 16 32 · · · ∞ ∞ There are models with few parameters but infinite d vc . There are models with redundant parameters but small d vc . M Approximation Versus Generalization : 4 /22 � A c L Creator: Malik Magdon-Ismail d vc for perceptron − →

  5. VC-dimension of the Perceptron in R d is d + 1 This can be shown in two steps: 1. d vc ≥ d + 1. What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 1 points that cannot be shattered. (c) Every set of d + 1 points can be shattered. (d) Every set of d + 1 points cannot be shattered. 2. d vc ≤ d + 1. What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 2 points that cannot be shattered. (c) Every set of d + 2 points can be shattered. (d) Every set of d + 1 points cannot be shattered. (e) Every set of d + 2 points cannot be shattered. M Approximation Versus Generalization : 5 /22 � A c L Creator: Malik Magdon-Ismail Step 1 answer − →

  6. VC-dimension of the Perceptron in R d is d + 1 This can be shown in two steps: 1. d vc ≥ d + 1. What needs to be shown? � (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 1 points that cannot be shattered. (c) Every set of d + 1 points can be shattered. (d) Every set of d + 1 points cannot be shattered. 2. d vc ≤ d + 1. What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 2 points that cannot be shattered. (c) Every set of d + 2 points can be shattered. (d) Every set of d + 1 points cannot be shattered. (e) Every set of d + 2 points cannot be shattered. M Approximation Versus Generalization : 6 /22 � A c L Creator: Malik Magdon-Ismail Step 2 answer − →

  7. VC-dimension of the Perceptron in R d is d + 1 This can be shown in two steps: 1. d vc ≥ d + 1. What needs to be shown? � (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 1 points that cannot be shattered. (c) Every set of d + 1 points can be shattered. (d) Every set of d + 1 points cannot be shattered. 2. d vc ≤ d + 1. What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 2 points that cannot be shattered. (c) Every set of d + 2 points can be shattered. (d) Every set of d + 1 points cannot be shattered. � (e) Every set of d + 2 points cannot be shattered. M Approximation Versus Generalization : 7 /22 � A c L Creator: Malik Magdon-Ismail d vc characterizes complexity in error bar − →

  8. A Single Parameter Characterizes Complexity out-of-sample error model complexity � Error 2 N log 2 |H| 1 E out ( g ) ≤ E in ( g ) + δ in-sample error |H| |H| ∗ ↓ out-of-sample error � N log 4((2 N ) d vc + 1) 8 E out ( g ) ≤ E in ( g ) + model complexity Error δ � �� � penalty for model complexity in-sample error Ω( d vc ) d ∗ VC dimension, d vc vc M Approximation Versus Generalization : 8 /22 � A c L Creator: Malik Magdon-Ismail Sample complexity − →

  9. Sample Complexity: How Many Data Points Do You Need? Set the error bar at ǫ . � N ln 4((2 N ) d vc + 1) 8 ǫ = δ Solve for N : ǫ 2 ln 4((2 N ) d vc + 1) N = 8 = O ( d vc ln N ) δ Example. d vc = 3; error bar ǫ = 0 . 1; confidence 90% ( δ = 0 . 1). A simple iterative method works well. Trying N = 1000 we get � 4(2000) 3 + 4 � 1 N ≈ 0 . 1 2 log ≈ 21192 . 0 . 1 We continue iteratively, and converge to N ≈ 30000. If d vc = 4, N ≈ 40000; for d vc = 5, N ≈ 50000. ( N ∝ d vc , but gross overestimates) Practical Rule of Thumb: N = 10 × d vc M Approximation Versus Generalization : 9 /22 � A c L Creator: Malik Magdon-Ismail Theory versus practice − →

  10. Theory Versus Practice The VC analysis allows us to reach outside the data for general H . – a single parameter characterizes complexity of H – d vc depends only on H . – E in can reach outside D to E out when d vc is finite. In Practice . . . • The VC bound is loose. – Hoeffding; – m H ( N ) is a worst case # of dichotomies, not average case or likely case. – The polynomial bound on m H ( N ) is loose. • It is a good guide – models with small d vc are good. • Roughly 10 × d vc examples needed to get good generalization. M Approximation Versus Generalization : 10 /22 � A c L Creator: Malik Magdon-Ismail Test set − →

  11. The Test Set • Another way to estimate E out ( g ) is using a test set to obtain E test ( g ). • E test is better than E in : you don’t pay the price for fitting. You can use |H| = 1 in the Hoeffding bound with E test . • Both a test and training set have variance. The training set has optimistic bias due to selection – fitting the data. A test set has no bias. • The price for a test set is fewer training examples. (why is this bad?) E test ≈ E out but now E test may be bad. M Approximation Versus Generalization : 11 /22 � A c L Creator: Malik Magdon-Ismail Approximation versus Generalization − →

  12. VC Bound Quantifies Approximation Versus Generalization The best H is H = { f } . You are better off buying a lottery ticket. d vc ↑ = ⇒ better chance of approximating f ( E in ≈ 0). d vc ↓ = ⇒ better chance of generalizing to out of sample ( E in ≈ E out ). E out ≤ E in + Ω( d vc ) . VC analysis only depends on H . Independent of f, P ( x ), learning algorithm. M Approximation Versus Generalization : 12 /22 � A c L Creator: Malik Magdon-Ismail Bias-variance analysis − →

  13. Bias-Variance Analysis Another way to quantify the tradeoff: 1. How well can the learning approximate f . . . . as opposed to how well did the learning approximate f in-sample ( E in ). 2. How close can you get to that approximation with a finite data set. . . . as opposed to how close is E in to E out . Bias-variance analysis applies to squared errors (classification and regression) Bias-variance analysis can take into account the learning algorithm Different learning algorithms can have different E out when applied to the same H ! M Approximation Versus Generalization : 13 /22 � A c L Creator: Malik Magdon-Ismail Sin example − →

  14. A Simple Learning Problem 2 Data Points. 2 hypothesis sets: H 0 : h ( x ) = b H 1 : h ( x ) = ax + b y y x x M Approximation Versus Generalization : 14 /22 � A c L Creator: Malik Magdon-Ismail Many data sets − →

  15. Let’s Repeat the Experiment Many Times y y x x For each data set D , you get a different g D . So, for a fixed x , g D ( x ) is random value, depending on D . M Approximation Versus Generalization : 15 /22 � A c L Creator: Malik Magdon-Ismail Average behavior − →

  16. What’s Happening on Average g ( x ) ¯ y y g ( x ) ¯ sin( x ) sin( x ) x x We can define: g D ( x ) ← random value , depending on D � g D ( x ) � g ( x ) = E D ¯ K ( g D 1 ( x ) + · · · + g D K ( x )) ← your average prediction on x 1 ≈ � ( g D ( x ) − ¯ g ( x )) 2 � var ( x ) = E D � g D ( x ) 2 � − ¯ ← how variable is your prediction? g ( x ) 2 = E D M Approximation Versus Generalization : 16 /22 � A c L Creator: Malik Magdon-Ismail Error on out-of-sample test point − →

  17. E out on Test Point x for Data D f ( x ) f ( x ) E D out ( x ) E D out ( x ) g D ( x ) g D ( x ) x x E D out ( x ) = ( g D ( x ) − f ( x )) 2 ← squared error , a random value depending on D � E D � E out ( x ) = E D out ( x ) ← expected E out ( x ) before seeing D M Approximation Versus Generalization : 17 /22 � A c L Creator: Malik Magdon-Ismail bias - var decomposition − →

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend