Learning From Data Lecture 7 Approximation Versus Generalization - PowerPoint PPT Presentation

Learning From Data Lecture 7 Approximation Versus Generalization The VC Dimension Approximation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100

recap: The Vapnik-Chervonenkis Bound (VC Bound) P [ | E in ( g ) − E out ( g ) | > ǫ ] ≤ 4 m H (2 N ) e − ǫ 2 N/ 8 , for any ǫ > 0 . P [ | E in ( g ) − E out ( g ) | >ǫ ] ≤ 2 |H| e − 2 ǫ 2 N ← finite H P [ | E in ( g ) − E out ( g ) | ≤ ǫ ] ≥ 1 − 4 m H (2 N ) e − ǫ 2 N/ 8 , for any ǫ > 0 . P [ | E in ( g ) − E out ( g ) |≤ ǫ ] ≥ 1 − 2 |H| e − 2 ǫ 2 N ← finite H � N log 4 m H (2 N ) 8 E out ( g ) ≤ E in ( g ) + , w.p. at least 1 − δ. E out ( g ) ≤ E in ( g )+ √ δ 2 N log 2 |H| 1 ← finite H δ k − 1 � � ≤ N k − 1 + 1 � � � N m H ( N ) ≤ k is a break point. i i =1 M Approximation Versus Generalization : 2 /22 � A c L Creator: Malik Magdon-Ismail VC dimension − →

The VC Dimension d vc m H ( N ) ∼ N k − 1 The tightest bound is obtained with the smallest break point k ∗ . Definition [VC Dimension] d vc = k ∗ − 1. The VC dimension is the largest N which can be shattered ( m H ( N ) = 2 N ). N ≤ d vc : H could shatter your data ( H can shatter some N points). N > d vc : N is a break point for H ; H cannot possibly shatter your data. m H ( N ) ≤ N d vc + 1 ∼ N d vc �� d vc log N E out ( g ) ≤ E in ( g ) + O N M Approximation Versus Generalization : 3 /22 � A c L Creator: Malik Magdon-Ismail d vc versus number of parameters − →

The VC-dimension is an Effective Number of Parameters N 1 2 3 4 5 · · · #Param d vc 2-D perceptron 2 4 8 14 · · · 3 3 1-D pos. ray 2 3 4 5 · · · 1 1 < 2 5 · · · 2-D pos. rectangles 2 4 8 16 4 4 pos. convex sets 2 4 8 16 32 · · · ∞ ∞ There are models with few parameters but infinite d vc . There are models with redundant parameters but small d vc . M Approximation Versus Generalization : 4 /22 � A c L Creator: Malik Magdon-Ismail d vc for perceptron − →

VC-dimension of the Perceptron in R d is d + 1 This can be shown in two steps: 1. d vc ≥ d + 1. What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 1 points that cannot be shattered. (c) Every set of d + 1 points can be shattered. (d) Every set of d + 1 points cannot be shattered. 2. d vc ≤ d + 1. What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 2 points that cannot be shattered. (c) Every set of d + 2 points can be shattered. (d) Every set of d + 1 points cannot be shattered. (e) Every set of d + 2 points cannot be shattered. M Approximation Versus Generalization : 5 /22 � A c L Creator: Malik Magdon-Ismail Step 1 answer − →

VC-dimension of the Perceptron in R d is d + 1 This can be shown in two steps: 1. d vc ≥ d + 1. What needs to be shown? � (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 1 points that cannot be shattered. (c) Every set of d + 1 points can be shattered. (d) Every set of d + 1 points cannot be shattered. 2. d vc ≤ d + 1. What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 2 points that cannot be shattered. (c) Every set of d + 2 points can be shattered. (d) Every set of d + 1 points cannot be shattered. (e) Every set of d + 2 points cannot be shattered. M Approximation Versus Generalization : 6 /22 � A c L Creator: Malik Magdon-Ismail Step 2 answer − →

VC-dimension of the Perceptron in R d is d + 1 This can be shown in two steps: 1. d vc ≥ d + 1. What needs to be shown? � (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 1 points that cannot be shattered. (c) Every set of d + 1 points can be shattered. (d) Every set of d + 1 points cannot be shattered. 2. d vc ≤ d + 1. What needs to be shown? (a) There is a set of d + 1 points that can be shattered. (b) There is a set of d + 2 points that cannot be shattered. (c) Every set of d + 2 points can be shattered. (d) Every set of d + 1 points cannot be shattered. � (e) Every set of d + 2 points cannot be shattered. M Approximation Versus Generalization : 7 /22 � A c L Creator: Malik Magdon-Ismail d vc characterizes complexity in error bar − →

A Single Parameter Characterizes Complexity out-of-sample error model complexity � Error 2 N log 2 |H| 1 E out ( g ) ≤ E in ( g ) + δ in-sample error |H| |H| ∗ ↓ out-of-sample error � N log 4((2 N ) d vc + 1) 8 E out ( g ) ≤ E in ( g ) + model complexity Error δ � �� penalty for model complexity in-sample error Ω( d vc ) d ∗ VC dimension, d vc vc M Approximation Versus Generalization : 8 /22 � A c L Creator: Malik Magdon-Ismail Sample complexity − →

Sample Complexity: How Many Data Points Do You Need? Set the error bar at ǫ . � N ln 4((2 N ) d vc + 1) 8 ǫ = δ Solve for N : ǫ 2 ln 4((2 N ) d vc + 1) N = 8 = O ( d vc ln N ) δ Example. d vc = 3; error bar ǫ = 0 . 1; confidence 90% ( δ = 0 . 1). A simple iterative method works well. Trying N = 1000 we get � 4(2000) 3 + 4 � 1 N ≈ 0 . 1 2 log ≈ 21192 . 0 . 1 We continue iteratively, and converge to N ≈ 30000. If d vc = 4, N ≈ 40000; for d vc = 5, N ≈ 50000. ( N ∝ d vc , but gross overestimates) Practical Rule of Thumb: N = 10 × d vc M Approximation Versus Generalization : 9 /22 � A c L Creator: Malik Magdon-Ismail Theory versus practice − →

Theory Versus Practice The VC analysis allows us to reach outside the data for general H . – a single parameter characterizes complexity of H – d vc depends only on H . – E in can reach outside D to E out when d vc is finite. In Practice . . . • The VC bound is loose. – Hoeffding; – m H ( N ) is a worst case # of dichotomies, not average case or likely case. – The polynomial bound on m H ( N ) is loose. • It is a good guide – models with small d vc are good. • Roughly 10 × d vc examples needed to get good generalization. M Approximation Versus Generalization : 10 /22 � A c L Creator: Malik Magdon-Ismail Test set − →

The Test Set • Another way to estimate E out ( g ) is using a test set to obtain E test ( g ). • E test is better than E in : you don’t pay the price for fitting. You can use |H| = 1 in the Hoeffding bound with E test . • Both a test and training set have variance. The training set has optimistic bias due to selection – fitting the data. A test set has no bias. • The price for a test set is fewer training examples. (why is this bad?) E test ≈ E out but now E test may be bad. M Approximation Versus Generalization : 11 /22 � A c L Creator: Malik Magdon-Ismail Approximation versus Generalization − →

VC Bound Quantifies Approximation Versus Generalization The best H is H = { f } . You are better off buying a lottery ticket. d vc ↑ = ⇒ better chance of approximating f ( E in ≈ 0). d vc ↓ = ⇒ better chance of generalizing to out of sample ( E in ≈ E out ). E out ≤ E in + Ω( d vc ) . VC analysis only depends on H . Independent of f, P ( x ), learning algorithm. M Approximation Versus Generalization : 12 /22 � A c L Creator: Malik Magdon-Ismail Bias-variance analysis − →

Bias-Variance Analysis Another way to quantify the tradeoff: 1. How well can the learning approximate f . . . . as opposed to how well did the learning approximate f in-sample ( E in ). 2. How close can you get to that approximation with a finite data set. . . . as opposed to how close is E in to E out . Bias-variance analysis applies to squared errors (classification and regression) Bias-variance analysis can take into account the learning algorithm Different learning algorithms can have different E out when applied to the same H ! M Approximation Versus Generalization : 13 /22 � A c L Creator: Malik Magdon-Ismail Sin example − →

A Simple Learning Problem 2 Data Points. 2 hypothesis sets: H 0 : h ( x ) = b H 1 : h ( x ) = ax + b y y x x M Approximation Versus Generalization : 14 /22 � A c L Creator: Malik Magdon-Ismail Many data sets − →

Let’s Repeat the Experiment Many Times y y x x For each data set D , you get a different g D . So, for a fixed x , g D ( x ) is random value, depending on D . M Approximation Versus Generalization : 15 /22 � A c L Creator: Malik Magdon-Ismail Average behavior − →

What’s Happening on Average g ( x ) ¯ y y g ( x ) ¯ sin( x ) sin( x ) x x We can define: g D ( x ) ← random value , depending on D � g D ( x ) � g ( x ) = E D ¯ K ( g D 1 ( x ) + · · · + g D K ( x )) ← your average prediction on x 1 ≈ � ( g D ( x ) − ¯ g ( x )) 2 � var ( x ) = E D � g D ( x ) 2 � − ¯ ← how variable is your prediction? g ( x ) 2 = E D M Approximation Versus Generalization : 16 /22 � A c L Creator: Malik Magdon-Ismail Error on out-of-sample test point − →

E out on Test Point x for Data D f ( x ) f ( x ) E D out ( x ) E D out ( x ) g D ( x ) g D ( x ) x x E D out ( x ) = ( g D ( x ) − f ( x )) 2 ← squared error , a random value depending on D � E D � E out ( x ) = E D out ( x ) ← expected E out ( x ) before seeing D M Approximation Versus Generalization : 17 /22 � A c L Creator: Malik Magdon-Ismail bias - var decomposition − →

Learning From Data Lecture 7 Approximation Versus Generalization - PowerPoint PPT Presentation

Learning From Data Lecture 7 Approximation Versus Generalization The VC Dimension Approximation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100 recap: The Vapnik-Chervonenkis Bound (VC Bound) P [ |

6. Approximation and fitting norm approximation least-norm problems regularized

ECS 231 Lecture on Approximation and Error Analysis 1 / 9 Approximation and error analysis 1.

Lecture 18: PCP Theorem and Hardness of Approximation I Arijit Bishnu 26.04.2010 Introduction

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

Lecture: Approximation Algorithms Lecture: Approximation Algorithms Jannik Matuschke November 5,

Moderately exponential approximation Bridging the gap between exact computation and polynomial

6. Approximation and fitting Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao

LOCAL LINEAR APPROXIMATION MATH 200 GOALS Be able to compute the local linear approximation

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

Three key debates about nationalism (in Europe) Instrumental versus Intrinsic Binary

Low Rank Approximation Lecture 4 Daniel Kressner Chair for Numerical Algorithms and HPC

Lecture 5: Value Function Approximation Emma Brunskill CS234 Reinforcement Learning. Winter 2018

Lecture 5: Value Function Approximation Emma Brunskill CS234 Reinforcement Learning. Winter 2020

recap: Approximation Versus Generalization VC Analysis Bias-Variance Analysis E out E in +

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Practical Linear- -value value Practical Linear Approximation Techniques Approximation

Week 6 Video 1 Visualization Learning Curves Visualization Displaying information in a

Learning Curve Analysis for Programming: Which Concepts do Students Struggle With? Kelly Rivers,

NACADA The Global Community for Academic Advising The Global Community for Academic Advising

Welcome Dr Clive Hickman Chief Executive, Manufacturing Technology Centre Keynote: Made Smarter

Identifying beneficial task relations for multi-task learning in deep neural networks Author:

Mismatched Models & Can GP Regression Be Made Robust Against Model Mismatch? Peter Sollich

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

UMBC A B M A L T F O U M B C I M Y O R T 1 (June 14, 2000 4:18 pm) I E S R

Sambuz

Useful Links

Newsletter

Mail Us

Learning From Data Lecture 7 Approximation Versus Generalization - PowerPoint PPT Presentation

Learning From Data Lecture 7 Approximation Versus Generalization The VC Dimension Approximation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100 recap: The Vapnik-Chervonenkis Bound (VC Bound) P [ |

6. Approximation and fitting norm approximation least-norm problems regularized

ECS 231 Lecture on Approximation and Error Analysis 1 / 9 Approximation and error analysis 1.

Lecture 18: PCP Theorem and Hardness of Approximation I Arijit Bishnu 26.04.2010 Introduction

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

Lecture: Approximation Algorithms Lecture: Approximation Algorithms Jannik Matuschke November 5,

Moderately exponential approximation Bridging the gap between exact computation and polynomial

6. Approximation and fitting Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao

LOCAL LINEAR APPROXIMATION MATH 200 GOALS Be able to compute the local linear approximation

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

Three key debates about nationalism (in Europe) Instrumental versus Intrinsic Binary

Low Rank Approximation Lecture 4 Daniel Kressner Chair for Numerical Algorithms and HPC

Lecture 5: Value Function Approximation Emma Brunskill CS234 Reinforcement Learning. Winter 2018

Lecture 5: Value Function Approximation Emma Brunskill CS234 Reinforcement Learning. Winter 2020

recap: Approximation Versus Generalization VC Analysis Bias-Variance Analysis E out E in +

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Practical Linear- -value value Practical Linear Approximation Techniques Approximation

Week 6 Video 1 Visualization Learning Curves Visualization Displaying information in a

Learning Curve Analysis for Programming: Which Concepts do Students Struggle With? Kelly Rivers,

NACADA The Global Community for Academic Advising The Global Community for Academic Advising

Welcome Dr Clive Hickman Chief Executive, Manufacturing Technology Centre Keynote: Made Smarter

Identifying beneficial task relations for multi-task learning in deep neural networks Author:

Mismatched Models &amp; Can GP Regression Be Made Robust Against Model Mismatch? Peter Sollich

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

UMBC A B M A L T F O U M B C I M Y O R T 1 (June 14, 2000 4:18 pm) I E S R

Sambuz

Useful Links

Newsletter

Mail Us

Mismatched Models & Can GP Regression Be Made Robust Against Model Mismatch? Peter Sollich