Class 2 & 3 Overfitting & Regularization Carlo Ciliberto - PowerPoint PPT Presentation

Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017

Last Class The goal of Statistical Learning Theory is to find a “good” estimator f n : X → Y , approximating the lowest expected risk � f : X→Y E ( f ) , E ( f ) = inf ℓ ( f ( x ) , y ) dρ ( x, y ) X×Y given only a finite number of (training) examples ( x i , y i ) n i =1 sampled independently from the unknown distribution ρ .

Last Class: The SLT Wishlist What does “good” estimator mean? Low excess risk E ( f n ) − E ( f ∗ ) ◮ Consistency . Does E ( f n ) − E ( f ∗ ) → 0 as n → + ∞ – in Expectation? – in Probability? with respect to a training set S = ( x i , y i ) n i =1 of points randomly sampled from ρ . ◮ Learning Rates . How “fast” is consistency achieved? Nonasymptotic bounds: finite sample complexity, tail bounds, error bounds...

Last Class (Expected Vs Empirical Risk) Approximate the expected risk of f : X → Y via its empirical risk n E n ( f ) = 1 � ℓ ( f ( x i ) , y i ) n i =1 � V f ◮ Expectation: E |E n ( f ) − E ( f ) | ≤ n ◮ Probability (e.g. using Chebyshev): P ( |E n ( f ) − E ( f ) | ≥ ǫ ) ≤ V f ∀ ǫ > 0 nǫ 2 where V f = Var ( x,y ) ∼ ρ ( ℓ ( f ( x ) , y )) .

Last Class (Empirical Risk Minimization) Idea : if E n is a good approximation to E , then we could use E n ( f ) f n = argmin f ∈F to approximate f ∗ . This is known as empirical risk minimization (ERM) Note : If we sample the points in S = ( x i , y i ) n i =1 independently from ρ , the corresponding f n = f S is a random variable and we have E E ( f n ) − E ( f ∗ ) ≤ E E ( f n ) − E n ( f n ) Question : does E E ( f n ) − E n ( f n ) go to zero as n increases?

Issues with ERM Assume X = Y = R , ρ with dense support 1 and ℓ ( y, y ) = 0 ∀ y ∈ Y . For any set ( x i , y i ) n i =1 s.t. x i � = x j ∀ i � = j let f n : X → Y be such that � if x = x i ∃ i ∈ { 1 , . . . n } y i f n ( x ) = 0 otherwise Then, for any number n of training points: ◮ E E n ( f n ) = 0 ◮ E E ( f n ) = E (0) , which is greater than zero (unless f ∗ ≡ 0 ) Therefore E E ( f n ) − E n ( f n ) = E (0) � 0 as n increases! 1 and such that every pair ( x, y ) has measure zero according to ρ

Overfitting An estimator f n is said to overfit the training data if for any n ∈ N : ◮ E E ( f n ) − E ( f ∗ ) > C for a constant C > 0 , and ◮ E E n ( f n ) − E n ( f ∗ ) ≤ 0 According to this definition ERM overfits...

ERM on Finite Hypotheses Spaces Is ERM hopeless? Consider the case X and Y finite. Then, F = Y X = { f : X → Y} is finite as well (albeit possibly large), and therefore: E |E n ( f n ) − E ( f n ) | ≤ E sup |E n ( f ) − E ( f ) | f ∈F � � ≤ E |E n ( f ) − E ( f ) | ≤ |F| V F /n f ∈F where V F = sup f ∈F V f and |F| denotes the cardinality of F . Then ERM works! Namely: lim n → + ∞ E |E ( f n ) − E ( f ) | = 0

ERM on Finite Hypotheses (Sub) Spaces The same argument holds in general: let H ⊂ F be a finite space of hypotheses. Then, � E |E n ( f n ) − E ( f n ) | ≤ |H| V H /n In particular, if f ∗ ∈ H , then � E |E ( f n ) − E ( f ∗ ) | ≤ |H| V H /n and ERM is a good estimator for the problem considered.

Example: Threshold functions Consider a binary classification problem Y = { 0 , 1 } . Someone has told us that the minimizer of the risk is a “threshold function” f a ∗ ( x ) = 1 [ a ∗ , + ∞ ) with a ∗ ∈ [ − 1 , 1] . 1.5 a b 1 0.5 0 -1.5 -1 -0.5 0 0.5 1 1.5 We can learn on H = { f a | a ∈ R } = [ − 1 , 1] . However on a computer we can only represent real numbers up to a given precision .

Example: Threshold Functions (with precision p ) Discretization : given a p > 0 , we can consider H p = { f a | a ∈ [ − 1 , 1] , a · 10 p = [ a · 10 p ] } with [ a ] denoting the integer part (i.e. the closest integer) of a scalar a . The value p can be interpreted as the “precision” of our space of functions H p . Note that |H p | = 2 · 10 p If f ∗ ∈ H p , then we have automatically that V H /n ≤ 10 p / √ n � E |E ( f n ) − E ( f ∗ ) | ≤ |H p | ( V H ≤ 1 since ℓ is the 0 - 1 loss and therefore | ℓ ( f ( x ) , y ) | ≤ 1 for any f ∈ H )

Rates in Expectation Vs Probability In practice, even for small values of p E |E ( f n ) − E ( f ∗ ) | ≤ 10 p / √ n will need a very large n in order to have a meaningful bound on the expected error. Interestingly, we can get much better constants (not rates though!) by working in probability...

Hoeffding’s Inequality Let X 1 , . . . , X n independent random variables s.t. X i ∈ [ a i , b i ] . � n Let X = 1 i =1 X i . Then, n � � 2 n 2 ǫ 2 �� ≥ ǫ � X − E X ≤ 2 exp − P � n i =1 ( b i − a i ) 2

Applying Hoeffding’s inequality Assume that ∀ f ∈ H , x ∈ X , y ∈ Y the loss is bounded | ℓ ( f ( x ) , y ) | ≤ M by some constant M > 0 . Then, for any f ∈ H we have P ( |E n ( f ) − E ( f ) | ≥ ǫ ) ≤ exp( − nǫ 2 2 M 2 )

Controlling the Generalization Error We would like to control the generalization error E n ( f n ) − E ( f n ) of our estimator in probability . One possible way to do that is by controlling the generalzation error of the whole set H . � � P ( |E n ( f n ) − E ( f n ) | ≥ ǫ ) ≤ P sup |E n ( f ) − E ( f ) | ≥ ǫ f ∈H The latter term is the probability that least one of the events |E n ( f ) − E ( f ) | ≥ ǫ occurs for f ∈ H . In other words the probability of the union of such events. Therefore � � � sup |E n ( f ) − E ( f ) | ≥ ǫ ≤ P ( |E n ( f ) − E ( f ) | ≥ ǫ ) P f ∈H f ∈H by the so-called union bound .

Hoeffding the Generalization Error By applying Hoeffding’s inequality, P ( |E n ( f n ) − E ( f n ) | ≥ ǫ ) ≤ 2 |H| exp( − nǫ 2 2 M 2 ) Or, equivalently, that for any δ ∈ (0 , 1] , � 2 M 2 log(2 |H| /δ ) |E n ( f n ) − E ( f n ) | ≤ n with probability at least 1 − δ .

Example: Threshold Functions (in Probability) Going back to H p space of threshold functions... � 4 + 6 p − 2 log δ |E n ( f n ) − E ( f n ) | ≤ n since M = 1 and log 2 |H| = log 4 · 10 p = log 4 + p log 10 ≤ 2 + 3 p . For example, let δ = 0 . 001 . We can say that � 6 p + 18 |E n ( f n ) − E ( f n ) | ≤ n holds at least 99 . 9% of the times.

Bounds in Expectation Vs Probability Comparing the two bounds E |E n ( f n ) − E ( f n ) | ≤ 10 p / √ n (Expectation) While, with probability greater than 99 . 9% � 6 p + 18 |E n ( f n ) − E ( f n ) | ≤ (Probability) n Although we cannot be 100% sure of it, we can be quite confident that the generalization error will be much smaller than what the bound in expectation tells us... Rates : note however that the rates of convergence to 0 are the same (i.e. O (1 / √ n ) ).

Improving the bound in Expectation Exploiting the bound in probability and the knowledge that on H p the excess risk is bounded by a constant, we can improve the bound in expectation... Let X be a random variable s.t. | X | < M for some constant M > 0 . Then, for any ǫ > 0 we have E | X | ≤ ǫ P ( | X | ≤ ǫ ) + M P ( | X | > ǫ ) Applying to our problem: for any δ ∈ (0 , 1] � 2 M 2 log(2 |H p | /δ ) E |E n ( f n ) − E ( f n ) | ≤ (1 − δ ) + δM n Therefore only log |H p | appears (no |H p | alone).

Infinite Hypotheses Spaces What if f ∗ ∈ H \ H p for any p > 0 ? ERM on H p will never minimize the expected risk. There will always be a gap for E ( f n,p ) − E ( f ∗ ) . For p → + ∞ it is natural to expect such gap to decrease... BUT if p increases too fast (with respect to the number n of examples) we cannot control the generalization error anymore! � 6 p + 18 |E n ( f n ) − E ( f n ) | ≤ → + ∞ for p → + ∞ n Therefore we need to increase p gradually as a function p ( n ) of the number of training examples. This approach is known as regularization .

Approximation Error for Threshold Functions Let’s consider f p = 1 [ a p , + ∞ ) argmin f ∈ H p E ( f ) with a p ∈ [ − 1 , 1] . Consider the error decomposition of the excess risk E ( f n ) − E ( f ∗ ) : E ( f n ) − E n ( f n ) + E n ( f n ) − E n ( f p ) + E n ( f p ) − E ( f p ) + E ( f p ) − E ( f ∗ ) � �� ≤ 0 We already know how to control the generalization of f n (via the supremum over H p ) and f p (since it is a single function). Moreover, we have that the approximation error is E ( f p ) − E ( f ∗ ) ≤ | a p − a ∗ | ≤ 10 − p (why?) Note that it does not depend on training data!

Approximation Error for Threshold Functions II Putting everything together we have that, for any δ ∈ [0 , 1) and p ≥ 0 , � 4 + 6 p − 2 log δ + 10 − p = φ ( n, δ, p ) E ( f n ) − E ( f ∗ ) ≤ 2 n holds with probability greater or equal to 1 − δ . In particular, for any n and δ , we can choose the best precision as p ( n, δ ) = argmin φ ( n, δ, p ) p ≥ 0 which leads to an error bound ǫ ( n, δ ) = φ ( n, δ, p ( n, δ )) holding with probability larger or equal than 1 − δ .

Class 2 & 3 Overfitting & Regularization Carlo Ciliberto - PowerPoint PPT Presentation

Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y , approximating the lowest

Overfitting Can Happen Overfitting Can Happen Overfitting Can Happen Overfitting Can Happen

The Problem of Overfitting The Problem of Overfitting BR data: neural network with 20%

Learning From Data Lecture 11 Overfitting What is Overfitting When does Overfitting Occur

Overfitting Validation process. Overfitting Ettore Lanzarone March 18, 2020 LESSON 3 Lesson 3

Regularization The problem of overfitting Machine Learning Example: Linear regression (housing

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Overfitting Many hypotheses consistent with/close to the data About this class With enough

recap: Overfitting Fitting the data more than is warranted Learning From Data Data Lecture 12

Overfitting and Regularization March 31, 2020 Data Science CSCI 1951A Brown University

Advanced Classification; Overfitting and regularization; From .R to Notebooks Structure of the

Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented

Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang What is

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

2020 ASEE ERM Business Meeting WHY THE DOG? BECAUSE WE ALL NEED A SMILE RIGHT NOW Agenda: 1.

Defining ERM and How It Benefits an Executive Gordon Proctor Launching Enterprise Risk

Formal Theory James Drummond QCD Amplitudes Strings Quantum Gravity Black holes

Fast Radio Bursts from Axion Star Fa Peng Huang fapeng.huang@wustl.edu Department of Physics and

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Two Sides of the Same Coin ERM and Clinical Quality Innovation A/Prof Wong Moh Sim Head and

Principled Learning Method for Wasserstein Distributionally Robust Optimization with Local

Openness, Technology Capital, and Development Ellen McGrattan and Edward Prescott April 2007 Why

Sambuz

Useful Links

Newsletter

Mail Us

Class 2 & 3 Overfitting & Regularization Carlo Ciliberto - PowerPoint PPT Presentation

Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y , approximating the lowest

Overfitting Can Happen Overfitting Can Happen Overfitting Can Happen Overfitting Can Happen

The Problem of Overfitting The Problem of Overfitting BR data: neural network with 20%

Learning From Data Lecture 11 Overfitting What is Overfitting When does Overfitting Occur

Overfitting Validation process. Overfitting Ettore Lanzarone March 18, 2020 LESSON 3 Lesson 3

Regularization The problem of overfitting Machine Learning Example: Linear regression (housing

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Overfitting Many hypotheses consistent with/close to the data About this class With enough

recap: Overfitting Fitting the data more than is warranted Learning From Data Data Lecture 12

Overfitting and Regularization March 31, 2020 Data Science CSCI 1951A Brown University

Advanced Classification; Overfitting and regularization; From .R to Notebooks Structure of the

Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented

Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang What is

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

2020 ASEE ERM Business Meeting WHY THE DOG? BECAUSE WE ALL NEED A SMILE RIGHT NOW Agenda: 1.

Defining ERM and How It Benefits an Executive Gordon Proctor Launching Enterprise Risk

Formal Theory James Drummond QCD Amplitudes Strings Quantum Gravity Black holes

Fast Radio Bursts from Axion Star Fa Peng Huang fapeng.huang@wustl.edu Department of Physics and

Maximum Likelihood Estimation CS 446 Maximum likelihood: abstract formulation Weve had one

Two Sides of the Same Coin ERM and Clinical Quality Innovation A/Prof Wong Moh Sim Head and

Principled Learning Method for Wasserstein Distributionally Robust Optimization with Local

Openness, Technology Capital, and Development Ellen McGrattan and Edward Prescott April 2007 Why

Sambuz

Useful Links

Newsletter

Mail Us

Regularization Overview Regularization Overview Problems & Multicollinearity We will