class 2 3 overfitting regularization
play

Class 2 & 3 Overfitting & Regularization Carlo Ciliberto - PowerPoint PPT Presentation

Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y , approximating the lowest


  1. Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017

  2. Last Class The goal of Statistical Learning Theory is to find a “good” estimator f n : X → Y , approximating the lowest expected risk � f : X→Y E ( f ) , E ( f ) = inf ℓ ( f ( x ) , y ) dρ ( x, y ) X×Y given only a finite number of (training) examples ( x i , y i ) n i =1 sampled independently from the unknown distribution ρ .

  3. Last Class: The SLT Wishlist What does “good” estimator mean? Low excess risk E ( f n ) − E ( f ∗ ) ◮ Consistency . Does E ( f n ) − E ( f ∗ ) → 0 as n → + ∞ – in Expectation? – in Probability? with respect to a training set S = ( x i , y i ) n i =1 of points randomly sampled from ρ . ◮ Learning Rates . How “fast” is consistency achieved? Nonasymptotic bounds: finite sample complexity, tail bounds, error bounds...

  4. Last Class (Expected Vs Empirical Risk) Approximate the expected risk of f : X → Y via its empirical risk n E n ( f ) = 1 � ℓ ( f ( x i ) , y i ) n i =1 � V f ◮ Expectation: E |E n ( f ) − E ( f ) | ≤ n ◮ Probability (e.g. using Chebyshev): P ( |E n ( f ) − E ( f ) | ≥ ǫ ) ≤ V f ∀ ǫ > 0 nǫ 2 where V f = Var ( x,y ) ∼ ρ ( ℓ ( f ( x ) , y )) .

  5. Last Class (Empirical Risk Minimization) Idea : if E n is a good approximation to E , then we could use E n ( f ) f n = argmin f ∈F to approximate f ∗ . This is known as empirical risk minimization (ERM) Note : If we sample the points in S = ( x i , y i ) n i =1 independently from ρ , the corresponding f n = f S is a random variable and we have E E ( f n ) − E ( f ∗ ) ≤ E E ( f n ) − E n ( f n ) Question : does E E ( f n ) − E n ( f n ) go to zero as n increases?

  6. Issues with ERM Assume X = Y = R , ρ with dense support 1 and ℓ ( y, y ) = 0 ∀ y ∈ Y . For any set ( x i , y i ) n i =1 s.t. x i � = x j ∀ i � = j let f n : X → Y be such that � if x = x i ∃ i ∈ { 1 , . . . n } y i f n ( x ) = 0 otherwise Then, for any number n of training points: ◮ E E n ( f n ) = 0 ◮ E E ( f n ) = E (0) , which is greater than zero (unless f ∗ ≡ 0 ) Therefore E E ( f n ) − E n ( f n ) = E (0) � 0 as n increases! 1 and such that every pair ( x, y ) has measure zero according to ρ

  7. Overfitting An estimator f n is said to overfit the training data if for any n ∈ N : ◮ E E ( f n ) − E ( f ∗ ) > C for a constant C > 0 , and ◮ E E n ( f n ) − E n ( f ∗ ) ≤ 0 According to this definition ERM overfits...

  8. ERM on Finite Hypotheses Spaces Is ERM hopeless? Consider the case X and Y finite. Then, F = Y X = { f : X → Y} is finite as well (albeit possibly large), and therefore: E |E n ( f n ) − E ( f n ) | ≤ E sup |E n ( f ) − E ( f ) | f ∈F � � ≤ E |E n ( f ) − E ( f ) | ≤ |F| V F /n f ∈F where V F = sup f ∈F V f and |F| denotes the cardinality of F . Then ERM works! Namely: lim n → + ∞ E |E ( f n ) − E ( f ) | = 0

  9. ERM on Finite Hypotheses (Sub) Spaces The same argument holds in general: let H ⊂ F be a finite space of hypotheses. Then, � E |E n ( f n ) − E ( f n ) | ≤ |H| V H /n In particular, if f ∗ ∈ H , then � E |E ( f n ) − E ( f ∗ ) | ≤ |H| V H /n and ERM is a good estimator for the problem considered.

  10. Example: Threshold functions Consider a binary classification problem Y = { 0 , 1 } . Someone has told us that the minimizer of the risk is a “threshold function” f a ∗ ( x ) = 1 [ a ∗ , + ∞ ) with a ∗ ∈ [ − 1 , 1] . 1.5 a b 1 0.5 0 -1.5 -1 -0.5 0 0.5 1 1.5 We can learn on H = { f a | a ∈ R } = [ − 1 , 1] . However on a computer we can only represent real numbers up to a given precision .

  11. Example: Threshold Functions (with precision p ) Discretization : given a p > 0 , we can consider H p = { f a | a ∈ [ − 1 , 1] , a · 10 p = [ a · 10 p ] } with [ a ] denoting the integer part (i.e. the closest integer) of a scalar a . The value p can be interpreted as the “precision” of our space of functions H p . Note that |H p | = 2 · 10 p If f ∗ ∈ H p , then we have automatically that V H /n ≤ 10 p / √ n � E |E ( f n ) − E ( f ∗ ) | ≤ |H p | ( V H ≤ 1 since ℓ is the 0 - 1 loss and therefore | ℓ ( f ( x ) , y ) | ≤ 1 for any f ∈ H )

  12. Rates in Expectation Vs Probability In practice, even for small values of p E |E ( f n ) − E ( f ∗ ) | ≤ 10 p / √ n will need a very large n in order to have a meaningful bound on the expected error. Interestingly, we can get much better constants (not rates though!) by working in probability...

  13. Hoeffding’s Inequality Let X 1 , . . . , X n independent random variables s.t. X i ∈ [ a i , b i ] . � n Let X = 1 i =1 X i . Then, n � � 2 n 2 ǫ 2 �� � � � ≥ ǫ � X − E X ≤ 2 exp − P � n i =1 ( b i − a i ) 2

  14. Applying Hoeffding’s inequality Assume that ∀ f ∈ H , x ∈ X , y ∈ Y the loss is bounded | ℓ ( f ( x ) , y ) | ≤ M by some constant M > 0 . Then, for any f ∈ H we have P ( |E n ( f ) − E ( f ) | ≥ ǫ ) ≤ exp( − nǫ 2 2 M 2 )

  15. Controlling the Generalization Error We would like to control the generalization error E n ( f n ) − E ( f n ) of our estimator in probability . One possible way to do that is by controlling the generalzation error of the whole set H . � � P ( |E n ( f n ) − E ( f n ) | ≥ ǫ ) ≤ P sup |E n ( f ) − E ( f ) | ≥ ǫ f ∈H The latter term is the probability that least one of the events |E n ( f ) − E ( f ) | ≥ ǫ occurs for f ∈ H . In other words the probability of the union of such events. Therefore � � � sup |E n ( f ) − E ( f ) | ≥ ǫ ≤ P ( |E n ( f ) − E ( f ) | ≥ ǫ ) P f ∈H f ∈H by the so-called union bound .

  16. Hoeffding the Generalization Error By applying Hoeffding’s inequality, P ( |E n ( f n ) − E ( f n ) | ≥ ǫ ) ≤ 2 |H| exp( − nǫ 2 2 M 2 ) Or, equivalently, that for any δ ∈ (0 , 1] , � 2 M 2 log(2 |H| /δ ) |E n ( f n ) − E ( f n ) | ≤ n with probability at least 1 − δ .

  17. Example: Threshold Functions (in Probability) Going back to H p space of threshold functions... � 4 + 6 p − 2 log δ |E n ( f n ) − E ( f n ) | ≤ n since M = 1 and log 2 |H| = log 4 · 10 p = log 4 + p log 10 ≤ 2 + 3 p . For example, let δ = 0 . 001 . We can say that � 6 p + 18 |E n ( f n ) − E ( f n ) | ≤ n holds at least 99 . 9% of the times.

  18. Bounds in Expectation Vs Probability Comparing the two bounds E |E n ( f n ) − E ( f n ) | ≤ 10 p / √ n (Expectation) While, with probability greater than 99 . 9% � 6 p + 18 |E n ( f n ) − E ( f n ) | ≤ (Probability) n Although we cannot be 100% sure of it, we can be quite confident that the generalization error will be much smaller than what the bound in expectation tells us... Rates : note however that the rates of convergence to 0 are the same (i.e. O (1 / √ n ) ).

  19. Improving the bound in Expectation Exploiting the bound in probability and the knowledge that on H p the excess risk is bounded by a constant, we can improve the bound in expectation... Let X be a random variable s.t. | X | < M for some constant M > 0 . Then, for any ǫ > 0 we have E | X | ≤ ǫ P ( | X | ≤ ǫ ) + M P ( | X | > ǫ ) Applying to our problem: for any δ ∈ (0 , 1] � 2 M 2 log(2 |H p | /δ ) E |E n ( f n ) − E ( f n ) | ≤ (1 − δ ) + δM n Therefore only log |H p | appears (no |H p | alone).

  20. Infinite Hypotheses Spaces What if f ∗ ∈ H \ H p for any p > 0 ? ERM on H p will never minimize the expected risk. There will always be a gap for E ( f n,p ) − E ( f ∗ ) . For p → + ∞ it is natural to expect such gap to decrease... BUT if p increases too fast (with respect to the number n of examples) we cannot control the generalization error anymore! � 6 p + 18 |E n ( f n ) − E ( f n ) | ≤ → + ∞ for p → + ∞ n Therefore we need to increase p gradually as a function p ( n ) of the number of training examples. This approach is known as regularization .

  21. Approximation Error for Threshold Functions Let’s consider f p = 1 [ a p , + ∞ ) argmin f ∈ H p E ( f ) with a p ∈ [ − 1 , 1] . Consider the error decomposition of the excess risk E ( f n ) − E ( f ∗ ) : E ( f n ) − E n ( f n ) + E n ( f n ) − E n ( f p ) + E n ( f p ) − E ( f p ) + E ( f p ) − E ( f ∗ ) � �� � ≤ 0 We already know how to control the generalization of f n (via the supremum over H p ) and f p (since it is a single function). Moreover, we have that the approximation error is E ( f p ) − E ( f ∗ ) ≤ | a p − a ∗ | ≤ 10 − p (why?) Note that it does not depend on training data!

  22. Approximation Error for Threshold Functions II Putting everything together we have that, for any δ ∈ [0 , 1) and p ≥ 0 , � 4 + 6 p − 2 log δ + 10 − p = φ ( n, δ, p ) E ( f n ) − E ( f ∗ ) ≤ 2 n holds with probability greater or equal to 1 − δ . In particular, for any n and δ , we can choose the best precision as p ( n, δ ) = argmin φ ( n, δ, p ) p ≥ 0 which leads to an error bound ǫ ( n, δ ) = φ ( n, δ, p ( n, δ )) holding with probability larger or equal than 1 − δ .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend