overparametrization and the bias variance dilemma
play

Overparametrization and the bias-variance dilemma Johannes - PowerPoint PPT Presentation

Overparametrization and the bias-variance dilemma Johannes Schmidt-Hieber joint work with Alexis Derumigny https://arxiv.org/abs/2006.00278.pdf 1 / 13 double descent and implicit regularization overparametrization generalizes well implicit


  1. Overparametrization and the bias-variance dilemma Johannes Schmidt-Hieber joint work with Alexis Derumigny https://arxiv.org/abs/2006.00278.pdf 1 / 13

  2. double descent and implicit regularization overparametrization generalizes well � implicit regularization 2 / 13

  3. can we defy the bias-variance trade-off? Geman et al. ’92: ”the fundamental limitations resulting from the bias-variance dilemma apply to all nonparametric inference methods, including neural networks” Because of the double descent phenomenon, there is some doubt whether this statement is true. Recent work includes 3 / 13

  4. lower bounds on the bias-variance trade-off Similar to minimax lower bounds we want to establish a general mathematical framework to derive lower bounds on the bias-variance trade-off that hold for all estimators. given such bounds we can answer many interesting questions • are there methods (e.g. deep learning) that can defy the bias-variance trade-off? • lower bounds for the U -shaped curve of the classical bias-variance trade-off 4 / 13

  5. related literature • Low ’95 provides complete characterization of bias-variance trade-off for functionals in the Gaussian white noise model • Pfanzagl ’99 shows that estimators of functionals satisfying an asymptotic unbiasedness property must have unbounded variance No general treatment of lower bounds for the bias-variance trade-off yet. 5 / 13

  6. Cram´ er-Rao inequality for parametric problems: V ( θ ) ≥ (1 + B ′ ( θ )) 2 F ( θ ) • V ( θ ) the variance • B ′ ( θ ) the derivative of the bias • F ( θ ) the Fisher information 6 / 13

  7. change of expectation inequalities • probability measures P 0 , . . . , P M • χ 2 ( P 0 , . . . , P M ) the matrix with entries � dP j χ 2 ( P 0 , . . . , P M ) j , k = dP k − 1 dP 0 • any random variable X • ∆ := ( E P 1 [ X ] − E P 0 [ X ] , . . . , E P M [ X ] − E P 0 [ X ]) ⊤ then, ∆ ⊤ χ 2 ( P 0 , . . . , P M ) − 1 ∆ ≤ Var P 0 ( X ) 7 / 13

  8. pointwise estimation Gaussian white noise model: We observe ( Y x ) x with dY x = f ( x ) dx + n − 1 / 2 dW x • estimate f ( x 0 ) for a fixed x 0 • C β ( R ) denotes ball of H¨ older β -smooth functions • for any estimator � f ( x 0 ) , we obtain the bias-variance lower bound � �� �� �� � � 1 � 1 /β � Bias f inf sup f ( x 0 ) sup Var f f ( x 0 ) n � f f ∈ C β ( R ) f ∈ C β ( R ) • bound is attained by most estimators • generates U -shaped curve 8 / 13

  9. high-dimensional models Gaussian sequence model: • observe independent X i ∼ N ( θ i , 1) , i = 1 , . . . , n • Θ( s ) the space of s -sparse vectors (here: s ≤ √ n / 2) • bias-variance decomposition + � n E θ [ � � θ − θ � 2 ] = � E θ [ � i =1 Var θ ( � θ ] − θ � 2 θ i ) � �� � B 2 ( θ ) • bias-variance lower bound: if B 2 ( θ ) ≤ γ s log( n / s 2 ) , then, � s 2 � 4 γ � n �� � Var 0 θ i � n n i =1 • bound is matched (up to a factor in the exponent) by soft thresholding • bias-variance trade-off more extreme than U -shape • results also extend to high-dimensional linear regression 9 / 13

  10. L 2 -loss Gaussian white noise model: We observe ( Y x ) x with dY x = f ( x ) dx + n − 1 / 2 dW x • bias-variance decomposition �� � �� � � � 2 �� f − f MISE f f := E f L 2 [0 , 1] � 1 � 1 �� � �� � Bias 2 = f ( x ) dx + Var f f ( x ) dx f 0 0 �� � =: IBias 2 f ( � f ) + IVar f f . • is there a bias-variance trade-off between IBias 2 f ( � f ) and �� � IVar f f ? • turns out to be a very hard problem 10 / 13

  11. L 2 -loss (ctd.) • we propose a two-fold reduction scheme • reduction to a simpler model • reduction to a smaller class of estimators • S β ( R ) Sobolev space of β -smooth functions Bias-variance lower bound: For any estimator � f , � � �� � ≥ 1 � 1 /β � IBias f ( � inf sup f ) sup IVar f f 8 n , � f f ∈ S β ( R ) f ∈ S β ( R ) • many estimators � f can be found with upper bound � 1 / n 11 / 13

  12. mean absolute deviation • several extensions of the bias-variance trade-off have been proposed in the literature, e.g. for classification • the mean absolute deviation (MAD) of an estimator � θ is E θ [ | � θ − m | ] with m either the mean or the median of � θ can the general framework be extended to lower bounds on the trade-off between bias and MAD? • derived change of expectation inequality • this can be used to obtain a partial answer for pointwise estimation in the Gaussian white noise model 12 / 13

  13. Summary • general framework to derive bias-variance lower bounds • leads to matching bias-variance lower bounds for standard models in nonparametric and high-dimensional statistics • different types of the bias-variance trade-off occur • can machine learning methods defy the bias-variance trade-off? No, there are universal lower bounds that no method can avoid for details and more results consult the preprint https://arxiv.org/abs/2006.00278.pdf 13 / 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend