nonparametric regression using deep neural networks with
play

Nonparametric regression using deep neural networks with ReLU - PowerPoint PPT Presentation

Nonparametric regression using deep neural networks with ReLU activation function Johannes Schmidt-Hieber February 2018 Caltech 1 / 20 Many impressive results in applications . . . Lack of theoretical understanding . . . 2 / 20


  1. Nonparametric regression using deep neural networks with ReLU activation function Johannes Schmidt-Hieber February 2018 Caltech 1 / 20

  2. ◮ Many impressive results in applications . . . ◮ Lack of theoretical understanding . . . 2 / 20

  3. Algebraic definition of a deep net Network architecture ( L , p ) consists of ◮ a positive integer L called the number of hidden layers/depth ◮ width vector p = ( p 0 , . . . , p L +1 ) ∈ N L +2 . Neural network with network architecture ( L , p ) f : R p 0 → R p L +1 , x �→ f ( x ) = W L +1 σ v L W L σ v L − 1 · · · W 2 σ v 1 W 1 x , Network parameters: ◮ W i is a p i × p i − 1 matrix ◮ v i ∈ R p i Activation function: ◮ We study the ReLU activation function σ ( x ) = max( x , 0) . 3 / 20

  4. Equivalence to graphical representation Figure: Representation as a direct graph of a network with two hidden layers L = 2 and width vector p = (4 , 3 , 3 , 2) . 4 / 20

  5. Characteristics of modern deep network architectures ◮ Networks are deep ◮ version of ResNet with 152 hidden layers ◮ networks become deeper ◮ Number of network parameters is larger than sample size ◮ AlexNet uses 60 million parameters for 1.2 million training samples ◮ There is some sort of sparsity on the parameters ◮ ReLU activation function ( σ ( x ) = max( x , 0)) 5 / 20

  6. The large parameter trick ◮ If we allow the network parameters to be arbitrarily large, then we can approximate the indicator function via x �→ σ ( ax ) − σ ( ax − 1) ◮ it is common in approximation theory to use networks with network parameters tending to infinity ◮ In our analysis, we restrict all network parameters in absolute value by one 6 / 20

  7. Statistical analysis ◮ we want to study the statistical performance of a deep network ◮ � do nonparametric regression ◮ we observe n i.i.d. copies ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , Y i = f ( X i ) + ε i , ε i ∼ N (0 , 1) ◮ X i ∈ R d , Y i ∈ R , ◮ goal is to reconstruct the function f : R d → R ◮ has been studied extensively (kernel smoothing, wavelets, splines, . . . ) 7 / 20

  8. The estimator ◮ denote by F ( L , p , s ) the class of all networks with ◮ architecture ( L , p ) ◮ number of active (e.g. non-zero) parameters is s ◮ choose network architecture ( L , p ) and sparsity s ◮ least-squares estimator n � � � 2 . � f n ∈ argmin Y i − f ( X i ) f ∈F ( L , p , s ) i =1 ◮ this is the global minimizer [not computable] ◮ prediction error ��� � 2 � R ( � f n , f ) := E f f n ( X ) − f ( X ) , with X D = X 1 being independent of the sample ◮ study the dependence of n on R ( � f n , f ) 8 / 20

  9. Function class ◮ classical idea: assume that regression function is β -smooth ◮ optimal nonparametric estimation rate is n − 2 β/ (2 β + d ) ◮ suffers from curse of dimensionality ◮ to understand deep learning this setting is therefore useless ◮ � make a good structural assumption on f 9 / 20

  10. Hierarchical structure ◮ Important: Only few objects are combined on deeper abstraction level ◮ few letters in one word ◮ few word in one sentence 10 / 20

  11. Function class ◮ We assume that f = g q ◦ . . . ◦ g 0 with ◮ g i : R d i → R d i +1 . ◮ each of the d i +1 components of g i is β i -smooth and depends only on t i variables ◮ t i can be much smaller than d i ◮ we show that the rate depends on the pairs ( t i , β i ) , i = 0 , . . . , q . 11 / 20

  12. Example Example: Additive models ◮ In an additive model d � f ( x ) = f i ( x i ) i =1 ◮ This can be written as f = g 1 ◦ g 0 with � d g 0 ( x ) = ( f i ( x i )) i =1 ,..., d , g 2 ( y ) = y i . i =1 Hence, t 0 = 1 , d 1 = t 2 = d . ◮ Decomposes additive functions in ◮ one function that can be non-smooth but every component is one-dimensional ◮ one function that has high-dimensional input but the function is smooth 12 / 20

  13. The effective smoothness For nonparametric regression, f = g q ◦ . . . ◦ g 0 Effective smoothness: q � β ∗ ( β ℓ ∧ 1) . i := β i ℓ = i +1 β ∗ i is the smoothness induced on f by g i 13 / 20

  14. Main result Theorem: If (i) depth ≍ log n (ii) width ≍ n C , with C ≥ 1 ti i + ti log n 2 β ∗ (iii) network sparsity ≍ max i =0 ,..., q n Then, 2 β ∗ i − i + ti log 2 n . R ( � 2 β ∗ f , f ) � max i =0 ,..., q n 14 / 20

  15. Remarks on the rate Rate: 2 β ∗ i − i + ti log 2 n . R ( � 2 β ∗ f , f ) � max i =0 ,..., q n Remarks: ◮ t i can be seen as an effective dimension ◮ strong heuristic that this is the optimal rate (up to log 2 n ) ◮ other methods such as wavelets likely do not achieve these rates 15 / 20

  16. Consequences ◮ the assumption that depth ≍ log n appears naturally ◮ in particular the depth scales with the sample size ◮ the networks can have much more parameters than the sample size ◮ important for statistical performance is not the size but the amount of regularization ◮ here the number of active parameters 16 / 20

  17. Consequences (ctd.) paradox: ◮ good rate for all smoothness indices ◮ existing piecewise linear methods only give good rates up to smoothness two ◮ Here the non-linearity of the function class helps � non-linearity is essential!!! 17 / 20

  18. On the proof ◮ Oracle inequality (roughly) � � ∞ + s log n � f ∗ − f � 2 R ( � f , f ) � inf . n f ∗ ∈F ( L , p , s , F ) ◮ shows the trade-off between approximation and the number of active parameters s ◮ Approximation theory: ◮ builds on work by Telgarsky (2016), Liang and Srikant (2016), Yarotski (2017) ◮ network parameters bounded by one ◮ explicit bounds on network architecture and sparsity 18 / 20

  19. Additive models (ctd.) ◮ Consider again the additive model d � f ( x ) = f i ( x i ) i =1 ◮ suppose that each function f i is β -smooth ◮ the theorem gives the rate 2 β 2 β +1 log 2 n . R ( � f , f ) � n − ◮ this rate is known to be optimal up to the log 2 n -factor The function class considered here contains other structural constraints as a special case such a generalized additive models and it can be shown that the rates are optimal up to the log 2 n -factor 19 / 20

  20. Extensions Some extensions are useful. To name a few ◮ high-dimensional input ◮ include stochastic gradient descent ◮ classification ◮ CNNs, recurrent neural networks, . . . 20 / 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend