adaptivity of deep relu network and its generalization
play

Adaptivity of deep ReLU network and its generalization error - PowerPoint PPT Presentation

Adaptivity of deep ReLU network and its generalization error analysis Taiji Suzuki The University of Tokyo Department of Mathematical Informatics AIP-RIKEN 22nd/Feb/2019 The 2nd Korea-Japan Machine Learning Workshop 1 / 50 Deep


  1. Adaptivity of deep ReLU network and its generalization error analysis Taiji Suzuki †‡ † The University of Tokyo Department of Mathematical Informatics ‡ AIP-RIKEN 22nd/Feb/2019 The 2nd Korea-Japan Machine Learning Workshop 1 / 50

  2. Deep learning model f ( x ) = η ( W L η ( W L − 1 . . . W 2 η ( W 1 x + b 1 ) + b 2 . . . )) High performance learning system Many applications: Deepmind, Google, Facebook, Open AI, Baidu, ... 2 / 50

  3. Deep learning model f ( x ) = η ( W L η ( W L − 1 . . . W 2 η ( W 1 x + b 1 ) + b 2 . . . )) High performance learning system Many applications: Deepmind, Google, Facebook, Open AI, Baidu, ... We need theories. 2 / 50

  4. Outline of this talk Why does deep learning perform so well? “Adaptivity” of deep neural network: Adaptivity to the shape of the target function. Adaptivity to the dimensionality of the input data. → sparsity, non-convexity 3 / 50

  5. Outline of this talk Why does deep learning perform so well? “Adaptivity” of deep neural network: Adaptivity to the shape of the target function. Adaptivity to the dimensionality of the input data. → sparsity, non-convexity Approach: Estimation error analysis on a Besov space. spatial inhomogeneity of smoothness avoiding curse of dimensionality Will be shown that any linear estimators such as kernel methods are outperformed by DL. 3 / 50

  6. Reference Taiji Suzuki: Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. ICLR2019 , to appear. (arXiv:1810.08033). 4 / 50

  7. Outline Literature overview 1 Approximating and estimating functions in Besov space and related spaces 2 Deep NN representation for Besov space Function class with more explicit sparsity Deep NN representation for “mixed smooth” Besov space 5 / 50

  8. Universal approximator Two layer neural network: As m → ∞ , the two layer network can approximate an m ∑ arbitrary function with an v j η ( w ⊤ f ( x ) = j x + b j ) . arbitrary precision. j =1 f ( x ) = ∑ m ∫ ˆ j =1 v j η ( w ⊤ h o ( w , b ) η ( w ⊤ x + b ) d w d b ≃ j x + b j ) f o ( x ) = (Sonoda & Murata, 2015) Year Basis function space C ( R d ) 1987 Hecht-Nielsen – 1988 Gallant & White Cos L 2 ( K ) L 2 ( R d ) Irie & Miyake integrable 1989 Carroll & Dickinson Continuous sigmoidal L 2 ( K ) Cybenko Continuous sigmoidal C ( K ) Funahashi Monotone & bounded C ( K ) 1993 Mhaskar & Micchelli Polynomial growth C ( K ) 2015 Sonoda & Murata admissible L 1 , L 2 6 / 50

  9. Universal approximator Two layer neural network: As m → ∞ , the two layer network can approximate an m ∑ arbitrary function with an v j η ( w ⊤ f ( x ) = j x + b j ) . arbitrary precision. j =1 f ( x ) = ∑ m ∫ ˆ j =1 v j η ( w ⊤ h o ( w , b ) η ( w ⊤ x + b ) d w d b ≃ j x + b j ) f o ( x ) = (Sonoda & Murata, 2015) Activation functions: 1 ReLU: η ( u ) = max { u , 0 } Sigmoid: η ( u ) = 1+exp( − u ) 6 / 50

  10. Expressive power of deep neural network Combinatorics/Hyperplane Arrangements (Montufar et al., 2014) Number of linear regions (ReLU) Polynomial expansions, tensor analysis (Cohen et al., 2016; Cohen & Shashua, 2016) Number of monomials (Sum product) Algebraic topology (Bianchini & Scarselli, 2014) Betti numbers (Pfaffian) Riemannian geometry + Dynamic mean field theory (Poole et al., 2016) Extrinsic curvature Deep neural network has exponentially large power of expression against the number of layers. 7 / 50

  11. Depth separation between 2 and 3 layers 2 layer NN is already universal approximator. When is deeper network useful? ⃝ There is a function represented by f o ( x ) = g ( ∥ x ∥ 2 ) = g ( x 2 1 + · · · + x 2 d ) that can be better approximated by 3 layer NN × than 2 layer NN (c.f., Eldan and Shamir (2016)) d x : the dimension of the input x 3 layers: O ( poly ( d x , ϵ − 1 )) internal nodes are sufficient. 2 layers: At least Ω(1 /ϵ d x ) internal nodes are required. → DL can avoid curse of dimensionality. 8 / 50

  12. Non-smooth function For estimating a non-smooth function , deep is better (Imaizumi & Fukumizu, 2018): K ∑ f o ( x ) = 1 R k ( x ) h k ( x ) k =1 where R k is a region with smooth boundary and h k is a smooth function. 9 / 50

  13. Depth separation What makes difference between deep and shallow methods? 10 / 50

  14. Depth separation What makes difference between deep and shallow methods? → Non-convexity of the model (sparseness) 10 / 50

  15. = Easy example: Linear activation Reduced rank regression: Y i = UVX i + ξ i ( i = 1 , . . . , n ) where U ∈ R M × r , V ∈ R r × N ( r ≪ M , N ), and Y i ∈ R M , X i ∈ R N . Linear estimator ˆ f ( x ) = ∑ n i =1 Y i φ ( X 1 , . . . , X n , x ), Deep learning ˆ f ( x ) = ˆ U ˆ V x . r ( M + N ) MN ≪ n n Deep Shallow V U Y i X i Non-convexity is essential. → sparsity. 11 / 50

  16. Nonlinear regression problem ✓ ✏ Nonlinear regression problem: y i = f o ( x i ) + ξ i ( i = 1 , . . . , n ) , where ξ i ∼ N (0 , σ 2 ), and x i ∼ P X ([0 , 1] d ) (i.i.d.). ✒ ✑ We want to estimate f o from data ( x i , y i ) n i =1 . Least squares estimator: n 1 ∑ ˆ ( y i − f ( x i )) 2 f = argmin n f ∈F i =1 where F is a neural network model. 12 / 50

  17. Model Estimator Approximation error (bias) Sample deviation (variance) True Bias and variance trade-off ∥ f o − ˆ ≤ ∥ f o − ˇ + ∥ ˇ f − ˆ f ∥ L 2 ( P ) f ∥ L 2 ( P ) f ∥ L 2 ( P ) � �� � � �� � � �� � Estimation error Approximation error Sample deviation (bias) (variance) Large model: small approximation error, large sample deviation Small model: large approximation error, small sample deviation → Bias and variance trade-off 13 / 50

  18. Outline Literature overview 1 Approximating and estimating functions in Besov space and related spaces 2 Deep NN representation for Besov space Function class with more explicit sparsity Deep NN representation for “mixed smooth” Besov space 14 / 50

  19. Agenda of this talk Deep learning can make use of sparsity . Appropriate function class with non-convexity: Q: A typical setting is H¨ older space. Can we generalize it? A: Besov space and mixed-smooth Besov space (tensor product space) Curse of dimensionality: Q: Deep learning can suffer from curse of dimensionality. Can we ease the effect of dimensionality under a suitable condition? A: Yes, if the true function is included in mixed-smooth Besov space . 15 / 50

  20. Outline Literature overview 1 Approximating and estimating functions in Besov space and related spaces 2 Deep NN representation for Besov space Function class with more explicit sparsity Deep NN representation for “mixed smooth” Besov space 16 / 50

  21. Minimax optimal framework What is a “good” estimator? Minimax optimal rate: E [ ∥ ˆ f − f o ∥ 2 L 2 ( P ) ] ≤ n − ? inf sup ˆ f :estimator f o ∈F → If an estimator ˆ f achieves the minimax optimal rate, then it can be seen a “good” estimator. What kind F do we think? 17 / 50

  22. H¨ older, Sobolev, Besov Ω = [0 , 1] d ⊂ R d older space ( C β (Ω)) H¨ | ∂ α f ( x ) − ∂ α f ( y ) | � � ∂ α f ∥ ∞ + max ∥ f ∥ C β = max | α | = m sup | x − y | β − m | α |≤ m x ∈ Ω Sobolev space ( W k p (Ω)) ) 1 ( ∑ p ∥ D α f ∥ p ∥ f ∥ W k p = L p (Ω) | α |≤ k ✓ ✏ Besov space ( B s p , q (Ω)) (0 < p , q ≤ ∞ , 0 < s ≤ m ) � � � � m ( m ) ∑ � � ( − 1) m − j f ( · + jh ) ω m ( f , t ) p := sup , � � � j � ∥ h ∥≤ t � � j =1 L p (Ω) (∫ ∞ ) 1 / q [ t − s ω m ( f , t ) p ] q d t ∥ f ∥ B s p , q (Ω) = ∥ f ∥ L p (Ω) + . t 0 ✒ ✑ 18 / 50

  23. Relation between the spaces Suppose Ω = [0 , 1] d ⊂ R . For m ∈ N , B m → W m → B m p , 1 ֒ p ֒ p , ∞ , B m 2 , 2 = W m 2 . For 0 < s < ∞ and s ̸∈ N , C s = B s ∞ , ∞ . 19 / 50

  24. 0 ∞ Continuous regime: s > d / p B s → C 0 p , q ֒ L r -integrability: s > d (1 / p − 1 / r ) + B s → L r p , q ֒ (If d / p ≥ s , the elements are not necessarily continuous). Continuous Dis-continuous s Example: B 1 1 , 1 ([0 , 1]) ⊂ { bounded total variation } ⊂ B 1 1 , ∞ ([0 , 1]) 20 / 50

  25. rough smooth Properties of Besov space Discontinuity: d / p > s Spatial inhomogeneity of smoothness: small p Question: Can deep learning capture these properties? 21 / 50

  26. p=1 p=0.5 p=2 Connection to sparsity 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 Multiresolution expansion -1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 ∑ ∑ α k , j ψ (2 k x − j ) , f = k ∈ N + j ∈ J ( k ) 1 / q   ∞ ∑ { 2 sk (2 − kd ∑ | α k , j | p ) 1 / p } q ∥ f ∥ B s p , q ≃   k =0 j ∈ J ( k ) Sparse coefficients → spatial inhomogeneity of smoothness 22 / 50

  27. Deep learning model f ( x ) = ( W ( L ) η ( · ) + b ( L ) ) ◦ ( W ( L − 1) η ( · ) + b ( L − 1) ) ◦ · · · ◦ ( W (1) x + b (1) ) F ( L , W , S , B ) : deep networks with depth L , width W , sparsity S , norm bound B . η is ReLU activation: η ( u ) = max { u , 0 } . (currently most popular) 23 / 50

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend