 
              Adaptivity of deep ReLU network and its generalization error analysis Taiji Suzuki †‡ † The University of Tokyo Department of Mathematical Informatics ‡ AIP-RIKEN 22nd/Feb/2019 The 2nd Korea-Japan Machine Learning Workshop 1 / 50
Deep learning model f ( x ) = η ( W L η ( W L − 1 . . . W 2 η ( W 1 x + b 1 ) + b 2 . . . )) High performance learning system Many applications: Deepmind, Google, Facebook, Open AI, Baidu, ... 2 / 50
Deep learning model f ( x ) = η ( W L η ( W L − 1 . . . W 2 η ( W 1 x + b 1 ) + b 2 . . . )) High performance learning system Many applications: Deepmind, Google, Facebook, Open AI, Baidu, ... We need theories. 2 / 50
Outline of this talk Why does deep learning perform so well? “Adaptivity” of deep neural network: Adaptivity to the shape of the target function. Adaptivity to the dimensionality of the input data. → sparsity, non-convexity 3 / 50
Outline of this talk Why does deep learning perform so well? “Adaptivity” of deep neural network: Adaptivity to the shape of the target function. Adaptivity to the dimensionality of the input data. → sparsity, non-convexity Approach: Estimation error analysis on a Besov space. spatial inhomogeneity of smoothness avoiding curse of dimensionality Will be shown that any linear estimators such as kernel methods are outperformed by DL. 3 / 50
Reference Taiji Suzuki: Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. ICLR2019 , to appear. (arXiv:1810.08033). 4 / 50
Outline Literature overview 1 Approximating and estimating functions in Besov space and related spaces 2 Deep NN representation for Besov space Function class with more explicit sparsity Deep NN representation for “mixed smooth” Besov space 5 / 50
Universal approximator Two layer neural network: As m → ∞ , the two layer network can approximate an m ∑ arbitrary function with an v j η ( w ⊤ f ( x ) = j x + b j ) . arbitrary precision. j =1 f ( x ) = ∑ m ∫ ˆ j =1 v j η ( w ⊤ h o ( w , b ) η ( w ⊤ x + b ) d w d b ≃ j x + b j ) f o ( x ) = (Sonoda & Murata, 2015) Year Basis function space C ( R d ) 1987 Hecht-Nielsen – 1988 Gallant & White Cos L 2 ( K ) L 2 ( R d ) Irie & Miyake integrable 1989 Carroll & Dickinson Continuous sigmoidal L 2 ( K ) Cybenko Continuous sigmoidal C ( K ) Funahashi Monotone & bounded C ( K ) 1993 Mhaskar & Micchelli Polynomial growth C ( K ) 2015 Sonoda & Murata admissible L 1 , L 2 6 / 50
Universal approximator Two layer neural network: As m → ∞ , the two layer network can approximate an m ∑ arbitrary function with an v j η ( w ⊤ f ( x ) = j x + b j ) . arbitrary precision. j =1 f ( x ) = ∑ m ∫ ˆ j =1 v j η ( w ⊤ h o ( w , b ) η ( w ⊤ x + b ) d w d b ≃ j x + b j ) f o ( x ) = (Sonoda & Murata, 2015) Activation functions: 1 ReLU: η ( u ) = max { u , 0 } Sigmoid: η ( u ) = 1+exp( − u ) 6 / 50
Expressive power of deep neural network Combinatorics/Hyperplane Arrangements (Montufar et al., 2014) Number of linear regions (ReLU) Polynomial expansions, tensor analysis (Cohen et al., 2016; Cohen & Shashua, 2016) Number of monomials (Sum product) Algebraic topology (Bianchini & Scarselli, 2014) Betti numbers (Pfaffian) Riemannian geometry + Dynamic mean field theory (Poole et al., 2016) Extrinsic curvature Deep neural network has exponentially large power of expression against the number of layers. 7 / 50
Depth separation between 2 and 3 layers 2 layer NN is already universal approximator. When is deeper network useful? ⃝ There is a function represented by f o ( x ) = g ( ∥ x ∥ 2 ) = g ( x 2 1 + · · · + x 2 d ) that can be better approximated by 3 layer NN × than 2 layer NN (c.f., Eldan and Shamir (2016)) d x : the dimension of the input x 3 layers: O ( poly ( d x , ϵ − 1 )) internal nodes are sufficient. 2 layers: At least Ω(1 /ϵ d x ) internal nodes are required. → DL can avoid curse of dimensionality. 8 / 50
Non-smooth function For estimating a non-smooth function , deep is better (Imaizumi & Fukumizu, 2018): K ∑ f o ( x ) = 1 R k ( x ) h k ( x ) k =1 where R k is a region with smooth boundary and h k is a smooth function. 9 / 50
Depth separation What makes difference between deep and shallow methods? 10 / 50
Depth separation What makes difference between deep and shallow methods? → Non-convexity of the model (sparseness) 10 / 50
= Easy example: Linear activation Reduced rank regression: Y i = UVX i + ξ i ( i = 1 , . . . , n ) where U ∈ R M × r , V ∈ R r × N ( r ≪ M , N ), and Y i ∈ R M , X i ∈ R N . Linear estimator ˆ f ( x ) = ∑ n i =1 Y i φ ( X 1 , . . . , X n , x ), Deep learning ˆ f ( x ) = ˆ U ˆ V x . r ( M + N ) MN ≪ n n Deep Shallow V U Y i X i Non-convexity is essential. → sparsity. 11 / 50
Nonlinear regression problem ✓ ✏ Nonlinear regression problem: y i = f o ( x i ) + ξ i ( i = 1 , . . . , n ) , where ξ i ∼ N (0 , σ 2 ), and x i ∼ P X ([0 , 1] d ) (i.i.d.). ✒ ✑ We want to estimate f o from data ( x i , y i ) n i =1 . Least squares estimator: n 1 ∑ ˆ ( y i − f ( x i )) 2 f = argmin n f ∈F i =1 where F is a neural network model. 12 / 50
Model Estimator Approximation error (bias) Sample deviation (variance) True Bias and variance trade-off ∥ f o − ˆ ≤ ∥ f o − ˇ + ∥ ˇ f − ˆ f ∥ L 2 ( P ) f ∥ L 2 ( P ) f ∥ L 2 ( P ) � �� � � �� � � �� � Estimation error Approximation error Sample deviation (bias) (variance) Large model: small approximation error, large sample deviation Small model: large approximation error, small sample deviation → Bias and variance trade-off 13 / 50
Outline Literature overview 1 Approximating and estimating functions in Besov space and related spaces 2 Deep NN representation for Besov space Function class with more explicit sparsity Deep NN representation for “mixed smooth” Besov space 14 / 50
Agenda of this talk Deep learning can make use of sparsity . Appropriate function class with non-convexity: Q: A typical setting is H¨ older space. Can we generalize it? A: Besov space and mixed-smooth Besov space (tensor product space) Curse of dimensionality: Q: Deep learning can suffer from curse of dimensionality. Can we ease the effect of dimensionality under a suitable condition? A: Yes, if the true function is included in mixed-smooth Besov space . 15 / 50
Outline Literature overview 1 Approximating and estimating functions in Besov space and related spaces 2 Deep NN representation for Besov space Function class with more explicit sparsity Deep NN representation for “mixed smooth” Besov space 16 / 50
Minimax optimal framework What is a “good” estimator? Minimax optimal rate: E [ ∥ ˆ f − f o ∥ 2 L 2 ( P ) ] ≤ n − ? inf sup ˆ f :estimator f o ∈F → If an estimator ˆ f achieves the minimax optimal rate, then it can be seen a “good” estimator. What kind F do we think? 17 / 50
H¨ older, Sobolev, Besov Ω = [0 , 1] d ⊂ R d older space ( C β (Ω)) H¨ | ∂ α f ( x ) − ∂ α f ( y ) | � � ∂ α f ∥ ∞ + max ∥ f ∥ C β = max | α | = m sup | x − y | β − m | α |≤ m x ∈ Ω Sobolev space ( W k p (Ω)) ) 1 ( ∑ p ∥ D α f ∥ p ∥ f ∥ W k p = L p (Ω) | α |≤ k ✓ ✏ Besov space ( B s p , q (Ω)) (0 < p , q ≤ ∞ , 0 < s ≤ m ) � � � � m ( m ) ∑ � � ( − 1) m − j f ( · + jh ) ω m ( f , t ) p := sup , � � � j � ∥ h ∥≤ t � � j =1 L p (Ω) (∫ ∞ ) 1 / q [ t − s ω m ( f , t ) p ] q d t ∥ f ∥ B s p , q (Ω) = ∥ f ∥ L p (Ω) + . t 0 ✒ ✑ 18 / 50
Relation between the spaces Suppose Ω = [0 , 1] d ⊂ R . For m ∈ N , B m → W m → B m p , 1 ֒ p ֒ p , ∞ , B m 2 , 2 = W m 2 . For 0 < s < ∞ and s ̸∈ N , C s = B s ∞ , ∞ . 19 / 50
0 ∞ Continuous regime: s > d / p B s → C 0 p , q ֒ L r -integrability: s > d (1 / p − 1 / r ) + B s → L r p , q ֒ (If d / p ≥ s , the elements are not necessarily continuous). Continuous Dis-continuous s Example: B 1 1 , 1 ([0 , 1]) ⊂ { bounded total variation } ⊂ B 1 1 , ∞ ([0 , 1]) 20 / 50
rough smooth Properties of Besov space Discontinuity: d / p > s Spatial inhomogeneity of smoothness: small p Question: Can deep learning capture these properties? 21 / 50
p=1 p=0.5 p=2 Connection to sparsity 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 Multiresolution expansion -1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 ∑ ∑ α k , j ψ (2 k x − j ) , f = k ∈ N + j ∈ J ( k ) 1 / q   ∞ ∑ { 2 sk (2 − kd ∑ | α k , j | p ) 1 / p } q ∥ f ∥ B s p , q ≃   k =0 j ∈ J ( k ) Sparse coefficients → spatial inhomogeneity of smoothness 22 / 50
Deep learning model f ( x ) = ( W ( L ) η ( · ) + b ( L ) ) ◦ ( W ( L − 1) η ( · ) + b ( L − 1) ) ◦ · · · ◦ ( W (1) x + b (1) ) F ( L , W , S , B ) : deep networks with depth L , width W , sparsity S , norm bound B . η is ReLU activation: η ( u ) = max { u , 0 } . (currently most popular) 23 / 50
Recommend
More recommend