Adaptivity of deep ReLU network and its generalization error - PowerPoint PPT Presentation

Adaptivity of deep ReLU network and its generalization error analysis Taiji Suzuki †‡ † The University of Tokyo Department of Mathematical Informatics ‡ AIP-RIKEN 22nd/Feb/2019 The 2nd Korea-Japan Machine Learning Workshop 1 / 50

Deep learning model f ( x ) = η ( W L η ( W L − 1 . . . W 2 η ( W 1 x + b 1 ) + b 2 . . . )) High performance learning system Many applications: Deepmind, Google, Facebook, Open AI, Baidu, ... 2 / 50

Deep learning model f ( x ) = η ( W L η ( W L − 1 . . . W 2 η ( W 1 x + b 1 ) + b 2 . . . )) High performance learning system Many applications: Deepmind, Google, Facebook, Open AI, Baidu, ... We need theories. 2 / 50

Outline of this talk Why does deep learning perform so well? “Adaptivity” of deep neural network: Adaptivity to the shape of the target function. Adaptivity to the dimensionality of the input data. → sparsity, non-convexity 3 / 50

Outline of this talk Why does deep learning perform so well? “Adaptivity” of deep neural network: Adaptivity to the shape of the target function. Adaptivity to the dimensionality of the input data. → sparsity, non-convexity Approach: Estimation error analysis on a Besov space. spatial inhomogeneity of smoothness avoiding curse of dimensionality Will be shown that any linear estimators such as kernel methods are outperformed by DL. 3 / 50

Reference Taiji Suzuki: Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. ICLR2019 , to appear. (arXiv:1810.08033). 4 / 50

Outline Literature overview 1 Approximating and estimating functions in Besov space and related spaces 2 Deep NN representation for Besov space Function class with more explicit sparsity Deep NN representation for “mixed smooth” Besov space 5 / 50

Universal approximator Two layer neural network: As m → ∞ , the two layer network can approximate an m ∑ arbitrary function with an v j η ( w ⊤ f ( x ) = j x + b j ) . arbitrary precision. j =1 f ( x ) = ∑ m ∫ ˆ j =1 v j η ( w ⊤ h o ( w , b ) η ( w ⊤ x + b ) d w d b ≃ j x + b j ) f o ( x ) = (Sonoda & Murata, 2015) Year Basis function space C ( R d ) 1987 Hecht-Nielsen – 1988 Gallant & White Cos L 2 ( K ) L 2 ( R d ) Irie & Miyake integrable 1989 Carroll & Dickinson Continuous sigmoidal L 2 ( K ) Cybenko Continuous sigmoidal C ( K ) Funahashi Monotone & bounded C ( K ) 1993 Mhaskar & Micchelli Polynomial growth C ( K ) 2015 Sonoda & Murata admissible L 1 , L 2 6 / 50

Universal approximator Two layer neural network: As m → ∞ , the two layer network can approximate an m ∑ arbitrary function with an v j η ( w ⊤ f ( x ) = j x + b j ) . arbitrary precision. j =1 f ( x ) = ∑ m ∫ ˆ j =1 v j η ( w ⊤ h o ( w , b ) η ( w ⊤ x + b ) d w d b ≃ j x + b j ) f o ( x ) = (Sonoda & Murata, 2015) Activation functions: 1 ReLU: η ( u ) = max { u , 0 } Sigmoid: η ( u ) = 1+exp( − u ) 6 / 50

Expressive power of deep neural network Combinatorics/Hyperplane Arrangements (Montufar et al., 2014) Number of linear regions (ReLU) Polynomial expansions, tensor analysis (Cohen et al., 2016; Cohen & Shashua, 2016) Number of monomials (Sum product) Algebraic topology (Bianchini & Scarselli, 2014) Betti numbers (Pfaffian) Riemannian geometry + Dynamic mean field theory (Poole et al., 2016) Extrinsic curvature Deep neural network has exponentially large power of expression against the number of layers. 7 / 50

Depth separation between 2 and 3 layers 2 layer NN is already universal approximator. When is deeper network useful? ⃝ There is a function represented by f o ( x ) = g ( ∥ x ∥ 2 ) = g ( x 2 1 + · · · + x 2 d ) that can be better approximated by 3 layer NN × than 2 layer NN (c.f., Eldan and Shamir (2016)) d x : the dimension of the input x 3 layers: O ( poly ( d x , ϵ − 1 )) internal nodes are sufficient. 2 layers: At least Ω(1 /ϵ d x ) internal nodes are required. → DL can avoid curse of dimensionality. 8 / 50

Non-smooth function For estimating a non-smooth function , deep is better (Imaizumi & Fukumizu, 2018): K ∑ f o ( x ) = 1 R k ( x ) h k ( x ) k =1 where R k is a region with smooth boundary and h k is a smooth function. 9 / 50

Depth separation What makes difference between deep and shallow methods? 10 / 50

Depth separation What makes difference between deep and shallow methods? → Non-convexity of the model (sparseness) 10 / 50

= Easy example: Linear activation Reduced rank regression: Y i = UVX i + ξ i ( i = 1 , . . . , n ) where U ∈ R M × r , V ∈ R r × N ( r ≪ M , N ), and Y i ∈ R M , X i ∈ R N . Linear estimator ˆ f ( x ) = ∑ n i =1 Y i φ ( X 1 , . . . , X n , x ), Deep learning ˆ f ( x ) = ˆ U ˆ V x . r ( M + N ) MN ≪ n n Deep Shallow V U Y i X i Non-convexity is essential. → sparsity. 11 / 50

Nonlinear regression problem ✓ ✏ Nonlinear regression problem: y i = f o ( x i ) + ξ i ( i = 1 , . . . , n ) , where ξ i ∼ N (0 , σ 2 ), and x i ∼ P X ([0 , 1] d ) (i.i.d.). ✒ ✑ We want to estimate f o from data ( x i , y i ) n i =1 . Least squares estimator: n 1 ∑ ˆ ( y i − f ( x i )) 2 f = argmin n f ∈F i =1 where F is a neural network model. 12 / 50

Model Estimator Approximation error (bias) Sample deviation (variance) True Bias and variance trade-off ∥ f o − ˆ ≤ ∥ f o − ˇ + ∥ ˇ f − ˆ f ∥ L 2 ( P ) f ∥ L 2 ( P ) f ∥ L 2 ( P ) � �� Estimation error Approximation error Sample deviation (bias) (variance) Large model: small approximation error, large sample deviation Small model: large approximation error, small sample deviation → Bias and variance trade-off 13 / 50

Agenda of this talk Deep learning can make use of sparsity . Appropriate function class with non-convexity: Q: A typical setting is H¨ older space. Can we generalize it? A: Besov space and mixed-smooth Besov space (tensor product space) Curse of dimensionality: Q: Deep learning can suffer from curse of dimensionality. Can we ease the effect of dimensionality under a suitable condition? A: Yes, if the true function is included in mixed-smooth Besov space . 15 / 50

Minimax optimal framework What is a “good” estimator? Minimax optimal rate: E [ ∥ ˆ f − f o ∥ 2 L 2 ( P ) ] ≤ n − ? inf sup ˆ f :estimator f o ∈F → If an estimator ˆ f achieves the minimax optimal rate, then it can be seen a “good” estimator. What kind F do we think? 17 / 50

H¨ older, Sobolev, Besov Ω = [0 , 1] d ⊂ R d older space ( C β (Ω)) H¨ | ∂ α f ( x ) − ∂ α f ( y ) | � � ∂ α f ∥ ∞ + max ∥ f ∥ C β = max | α | = m sup | x − y | β − m | α |≤ m x ∈ Ω Sobolev space ( W k p (Ω)) ) 1 ( ∑ p ∥ D α f ∥ p ∥ f ∥ W k p = L p (Ω) | α |≤ k ✓ ✏ Besov space ( B s p , q (Ω)) (0 < p , q ≤ ∞ , 0 < s ≤ m ) � � � � m ( m ) ∑ � � ( − 1) m − j f ( · + jh ) ω m ( f , t ) p := sup , � � � j � ∥ h ∥≤ t � � j =1 L p (Ω) (∫ ∞ ) 1 / q [ t − s ω m ( f , t ) p ] q d t ∥ f ∥ B s p , q (Ω) = ∥ f ∥ L p (Ω) + . t 0 ✒ ✑ 18 / 50

Relation between the spaces Suppose Ω = [0 , 1] d ⊂ R . For m ∈ N , B m → W m → B m p , 1 ֒ p ֒ p , ∞ , B m 2 , 2 = W m 2 . For 0 < s < ∞ and s ̸∈ N , C s = B s ∞ , ∞ . 19 / 50

0 ∞ Continuous regime: s > d / p B s → C 0 p , q ֒ L r -integrability: s > d (1 / p − 1 / r ) + B s → L r p , q ֒ (If d / p ≥ s , the elements are not necessarily continuous). Continuous Dis-continuous s Example: B 1 1 , 1 ([0 , 1]) ⊂ { bounded total variation } ⊂ B 1 1 , ∞ ([0 , 1]) 20 / 50

rough smooth Properties of Besov space Discontinuity: d / p > s Spatial inhomogeneity of smoothness: small p Question: Can deep learning capture these properties? 21 / 50

p=1 p=0.5 p=2 Connection to sparsity 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 Multiresolution expansion -1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 ∑ ∑ α k , j ψ (2 k x − j ) , f = k ∈ N + j ∈ J ( k ) 1 / q   ∞ ∑ { 2 sk (2 − kd ∑ | α k , j | p ) 1 / p } q ∥ f ∥ B s p , q ≃   k =0 j ∈ J ( k ) Sparse coefficients → spatial inhomogeneity of smoothness 22 / 50

Deep learning model f ( x ) = ( W ( L ) η ( · ) + b ( L ) ) ◦ ( W ( L − 1) η ( · ) + b ( L − 1) ) ◦ · · · ◦ ( W (1) x + b (1) ) F ( L , W , S , B ) : deep networks with depth L , width W , sparsity S , norm bound B . η is ReLU activation: η ( u ) = max { u , 0 } . (currently most popular) 23 / 50

Adaptivity of deep ReLU network and its generalization error - PowerPoint PPT Presentation

Adaptivity of deep ReLU network and its generalization error analysis Taiji Suzuki The University of Tokyo Department of Mathematical Informatics AIP-RIKEN 22nd/Feb/2019 The 2nd Korea-Japan Machine Learning Workshop 1 / 50 Deep

On-line learning in neural networks with ReLU activation Michiel Straat September 19, 2018 1 /

A Flexible Mechanism for Providing Adaptivity Based on Learning Providing Adaptivity Based on

fifty shades of adaptivity (in property testing) An Adaptivity Hierarchy Theorem for Property

Natural Analysts in Adaptive Data Analysis Tijana Zrnic joint with Moritz Hardt Adaptivity

Collapse of Deep and Narrow ReLU Neural Nets Lu Lu , Yeonjong Shin, Yanhui Su, George Karniadakis

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity Chulhee

All That Glisters Is Not Convnets: Hybrid Architectures For Faster, Better Solvers Prof Tom

ReLu and Maxout Networks and Their Possible Connections to Tropical Methods J org Zimmermann,

CUDA NEW FEATURES AND BEYOND Stephen Jones, GTC 2019 A QUICK LOOK BACK This Time Last Year...

TIME/ACCURACY TRADEOFFS FOR LEARNING A RELU WITH RESPECT TO GAUSSIAN MARGINALS Surbhi Goel

Q1) How important is the problem of adaptivity and its various guises as a cause of false

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Constructive universal high-dimensional distribution generation through deep ReLU networks Dmytro

Adaptivity and Personalization in Learning System s Sabine Graf School of Computing and

Re Reverse-Eng Engine neeri ring ng De Deep Re ReLU Ne Networ orks David Rolnick and

Adaptivity helps for testing juntas Rocco Servedio, Li-Yang Tan, John Wright Columbia TTIC CMU

THE analysis place / transition invariants state equation MUTEX PATTERN trap equation -

So#ware Architecture Bertrand Meyer, Michela Pedroni ETH Zurich, FebruaryMay 2010 Lecture 2:

{ 3 ! 837 : + 9 3 ;

Statistical Learning Learning From Examples We want to estimate the working temperature

02291: System Integration Hubert Baumeister hub@imm.dtu.dk Spring 2011 Contents 1 Recap 1 2

World Income Inequality Databases: an assessment of WIID and SWIID Stephen P. Jenkins Email:

Mathematical Analysis of Epidemiological Models III Jan Medlock Clemson University Department

How Proposed Changes to the Public Charge Rule Will Affect Health, Hunger and the Economy