Nonparametric regression using deep neural networks with ReLU - PowerPoint PPT Presentation

Nonparametric regression using deep neural networks with ReLU activation function Johannes Schmidt-Hieber February 2018 Caltech 1 / 20

◮ Many impressive results in applications . . . ◮ Lack of theoretical understanding . . . 2 / 20

Algebraic definition of a deep net Network architecture ( L , p ) consists of ◮ a positive integer L called the number of hidden layers/depth ◮ width vector p = ( p 0 , . . . , p L +1 ) ∈ N L +2 . Neural network with network architecture ( L , p ) f : R p 0 → R p L +1 , x �→ f ( x ) = W L +1 σ v L W L σ v L − 1 · · · W 2 σ v 1 W 1 x , Network parameters: ◮ W i is a p i × p i − 1 matrix ◮ v i ∈ R p i Activation function: ◮ We study the ReLU activation function σ ( x ) = max( x , 0) . 3 / 20

Equivalence to graphical representation Figure: Representation as a direct graph of a network with two hidden layers L = 2 and width vector p = (4 , 3 , 3 , 2) . 4 / 20

Characteristics of modern deep network architectures ◮ Networks are deep ◮ version of ResNet with 152 hidden layers ◮ networks become deeper ◮ Number of network parameters is larger than sample size ◮ AlexNet uses 60 million parameters for 1.2 million training samples ◮ There is some sort of sparsity on the parameters ◮ ReLU activation function ( σ ( x ) = max( x , 0)) 5 / 20

The large parameter trick ◮ If we allow the network parameters to be arbitrarily large, then we can approximate the indicator function via x �→ σ ( ax ) − σ ( ax − 1) ◮ it is common in approximation theory to use networks with network parameters tending to infinity ◮ In our analysis, we restrict all network parameters in absolute value by one 6 / 20

Statistical analysis ◮ we want to study the statistical performance of a deep network ◮ � do nonparametric regression ◮ we observe n i.i.d. copies ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , Y i = f ( X i ) + ε i , ε i ∼ N (0 , 1) ◮ X i ∈ R d , Y i ∈ R , ◮ goal is to reconstruct the function f : R d → R ◮ has been studied extensively (kernel smoothing, wavelets, splines, . . . ) 7 / 20

The estimator ◮ denote by F ( L , p , s ) the class of all networks with ◮ architecture ( L , p ) ◮ number of active (e.g. non-zero) parameters is s ◮ choose network architecture ( L , p ) and sparsity s ◮ least-squares estimator n � � � 2 . � f n ∈ argmin Y i − f ( X i ) f ∈F ( L , p , s ) i =1 ◮ this is the global minimizer [not computable] ◮ prediction error �� 2 � R ( � f n , f ) := E f f n ( X ) − f ( X ) , with X D = X 1 being independent of the sample ◮ study the dependence of n on R ( � f n , f ) 8 / 20

Function class ◮ classical idea: assume that regression function is β -smooth ◮ optimal nonparametric estimation rate is n − 2 β/ (2 β + d ) ◮ suffers from curse of dimensionality ◮ to understand deep learning this setting is therefore useless ◮ � make a good structural assumption on f 9 / 20

Hierarchical structure ◮ Important: Only few objects are combined on deeper abstraction level ◮ few letters in one word ◮ few word in one sentence 10 / 20

Function class ◮ We assume that f = g q ◦ . . . ◦ g 0 with ◮ g i : R d i → R d i +1 . ◮ each of the d i +1 components of g i is β i -smooth and depends only on t i variables ◮ t i can be much smaller than d i ◮ we show that the rate depends on the pairs ( t i , β i ) , i = 0 , . . . , q . 11 / 20

Example Example: Additive models ◮ In an additive model d � f ( x ) = f i ( x i ) i =1 ◮ This can be written as f = g 1 ◦ g 0 with � d g 0 ( x ) = ( f i ( x i )) i =1 ,..., d , g 2 ( y ) = y i . i =1 Hence, t 0 = 1 , d 1 = t 2 = d . ◮ Decomposes additive functions in ◮ one function that can be non-smooth but every component is one-dimensional ◮ one function that has high-dimensional input but the function is smooth 12 / 20

The effective smoothness For nonparametric regression, f = g q ◦ . . . ◦ g 0 Effective smoothness: q � β ∗ ( β ℓ ∧ 1) . i := β i ℓ = i +1 β ∗ i is the smoothness induced on f by g i 13 / 20

Main result Theorem: If (i) depth ≍ log n (ii) width ≍ n C , with C ≥ 1 ti i + ti log n 2 β ∗ (iii) network sparsity ≍ max i =0 ,..., q n Then, 2 β ∗ i − i + ti log 2 n . R ( � 2 β ∗ f , f ) � max i =0 ,..., q n 14 / 20

Remarks on the rate Rate: 2 β ∗ i − i + ti log 2 n . R ( � 2 β ∗ f , f ) � max i =0 ,..., q n Remarks: ◮ t i can be seen as an effective dimension ◮ strong heuristic that this is the optimal rate (up to log 2 n ) ◮ other methods such as wavelets likely do not achieve these rates 15 / 20

Consequences ◮ the assumption that depth ≍ log n appears naturally ◮ in particular the depth scales with the sample size ◮ the networks can have much more parameters than the sample size ◮ important for statistical performance is not the size but the amount of regularization ◮ here the number of active parameters 16 / 20

Consequences (ctd.) paradox: ◮ good rate for all smoothness indices ◮ existing piecewise linear methods only give good rates up to smoothness two ◮ Here the non-linearity of the function class helps � non-linearity is essential!!! 17 / 20

On the proof ◮ Oracle inequality (roughly) � � ∞ + s log n � f ∗ − f � 2 R ( � f , f ) � inf . n f ∗ ∈F ( L , p , s , F ) ◮ shows the trade-off between approximation and the number of active parameters s ◮ Approximation theory: ◮ builds on work by Telgarsky (2016), Liang and Srikant (2016), Yarotski (2017) ◮ network parameters bounded by one ◮ explicit bounds on network architecture and sparsity 18 / 20

Additive models (ctd.) ◮ Consider again the additive model d � f ( x ) = f i ( x i ) i =1 ◮ suppose that each function f i is β -smooth ◮ the theorem gives the rate 2 β 2 β +1 log 2 n . R ( � f , f ) � n − ◮ this rate is known to be optimal up to the log 2 n -factor The function class considered here contains other structural constraints as a special case such a generalized additive models and it can be shown that the rates are optimal up to the log 2 n -factor 19 / 20

Extensions Some extensions are useful. To name a few ◮ high-dimensional input ◮ include stochastic gradient descent ◮ classification ◮ CNNs, recurrent neural networks, . . . 20 / 20

Nonparametric regression using deep neural networks with ReLU - PowerPoint PPT Presentation

Nonparametric regression using deep neural networks with ReLU activation function Johannes Schmidt-Hieber February 2018 Caltech 1 / 20 Many impressive results in applications . . . Lack of theoretical understanding . . . 2 / 20

Nonparametric Regression Splines for Nonparametric Regression Splines for Regional Atmospheric

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

On the Expressive Power of Deep Neural Networks Maithra Raghu, Ben Poole, Jon Kleinberg, Surya

Introduction to Nonparametric Bayesian Modeling and Gaussian Process Regression Piyush Rai Dept.

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

Optimizing Deep Neural Networks Leena Chennuru Vankadara 26-10-2015 Table of Contents Neural

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Weight Parameterizations in Deep Neural Networks Sergey Zagoruyko e Paris-Est, Universit

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Introduction to Deep Neural Networks 0. Logistics Spring 2020 1 Neural Networks are taking

Artificial Neural Networks Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 4

Evaluations of Deep Convolutional Neural Networks for Automatic Identification of Malaria Infected

Adaptive Layout Decomposition with Graph Embedding Neural Networks Wei Li 1 , Jialu Xia 1 , Yuzhe

Neural Networks: Backpropagation Machine Learning Based on slides and material from Geoffrey

Identifying the relevant dependencies of the neural network response on characteristics of the

11/3/2019 Steve Gordon, AUSA, Civil Rights Enforcement Coordinator United States Attorneys

Disclosures The Thin Red Line Between I have nothing to disclose Neuropathology and Head &

Power Indices and Game Theory (Applications to Bioinformatics) Stefano MORETTI

Nonparametric regression using deep neural networks with ReLU - PowerPoint PPT Presentation

Nonparametric regression using deep neural networks with ReLU activation function Johannes Schmidt-Hieber February 2018 Caltech 1 / 20 Many impressive results in applications . . . Lack of theoretical understanding . . . 2 / 20

Nonparametric Regression Splines for Nonparametric Regression Splines for Regional Atmospheric

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

On the Expressive Power of Deep Neural Networks Maithra Raghu, Ben Poole, Jon Kleinberg, Surya

Introduction to Nonparametric Bayesian Modeling and Gaussian Process Regression Piyush Rai Dept.

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

Optimizing Deep Neural Networks Leena Chennuru Vankadara 26-10-2015 Table of Contents Neural

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Weight Parameterizations in Deep Neural Networks Sergey Zagoruyko e Paris-Est, Universit

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Introduction to Deep Neural Networks 0. Logistics Spring 2020 1 Neural Networks are taking

Artificial Neural Networks Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 4

Evaluations of Deep Convolutional Neural Networks for Automatic Identification of Malaria Infected

Adaptive Layout Decomposition with Graph Embedding Neural Networks Wei Li 1 , Jialu Xia 1 , Yuzhe

Neural Networks: Backpropagation Machine Learning Based on slides and material from Geoffrey

Identifying the relevant dependencies of the neural network response on characteristics of the

11/3/2019 Steve Gordon, AUSA, Civil Rights Enforcement Coordinator United States Attorneys

Disclosures The Thin Red Line Between I have nothing to disclose Neuropathology and Head &amp;

Power Indices and Game Theory (Applications to Bioinformatics) Stefano MORETTI

Disclosures The Thin Red Line Between I have nothing to disclose Neuropathology and Head &