Deep Neural Networks and Partial Differential Equations: - PowerPoint PPT Presentation

Deep Neural Networks and Partial Differential Equations: Approximation Theory and Structural Properties Philipp Christian Petersen

Joint work Joint work with: ◮ Helmut B¨ olcskei (ETH Z¨ urich) ◮ Philipp Grohs (University of Vienna) ◮ Joost Opschoor (ETH Z¨ urich) ◮ Gitta Kutyniok (TU Berlin) ◮ Mones Raslan (TU Berlin) ◮ Christoph Schwab (ETH Z¨ urich) ◮ Felix Voigtlaender (KU Eichst¨ att-Ingolstadt) 1 / 36

Today’s Goal Goal of this talk: Discuss the suitability of neural networks as an ansatz system for the solution of PDEs. Two threads: Approximation theory: Structural properties: ◮ non-convex, non-closed ◮ universal approximation ansatz spaces ◮ optimal approximation ◮ parametrization not stable rates for all classical ◮ very hard to optimize over function spaces ◮ reduced curse of dimension 1 1 0.5 0 0 1 -1 0.8 1 1 0.6 0.8 0.8 0.6 0.6 0.4 0 0.2 0.4 0.4 0.4 0.2 0.2 0.2 0.6 0 0.8 0 0 1 2 / 36

Outline Neural networks Introduction to neural networks Approaches to solve PDEs Approximation theory of neural networks Classical results Optimality High-dimensional approximation Structural results Convexity Closedness Stable parametrization 3 / 36

Neural networks We consider neural networks as a special kind of functions: ◮ d = N 0 ∈ N : input dimension , ◮ L : number of layers , ◮ ̺ : R → R : activation function , ◮ T ℓ : R N ℓ − 1 → R N ℓ , ℓ = 1 , . . . , L : affine-linear maps. Then Φ ̺ : R d → R N L given by Φ ̺ ( x ) = T L ( ̺ ( T L − 1 ( ̺ ( . . . ̺ ( T 1 ( x )))))) , x ∈ R d , is called a neural network (NN) . The sequence ( d , N 1 , . . . , N L ) is called the architecture of Φ ̺ . 4 / 36

Why are neural networks interesting? - I Deep Learning: Deep learning describes a variety of techniques based on data-driven adaptation of the affine linear maps in a neural network. Overwhelming success: ◮ Image classification ◮ Text understanding Ren, He, Girshick, Sun; 2015 ◮ Game intelligence � Hardware design of the future! 5 / 36

Why are neural networks interesting? - II Expressibility: Neural networks constitute a very powerful architecture. Theorem (Cybenko; 1989, Hornik; 1991, Pinkus; 1999) Let d ∈ N , K ⊂ R d compact, f : K → R continuous, ̺ : R → R continuous and not a polynomial. Let ε > 0 , then there exist a two-layer NN Φ ̺ : � f − Φ ̺ � ∞ ≤ ε. Efficient expressibility: R M ∋ θ �→ ( T 1 , . . . , T L ) �→ Φ ̺ θ yields a parametrized system of functions. In a sense this parametrization is optimally efficient. (More on this below). 6 / 36

How can we apply NNs to solve PDEs? PDE problem: For D ⊂ R d , d ∈ N find u such that G ( x , u ( x ) , ∇ u ( x ) , ∇ 2 u ( x )) = 0 for all x ∈ D . Approach of [Lagaris, Likas, Fotiadis; 1998]: Let ( x i ) i ∈ I ⊂ D , find a NN Φ ̺ θ such that G ( x i , Φ ̺ θ ( x i ) , ∇ Φ ̺ θ ( x i ) , ∇ 2 Φ ̺ θ ( x i )) = 0 for all i ∈ I . Standard methods can be used to find parameters θ . 7 / 36

Approaches to solve PDEs - Examples General Framework: Deep Ritz Method [E, Yu; 2017]: NNs as trial functions, SGD naturally replaces quadrature. High-dimensional PDEs: [Sirignano, Spiliopoulos; 2017]: Let D ⊂ R d d ≥ 100 find u such that ∂ u ∂ t ( t , x ) + H ( u )( t , x ) = 0 , ( t , x ) ∈ [0 , T ] × Ω , + BC + IC As the number of parameters of the NNs increases the minimizer of associated energy approaches true solution. No mesh generation required! [Berner, Grohs, Hornung, Jentzen, von Wurstemberger; 2017]: Phrasing problem as empirical risk minimization � provably no curse of dimension in approximation problem or number of samples. 8 / 36

How can we apply NNs to solve PDEs? Deep learning and PDEs: Both approaches above are based on two ideas. ◮ Neural networks are highly efficient in representing solutions of PDEs, hence the complexity of the problem can be greatly reduced. ◮ There exist black box methods from machine learning that solve the optimization problem. This talk: ◮ We will show exactly how efficient the representations are. ◮ Raise doubt that the black box can produce reliable results in general. 9 / 36

Approximation theory of neural networks 10 / 36

Complexity of neural networks Recall: Φ ̺ ( x ) = T L ( ̺ ( T L − 1 ( ̺ ( . . . ̺ ( T 1 ( x )))))) , x ∈ R d . Each affine linear mapping T ℓ is defined by a matrix A ℓ ∈ R N ℓ × N ℓ − 1 and a translation b ℓ ∈ R N ℓ via T ℓ ( x ) = A ℓ x + b ℓ . The number of weights W (Φ ̺ ) and the number of neurons N (Φ ̺ ) are L � � W (Φ ̺ ) = N (Φ ̺ ) = ( � A j � ℓ 0 + � b j � ℓ 0 ) and N j . j ≤ L j =0 11 / 36

Power of the architecture — Exemplary results Given f from some class of functions, how many weights/neurons does an ε -approximating NN need to have? 12 / 36

Power of the architecture — Exemplary results Given f from some class of functions, how many weights/neurons does an ε -approximating NN need to have? Not so many... Theorem (Maiorov, Pinkus; 1999) There exists an activation function ̺ weird : R → R that ◮ is analytic and strictly increasing, ◮ satisfies lim x →−∞ ̺ weird ( x ) = 0 and lim x →∞ ̺ weird ( x ) = 1 , such that for any d ∈ N , any f ∈ C ([0 , 1] d ) , and any ε > 0 , there is a 3 -layer ̺ -network Φ ̺ weird with � f − Φ ̺ weird � L ∞ ≤ ε and ε ε N (Φ ̺ weird ) = 9 d + 3 . ε 12 / 36

Power of the architecture — Exemplary results ◮ Barron; 1993 : Approximation rate for functions with one finite Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero. 13 / 36

Power of the architecture — Exemplary results ◮ Barron; 1993 : Approximation rate for functions with one finite Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero. ◮ Mhaskar; 1993 : Let ̺ be sigmoidal function of order k ≥ 2. n ) − s / d and For f ∈ C s ([0 , 1] d ), we have � f − Φ ̺ n � L ∞ � N (Φ ̺ L (Φ ̺ n ) = L ( d , s , k ). 13 / 36

Power of the architecture — Exemplary results ◮ Barron; 1993 : Approximation rate for functions with one finite Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero. ◮ Mhaskar; 1993 : Let ̺ be sigmoidal function of order k ≥ 2. n ) − s / d and For f ∈ C s ([0 , 1] d ), we have � f − Φ ̺ n � L ∞ � N (Φ ̺ L (Φ ̺ n ) = L ( d , s , k ). ◮ Yarotsky; 2017 : For f ∈ C s ([0 , 1] d ), we have for ̺ ( x ) = x + (called ReLU) that � f − Φ ̺ n � L ∞ � W (Φ ̺ n ) − s / d and L (Φ ̺ ε ) ≍ log( n ). 13 / 36

Power of the architecture — Exemplary results ◮ Barron; 1993 : Approximation rate for functions with one finite Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero. ◮ Mhaskar; 1993 : Let ̺ be sigmoidal function of order k ≥ 2. n ) − s / d and For f ∈ C s ([0 , 1] d ), we have � f − Φ ̺ n � L ∞ � N (Φ ̺ L (Φ ̺ n ) = L ( d , s , k ). ◮ Yarotsky; 2017 : For f ∈ C s ([0 , 1] d ), we have for ̺ ( x ) = x + (called ReLU) that � f − Φ ̺ n � L ∞ � W (Φ ̺ n ) − s / d and L (Φ ̺ ε ) ≍ log( n ). ◮ Shaham, Cloninger, Coifman; 2015 : One can implement certain wavelets using 4–layer NNs. 13 / 36

Power of the architecture — Exemplary results ◮ Barron; 1993 : Approximation rate for functions with one finite Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero. ◮ Mhaskar; 1993 : Let ̺ be sigmoidal function of order k ≥ 2. n ) − s / d and For f ∈ C s ([0 , 1] d ), we have � f − Φ ̺ n � L ∞ � N (Φ ̺ L (Φ ̺ n ) = L ( d , s , k ). ◮ Yarotsky; 2017 : For f ∈ C s ([0 , 1] d ), we have for ̺ ( x ) = x + (called ReLU) that � f − Φ ̺ n � L ∞ � W (Φ ̺ n ) − s / d and L (Φ ̺ ε ) ≍ log( n ). ◮ Shaham, Cloninger, Coifman; 2015 : One can implement certain wavelets using 4–layer NNs. ◮ He, Li, Xu, Zheng; 2018, Opschoor, Schwab, P.; 2019 : ReLU NNs reproduce approximation rates of h -, p - and hp -FEM . 13 / 36

Lower bounds Optimal approximation rates: Lower bounds on required network size only exist under additional assumptions. (Recall networks based on ̺ weird ). Options: (A) Place restrictions on activation function (e.g. only consider the ReLU), thereby excluding pathological examples like ̺ weird . ( � VC dimension bounds ) (B) Place restrictions on the weights . ( � Information theoretical bounds, entropy arguments ) (C) Use still other concepts like continuous N -widths . 14 / 36

Deep Neural Networks and Partial Differential Equations: - PowerPoint PPT Presentation

Deep Neural Networks and Partial Differential Equations: Approximation Theory and Structural Properties Philipp Christian Petersen Joint work Joint work with: Helmut B olcskei (ETH Z urich) Philipp Grohs (University of Vienna)

15. Partial differential equations; double integrals 15.1. Partial differential equations. Recall

Modelling with Differential Equations Modelling with Differential Equations Modelling with

Stochastic (partial) differential equations and Gaussian processes Simo Srkk Aalto

Differential Equations Classification and Simulation November 25, 2008 M. Emmerich, LIACS 1

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Numerical Solutions to Partial Differential Equations Zhiping Li LMAM and School of Mathematical

Parallel Numerical Algorithms Discretised Partial Differential Equations 1 Overview of Lecture

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Differential equations Programming of Differential Equations A differential equation (ODE)

Differential equations Programming of Differential Equations A differential equation (ODE)

1.3 Differential Equations as Mathematical Models a lesson for MATH F302 Differential Equations

Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Concepts and Algorithms of Scientific and Visual Computing Partial Differential Equations

Singular stochastic partial differential equations Giovanni Jona-Lasinio Firenze, November 23,

Few-Shot Learning Christian Simon Piotr Koniusz Richard Nock Mehrtash

Gandiva : Introspective Cluster Scheduling for Deep Learning Wencong Xiao, Romil Bhardwaj,

Introduction to Deep Neural Networks 0. Logistics Spring 2020 1 Neural Networks are taking

Associaz Associazion ione e ipofrazioname mento e e ta targ rget

Collaborative Deep Learning for Recommender Systems Hao Wang Naiyan Wang Dit-Yan Yeung 1

Bit Fu Bi Fusion on Bit-Level Dynamically Composable Architecture for Deep Neural Networks

AnIsabelleFormalization oftheExpressiveness ofDeepLearning Alexander Bentkamp Vrije

A Semantic Loss Function for Deep Learning with Symbolic Knowledge Jingyi Xu, Zilu Zhang , Tal