Deep Neural Networks for PDEs Philipp Grohs DL and Vis, September - PowerPoint PPT Presentation

Data Generating Distribution Suppose that there exists a probability distribution on R 784 that randomly generates handwritten digits.

Data Generating Distribution Suppose that there exists a probability distribution on R 784 that randomly generates handwritten digits. � Variational Autoencoder Demo

A New Look Suppose that our training data consists of samples according to a given data distribution ( X , Y )

A New Look If we knew the data distribution ( X , Y ), the best functional relation between X and Y would simply be E [ Y | X = x ]!

A New Look But we only have samples and do not know the distribution ( X , Y )

A New Look But we only have samples and do not know the distribution ( X , Y ) A mathematical learning problem seeks to infer the regression function E [ Y | X = x ] from random samples ( x i , y i ) m i =1 of ( X , Y ) .

Mathematical Formulation

Mathematical Formulation Let (Ω , F , P ) be a probability space and let X : Ω → R d and Y : Ω → R n be random vectors. Find the best functional U : R d → R n between these vectors in the sense relationship ˆ that � ˆ | U ( X ( ω )) − Y ( ω ) | 2 d P ( ω ) U = argmin U : R d → R n Ω | U ( X ) − Y | 2 � � = argmin U : R d → R n E .

Mathematical Formulation Let (Ω , F , P ) be a probability space and let X : Ω → R d and Y : Ω → R n be random vectors. Find the best functional U : R d → R n between these vectors in the sense relationship ˆ that � ˆ | U ( X ( ω )) − Y ( ω ) | 2 d P ( ω ) U = argmin U : R d → R n Ω | U ( X ) − Y | 2 � � = argmin U : R d → R n E . We have ˆ U ( x ) = E [ Y | X = x ] .

Mathematical Formulation Let (Ω , F , P ) be a probability space and let X : Ω → R d and Y : Ω → R n be random vectors. Find the best functional U : R d → R n between these vectors in the sense relationship ˆ that � ˆ | U ( X ( ω )) − Y ( ω ) | 2 d P ( ω ) U = argmin U : R d → R n Ω | U ( X ) − Y | 2 � � = argmin U : R d → R n E . We have ˆ U ( x ) = E [ Y | X = x ] . ˆ U is called the regression function .

Statistical Learning Theory � � Let z = ( x 1 , y 1 ) , . . . , ( x m , y m ) be m realizations of samples independently drawn according to ( X , Y ). For a function U : R d → R k define the empirical risk of U by m E z ( U ) = 1 � | U ( x i ) − y i | 2 . m i =1

Statistical Learning Theory � � Let z = ( x 1 , y 1 ) , . . . , ( x m , y m ) be m realizations of samples independently drawn according to ( X , Y ). For a function U : R d → R k define the empirical risk of U by m E z ( U ) = 1 � | U ( x i ) − y i | 2 . m i =1 Empirical Risk Minimization (ERM) picks a hypothesis class H ⊂ C ( R d , R k ) and computes the empirical regression function ˆ U H , z ∈ argmin E z ( U ) . U ∈H

Statistical Learning Theory � � Let z = ( x 1 , y 1 ) , . . . , ( x m , y m ) be m realizations of samples independently drawn according to ( X , Y ). For a function U : R d → R k define the empirical risk of U by m E z ( U ) = 1 � | U ( x i ) − y i | 2 . m i =1 Empirical Risk Minimization (ERM) picks a hypothesis class H ⊂ C ( R d , R k ) and computes the empirical regression function ˆ U H , z ∈ argmin E z ( U ) . U ∈H Example H = { Polynomials of degree ≤ p } .

Degree too low: underfitting. Degree to high: overfitting!

Figure: Error with Polynomial Degree

Bias-Variance-Problem “Capacity” of the hypothesis space has to be adapted to the complexity of the target function and the sample size! Figure: Error with Polynomial Degree

Bias-Variance Decomposition Let ( X , Y ) data generating r.v.’s and ˆ U the regression function. Let i =1 i.i.d. samples, H a hypothesis class and ˆ z = ( x i , y i ) m U H , z the empirical regression function. We seek to understand the error ǫ := E ( ˆ U H , z ) − E ( ˆ U ) = E | ˆ U H , z ( X ) − ˆ U ( X ) | 2

Bias-Variance Decomposition Let ( X , Y ) data generating r.v.’s and ˆ U the regression function. Let i =1 i.i.d. samples, H a hypothesis class and ˆ z = ( x i , y i ) m U H , z the empirical regression function. We seek to understand the error ǫ := E ( ˆ U H , z ) − E ( ˆ U ) = E | ˆ U H , z ( X ) − ˆ U ( X ) | 2 Bias-Variance Decomposition Let U H := argmin U ∈H E | U ( X ) − ˆ U ( X ) | 2 , U ( X ) | 2 the approximation error and ǫ approx := E | U H ( X ) − ˆ ǫ generalize := E ( U H , z ) − E ( U H ) the generalization error . Then ǫ = ǫ approx + ǫ generalize .

Bias-Variance Decomposition Let ( X , Y ) data generating r.v.’s and ˆ U the regression function. Let i =1 i.i.d. samples, H a hypothesis class and ˆ z = ( x i , y i ) m U H , z the empirical regression function. We seek to understand the error ǫ := E ( ˆ U H , z ) − E ( ˆ U ) = E | ˆ U H , z ( X ) − ˆ U ( X ) | 2 Bias-Variance Decomposition Let U H := argmin U ∈H E | U ( X ) − ˆ U ( X ) | 2 , U ( X ) | 2 the approximation error and ǫ approx := E | U H ( X ) − ˆ ǫ generalize := E ( U H , z ) − E ( U H ) the generalization error . Then ǫ = ǫ approx + ǫ generalize . Main Theorem [e.g., Cucker-Zhou (2007)] If m � ln( N ( H , c · η )) (and very strong conditions hold), then η 2 ǫ generalize ≤ η w.h.p. where N ( H , s ) is the s -covering number of H w.r.t. L ∞ .

Bias-Variance Decomposition Let ( X , Y ) data generating r.v.’s and ˆ U the regression function. Let i =1 i.i.d. samples, H a hypothesis class and ˆ z = ( x i , y i ) m U H , z the empirical regression function. We seek to understand the error ǫ := E ( ˆ U H , z ) − E ( ˆ U ) = E | ˆ U H , z ( X ) − ˆ U ( X ) | 2 Bias-Variance Decomposition Let U H := argmin U ∈H E | U ( X ) − ˆ U ( X ) | 2 , U ( X ) | 2 the approximation error and ǫ approx := E | U H ( X ) − ˆ Problems for Data Science Applications: ǫ generalize := E ( U H , z ) − E ( U H ) the generalization error . Then ǫ = ǫ approx + ǫ generalize . Assumption that data is iid is debatable Different asymptotic regime in deep learning (where often Main Theorem [e.g., Cucker-Zhou (2007)] # DOFs >> # training samples) If m � ln( N ( H , c · η )) (and very strong conditions hold), then Without knowing P ( X , Y ) it is impossible to control the η 2 ǫ generalize ≤ η w.h.p. where N ( H , s ) is the s -covering number of H approximation error. w.r.t. L ∞ .

PDEs as Learning Problems

Explicit Solution of Heat Equation if g = 0 Let u ( t , x ) satisfy ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4.

Explicit Solution of Heat Equation if g = 0 Let u ( t , x ) satisfy ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y )

Explicit Solution of Heat Equation if g = 0 Let u ( t , x ) satisfy ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y ) In other words u ( t , x ) = E [ ϕ ( Z x Z x t ∼ N ( x , t 1 / 2 I ) . t )] ,

Explicit Solution of Heat Equation if g = 0 Let u ( t , x ) satisfy ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y ) In other words u ( t , x ) = E [ ϕ ( Z x Z x t ∼ N ( x , t 1 / 2 I ) . t )] , In other words, for x ∈ [ u , v ] 3 and X ∼ U [ u , v ] 3 and Y = ϕ � Z X � we t have u ( t , x ) = E [ Y | X = x ] .

Explicit Solution of Heat Equation if g = 0 The solution u ( t , x ) of the PDE can be interpreted Let u ( t , x ) satisfy as solution to the learning problem with data distribution ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ( X , Y ) , where X ∼ U [ u , v ] 3 and Y = ϕ ( Z X t ) and Z X ∼ ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) t ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 N ( x , t 1 / 2 I ) ! t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y ) In other words u ( t , x ) = E [ ϕ ( Z x Z x t ∼ N ( x , t 1 / 2 I ) . t )] , In other words, for x ∈ [ u , v ] 3 and X ∼ U [ u , v ] 3 and Y = ϕ � Z X � we t have u ( t , x ) = E [ Y | X = x ] .

Explicit Solution of Heat Equation if g = 0 The solution u ( t , x ) of the PDE can be interpreted Let u ( t , x ) satisfy as solution to the learning problem with data distribution ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ( X , Y ) , where X ∼ U [ u , v ] 3 and Y = ϕ ( Z X t ) and Z X ∼ ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) t ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 N ( x , t 1 / 2 I ) ! t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y ) Contrary to conventional ML problems, the data dis- In other words tribution is now explicitly known – we can simulate as much u ( t , x ) = E [ ϕ ( Z x Z x t ∼ N ( x , t 1 / 2 I ) . t )] , training data as we want! In other words, for x ∈ [ u , v ] 3 and X ∼ U [ u , v ] 3 and Y = ϕ � Z X � we t have u ( t , x ) = E [ Y | X = x ] .

Explicit Solution of Heat Equation if g = 0 The solution u ( t , x ) of the PDE can be interpreted Let u ( t , x ) satisfy as solution to the learning problem with data distribution ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ( X , Y ) , where X ∼ U [ u , v ] 3 and Y = ϕ ( Z X t ) and Z X ∼ ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) t ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 N ( x , t 1 / 2 I ) ! t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y ) Contrary to conventional ML problems, the data dis- In other words tribution is now explicitly known – we can simulate as much u ( t , x ) = E [ ϕ ( Z x Z x t ∼ N ( x , t 1 / 2 I ) . t )] , training data as we want! In other words, for x ∈ [ u , v ] 3 and X ∼ U [ u , v ] 3 and Y = ϕ � Z X � we t have u ( t , x ) = E [ Y | X = x ] . We will see in a minute that similar properties hold for a much more general class of PDEs!

Linear Kolmogorov Equations Given Σ : R d → R d × d , µ : R d → R d and initial value ϕ : R d → R , find u : R + × R d → R with ∂ u ∂ t ( t , x ) = 1 � � Σ( x )Σ T ( x ) Hess x u ( t , x ) 2 Trace + µ ( x ) · ∇ x u ( t , x ) , ( t , x ) ∈ [0 , T ] × R d , u (0 , x ) = ϕ ( x ) .

Linear Kolmogorov Equations Given Σ : R d → R d × d , µ : R d → R d and initial value ϕ : R d → R , find u : R + × R d → R with ∂ u ∂ t ( t , x ) = 1 � � Σ( x )Σ T ( x ) Hess x u ( t , x ) 2 Trace + µ ( x ) · ∇ x u ( t , x ) , ( t , x ) ∈ [0 , T ] × R d , u (0 , x ) = ϕ ( x ) . Examples include convection-diffusion equations and Black-Scholes Equation.

Linear Kolmogorov Equations Given Σ : R d → R d × d , µ : R d → R d and initial value ϕ : R d → R , find u : R + × R d → R with ∂ u ∂ t ( t , x ) = 1 � � Σ( x )Σ T ( x ) Hess x u ( t , x ) 2 Trace + µ ( x ) · ∇ x u ( t , x ) , ( t , x ) ∈ [0 , T ] × R d , u (0 , x ) = ϕ ( x ) . Examples include convection-diffusion equations and Black-Scholes Equation. Standard methods such as sparse grid methods, sparse tensor product methods, spectral methods, finite element methods or finite difference methods are incapable of solving such equations in high dimensions ( d = 100)!

Special Case: Pricing of Financial Derivatives

Special Case: Pricing of Financial Derivatives Given a portfolio consisting of d assets with value ( x i ( t )) d i =1 .

Special Case: Pricing of Financial Derivatives Given a portfolio consisting of d assets with value ( x i ( t )) d i =1 . European Max Option: At time T , exercise option and receive � � d G ( x ) := max max i =1 ( x i − K i ) , 0

Special Case: Pricing of Financial Derivatives Given a portfolio consisting of d assets with value ( x i ( t )) d i =1 . European Max Option: At time T , exercise option and receive � � d G ( x ) := max max i =1 ( x i − K i ) , 0 (Black-Scholes (1973)): in the absence of correlations the portfolio-value u ( t , x ) satisfies � ∂ � ∂ � ∂ 2 d d + σ 2 � u ( t , x ) + µ � � � � | x i | 2 x i u ( t , x ) u ( t , x ) = 0 , ∂ x 2 ∂ t 2 ∂ x i 2 i i =1 i =1 u ( T , x ) = G ( x ) .

Special Case: Pricing of Financial Derivatives Given a portfolio consisting of d assets with value ( x i ( t )) d i =1 . European Max Option: At time T , exercise option and receive � � d G ( x ) := max max i =1 ( x i − K i ) , 0 (Black-Scholes (1973)): in the absence of correlations the portfolio-value u ( t , x ) satisfies � ∂ � ∂ � ∂ 2 d d + σ 2 � u ( t , x ) + µ � � � � | x i | 2 x i u ( t , x ) u ( t , x ) = 0 , ∂ x 2 ∂ t 2 ∂ x i 2 i i =1 i =1 u ( T , x ) = G ( x ) . Pricing Problem: u (0 , x ) =??.

Kolmogorov PDEs as Learning Problems

Kolmogorov PDEs as Learning Problems For x ∈ R d and t ∈ R + let � t � t Z x µ ( Z x Σ( Z x t := x + s ) ds + s ) dW s . 0 0 Then (Feynman-Kac) u ( T , x ) = E ( ϕ ( Z x T )) .

Kolmogorov PDEs as Learning Problems For x ∈ R d and t ∈ R + let � t � t Z x µ ( Z x Σ( Z x t := x + s ) ds + s ) dW s . 0 0 Then (Feynman-Kac) u ( T , x ) = E ( ϕ ( Z x T )) . Lemma (Beck-Becker-G-Jafaari-Jentzen (2018)) X ) . The solution ˆ Let X ∼ U [ a , b ] d and let Y = ϕ ( Z T U of the mathematical learning problem with data distribution ( X , Y ) is given by ˆ x ∈ [ a , b ] d , U ( x ) = u ( T , x ) , where u solves the corresponding Kolmogorov equation.

Solving linear Kolmogorov Equations by means of Neural Network Based Learning

The Vanilla DL Paradigm

The Vanilla DL Paradigm Every image is given as a 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 :

The Vanilla DL Paradigm Every image is given as a 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 : Every label is given as a 10-dim vector y ∈ R 10 describing the ‘probability’ of each digit

The Vanilla DL Paradigm Every image is given as a Given labeled training 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 : data i =1 ⊂ R 784 × R 10 . ( x i , y i ) m Every label is given as a 10-dim vector y ∈ R 10 describing the ‘probability’ of each digit

The Vanilla DL Paradigm Every image is given as a Given labeled training 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 : data i =1 ⊂ R 784 × R 10 . ( x i , y i ) m Fix network architecture, e.g., number of layers (for example L = 3) and numbers of neurons Every label is given as a ( N 1 = 30 , N 2 = 30). 10-dim vector y ∈ R 10 describing the ‘probability’ of each digit

The Vanilla DL Paradigm Every image is given as a Given labeled training 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 : data i =1 ⊂ R 784 × R 10 . ( x i , y i ) m Fix network architecture, e.g., number of layers (for example L = 3) and numbers of neurons Every label is given as a ( N 1 = 30 , N 2 = 30). 10-dim vector y ∈ R 10 describing the ‘probability’ The learning goal is to of each digit find the empirical regression function f z ∈ H σ (784 , 30 , 30 , 10) .

The Vanilla DL Paradigm Every image is given as a Given labeled training 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 : data i =1 ⊂ R 784 × R 10 . ( x i , y i ) m Fix network architecture, e.g., number of layers (for example L = 3) and numbers of neurons Every label is given as a ( N 1 = 30 , N 2 = 30). 10-dim vector y ∈ R 10 describing the ‘probability’ The learning goal is to of each digit find the empirical regression function f z ∈ H σ (784 , 30 , 30 , 10) . Typically solved by stochastic first order optimization methods.

Description of Image Content ImageNet Challenge

Deep Learning Algorithm

Deep Learning Algorithm iid 1. Generate training data z = ( x i , y i ) m ∼ ( X , ϕ ( Z T X )) by i =1 simulating Z T X with the Euler-Maruyama scheme.

Deep Learning Algorithm iid 1. Generate training data z = ( x i , y i ) m ∼ ( X , ϕ ( Z T X )) by i =1 simulating Z T X with the Euler-Maruyama scheme. 2. Apply the Deep Learning Paradigm to this training data

Deep Learning Algorithm iid 1. Generate training data z = ( x i , y i ) m ∼ ( X , ϕ ( Z T X )) by i =1 simulating Z T X with the Euler-Maruyama scheme. 2. Apply the Deep Learning Paradigm to this training data ...meaning that (i) we pick a network architecture ( N 0 = d , N 1 , . . . , N L = 1), and let H = H σ ( N 0 ,..., N L ) and (ii) attempt to approximately compute m 1 ˆ � ( U ( x i ) − y i ) 2 U H , z = argmin m U ∈H i =1 in Tensorflow.

Deep Neural Networks for PDEs Philipp Grohs DL and Vis, September - PowerPoint PPT Presentation

Deep Neural Networks for PDEs Philipp Grohs DL and Vis, September 2018 Short Reading List 1 Ian Goodfellow and Yoshua Bengio and Aaron Courville: Deep Learning; MIT Press, 2016 2 Aurelien Geron: Hands-On Machine Learning with Scikit-Learn and

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Optimizing Deep Neural Networks Leena Chennuru Vankadara 26-10-2015 Table of Contents Neural

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

On the Expressive Power of Deep Neural Networks Maithra Raghu, Ben Poole, Jon Kleinberg, Surya

Weight Parameterizations in Deep Neural Networks Sergey Zagoruyko e Paris-Est, Universit

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Introduction to Deep Neural Networks 0. Logistics Spring 2020 1 Neural Networks are taking

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

PGCon 2020 Tatsuro Yamada Julien Rouhaud Who we are Tatsuro Yamada Works for NTT Comware as

Wim Peeters PBDKO vzw (Belgium) Abstract In Flanders (Belgium) secondary schools are responsible

Lean logistics, lessons learnt from Japan Adrian Blumenthal, Special Projects Director,

Three Ways to make your Industrial Data Science Projects a Success Prof. Dr.-Ing. Jochen Deuse

Stochastic (partial) differential equations and Gaussian processes Simo Srkk Aalto

Separation of Variables Bessel Equations Bernd Schr oder logo1 Bernd Schr oder

MA-207 Differential Equations II Ronnie Sebastian Department of Mathematics Indian Institute of

Light collection Nikolay Anfimov on behalf of module for LAr TPC UniBe and JINR groups Argon