deep neural networks for pdes
play

Deep Neural Networks for PDEs Philipp Grohs DL and Vis, September - PowerPoint PPT Presentation

Deep Neural Networks for PDEs Philipp Grohs DL and Vis, September 2018 Short Reading List 1 Ian Goodfellow and Yoshua Bengio and Aaron Courville: Deep Learning; MIT Press, 2016 2 Aurelien Geron: Hands-On Machine Learning with Scikit-Learn and


  1. Data Generating Distribution Suppose that there exists a probability distribution on R 784 that randomly generates handwritten digits.

  2. Data Generating Distribution Suppose that there exists a probability distribution on R 784 that randomly generates handwritten digits.

  3. Data Generating Distribution Suppose that there exists a probability distribution on R 784 that randomly generates handwritten digits.

  4. Data Generating Distribution Suppose that there exists a probability distribution on R 784 that randomly generates handwritten digits. � Variational Autoencoder Demo

  5. A New Look Suppose that our training data consists of samples according to a given data distribution ( X , Y )

  6. A New Look Suppose that our training data consists of samples according to a given data distribution ( X , Y )

  7. A New Look If we knew the data distribution ( X , Y ), the best functional relation between X and Y would simply be E [ Y | X = x ]!

  8. A New Look If we knew the data distribution ( X , Y ), the best functional relation between X and Y would simply be E [ Y | X = x ]!

  9. A New Look But we only have samples and do not know the distribution ( X , Y )

  10. A New Look But we only have samples and do not know the distribution ( X , Y )

  11. A New Look But we only have samples and do not know the distribution ( X , Y ) A mathematical learning problem seeks to infer the regression function E [ Y | X = x ] from random samples ( x i , y i ) m i =1 of ( X , Y ) .

  12. Mathematical Formulation

  13. Mathematical Formulation Let (Ω , F , P ) be a probability space and let X : Ω → R d and Y : Ω → R n be random vectors. Find the best functional U : R d → R n between these vectors in the sense relationship ˆ that � ˆ | U ( X ( ω )) − Y ( ω ) | 2 d P ( ω ) U = argmin U : R d → R n Ω | U ( X ) − Y | 2 � � = argmin U : R d → R n E .

  14. Mathematical Formulation Let (Ω , F , P ) be a probability space and let X : Ω → R d and Y : Ω → R n be random vectors. Find the best functional U : R d → R n between these vectors in the sense relationship ˆ that � ˆ | U ( X ( ω )) − Y ( ω ) | 2 d P ( ω ) U = argmin U : R d → R n Ω | U ( X ) − Y | 2 � � = argmin U : R d → R n E . We have ˆ U ( x ) = E [ Y | X = x ] .

  15. Mathematical Formulation Let (Ω , F , P ) be a probability space and let X : Ω → R d and Y : Ω → R n be random vectors. Find the best functional U : R d → R n between these vectors in the sense relationship ˆ that � ˆ | U ( X ( ω )) − Y ( ω ) | 2 d P ( ω ) U = argmin U : R d → R n Ω | U ( X ) − Y | 2 � � = argmin U : R d → R n E . We have ˆ U ( x ) = E [ Y | X = x ] . ˆ U is called the regression function .

  16. Statistical Learning Theory � � Let z = ( x 1 , y 1 ) , . . . , ( x m , y m ) be m realizations of samples independently drawn according to ( X , Y ). For a function U : R d → R k define the empirical risk of U by m E z ( U ) = 1 � | U ( x i ) − y i | 2 . m i =1

  17. Statistical Learning Theory � � Let z = ( x 1 , y 1 ) , . . . , ( x m , y m ) be m realizations of samples independently drawn according to ( X , Y ). For a function U : R d → R k define the empirical risk of U by m E z ( U ) = 1 � | U ( x i ) − y i | 2 . m i =1 Empirical Risk Minimization (ERM) picks a hypothesis class H ⊂ C ( R d , R k ) and computes the empirical regression function ˆ U H , z ∈ argmin E z ( U ) . U ∈H

  18. Statistical Learning Theory � � Let z = ( x 1 , y 1 ) , . . . , ( x m , y m ) be m realizations of samples independently drawn according to ( X , Y ). For a function U : R d → R k define the empirical risk of U by m E z ( U ) = 1 � | U ( x i ) − y i | 2 . m i =1 Empirical Risk Minimization (ERM) picks a hypothesis class H ⊂ C ( R d , R k ) and computes the empirical regression function ˆ U H , z ∈ argmin E z ( U ) . U ∈H Example H = { Polynomials of degree ≤ p } .

  19. Degree too low: underfitting. Degree to high: overfitting!

  20. Figure: Error with Polynomial Degree

  21. Bias-Variance-Problem “Capacity” of the hypothesis space has to be adapted to the complexity of the target function and the sample size! Figure: Error with Polynomial Degree

  22. Bias-Variance Decomposition Let ( X , Y ) data generating r.v.’s and ˆ U the regression function. Let i =1 i.i.d. samples, H a hypothesis class and ˆ z = ( x i , y i ) m U H , z the empirical regression function. We seek to understand the error ǫ := E ( ˆ U H , z ) − E ( ˆ U ) = E | ˆ U H , z ( X ) − ˆ U ( X ) | 2

  23. Bias-Variance Decomposition Let ( X , Y ) data generating r.v.’s and ˆ U the regression function. Let i =1 i.i.d. samples, H a hypothesis class and ˆ z = ( x i , y i ) m U H , z the empirical regression function. We seek to understand the error ǫ := E ( ˆ U H , z ) − E ( ˆ U ) = E | ˆ U H , z ( X ) − ˆ U ( X ) | 2 Bias-Variance Decomposition Let U H := argmin U ∈H E | U ( X ) − ˆ U ( X ) | 2 , U ( X ) | 2 the approximation error and ǫ approx := E | U H ( X ) − ˆ ǫ generalize := E ( U H , z ) − E ( U H ) the generalization error . Then ǫ = ǫ approx + ǫ generalize .

  24. Bias-Variance Decomposition Let ( X , Y ) data generating r.v.’s and ˆ U the regression function. Let i =1 i.i.d. samples, H a hypothesis class and ˆ z = ( x i , y i ) m U H , z the empirical regression function. We seek to understand the error ǫ := E ( ˆ U H , z ) − E ( ˆ U ) = E | ˆ U H , z ( X ) − ˆ U ( X ) | 2 Bias-Variance Decomposition Let U H := argmin U ∈H E | U ( X ) − ˆ U ( X ) | 2 , U ( X ) | 2 the approximation error and ǫ approx := E | U H ( X ) − ˆ ǫ generalize := E ( U H , z ) − E ( U H ) the generalization error . Then ǫ = ǫ approx + ǫ generalize . Main Theorem [e.g., Cucker-Zhou (2007)] If m � ln( N ( H , c · η )) (and very strong conditions hold), then η 2 ǫ generalize ≤ η w.h.p. where N ( H , s ) is the s -covering number of H w.r.t. L ∞ .

  25. Bias-Variance Decomposition Let ( X , Y ) data generating r.v.’s and ˆ U the regression function. Let i =1 i.i.d. samples, H a hypothesis class and ˆ z = ( x i , y i ) m U H , z the empirical regression function. We seek to understand the error ǫ := E ( ˆ U H , z ) − E ( ˆ U ) = E | ˆ U H , z ( X ) − ˆ U ( X ) | 2 Bias-Variance Decomposition Let U H := argmin U ∈H E | U ( X ) − ˆ U ( X ) | 2 , U ( X ) | 2 the approximation error and ǫ approx := E | U H ( X ) − ˆ Problems for Data Science Applications: ǫ generalize := E ( U H , z ) − E ( U H ) the generalization error . Then ǫ = ǫ approx + ǫ generalize . Assumption that data is iid is debatable Different asymptotic regime in deep learning (where often Main Theorem [e.g., Cucker-Zhou (2007)] # DOFs >> # training samples) If m � ln( N ( H , c · η )) (and very strong conditions hold), then Without knowing P ( X , Y ) it is impossible to control the η 2 ǫ generalize ≤ η w.h.p. where N ( H , s ) is the s -covering number of H approximation error. w.r.t. L ∞ .

  26. PDEs as Learning Problems

  27. Explicit Solution of Heat Equation if g = 0 Let u ( t , x ) satisfy ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4.

  28. Explicit Solution of Heat Equation if g = 0 Let u ( t , x ) satisfy ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y )

  29. Explicit Solution of Heat Equation if g = 0 Let u ( t , x ) satisfy ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y ) In other words u ( t , x ) = E [ ϕ ( Z x Z x t ∼ N ( x , t 1 / 2 I ) . t )] ,

  30. Explicit Solution of Heat Equation if g = 0 Let u ( t , x ) satisfy ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y ) In other words u ( t , x ) = E [ ϕ ( Z x Z x t ∼ N ( x , t 1 / 2 I ) . t )] , In other words, for x ∈ [ u , v ] 3 and X ∼ U [ u , v ] 3 and Y = ϕ � Z X � we t have u ( t , x ) = E [ Y | X = x ] .

  31. Explicit Solution of Heat Equation if g = 0 The solution u ( t , x ) of the PDE can be interpreted Let u ( t , x ) satisfy as solution to the learning problem with data distribution ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ( X , Y ) , where X ∼ U [ u , v ] 3 and Y = ϕ ( Z X t ) and Z X ∼ ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) t ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 N ( x , t 1 / 2 I ) ! t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y ) In other words u ( t , x ) = E [ ϕ ( Z x Z x t ∼ N ( x , t 1 / 2 I ) . t )] , In other words, for x ∈ [ u , v ] 3 and X ∼ U [ u , v ] 3 and Y = ϕ � Z X � we t have u ( t , x ) = E [ Y | X = x ] .

  32. Explicit Solution of Heat Equation if g = 0 The solution u ( t , x ) of the PDE can be interpreted Let u ( t , x ) satisfy as solution to the learning problem with data distribution ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ( X , Y ) , where X ∼ U [ u , v ] 3 and Y = ϕ ( Z X t ) and Z X ∼ ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) t ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 N ( x , t 1 / 2 I ) ! t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y ) Contrary to conventional ML problems, the data dis- In other words tribution is now explicitly known – we can simulate as much u ( t , x ) = E [ ϕ ( Z x Z x t ∼ N ( x , t 1 / 2 I ) . t )] , training data as we want! In other words, for x ∈ [ u , v ] 3 and X ∼ U [ u , v ] 3 and Y = ϕ � Z X � we t have u ( t , x ) = E [ Y | X = x ] .

  33. Explicit Solution of Heat Equation if g = 0 The solution u ( t , x ) of the PDE can be interpreted Let u ( t , x ) satisfy as solution to the learning problem with data distribution ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ( X , Y ) , where X ∼ U [ u , v ] 3 and Y = ϕ ( Z X t ) and Z X ∼ ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) t ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 N ( x , t 1 / 2 I ) ! t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y ) Contrary to conventional ML problems, the data dis- In other words tribution is now explicitly known – we can simulate as much u ( t , x ) = E [ ϕ ( Z x Z x t ∼ N ( x , t 1 / 2 I ) . t )] , training data as we want! In other words, for x ∈ [ u , v ] 3 and X ∼ U [ u , v ] 3 and Y = ϕ � Z X � we t have u ( t , x ) = E [ Y | X = x ] . We will see in a minute that similar properties hold for a much more general class of PDEs!

  34. Linear Kolmogorov Equations Given Σ : R d → R d × d , µ : R d → R d and initial value ϕ : R d → R , find u : R + × R d → R with ∂ u ∂ t ( t , x ) = 1 � � Σ( x )Σ T ( x ) Hess x u ( t , x ) 2 Trace + µ ( x ) · ∇ x u ( t , x ) , ( t , x ) ∈ [0 , T ] × R d , u (0 , x ) = ϕ ( x ) .

  35. Linear Kolmogorov Equations Given Σ : R d → R d × d , µ : R d → R d and initial value ϕ : R d → R , find u : R + × R d → R with ∂ u ∂ t ( t , x ) = 1 � � Σ( x )Σ T ( x ) Hess x u ( t , x ) 2 Trace + µ ( x ) · ∇ x u ( t , x ) , ( t , x ) ∈ [0 , T ] × R d , u (0 , x ) = ϕ ( x ) . Examples include convection-diffusion equations and Black-Scholes Equation.

  36. Linear Kolmogorov Equations Given Σ : R d → R d × d , µ : R d → R d and initial value ϕ : R d → R , find u : R + × R d → R with ∂ u ∂ t ( t , x ) = 1 � � Σ( x )Σ T ( x ) Hess x u ( t , x ) 2 Trace + µ ( x ) · ∇ x u ( t , x ) , ( t , x ) ∈ [0 , T ] × R d , u (0 , x ) = ϕ ( x ) . Examples include convection-diffusion equations and Black-Scholes Equation. Standard methods such as sparse grid methods, sparse tensor product methods, spectral methods, finite element methods or finite difference methods are incapable of solving such equations in high dimensions ( d = 100)!

  37. Special Case: Pricing of Financial Derivatives

  38. Special Case: Pricing of Financial Derivatives Given a portfolio consisting of d assets with value ( x i ( t )) d i =1 .

  39. Special Case: Pricing of Financial Derivatives Given a portfolio consisting of d assets with value ( x i ( t )) d i =1 . European Max Option: At time T , exercise option and receive � � d G ( x ) := max max i =1 ( x i − K i ) , 0

  40. Special Case: Pricing of Financial Derivatives Given a portfolio consisting of d assets with value ( x i ( t )) d i =1 . European Max Option: At time T , exercise option and receive � � d G ( x ) := max max i =1 ( x i − K i ) , 0 (Black-Scholes (1973)): in the absence of correlations the portfolio-value u ( t , x ) satisfies � ∂ � ∂ � ∂ 2 d d + σ 2 � u ( t , x ) + µ � � � � | x i | 2 x i u ( t , x ) u ( t , x ) = 0 , ∂ x 2 ∂ t 2 ∂ x i 2 i i =1 i =1 u ( T , x ) = G ( x ) .

  41. Special Case: Pricing of Financial Derivatives Given a portfolio consisting of d assets with value ( x i ( t )) d i =1 . European Max Option: At time T , exercise option and receive � � d G ( x ) := max max i =1 ( x i − K i ) , 0 (Black-Scholes (1973)): in the absence of correlations the portfolio-value u ( t , x ) satisfies � ∂ � ∂ � ∂ 2 d d + σ 2 � u ( t , x ) + µ � � � � | x i | 2 x i u ( t , x ) u ( t , x ) = 0 , ∂ x 2 ∂ t 2 ∂ x i 2 i i =1 i =1 u ( T , x ) = G ( x ) . Pricing Problem: u (0 , x ) =??.

  42. Kolmogorov PDEs as Learning Problems

  43. Kolmogorov PDEs as Learning Problems For x ∈ R d and t ∈ R + let � t � t Z x µ ( Z x Σ( Z x t := x + s ) ds + s ) dW s . 0 0 Then (Feynman-Kac) u ( T , x ) = E ( ϕ ( Z x T )) .

  44. Kolmogorov PDEs as Learning Problems For x ∈ R d and t ∈ R + let � t � t Z x µ ( Z x Σ( Z x t := x + s ) ds + s ) dW s . 0 0 Then (Feynman-Kac) u ( T , x ) = E ( ϕ ( Z x T )) . Lemma (Beck-Becker-G-Jafaari-Jentzen (2018)) X ) . The solution ˆ Let X ∼ U [ a , b ] d and let Y = ϕ ( Z T U of the mathematical learning problem with data distribution ( X , Y ) is given by ˆ x ∈ [ a , b ] d , U ( x ) = u ( T , x ) , where u solves the corresponding Kolmogorov equation.

  45. Solving linear Kolmogorov Equations by means of Neural Network Based Learning

  46. The Vanilla DL Paradigm

  47. The Vanilla DL Paradigm Every image is given as a 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 :

  48. The Vanilla DL Paradigm Every image is given as a 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 : Every label is given as a 10-dim vector y ∈ R 10 describing the ‘probability’ of each digit

  49. The Vanilla DL Paradigm Every image is given as a Given labeled training 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 : data i =1 ⊂ R 784 × R 10 . ( x i , y i ) m Every label is given as a 10-dim vector y ∈ R 10 describing the ‘probability’ of each digit

  50. The Vanilla DL Paradigm Every image is given as a Given labeled training 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 : data i =1 ⊂ R 784 × R 10 . ( x i , y i ) m Fix network architecture, e.g., number of layers (for example L = 3) and numbers of neurons Every label is given as a ( N 1 = 30 , N 2 = 30). 10-dim vector y ∈ R 10 describing the ‘probability’ of each digit

  51. The Vanilla DL Paradigm Every image is given as a Given labeled training 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 : data i =1 ⊂ R 784 × R 10 . ( x i , y i ) m Fix network architecture, e.g., number of layers (for example L = 3) and numbers of neurons Every label is given as a ( N 1 = 30 , N 2 = 30). 10-dim vector y ∈ R 10 describing the ‘probability’ The learning goal is to of each digit find the empirical regression function f z ∈ H σ (784 , 30 , 30 , 10) .

  52. The Vanilla DL Paradigm Every image is given as a Given labeled training 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 : data i =1 ⊂ R 784 × R 10 . ( x i , y i ) m Fix network architecture, e.g., number of layers (for example L = 3) and numbers of neurons Every label is given as a ( N 1 = 30 , N 2 = 30). 10-dim vector y ∈ R 10 describing the ‘probability’ The learning goal is to of each digit find the empirical regression function f z ∈ H σ (784 , 30 , 30 , 10) . Typically solved by stochastic first order optimization methods.

  53. Description of Image Content ImageNet Challenge

  54. Deep Learning Algorithm

  55. Deep Learning Algorithm iid 1. Generate training data z = ( x i , y i ) m ∼ ( X , ϕ ( Z T X )) by i =1 simulating Z T X with the Euler-Maruyama scheme.

  56. Deep Learning Algorithm iid 1. Generate training data z = ( x i , y i ) m ∼ ( X , ϕ ( Z T X )) by i =1 simulating Z T X with the Euler-Maruyama scheme. 2. Apply the Deep Learning Paradigm to this training data

  57. Deep Learning Algorithm iid 1. Generate training data z = ( x i , y i ) m ∼ ( X , ϕ ( Z T X )) by i =1 simulating Z T X with the Euler-Maruyama scheme. 2. Apply the Deep Learning Paradigm to this training data ...meaning that (i) we pick a network architecture ( N 0 = d , N 1 , . . . , N L = 1), and let H = H σ ( N 0 ,..., N L ) and (ii) attempt to approximately compute m 1 ˆ � ( U ( x i ) − y i ) 2 U H , z = argmin m U ∈H i =1 in Tensorflow.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend