structure preservation in some deep learning architectures
play

Structure preservation in (some) deep learning architectures - PowerPoint PPT Presentation

Structure preservation in (some) deep learning architectures Brynjulf Owren Department of Mathematical Sciences, NTNU, Trondheim, Norway LMS-Bath Symposium 2020 Joint work with: Martin Benning, Elena Celledoni, Matthias Ehrhardt, Christian


  1. Structure preservation in (some) deep learning architectures Brynjulf Owren Department of Mathematical Sciences, NTNU, Trondheim, Norway LMS-Bath Symposium – 2020 Joint work with: Martin Benning, Elena Celledoni, Matthias Ehrhardt, Christian Etmann, Carola-Bibiane Schönlieb and Ferdia Sherry 1 / 31

  2. Main sources for this talk • Benning, Martin; Celledoni, Elena; Ehrhardt, Matthias J.; Owren, Brynjulf; Schönlieb, Carola-Bibiane, Deep Learning as Optimal Control Problems: Models and Numerical Methods J. Comput. Dyn. 6 (2019), no. 2, 171–198. • Elena Celledoni, Matthias J. Ehrhardt, Christian Etmann, Robert I McLachlan, Brynjulf Owren, Carola-Bibiane Schönlieb, Ferdia Sherry, Structure preserving deep learning , arXiv:2006.03364 (June 2020) 2 / 31

  3. Neural networks as discrete dynamical system Neural network layers: φ k : X k × Θ k → X k + 1 , Θ k : Parameter space of layer k X k The k th feature space The full neural network Ψ : X × Θ → Y ( x , θ ) �→ z K can then be defined via the iteration z 0 = x z k + 1 = φ k ( z k , θ k ) , k = 0 , . . . , K − 1 , Extra final layer may be needed: η : X K × Θ K → Y . In this talk, X k = X for all k . 3 / 31

  4. Training the neural network Training data: ( x n , y n ) N n = 1 ⊂ X × Y Training the network amounts to minimising the loss function � N � E ( θ ) = 1 � min L n (Ψ( x n , θ )) + R ( θ ) , N θ ∈ Θ n = 1 where • L n ( y ) : Y → R ∞ is the loss for a specific data point • R : Θ → R ∞ acts as a regulariser which penalises and constrains unwanted solutions. We can define the loss over a batch of N data points in terms of the final layer as N E ( z ; θ ) = 1 � L n ( η ( z n ) , θ ) + R ( θ ) N n = 1 4 / 31

  5. ResNet model (He et al. (2016)) Ψ : X × Θ → X , Ψ( x , θ ) = z K given by the iteration z 0 = x z k + 1 = z k + σ ( A k z k + b k ) , k = 0 , . . . , K − 1 , y = η ( w T z K + µ ) • σ is a nonlinear activation function, a scalar function acting element-wise on vectors. • θ k = ( A k , b k ) , k ≤ K − 1. θ K = ( w , µ ) . The ResNet layers can be seen as a time stepper for the ODE z = σ ( A ( t ) z + b ( t )) , t ∈ [ 0 , T ] ˙ It is the explicit Euler method with stepsize h = 1. 5 / 31

  6. Activations – examples σ 1 ( x ) = tanh x σ 2 ( x ) = max ( 0 , x ) , (RELU) 1 '(x)=1-tanh 2 (x) 1 (x)=tanh(x) 1 1 0.8 0.5 0.6 0 0.4 -0.5 0.2 -1 0 -4 -2 0 2 4 -4 -2 0 2 4 2 (x)=max(0,x) 2 '(x)=Heaviside(x) 4 1 0.8 3 0.6 2 0.4 1 0.2 0 0 -4 -2 0 2 4 -4 -2 0 2 4 6 / 31

  7. The continuous optimal control problem – summarised � N � E ( θ, z ) = 1 � min L n ( z n ( T )) + R ( θ ) N ( θ, z ) ∈ Θ ×X N n = 1 such that z n = f ( z n , θ ( t )) , ˙ z n ( 0 ) = x n , n = 1 , . . . , N . 7 / 31

  8. Training as an Optimal Control Problem The first order optimality conditions can be phrased as a Hamiltonian Boundary Value Problem (Benning et al. (2020)). Define H ( z , p ; θ ) = � p , f ( z , p ; θ ) � Solve z = ∂ H p = − ∂ H 0 = ∂ H ˙ ∂ p , ˙ ∂ z , ∂θ . with boundary conditions � p ( T ) = ∂ L � z ( 0 ) = x , � ∂ z � t = T For ResNet, f ( z , p ; θ ) = σ ( A ( t ) z + b ( t )) , and we shall discuss other alternative vector fields f . 8 / 31

  9. Solving the HBVP Standard procedure: Initial guess θ ( 0 ) while not converged z = f ( z ; θ ( i ) ) to get z 1 , . . . , z K , z k = φ ( z k − 1 ) Sweep forward ˙ p = − Df ( z ) T p to obtain ∇ θ E Backprop on ˙ Update by some descent method e.g. θ ( i + 1 ) = θ ( i ) − τ ∇ θ E ( θ ( i ) ) • Chen et al (2018) suggest to use a black-box solver. Obtain z ( T ) and then do ( z ( t ) , p ( t )) backwards in time simultaneously to save memory usage. • Problematic for various reasons. No explicit solver satisfying first order optimality conditions + stability issues. • Gholami et al (2019) amend problem by a checkpointing method so only forward sweeps through feature spaces. Again: first order optimality is not so clear 9 / 31

  10. DTO vs OTD Two options 1 DTO. Discretise the forward ODE ( ˙ z = f ( z ; θ ) ) by some numerical method φ . Then solve the discrete optimisation problem, based on the gradients ∇ θ k E ( z K ; θ K ) . 2 OTD. Solve the Hamiltonian boundary value problem by a numerical method ¯ φ : ( z k , p k ) �→ ( φ ( z k ) , p k + 1 ) and compute ∂θ ( z k , θ k ) T p k + 1 for each k . ∂φ Theorem (Benning et al 2020, Sanz-Serna 2015) DTO and OTD are equivalent if the overall method ¯ φ for the Hamiltonian boundary value problem preserves quadratic invariants (a.k.a. symplectic). That is, ∇ θ k E ( z K ; θ K ) = ∂φ ∂θ ( z k , θ k ) T p k + 1 10 / 31

  11. An illustration 11 / 31

  12. Generalisation mode – Forward problem Once the network has been trained, the parameters θ ( t ) are known. Generalisation (the forward problem) becomes a non-autonomous initial value problem z = ¯ ˙ f ( t , z ) := f ( z ; θ ( t )) , z ( 0 ) = x . - Arguably, one may ask for good “stability properties" for the forward problem. Haber & Ruthotto (2017), Zhang & Schaeffer (2020). - Stability may also be desired in “backward time", Chang et al. (2018). What is our freedom in choosing good models? - Restrict parameter space Θ ( A skew-symmetric, negative definite, manifold-valued,. . . ) - Alter the structure of the vector field f (Hamiltonian, dissipative, measure preserving,. . . ) - Apply integrator with good stability properties 12 / 31

  13. Notions of stability • Linear stability analysis (Haber and Ruthotto). Nonlinear vector field f ( t , z ) look at spectrum of J ( t , z ) := ∂ f ∂ z ( t , z ) , Re λ i ≤ 0 Works only locally and only with autonomous vector fields. • Nonlinear stability analysis, look at norm contractivity/growth � z 2 ( t ) − z 1 ( t ) � ≤ C ( t ) � z 2 ( 0 ) − z 1 ( 0 ) � Such conditions can be ensured by imposing Lipschitz type conditions. E.g. for inner product spaces ν ∈ R � f ( t , z 2 ) − f ( t , z 1 ) , z 2 − z 1 � ≤ ν � z 2 − z 1 � 2 2 , ∀ z 1 , z 2 , t ∈ [ 0 , T ] ⇒ � z 2 ( t ) − z 1 ( t ) � ≤ e ν t � z 2 ( 0 ) − z 1 ( 0 ) � 13 / 31

  14. Example of a stability result (Celledoni et al. (2020)) We consider for simplicity the ODE model z = − A ( t ) T σ ( A ( t ) z + b ( t )) = f ( t , z ) , ˙ z = −∇ z V with V = γ ( A ( t ) z + b ( t )) 1 where γ ′ = σ Here ˙ Theorem 1 Let V ( t , z ) be twice differentiable and convex in the second argument. Then the vector field f ( t , z ) = −∇ V ( t , z ) satisfies a one-sided Lipschitz condition with ν ≤ 0 . 2 Suppose that σ ( s ) is absolutely continuous and 0 ≤ σ ′ ( s ) ≤ 1 a.e. in R . Then the one-sided Lipschitz condition holds for any A ( t ) and b ( t ) with − µ 2 ∗ ≤ ν σ ≤ 0 where µ ∗ = min µ ( t ) and where µ ( t ) is the smallest singular t value of A ( t ) . In particular ν σ = − µ 2 ∗ is obtained when σ ( s ) = s . 14 / 31

  15. Hamiltonian architectures Chang et al. (2018) Let H ( t , z , p ) = T ( t , p ) + V ( t , z ) Let γ i : R → R be such that γ ′ i ( t ) = σ i ( t ) , i = 1 , 2 and set T ( t , p ) = γ 1 ( A 1 ( t ) p + b 1 ( t )) 1 , V ( t , z ) = γ 2 ( A 2 ( t ) z + b 2 ( t )) 1 where 1 = ( 1 , . . . , 1 ) T . This leads to models of the form z = ∂ p H = A 1 ( t ) T σ 1 ( A 1 ( t ) p + b 1 ( t )) ˙ p = − ∂ z H = − A 2 ( t ) T σ 2 ( A 2 ( t ) z + b 2 ( t )) ˙ 15 / 31

  16. Two particular Hamiltonian cases 1 A simple case is obtained by choosing σ 1 ( s ) := s , A 1 ( t ) ≡ I , b 1 ( t ) ≡ 0 and σ 2 ( s ) := σ ( s ) which after eliminating p yields the second order ODE z = − ∂ z V = − A ( t ) T σ ( A ( t ) z + b ( t )) ¨ 2 A second example z = A ( t ) T σ ( A ( t ) p + b ( t )) ˙ p = − A ( t ) T σ ( A ( t ) z + b ( t )) ˙ 16 / 31

  17. Non-autonomous Hamiltonian problems Autonomous problems • Two important geometric properties • The flow preserves the Hamiltonian • The flow is symplectic • Numerical schemes can be symplectic or energy preserving, excellent long time behaviour Non-autonomous Hamiltonian problems • The situation is less clear, at least two ways to interpret the dynamics 1 ’Autonomise’ by adding time as dependent variable (contact manifold). A preserved two-form can be introduced ω = dp ∧ dq − dH ∧ dt but the Hamiltonian is not preseved along the flow 2 Extend system by adding time and a conjugate momentum variable p t . Define extended Hamiltonian K ( q , p , t , p t ) = H ( q , p , t ) + p t and symplectic form Ω = dp ∧ dq + dp t ∧ dt 17 / 31

  18. The extended system p = − ∂ z H , ˙ z = ∂ p H , ˙ ˙ t = 1 , p t = − ∂ t H ˙ • An obvious strategy would be to study the dynamics of the extended autonomous Hamiltonian system. • Unfortunately, it does not give a lot of information • Any level set of K is unbounded • Chang et al (2018) report good numerical results with this type of model, I am not aware of any theoretical justification • Asorey et al. (1983) contains a number of results for the relations between the dynamics on the contact manifold and the extended manifold, [more work to be done in this direction] • LO Jay (2020), Marthinsen & O (2016) provide conditions on numerical integrators to be canonical in the non-autonomous case 18 / 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend