neural networks with cheap differential operators
play

Neural Networks with Cheap Differential Operators Ricky T. Q. Chen, - PowerPoint PPT Presentation

Neural Networks with Cheap Differential Operators Ricky T. Q. Chen, David Duvenaud Differential Operators Want to compute operators such as divergence : d f i ( x ) f = f : d d x i is a neural net. where i


  1. Neural Networks with Cheap Differential Operators Ricky T. Q. Chen, David Duvenaud

  2. Differential Operators • Want to compute operators such as divergence : d ∂ f i ( x ) ∑ ∇ ⋅ f = f : ℝ d → ℝ d ∂ x i is a neural net. where i =1 • Solving PDEs • Fitting SDEs • Finding fixed points • Continuous normalizing flows

  3. Automatic Differentiation (AD) Reverse-mode AD gives cheap vector-Jacobian products: ∂ f 1 ( x ) –––––– –––––– v 1 d ∂ x [ dx f ( x ) ] = ∂ f i ( x ) d ∑ v T v i = ⋮ ∂ x ∂ f d ( x ) i =1 –––––– –––––– v d ∂ x • For full Jacobian, need separate passes d • In general, Jacobian diagonal has the same cost as the full Jacobian! • We restrict architecture to allow one-pass diagonal computations.

  4. HollowNets Allow e ffi cient computation of dimension-wise derivatives of order k: with only k backward passes, regardless of dimension. Example: D k =1 Jacobian Jacobian diagonal dim f ( x ) =

  5. HollowNet Architecture HollowNets are composed of two sub-networks: • Hidden units which don’t depend on their respective input: h i = c i ( x − i ) • Output units depend only on their respective hidden and input: f i ( x ) = τ i ([ x i , h i ])

  6. HollowNet Jacobians Can get exact dimension- wise derivatives by disconnecting some dependencies in backward pass. i.e. detach in PyTorch or stop_gradient in TensorFlow.

  7. HollowNet Jacobians Can factor Jacobian into: • A diagonal matrix (dimension-wise dependencies). • A hollow matrix (all interactions). d ∂ ∂ ∂ = + diagonal + hollow dx f = ∂ x τ ( x , h ) ∂ x h ( x ) ∂ h τ ( x , h )

  8. Application I: Finding Fixed Points Root finding problems can be solved using Jacobi-Newton: ( f ( x ) = 0) − 1 f ( x ) x t +1 = x t − [ D dim f ( x ) ] x t +1 = x t − f ( x ) • Same solution with faster convergence. • We applied to implicit ODE solvers for solving sti ff equations.

  9. Application II: Continuous Normalizing Flows • Transforms distributions through an ODE: • Change in density given by divergence: = tr ( d dx f ( x ) ) = d log p ( x , t ) d ∑ [ D dim f ( x ) ] i dt i =1

  10. Learning Stochastic Diff Eqs • Fokker-Planck describes density change using and : D 2 D dim dim d i =1 [ − ( D dim f ) p − ( ∇ p ) ⊙ f + ( D 2 2 diag ( g ) 2 ⊙ ( D dim ∇ p ) ] i ∂ p ( t , x ) dim diag ( g )) p + 2( D dim diag ( g )) ⊙ ( ∇ p ) + 1 ∑ = ∂ t

  11. Takeaways • Dimension-wise derivatives are costly for general functions. • Restricting to hollow Jacobians gives cheap diagonal grads. • Useful for PDEs, SDEs, normalizing flows, and optimization.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend