Neural Networks with Cheap Differential Operators Ricky T. Q. Chen, - - PowerPoint PPT Presentation

neural networks with cheap differential operators
SMART_READER_LITE
LIVE PREVIEW

Neural Networks with Cheap Differential Operators Ricky T. Q. Chen, - - PowerPoint PPT Presentation

Neural Networks with Cheap Differential Operators Ricky T. Q. Chen, David Duvenaud Differential Operators Want to compute operators such as divergence : d f i ( x ) f = f : d d x i is a neural net. where i


slide-1
SLIDE 1

Neural Networks with Cheap Differential Operators

Ricky T. Q. Chen, David Duvenaud

slide-2
SLIDE 2

Differential Operators

  • Want to compute operators such as divergence:

∇ ⋅ f =

d

i=1

∂fi(x) ∂xi

f : ℝd → ℝd

where

is a neural net.

  • Solving PDEs
  • Finding fixed points
  • Fitting SDEs
  • Continuous normalizing flows
slide-3
SLIDE 3

Automatic Differentiation (AD)

Reverse-mode AD gives cheap vector-Jacobian products:

  • For full Jacobian, need separate passes
  • In general, Jacobian diagonal has the same cost as the full Jacobian!
  • We restrict architecture to allow one-pass diagonal computations.

vT [ d dx f(x)] =

d

i=1

vi ∂fi(x) ∂x = v1

∂f1(x) ∂x

⋮ vd

∂fd(x) ∂x

d

–––––– –––––– –––––– ––––––

slide-4
SLIDE 4

HollowNets

Allow efficient computation of dimension-wise derivatives of order k:

with only k backward passes, regardless of dimension. Example: Jacobian Jacobian diagonal

Dk=1

dim f(x) =

slide-5
SLIDE 5

HollowNet Architecture

  • Hidden units which don’t

depend on their respective input:

  • Output units depend only on

their respective hidden and input:

hi = ci(x−i)

fi(x) = τi([xi, hi])

HollowNets are composed of two sub-networks:

slide-6
SLIDE 6

Can get exact dimension- wise derivatives by disconnecting some dependencies in backward pass.

i.e. detach in PyTorch or stop_gradient in TensorFlow.

HollowNet Jacobians

slide-7
SLIDE 7

HollowNet Jacobians

Can factor Jacobian into:

  • A diagonal matrix (dimension-wise dependencies).
  • A hollow matrix (all interactions).

d dx f =

∂ ∂x τ(x, h)

∂ ∂h τ(x, h)

∂ ∂x h(x)

diagonal + hollow

+

=

slide-8
SLIDE 8

Application I: Finding Fixed Points

Root finding problems can be solved using Jacobi-Newton:

(f(x) = 0)

xt+1 = xt − f(x)

  • Same solution with faster

convergence.

  • We applied to implicit ODE

solvers for solving stiff equations.

xt+1 = xt − [Ddimf(x)]

−1 f(x)

slide-9
SLIDE 9

Application II: Continuous Normalizing Flows

  • Transforms distributions through an ODE:
  • Change in density given by divergence:

d log p(x, t) dt = tr ( d dx f(x)) =

d

i=1

[Ddim f(x)]i

slide-10
SLIDE 10

Learning Stochastic Diff Eqs

  • Fokker-Planck describes density change using and :

∂p(t, x) ∂t =

d

i=1 [− (Ddim f )p − (∇p) ⊙ f + (D2 dimdiag(g))p + 2(Ddimdiag(g)) ⊙ (∇p) + 1

2 diag(g)2 ⊙ (Ddim∇p)]i

Ddim D2

dim

slide-11
SLIDE 11

Takeaways

  • Dimension-wise derivatives are

costly for general functions.

  • Restricting to hollow Jacobians

gives cheap diagonal grads.

  • Useful for PDEs, SDEs,

normalizing flows, and

  • ptimization.