Neural Networks with Cheap Differential Operators Ricky T. Q. Chen, - - PowerPoint PPT Presentation
Neural Networks with Cheap Differential Operators Ricky T. Q. Chen, - - PowerPoint PPT Presentation
Neural Networks with Cheap Differential Operators Ricky T. Q. Chen, David Duvenaud Differential Operators Want to compute operators such as divergence : d f i ( x ) f = f : d d x i is a neural net. where i
SLIDE 1
SLIDE 2
Differential Operators
- Want to compute operators such as divergence:
∇ ⋅ f =
d
∑
i=1
∂fi(x) ∂xi
f : ℝd → ℝd
where
is a neural net.
- Solving PDEs
- Finding fixed points
- Fitting SDEs
- Continuous normalizing flows
SLIDE 3
Automatic Differentiation (AD)
Reverse-mode AD gives cheap vector-Jacobian products:
- For full Jacobian, need separate passes
- In general, Jacobian diagonal has the same cost as the full Jacobian!
- We restrict architecture to allow one-pass diagonal computations.
vT [ d dx f(x)] =
d
∑
i=1
vi ∂fi(x) ∂x = v1
∂f1(x) ∂x
⋮ vd
∂fd(x) ∂x
d
–––––– –––––– –––––– ––––––
SLIDE 4
HollowNets
Allow efficient computation of dimension-wise derivatives of order k:
with only k backward passes, regardless of dimension. Example: Jacobian Jacobian diagonal
Dk=1
dim f(x) =
SLIDE 5
HollowNet Architecture
- Hidden units which don’t
depend on their respective input:
- Output units depend only on
their respective hidden and input:
hi = ci(x−i)
fi(x) = τi([xi, hi])
HollowNets are composed of two sub-networks:
SLIDE 6
Can get exact dimension- wise derivatives by disconnecting some dependencies in backward pass.
i.e. detach in PyTorch or stop_gradient in TensorFlow.
HollowNet Jacobians
SLIDE 7
HollowNet Jacobians
Can factor Jacobian into:
- A diagonal matrix (dimension-wise dependencies).
- A hollow matrix (all interactions).
d dx f =
∂ ∂x τ(x, h)
∂ ∂h τ(x, h)
∂ ∂x h(x)
diagonal + hollow
+
=
SLIDE 8
Application I: Finding Fixed Points
Root finding problems can be solved using Jacobi-Newton:
(f(x) = 0)
xt+1 = xt − f(x)
- Same solution with faster
convergence.
- We applied to implicit ODE
solvers for solving stiff equations.
xt+1 = xt − [Ddimf(x)]
−1 f(x)
SLIDE 9
Application II: Continuous Normalizing Flows
- Transforms distributions through an ODE:
- Change in density given by divergence:
d log p(x, t) dt = tr ( d dx f(x)) =
d
∑
i=1
[Ddim f(x)]i
SLIDE 10
Learning Stochastic Diff Eqs
- Fokker-Planck describes density change using and :
∂p(t, x) ∂t =
d
∑
i=1 [− (Ddim f )p − (∇p) ⊙ f + (D2 dimdiag(g))p + 2(Ddimdiag(g)) ⊙ (∇p) + 1
2 diag(g)2 ⊙ (Ddim∇p)]i
Ddim D2
dim
SLIDE 11
Takeaways
- Dimension-wise derivatives are
costly for general functions.
- Restricting to hollow Jacobians
gives cheap diagonal grads.
- Useful for PDEs, SDEs,
normalizing flows, and
- ptimization.