Compiling Deep Nets Scott Sanner Goal of this talk Will not - - PowerPoint PPT Presentation

compiling deep nets
SMART_READER_LITE
LIVE PREVIEW

Compiling Deep Nets Scott Sanner Goal of this talk Will not - - PowerPoint PPT Presentation

Compiling Deep Nets Scott Sanner Goal of this talk Will not evangelize deep networks / successes Go to ICML, NIPS, Silicon Valley, read tech news Just believe But deep nets do not solve all problems Yet Lack


slide-1
SLIDE 1

Compiling Deep Nets

Scott Sanner

slide-2
SLIDE 2

Goal of this talk

  • Will not evangelize deep networks / successes

– Go to ICML, NIPS, Silicon Valley, read tech news – “Just believe”

  • But deep nets do not solve all problems

– Yet  – Lack techniques for handling arbitrary queries – With compilations, that could change

slide-3
SLIDE 3

Probabilistic Inference with Arbitrary Queries

Why Deep Nets and not Graphical Models?

slide-4
SLIDE 4

Graphical Models Revisited

  • HMM / Chain-CRF
  • (Cond.) Ising Model

LSTM-based RNN Convolutional NN

slide-5
SLIDE 5

Graphical Models vs. Deep Nets

Graphical Models

  • Structured
  • Convex (parameter learning,

if exp family and not latent)

  • Latent models are more niche

– Mixture models, LDA, Bayesians

  • Arbitrary Exact Inference: P(Q|E)

– Intractable (unless compiled)

Deep (Generative) Neural Networks

  • Also structured
  • Convex? What’s that?

– Adam, RMSProp work well, see (Neural Taylor Expansion, ICML-17)

  • It’s all about latent (hidden)

– Massively overparameterized hidden layer representation – Helps with non-convexity – Exacerbates overfitting – need novel regularizers (dropout, batch norm)

  • Arbitrary Exact Inference: P(Q|E)

– Unknown (can we compile?)

Not just learning, also planning/control (Wu, Say, Sanner; NIPS-17) Maybe we could cross- pollinate back to GMs?

slide-6
SLIDE 6

Should we all switch to Deep Nets?

  • Not quite yet…
  • Deep nets are much more specialized than the

general motivation for graphical models

  • In order to answer general P(Q|E)

– First need a deep generative model – But most currently do inference via sampling – How to do arbitrary exact inference?

  • Compilations required to support such inference

Many flavors

slide-7
SLIDE 7

Remainder of Talk

  • Deep Generative Models
  • Arithmetic and Continuous Decision Diagrams

– Where my focus has been – Support marginalization for queries

  • Though really hard to bound inference complexity
  • Not only option, but need continuous compilations
  • Compiling Deep Generative Models to DDs

Treewidth is a discrete graphical model notion

slide-8
SLIDE 8

Deep Generative Models

Alphabet Soup: GANs, VAEs, etc.

slide-9
SLIDE 9

Vanilla ReLU Deep Network Structure

(Rectified Linear Units) Input/output Hidden units

Note: ReLU is just a piecewise linear function Input Layer Hidden Layer Output Layer

Slide from Buser Say

slide-10
SLIDE 10

Generative Adversarial Networks (GANs)

  • Generator + Discriminator framework

– “Fake Data” is from generative model

  • Can captures complex distributions

through “refined” backpropagation – For fictitious image generation, can generate clearer images than autoencoders minimizing RMSE

Slide from Ga Wu

slide-11
SLIDE 11

Variational Auto-Encoders (VAEs)

  • Optimize variational lower bound of P(X)
  • Two way mapping

– Encoder: P(Z|X) – Decoder: P(X|Z) – generative model

  • Re-parameterization Trick

– N(µ,σ) = µ + σ N(0,1) – Separate deterministic reasoning from stochastic part

Slide from Ga Wu

slide-12
SLIDE 12

Deep Autoregressive Networks

  • Standard Graphical Model

– Except that conditional probabilities are deep networks

  • Some recent more complex variants

– WaveNet, PixelCNN, PixelRNN

  • Note: cannot use standard message

passing algorithms with deep net factors – But we might use decision diagrams

A E B X

Slide from Ga Wu

slide-13
SLIDE 13

Decision Diagrams

Alphabet Soup: ADDs, AADDs, XADDs

slide-14
SLIDE 14

Exploits context-specific independence (CSI) and shared substructure.

Algebraic Decision Diagram (ADD)

Function Representation (ADDs)

  • Why not a directed acyclic graph (DAG)?

a b c F(a,b,c) 0.00 1 0.00 1 0.00 1 1 1.00 1 0.00 1 1 1.00 1 1 0.00 1 1 1 1.00 a b c 1

slide-15
SLIDE 15
  • AND OR XOR
  • Trees can compactly represent AND / OR

– But not XOR (linear as ADD, exponential as tree) – Why? Trees must represent every path

Trees vs. ADDs

x1 x2 x2 1 x3 x3 x1 x2 1 x3 x1 x2 1 x3

slide-16
SLIDE 16

Binary Operations (ADDs)

  • Why do we order variable tests?
  • Enables us to do efficient binary operations…

a b 1 c a a 2 c b c 2

Result: ADD

  • perations can

avoid state enumeration

slide-17
SLIDE 17
  • Are ADDs enough?
  • Or do we need more compactness?
  • Ex. 1: Additive reward/utility functions

– R(a,b,c) = R(a) + R(b) + R(c) = 4a + 2b + c

  • Ex. 2: Multiplicative value functions

– V(a,b,c) = V(a) ⋅ V(b) ⋅ V(c) = γ(4a + 2b + c)

ADD Inefficiency

a b c b c c c

7 6 5 4 3 2 1

a b c b c c c

γ7 γ6 γ5 γ4 γ3 γ2 γ1 γ0

slide-18
SLIDE 18
  • Define a new decision diagram – Affine ADD
  • Edges labeled by offset (c) and multiplier (b):
  • Semantics: if (a) then (c1+b1F1) else (c2+b2F2)

Affine ADD (AADD)

(Sanner, McAllester, IJCAI-05) <c1,b1> <c2,b2> a

F1 F2

slide-19
SLIDE 19
  • Maximize sharing by normalizing nodes [0,1]
  • Example: if (a) then (4) else (2)

Affine ADD (AADD)

Normalize

<4,0> <2,0> a <1,0> <0,0> a <2,2> Need top-level affine transform to recover

  • riginal range
slide-20
SLIDE 20
  • Back to our previous examples…
  • Ex. 1: Additive reward/utility functions
  • R(a,b) = R(a) + R(b)

= 2a + b

  • Ex. 2: Multiplicative value functions
  • V(a,b) = V(a) ⋅ V(b)

= γ(2a + b); γ<1

AADD Examples

b a

<2/3,1/3> <0,1/3> <0,3> <1,0> <0,0>

b a

<γ3, 1-γ3> <0,0> <1,0> <0, γ2-γ3> 1-γ3 < γ-γ3, 1-γ> 1-γ3 1-γ3 Automatically Constructed!

slide-21
SLIDE 21

ADDs vs. AADDs

  • Additive functions: ∑i=1..n xi

Note: no context-specific independence, but subdiagrams shared: result size O(n2)

slide-22
SLIDE 22

ADDs vs. AADDs

  • Additive functions: ∑i 2ixi

– Best case result for ADD (exp.) vs. AADD (linear)

slide-23
SLIDE 23

ADDs vs. AADDs

  • Additive functions: ∑i=0..n-1 F(xi,x(i+1) % n)

Pairwise factoring evident in AADD structure

x1 x7 x3 x2 x6 x4 x5

slide-24
SLIDE 24

But we want to compile deep networks

Hidden layers are continuous

slide-25
SLIDE 25

ReLU Deep Nets are Piecewise Linear!

(Rectified Linear Units)

Note: ReLU is just a piecewise linear function

E.g., see MILP compilation of ReLU deep nets for optimization (Say, Wu, Zhou, Sanner; IJCAI-17)

Slide from Buser Say

Input/output Hidden units

slide-26
SLIDE 26

V = 8 > > > > > > > > > > > < > > > > > > > > > > > : x1 + k > 100 ^ x2 + k > 100 : x1 + k > 100 ^ x2 + k ฀ 100 : x2 x1 + k ฀ 100 ^ x2 + k > 100 : x1 x1 + x2 + k > 100 ^ x1 + k ฀ 100 ^ x2 + k ฀ 100 ^ x2 > x1 : x2 x1 + x2 + k > 100 ^ x1 + k ฀ 100 ^ x2 + k ฀ 100 ^ x2 ฀ x1 : x1 x1 + x2 + k ฀ 100 : x1 + x2 . . . . . .

Case → XADD

Sanner et al (UAI-11) Sanner and Abbasnejad (AAAI-12) Zamani, Sanner et al (AAAI-12)

slide-27
SLIDE 27

Compactness of (X)ADDs

  • XDD is linear in

# of decisions φi

  • Case version has

exponential number

  • f partitions!

φ1 φ2 φ2 1 φ3 φ3 φ4 φ4 φ5 φ5

slide-28
SLIDE 28

XADD Maximization

y > 0 y

max( , ) =

y > 0 x x > 0 x > 0 x > y y x > 0 x x y

May introduce new decision tests

Operations exploit structure: O(|f||g|)

slide-29
SLIDE 29

Maintaining XADD Orderings

  • Max may get decisions out of order

Decision

  • rdering

(root→leaf)

  • x > y
  • y > 0
  • x > 0

y > 0 y

max( , ) =

x > 0 x x y y > 0 x x > 0 x > 0 x > y y

Newly introduced node is out of order!

slide-30
SLIDE 30

Maintaining XADD Orderings

  • Substitution may get decisions out of order

Decision

  • rdering

(root→leaf):

  • x > y
  • y > 0
  • x > z

y > 0 x > z x > z z x y x

= σ={ z/y }

y > 0 x > y x > y y x y x

Substituted nodes are now out of order!

slide-31
SLIDE 31

Correcting XADD Ordering

  • Obtain ordered XADD from unordered XADD

– key idea: binary operations maintain orderings

z

ID1 ID0 z is out of order

ID1

z

1

ID0

z

1

result will have z in order!

Inductively assume ID1 and ID0 are ordered. All operands ordered, so applying ⊗, ⊕ produces

  • rdered result!
slide-32
SLIDE 32

Maintaining Minimality

y > 0 x x > 0 y

Node unreachable – x + y < 0 always false if x > 0 & y > 0

x + y < 0 x + y y y > 0 x > 0 x + y

If linear, can detect with feasibility checker of LP solver & prune More subtle prunings as well.

slide-33
SLIDE 33

What’s the Minimal Diagram?

Search through all possible node rotations to find? Canonicity still an

  • pen question!

x > 7 1 x > 6 x > 8 2 3 1 x > 8 x > 6 3 2 1 2 3 6 8 7

slide-34
SLIDE 34

Affine XADD?

We’re working on it (can define affine different ways)

slide-35
SLIDE 35

Compiling Deep Nets

slide-36
SLIDE 36

Key idea: Compile with XADD Apply!

Deep Learned weights (Rectified Linear Units) State at time t State at time t+1

Build bottom-up… each node is an “Apply” sum and max

  • peration of children!

Many more details depending on the source model, but this is key idea permitting compilation and automated inference w.r.t. deep generative model source.

Input/output Hidden units

slide-37
SLIDE 37

Open Questions

  • XADD Compilation

– Works for vanilla autoregressive nets

  • Best decision order?
  • Need to multiply and marginalize polynomials (LinPWPoly)
  • Can add hidden variables as explicit variables

– Saves space, but how to formalize inference? – Message passing for non-probabilistic functions? » Need to examine the algebra

– Need extensions to handle GAN/VAE noise source inputs – Hard to do exact message passing

  • Treewidth bounds do not apply in continuous case
  • Alternative reductions to stochastic optimization?
  • Other directions

– Continuous arithmetic circuits? – Where autodiff = inference?