compiling deep nets
play

Compiling Deep Nets Scott Sanner Goal of this talk Will not - PowerPoint PPT Presentation

Compiling Deep Nets Scott Sanner Goal of this talk Will not evangelize deep networks / successes Go to ICML, NIPS, Silicon Valley, read tech news Just believe But deep nets do not solve all problems Yet Lack


  1. Compiling Deep Nets Scott Sanner

  2. Goal of this talk • Will not evangelize deep networks / successes – Go to ICML, NIPS, Silicon Valley, read tech news – “Just believe” • But deep nets do not solve all problems – Yet  – Lack techniques for handling arbitrary queries – With compilations, that could change

  3. Probabilistic Inference with Arbitrary Queries Why Deep Nets and not Graphical Models?

  4. Graphical Models Revisited • HMM / Chain-CRF LSTM-based RNN • (Cond.) Ising Model Convolutional NN

  5. Graphical Models vs. Deep Nets Not just learning, also Graphical Models Deep (Generative) Neural Networks planning/control (Wu, Say, Sanner; • Structured • Also structured NIPS-17) • Convex (parameter learning, • Convex? What’s that? if exp family and not latent) – Adam, RMSProp work well, see (Neural Taylor Expansion, ICML-17) • Latent models are more niche • It’s all about latent (hidden) – Mixture models, LDA, Bayesians – Massively overparameterized hidden layer representation Maybe we – Helps with non-convexity could cross- pollinate back – Exacerbates overfitting – need novel to GMs? regularizers ( dropout, batch norm ) • Arbitrary Exact Inference: P(Q|E) • Arbitrary Exact Inference: P(Q|E) – Intractable (unless compiled) – Unknown (can we compile?)

  6. Should we all switch to Deep Nets? • Not quite yet… • Deep nets are much more specialized than the general motivation for graphical models • In order to answer general P(Q|E) – First need a deep generative model Many flavors – But most currently do inference via sampling – How to do arbitrary exact inference? • Compilations required to support such inference

  7. Remainder of Talk • Deep Generative Models • Arithmetic and Continuous Decision Diagrams – Where my focus has been Treewidth is a discrete graphical – Support marginalization for queries model notion • Though really hard to bound inference complexity • Not only option, but need continuous compilations • Compiling Deep Generative Models to DDs

  8. Deep Generative Models Alphabet Soup: GANs, VAEs, etc.

  9. Vanilla ReLU Deep Network Structure Input Layer Hidden Layer Output Layer Input/output Note: ReLU is Hidden just a piecewise units linear function (Rectified Linear Units) Slide from Buser Say

  10. Generative Adversarial Networks (GANs) • Generator + Discriminator framework – “Fake Data” is from generative model • Can captures complex distributions through “refined” backpropagation – For fictitious image generation, can generate clearer images than autoencoders minimizing RMSE Slide from Ga Wu

  11. Variational Auto-Encoders (VAEs) • Optimize variational lower bound of P(X) • Two way mapping – Encoder: P(Z|X) – Decoder: P(X|Z) – generative model • Re-parameterization Trick – N( µ , σ ) = µ + σ N(0,1) – Separate deterministic reasoning from stochastic part Slide from Ga Wu

  12. Deep Autoregressive Networks • Standard Graphical Model – Except that conditional probabilities are deep networks E B • Some recent more complex variants – WaveNet, PixelCNN, PixelRNN A X • Note: cannot use standard message passing algorithms with deep net factors – But we might use decision diagrams Slide from Ga Wu

  13. Decision Diagrams Alphabet Soup: ADDs, AADDs, XADDs

  14. Function Representation (ADDs) • Why not a directed acyclic graph (DAG)? a b c F(a,b,c) a 0 0 0 0.00 0 0 1 0.00 b Algebraic 0 1 0 0.00 Decision 0 1 1 1.00 Diagram c 1 0 0 0.00 (ADD) 1 0 1 1.00 1 1 0 0.00 1 0 1 1 1 1.00 Exploits context-specific independence (CSI) and shared substructure.

  15. Trees vs. ADDs • AND OR XOR x 1 x 1 x 1 x 2 x 2 x 2 x 2 x 3 x 3 x 3 x 3 1 0 0 1 1 0 • Trees can compactly represent AND / OR – But not XOR (linear as ADD, exponential as tree) – Why? Trees must represent every path

  16. Binary Operations (ADDs) • Why do we order variable tests? • Enables us to do efficient binary operations… a a Result: ADD a operations can avoid state enumeration b b c c c 0 2 1 0 0 2

  17. ADD Inefficiency • Are ADDs enough? • Or do we need more compactness? • Ex. 1: Additive reward/utility functions a b b – R(a,b,c) = R(a) + R(b) + R(c) c c c c = 4a + 2b + c 7 6 5 4 3 2 1 0 • Ex. 2: Multiplicative value functions a b b – V(a,b,c) = V(a) ⋅ V(b) ⋅ V(c) c c c c = γ (4a + 2b + c) γ 6 γ 5 γ 4 γ 3 γ 2 γ 1 γ 7 γ 0

  18. (Sanner, McAllester, IJCAI-05) Affine ADD (AADD) • Define a new decision diagram – Affine ADD • Edges labeled by offset ( c ) and multiplier ( b ): a <c 1 ,b 1 > <c 2 ,b 2 > F 1 F 2 • Semantics: if (a) then (c 1 +b 1 F 1 ) else (c 2 +b 2 F 2 )

  19. Affine ADD (AADD) • Maximize sharing by normalizing nodes [0,1] Need top-level affine • Example: if (a) then (4) else (2) transform to recover original range <2,2> a a Normalize <4,0> <2,0> <1,0> <0,0> 0 0

  20. Automatically AADD Examples Constructed! • Back to our previous examples… • Ex. 1: Additive reward/utility functions <0,3> a • R(a,b) = R(a) + R(b) <0,1/3> <2/3,1/3> = 2a + b b <1,0> <0,0> 0 • Ex. 2: Multiplicative value functions < γ 3 , 1- γ 3 > a < γ - γ 3 , 1- γ > <0, γ 2 - γ 3 > • V(a,b) = V(a) ⋅ V(b) 1- γ 3 1- γ 3 1- γ 3 = γ (2a + b) ; γ <1 b <1,0> <0,0> 0

  21. ADDs vs. AADDs • Additive functions: ∑ i=1..n x i Note: no context-specific independence, but subdiagrams shared: result size O(n 2 )

  22. ADDs vs. AADDs • Additive functions: ∑ i 2 i x i – Best case result for ADD (exp.) vs. AADD (linear)

  23. x 1 x 2 ADDs vs. AADDs • Additive functions: ∑ i=0..n-1 F(x i ,x (i+1) % n ) x 7 x 3 x 6 x 4 x 5 Pairwise factoring evident in AADD structure

  24. But we want to compile deep networks Hidden layers are continuous

  25. ReLU Deep Nets are Piecewise Linear! E.g., see MILP compilation of ReLU deep nets for optimization (Say, Wu, Zhou, Sanner; IJCAI-17) Input/output Hidden Note: ReLU is units just a piecewise linear function (Rectified Linear Units) Slide from Buser Say

  26. Case → XADD 8 x 1 + k > 100 ^ x 2 + k > 100 : 0 > > > > > x 1 + k > 100 ^ x 2 + k ฀ 100 : x 2 > > > > > x 1 + k ฀ 100 ^ x 2 + k > 100 : x 1 > < x 1 + x 2 + k > 100 ^ x 1 + k ฀ 100 ^ x 2 + k ฀ 100 ^ x 2 > x 1 : x 2 V = > > > x 1 + x 2 + k > 100 ^ x 1 + k ฀ 100 ^ x 2 + k ฀ 100 ^ x 2 ฀ x 1 : x 1 > > > > x 1 + x 2 + k ฀ 100 : x 1 + x 2 > > > > . . : . . . . Sanner et al (UAI-11) Sanner and Abbasnejad (AAAI-12) Zamani, Sanner et al (AAAI-12)

  27. Compactness of (X)ADDs φ 1 • XDD is linear in φ 2 φ 2 # of decisions φ i φ 3 φ 3 • Case version has φ 4 φ 4 exponential number of partitions! φ 5 φ 5 1 0

  28. XADD Maximization y > 0 max( , ) = x > 0 x > 0 y > 0 x > 0 x > y y x y x x y May introduce new decision tests Operations exploit structure: O(|f||g|)

  29. Maintaining XADD Orderings • Max may get decisions out of order y > 0 Decision ordering (root → leaf) max( , ) = x > 0 x > 0 y > 0 x > 0 • x > y x > y y • y > 0 x y x • x > 0 x y Newly introduced node is out of order!

  30. Maintaining XADD Orderings • Substitution may get decisions out of order Decision σ ={ z/y } y > 0 y > 0 ordering (root → leaf): = • x > y x > z x > z x > y x > y • y > 0 • x > z x z x y x y x y Substituted nodes are now out of order!

  31. Correcting XADD Ordering • Obtain ordered XADD from unordered XADD – key idea: binary operations maintain orderings z is out of order result will have z in order! z ⊗ ⊗ z z ⊕ ID 1 ID 0 1 0 0 1 ID 1 ID 0 All operands ordered, so Inductively assume ID 1 applying ⊗ , ⊕ produces and ID 0 are ordered. ordered result!

  32. Maintaining Minimality y > 0 y > 0 x > 0 x > 0 x + y < 0 y x y x + y x + y Node unreachable – If linear , can detect with More subtle x + y < 0 always feasibility checker of LP prunings as false if x > 0 & y > 0 solver & prune well.

  33. What’s the Minimal Diagram? x > 7 x > 8 x > 6 x > 8 x > 6 1 2 3 1 2 3 3 2 1 Search through Canonicity still an all possible node 6 7 8 open question! rotations to find?

  34. Affine XADD? We’re working on it (can define affine different ways)

  35. Compiling Deep Nets

  36. Key idea: Compile with XADD Apply! Deep Learned State at State at weights time t time t+1 Input/output Build bottom-up… Hidden each node is an units “Apply” sum and max operation of children! (Rectified Linear Units) Many more details depending on the source model, but this is key idea permitting compilation and automated inference w.r.t. deep generative model source.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend