Compiling Deep Nets Scott Sanner Goal of this talk Will not - PowerPoint PPT Presentation

Compiling Deep Nets Scott Sanner

Goal of this talk • Will not evangelize deep networks / successes – Go to ICML, NIPS, Silicon Valley, read tech news – “Just believe” • But deep nets do not solve all problems – Yet  – Lack techniques for handling arbitrary queries – With compilations, that could change

Probabilistic Inference with Arbitrary Queries Why Deep Nets and not Graphical Models?

Graphical Models Revisited • HMM / Chain-CRF LSTM-based RNN • (Cond.) Ising Model Convolutional NN

Graphical Models vs. Deep Nets Not just learning, also Graphical Models Deep (Generative) Neural Networks planning/control (Wu, Say, Sanner; • Structured • Also structured NIPS-17) • Convex (parameter learning, • Convex? What’s that? if exp family and not latent) – Adam, RMSProp work well, see (Neural Taylor Expansion, ICML-17) • Latent models are more niche • It’s all about latent (hidden) – Mixture models, LDA, Bayesians – Massively overparameterized hidden layer representation Maybe we – Helps with non-convexity could cross- pollinate back – Exacerbates overfitting – need novel to GMs? regularizers ( dropout, batch norm ) • Arbitrary Exact Inference: P(Q|E) • Arbitrary Exact Inference: P(Q|E) – Intractable (unless compiled) – Unknown (can we compile?)

Should we all switch to Deep Nets? • Not quite yet… • Deep nets are much more specialized than the general motivation for graphical models • In order to answer general P(Q|E) – First need a deep generative model Many flavors – But most currently do inference via sampling – How to do arbitrary exact inference? • Compilations required to support such inference

Remainder of Talk • Deep Generative Models • Arithmetic and Continuous Decision Diagrams – Where my focus has been Treewidth is a discrete graphical – Support marginalization for queries model notion • Though really hard to bound inference complexity • Not only option, but need continuous compilations • Compiling Deep Generative Models to DDs

Deep Generative Models Alphabet Soup: GANs, VAEs, etc.

Vanilla ReLU Deep Network Structure Input Layer Hidden Layer Output Layer Input/output Note: ReLU is Hidden just a piecewise units linear function (Rectified Linear Units) Slide from Buser Say

Generative Adversarial Networks (GANs) • Generator + Discriminator framework – “Fake Data” is from generative model • Can captures complex distributions through “refined” backpropagation – For fictitious image generation, can generate clearer images than autoencoders minimizing RMSE Slide from Ga Wu

Variational Auto-Encoders (VAEs) • Optimize variational lower bound of P(X) • Two way mapping – Encoder: P(Z|X) – Decoder: P(X|Z) – generative model • Re-parameterization Trick – N( µ , σ ) = µ + σ N(0,1) – Separate deterministic reasoning from stochastic part Slide from Ga Wu

Deep Autoregressive Networks • Standard Graphical Model – Except that conditional probabilities are deep networks E B • Some recent more complex variants – WaveNet, PixelCNN, PixelRNN A X • Note: cannot use standard message passing algorithms with deep net factors – But we might use decision diagrams Slide from Ga Wu

Decision Diagrams Alphabet Soup: ADDs, AADDs, XADDs

Function Representation (ADDs) • Why not a directed acyclic graph (DAG)? a b c F(a,b,c) a 0 0 0 0.00 0 0 1 0.00 b Algebraic 0 1 0 0.00 Decision 0 1 1 1.00 Diagram c 1 0 0 0.00 (ADD) 1 0 1 1.00 1 1 0 0.00 1 0 1 1 1 1.00 Exploits context-specific independence (CSI) and shared substructure.

Trees vs. ADDs • AND OR XOR x 1 x 1 x 1 x 2 x 2 x 2 x 2 x 3 x 3 x 3 x 3 1 0 0 1 1 0 • Trees can compactly represent AND / OR – But not XOR (linear as ADD, exponential as tree) – Why? Trees must represent every path

Binary Operations (ADDs) • Why do we order variable tests? • Enables us to do efficient binary operations… a a Result: ADD a operations can avoid state enumeration b b c c c 0 2 1 0 0 2

ADD Inefficiency • Are ADDs enough? • Or do we need more compactness? • Ex. 1: Additive reward/utility functions a b b – R(a,b,c) = R(a) + R(b) + R(c) c c c c = 4a + 2b + c 7 6 5 4 3 2 1 0 • Ex. 2: Multiplicative value functions a b b – V(a,b,c) = V(a) ⋅ V(b) ⋅ V(c) c c c c = γ (4a + 2b + c) γ 6 γ 5 γ 4 γ 3 γ 2 γ 1 γ 7 γ 0

(Sanner, McAllester, IJCAI-05) Affine ADD (AADD) • Define a new decision diagram – Affine ADD • Edges labeled by offset ( c ) and multiplier ( b ): a <c 1 ,b 1 > <c 2 ,b 2 > F 1 F 2 • Semantics: if (a) then (c 1 +b 1 F 1 ) else (c 2 +b 2 F 2 )

Affine ADD (AADD) • Maximize sharing by normalizing nodes [0,1] Need top-level affine • Example: if (a) then (4) else (2) transform to recover original range <2,2> a a Normalize <4,0> <2,0> <1,0> <0,0> 0 0

Automatically AADD Examples Constructed! • Back to our previous examples… • Ex. 1: Additive reward/utility functions <0,3> a • R(a,b) = R(a) + R(b) <0,1/3> <2/3,1/3> = 2a + b b <1,0> <0,0> 0 • Ex. 2: Multiplicative value functions < γ 3 , 1- γ 3 > a < γ - γ 3 , 1- γ > <0, γ 2 - γ 3 > • V(a,b) = V(a) ⋅ V(b) 1- γ 3 1- γ 3 1- γ 3 = γ (2a + b) ; γ <1 b <1,0> <0,0> 0

ADDs vs. AADDs • Additive functions: ∑ i=1..n x i Note: no context-specific independence, but subdiagrams shared: result size O(n 2 )

ADDs vs. AADDs • Additive functions: ∑ i 2 i x i – Best case result for ADD (exp.) vs. AADD (linear)

x 1 x 2 ADDs vs. AADDs • Additive functions: ∑ i=0..n-1 F(x i ,x (i+1) % n ) x 7 x 3 x 6 x 4 x 5 Pairwise factoring evident in AADD structure

But we want to compile deep networks Hidden layers are continuous

ReLU Deep Nets are Piecewise Linear! E.g., see MILP compilation of ReLU deep nets for optimization (Say, Wu, Zhou, Sanner; IJCAI-17) Input/output Hidden Note: ReLU is units just a piecewise linear function (Rectified Linear Units) Slide from Buser Say

Case → XADD 8 x 1 + k > 100 ^ x 2 + k > 100 : 0 > > > > > x 1 + k > 100 ^ x 2 + k ฀ 100 : x 2 > > > > > x 1 + k ฀ 100 ^ x 2 + k > 100 : x 1 > < x 1 + x 2 + k > 100 ^ x 1 + k ฀ 100 ^ x 2 + k ฀ 100 ^ x 2 > x 1 : x 2 V = > > > x 1 + x 2 + k > 100 ^ x 1 + k ฀ 100 ^ x 2 + k ฀ 100 ^ x 2 ฀ x 1 : x 1 > > > > x 1 + x 2 + k ฀ 100 : x 1 + x 2 > > > > . . : . . . . Sanner et al (UAI-11) Sanner and Abbasnejad (AAAI-12) Zamani, Sanner et al (AAAI-12)

Compactness of (X)ADDs φ 1 • XDD is linear in φ 2 φ 2 # of decisions φ i φ 3 φ 3 • Case version has φ 4 φ 4 exponential number of partitions! φ 5 φ 5 1 0

XADD Maximization y > 0 max( , ) = x > 0 x > 0 y > 0 x > 0 x > y y x y x x y May introduce new decision tests Operations exploit structure: O(|f||g|)

Maintaining XADD Orderings • Max may get decisions out of order y > 0 Decision ordering (root → leaf) max( , ) = x > 0 x > 0 y > 0 x > 0 • x > y x > y y • y > 0 x y x • x > 0 x y Newly introduced node is out of order!

Maintaining XADD Orderings • Substitution may get decisions out of order Decision σ ={ z/y } y > 0 y > 0 ordering (root → leaf): = • x > y x > z x > z x > y x > y • y > 0 • x > z x z x y x y x y Substituted nodes are now out of order!

Correcting XADD Ordering • Obtain ordered XADD from unordered XADD – key idea: binary operations maintain orderings z is out of order result will have z in order! z ⊗ ⊗ z z ⊕ ID 1 ID 0 1 0 0 1 ID 1 ID 0 All operands ordered, so Inductively assume ID 1 applying ⊗ , ⊕ produces and ID 0 are ordered. ordered result!

Maintaining Minimality y > 0 y > 0 x > 0 x > 0 x + y < 0 y x y x + y x + y Node unreachable – If linear , can detect with More subtle x + y < 0 always feasibility checker of LP prunings as false if x > 0 & y > 0 solver & prune well.

What’s the Minimal Diagram? x > 7 x > 8 x > 6 x > 8 x > 6 1 2 3 1 2 3 3 2 1 Search through Canonicity still an all possible node 6 7 8 open question! rotations to find?

Affine XADD? We’re working on it (can define affine different ways)

Compiling Deep Nets

Key idea: Compile with XADD Apply! Deep Learned State at State at weights time t time t+1 Input/output Build bottom-up… Hidden each node is an units “Apply” sum and max operation of children! (Rectified Linear Units) Many more details depending on the source model, but this is key idea permitting compilation and automated inference w.r.t. deep generative model source.

Compiling Deep Nets Scott Sanner Goal of this talk Will not - PowerPoint PPT Presentation

Compiling Deep Nets Scott Sanner Goal of this talk Will not evangelize deep networks / successes Go to ICML, NIPS, Silicon Valley, read tech news Just believe But deep nets do not solve all problems Yet Lack

Conflict nets: Efficient locally canonical MALL proof nets Dominic J. D. Hughes and Willem

Petri Nets Petri Nets Inputs and Outputs Petri Nets vs FSM Lionel Morel Modeling Templates

Mix-Nets Lecture 19 Some tools for electronic-voting (and other things) Mix-Nets Mix-Nets

Petri Nets and Model Checking Natasa Gkolfi University of Oslo March 31, 2017 Petri Nets and

outline of this tutorial motivations 1 ACISS09 tutorial on deep belief nets deep

Introduction to Compiling Chapter 1 1 Compiler Construction Introduction to Compiling To Do

Deep Nets: What have they ever done for Vision? Alan Yuille Dept. Cognitive Science and

From DB-nets to Coloured Petri Nets with Priorities Marco Montali and Andrey Rivkin KRDB Research

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint

RNN Recitation 10/27/17 Recurrent nets are very deep nets Y(T) h f (-1) X(0) The relation

Machine learning from a complexity point of view Artemy Kolchinsky SFI CSSS 2019 1

MorphNet Elad Eban Faster Neural Nets with Hardware-Aware Architecture Learning Where Do

Incident Mobilization Incident Mobilization (R- -T T- -S) Nets S) Nets (R Mobilization

Iridex Group Plastic Protective sleeves Rabitz mesh Soil stabilization grid W Fencing nets

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Incident Mobilization Incident Mobilization (R- -T T- -S) Nets S) Nets (R Mobilization

Stop or Continue Data Collection: A Nonignorable Missing Data Approach for Continuous Variables

Bayesian Updating: Continuous Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan Bloom Problem 6a

Interpreting Models for Categorical and Count Outcomes Rose Medeiros StataCorp LLC Stata

Performance Issues for Parallel Implementations of Bootstrap Simulation Algorithm 22 nd

Burglary Earthquake .001 .002

Reeb Graphs and Piecewise Linear Functions Koen Klaren Eindhoven University of Technology

Static scoping Scoping in Hofl Theory of Programming Languages Computer Science Department

Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision Jia-Bin Huang,