Compiling Deep Nets Scott Sanner Goal of this talk Will not - - PowerPoint PPT Presentation
Compiling Deep Nets Scott Sanner Goal of this talk Will not - - PowerPoint PPT Presentation
Compiling Deep Nets Scott Sanner Goal of this talk Will not evangelize deep networks / successes Go to ICML, NIPS, Silicon Valley, read tech news Just believe But deep nets do not solve all problems Yet Lack
Goal of this talk
- Will not evangelize deep networks / successes
– Go to ICML, NIPS, Silicon Valley, read tech news – “Just believe”
- But deep nets do not solve all problems
– Yet – Lack techniques for handling arbitrary queries – With compilations, that could change
Probabilistic Inference with Arbitrary Queries
Why Deep Nets and not Graphical Models?
Graphical Models Revisited
- HMM / Chain-CRF
- (Cond.) Ising Model
LSTM-based RNN Convolutional NN
Graphical Models vs. Deep Nets
Graphical Models
- Structured
- Convex (parameter learning,
if exp family and not latent)
- Latent models are more niche
– Mixture models, LDA, Bayesians
- Arbitrary Exact Inference: P(Q|E)
– Intractable (unless compiled)
Deep (Generative) Neural Networks
- Also structured
- Convex? What’s that?
– Adam, RMSProp work well, see (Neural Taylor Expansion, ICML-17)
- It’s all about latent (hidden)
– Massively overparameterized hidden layer representation – Helps with non-convexity – Exacerbates overfitting – need novel regularizers (dropout, batch norm)
- Arbitrary Exact Inference: P(Q|E)
– Unknown (can we compile?)
Not just learning, also planning/control (Wu, Say, Sanner; NIPS-17) Maybe we could cross- pollinate back to GMs?
Should we all switch to Deep Nets?
- Not quite yet…
- Deep nets are much more specialized than the
general motivation for graphical models
- In order to answer general P(Q|E)
– First need a deep generative model – But most currently do inference via sampling – How to do arbitrary exact inference?
- Compilations required to support such inference
Many flavors
Remainder of Talk
- Deep Generative Models
- Arithmetic and Continuous Decision Diagrams
– Where my focus has been – Support marginalization for queries
- Though really hard to bound inference complexity
- Not only option, but need continuous compilations
- Compiling Deep Generative Models to DDs
Treewidth is a discrete graphical model notion
Deep Generative Models
Alphabet Soup: GANs, VAEs, etc.
Vanilla ReLU Deep Network Structure
(Rectified Linear Units) Input/output Hidden units
Note: ReLU is just a piecewise linear function Input Layer Hidden Layer Output Layer
Slide from Buser Say
Generative Adversarial Networks (GANs)
- Generator + Discriminator framework
– “Fake Data” is from generative model
- Can captures complex distributions
through “refined” backpropagation – For fictitious image generation, can generate clearer images than autoencoders minimizing RMSE
Slide from Ga Wu
Variational Auto-Encoders (VAEs)
- Optimize variational lower bound of P(X)
- Two way mapping
– Encoder: P(Z|X) – Decoder: P(X|Z) – generative model
- Re-parameterization Trick
– N(µ,σ) = µ + σ N(0,1) – Separate deterministic reasoning from stochastic part
Slide from Ga Wu
Deep Autoregressive Networks
- Standard Graphical Model
– Except that conditional probabilities are deep networks
- Some recent more complex variants
– WaveNet, PixelCNN, PixelRNN
- Note: cannot use standard message
passing algorithms with deep net factors – But we might use decision diagrams
A E B X
Slide from Ga Wu
Decision Diagrams
Alphabet Soup: ADDs, AADDs, XADDs
Exploits context-specific independence (CSI) and shared substructure.
Algebraic Decision Diagram (ADD)
Function Representation (ADDs)
- Why not a directed acyclic graph (DAG)?
a b c F(a,b,c) 0.00 1 0.00 1 0.00 1 1 1.00 1 0.00 1 1 1.00 1 1 0.00 1 1 1 1.00 a b c 1
- AND OR XOR
- Trees can compactly represent AND / OR
– But not XOR (linear as ADD, exponential as tree) – Why? Trees must represent every path
Trees vs. ADDs
x1 x2 x2 1 x3 x3 x1 x2 1 x3 x1 x2 1 x3
Binary Operations (ADDs)
- Why do we order variable tests?
- Enables us to do efficient binary operations…
a b 1 c a a 2 c b c 2
Result: ADD
- perations can
avoid state enumeration
- Are ADDs enough?
- Or do we need more compactness?
- Ex. 1: Additive reward/utility functions
– R(a,b,c) = R(a) + R(b) + R(c) = 4a + 2b + c
- Ex. 2: Multiplicative value functions
– V(a,b,c) = V(a) ⋅ V(b) ⋅ V(c) = γ(4a + 2b + c)
ADD Inefficiency
a b c b c c c
7 6 5 4 3 2 1
a b c b c c c
γ7 γ6 γ5 γ4 γ3 γ2 γ1 γ0
- Define a new decision diagram – Affine ADD
- Edges labeled by offset (c) and multiplier (b):
- Semantics: if (a) then (c1+b1F1) else (c2+b2F2)
Affine ADD (AADD)
(Sanner, McAllester, IJCAI-05) <c1,b1> <c2,b2> a
F1 F2
- Maximize sharing by normalizing nodes [0,1]
- Example: if (a) then (4) else (2)
Affine ADD (AADD)
Normalize
<4,0> <2,0> a <1,0> <0,0> a <2,2> Need top-level affine transform to recover
- riginal range
- Back to our previous examples…
- Ex. 1: Additive reward/utility functions
- R(a,b) = R(a) + R(b)
= 2a + b
- Ex. 2: Multiplicative value functions
- V(a,b) = V(a) ⋅ V(b)
= γ(2a + b); γ<1
AADD Examples
b a
<2/3,1/3> <0,1/3> <0,3> <1,0> <0,0>
b a
<γ3, 1-γ3> <0,0> <1,0> <0, γ2-γ3> 1-γ3 < γ-γ3, 1-γ> 1-γ3 1-γ3 Automatically Constructed!
ADDs vs. AADDs
- Additive functions: ∑i=1..n xi
Note: no context-specific independence, but subdiagrams shared: result size O(n2)
ADDs vs. AADDs
- Additive functions: ∑i 2ixi
– Best case result for ADD (exp.) vs. AADD (linear)
ADDs vs. AADDs
- Additive functions: ∑i=0..n-1 F(xi,x(i+1) % n)
Pairwise factoring evident in AADD structure
x1 x7 x3 x2 x6 x4 x5
But we want to compile deep networks
Hidden layers are continuous
ReLU Deep Nets are Piecewise Linear!
(Rectified Linear Units)
Note: ReLU is just a piecewise linear function
E.g., see MILP compilation of ReLU deep nets for optimization (Say, Wu, Zhou, Sanner; IJCAI-17)
Slide from Buser Say
Input/output Hidden units
V = 8 > > > > > > > > > > > < > > > > > > > > > > > : x1 + k > 100 ^ x2 + k > 100 : x1 + k > 100 ^ x2 + k 100 : x2 x1 + k 100 ^ x2 + k > 100 : x1 x1 + x2 + k > 100 ^ x1 + k 100 ^ x2 + k 100 ^ x2 > x1 : x2 x1 + x2 + k > 100 ^ x1 + k 100 ^ x2 + k 100 ^ x2 x1 : x1 x1 + x2 + k 100 : x1 + x2 . . . . . .
Case → XADD
Sanner et al (UAI-11) Sanner and Abbasnejad (AAAI-12) Zamani, Sanner et al (AAAI-12)
Compactness of (X)ADDs
- XDD is linear in
# of decisions φi
- Case version has
exponential number
- f partitions!
φ1 φ2 φ2 1 φ3 φ3 φ4 φ4 φ5 φ5
XADD Maximization
y > 0 y
max( , ) =
y > 0 x x > 0 x > 0 x > y y x > 0 x x y
May introduce new decision tests
Operations exploit structure: O(|f||g|)
Maintaining XADD Orderings
- Max may get decisions out of order
Decision
- rdering
(root→leaf)
- x > y
- y > 0
- x > 0
y > 0 y
max( , ) =
x > 0 x x y y > 0 x x > 0 x > 0 x > y y
Newly introduced node is out of order!
Maintaining XADD Orderings
- Substitution may get decisions out of order
Decision
- rdering
(root→leaf):
- x > y
- y > 0
- x > z
y > 0 x > z x > z z x y x
= σ={ z/y }
y > 0 x > y x > y y x y x
Substituted nodes are now out of order!
Correcting XADD Ordering
- Obtain ordered XADD from unordered XADD
– key idea: binary operations maintain orderings
z
ID1 ID0 z is out of order
⊕
ID1
⊗
z
1
ID0
⊗
z
1
result will have z in order!
Inductively assume ID1 and ID0 are ordered. All operands ordered, so applying ⊗, ⊕ produces
- rdered result!
Maintaining Minimality
y > 0 x x > 0 y
Node unreachable – x + y < 0 always false if x > 0 & y > 0
x + y < 0 x + y y y > 0 x > 0 x + y
If linear, can detect with feasibility checker of LP solver & prune More subtle prunings as well.
What’s the Minimal Diagram?
Search through all possible node rotations to find? Canonicity still an
- pen question!
x > 7 1 x > 6 x > 8 2 3 1 x > 8 x > 6 3 2 1 2 3 6 8 7
Affine XADD?
We’re working on it (can define affine different ways)
Compiling Deep Nets
Key idea: Compile with XADD Apply!
Deep Learned weights (Rectified Linear Units) State at time t State at time t+1
Build bottom-up… each node is an “Apply” sum and max
- peration of children!
Many more details depending on the source model, but this is key idea permitting compilation and automated inference w.r.t. deep generative model source.
Input/output Hidden units
Open Questions
- XADD Compilation
– Works for vanilla autoregressive nets
- Best decision order?
- Need to multiply and marginalize polynomials (LinPWPoly)
- Can add hidden variables as explicit variables
– Saves space, but how to formalize inference? – Message passing for non-probabilistic functions? » Need to examine the algebra
– Need extensions to handle GAN/VAE noise source inputs – Hard to do exact message passing
- Treewidth bounds do not apply in continuous case
- Alternative reductions to stochastic optimization?
- Other directions