SLIDE 1 Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning
Guy Van den Broeck
Stanford Nov 14, 2018
SLIDE 2
Foundation: Logical Circuit Languages
SLIDE 3 Negation Normal Form Circuits
[Darwiche 2002]
Δ = (sun ∧ rain ⇒ rainbow)
SLIDE 4 Decomposable Circuits
Decomposable
[Darwiche 2002]
SLIDE 5 Tractable for Logical Inference
- Is there a solution? (SAT)
– SAT(𝛽 ∨ 𝛾) iff SAT(𝛽) or SAT(𝛾) (always) – SAT(𝛽 ∧ 𝛾) iff SAT(𝛽) and SAT(𝛾) (decomposable)
- How many solutions are there? (#SAT)
- Complexity linear in circuit size
✓
SLIDE 6 Deterministic Circuits
Deterministic
[Darwiche 2002]
SLIDE 7
How many solutions are there? (#SAT)
SLIDE 8 How many solutions are there? (#SAT)
Arithmetic Circuit
SLIDE 9 Tractable for Logical Inference
- Is there a solution? (SAT)
- How many solutions are there? (#SAT)
- Stricter languages (e.g., BDD, SDD):
– Equivalence checking – Conjoin/disjoint/negate circuits
- Complexity linear in circuit size
- Compilation into circuit language by either
– ↓ exhaustive SAT solver – ↑ conjoin/disjoin/negate
✓ ✓ ✓ ✓
SLIDE 10
Learning with Logical Constraints
SLIDE 11 Motivation: Video
[Lu, W. L., Ting, J. A., Little, J. J., & Murphy, K. P. (2013). Learning to track and identify players from broadcast sports videos.]
SLIDE 12 Motivation: Robotics
[Wong, L. L., Kaelbling, L. P., & Lozano-Perez, T., Collision-free state estimation. ICRA 2012]
SLIDE 13 Motivation: Language
At least one verb in each sentence
If a modifier is kept, its subject is also kept
- Information extraction
- Semantic role labeling
… and many more!
[Chang, M., Ratinov, L., & Roth, D. (2008). Constraints as prior knowledge],…, [Chang, M. W., Ratinov, L., & Roth, D. (2012). Structured learning with constrained conditional models.], [https://en.wikipedia.org/wiki/Constrained_conditional_model]
SLIDE 14 Motivation: Deep Learning
[Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al.. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476.]
SLIDE 15 Courses:
- Logic (L)
- Knowledge Representation (K)
- Probability (P)
- Artificial Intelligence (A)
Data
- Must take at least one of
Probability or Logic.
- Probability is a prerequisite for AI.
- The prerequisites for KR is
either AI or Logic.
Constraints
Running Example
SLIDE 16 L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
unstructured
L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
structured
Structured Space
7 out of 16 instantiations are impossible
- Must take at least one of
Probability (P) or Logic (L).
- Probability is a prerequisite
for AI (A).
- The prerequisites for KR (K) is
either AI or Logic.
SLIDE 17 L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
unstructured
L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
structured
Boolean Constraints
7 out of 16 instantiations are impossible
SLIDE 18 Learning in Structured Spaces
Data Constraints
(Background Knowledge) (Physics)
ML Model
+
Today‟s machine learning tools don‟t take knowledge as input! Learn
SLIDE 19
Deep Learning with Logical Constraints
SLIDE 20 Deep Learning with Logical Knowledge
Data Constraints Deep Neural Network
+
Learn
Input Neural Network Logical Constraint Output
Output is probability vector p, not Boolean logic!
SLIDE 21 Semantic Loss
Q: How close is output p to satisfying constraint? Answer: Semantic loss function L(α,p)
– If p is Boolean then L(p,p) = 0 – If α implies β then L(α,p) ≥ L(β,p) (α more strict)
– If α is equivalent to β then L(α,p) = L(β,p) – If p is Boolean and satisfies α then L(α,p) = 0
SEMANTIC Loss!
SLIDE 22 Semantic Loss: Definition
Theorem: Axioms imply unique semantic loss:
Probability of getting x after flipping coins with prob. p Probability of satisfying α after flipping coins with prob. p
SLIDE 23 Example: Exactly-One
- Data must have some label
We agree this must be one of the 10 digits:
→ For 3 classes:
𝒚𝟐 ∨ 𝒚𝟑∨ 𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟑 ¬𝒚𝟑 ∨ ¬𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟒 Only 𝒚𝒋 = 𝟐 after flipping coins Exactly one true 𝒚 after flipping coins
SLIDE 24 Semi-Supervised Learning
- Intuition: Unlabeled data must have some label
- Cf. entropy constraints, manifold learning
- Minimize exactly-one semantic loss on unlabeled data
Train with 𝑓𝑦𝑗𝑡𝑢𝑗𝑜 𝑚𝑝𝑡𝑡 + 𝑥 ∙ 𝑡𝑓𝑛𝑏𝑜𝑢𝑗𝑑 𝑚𝑝𝑡𝑡
SLIDE 25
MNIST Experiment
Competitive with state of the art in semi-supervised deep learning
SLIDE 26 FASHION Experiment
Outperforms Ladder Nets!
Same conclusion on CIFAR10
SLIDE 27 What about real constraints? Paths cf. Nature paper
Good variable assignment (represents route) 184 Bad variable assignment (does not represent route) 16,777,032
Unstructured probability space: 184+16,777,032 = 224 Space easily encoded in logical constraints [Nishino et al.]
SLIDE 28 How to Compute Semantic Loss?
- In general: #P-hard
- With a logical circuit for α: Linear!
- Example: exactly-one constraint:
- Why? Decomposability and determinism!
L(α,p) = L( , p) = - log( )
SLIDE 29 Predict Shortest Paths
Add semantic loss for path constraint
Is output a path? Are individual edge predictions correct? Is prediction the shortest path? This is the real task! (same conclusion for predicting sushi preferences, see paper)
SLIDE 30
Probabilistic Circuits
SLIDE 31
L P A P L
P
L P
K K A A A A
Logical Circuits
Can we represent a distribution
- ver the solutions to the constraint?
SLIDE 32
L P A P L
P
L P
K K A A A A
Recall: Decomposability
AND gates have disjoint input circuits
SLIDE 33 Recall: Determinism
¬L K L ⊥ P A ¬P ⊥ L ¬L ⊥ ¬P ¬A P ¬L ¬K L ⊥ P ¬P ⊥ K ¬K A ¬A A ¬A
Input: L, K, P, A are true and ¬L, ¬K, ¬P, ¬A are false Property: OR gates have at most one true input wire
SLIDE 34 ¬L K L ⊥
1
P A ¬P ⊥
1
L ¬L ⊥
1
¬P ¬A P
0.6 0.4
¬L ¬K L ⊥
1
P ¬P ⊥
1
K ¬K
0.8 0.2
A ¬A
0.25 0.75
A ¬A
0.9 0.1 0.1 0.6 0.3
PSDD: Probabilistic SDD
Syntax: assign a normalized probability to each OR gate input
SLIDE 35 ¬L K L ⊥ P A ¬P ⊥ L ¬L ⊥ ¬P ¬A P ¬L ¬K L ⊥ P ¬P ⊥ K ¬K A ¬A A ¬A
Input: L, K, P, A are true
0.1 0.6 0.3 1 1 1 0.6 0.4 1 1 0.8 0.2 0.25 0.75 0.9 0.1
Pr(L,K,P,A) = 0.3 x 1 x 0.8 x 0.4 x 0.25 = 0.024
PSDD: Probabilistic SDD
SLIDE 36
L
1
P A P
1
L
1
P
0.6 0.4
L
1
P
1
A A
0.8 0.2
A A
0.25 0.75
A A
0.9 0.1 0.1 0.6 0.3
Can read probabilistic independences off the circuit structure
Each node represents a normalized distribution!
SLIDE 37 Tractable for Probabilistic Inference
Find most-likely assignment to x given y
(otherwise NP-hard)
- Computing conditional probabilities Pr(x|y)
(otherwise #P-hard)
- Sample from Pr(x|y)
- Algorithms linear in circuit size
(pass up, pass down, similar to backprop)
SLIDE 38
L
1
P A P
1
L
1
P
0.6 0.4
L
1
P
1
K K
0.8 0.2
A A
0.25 0.75
A A
0.9 0.1 0.1 0.6 0.3
Student takes course L Student takes course P Probability of course P given L
Parameters are Interpretable
Explainable AI DARPA Program
SLIDE 39
Learning Probabilistic Circuit Parameters
SLIDE 40 Learning Algorithms
max likelihood from complete data
- One pass over data to estimate Pr(x|y)
- Where does the structure come from?
For now: simply compiled from constraint…
Not a lot to say: very easy!
SLIDE 41 Combinatorial Objects: Rankings
10 items: 3,628,800 rankings 20 items: 2,432,902,008,176,640,000 rankings
rank sushi 1 fatty tuna 2 sea urchin 3 salmon roe 4 shrimp 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll rank sushi 1 shrimp 2 sea urchin 3 salmon roe 4 fatty tuna 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll
SLIDE 42 Combinatorial Objects: Rankings
rank sushi 1 fatty tuna 2 sea urchin 3 salmon roe 4 shrimp 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll
- Predict Boolean Variables:
Aij - item i at position j
each item i assigned to a unique position (n constraints) each position j assigned a unique item (n constraints)
SLIDE 43 Learning Preference Distributions
Special-purpose distribution: Mixture-of-Mallows
– # of components from 1 to 20 – EM with 10 random seeds – Implementation of Lu & Boutilier PSDD
Circuit structure does not even depend on data!
SLIDE 44
Learning Probabilistic Circuit Structure
SLIDE 45
Structure Learning Primitive
SLIDE 46
Structure Learning Primitive
Primitives maintain PSDD properties and constraint of root!
SLIDE 47 LearnPSDD Algorithm
(Vtree learning)* Construct the most naïve PSDD LearnPSDD (search for better structure)
1 2 3
Simulate
Execute the
Generate candidate
Works with or without logical constraint.
SLIDE 48
PSDDs …are Sum-Product Networks …are Arithmetic Circuits
2 1 n p1 s1 p2 s2 pn sn PSDD AC +
* * * * * *
1 2 n p1 s1 p2 s2 pn sn
SLIDE 49
Experiments on 20 datasets
Compared to SPN learners, LearnPSDD gives comparable performance yet smaller size
SLIDE 50 Learn Mixtures of PSDDs
State of the art
Q: “Help! I need to learn a discrete probability distribution…” A: Learn mixture of PSDDs! Strongly outperforms
- Bayesian network learners
- Markov network learners
Competitive with
- SPN learners
- Cutset network learners
SLIDE 51
Logistic Circuits
SLIDE 52 What if I only want to classify Y?
Pr(𝑍, 𝐵, 𝐶, 𝐷, 𝐸)
SLIDE 53 Logistic Circuits
Represents Pr 𝑍 𝐵, 𝐶, 𝐷, 𝐸
- Take all „hot‟ wires
- Sum their weights
- Push through logistic function
SLIDE 54 Logistic vs. Probabilistic Circuits
Probabilities become log-odds Pr 𝑍 𝐵, 𝐶, 𝐷, 𝐸
Pr(𝑍, 𝐵, 𝐶, 𝐷, 𝐸)
SLIDE 55 Parameter Learning
Reduce to logistic regression:
Features associated with each wire “Global Circuit Flow” features
Learning parameters θ is convex optimization!
SLIDE 56 Logistic Circuit Structure Learning
Calculate Gradient Variance Execute the best operation Generate candidate
Similar to LearnPSDD structure learning
SLIDE 57
Comparable Accuracy with Neural Nets
SLIDE 58
Significantly Smaller in Size Better Data Efficiency
SLIDE 59
Interpretable?
SLIDE 60
Reasoning with Probabilistic Circuits
SLIDE 61 Compilation target for probabilistic reasoning
Bayesian networks Factor graphs Probabilistic databases Relational Bayesian networks Probabilistic programs Markov Logic Probabilistic Circuits
SLIDE 62
Compilation for Prob. Inference
SLIDE 63 Collapsed Compilation
To sample a circuit:
- 1. Compile bottom up until you reach the size limit
- 2. Pick a variable you want to sample
- 3. Sample it according to its marginal distribution in
the current circuit
- 4. Condition on the sampled value
- 5. (Repeat)
Asymptotically unbiased importance sampler
SLIDE 64
Circuits + importance weights approximate any query
SLIDE 65
Experiments
Competitive with state-of-the-art approximate inference in graphical models. Outperforms it on several benchmarks!
SLIDE 66
Reasoning About Classifiers
SLIDE 67 Classifier Trimming
C
F1 F2 F3 F4
C
F2 F3
Classifier 𝛽 Classifier 𝛾
Threshold 𝑈 Threshold 𝑈′
Trim features while maintaining classification behavior
SLIDE 68
How to measure Similarity?
What is the expected probability that a classifier α will agree with its trimming β?
“Expected Classification Agreement”
SLIDE 69 Solving PPPP problems with constrained SDDs
L P A P L
P
L P
K K A A A A
f
1 2 3
L K P A 𝑀 ∧ 𝐿 𝑔|(𝑀 ∧ 𝐿)
SLIDE 70
SDD method faster than traditional jointree inference
SLIDE 71
Classification agreement and accuracy
Higher agreement tends to get higher accuracy Additional dimension for feature selection
SLIDE 72 Conclusions
Statistical ML “Probability” Symbolic AI “Logic” Connectionism “Deep”
Circuits
SLIDE 73 Questions?
PSDD with 15,000 nodes
SLIDE 74 References
- Doga Kisa, Guy Van den Broeck, Arthur Choi and Adnan Darwiche. Probabilistic
sentential decision diagrams, In Proceedings of the 14th International Conference
- n Principles of Knowledge Representation and Reasoning (KR), 2014.
- Arthur Choi, Guy Van den Broeck and Adnan Darwiche. Tractable Learning for
Structured Probability Spaces: A Case Study in Learning Preference Distributions, In Proceedings of 24th International Joint Conference on Artificial Intelligence (IJCAI), 2015.
- Arthur Choi, Guy Van den Broeck and Adnan Darwiche. Probability Distributions
- ver Structured Spaces, In Proceedings of the AAAI Spring Symposium on
KRR, 2015.
- Jessa Bekker, Jesse Davis, Arthur Choi, Adnan Darwiche and Guy Van den
- Broeck. Tractable Learning for Complex Probability Queries, In Advances in
Neural Information Processing Systems 28 (NIPS), 2015.
- YooJung Choi, Adnan Darwiche and Guy Van den Broeck. Optimal Feature
Selection for Decision Robustness in Bayesian Networks, In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), 2017.
SLIDE 75 References
- Yitao Liang, Jessa Bekker and Guy Van den Broeck. Learning the Structure of
Probabilistic Sentential Decision Diagrams, In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI), 2017.
- Yitao Liang and Guy Van den Broeck. Towards Compact Interpretable Models:
Shrinking of Learned Probabilistic Sentential Decision Diagrams, In IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI), 2017.
- YooJung Choi and Guy Van den Broeck. On Robust Trimming of Bayesian
Network Classifiers, In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), 2018.
- Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang and Guy Van den Broeck. A
Semantic Loss Function for Deep Learning with Symbolic Knowledge, In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.
- Tal Friedman and Guy Van den Broeck. Approximate Knowledge Compilation by
Online Collapsed Importance Sampling, In Advances in Neural Information Processing Systems 31 (NIPS), 2018.
- Yitao Liang and Guy Van den Broeck. Learning Logistic Circuits, In Proceedings of
the 33rd Conference on Artificial Intelligence (AAAI), 2019.