[PPT] - Probabilistic and Logistic Circuits: A New Synthesis of Logic and PowerPoint Presentation

SLIDE 1

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

Guy Van den Broeck

Stanford Nov 14, 2018

SLIDE 2

Foundation: Logical Circuit Languages

SLIDE 3

Negation Normal Form Circuits

[Darwiche 2002]

Δ = (sun ∧ rain ⇒ rainbow)

SLIDE 4

Decomposable Circuits

Decomposable

[Darwiche 2002]

SLIDE 5

Tractable for Logical Inference

Is there a solution? (SAT)

– SAT(𝛽 ∨ 𝛾) iff SAT(𝛽) or SAT(𝛾) (always) – SAT(𝛽 ∧ 𝛾) iff SAT(𝛽) and SAT(𝛾) (decomposable)

How many solutions are there? (#SAT)
Complexity linear in circuit size 

✓

SLIDE 6

Deterministic Circuits

Deterministic

[Darwiche 2002]

SLIDE 7

How many solutions are there? (#SAT)

SLIDE 8

How many solutions are there? (#SAT)

Arithmetic Circuit

SLIDE 9

Tractable for Logical Inference

Is there a solution? (SAT)
How many solutions are there? (#SAT)
Stricter languages (e.g., BDD, SDD):

– Equivalence checking – Conjoin/disjoint/negate circuits

Complexity linear in circuit size 
Compilation into circuit language by either

– ↓ exhaustive SAT solver – ↑ conjoin/disjoin/negate

✓ ✓ ✓ ✓

SLIDE 10

Learning with Logical Constraints

SLIDE 11

Motivation: Video

[Lu, W. L., Ting, J. A., Little, J. J., & Murphy, K. P. (2013). Learning to track and identify players from broadcast sports videos.]

SLIDE 12

Motivation: Robotics

[Wong, L. L., Kaelbling, L. P., & Lozano-Perez, T., Collision-free state estimation. ICRA 2012]

SLIDE 13

Motivation: Language

Non-local dependencies:

At least one verb in each sentence

Sentence compression

If a modifier is kept, its subject is also kept

Information extraction
Semantic role labeling

… and many more!

[Chang, M., Ratinov, L., & Roth, D. (2008). Constraints as prior knowledge],…, [Chang, M. W., Ratinov, L., & Roth, D. (2012). Structured learning with constrained conditional models.], [https://en.wikipedia.org/wiki/Constrained_conditional_model]

SLIDE 14

Motivation: Deep Learning

[Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al.. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476.]

SLIDE 15

Courses:

Logic (L)
Knowledge Representation (K)
Probability (P)
Artificial Intelligence (A)

Data

Must take at least one of

Probability or Logic.

Probability is a prerequisite for AI.
The prerequisites for KR is

either AI or Logic.

Constraints

Running Example

SLIDE 16

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

unstructured

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

structured

Structured Space

7 out of 16 instantiations are impossible

Must take at least one of

Probability (P) or Logic (L).

Probability is a prerequisite

for AI (A).

The prerequisites for KR (K) is

either AI or Logic.

SLIDE 17

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

unstructured

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

structured

Boolean Constraints

7 out of 16 instantiations are impossible

SLIDE 18

Learning in Structured Spaces

Data Constraints

(Background Knowledge) (Physics)

ML Model

+

Today‟s machine learning tools don‟t take knowledge as input!  Learn

SLIDE 19

Deep Learning with Logical Constraints

SLIDE 20

Deep Learning with Logical Knowledge

Data Constraints Deep Neural Network

+

Learn

Input Neural Network Logical Constraint Output

Output is probability vector p, not Boolean logic!

SLIDE 21

Semantic Loss

Q: How close is output p to satisfying constraint? Answer: Semantic loss function L(α,p)

Axioms, for example:

– If p is Boolean then L(p,p) = 0 – If α implies β then L(α,p) ≥ L(β,p) (α more strict)

Properties:

– If α is equivalent to β then L(α,p) = L(β,p) – If p is Boolean and satisfies α then L(α,p) = 0

SEMANTIC Loss!

SLIDE 22

Semantic Loss: Definition

Theorem: Axioms imply unique semantic loss:

Probability of getting x after flipping coins with prob. p Probability of satisfying α after flipping coins with prob. p

SLIDE 23

Example: Exactly-One

Data must have some label

We agree this must be one of the 10 digits:

Exactly-one constraint

→ For 3 classes:

Semantic loss:

𝒚𝟐 ∨ 𝒚𝟑∨ 𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟑 ¬𝒚𝟑 ∨ ¬𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟒 Only 𝒚𝒋 = 𝟐 after flipping coins Exactly one true 𝒚 after flipping coins

SLIDE 24

Semi-Supervised Learning

Intuition: Unlabeled data must have some label
Cf. entropy constraints, manifold learning
Minimize exactly-one semantic loss on unlabeled data

Train with 𝑓𝑦𝑗𝑡𝑢𝑗𝑜𝑕 𝑚𝑝𝑡𝑡 + 𝑥 ∙ 𝑡𝑓𝑛𝑏𝑜𝑢𝑗𝑑 𝑚𝑝𝑡𝑡

SLIDE 25

MNIST Experiment

Competitive with state of the art in semi-supervised deep learning

SLIDE 26

FASHION Experiment

Outperforms Ladder Nets!

Same conclusion on CIFAR10

SLIDE 27

What about real constraints? Paths cf. Nature paper

Good variable assignment (represents route) 184 Bad variable assignment (does not represent route) 16,777,032

Unstructured probability space: 184+16,777,032 = 224 Space easily encoded in logical constraints  [Nishino et al.]

SLIDE 28

How to Compute Semantic Loss?

In general: #P-hard 
With a logical circuit for α: Linear!
Example: exactly-one constraint:
Why? Decomposability and determinism!

L(α,p) = L( , p) = - log( )

SLIDE 29

Predict Shortest Paths

Add semantic loss for path constraint

Is output a path? Are individual edge predictions correct? Is prediction the shortest path? This is the real task! (same conclusion for predicting sushi preferences, see paper)

SLIDE 30

Probabilistic Circuits

SLIDE 31

L K

L  P A P  L

L 
P A

P

L K

L  P

P 

K K A A A A

Logical Circuits

Can we represent a distribution

ver the solutions to the constraint?

SLIDE 32

L K

L  P A P  L

L 
P A

P

L K

L  P

P 

K K A A A A

Recall: Decomposability

AND gates have disjoint input circuits

SLIDE 33

Recall: Determinism

¬L K L ⊥ P A ¬P ⊥ L ¬L ⊥ ¬P ¬A P ¬L ¬K L ⊥ P ¬P ⊥ K ¬K A ¬A A ¬A

Input: L, K, P, A are true and ¬L, ¬K, ¬P, ¬A are false Property: OR gates have at most one true input wire

SLIDE 34

¬L K L ⊥

1

P A ¬P ⊥

1

L ¬L ⊥

1

¬P ¬A P

0.6 0.4

¬L ¬K L ⊥

1

P ¬P ⊥

1

K ¬K

0.8 0.2

A ¬A

0.25 0.75

A ¬A

0.9 0.1 0.1 0.6 0.3

PSDD: Probabilistic SDD

Syntax: assign a normalized probability to each OR gate input

SLIDE 35

¬L K L ⊥ P A ¬P ⊥ L ¬L ⊥ ¬P ¬A P ¬L ¬K L ⊥ P ¬P ⊥ K ¬K A ¬A A ¬A

Input: L, K, P, A are true

0.1 0.6 0.3 1 1 1 0.6 0.4 1 1 0.8 0.2 0.25 0.75 0.9 0.1

Pr(L,K,P,A) = 0.3 x 1 x 0.8 x 0.4 x 0.25 = 0.024

PSDD: Probabilistic SDD

SLIDE 36

L K

L 

1

P A P 

1

L

L 

1

P A

P

0.6 0.4

L K

L 

1

P

P 

1

A A

0.8 0.2

A A

0.25 0.75

A A

0.9 0.1 0.1 0.6 0.3

Can read probabilistic independences off the circuit structure

Each node represents a normalized distribution!

SLIDE 37

Tractable for Probabilistic Inference

MAP inference:

Find most-likely assignment to x given y

(otherwise NP-hard)

Computing conditional probabilities Pr(x|y)

(otherwise #P-hard)

Sample from Pr(x|y)
Algorithms linear in circuit size 

(pass up, pass down, similar to backprop)

SLIDE 38

L K

L 

1

P A P 

1

L

L 

1

P A

P

0.6 0.4

L K

L 

1

P

P 

1

K K

0.8 0.2

A A

0.25 0.75

A A

0.9 0.1 0.1 0.6 0.3

Student takes course L Student takes course P Probability of course P given L

Parameters are Interpretable

Explainable AI DARPA Program

SLIDE 39

Learning Probabilistic Circuit Parameters

SLIDE 40

Learning Algorithms

Closed form

max likelihood from complete data

One pass over data to estimate Pr(x|y)
Where does the structure come from?

For now: simply compiled from constraint…

Not a lot to say: very easy! 

SLIDE 41

Combinatorial Objects: Rankings

10 items: 3,628,800 rankings 20 items: 2,432,902,008,176,640,000 rankings

rank sushi 1 fatty tuna 2 sea urchin 3 salmon roe 4 shrimp 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll rank sushi 1 shrimp 2 sea urchin 3 salmon roe 4 fatty tuna 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll

SLIDE 42

Combinatorial Objects: Rankings

rank sushi 1 fatty tuna 2 sea urchin 3 salmon roe 4 shrimp 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll

Predict Boolean Variables:

Aij - item i at position j

Constraints:

each item i assigned to a unique position (n constraints) each position j assigned a unique item (n constraints)

SLIDE 43

Learning Preference Distributions

Special-purpose distribution: Mixture-of-Mallows

– # of components from 1 to 20 – EM with 10 random seeds – Implementation of Lu & Boutilier PSDD

Circuit structure does not even depend on data!

SLIDE 44

Learning Probabilistic Circuit Structure

SLIDE 45

Structure Learning Primitive

SLIDE 46

Structure Learning Primitive

Primitives maintain PSDD properties and constraint of root!

SLIDE 47

LearnPSDD Algorithm

(Vtree learning)* Construct the most naïve PSDD LearnPSDD (search for better structure)

1 2 3

Simulate

perations

Execute the

peration

Generate candidate

perations

Works with or without logical constraint.

SLIDE 48

PSDDs …are Sum-Product Networks …are Arithmetic Circuits

2 1 n p1 s1 p2 s2 pn sn PSDD AC +

* * * * * *

1 2 n p1 s1 p2 s2 pn sn

SLIDE 49

Experiments on 20 datasets

Compared to SPN learners, LearnPSDD gives comparable performance yet smaller size

SLIDE 50

Learn Mixtures of PSDDs

State of the art

n 6 datasets!

Q: “Help! I need to learn a discrete probability distribution…” A: Learn mixture of PSDDs! Strongly outperforms

Bayesian network learners
Markov network learners

Competitive with

SPN learners
Cutset network learners

SLIDE 51

Logistic Circuits

SLIDE 52

What if I only want to classify Y?

Pr(𝑍, 𝐵, 𝐶, 𝐷, 𝐸)

SLIDE 53

Logistic Circuits

Represents Pr 𝑍 𝐵, 𝐶, 𝐷, 𝐸

Take all „hot‟ wires
Sum their weights
Push through logistic function

SLIDE 54

Logistic vs. Probabilistic Circuits

Probabilities become log-odds Pr 𝑍 𝐵, 𝐶, 𝐷, 𝐸

Pr(𝑍, 𝐵, 𝐶, 𝐷, 𝐸)

SLIDE 55

Parameter Learning

Reduce to logistic regression:

Features associated with each wire “Global Circuit Flow” features

Learning parameters θ is convex optimization!

SLIDE 56

Logistic Circuit Structure Learning

Calculate Gradient Variance Execute the best operation Generate candidate

perations

Similar to LearnPSDD structure learning

SLIDE 57

Comparable Accuracy with Neural Nets

SLIDE 58

Significantly Smaller in Size Better Data Efficiency

SLIDE 59

Interpretable?

SLIDE 60

Reasoning with Probabilistic Circuits

SLIDE 61

Compilation target for probabilistic reasoning

Bayesian networks Factor graphs Probabilistic databases Relational Bayesian networks Probabilistic programs Markov Logic Probabilistic Circuits

SLIDE 62

Compilation for Prob. Inference

SLIDE 63

Collapsed Compilation

To sample a circuit:

1. Compile bottom up until you reach the size limit
2. Pick a variable you want to sample
3. Sample it according to its marginal distribution in

the current circuit

4. Condition on the sampled value
5. (Repeat)

Asymptotically unbiased importance sampler 

SLIDE 64

Circuits + importance weights approximate any query

SLIDE 65

Experiments

Competitive with state-of-the-art approximate inference in graphical models. Outperforms it on several benchmarks!

SLIDE 66

Reasoning About Classifiers

SLIDE 67

Classifier Trimming

C

F1 F2 F3 F4

C

F2 F3

Classifier 𝛽 Classifier 𝛾

Threshold 𝑈 Threshold 𝑈′

Trim features while maintaining classification behavior

SLIDE 68

How to measure Similarity?

What is the expected probability that a classifier α will agree with its trimming β?

“Expected Classification Agreement”

SLIDE 69

Solving PPPP problems with constrained SDDs

L K

L  P A P  L

L 
P A

P

L K

L  P

P 

K K A A A A

f

1 2 3

L K P A 𝑀 ∧ 𝐿 𝑔|(𝑀 ∧ 𝐿)

SLIDE 70

SDD method faster than traditional jointree inference

SLIDE 71

Classification agreement and accuracy

Higher agreement tends to get higher accuracy Additional dimension for feature selection

SLIDE 72

Conclusions

Statistical ML “Probability” Symbolic AI “Logic” Connectionism “Deep”

Circuits

SLIDE 73

Questions?

PSDD with 15,000 nodes

SLIDE 74

References

Doga Kisa, Guy Van den Broeck, Arthur Choi and Adnan Darwiche. Probabilistic

sentential decision diagrams, In Proceedings of the 14th International Conference

n Principles of Knowledge Representation and Reasoning (KR), 2014.
Arthur Choi, Guy Van den Broeck and Adnan Darwiche. Tractable Learning for

Structured Probability Spaces: A Case Study in Learning Preference Distributions, In Proceedings of 24th International Joint Conference on Artificial Intelligence (IJCAI), 2015.

Arthur Choi, Guy Van den Broeck and Adnan Darwiche. Probability Distributions
ver Structured Spaces, In Proceedings of the AAAI Spring Symposium on

KRR, 2015.

Jessa Bekker, Jesse Davis, Arthur Choi, Adnan Darwiche and Guy Van den
Broeck. Tractable Learning for Complex Probability Queries, In Advances in

Neural Information Processing Systems 28 (NIPS), 2015.

YooJung Choi, Adnan Darwiche and Guy Van den Broeck. Optimal Feature

Selection for Decision Robustness in Bayesian Networks, In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), 2017.

SLIDE 75

References

Yitao Liang, Jessa Bekker and Guy Van den Broeck. Learning the Structure of

Probabilistic Sentential Decision Diagrams, In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI), 2017.

Yitao Liang and Guy Van den Broeck. Towards Compact Interpretable Models:

Shrinking of Learned Probabilistic Sentential Decision Diagrams, In IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI), 2017.

YooJung Choi and Guy Van den Broeck. On Robust Trimming of Bayesian

Network Classifiers, In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), 2018.

Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang and Guy Van den Broeck. A

Semantic Loss Function for Deep Learning with Symbolic Knowledge, In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.

Tal Friedman and Guy Van den Broeck. Approximate Knowledge Compilation by

Online Collapsed Importance Sampling, In Advances in Neural Information Processing Systems 31 (NIPS), 2018.

Yitao Liang and Guy Van den Broeck. Learning Logistic Circuits, In Proceedings of

the 33rd Conference on Artificial Intelligence (AAAI), 2019.