Probabilistic and Logistic Circuits: A New Synthesis of Logic and - - PowerPoint PPT Presentation

probabilistic and logistic circuits
SMART_READER_LITE
LIVE PREVIEW

Probabilistic and Logistic Circuits: A New Synthesis of Logic and - - PowerPoint PPT Presentation

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning Guy Van den Broeck HRL/ACTIONS @ KR Oct 28, 2018 Foundation: Logical Circuit Languages Negation Normal Form Circuits = (sun rain rainbow)


slide-1
SLIDE 1

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

Guy Van den Broeck

HRL/ACTIONS @ KR Oct 28, 2018

slide-2
SLIDE 2

Foundation: Logical Circuit Languages

slide-3
SLIDE 3

Negation Normal Form Circuits

[Darwiche 2002]

Δ = (sun ∧ rain ⇒ rainbow)

slide-4
SLIDE 4

Decomposable Circuits

Decomposable

[Darwiche 2002]

slide-5
SLIDE 5

Tractable for Logical Inference

  • Is there a solution? (SAT)

– SAT(𝛽 ∨ 𝛾) iff SAT(𝛽) or SAT(𝛾) (always) – SAT(𝛽 ∧ 𝛾) iff SAT(𝛽) and SAT(𝛾) (decomposable)

  • How many solutions are there? (#SAT)
  • Complexity linear in circuit size 

slide-6
SLIDE 6

Deterministic Circuits

Deterministic

[Darwiche 2002]

slide-7
SLIDE 7

How many solutions are there? (#SAT)

slide-8
SLIDE 8

How many solutions are there? (#SAT)

Arithmetic Circuit

slide-9
SLIDE 9

Tractable for Logical Inference

  • Is there a solution? (SAT)
  • How many solutions are there? (#SAT)
  • Stricter languages (e.g., BDD, SDD):

– Equivalence checking – Conjoin/disjoint/negate circuits

  • Complexity linear in circuit size 
  • Compilation into circuit language by either

– ↓ exhaustive SAT solver – ↑ conjoin/disjoin/negate

✓ ✓ ✓ ✓

slide-10
SLIDE 10

Learning with Logical Constraints

slide-11
SLIDE 11

Motivation: Video

[Lu, W. L., Ting, J. A., Little, J. J., & Murphy, K. P. (2013). Learning to track and identify players from broadcast sports videos.]

slide-12
SLIDE 12

Motivation: Robotics

[Wong, L. L., Kaelbling, L. P., & Lozano-Perez, T., Collision-free state estimation. ICRA 2012]

slide-13
SLIDE 13

Motivation: Language

  • Non-local dependencies:

At least one verb in each sentence

  • Sentence compression

If a modifier is kept, its subject is also kept

  • Information extraction
  • Semantic role labeling

… and many more!

[Chang, M., Ratinov, L., & Roth, D. (2008). Constraints as prior knowledge],…, [Chang, M. W., Ratinov, L., & Roth, D. (2012). Structured learning with constrained conditional models.], [https://en.wikipedia.org/wiki/Constrained_conditional_model]

slide-14
SLIDE 14

Motivation: Deep Learning

[Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al.. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476.]

slide-15
SLIDE 15

Courses:

  • Logic (L)
  • Knowledge Representation (K)
  • Probability (P)
  • Artificial Intelligence (A)

Data

  • Must take at least one of

Probability or Logic.

  • Probability is a prerequisite for AI.
  • The prerequisites for KR is

either AI or Logic.

Constraints

Running Example

slide-16
SLIDE 16

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

unstructured

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

structured

Structured Space

7 out of 16 instantiations are impossible

  • Must take at least one of

Probability (P) or Logic (L).

  • Probability is a prerequisite

for AI (A).

  • The prerequisites for KR (K) is

either AI or Logic.

slide-17
SLIDE 17

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

unstructured

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

structured

Boolean Constraints

7 out of 16 instantiations are impossible

slide-18
SLIDE 18

Learning in Structured Spaces

Data Constraints

(Background Knowledge) (Physics)

ML Model

+

Today‟s machine learning tools don‟t take knowledge as input!  Learn

slide-19
SLIDE 19

Deep Learning with Logical Constraints

slide-20
SLIDE 20

Deep Learning with Logical Knowledge

Data Constraints Deep Neural Network

+

Learn

Input Neural Network Logical Constraint Output

Output is probability vector p, not Boolean logic!

slide-21
SLIDE 21

Semantic Loss

Q: How close is output p to satisfying constraint? Answer: Semantic loss function L(α,p)

  • Axioms, for example:

– If p is Boolean then L(p,p) = 0 – If α implies β then L(α,p) ≥ L(β,p) (α more strict)

  • Properties:

– If α is equivalent to β then L(α,p) = L(β,p) – If p is Boolean and satisfies α then L(α,p) = 0

SEMANTIC Loss!

slide-22
SLIDE 22

Semantic Loss: Definition

Theorem: Axioms imply unique semantic loss:

Probability of getting x after flipping coins with prob. p Probability of satisfying α after flipping coins with prob. p

slide-23
SLIDE 23

Example: Exactly-One

  • Data must have some label

We agree this must be one of the 10 digits:

  • Exactly-one constraint

→ For 3 classes:

  • Semantic loss:

𝒚𝟐 ∨ 𝒚𝟑∨ 𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟑 ¬𝒚𝟑 ∨ ¬𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟒 Only 𝒚𝒋 = 𝟐 after flipping coins Exactly one true 𝒚 after flipping coins

slide-24
SLIDE 24

Semi-Supervised Learning

  • Intuition: Unlabeled data must have some label
  • Minimize exactly-one semantic loss on unlabeled data

Train with 𝑓𝑦𝑗𝑡𝑢𝑗𝑜𝑕 𝑚𝑝𝑡𝑡 + 𝑥 ∙ 𝑡𝑓𝑛𝑏𝑜𝑢𝑗𝑑 𝑚𝑝𝑡𝑡

slide-25
SLIDE 25

MNIST Experiment

Competitive with state of the art in semi-supervised deep learning

slide-26
SLIDE 26

FASHION Experiment

Outperforms Ladder Nets!

Same conclusion on CIFAR10

slide-27
SLIDE 27

What about real constraints? Paths cf. Nature paper

Good variable assignment (represents route) 184 Bad variable assignment (does not represent route) 16,777,032

Unstructured probability space: 184+16,777,032 = 224 Space easily encoded in logical constraints  [Nishino et al.]

slide-28
SLIDE 28

How to Compute Semantic Loss?

  • In general: #P-hard 
  • With a logical circuit for α: Linear!
  • Example: exactly-one constraint:
  • Why? Decomposability and determinism!

L(α,p) = L( , p) = - log( )

slide-29
SLIDE 29

Predict Shortest Paths

Add semantic loss for path constraint

Is output a path? Are individual edge predictions correct? Is prediction the shortest path? This is the real task! (same conclusion for predicting sushi preferences, see paper)

slide-30
SLIDE 30

Probabilistic Circuits

slide-31
SLIDE 31
  • L K

L  P A P  L

  • L 
  • P A

P

  • L K

L  P

  • P 

K K A A A A

Logical Circuits

Can we represent a distribution

  • ver the solutions to the constraint?
slide-32
SLIDE 32
  • L K

L  P A P  L

  • L 
  • P A

P

  • L K

L  P

  • P 

K K A A A A

Recall: Decomposability

AND gates have disjoint input circuits

slide-33
SLIDE 33

Recall: Determinism

¬L K L ⊥ P A ¬P ⊥ L ¬L ⊥ ¬P ¬A P ¬L ¬K L ⊥ P ¬P ⊥ K ¬K A ¬A A ¬A

Input: L, K, P, A are true and ¬L, ¬K, ¬P, ¬A are false Property: OR gates have at most one true input wire

slide-34
SLIDE 34

¬L K L ⊥

1

P A ¬P ⊥

1

L ¬L ⊥

1

¬P ¬A P

0.6 0.4

¬L ¬K L ⊥

1

P ¬P ⊥

1

K ¬K

0.8 0.2

A ¬A

0.25 0.75

A ¬A

0.9 0.1 0.1 0.6 0.3

PSDD: Probabilistic SDD

Syntax: assign a normalized probability to each OR gate input

slide-35
SLIDE 35

¬L K L ⊥ P A ¬P ⊥ L ¬L ⊥ ¬P ¬A P ¬L ¬K L ⊥ P ¬P ⊥ K ¬K A ¬A A ¬A

Input: L, K, P, A are true

0.1 0.6 0.3 1 1 1 0.6 0.4 1 1 0.8 0.2 0.25 0.75 0.9 0.1

Pr(L,K,P,A) = 0.3 x 1 x 0.8 x 0.4 x 0.25 = 0.024

PSDD: Probabilistic SDD

slide-36
SLIDE 36
  • L K

L 

1

P A P 

1

L

  • L 

1

  • P A

P

0.6 0.4

  • L K

L 

1

P

  • P 

1

A A

0.8 0.2

A A

0.25 0.75

A A

0.9 0.1 0.1 0.6 0.3

Can read probabilistic independences off the circuit structure

Each node represents a normalized distribution!

slide-37
SLIDE 37

Tractable for Probabilistic Inference

  • MAP inference:

Find most-likely assignment to x given y

(otherwise NP-hard)

  • Computing conditional probabilities Pr(x|y)

(otherwise #P-hard)

  • Sample from Pr(x|y)
  • Algorithms linear in circuit size 

(pass up, pass down, similar to backprop)

slide-38
SLIDE 38
  • L K

L 

1

P A P 

1

L

  • L 

1

  • P A

P

0.6 0.4

  • L K

L 

1

P

  • P 

1

K K

0.8 0.2

A A

0.25 0.75

A A

0.9 0.1 0.1 0.6 0.3

Student takes course L Student takes course P Probability of course P given L

Parameters are Interpretable

Explainable AI DARPA Program

slide-39
SLIDE 39

Learning Probabilistic Circuit Parameters

slide-40
SLIDE 40

Learning Algorithms

  • Closed form

max likelihood from complete data

  • One pass over data to estimate Pr(x|y)
  • Where does the structure come from?

For now: simply compiled from constraint…

Not a lot to say: very easy! 

slide-41
SLIDE 41

Combinatorial Objects: Rankings

10 items: 3,628,800 rankings 20 items: 2,432,902,008,176,640,000 rankings

rank sushi 1 fatty tuna 2 sea urchin 3 salmon roe 4 shrimp 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll rank sushi 1 shrimp 2 sea urchin 3 salmon roe 4 fatty tuna 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll

slide-42
SLIDE 42

Combinatorial Objects: Rankings

rank sushi 1 fatty tuna 2 sea urchin 3 salmon roe 4 shrimp 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll

  • Predict Boolean Variables:

Aij - item i at position j

  • Constraints:

each item i assigned to a unique position (n constraints) each position j assigned a unique item (n constraints)

slide-43
SLIDE 43

Learning Preference Distributions

Special-purpose distribution: Mixture-of-Mallows

– # of components from 1 to 20 – EM with 10 random seeds – Implementation of Lu & Boutilier PSDD

Circuit structure does not even depend on data!

slide-44
SLIDE 44

Learning Probabilistic Circuit Structure

slide-45
SLIDE 45

Structure Learning Primitive

slide-46
SLIDE 46

Structure Learning Primitive

Primitives maintain PSDD properties and constraint of root!

slide-47
SLIDE 47

LearnPSDD Algorithm

(Vtree learning)* Construct the most naïve PSDD LearnPSDD (search for better structure)

1 2 3

Simulate

  • perations

Execute the

  • peration

Generate candidate

  • perations

Works with or without logical constraint.

slide-48
SLIDE 48

PSDDs …are Sum-Product Networks …are Arithmetic Circuits

2 1 n p1 s1 p2 s2 pn sn PSDD AC +

* * * * * *

1 2 n p1 s1 p2 s2 pn sn

slide-49
SLIDE 49

Experiments on 20 datasets

Compare with O-SPN: smaller size in 14, better LL in 11, win on both in 6 Compare with L-SPN: smaller size in 14, better LL in 6, win on both in 2

Compared to SPN learners, LearnPSDD gives comparable performance yet smaller size

slide-50
SLIDE 50

Learn Mixtures of PSDDs

State of the art

  • n 6 datasets!

Q: “Help! I need to learn a discrete probability distribution…” A: Learn mixture of PSDDs! Strongly outperforms

  • Bayesian network learners
  • Markov network learners

Competitive with

  • SPN learners
  • Cutset network learners
slide-51
SLIDE 51

Logistic Circuits

slide-52
SLIDE 52

What if I only want to classify Y?

Pr(𝑍, 𝐵, 𝐶, 𝐷, 𝐸)

slide-53
SLIDE 53

Logistic Circuits

Represents Pr 𝑍 𝐵, 𝐶, 𝐷, 𝐸

  • Take all „hot‟ wires
  • Sum their weights
  • Push through logistic function
slide-54
SLIDE 54

Logistic vs. Probabilistic Circuits

Probabilities become log-odds Pr 𝑍 𝐵, 𝐶, 𝐷, 𝐸

Pr(𝑍, 𝐵, 𝐶, 𝐷, 𝐸)

slide-55
SLIDE 55

Parameter Learning

Reduce to logistic regression:

Features associated with each wire “Global Circuit Flow” features

Learning parameters θ is convex optimization!

slide-56
SLIDE 56

Logistic Circuit Structure Learning

Calculate Gradient Variance Execute the best operation Generate candidate

  • perations

Similar to LearnPSDD structure learning

slide-57
SLIDE 57

Comparable Accuracy with Neural Nets

slide-58
SLIDE 58

Significantly Smaller in Size Better Data Efficiency

slide-59
SLIDE 59

Reasoning with Probabilistic Circuits

slide-60
SLIDE 60

Compilation target for probabilistic reasoning

Bayesian networks Factor graphs Probabilistic databases Relational Bayesian networks Probabilistic programs Markov Logic Probabilistic Circuits

slide-61
SLIDE 61

Compilation for Prob. Inference

slide-62
SLIDE 62

Collapsed Compilation

To sample a circuit:

  • 1. Compile bottom up until you reach the size limit
  • 2. Pick a variable you want to sample
  • 3. Sample it according to its marginal distribution in

the current circuit

  • 4. Condition on the sampled value
  • 5. (Repeat)

Asymptotically unbiased importance sampler 

slide-63
SLIDE 63

Circuits + importance weights approximate any query

slide-64
SLIDE 64

Experiments

Competitive with state-of-the-art approximate inference in graphical models. Outperforms it on several benchmarks!

slide-65
SLIDE 65

Reasoning About Classifiers

slide-66
SLIDE 66

Classifier Trimming

C

F1 F2 F3 F4

C

F2 F3

Classifier 𝛽 Classifier 𝛾

Threshold 𝑈 Threshold 𝑈′

Trim features while maintaining classification behavior

slide-67
SLIDE 67

How to measure Similarity?

What is the expected probability that a classifier α will agree with its trimming β?

“Expected Classification Agreement”

slide-68
SLIDE 68

Solving PPPP problems with constrained SDDs

  • L K

L  P A P  L

  • L 
  • P A

P

  • L K

L  P

  • P 

K K A A A A

f

1 2 3

L K P A 𝑀 ∧ 𝐿 𝑔|(𝑀 ∧ 𝐿)

slide-69
SLIDE 69

SDD method faster than traditional jointree inference

slide-70
SLIDE 70

Classification agreement and accuracy

Higher agreement tends to get higher accuracy Additional dimension for feature selection

slide-71
SLIDE 71

Conclusions

Statistical ML “Probability” Symbolic AI “Logic” Connectionism “Deep”

Circuits

slide-72
SLIDE 72

Questions?

PSDD with 15,000 nodes

slide-73
SLIDE 73

References

  • Doga Kisa, Guy Van den Broeck, Arthur Choi and Adnan Darwiche. Probabilistic

sentential decision diagrams, In Proceedings of the 14th International Conference

  • n Principles of Knowledge Representation and Reasoning (KR), 2014.
  • Arthur Choi, Guy Van den Broeck and Adnan Darwiche. Tractable Learning for

Structured Probability Spaces: A Case Study in Learning Preference Distributions, In Proceedings of 24th International Joint Conference on Artificial Intelligence (IJCAI), 2015.

  • Arthur Choi, Guy Van den Broeck and Adnan Darwiche. Probability Distributions
  • ver Structured Spaces, In Proceedings of the AAAI Spring Symposium on

KRR, 2015.

  • Jessa Bekker, Jesse Davis, Arthur Choi, Adnan Darwiche and Guy Van den
  • Broeck. Tractable Learning for Complex Probability Queries, In Advances in

Neural Information Processing Systems 28 (NIPS), 2015.

  • YooJung Choi, Adnan Darwiche and Guy Van den Broeck. Optimal Feature

Selection for Decision Robustness in Bayesian Networks, In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), 2017.

slide-74
SLIDE 74

References

  • Yitao Liang, Jessa Bekker and Guy Van den Broeck. Learning the Structure of

Probabilistic Sentential Decision Diagrams, In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI), 2017.

  • Yitao Liang and Guy Van den Broeck. Towards Compact Interpretable Models:

Shrinking of Learned Probabilistic Sentential Decision Diagrams, In IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI), 2017.

  • YooJung Choi and Guy Van den Broeck. On Robust Trimming of Bayesian

Network Classifiers, In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), 2018.

  • Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang and Guy Van den Broeck. A

Semantic Loss Function for Deep Learning with Symbolic Knowledge, In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.

  • Yitao Liang and Guy Van den Broeck. Learning Logistic Circuits, In Proceedings of

the UAI 2018 Workshop: Uncertainty in Deep Learning, 2018.

  • Tal Friedman and Guy Van den Broeck. Approximate Knowledge Compilation by

Online Collapsed Importance Sampling, In Advances in Neural Information Processing Systems 31 (NIPS), 2018.