Probabilistic and Logistic Circuits: A New Synthesis of Logic and - - PowerPoint PPT Presentation

probabilistic and logistic circuits
SMART_READER_LITE
LIVE PREVIEW

Probabilistic and Logistic Circuits: A New Synthesis of Logic and - - PowerPoint PPT Presentation

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning Guy Van den Broeck RelationalAI ArrowCon Feb 5, 2019 Which method to choose? Classical AI Methods: Neural Networks: Hungry $25? ? Restaura Sleep? nt?


slide-1
SLIDE 1

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

Guy Van den Broeck

RelationalAI ArrowCon Feb 5, 2019

slide-2
SLIDE 2

Which method to choose?

Classical AI Methods:

Hungry ? $25? Restaura nt? Sleep?

Clear Modeling Assumption Well-understood

Neural Networks:

“Black Box” Good performance

  • n Image Classification
slide-3
SLIDE 3

Outline

  • Adding knowledge to deep learning
  • Probabilistic circuits
  • Logistic circuits for image classification
slide-4
SLIDE 4

Outline

  • Adding knowledge to deep learning
  • Probabilistic circuits
  • Logistic circuits for image classification
slide-5
SLIDE 5

Motivation: Video

[Lu, W. L., Ting, J. A., Little, J. J., & Murphy, K. P. (2013). Learning to track and identify players from broadcast sports videos.]

slide-6
SLIDE 6

Motivation: Robotics

[Wong, L. L., Kaelbling, L. P., & Lozano-Perez, T., Collision-free state estimation. ICRA 2012]

slide-7
SLIDE 7

Motivation: Language

  • Non-local dependencies:

At least one verb in each sentence

  • Sentence compression

If a modifier is kept, its subject is also kept

  • Information extraction
  • Semantic role labeling

… and many more!

[Chang, M., Ratinov, L., & Roth, D. (2008). Constraints as prior knowledge],…, [Chang, M. W., Ratinov, L., & Roth, D. (2012). Structured learning with constrained conditional models.], [https://en.wikipedia.org/wiki/Constrained_conditional_model]

slide-8
SLIDE 8

Motivation: Deep Learning

[Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al.. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476.]

slide-9
SLIDE 9

Courses:

  • Logic (L)
  • Knowledge Representation (K)
  • Probability (P)
  • Artificial Intelligence (A)

Data

  • Must take at least one of

Probability or Logic.

  • Probability is a prerequisite for AI.
  • The prerequisites for KR is

either AI or Logic.

Constraints

Running Example

slide-10
SLIDE 10

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

unstructured

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

structured

Structured Space

7 out of 16 instantiations are impossible

  • Must take at least one of

Probability (P) or Logic (L).

  • Probability is a prerequisite

for AI (A).

  • The prerequisites for KR (K) is

either AI or Logic.

slide-11
SLIDE 11

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

unstructured

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

structured

Boolean Constraints

7 out of 16 instantiations are impossible

slide-12
SLIDE 12

Learning in Structured Spaces

Data Constraints

(Background Knowledge) (Physics)

ML Model

+

Today‟s machine learning tools don‟t take knowledge as input!  Learn

slide-13
SLIDE 13

Deep Learning with Logical Knowledge

Data Constraints Deep Neural Network

+

Learn

Input Neural Network Logical Constraint Output

Output is probability vector p, not Boolean logic!

slide-14
SLIDE 14

Semantic Loss

Q: How close is output p to satisfying constraint? Answer: Semantic loss function L(α,p)

  • Axioms, for example:

– If p is Boolean then L(p,p) = 0 – If α implies β then L(α,p) ≥ L(β,p) (α more strict)

  • Properties:

– If α is equivalent to β then L(α,p) = L(β,p) – If p is Boolean and satisfies α then L(α,p) = 0

SEMANTIC Loss!

slide-15
SLIDE 15

Semantic Loss: Definition

Theorem: Axioms imply unique semantic loss:

Probability of getting x after flipping coins with prob. p Probability of satisfying α after flipping coins with prob. p

slide-16
SLIDE 16

Example: Exactly-One

  • Data must have some label

We agree this must be one of the 10 digits:

  • Exactly-one constraint

→ For 3 classes:

  • Semantic loss:

𝒚𝟐 ∨ 𝒚𝟑∨ 𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟑 ¬𝒚𝟑 ∨ ¬𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟒 Only 𝒚𝒋 = 𝟐 after flipping coins Exactly one true 𝒚 after flipping coins

slide-17
SLIDE 17

Semi-Supervised Learning

  • Intuition: Unlabeled data must have some label
  • Cf. entropy constraints, manifold learning
  • Minimize exactly-one semantic loss on unlabeled data

Train with 𝑓𝑦𝑗𝑡𝑢𝑗𝑜𝑕 𝑚𝑝𝑡𝑡 + 𝑥 ∙ 𝑡𝑓𝑛𝑏𝑜𝑢𝑗𝑑 𝑚𝑝𝑡𝑡

slide-18
SLIDE 18

MNIST Experiment

Competitive with state of the art in semi-supervised deep learning

slide-19
SLIDE 19

FASHION Experiment

Outperforms Ladder Nets!

Same conclusion on CIFAR10

slide-20
SLIDE 20

What about real constraints? Paths cf. Nature paper

Good variable assignment (represents route) 184 Bad variable assignment (does not represent route) 16,777,032

Unstructured probability space: 184+16,777,032 = 224 Space easily encoded in logical constraints  [Nishino et al.]

slide-21
SLIDE 21

How to Compute Semantic Loss?

  • In general: #P-hard 
slide-22
SLIDE 22

Negation Normal Form Circuits

[Darwiche 2002]

Δ = (sun ∧ rain ⇒ rainbow)

slide-23
SLIDE 23

Input: Bottom-up Evaluation

1 1 1 1 1 0 = 1 AND 0 1 1 1 1 1 1 1 1

Logical Circuits

slide-24
SLIDE 24

Decomposable Circuits

Decomposable

[Darwiche 2002]

slide-25
SLIDE 25

Tractable for Logical Inference

  • Is there a solution? (SAT)

– SAT(𝛽 ∨ 𝛾) iff SAT(𝛽) or SAT(𝛾) (always) – SAT(𝛽 ∧ 𝛾) iff SAT(𝛽) and SAT(𝛾) (decomposable)

  • How many solutions are there? (#SAT)
  • Complexity linear in circuit size 

slide-26
SLIDE 26

Deterministic Circuits

Deterministic

[Darwiche 2002]

slide-27
SLIDE 27

How many solutions are there? (#SAT)

slide-28
SLIDE 28

How many solutions are there? (#SAT)

Arithmetic Circuit

slide-29
SLIDE 29

Tractable for Logical Inference

  • Is there a solution? (SAT)
  • How many solutions are there? (#SAT)
  • Stricter languages (e.g., BDD, SDD):

– Equivalence checking – Conjoin/disjoint/negate circuits

  • Complexity linear in circuit size 
  • Compilation into circuit language by either

– ↓ exhaustive SAT solver – ↑ conjoin/disjoin/negate

✓ ✓ ✓ ✓

slide-30
SLIDE 30

How to Compute Semantic Loss?

  • In general: #P-hard 
  • With a logical circuit for α: Linear!
  • Example: exactly-one constraint:
  • Why? Decomposability and determinism!

L(α,p) = L( , p) = - log( )

slide-31
SLIDE 31

Predict Shortest Paths

Add semantic loss for path constraint

Is output a path? Are individual edge predictions correct? Is prediction the shortest path? This is the real task! (same conclusion for predicting sushi preferences, see paper)

slide-32
SLIDE 32

Outline

  • Adding knowledge to deep learning
  • Probabilistic circuits
  • Logistic circuits for image classification
slide-33
SLIDE 33
  • L K

L  P A P  L

  • L 
  • P A

P

  • L K

L  P

  • P 

K K A A A A

Logical Circuits

Can we represent a distribution

  • ver the solutions to the constraint?
slide-34
SLIDE 34

¬L K L ⊥

1

P A ¬P ⊥

1

L ¬L ⊥

1

¬P ¬A P

0.6 0.4

¬L ¬K L ⊥

1

P ¬P ⊥

1

K ¬K

0.8 0.2

A ¬A

0.25 0.75

A ¬A

0.9 0.1 0.1 0.6 0.3

Probabilistic Circuits

Syntax: assign a normalized probability to each OR gate input

slide-35
SLIDE 35

Bottom-Up Evaluation of PSDDs

Input:

1 1 1 1

Multiply the parameters bottom-up

1

0.1 0.8 0.0 0.3

0.01 0.24 0.00 0.194 0.096 0.096

𝐐𝐬(𝑩, 𝑪, 𝑫, 𝑬) = 𝟏. 𝟏𝟘𝟕

0.1= 0.1*1 + 0.9*0

0.24= 0.8*0.3

slide-36
SLIDE 36

¬L K L ⊥ P A ¬P ⊥ L ¬L ⊥ ¬P ¬A P ¬L ¬K L ⊥ P ¬P ⊥ K ¬K A ¬A A ¬A

Input: L, K, P, A are true

0.1 0.6 0.3 1 1 1 0.6 0.4 1 1 0.8 0.2 0.25 0.75 0.9 0.1

Pr(L,K,P,A) = 0.3 x 1 x 0.8 x 0.4 x 0.25 = 0.024

Alternative View of PSDDs

slide-37
SLIDE 37
  • L K

L 

1

P A P 

1

L

  • L 

1

  • P A

P

0.6 0.4

  • L K

L 

1

P

  • P 

1

A A

0.8 0.2

A A

0.25 0.75

A A

0.9 0.1 0.1 0.6 0.3

Can read probabilistic independences off the circuit structure!

Each node represents a normalized distribution!

Can interpret every parameter as a conditional probability! (XAI)

slide-38
SLIDE 38

Tractable for Probabilistic Inference

  • MAP inference:

Find most-likely assignment to x given y

(otherwise NP-hard)

  • Computing conditional probabilities Pr(x|y)

(otherwise #P-hard)

  • Sample from Pr(x|y)
  • Algorithms linear in circuit size 

(pass up, pass down, similar to backprop)

slide-39
SLIDE 39

Parameter Learning Algorithms

  • Closed form

max likelihood from complete data

  • One pass over data to estimate Pr(x|y)

Not a lot to say: very easy! 

slide-40
SLIDE 40

PSDDs …are Sum-Product Networks …are Arithmetic Circuits

2 1 n p1 s1 p2 s2 pn sn PSDD AC +

* * * * * *

1 2 n p1 s1 p2 s2 pn sn

slide-41
SLIDE 41

Learn Mixtures of PSDD Structures

State of the art

  • n 6 datasets!

Q: “Help! I need to learn a discrete probability distribution…” A: Learn mixture of PSDDs! Strongly outperforms

  • Bayesian network learners
  • Markov network learners

Competitive with

  • SPN learners
  • Cutset network learners
slide-42
SLIDE 42

Outline

  • Adding knowledge to deep learning
  • Probabilistic circuits
  • Logistic circuits for image

classification

slide-43
SLIDE 43

What if I only want to classify Y?

Pr(𝑍, 𝐵, 𝐶, 𝐷, 𝐸)

What if we only want to learn a classifier 𝐐𝐬 𝒁 𝒀)

slide-44
SLIDE 44

Input:

Aggregate the parameters bottom-up Logistic function on final

  • utput

1 1 1 1

𝐐𝐬 𝒁 = 𝟐 𝑩, 𝑪, 𝑫, 𝑬) =

𝟐 𝟐+𝒇𝒚𝒒(−𝟐.𝟘) = 𝟏. 𝟗𝟕𝟘

Logistic Circuits: Evaluation

slide-45
SLIDE 45

Alternative View on Logistic Circuits

Represents Pr 𝑍 𝐵, 𝐶, 𝐷, 𝐸

  • Take all „hot‟ wires
  • Sum their weights
  • Push through logistic function
slide-46
SLIDE 46

Special Case: Logistic Regression

What about other logistic circuits in more general forms?

Pr 𝑍 = 1 𝐵, 𝐶, 𝐷, 𝐸 = 1 1 + ex p( − 𝐵 ∗ 𝜄𝐵 − ¬𝐵 ∗ 𝜄¬𝐵 − 𝐶 ∗ 𝜄𝐶 − ⋯ )

Logistic Regression

slide-47
SLIDE 47

Parameter Learning

Reduce to logistic regression:

Features associated with each wire “Global Circuit Flow” features

Learning parameters θ is convex optimization!

slide-48
SLIDE 48

Structure Learning Primitive

slide-49
SLIDE 49

Logistic Circuit Structure Learning

Calculate Gradient Variance Execute the best operation Generate candidate

  • perations
slide-50
SLIDE 50

Comparable Accuracy with Neural Nets

slide-51
SLIDE 51

Significantly Smaller in Size

slide-52
SLIDE 52

Better Data Efficiency

slide-53
SLIDE 53

Logistic vs. Probabilistic Circuits

Probabilities become log-odds Pr 𝑍 𝐵, 𝐶, 𝐷, 𝐸

Pr(𝑍, 𝐵, 𝐶, 𝐷, 𝐸)

slide-54
SLIDE 54

Interpretable?

slide-55
SLIDE 55

Logistic Circuits: Conclusions

  • Synthesis of symbolic AI and statistical

learning

  • Discriminative counterparts of probabilistic

circuits

  • Convex parameter learning
  • Simple heuristic for structure learning
  • Good performance
  • Easy to interpret
slide-56
SLIDE 56

Conclusions

Statistical ML “Probability” Symbolic AI “Logic” Connectionism “Deep”

Circuits