of Learning and Reasoning Guy Van den Broeck Simons Symposium on - - PowerPoint PPT Presentation

of learning and reasoning
SMART_READER_LITE
LIVE PREVIEW

of Learning and Reasoning Guy Van den Broeck Simons Symposium on - - PowerPoint PPT Presentation

Circuit Languages as a Synthesis of Learning and Reasoning Guy Van den Broeck Simons Symposium on New Directions in Theoretical Machine Learning May 10, 2019 How are ideas about automated reasoning from GOFAI relevant to modern statistical


slide-1
SLIDE 1

Circuit Languages as a Synthesis

  • f Learning and Reasoning

Guy Van den Broeck

Simons Symposium on New Directions in Theoretical Machine Learning May 10, 2019

slide-2
SLIDE 2

How are ideas about automated reasoning from GOFAI relevant to modern statistical machine learning?

slide-3
SLIDE 3

Outline: Reasoning ∩ Learning

  • 1. Deep Learning with Symbolic Knowledge
  • 2. Efficient Reasoning During Learning
  • 3. Probabilistic and Logistic Circuits
slide-4
SLIDE 4

Deep Learning with Symbolic Knowledge

R L

slide-5
SLIDE 5

Motivation: Vision

[Lu, W. L., Ting, J. A., Little, J. J., & Murphy, K. P. (2013). Learning to track and identify players from broadcast sports videos.]

slide-6
SLIDE 6

Motivation: Robotics

[Wong, L. L., Kaelbling, L. P., & Lozano-Perez, T., Collision-free state estimation. ICRA 2012]

  

slide-7
SLIDE 7

Motivation: Language

  • Non-local dependencies:

“At least one verb in each sentence”

  • Sentence compression

“If a modifier is kept, its subject is also kept”

  • NELL ontology and rules

… and much more!

[Chang, M., Ratinov, L., & Roth, D. (2008). Constraints as prior knowledge], [Ganchev, K., Gillenwater, J., & Taskar, B. (2010). Posterior regularization for structured latent variable models] … and many many more!

slide-8
SLIDE 8

Motivation: Deep Learning

[Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al.. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476.]

slide-9
SLIDE 9

Motivation: Deep Learning

[Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al.. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476.]

… but …

slide-10
SLIDE 10

Learning with Symbolic Knowledge

Constraints

(Background Knowledge) (Physics)

+

Data

  • 1. Must take at least one of Probability (P)
  • r Logic (L).
  • 2. Probability (P) is a prerequisite for AI (A).
  • 3. The prerequisites for KR (K) is either AI

(A) or Logic (L).

slide-11
SLIDE 11

Learning with Symbolic Knowledge

Constraints

(Background Knowledge) (Physics)

ML Model

+

Today‟s machine learning tools don‟t take knowledge as input!  Learn Data

slide-12
SLIDE 12

Deep Learning with Symbolic Knowledge

Data Constraints Deep Neural Network

+

Learn

Input Neural Network Logical Constraint Output

Output is probability vector p, not Boolean logic!

slide-13
SLIDE 13

Semantic Loss

Q: How close is output p to satisfying constraint α? Answer: Semantic loss function L(α,p)

  • Axioms, for example:

– If α fixes the labels, then L(α,p) is cross-entropy – If α implies β then L(α,p) ≥ L(β,p) (α more strict)

  • Implied Properties:

– If α is equivalent to β then L(α,p) = L(β,p) – If p is Boolean and satisfies α then L(α,p) = 0

SEMANTIC Loss!

slide-14
SLIDE 14

Semantic Loss: Definition

Theorem: Axioms imply unique semantic loss:

Probability of getting state x after flipping coins with probabilities p Probability of satisfying α after flipping coins with probabilities p

slide-15
SLIDE 15

Simple Example: Exactly-One

  • Data must have some label

We agree this must be one of the 10 digits:

  • Exactly-one constraint

→ For 3 classes:

  • Semantic loss:

𝒚𝟐 ∨ 𝒚𝟑∨ 𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟑 ¬𝒚𝟑 ∨ ¬𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟒 Only 𝒚𝒋 = 𝟐 after flipping coins Exactly one true 𝒚 after flipping coins

slide-16
SLIDE 16

Semi-Supervised Learning

  • Intuition: Unlabeled data must have some label
  • Cf. entropy minimization, manifold learning
  • Minimize exactly-one semantic loss on unlabeled data

Train with 𝑓𝑦𝑗𝑡𝑢𝑗𝑜𝑕 𝑚𝑝𝑡𝑡 + 𝑥 ∙ 𝑡𝑓𝑛𝑏𝑜𝑢𝑗𝑑 𝑚𝑝𝑡𝑡

slide-17
SLIDE 17

Experimental Evaluation

Competitive with state of the art in semi-supervised deep learning Outperforms SoA!

Same conclusion on CIFAR10

slide-18
SLIDE 18

Efficient Reasoning During Learning

R L

slide-19
SLIDE 19

But what about real constraints?

  • cf. Nature paper
  • Path constraint
  • Example: 4x4 grids

224 = 184 paths + 16,777,032 non-paths

  • Easily encoded as logical constraints 

[Nishino et al., Choi et al.]

vs.

slide-20
SLIDE 20

How to Compute Semantic Loss?

  • In general: #P-hard 
slide-21
SLIDE 21

Reasoning Tool: Logical Circuits

Representation of logical sentences: 𝐷 ∧ ¬𝐸 ∨ ¬𝐷 ∧ 𝐸 C XOR D

slide-22
SLIDE 22

Input:

1 1 1 1 1 1 1 1 1 1 1 1 1

Reasoning Tool: Logical Circuits

Representation of logical sentences:

slide-23
SLIDE 23

Tractable for Logical Inference

  • Is there a solution? (SAT)

– SAT(𝛽 ∨ 𝛾) iff SAT(𝛽) or SAT(𝛾) (always) – SAT(𝛽 ∧ 𝛾) iff ???

slide-24
SLIDE 24

Decomposable Circuits

Decomposable

B,C,D A

slide-25
SLIDE 25

Tractable for Logical Inference

  • Is there a solution? (SAT)

– SAT(𝛽 ∨ 𝛾) iff SAT(𝛽) or SAT(𝛾) (always) – SAT(𝛽 ∧ 𝛾) iff SAT(𝛽) and SAT(𝛾) (decomposable)

  • How many solutions are there? (#SAT)
  • Complexity linear in circuit size 

slide-26
SLIDE 26

Deterministic Circuits

Deterministic

C XOR D

slide-27
SLIDE 27

Deterministic Circuits

Deterministic

C XOR D C⇔D

slide-28
SLIDE 28

How many solutions are there? (#SAT)

1 1 1 1 1 1 1 1 1

16

8 8 4 4 4 8 8 2 2 2 2 1 1 1

+ x

slide-29
SLIDE 29

Tractable for Logical Inference

  • Is there a solution? (SAT)
  • How many solutions are there? (#SAT)
  • Conjoin, disjoin, equivalence checking, etc.
  • Complexity linear in circuit size 
  • Compilation into circuit by

– ↓ exhaustive SAT solver – ↑ conjoin/disjoin/negate

✓ ✓

[Darwiche and Marquis, JAIR 2002]

slide-30
SLIDE 30

How to Compute Semantic Loss?

  • In general: #P-hard 
  • With a logical circuit for α: Linear 
  • Example: exactly-one constraint:
  • Why? Decomposability and determinism!

L(α,p) = L( , p) = - log( )

slide-31
SLIDE 31

Predict Shortest Paths

Add semantic loss for path constraint

Is output a path? Are individual edge predictions correct? Is prediction the shortest path? This is the real task! (same conclusion for predicting sushi preferences, see paper)

slide-32
SLIDE 32

Conclusions 1

  • Knowledge is (hidden) everywhere in ML
  • Semantic loss makes logic differentiable
  • Performs well semi-supervised
  • Requires hard reasoning in general

– Reasoning can be encapsulated in a circuit – No overhead during learning

  • Performs well on structured prediction
  • A little bit of reasoning goes a long way!
slide-33
SLIDE 33

Probabilistic and Logistic Circuits

R L

slide-34
SLIDE 34

A False Dilemma?

Classical AI Methods

Hungry? $25? Restau rant? Sleep?

Clear Modeling Assumption Well-understood

Neural Networks

“Black Box” Empirical performance

slide-35
SLIDE 35

Can we turn logic circuits into a statistical model?

Inspiration: Probabilistic Circuits

slide-36
SLIDE 36

Probabilistic Circuits

Input:

1 1 1 1 1

.1 .8 .3

.01 .24

.194 .096

.096

𝐐𝐬(𝑩, 𝑪, 𝑫, 𝑬) = 𝟏. 𝟏𝟘𝟕

(.1x1) + (.9x0) .8 x .3

slide-37
SLIDE 37
  • L K

L 

1

P A P 

1

L

  • L 

1

  • P A

P

0.6 0.4

  • L K

L 

1

P

  • P 

1

A A

0.8 0.2

A A

0.25 0.75

A A

0.9 0.1 0.1 0.6 0.3

Can read probabilistic independences off the circuit structure

Each node represents a normalized distribution!

slide-38
SLIDE 38
  • L K

L 

1

P A P 

1

L

  • L 

1

  • P A

P

0.6 0.4

  • L K

L 

1

P

  • P 

1

K K

0.8 0.2

A A

0.25 0.75

A A

0.9 0.1 0.1 0.6 0.3

Student takes course L Student takes course P Probability of course P given L

Parameters are Interpretable

slide-39
SLIDE 39

Properties, Properties, Properties!

  • Read conditional independencies from structure
  • Interpretable parameters (XAI)

(conditional probabilities of logical sentences)

  • Closed-form parameter learning
  • Efficient reasoning

– MAP inference: most-likely assignment to x given y

(otherwise NP-hard)

– Computing conditional probabilities Pr(x|y)

(otherwise #P-hard)

– Algorithms linear in circuit size  – x and y could even be complex logical circuits

slide-40
SLIDE 40

Discrete Density Estimation

LearnPSDD state of the art

  • n 6 datasets!

Q: “Help! I need to learn a discrete probability distribution…” A: Learn probabilistic circuits! Strongly outperforms

  • Bayesian network learners
  • Markov network learners

Competitive SPN learner

slide-41
SLIDE 41

Learning Preference Distributions

Special-purpose distribution: Mixture-of-Mallows

– # of components from 1 to 20 – EM with 10 random seeds – Implementation of Lu & Boutilier PSDD

slide-42
SLIDE 42

Compilation for Prob. Inference

slide-43
SLIDE 43

Collapsed Compilation [NeurIPS 2018]

To sample a circuit:

  • 1. Compile bottom up until you reach the size limit
  • 2. Pick a variable you want to sample
  • 3. Sample it according to its marginal distribution in

the current circuit

  • 4. Condition on the sampled value
  • 5. (Repeat)

Asymptotically unbiased importance sampler 

slide-44
SLIDE 44

Circuits + importance weights approximate any query

slide-45
SLIDE 45

Experiments

Competitive with state-of-the-art approximate inference in graphical models. Outperforms it on several benchmarks!

slide-46
SLIDE 46

But what if I only want to classify Y?

Pr(𝑍, 𝐵, 𝐶, 𝐷, 𝐸) Pr 𝑍 𝐵, 𝐶, 𝐷, 𝐸)

slide-47
SLIDE 47

1 1 1 1

𝐐𝐬 𝒁 = 𝟐 𝑩, 𝑪, 𝑫, 𝑬)

Logistic Circuits

= 𝟐 𝟐 + 𝒇𝒚𝒒(−𝟐. 𝟘) = 𝟏. 𝟗𝟕𝟘

Input: Logistic function

  • n output weight
slide-48
SLIDE 48

Alternative Semantics

Represents Pr 𝑍 𝐵, 𝐶, 𝐷, 𝐸

  • Take all „hot‟ wires
  • Sum their weights
  • Push through logistic function
slide-49
SLIDE 49

Special Case: Logistic Regression

Is this a coincidence? What about more general circuits?

Pr 𝑍 = 1 𝐵, 𝐶, 𝐷, 𝐸 = 1 1 + ex p( − 𝐵 ∗ 𝜄𝐵 − ¬𝐵 ∗ 𝜄¬𝐵 − 𝐶 ∗ 𝜄𝐶 − ⋯ )

Logistic Regression

slide-50
SLIDE 50

Parameter Learning

Reduce to logistic regression:

Features associated with each wire “Global Circuit Flow” features

Learning parameters θ is convex optimization!

slide-51
SLIDE 51

Logistic Circuit Structure Learning

Calculate Gradient Variance Execute the best operation Generate candidate

  • perations
slide-52
SLIDE 52

Comparable Accuracy with Neural Nets

slide-53
SLIDE 53

Significantly Smaller in Size

slide-54
SLIDE 54

Better Data Efficiency

slide-55
SLIDE 55

Logistic vs. Probabilistic Circuits

Probabilities become log-odds Pr 𝑍 𝐵, 𝐶, 𝐷, 𝐸

Pr(𝑍, 𝐵, 𝐶, 𝐷, 𝐸)

slide-56
SLIDE 56

Interpretable?

slide-57
SLIDE 57

2+2 = Reasoning About Classifiers

2 = State-of-the-art (discrete) densities 2 = Non-compromising classifiers 2+2= Tools for reasoning about how a classifier acts on a distribution

  • Adversarial
  • Missing data
  • Active sensing
  • Explainability
  • Fairness
  • Robustness
  • Unknown unknowns
  • Selection bias
slide-58
SLIDE 58

What to expect of classifiers? [IJCAI19]

  • Given a predictor Y=F(X), a distribution P(X)
  • What is expected prediction of F in P(X|e)?
  • Computationally hard

– Even with trivial F (#P-hard) – Even with trivial P (#P-hard) – Even with trivial F and P (NP-hard)

  • But: we can do this efficiently
  • n regression circuit F and

probabilistic circuit P!

slide-59
SLIDE 59

XAI User Study: 5 or 3?

Correctly Classified Misclassified Sufficient Explanations

slide-60
SLIDE 60

Compare to Data Distribution-Unaware explanations

Correctly Classified Misclassified

slide-61
SLIDE 61

Conclusions 2

Statistical ML “Probability” Symbolic AI “Logic” Connectionism “Deep”

Logistic Circuits

slide-62
SLIDE 62

Final Conclusions

  • Knowledge is everywhere in learning
  • Some concepts not easily learned from data
  • Make knowledge first-class citizen in ML
  • Logical circuits turned statistical models
  • Strong properties produce strong learners
  • There is no dilemma between

understanding and accuracy?

  • A wealth of high-level reasoning approaches

are still absent from ML discussion

slide-63
SLIDE 63

Acknowledgements

Thanks to my students and collaborators! Thanks for your attention! Questions?