Reasoning and Learning Guy Van den Broeck Northeastern University - - PowerPoint PPT Presentation

reasoning and learning
SMART_READER_LITE
LIVE PREVIEW

Reasoning and Learning Guy Van den Broeck Northeastern University - - PowerPoint PPT Presentation

Towards a New Synthesis of Reasoning and Learning Guy Van den Broeck Northeastern University April 22, 2019 Outline: Reasoning Learning 1. Deep Learning with Symbolic Knowledge 2. Efficient Reasoning During Learning 3. Probabilistic and


slide-1
SLIDE 1

Towards a New Synthesis of Reasoning and Learning

Guy Van den Broeck

Northeastern University April 22, 2019

slide-2
SLIDE 2

Outline: Reasoning ∩ Learning

  • 1. Deep Learning with Symbolic Knowledge
  • 2. Efficient Reasoning During Learning
  • 3. Probabilistic and Logistic Circuits
  • 4. High-Level Probabilistic Reasoning
slide-3
SLIDE 3

Deep Learning with Symbolic Knowledge

R L

slide-4
SLIDE 4

Motivation: Vision

[Lu, W. L., Ting, J. A., Little, J. J., & Murphy, K. P. (2013). Learning to track and identify players from broadcast sports videos.]

slide-5
SLIDE 5

Motivation: Robotics

[Wong, L. L., Kaelbling, L. P., & Lozano-Perez, T., Collision-free state estimation. ICRA 2012]

  

slide-6
SLIDE 6

Motivation: Language

  • Non-local dependencies:

“At least one verb in each sentence”

  • Sentence compression

“If a modifier is kept, its subject is also kept”

… and many more!

[Chang, M., Ratinov, L., & Roth, D. (2008). Constraints as prior knowledge], [Ganchev, K., Gillenwater, J., & Taskar, B. (2010). Posterior regularization for structured latent variable models]

slide-7
SLIDE 7

Motivation: Deep Learning

[Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al.. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476.]

slide-8
SLIDE 8

Motivation: Deep Learning

[Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al.. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476.]

… but …

slide-9
SLIDE 9

Learning with Symbolic Knowledge

Constraints

(Background Knowledge) (Physics)

+

Data

  • 1. Must take at least one of Probability (P)
  • r Logic (L).
  • 2. Probability (P) is a prerequisite for AI (A).
  • 3. The prerequisites for KR (K) is either AI

(A) or Logic (L).

slide-10
SLIDE 10

Learning with Symbolic Knowledge

Constraints

(Background Knowledge) (Physics)

ML Model

+

Today‟s machine learning tools don‟t take knowledge as input!  Learn Data

slide-11
SLIDE 11

Deep Learning with Symbolic Knowledge

Data Constraints Deep Neural Network

+

Learn

Input Neural Network Logical Constraint Output

Output is probability vector p, not Boolean logic!

slide-12
SLIDE 12

Semantic Loss

Q: How close is output p to satisfying constraint α? Answer: Semantic loss function L(α,p)

  • Axioms, for example:

– If p is Boolean then L(p,p) = 0 – If α implies β then L(α,p) ≥ L(β,p) (α more strict)

  • Implied Properties:

– If α is equivalent to β then L(α,p) = L(β,p) – If p is Boolean and satisfies α then L(α,p) = 0

SEMANTIC Loss!

slide-13
SLIDE 13

Semantic Loss: Definition

Theorem: Axioms imply unique semantic loss:

Probability of getting state x after flipping coins with probabilities p Probability of satisfying α after flipping coins with probabilities p

slide-14
SLIDE 14

Simple Example: Exactly-One

  • Data must have some label

We agree this must be one of the 10 digits:

  • Exactly-one constraint

→ For 3 classes:

  • Semantic loss:

𝒚𝟐 ∨ 𝒚𝟑∨ 𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟑 ¬𝒚𝟑 ∨ ¬𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟒 Only 𝒚𝒋 = 𝟐 after flipping coins Exactly one true 𝒚 after flipping coins

slide-15
SLIDE 15

Semi-Supervised Learning

  • Intuition: Unlabeled data must have some label
  • Cf. entropy minimization, manifold learning
  • Minimize exactly-one semantic loss on unlabeled data

Train with 𝑓𝑦𝑗𝑡𝑢𝑗𝑜𝑕 𝑚𝑝𝑡𝑡 + 𝑥 ∙ 𝑡𝑓𝑛𝑏𝑜𝑢𝑗𝑑 𝑚𝑝𝑡𝑡

slide-16
SLIDE 16

Experimental Evaluation

Competitive with state of the art in semi-supervised deep learning Outperforms SoA!

Same conclusion on CIFAR10

slide-17
SLIDE 17

Efficient Reasoning During Learning

R L

slide-18
SLIDE 18

But what about real constraints?

  • cf. Nature paper
  • Path constraint
  • Example: 4x4 grids

224 = 184 paths + 16,777,032 non-paths

  • Easily encoded as logical constraints 

[Nishino et al., Choi et al.]

vs.

slide-19
SLIDE 19

How to Compute Semantic Loss?

  • In general: #P-hard 
slide-20
SLIDE 20

Reasoning Tool: Logical Circuits

Representation of logical sentences: 𝐷 ∧ ¬𝐸 ∨ ¬𝐷 ∧ 𝐸 C XOR D

slide-21
SLIDE 21

Input: Bottom-up Evaluation

1 1 1 1 1 1 1 1 1 1 1 1 1

Reasoning Tool: Logical Circuits

Representation of logical sentences:

slide-22
SLIDE 22

Tractable for Logical Inference

  • Is there a solution? (SAT)

– SAT(𝛽 ∨ 𝛾) iff SAT(𝛽) or SAT(𝛾) (always) – SAT(𝛽 ∧ 𝛾) iff ???

slide-23
SLIDE 23

Decomposable Circuits

Decomposable

B,C,D A

slide-24
SLIDE 24

Tractable for Logical Inference

  • Is there a solution? (SAT)

– SAT(𝛽 ∨ 𝛾) iff SAT(𝛽) or SAT(𝛾) (always) – SAT(𝛽 ∧ 𝛾) iff SAT(𝛽) and SAT(𝛾) (decomposable)

  • How many solutions are there? (#SAT)
  • Complexity linear in circuit size 

slide-25
SLIDE 25

Deterministic Circuits

Deterministic

C XOR D

slide-26
SLIDE 26

Deterministic Circuits

Deterministic

C XOR D C⇔D

slide-27
SLIDE 27

How many solutions are there? (#SAT)

1 1 1 1 1 1 1 1 1

16

8 8 4 4 4 8 8 2 2 2 2 1 1 1

+ x

slide-28
SLIDE 28

Tractable for Logical Inference

  • Is there a solution? (SAT)
  • How many solutions are there? (#SAT)
  • Complexity linear in circuit size 
  • Compilation into circuit by

– ↓ exhaustive SAT solver – ↑ conjoin/disjoin/negate

✓ ✓

[Darwiche and Marquis, JAIR 2002]

slide-29
SLIDE 29

How to Compute Semantic Loss?

  • In general: #P-hard 
  • With a logical circuit for α: Linear 
  • Example: exactly-one constraint:
  • Why? Decomposability and determinism!

L(α,p) = L( , p) = - log( )

slide-30
SLIDE 30

Predict Shortest Paths

Add semantic loss for path constraint

Is output a path? Are individual edge predictions correct? Is prediction the shortest path? This is the real task! (same conclusion for predicting sushi preferences, see paper)

slide-31
SLIDE 31

Conclusions 1

  • Knowledge is (hidden) everywhere in ML
  • Semantic loss makes logic differentiable
  • Performs well semi-supervised
  • Requires hard reasoning in general

– Reasoning can be encapsulated in a circuit – No overhead during learning

  • Performs well on structured prediction
  • A little bit of reasoning goes a long way!
slide-32
SLIDE 32

Probabilistic and Logistic Circuits

R L

slide-33
SLIDE 33

A False Dilemma?

Classical AI Methods

Hungry? $25? Restau rant? Sleep?

Clear Modeling Assumption Well-understood

Neural Networks

“Black Box” Empirical performance

slide-34
SLIDE 34

Can we turn logic circuits into a statistical model?

Inspiration: Probabilistic Circuits

slide-35
SLIDE 35

Probabilistic Circuits

Input:

1 1 1 1 1

.1 .8 .3

.01 .24

.194 .096

.096

𝐐𝐬(𝑩, 𝑪, 𝑫, 𝑬) = 𝟏. 𝟏𝟘𝟕

Bottom-up evaluation

(.1x1) + (.9x0) .8 x .3

Probability on edges

slide-36
SLIDE 36

Properties, Properties, Properties!

  • Read conditional independencies from structure
  • Interpretable parameters (XAI)

(conditional probabilities of logical sentences)

  • Closed-form parameter learning
  • Efficient reasoning

– MAP inference: most-likely assignment to x given y

(otherwise NP-hard)

– Computing conditional probabilities Pr(x|y)

(otherwise #P-hard)

– Algorithms linear in circuit size 

slide-37
SLIDE 37

Side Note: Discrete Density Estimation

LearnPSDD state of the art

  • n 6 datasets!

Q: “Help! I need to learn a discrete probability distribution…” A: Learn probabilistic circuits! Strongly outperforms

  • Bayesian network learners
  • Markov network learners

Competitive with SPN learners (State of the art for approximate inference in discrete factor graphs)

slide-38
SLIDE 38

But what if I only want to classify Y?

Pr(𝑍, 𝐵, 𝐶, 𝐷, 𝐸) Pr 𝑍 𝐵, 𝐶, 𝐷, 𝐸)

slide-39
SLIDE 39

1 1 1 1

𝐐𝐬 𝒁 = 𝟐 𝑩, 𝑪, 𝑫, 𝑬)

Logistic Circuits

= 𝟐 𝟐 + 𝒇𝒚𝒒(−𝟐. 𝟘) = 𝟏. 𝟗𝟕𝟘

Input: Bottom-up evaluation Weights on edges Logistic function

  • n output weight
slide-40
SLIDE 40

Alternative Semantics

Represents Pr 𝑍 𝐵, 𝐶, 𝐷, 𝐸

  • Take all „hot‟ wires
  • Sum their weights
  • Push through logistic function
slide-41
SLIDE 41

Special Case: Logistic Regression

Is this a coincidence? What about more general circuits?

Pr 𝑍 = 1 𝐵, 𝐶, 𝐷, 𝐸 = 1 1 + ex p( − 𝐵 ∗ 𝜄𝐵 − ¬𝐵 ∗ 𝜄¬𝐵 − 𝐶 ∗ 𝜄𝐶 − ⋯ )

Logistic Regression

slide-42
SLIDE 42

Parameter Learning

Reduce to logistic regression:

Features associated with each wire “Global Circuit Flow” features

Learning parameters θ is convex optimization!

slide-43
SLIDE 43

Logistic Circuit Structure Learning

Calculate Gradient Variance Execute the best operation Generate candidate

  • perations
slide-44
SLIDE 44

Comparable Accuracy with Neural Nets

slide-45
SLIDE 45

Significantly Smaller in Size

slide-46
SLIDE 46

Better Data Efficiency

slide-47
SLIDE 47

Logistic vs. Probabilistic Circuits

Probabilities become log-odds Pr 𝑍 𝐵, 𝐶, 𝐷, 𝐸

Pr(𝑍, 𝐵, 𝐶, 𝐷, 𝐸)

slide-48
SLIDE 48

Interpretable?

slide-49
SLIDE 49

Conclusions 2

Statistical ML “Probability” Symbolic AI “Logic” Connectionism “Deep”

Logistic Circuits

slide-50
SLIDE 50

High-Level Probabilistic Inference

R L

slide-51
SLIDE 51

...

Simple Reasoning Problem

?

Probability that Card1 is Hearts? 1/4

[Van den Broeck; AAAI-KRR‟15]

slide-52
SLIDE 52

Let us automate this:

  • 1. Probabilistic graphical model (e.g., factor graph)
  • 2. Probabilistic inference algorithm

(e.g., variable elimination or junction tree)

Automated Reasoning

[Van den Broeck; AAAI-KRR‟15]

slide-53
SLIDE 53

Let us automate this:

  • 1. Probabilistic graphical model (e.g., factor graph)

is fully connected!

  • 2. Probabilistic inference algorithm

(e.g., variable elimination or junction tree) builds a table with 5252 rows

Automated Reasoning

(artist's impression)

[Van den Broeck; AAAI-KRR‟15]

slide-54
SLIDE 54

...

What's Going On Here?

?

Probability that Card52 is Spades given that Card1 is QH? 13/51

[Van den Broeck; AAAI-KRR‟15]

slide-55
SLIDE 55

What's Going On Here?

? ...

Probability that Card52 is Spades given that Card2 is QH? 13/51

[Van den Broeck; AAAI-KRR‟15]

slide-56
SLIDE 56

What's Going On Here?

? ...

Probability that Card52 is Spades given that Card3 is QH? 13/51

[Van den Broeck; AAAI-KRR‟15]

slide-57
SLIDE 57

...

Tractable Reasoning

What's going on here? Which property makes reasoning tractable?

⇒ Lifted Inference

 High-level (first-order) reasoning  Symmetry  Exchangeability

[Niepert and Van den Broeck, AAAI‟ 14], [Van den Broeck, AAAI-KRR‟15]

slide-58
SLIDE 58

Model distribution at first-order level:

∀p, ∃c, Card(p,c) ∀c, ∃p, Card(p,c) ∀p, ∀c, ∀c‟, Card(p,c) ∧ Card(p,c‟) ⇒ c = c‟ Δ =

[Van den Broeck 2015]

...

Can we now be efficient in the size of our domain?

slide-59
SLIDE 59

X Y

Smokes(x) Gender(x) Young(x) Tall(x) Smokes(y) Gender(y) Young(y) Tall(y)

Properties Properties

Friends(x,y) Colleagues(x,y) Family(x,y) Classmates(x,y)

Relations

FO2 is liftable!

“Smokers are more likely to be friends with other smokers.” “Colleagues of the same age are more likely to be friends.” “People are either family or friends, but never both.” “If X is family of Y, then Y is also family of X.” “Universities in the Bay Area are more likely to be rivals.”

slide-60
SLIDE 60

Tractable Classes

FO2 CNF FO2 Safe monotone CNF Safe type-1 CNF ? #P1 FO3 #P1 CQs Δ = ∀x,y,z, Friends(x,y) ∧ Friends(y,z) ⇒ Friends(x,z)

[VdB; NIPS’11], [VdB et al.; KR’14], [Gribkoff, VdB, Suciu; UAI’15], [Beame, VdB, Gribkoff, Suciu; PODS’15], etc.

#P1

slide-61
SLIDE 61

Probabilistic Programming

Programming Languages Artificial Intelligence

Probabilistic Predicate Abstraction Knowledge Compilation

Similar picture for probabilistic databases probabilistic SMT, probabilistic datalog, probabilistic logic programming, …

slide-62
SLIDE 62

Conclusions 3

  • Challenge is even greater at first-order level
  • Existing reasoning algorithms cannot cut it!
  • Integration of first-order logic and probability

is long-standing goal of AI

  • First-order probabilistic reasoning is frontier

and integration of AI, KR, ML, DBs, theory, PL, etc.

slide-63
SLIDE 63

Final Conclusions

  • Knowledge is everywhere in learning
  • Some concepts not easily learned from data
  • Make knowledge first-class citizen in ML
  • Logical circuits turned statistical models
  • Strong properties produce strong learners
  • There is no dilemma between

understanding and accuracy?

  • A wealth of high-level reasoning approaches

are still absent from ML discussion

slide-64
SLIDE 64

Acknowledgements

Thanks to my students and collaborators! Thanks for your attention! Questions?