Machine Learning Guy Van den Broeck UCSD May 14, 2018 Overview - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Guy Van den Broeck UCSD May 14, 2018 Overview - - PowerPoint PPT Presentation

Probabilistic Circuits: A New Synthesis of Logic and Machine Learning Guy Van den Broeck UCSD May 14, 2018 Overview Statistical ML Probability Connectionism Deep Symbolic AI Logic Probabilistic Circuits References


slide-1
SLIDE 1

Probabilistic Circuits: A New Synthesis of Logic and Machine Learning

Guy Van den Broeck

UCSD May 14, 2018

slide-2
SLIDE 2

Overview

Statistical ML “Probability” Symbolic AI “Logic” Connectionism “Deep”

Probabilistic Circuits

slide-3
SLIDE 3

Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang and Guy Van den Broeck. A Semantic Loss Function for Deep Learning with Symbolic Knowledge, In Proceedings of the International Conference on Machine Learning (ICML), 2018. YooJung Choi and Guy Van den Broeck. On Robust Trimming of Bayesian Network Classifiers, In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), 2018. Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang and Guy Van den Broeck. A Semantic Loss Function for Deep Learning Under Weak Supervision, In NIPS 2017 Workshop on Learning with Limited Labeled Data: Weak Supervision and Beyond, 2017. Yitao Liang and Guy Van den Broeck. Towards Compact Interpretable Models: Shrinking of Learned Probabilistic Sentential Decision Diagrams, In IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI), 2017. YooJung Choi, Adnan Darwiche and Guy Van den Broeck. Optimal Feature Selection for Decision Robustness in Bayesian Networks, In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), 2017. Yitao Liang, Jessa Bekker and Guy Van den Broeck. Learning the Structure of Probabilistic Sentential Decision Diagrams, In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI), 2017. Jessa Bekker, Jesse Davis, Arthur Choi, Adnan Darwiche and Guy Van den Broeck. Tractable Learning for Complex Probability Queries, In Advances in Neural Information Processing Systems 28 (NIPS), 2015. Arthur Choi, Guy Van den Broeck and Adnan Darwiche. Probability Distributions over Structured Spaces, In Proceedings of the AAAI Spring Symposium on KRR, 2015. Arthur Choi, Guy Van den Broeck and Adnan Darwiche. Tractable Learning for Structured Probability Spaces: A Case Study in Learning Preference Distributions, In Proceedings of 24th International Joint Conference on Artificial Intelligence (IJCAI), 2015. Doga Kisa, Guy Van den Broeck, Arthur Choi and Adnan Darwiche. Probabilistic sentential decision diagrams: Learning with massive logical constraints, In ICML Workshop on Learning Tractable Probabilistic Models (LTPM), 2014. Doga Kisa, Guy Van den Broeck, Arthur Choi and Adnan Darwiche. Probabilistic sentential decision diagrams, In Proceedings of the 14th International Conference on Principles of Knowledge Representation and Reasoning (KR), 2014. (… and ongoing work by Tal Friedman, YooJung Choi, and Yitao Liang)

References

slide-4
SLIDE 4

Structured Spaces

slide-5
SLIDE 5

Courses:

  • Logic (L)
  • Knowledge Representation (K)
  • Probability (P)
  • Artificial Intelligence (A)

Data

Running Example

slide-6
SLIDE 6

Courses:

  • Logic (L)
  • Knowledge Representation (K)
  • Probability (P)
  • Artificial Intelligence (A)

Data

  • Must take at least one of

Probability or Logic.

  • Probability is a prerequisite for AI.
  • The prerequisites for KR is

either AI or Logic.

Constraints

Running Example

slide-7
SLIDE 7

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

unstructured

Structured Space

slide-8
SLIDE 8

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

unstructured

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

structured

Structured Space

7 out of 16 instantiations are impossible

  • Must take at least one of

Probability (P) or Logic (L).

  • Probability is a prerequisite

for AI (A).

  • The prerequisites for KR (K) is

either AI or Logic.

slide-9
SLIDE 9

Example: Video

[Lu, W. L., Ting, J. A., Little, J. J., & Murphy, K. P. (2013). Learning to track and identify players from broadcast sports videos.]

slide-10
SLIDE 10

Example: Video

[Lu, W. L., Ting, J. A., Little, J. J., & Murphy, K. P. (2013). Learning to track and identify players from broadcast sports videos.]

slide-11
SLIDE 11

Example: Robotics

[Wong, L. L., Kaelbling, L. P., & Lozano-Perez, T., Collision-free state estimation. ICRA 2012]

slide-12
SLIDE 12

Example: Robotics

[Wong, L. L., Kaelbling, L. P., & Lozano-Perez, T., Collision-free state estimation. ICRA 2012]

slide-13
SLIDE 13

Example: Language

  • Non-local dependencies:

At least one verb in each sentence

[Chang, M., Ratinov, L., & Roth, D. (2008). Constraints as prior knowledge],…, [Chang, M. W., Ratinov, L., & Roth, D. (2012). Structured learning with constrained conditional models.], [https://en.wikipedia.org/wiki/Constrained_conditional_model]

slide-14
SLIDE 14

Example: Language

  • Non-local dependencies:

At least one verb in each sentence

  • Sentence compression

If a modifier is kept, its subject is also kept

[Chang, M., Ratinov, L., & Roth, D. (2008). Constraints as prior knowledge],…, [Chang, M. W., Ratinov, L., & Roth, D. (2012). Structured learning with constrained conditional models.], [https://en.wikipedia.org/wiki/Constrained_conditional_model]

slide-15
SLIDE 15

Example: Language

  • Non-local dependencies:

At least one verb in each sentence

  • Sentence compression

If a modifier is kept, its subject is also kept

  • Information extraction

[Chang, M., Ratinov, L., & Roth, D. (2008). Constraints as prior knowledge],…, [Chang, M. W., Ratinov, L., & Roth, D. (2012). Structured learning with constrained conditional models.], [https://en.wikipedia.org/wiki/Constrained_conditional_model]

slide-16
SLIDE 16

Example: Language

  • Non-local dependencies:

At least one verb in each sentence

  • Sentence compression

If a modifier is kept, its subject is also kept

  • Information extraction
  • Semantic role labeling
  • … and many more!

[Chang, M., Ratinov, L., & Roth, D. (2008). Constraints as prior knowledge],…, [Chang, M. W., Ratinov, L., & Roth, D. (2012). Structured learning with constrained conditional models.], [https://en.wikipedia.org/wiki/Constrained_conditional_model]

slide-17
SLIDE 17

Example: Deep Learning

[Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al.. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476.]

slide-18
SLIDE 18

Example: Deep Learning

[Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al.. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476.]

slide-19
SLIDE 19

Example: Deep Learning

[Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al.. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476.]

slide-20
SLIDE 20

Example: Deep Learning

[Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al.. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476.]

slide-21
SLIDE 21

Learning in Structured Spaces

Data

slide-22
SLIDE 22

Learning in Structured Spaces

Data Constraints

(Background Knowledge) (Physics)

+

slide-23
SLIDE 23

Learning in Structured Spaces

Data Constraints

(Background Knowledge) (Physics)

ML Model

(Distribution) (Neural Network)

+

Learn

slide-24
SLIDE 24

Learning in Structured Spaces

Data Constraints

(Background Knowledge) (Physics)

ML Model

(Distribution) (Neural Network)

+

Statistical ML tools don’t take constraints as input! ☹ Learn

slide-25
SLIDE 25

Specification Language: Logic

slide-26
SLIDE 26

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

unstructured

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

structured

Structured Probability Space

7 out of 16 instantiations are impossible

  • Must take at least one of

Probability or Logic.

  • Probability is a prerequisite for AI.
  • The prerequisites for KR is

either AI or Logic.

slide-27
SLIDE 27

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

unstructured

L K P A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

structured

Boolean Constraints

7 out of 16 instantiations are impossible

slide-28
SLIDE 28

Combinatorial Objects: Rankings

10 items: 3,628,800 rankings 20 items: 2,432,902,008,176,640,000 rankings

rank sushi 1 fatty tuna 2 sea urchin 3 salmon roe 4 shrimp 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll rank sushi 1 shrimp 2 sea urchin 3 salmon roe 4 fatty tuna 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll

slide-29
SLIDE 29

Combinatorial Objects: Rankings

rank sushi 1 fatty tuna 2 sea urchin 3 salmon roe 4 shrimp 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll rank sushi 1 shrimp 2 sea urchin 3 salmon roe 4 fatty tuna 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll

Aij item i at position j (n items require n2 Boolean variables)

slide-30
SLIDE 30

Combinatorial Objects: Rankings

rank sushi 1 fatty tuna 2 sea urchin 3 salmon roe 4 shrimp 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll rank sushi 1 shrimp 2 sea urchin 3 salmon roe 4 fatty tuna 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll

Aij item i at position j (n items require n2 Boolean variables)

An item may be assigned to more than one position A position may contain more than one item

slide-31
SLIDE 31

Encoding Rankings in Logic

Aij : item i at position j

pos 1 pos 2 pos 3 pos 4 item 1 A11 A12 A13 A14 item 2 A21 A22 A23 A24 item 3 A31 A32 A33 A34 item 4 A41 A42 A43 A44

slide-32
SLIDE 32

Encoding Rankings in Logic

Aij : item i at position j

pos 1 pos 2 pos 3 pos 4 item 1 A11 A12 A13 A14 item 2 A21 A22 A23 A24 item 3 A31 A32 A33 A34 item 4 A41 A42 A43 A44

constraint: each item i assigned to a unique position (n constraints)

slide-33
SLIDE 33

Encoding Rankings in Logic

Aij : item i at position j

pos 1 pos 2 pos 3 pos 4 item 1 A11 A12 A13 A14 item 2 A21 A22 A23 A24 item 3 A31 A32 A33 A34 item 4 A41 A42 A43 A44

constraint: each item i assigned to a unique position (n constraints) constraint: each position j assigned a unique item (n constraints)

slide-34
SLIDE 34

Encoding Rankings in Logic

Aij : item i at position j

pos 1 pos 2 pos 3 pos 4 item 1 A11 A12 A13 A14 item 2 A21 A22 A23 A24 item 3 A31 A32 A33 A34 item 4 A41 A42 A43 A44

constraint: each item i assigned to a unique position (n constraints) constraint: each position j assigned a unique item (n constraints)

slide-35
SLIDE 35

Structured Space for Paths

  • cf. Nature paper
slide-36
SLIDE 36

Structured Space for Paths

  • cf. Nature paper

Good variable assignment (represents route) 184

slide-37
SLIDE 37

Structured Space for Paths

  • cf. Nature paper

Good variable assignment (represents route) 184 Bad variable assignment (does not represent route) 16,777,032

slide-38
SLIDE 38

Structured Space for Paths

  • cf. Nature paper

Good variable assignment (represents route) 184 Bad variable assignment (does not represent route) 16,777,032

Space easily encoded in logical constraints  [Nishino et al.]

slide-39
SLIDE 39

Unstructured probability space: 184+16,777,032 = 224

Structured Space for Paths

  • cf. Nature paper

Good variable assignment (represents route) 184 Bad variable assignment (does not represent route) 16,777,032

Space easily encoded in logical constraints  [Nishino et al.]

slide-40
SLIDE 40

Logical Circuits

slide-41
SLIDE 41
  • L K

L  P A P  L

  • L 
  • P A

P

  • L K

L  P

  • P 

K K A A A A

Logical Circuits

slide-42
SLIDE 42
  • L K

L  P A P  L

  • L 
  • P A

P

  • L K

L  P

  • P 

K K A A A A

Property: Decomposability

slide-43
SLIDE 43
  • L K

L  P A P  L

  • L 
  • P A

P

  • L K

L  P

  • P 

K K A A A A

Property: Decomposability

slide-44
SLIDE 44
  • L K

L  P A P  L

  • L 
  • P A

P

  • L K

L  P

  • P 

K K A A A A

Property: Decomposability

Property: AND gates have disjoint input circuits

slide-45
SLIDE 45

Property: Determinism

¬L K L ⊥ P A ¬P ⊥ L ¬L ⊥ ¬P ¬A P ¬L ¬K L ⊥ P ¬P ⊥ K ¬K A ¬A A ¬A

Input: L, K, P, A are true and ¬L, ¬K, ¬P, ¬A are false

slide-46
SLIDE 46

Property: Determinism

¬L K L ⊥ P A ¬P ⊥ L ¬L ⊥ ¬P ¬A P ¬L ¬K L ⊥ P ¬P ⊥ K ¬K A ¬A A ¬A

Input: L, K, P, A are true and ¬L, ¬K, ¬P, ¬A are false

slide-47
SLIDE 47

Property: Determinism

¬L K L ⊥ P A ¬P ⊥ L ¬L ⊥ ¬P ¬A P ¬L ¬K L ⊥ P ¬P ⊥ K ¬K A ¬A A ¬A

Input: L, K, P, A are true and ¬L, ¬K, ¬P, ¬A are false Property: OR gates have at most one true input wire

slide-48
SLIDE 48

Tractable for Logical Inference

  • Is structured space empty? (SAT)
  • Count size of structured space (#SAT)
  • Check equivalence of spaces
slide-49
SLIDE 49

Tractable for Logical Inference

  • Is structured space empty? (SAT)
  • Count size of structured space (#SAT)
  • Check equivalence of spaces
  • Algorithms linear in circuit size 

(pass up, pass down, similar to backprop)

slide-50
SLIDE 50

Tractable for Logical Inference

  • Is structured space empty? (SAT)
  • Count size of structured space (#SAT)
  • Check equivalence of spaces
  • Algorithms linear in circuit size 

(pass up, pass down, similar to backprop)

  • Compilation by exhaustive SAT solvers
slide-51
SLIDE 51

Semantic Loss for Deep Learning

slide-52
SLIDE 52

Deep Structured Output Prediction

Data Constraints

(Background Knowledge) (Physics)

Deep Neural Network

+

Learn

slide-53
SLIDE 53

Deep Structured Output Prediction

Data Constraints

(Background Knowledge) (Physics)

Deep Neural Network

+

Learn

Input Neural Network Logical Constraint Output

slide-54
SLIDE 54

Deep Structured Output Prediction

Data Constraints

(Background Knowledge) (Physics)

Deep Neural Network

+

Learn

Input Neural Network Logical Constraint Output

slide-55
SLIDE 55

Semantic Loss

slide-56
SLIDE 56

Semantic Loss

  • Output is probability vector p, not logic!

How close is output to satisfying constraint?

slide-57
SLIDE 57

Semantic Loss

  • Output is probability vector p, not logic!

How close is output to satisfying constraint?

  • Answer: Semantic loss function L(α,p)
slide-58
SLIDE 58

Semantic Loss

  • Output is probability vector p, not logic!

How close is output to satisfying constraint?

  • Answer: Semantic loss function L(α,p)
  • Axioms, for example:
slide-59
SLIDE 59

Semantic Loss

  • Output is probability vector p, not logic!

How close is output to satisfying constraint?

  • Answer: Semantic loss function L(α,p)
  • Axioms, for example:

– If p is Boolean then L(p,p) = 0

slide-60
SLIDE 60

Semantic Loss

  • Output is probability vector p, not logic!

How close is output to satisfying constraint?

  • Answer: Semantic loss function L(α,p)
  • Axioms, for example:

– If p is Boolean then L(p,p) = 0 – If α implies β then L(α,p) ≥ L(β,p)

slide-61
SLIDE 61

Semantic Loss

  • Output is probability vector p, not logic!

How close is output to satisfying constraint?

  • Answer: Semantic loss function L(α,p)
  • Axioms, for example:

– If p is Boolean then L(p,p) = 0 – If α implies β then L(α,p) ≥ L(β,p)

  • Properties:
slide-62
SLIDE 62

Semantic Loss

  • Output is probability vector p, not logic!

How close is output to satisfying constraint?

  • Answer: Semantic loss function L(α,p)
  • Axioms, for example:

– If p is Boolean then L(p,p) = 0 – If α implies β then L(α,p) ≥ L(β,p)

  • Properties:

– If α is equivalent to β then L(α,p) = L(β,p)

slide-63
SLIDE 63

Semantic Loss

  • Output is probability vector p, not logic!

How close is output to satisfying constraint?

  • Answer: Semantic loss function L(α,p)
  • Axioms, for example:

– If p is Boolean then L(p,p) = 0 – If α implies β then L(α,p) ≥ L(β,p)

  • Properties:

– If α is equivalent to β then L(α,p) = L(β,p) – If p is Boolean and satisfies α then L(α,p) = 0

slide-64
SLIDE 64

Semantic Loss

  • Output is probability vector p, not logic!

How close is output to satisfying constraint?

  • Answer: Semantic loss function L(α,p)
  • Axioms, for example:

– If p is Boolean then L(p,p) = 0 – If α implies β then L(α,p) ≥ L(β,p)

  • Properties:

– If α is equivalent to β then L(α,p) = L(β,p) – If p is Boolean and satisfies α then L(α,p) = 0

SEMANTIC Loss!

slide-65
SLIDE 65

Semantic Loss: Definition

Theorem: Axioms imply unique semantic loss:

Probability of getting x after flipping coins with prob. p Probability of satisfying α after flipping coins with prob. p

slide-66
SLIDE 66

How to Compute Semantic Loss?

slide-67
SLIDE 67

How to Compute Semantic Loss?

  • In general: #P-hard 
slide-68
SLIDE 68

How to Compute Semantic Loss?

  • In general: #P-hard 
  • With a logical circuit for α: Linear!
slide-69
SLIDE 69

How to Compute Semantic Loss?

  • In general: #P-hard 
  • With a logical circuit for α: Linear!
  • Example: exactly-one constraint:

𝒚𝟐 ∨ 𝒚𝟑∨ 𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟑 ¬𝒚𝟑 ∨ ¬𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟒

slide-70
SLIDE 70

How to Compute Semantic Loss?

  • In general: #P-hard 
  • With a logical circuit for α: Linear!
  • Example: exactly-one constraint:

L(α,p) = L( , p)

𝒚𝟐 ∨ 𝒚𝟑∨ 𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟑 ¬𝒚𝟑 ∨ ¬𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟒

slide-71
SLIDE 71

How to Compute Semantic Loss?

  • In general: #P-hard 
  • With a logical circuit for α: Linear!
  • Example: exactly-one constraint:

L(α,p) = L( , p)

𝒚𝟐 ∨ 𝒚𝟑∨ 𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟑 ¬𝒚𝟑 ∨ ¬𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟒

= - log( )

slide-72
SLIDE 72

How to Compute Semantic Loss?

  • In general: #P-hard 
  • With a logical circuit for α: Linear!
  • Example: exactly-one constraint:
  • Why? Decomposability and determinism!

L(α,p) = L( , p)

𝒚𝟐 ∨ 𝒚𝟑∨ 𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟑 ¬𝒚𝟑 ∨ ¬𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟒

= - log( )

slide-73
SLIDE 73

Supervised Learning

  • Predict shortest paths
  • Add semantic loss to objective
slide-74
SLIDE 74

Supervised Learning

  • Predict shortest paths
  • Add semantic loss to objective

Is output a path?

slide-75
SLIDE 75

Supervised Learning

  • Predict shortest paths
  • Add semantic loss to objective

Is output a path? Does output have true edges?

slide-76
SLIDE 76

Supervised Learning

  • Predict shortest paths
  • Add semantic loss to objective

Is output a path? Does output have true edges? Is output the true path?

slide-77
SLIDE 77

Supervised Learning

  • Predict sushi preferences
  • Add semantic loss to objective

rank sushi 1 fatty tuna 2 sea urchin 3 salmon roe 4 shrimp 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll

slide-78
SLIDE 78

Supervised Learning

  • Predict sushi preferences
  • Add semantic loss to objective

Is output a ranking?

rank sushi 1 fatty tuna 2 sea urchin 3 salmon roe 4 shrimp 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll

slide-79
SLIDE 79

Supervised Learning

  • Predict sushi preferences
  • Add semantic loss to objective

Is output a ranking? Does output correctly rank individual sushis?

rank sushi 1 fatty tuna 2 sea urchin 3 salmon roe 4 shrimp 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll

slide-80
SLIDE 80

Supervised Learning

  • Predict sushi preferences
  • Add semantic loss to objective

Is output a ranking? Does output correctly rank individual sushis? Is output the true ranking?

rank sushi 1 fatty tuna 2 sea urchin 3 salmon roe 4 shrimp 5 tuna 6 squid 7 tuna roll 8 see eel 9 egg 10 cucumber roll

slide-81
SLIDE 81

Semi-Supervised Learning

  • Unlabeled data must have some label
slide-82
SLIDE 82

Semi-Supervised Learning

  • Unlabeled data must have some label
slide-83
SLIDE 83

Semi-Supervised Learning

  • Unlabeled data must have some label
slide-84
SLIDE 84

Semi-Supervised Learning

  • Unlabeled data must have some label
  • Low semantic loss with exactly-one constraint
slide-85
SLIDE 85

Semi-Supervised Learning

  • Unlabeled data must have some label
  • Low semantic loss with exactly-one constraint
slide-86
SLIDE 86

MNIST

slide-87
SLIDE 87

FASHION

slide-88
SLIDE 88

CIFAR10

slide-89
SLIDE 89

Semantic Loss Conclusions

  • Cares about meaning not syntax
  • Elegant axiomatic approach
slide-90
SLIDE 90

Semantic Loss Conclusions

  • Cares about meaning not syntax
  • Elegant axiomatic approach
  • If you have complex output constraints

Use logical circuits to enforce them

slide-91
SLIDE 91

Semantic Loss Conclusions

  • Cares about meaning not syntax
  • Elegant axiomatic approach
  • If you have complex output constraints

Use logical circuits to enforce them

If you have unlabeled data (no constraints)

Get a lot of signal by minimizing semantic loss of exactly-one

slide-92
SLIDE 92

Probabilistic Circuits

slide-93
SLIDE 93
  • L K

L  P A P  L

  • L 
  • P A

P

  • L K

L  P

  • P 

K K A A A A

Logical Circuits

slide-94
SLIDE 94

PSDD: Probabilistic SDD

¬L K L ⊥

1

P A ¬P ⊥

1

L ¬L ⊥

1

¬P ¬A P

0.6 0.4

¬L ¬K L ⊥

1

P ¬P ⊥

1

K ¬K

0.8 0.2

A ¬A

0.25 0.75

A ¬A

0.9 0.1 0.1 0.6 0.3

slide-95
SLIDE 95

PSDD: Probabilistic SDD

¬L K L ⊥ P A ¬P ⊥ L ¬L ⊥ ¬P ¬A P ¬L ¬K L ⊥ P ¬P ⊥ K ¬K A ¬A A ¬A

Input: L, K, P, A are true

0.1 0.6 0.3 1 1 1 0.6 0.4 1 1 0.8 0.2 0.25 0.75 0.9 0.1

slide-96
SLIDE 96

PSDD: Probabilistic SDD

¬L K L ⊥ P A ¬P ⊥ L ¬L ⊥ ¬P ¬A P ¬L ¬K L ⊥ P ¬P ⊥ K ¬K A ¬A A ¬A

Input: L, K, P, A are true

0.1 0.6 0.3 1 1 1 0.6 0.4 1 1 0.8 0.2 0.25 0.75 0.9 0.1

Pr(L,K,P,A) = 0.3 x 1 x 0.8 x 0.4 x 0.25 = 0.024

slide-97
SLIDE 97
  • L K

L 

1

P A P 

1

L

  • L 

1

  • P A

P

0.6 0.4

  • L K

L 

1

P

  • P 

1

A A

0.8 0.2

A A

0.25 0.75

A A

0.9 0.1 0.1 0.6 0.3

PSDD nodes induce a normalized distribution!

slide-98
SLIDE 98
  • L K

L 

1

P A P 

1

L

  • L 

1

  • P A

P

0.6 0.4

  • L K

L 

1

P

  • P 

1

A A

0.8 0.2

A A

0.25 0.75

A A

0.9 0.1 0.1 0.6 0.3

PSDD nodes induce a normalized distribution!

slide-99
SLIDE 99
  • L K

L 

1

P A P 

1

L

  • L 

1

  • P A

P

0.6 0.4

  • L K

L 

1

P

  • P 

1

A A

0.8 0.2

A A

0.25 0.75

A A

0.9 0.1 0.1 0.6 0.3

Can read probabilistic independences off the circuit structure

PSDD nodes induce a normalized distribution!

slide-100
SLIDE 100

Tractable for Probabilistic Inference

  • MAP inference: Find most-likely assignment

(otherwise NP-complete)

  • Computing conditional probabilities Pr(x|y)

(otherwise PP-complete)

  • Sample from Pr(x|y)
slide-101
SLIDE 101

Tractable for Probabilistic Inference

  • MAP inference: Find most-likely assignment

(otherwise NP-complete)

  • Computing conditional probabilities Pr(x|y)

(otherwise PP-complete)

  • Sample from Pr(x|y)

Algorithms linear in circuit size  (pass up, pass down, similar to backprop)

slide-102
SLIDE 102

Learning Probabilistic Circuits

slide-103
SLIDE 103
  • L K

L 

1

P A P 

1

L

  • L 

1

  • P A

P

0.6 0.4

  • L K

L 

1

P

  • P 

1

K K

0.8 0.2

A A

0.25 0.75

A A

0.9 0.1 0.1 0.6 0.3

Parameters are Interpretable

Explainable AI DARPA Program

slide-104
SLIDE 104
  • L K

L 

1

P A P 

1

L

  • L 

1

  • P A

P

0.6 0.4

  • L K

L 

1

P

  • P 

1

K K

0.8 0.2

A A

0.25 0.75

A A

0.9 0.1 0.1 0.6 0.3

Student takes course L

Parameters are Interpretable

Explainable AI DARPA Program

slide-105
SLIDE 105
  • L K

L 

1

P A P 

1

L

  • L 

1

  • P A

P

0.6 0.4

  • L K

L 

1

P

  • P 

1

K K

0.8 0.2

A A

0.25 0.75

A A

0.9 0.1 0.1 0.6 0.3

Student takes course L Student takes course P

Parameters are Interpretable

Explainable AI DARPA Program

slide-106
SLIDE 106
  • L K

L 

1

P A P 

1

L

  • L 

1

  • P A

P

0.6 0.4

  • L K

L 

1

P

  • P 

1

K K

0.8 0.2

A A

0.25 0.75

A A

0.9 0.1 0.1 0.6 0.3

Student takes course L Student takes course P Probability of P given L

Parameters are Interpretable

Explainable AI DARPA Program

slide-107
SLIDE 107

Learning Algorithms

  • Parameter learning:

Closed form max likelihood from complete data One pass over data to estimate Pr(x|y)

slide-108
SLIDE 108

Learning Algorithms

  • Parameter learning:

Closed form max likelihood from complete data One pass over data to estimate Pr(x|y) Not a lot to say: very easy!

slide-109
SLIDE 109

Learning Algorithms

  • Parameter learning:

Closed form max likelihood from complete data One pass over data to estimate Pr(x|y)

  • Structure learning:

Not a lot to say: very easy!

slide-110
SLIDE 110

Learning Algorithms

  • Parameter learning:

Closed form max likelihood from complete data One pass over data to estimate Pr(x|y)

  • Structure learning:

○ Compile logical constraint for structured space Use SAT solver technology

Not a lot to say: very easy!

slide-111
SLIDE 111

Learning Algorithms

  • Parameter learning:

Closed form max likelihood from complete data One pass over data to estimate Pr(x|y)

  • Structure learning:

○ Compile logical constraint for structured space Use SAT solver technology ○ Learn structure from data by search/optimization

Not a lot to say: very easy!

slide-112
SLIDE 112

Learning Preference Distributions

Special-purpose distribution: Mixture-of-Mallows

– # of components from 1 to 20 – EM with 10 random seeds – implementation of Lu & Boutilier PSDD

slide-113
SLIDE 113

Learning Preference Distributions

Special-purpose distribution: Mixture-of-Mallows

– # of components from 1 to 20 – EM with 10 random seeds – implementation of Lu & Boutilier PSDD

This is the naive approach, circuit does not depend on data!

slide-114
SLIDE 114

Learning from Incomplete Data

  • Movielens Dataset:

– 3,900 movies, 6,040 users, 1m ratings – take ratings from 64 most rated movies – ratings 1-5 converted to pairwise prefs.

  • PSDD for partial rankings

– 4 tiers – 18,711 parameters

rank movie 1 The Godfather 2 The Usual Suspects 3 Casablanca 4 The Shawshank Redemption 5 Schindler’s List 6 One Flew Over the Cuckoo’s Nest 7 The Godfather: Part II 8 Monty Python and the Holy Grail 9 Raiders of the Lost Ark 10 Star Wars IV: A New Hope

movies by expected tier

slide-115
SLIDE 115

Probabilistic-Logical Queries

rank movie 1 Star Wars V: The Empire Strikes Back 2 Star Wars IV: A New Hope 3 The Godfather 4 The Shawshank Redemption 5 The Usual Suspects

slide-116
SLIDE 116

Probabilistic-Logical Queries

rank movie 1 Star Wars V: The Empire Strikes Back 2 Star Wars IV: A New Hope 3 The Godfather 4 The Shawshank Redemption 5 The Usual Suspects

  • no other Star Wars movie in top-5
  • at least one comedy in top-5
slide-117
SLIDE 117

Probabilistic-Logical Queries

rank movie 1 Star Wars V: The Empire Strikes Back 2 Star Wars IV: A New Hope 3 The Godfather 4 The Shawshank Redemption 5 The Usual Suspects

  • no other Star Wars movie in top-5
  • at least one comedy in top-5

rank movie 1 Star Wars V: The Empire Strikes Back 2 American Beauty 3 The Godfather 4 The Usual Suspects 5 The Shawshank Redemption

slide-118
SLIDE 118

Probabilistic-Logical Queries

rank movie 1 Star Wars V: The Empire Strikes Back 2 Star Wars IV: A New Hope 3 The Godfather 4 The Shawshank Redemption 5 The Usual Suspects

  • no other Star Wars movie in top-5
  • at least one comedy in top-5

rank movie 1 Star Wars V: The Empire Strikes Back 2 American Beauty 3 The Godfather 4 The Usual Suspects 5 The Shawshank Redemption

diversified recommendations via logical constraints

slide-119
SLIDE 119

Learning Probabilistic Circuit Structure

slide-120
SLIDE 120

Tractable Learning

Bayesian networks Markov networks

slide-121
SLIDE 121

Tractable Learning

Bayesian networks Markov networks Do not support linear-time exact inference

slide-122
SLIDE 122

Tractable Learning

SPNs Cutset Networks Historically: Polytrees, Chow-Liu trees, etc. Both are Arithmetic Circuits (ACs)

[Darwiche, JACM 2003]

slide-123
SLIDE 123

PSDDs are Arithmetic Circuits

2 1 n p1 s1 p2 s2 pn sn PSDD AC +

* * * * * *

1 2 n p1 s1 p2 s2 pn sn

slide-124
SLIDE 124

Tractable Learning

Strong Properties Representational Freedom

slide-125
SLIDE 125

Tractable Learning

Strong Properties Representational Freedom

DNN

slide-126
SLIDE 126

Tractable Learning

Strong Properties Representational Freedom

DNN

slide-127
SLIDE 127

Tractable Learning

Strong Properties Representational Freedom

DNN

SPN Cutset

slide-128
SLIDE 128

Tractable Learning

Strong Properties Representational Freedom

DNN

SPN Cutset

slide-129
SLIDE 129

Variable Trees (vtrees)

PSDD Vtree Correspondence

slide-130
SLIDE 130

Learning Variable Trees

  • How much do vars depend on each other?
  • Learn vtree by hierarchical clustering
slide-131
SLIDE 131

Learning Variable Trees

  • How much do vars depend on each other?
  • Learn vtree by hierarchical clustering
slide-132
SLIDE 132

Learning Primitives

slide-133
SLIDE 133

Learning Primitives

slide-134
SLIDE 134

Learning Primitives

Primitives maintain PSDD properties and structured space!

slide-135
SLIDE 135

LearnPSDD

Vtree learning Construct the most naïve PSDD LearnPSDD (search for better structure)

1 2 3

slide-136
SLIDE 136

LearnPSDD

Vtree learning Construct the most naïve PSDD LearnPSDD (search for better structure)

1 2 3

Simulate

  • perations

Execute the best Generate candidate

  • perations
slide-137
SLIDE 137

Experiments on 20 datasets

slide-138
SLIDE 138

Experiments on 20 datasets

Compare with O-SPN: smaller size in 14, better LL in 11, win on both in 6 Compare with L-SPN: smaller size in 14, better LL in 6, win on both in 2

slide-139
SLIDE 139

Experiments on 20 datasets

Compare with O-SPN: smaller size in 14, better LL in 11, win on both in 6 Compare with L-SPN: smaller size in 14, better LL in 6, win on both in 2 Comparable in performance & Smaller in size

slide-140
SLIDE 140

Ensembles of PSDDs

slide-141
SLIDE 141

Ensembles of PSDDs

EM/Bagging

slide-142
SLIDE 142

Ensembles of PSDDs

EM/Bagging

slide-143
SLIDE 143

State-of-the-Art Performance

slide-144
SLIDE 144

State-of-the-Art Performance

State of the art in 6 datasets

slide-145
SLIDE 145

What happens if you have a structured space?

Multi-valued data = exactly-one constraint

slide-146
SLIDE 146

What happens if you have a structured space?

Multi-valued data = exactly-one constraint

𝒚𝟐 ∨ 𝒚𝟑∨ 𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟑 ¬𝒚𝟑 ∨ ¬𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟒

slide-147
SLIDE 147

What happens if you have a structured space?

Multi-valued data = exactly-one constraint

𝒚𝟐 ∨ 𝒚𝟑∨ 𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟑 ¬𝒚𝟑 ∨ ¬𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟒

slide-148
SLIDE 148

What happens if you have a structured space?

Multi-valued data = exactly-one constraint

Never omit domain constraints! 𝒚𝟐 ∨ 𝒚𝟑∨ 𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟑 ¬𝒚𝟑 ∨ ¬𝒚𝟒 ¬𝒚𝟐 ∨ ¬𝒚𝟒

slide-149
SLIDE 149

Circuit-Based Probabilistic Reasoning

slide-150
SLIDE 150

Compilation for Inference

slide-151
SLIDE 151

Compilation for Inference

slide-152
SLIDE 152

Ongoing Work

  • Probabilistic program inference

by compilation

  • Approximate inference

by collapsed compilation

  • Robust feature selection

by compilation [IJCAI18]

  • Powerful reasoning toolbox!
slide-153
SLIDE 153

Conclusions

  • Logic is everywhere in machine learning 
  • Probabilistic circuits build on logical circuits
  • 1. Tractability
  • 2. Semantics
  • 3. Natural encoding of structured spaces
  • Learning is effective

1. Enforcing neural network output constraints

State of the art semi-supervised learning and complex output

2. Density estimation from constraints encoding structured space

State of the art learning preference distributions

3. Density estimation from standard unstructured datasets

State of the art on standard tractable learning datasets

slide-154
SLIDE 154

Conclusions

Statistical ML “Probability” Symbolic AI “Logic” Connectionism “Deep”

PSDD

slide-155
SLIDE 155

Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang and Guy Van den Broeck. A Semantic Loss Function for Deep Learning with Symbolic Knowledge, In Proceedings of the International Conference on Machine Learning (ICML), 2018. YooJung Choi and Guy Van den Broeck. On Robust Trimming of Bayesian Network Classifiers, In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), 2018. Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang and Guy Van den Broeck. A Semantic Loss Function for Deep Learning Under Weak Supervision, In NIPS 2017 Workshop on Learning with Limited Labeled Data: Weak Supervision and Beyond, 2017. Yitao Liang and Guy Van den Broeck. Towards Compact Interpretable Models: Shrinking of Learned Probabilistic Sentential Decision Diagrams, In IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI), 2017. YooJung Choi, Adnan Darwiche and Guy Van den Broeck. Optimal Feature Selection for Decision Robustness in Bayesian Networks, In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), 2017. Yitao Liang, Jessa Bekker and Guy Van den Broeck. Learning the Structure of Probabilistic Sentential Decision Diagrams, In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI), 2017. Jessa Bekker, Jesse Davis, Arthur Choi, Adnan Darwiche and Guy Van den Broeck. Tractable Learning for Complex Probability Queries, In Advances in Neural Information Processing Systems 28 (NIPS), 2015. Arthur Choi, Guy Van den Broeck and Adnan Darwiche. Probability Distributions over Structured Spaces, In Proceedings of the AAAI Spring Symposium on KRR, 2015. Arthur Choi, Guy Van den Broeck and Adnan Darwiche. Tractable Learning for Structured Probability Spaces: A Case Study in Learning Preference Distributions, In Proceedings of 24th International Joint Conference on Artificial Intelligence (IJCAI), 2015. Doga Kisa, Guy Van den Broeck, Arthur Choi and Adnan Darwiche. Probabilistic sentential decision diagrams: Learning with massive logical constraints, In ICML Workshop on Learning Tractable Probabilistic Models (LTPM), 2014. Doga Kisa, Guy Van den Broeck, Arthur Choi and Adnan Darwiche. Probabilistic sentential decision diagrams, In Proceedings of the 14th International Conference on Principles of Knowledge Representation and Reasoning (KR), 2014. (… and ongoing work by Tal Friedman, YooJung Choi, and Yitao Liang)

References

slide-156
SLIDE 156

Questions?

PSDD with 15,000 nodes LearnPSDD code: https://github.com/UCLA-StarAI/LearnPSDD Other code online soon