Sum-Product Networks CS486 / 686 University of Waterloo Lecture - - PowerPoint PPT Presentation

sum product networks
SMART_READER_LITE
LIVE PREVIEW

Sum-Product Networks CS486 / 686 University of Waterloo Lecture - - PowerPoint PPT Presentation

Sum-Product Networks CS486 / 686 University of Waterloo Lecture 23: July 19, 2017 Outline SPNs in more depth Relationship to Bayesian networks Parameter estimation Online and distributed estimation Dynamic SPNs for


slide-1
SLIDE 1

Sum-Product Networks

CS486 / 686 University of Waterloo Lecture 23: July 19, 2017

slide-2
SLIDE 2

2

Outline

  • SPNs in more depth

– Relationship to Bayesian networks – Parameter estimation – Online and distributed estimation – Dynamic SPNs for sequence data

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-3
SLIDE 3

3

SPN  Bayes Net

  • 1. Normalize SPN
  • 2. Create structure
  • 3. Construct conditional distribution

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-4
SLIDE 4

4

Normal SPN

An SPN is said to be normal when

  • 1. It is complete and decomposable
  • 2. All weights are non-negative and the weights of

the edges emanating from each sum node sum to 1.

  • 3. Every terminal node in the SPN is a univariate

distribution and the size of the scope of each sum node is at least 2.

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-5
SLIDE 5

5

Construct Bipartite Bayes Net

  • 1. Create observable node for each observable

variable

  • 2. Create hidden node for each sum node
  • 3. For each variable in the scope of a sum node,

add a directed edge from the hidden node associated with the sum node to the observable node associated with the variable

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-6
SLIDE 6

6

Construct Conditional Distributions

  • 1. Hidden node :
  • 2. Observable node : construct conditional

distribution in the form of an algebraic decision diagram

a. Extract sub-SPN of all nodes that contain in their scope b. Remove the product nodes c. Replace each sum node by its corresponding hidden variable

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-7
SLIDE 7

7

Some Observations

  • Deep SPNs can be converted into shallow BNs.
  • The depth of an SPN is proportional to the height
  • f the highest algebraic decision diagram in the

corresponding BN.

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-8
SLIDE 8

8

Conversion Facts

Thm 1: Any complete and decomposable SPN

  • ver variables

can be converted into a BN with ADD representation in time . Furthermore and represent the same distribution and . Thm 2: Given any BN with ADD representation generated from a complete and decomposable SPN

  • ver variables

, the original SPN can be recovered by applying the variable elimination algorithm in .

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-9
SLIDE 9

9

Relationships

Probabilistic distributions

  • Compact: space is polynomial in # of variables
  • Tractable: inference time is polynomial in # of variables

SPN = BN Compact BN Compact SPN = Tractable SPN = Tractable BN

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-10
SLIDE 10

10

Parameter Estimation

  • Maximum Likelihood Estimation
  • Online Bayesian Moment Matching

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-11
SLIDE 11

11

Maximum Log-Likelihood

  • Objective:
  • Where

and

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-12
SLIDE 12

12

Non-Convex Optimization

s.t.

  • Approximations:

– Projected gradient descent (PGD) – Exponential gradient (EG) – Sequential monomial approximation (SMA) – Convex concave procedure (CCCP = EM)

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-13
SLIDE 13

13

Summary

Algo Var Update Approximation PGD additive linear

  • EG

multiplicative linear

  • SMA

multiplicative monomial

  • CCCP

(EM) multiplicative Concave lower bound

  • CS486/686 Lecture Slides (c) 2017 P. Poupart
slide-14
SLIDE 14

14

Results

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-15
SLIDE 15

15

Scalability

  • Online: process data sequentially once only
  • Distributed: process subsets of data on different

computers

  • Mini-batches: online PGD, online EG, online

SMA, online EM

  • Problems: loss of information due to mini-

batches, local optima, overfitting

  • Can we do better?

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-16
SLIDE 16

Thomas Bayes

CS486/686 Lecture Slides (c) 2017 P. Poupart

16

slide-17
SLIDE 17

Bayesian Learning

  • Bayes’ theorem (1764)
  • Broderick et al. (2013): facilitates

– Online learning (streaming data) – Distributed computation

core #1 core #2 core #3

CS486/686 Lecture Slides (c) 2017 P. Poupart

17

slide-18
SLIDE 18

18

Exact Bayesian Learning

  • Assume a normal SPN where the weights
  • f

each sum node form a discrete distribution.

  • Prior:

where

  • Likelihood:
  • Posterior:

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-19
SLIDE 19

Karl Pearson

CS486/686 Lecture Slides (c) 2017 P. Poupart

19

slide-20
SLIDE 20

Method of Moments (1894)

  • Estimate model parameters by matching a

subset of moments (i.e., mean and variance)

  • Performance guarantees

– Break through: First provably consistent estimation algorithm for several mixture models

  • HMMs: Hsu, Kakade, Zhang (2008)
  • MoGs: Moitra, Valiant (2010), Belkin, Sinha (2010)
  • LDA: Anandkumar, Foster, Hsu, Kakade, Liu (2012)

CS486/686 Lecture Slides (c) 2017 P. Poupart

20

slide-21
SLIDE 21

Bayesian Moment Matching for Sum Product Networks

Bayesian Learning + Method of Moments Online, distributed and tractable algorithm for SPNs Approximate mixture of products of Dirichlets by a single product of Dirichlets that matches first and second order moments

CS486/686 Lecture Slides (c) 2017 P. Poupart

21

slide-22
SLIDE 22

22

Moments

  • Moment definition:
  • Dirichlet:
  • – Moments:
  • – Hyperparameters:
  • CS486/686 Lecture Slides (c) 2017 P. Poupart
slide-23
SLIDE 23

23

Moment Matching

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-24
SLIDE 24

24

Recursive moment computation

  • Compute
  • f posterior

after observing If then

Return leaf value

Else if then

Return

  • Else if

and then

Return

  • ,
  • Else

Return

,

  • CS486/686 Lecture Slides (c) 2017 P. Poupart
slide-25
SLIDE 25

Results (benchmarks)

CS486/686 Lecture Slides (c) 2017 P. Poupart

25

slide-26
SLIDE 26

Results (Large Datasets)

  • Log likelihood
  • Time (minutes)

CS486/686 Lecture Slides (c) 2017 P. Poupart

26

slide-27
SLIDE 27

27

Sequence Data

  • How can we train an SPN with data sequences of varying

length?

  • Examples

– Sentence modeling: sequence of words – Activity recognition: sequence of measurements – Weather prediction: time-series data

  • Challenge: need structure that adapts to the length of the

sequence while keeping # of parameters fixed

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-28
SLIDE 28

28

Dynamic SPN

  • Idea: stack template networks with identical structure and

parameters

+

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-29
SLIDE 29

29

Definitions

  • Dynamic Sum-Product Network: bottom network, a

stack of template networks and a top network

  • Bottom network: directed acyclic graph with

indicator leaves and roots that interface with the network above.

  • Top network: rooted directed acyclic graph with

leaves that interface with the network below

  • Template network: directed acyclic graph of

roots that interface with the network above, indicator leaves and additional leaves that interface with the network below.

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-30
SLIDE 30

30

Invariance

Let be a bijective mapping that associates inputs to corresponding outputs in a template network Invariance: a template network over is invariant when the scope of each interface node excludes and for all pairs of interface nodes and , the following properties hold:

  • r
  • All interior and output sum nodes are complete
  • All interior and output product nodes are decomposable

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-31
SLIDE 31

31

Completeness and Decomposability

Theorem 1: If

  • a. the bottom network is complete and decomposable,
  • b. the scopes of all pairs of output interface nodes of the

bottom network are either identical or disjoint,

  • c. the scopes of the output interface nodes of the bottom

network can be used to assign scopes to the input interface nodes of the template and top networks in such a way that the template network is invariant and the top network is complete and decomposable, then the DSPN is complete and decomposable

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-32
SLIDE 32

32

Structure Learning

Anytime search-and-score framework Input: data, variables Output: Repeat Until stopping criterion is met

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-33
SLIDE 33

33

Initial Structure

  • Factorized model of univariate distributions

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-34
SLIDE 34

34

Neighbour generation

  • Replace sub-SPN rooted at a product node by a

product of Naïve Bayes modes

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-35
SLIDE 35

35

Results

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-36
SLIDE 36

36

Results

CS486/686 Lecture Slides (c) 2017 P. Poupart

slide-37
SLIDE 37

37

Conclusion

  • Sum-Product Networks

– Deep architecture with clear semantics – Tractable probabilistic graphical model

  • Future work

– Decision SPNs: M. Melibari and P. Doshi

  • Open problem:

– Thorough comparison of SPNs to other deep networks

CS486/686 Lecture Slides (c) 2017 P. Poupart