SCORE EQUIVALENCE & POLYHEDRAL APPROACHES TO LEARNING BAYESIAN - - PowerPoint PPT Presentation

score equivalence polyhedral approaches to learning
SMART_READER_LITE
LIVE PREVIEW

SCORE EQUIVALENCE & POLYHEDRAL APPROACHES TO LEARNING BAYESIAN - - PowerPoint PPT Presentation

SCORE EQUIVALENCE & POLYHEDRAL APPROACHES TO LEARNING BAYESIAN NETWORKS David Haws*, James Cussens, Milan Studeny IBM Watson dchaws@gmail.com University of York, Deramore Lane, York, YO10 5GE, UK The Institute of Information Theory and


slide-1
SLIDE 1

SCORE EQUIVALENCE & POLYHEDRAL APPROACHES TO LEARNING BAYESIAN NETWORKS

David Haws*, James Cussens, Milan Studeny IBM Watson dchaws@gmail.com

University of York, Deramore Lane, York, YO10 5GE, UK

The Institute of Information Theory and Automation of the CAS, Pod Voda ́renskou veˇˇz ́ı 4, Prague, 182 08, Czech Republic

slide-2
SLIDE 2

Definition

  • A Bayesian network is a graphical model that represents a set
  • f random variables and their conditional dependencies via a

directed acyclic graph (DAG).

  • Nodes à Random variables

Edges à Conditional dependencies

  • Node (RV) is conditionally independent
  • f its non-descendents; given the state of

all its parents. or

  • Node (RV) is conditionally independent of all
  • ther nodes j, given its Markov blanket.
  • Variables X, Y are conditionally independent (CI) given Z if

Pr(X and Y | Z) = Pr(X | Z) Pr(Y | Z).

slide-3
SLIDE 3

Examples

slide-4
SLIDE 4

Learning Bayesian Network

Variables Observations Learn “Best” Bayesian Network

How to find right DAG?

Scoring criteria!

NP-Hard

slide-5
SLIDE 5

Scoring Criteria

  • A scoring function Q(G, D) evaluates how well a DAG explains

the data.

  • We will only consider score equivalent and decomposable

scoring functions

  • Bayesian Information Criterion (BIC),
  • Bayesian Dirichlet Equivalent (BDE).
  • Score equivalent: Score of two Markov equivalent graphs are

the same.

WARNING: Two different DAGs may represent the same probability model! If so, they are called Markov equivalent. Markov equivalent Roughly: Likelihood of graph given data + penalty for complex graphs.

slide-6
SLIDE 6

Score Decomposable

A scoring function is score decomposable if there exists a set of functions (local scores) such that

qi|B : DATA({i} ∪ B, d) → R

Score DAGs by summing the local score

  • f each node and its parents!

Q(G, D) = X

i∈N

qi|paG(i)(D{i}∪paG(i))

Local score Parents of node i in graph G. Sum over random variables / nodes.

slide-7
SLIDE 7

Family Variable Representation

Given DAG G over variables N one has

Record each nodes parent set! Two graphs are Markov equivalent, but different DAG representations!

slide-8
SLIDE 8

Family Variable Polytope (FVP)

  • Vertex ßà DAG
  • Dimension = N(2(N-1) -1)
  • Facet description for N=1,2,3,4
  • No facet description N > 4
  • Some facets known N > 4
  • Simple IP relaxation
  • IP solution gives DAG

FVP

slide-9
SLIDE 9

Characteristic Imset Representation

Goal: Unique vector representation of Markov equivalent classes of DAGs. Notation: Suppose N random variables. We index the components of using subsets , such as

  • Studeny
slide-10
SLIDE 10

Characteristic Imset Representation

Given DAG G over variables N one has for any . Moreover

Parents of node i in G.

Theorem (Studeny, Hemmecke, Lindner 2011): Characteristic imsets Markov equivalence classes.

slide-11
SLIDE 11

Characteristic Imset Polytope (CIP)

  • Vertex ßà Markov Eq. Class
  • Dimension = 2N – N – 1
  • Facet description for N=1,2,3,4
  • No facet description N > 4
  • Some facets known N > 4
  • Complex IP relaxation
  • IP solution gives eq. class

CIP

slide-12
SLIDE 12

Geometric Approach to Learning BN

Every reasonable scoring function (BIC, BDE, …) is an affine function of family variable or char imset:

Integer and linear programming techniques can be applied!

(Linear relaxations combined with row-generation and Brach-and-Cut)

CIP FVP

Data Data Practical ILP methods & software exist based on FVP and CIP.

slide-13
SLIDE 13

FVP: Some Known Faces

  • Order faces (Cussens et al.)
  • Sink faces (Cussens et al.)
  • Non-negative inequalities on family variables. (H., Cussens, Studeny).
  • Modified convexity constraints (H, Cussens Studeny).
  • Generalized cluster inequalities (H, Cussens, Studeny), (Cussens, et al.)
  • Connected matroids on C ⊆ N, |C| ≥ 2 (Studeny).
  • Extreme supermodular set functions (H, Cussens, Studeny).

DAGs consistent with total order DAGs with sink j Node i has at most

  • ne parent set.

Coming up… Coming up…

slide-14
SLIDE 14

Super-modular Set Functions

  • The set of super-modular vectors is a polyhedral cone.
  • Cone is pointed => it has finitely many extreme rays.
  • Extreme rays correspond to faces of FVP.
  • Remark: The rank functions of connected matroids are

extreme rays of non-decreasing submodular functions (mirrors to supermodular functions).

slide-15
SLIDE 15

Cluster Inequalities

  • Why? Not all nodes in the cluster C can have a parent

which is also in the cluster C

slide-16
SLIDE 16

Generalized Cluster Inequalities

  • For every cluster and

the generalized cluster inequality is

  • Why? For any DAG G the induced subgraph GC is acyclic

and the first k nodes in C in a total order consistent with GC has at most k-1 parents in C.

slide-17
SLIDE 17

Connecting FVP and CIP

  • bjective is score equivalent(SE) if

Linear map between Family variable and Char Imset: (Studeny, Haws) Face is score equivalent if there exists a SE objective defining F. A face is weakly equivalent(WE) if

BIC & BDE always give SE obj!

slide-18
SLIDE 18

Score Equivalence, FVP , & CIP

  • Theorem [H, Cussens, Studeny]

The following conditions are equivalent for a facet

a)

S is closed under Markov equivalence

b)

S contains the whole equivalence class of full graphs

c)

S is SE

  • Corollary [H, Cussens, Studeny]

There is a 1-1 correspondence between SE faces of FVP and faces of CIP which preserves inclusion.

  • Corollary [H, Cussens, Studeny]

SE facets of the FVP correspond to those facets of the CIP that contain the 1-imset. None of those facets of CIP include the 0-imset. Moreover, these facets correspond to extreme supermodular functions.

slide-19
SLIDE 19

Sufficiency of Score Equivalent Faces

  • Learning BN structure = optimizing SE obj over FVP
  • Are SE faces of FVP sufficient? Yes!
  • Theorem [H, Cussens, Studeny]

Let o be an SE objective. Then the LP problem of has the same optimal value as the LP problem to maximize the same function over the polyhedron

  • SE faces of FVP corresponding to facets of CIP not containing 0-

imset

  • Non-negativity and modified convexity constraints.
  • Not true for SE-facets! L Must use all SE-faces.
slide-20
SLIDE 20

Open Conjecture…something to think on

Theorem: Every weakly equivalent facet of family-variable polytope is a score equivalent facet.

(H, Cussens, Studeny)

Conjecture: Every weakly equivalent face of family-variable polytope is a score equivalent face.

(Haws, Cussens, Studeny)

Believe false, but counter-example must be in N >=4. Already performed extensive searches in N=4,5. L

slide-21
SLIDE 21

THANK YOU!

Polyhedral Approaches to Learning Bayesian Networks, David Haws, James Cussens, Milan Studeny, to appear in book on “Special Session on Algebraic and Geometric Methods in Applied Discrete Mathematics”, 2015. David Haws*, Milan Studeny, James Cussens IBM Watson dchaws@gmail.com