Informative Priors for Graphical Model Structure James Cussens, - - PowerPoint PPT Presentation

informative priors for graphical model structure
SMART_READER_LITE
LIVE PREVIEW

Informative Priors for Graphical Model Structure James Cussens, - - PowerPoint PPT Presentation

Informative Priors for Graphical Model Structure James Cussens, University of York jc@cs.york.ac.uk (joint work with Nicos Angelopoulos) Supported by the UK EPSRC MATHFIT programme Use of structural priors The use of structural priors


slide-1
SLIDE 1

Informative Priors for Graphical Model Structure

James Cussens, University of York jc@cs.york.ac.uk (joint work with Nicos Angelopoulos) Supported by the UK EPSRC MATHFIT programme

slide-2
SLIDE 2

Use of structural priors

  • “The use of structural priors when learning BNs has received
  • nly little attention in the learning community.”

(Langseth & Nielsen, 2003)

  • “The standard priors over network structures are often used

not because they are particularly well-motivated, but rather because they are simple and easy to work with. In fact, the ubiquitous uniform prior over structures is far from uniform

  • ver [Markov equivalence classes]”

(Friedman & Koller, 2003)

Bristol 17/10/03 1

slide-3
SLIDE 3

Exploiting experts “. . . in the context of knowledge-based systems, or indeed in any context where the primary aim of the modeling effort is to predict the future, [uniform] prior distributions are often inappropriate;

  • ne of the primary advantages of the Bayesian approach is that

it provides a practical framework for harnessing all available re- sources including prior expert knowledge.” (Madigan et al, 1995)

Bristol 17/10/03 2

slide-4
SLIDE 4

The problem with experts “Notwithstanding the preceding remarks, eliciting an informa- tive prior distribution on model space from a domain expert is challenging.” (Madigan et al, 1995)

Bristol 17/10/03 3

slide-5
SLIDE 5

Hard constraints

  • Imposing a total ordering on variables or blocks
  • Limiting the number of parents
  • Banning/requiring specific edges.

Bristol 17/10/03 4

slide-6
SLIDE 6

Assuming link independence pr(M) ∝

  • e∈EP

pr(e)

  • e∈EA

(1 − pr(e)) (Buntine, 1991; Cooper & Herskovits, 1992; Madigan and Raftery, 1994)

Bristol 17/10/03 5

slide-7
SLIDE 7

Edit distance from prior network Let M differ from the expert’s prior network by δ arcs, then pr(M) = cκδ (≈ Madigan and Raftery, 1994; Heckerman et al, 1995)

Bristol 17/10/03 6

slide-8
SLIDE 8

Priors over CART trees

  • A Bayesian CART algorithm. Denison et al, Biometrika 1998
  • Bayesian CART model search. Chipman et al, JASA 1998

“Instead of specifying a closed-form expression for the tree prior, p(T), we specify p(T) implicitly by a tree-generating stochastic

  • process. Each realization of such a process can simply be con-

sidered a random draw from this prior. Furthermore, many spec- ifications allow for straightforward evaluation of p(T) for any T and can be effectively coupled with efficient Metropolis-Hastings search algorithms . . . ” (Denison et al)

Bristol 17/10/03 7

slide-9
SLIDE 9

A graphical-model-generating stochastic process

A A A A A A A A A A A A A B B B B B B B B B B B B B C C C C C C C C C C C C p1 p2 p3 p1 p2 p3 p1 p2 p3 p1 p2 p3 X A A A A A A A A B B B B B C C C C C p1 p1 p2 p3 p1

X p1

B

1 2 3 4 5 6 7 8 9 10 11 12 13 28 29 30 31 32 37 38 40

Bristol 17/10/03 8

slide-10
SLIDE 10

Stochastic logic programs implement model-generating stochastic processes

  • 1. Write a logic program which defines a set of models:
  • BN is a Bayesian network if . . .
  • ∀BN : bn(BN) ← digraph(BN) ∧ acyclic(BN)
  • bn(BN) :- digraph(BN), acyclic(BN).
  • 2. Add parameters to define distribution over models to get a

stochastic logic program (SLP).

Bristol 17/10/03 9

slide-11
SLIDE 11

SLPs for MCMC

  • The tree gives a natural neighbourhood structure to the

model space . . .

  • . . . which we exploit to construct a proposal distribution based
  • n the prior.

Bristol 17/10/03 10

slide-12
SLIDE 12

The proposal mechanism

  • 1. Backtrack one step to the most recent choice point in the

probability tree

  • 2. We then probabilistically backtrack as follows: If at the top
  • f the tree stop. Otherwise backtrack one more step to the

next choice point with probability pb.

  • 3. Once we have stopped backtracking choose a new leaf/model

M∗ from the choice point by selecting branches according to their probabilities attached to them. However, in the first step down the tree we may not choose the branch that leads back to the current leaf/model Mi.

Bristol 17/10/03 11

slide-13
SLIDE 13

Bouncing around the tree

........ ....... ...... G_0 G Mi M* fail

not a choice point

ni = n* =2

pi p*

Bristol 17/10/03 12

slide-14
SLIDE 14

The acceptance probability If M∗ is a failure then α(Mi, M∗) = 0 else: α(Mi, M∗) = min

  • p(n∗−ni)

b

1 − pi 1 − p∗ P(D|M∗) P(D|Mi), 1

  • Bristol 17/10/03

13

slide-15
SLIDE 15

Better mixing with a cyclic transition kernel

  • We cycle through the values pb = 1 − 2−n, for n = 1, . . . , 28,
  • so that on every 28th iteration, there is a high probability of

backtracking all the way to the top of the tree.

Bristol 17/10/03 14

slide-16
SLIDE 16

It works . . . eventually! M ˆ p4 ˆ p5 ˆ p6 p BN22 0.668 0.690 0.704 0.702 BN20 0.176 0.150 0.145 0.146 BN19 0.144 0.152 0.143 0.145 BN4 0.007 0.005 0.005 0.005 BN5 0.002 0.001 0.002 0.002 BN1 0.001 0.001 0.001 0.001 BN14 BN10 0.001 BN11 Estimated (ˆ pi) and actual (p) posterior probabilities for the nine most probable 3-node BNs in BNTREE. ˆ pi is the estimated probability after 10i iterations.

Bristol 17/10/03 15

slide-17
SLIDE 17

Real evaluation

  • Generate 2295 datapoints from the Asia BN
  • 783,702,329,343 BNs in model space
  • Run MCMC for 500,000 iterations (no burn-in)
  • Runtimes: 24 minutes - 55 minutes
  • 2 runs for each ‘setting’: compare observed probabilities

Bristol 17/10/03 16

slide-18
SLIDE 18

Real evaluation - OK results

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3pun

Bristol 17/10/03 17

slide-19
SLIDE 19

Real evaluation - hmmm

0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 8pun

Bristol 17/10/03 18

slide-20
SLIDE 20

Markov equivalence classes bn(RVs,BN) :- skeleton(RVs,Skel), essential_graph(Skel,Imms,EG), %could stop here bn(EG,Imms,BN), top_sort(BN,_). %check for cycles Way too many failures!

Bristol 17/10/03 19

slide-21
SLIDE 21

Logic program transformation member(X,[X|_]). member(X,[_|T]) :- member(X,T). member(X,[X,_,_]). ?- member(X,List),List=[_,_,_] member(X,[_,X,_]). member(X,[_,_,X]).

Bristol 17/10/03 20

slide-22
SLIDE 22

SLP transformation for more efficient sampling 1/2 : member(X,[X|_]). 1/2 : member(X,[_|T]) :- member(X,T). 4/7 : member(X,[X,_,_]). ?- member(X,List),List=[_,_,_] 2/7 : member(X,[_,X,_]). 1/7 : member(X,[_,_,X]).

Bristol 17/10/03 21

slide-23
SLIDE 23

What about R?

  • R calls C calls Prolog
  • Where does the prior live? as an R object?
  • The data should eventually be an R dataframe
  • Begin with R as a ‘wrapper’.

Bristol 17/10/03 22