Efficient Weight Learning for Markov Logic Networks Speaker - - PowerPoint PPT Presentation

efficient weight learning for markov logic networks
SMART_READER_LITE
LIVE PREVIEW

Efficient Weight Learning for Markov Logic Networks Speaker - - PowerPoint PPT Presentation

Efficient Weight Learning for Markov Logic Networks Speaker Manuel Noll Advisor Maximilian Dylla What is this talk about? Well, of course about weight learning. A weight represents the relative strenght or importance of that rule,


slide-1
SLIDE 1

Efficient Weight Learning for Markov Logic Networks

Speaker

Manuel Noll

Advisor

Maximilian Dylla

slide-2
SLIDE 2

What is this talk about?

Well, of course about weight learning. A weight represents the relative strenght or importance of that rule, constraint respectively.

arg max w Pw X =x= 1 Z exp∑i winix

We want to find the best fitting weights w for our world x. This is a convex optimization problem. Stated differently, we have some observations x and want to maximize our model depending on the parameter w. as a likelihood function. A gradient descent method solves this problem.

slide-3
SLIDE 3

Bird Nightingal Sparrow Pinguin Able to fly Nightingal Sparrow Eats fish Pinguin Bear

Getting a better understanding of

Pw X =x= 1 Z exp∑i winix

nix:is the number of times the ith formula is satisfied by the state of the world x

Example

Number Formula Weight 1 1 2 5

∀ x:¬AbleToFly x⇒ EatsFishx

∀ x: Bird x⇒ AbleToFly x

Vector translation x=(1,1,1,0,1,1,0,0,0,0,1,1) Constant symbols C=(Nigthingal,Sparrow, Pinguin, Bear) We make a closed world assumption.

slide-4
SLIDE 4

Bird Nightingal Sparrow Pinguin Able to fly Nightingal Sparrow Eats fish Pinguin Bear

Getting a better understanding of

Pw X =x= 1 Z exp∑i winix

nix:is the number of times the ith formula is satisfied by the state of the world x

Example

Number Formula Weight 1 1 2 5

∀ x:¬AbleToFly x⇒ EatsFishx

∀ x: Bird x⇒ AbleToFly x

Number Formula Weight Satisfied by 1 1 Nightingal,Sparrow 2 5 Bear, Pinguin

∀ x:¬AbleToFly x⇒ EatsFishx

∀ x: Bird x⇒ AbleToFly x

slide-5
SLIDE 5

Pw X =x= 1 Z exp∑i winix

Can be seen as the probability that the given world is an interpretation of our model. When x is fixed the model depends on w and thus is a likelihood function on w. Given a prior distribution p of w, we can compute the posterior distribution of w

Pw∣X =x= Pw X =x pw

∑x ' Pw X =x '

So-called maximum a posteriori estimation (MAP) of w.

slide-6
SLIDE 6

Maximization of the Log-Likelihood function

arg maxw Pw X =x= 1 Z exp∑i winix Z=∑x' exp∑i wi nix '

We want to fit our model to our world x, therefore we have to

  • ptimize the weights

arg max wlog PwX =x=∑i winix−log∑ x' exp∑i winix' 

easier

⇒ ∂ ∂ wi log Pw X =x=nix−∑x' Pw X =x' nix =ni x−E w[nix]

slide-7
SLIDE 7

A conditional likelihood approach

In many applications, we know a priori which predicates will be evidence and which ones will be queried. Goal is to correctly predict the latter given the former. Partition the ground atoms in the domain into a set

  • f evidence atoms X and a set of query atoms Y .

PwY =y∣X =x= 1 Z x exp∑i∈F Y winix , y

The conditional likelihood of Y given X is

F Y :the set of all MLN clauses with at least one grounding involving a query atom

slide-8
SLIDE 8

A convex Problem

argmaxw PwY = y∣X =x= 1 Z x exp∑i∈F Y wi nix , y

⇒ ∂ ∂ wi log PwY =y∣X =x=ni x , y−∑y ' PwY =y'∣X =xnix , y'  =ni x , y−Ew[nix , y] The log-likelihood function is a convex function.

w∈[0,∞]

Since which is a convex set, we have a convex optimization problem. For a convex optimization problem holds, if a local minimum exists, then it is a global minimum.

slide-9
SLIDE 9

Gradient descent / ascent

Let f be a differentiable function then the iteration

xn=xn−1∇ f xn−1

is the so-called gradient ascent method.. In our case

wi ,t=wi ,t−1 ∂ ∂ wi log PwY =y∣X=xwt−1 =wi, t−1ni x , y−Ew[nix , y]

Ew[nix , y]

But how can we compute which is still intractable?

slide-10
SLIDE 10

MAP with MaxWalkSat

can be approximated by the counts in the MAP state . This will be a good approximation if most of the probability mass of is concentrated around .

nix , y Ew[nix , y] yw

* x

PwY = y∣X =x yw

* x

1 Z x exp∑i∈F Y wini x , y

suggests solution

yw

* x

is the state that maximizes the sum of the weights

  • f the satisfied ground clauses

This weighted satisfiability problem is solved by MaxWalkSat.

slide-11
SLIDE 11

Voted Perceptron Algorithm

Initializing For i=0 to T Approximate by MaxWalkSat Count Return

wi~N 0,1 yw

* x

nix , yw

* 

wi ,t=wi ,t−1nix , y−Ew

* [ni x , y]

w= 1 T ∑i

T

wt

slide-12
SLIDE 12

Markov Chain Monte Carlo (MCMC)

Ew[nix , y]

can be estimated via the Markov Chain Monte Carlo method The Markov Chain has to generate values whose asymptotic distribution follows the distribution of

y1,..., yt Pw X =x∣Y = y

Taking t samples from the MC and averaging shall gives us a good approxmiation for Ew[nix , y]

slide-13
SLIDE 13

Markov Chain Monte Carlo (MCMC)

Ew[nix , y]

Estimate by a MCMC as follows Let For i=1 to T For each y in Y Compute Sample from Set y = in Samples are taken from

W

0=X ∪Y

W

i=W i−1

P y∣Y ∖{y}

y

t

P y∣Y ∖{y}

y

t

W

i

W

0,...,W T

slide-14
SLIDE 14

Contrastive Divergence

Initializing For i=0 to T Approximate by MCMC Return

wi~N 0,1

wi ,t=wi ,t−1nix , y−Ew

* [ni x , y]

w= 1 T ∑i

T

wt

Ew[nix , y]

slide-15
SLIDE 15

Per Weight Learning Rates

Gradient descent is highly vulnerable to ill-conditioning, that is large deviation from one of the ratio of the largest and smallest absolute eigenvalues of the Hessian. In MLN's is the Hessian

[

Ew[n1] Ew[n1]−Ew[n1n1] ... Ew[n1]E w[nk ]−E w[n1nk] ⋮ ⋱ ⋮ Ew[nk] Ew[n1]−Ew[nk n1] ... E w[nk ]E w[nk ]−E w[nk nk]]

the negative covariance matrix of the clause counts

i=  ni

introduce different learning weights

slide-16
SLIDE 16

Diagonal Newton

A gradient descent of the form

xt=xt−1−H

−1∇ f  xt−1

for a differentiable function f and H the Hessian is the so-called Newton method For many weights calculating the inverse of H becomes infeasible. Use the inverse of the diagonalized Hessian

[

1 Ew[n1] Ew[n1]−Ew[n1n1] ... ⋮ ⋱ ⋮ ... 1 Ew[nk ] Ew[nk ]−E w[nk nk]]

So-called diagonal Newton method.

slide-17
SLIDE 17

Diagonal Newton

wi ,t=wi ,t−1− ni x , y−Ew[ni x , y] Ew[nix , y] Ew[nix , y]−Ew[nix , yni x , y]

Step size for a given search direction d

=d

T ∇ PwY =y∣X =x

d

T H −1d  d T d

limits the step size to a region in which the approximation is good

slide-18
SLIDE 18

Diagonal Newton Choosing λ

actual=d t−1

T

∇ Pt  pred=d t−1

T ∇ Pt−1H t−1∇ Pt−1/2

Approximated actual change Predicted change If then

actual  pred 0.75

t1=t/2

If then

actual  pred 0.25

t1=4t

If is negative, the step is rejected and redone after adjusting

actual=d t−1

T

∇ Pt

slide-19
SLIDE 19

Scaled Conjugate Gradient

Instead of taking a constant step size, do a so-called line search

argmina ha= f xka d ,a∈ℝ

+

can be „loosley“ minimized On ill – conditioned problems inefficient, since line searchs along successive directions tend to partly undo the effect - gradient zigzags Impose at each step that gradient along previous directions remain zero Use Hessian instead of line search

slide-20
SLIDE 20

Experiments

Datasets Cora 1295 citations of 132 different Papers Task: which citations refer to the same paper, given the words in their author, title and venue fields WebKB Labeled 4165 web pages, 10,935 web links, each marked with a subset of the categories „student, faculty, professor, department, Research project, course“ Task: predict categories from web pages' words and links

slide-21
SLIDE 21

MLN Rules for WebKB

Hasword , page⇒Class page ,class ¬Hasword , page⇒Class page ,class Class page1,class1∧LinksTo page1, page2⇒Class page2 ,class1

Learned separate weight for each of these rules for each (word,class) and (class,class) pair. Contained 10,981 weights Ground clauses exceeded 300,000

slide-22
SLIDE 22

Experiments Methodology

Cora Five-way cross-validation WebKB Four-way cross-validation Each algorithm trained for 4h on different learning rates on each training set with a Gaussian prior, based on preleminary experiments After tuning rerun the algorithms for 10h on the whole data set (including the held-out validation set)

slide-23
SLIDE 23

Results

slide-24
SLIDE 24

That's it folks!