Efficient Weight Learning for Markov Logic Networks Speaker - - PowerPoint PPT Presentation

▶

Dec 27, 2023 422 likes •672 views

Efficient Weight Learning for Markov Logic Networks Speaker Manuel Noll Advisor Maximilian Dylla What is this talk about? Well, of course about weight learning. A weight represents the relative strenght or importance of that rule,

SLIDE 1

Efficient Weight Learning for Markov Logic Networks

Speaker

Manuel Noll

Advisor

Maximilian Dylla

SLIDE 2

What is this talk about?

Well, of course about weight learning. A weight represents the relative strenght or importance of that rule, constraint respectively.

arg max w Pw X =x= 1 Z exp∑i winix

We want to find the best fitting weights w for our world x. This is a convex optimization problem. Stated differently, we have some observations x and want to maximize our model depending on the parameter w. as a likelihood function. A gradient descent method solves this problem.

SLIDE 3

Bird Nightingal Sparrow Pinguin Able to fly Nightingal Sparrow Eats fish Pinguin Bear

Getting a better understanding of

Pw X =x= 1 Z exp∑i winix

nix:is the number of times the ith formula is satisfied by the state of the world x

Example

Number Formula Weight 1 1 2 5

∀ x:¬AbleToFly x⇒ EatsFishx

∀ x: Bird x⇒ AbleToFly x

Vector translation x=(1,1,1,0,1,1,0,0,0,0,1,1) Constant symbols C=(Nigthingal,Sparrow, Pinguin, Bear) We make a closed world assumption.

SLIDE 4

Bird Nightingal Sparrow Pinguin Able to fly Nightingal Sparrow Eats fish Pinguin Bear

Getting a better understanding of

Pw X =x= 1 Z exp∑i winix

nix:is the number of times the ith formula is satisfied by the state of the world x

Example

Number Formula Weight 1 1 2 5

∀ x:¬AbleToFly x⇒ EatsFishx

∀ x: Bird x⇒ AbleToFly x

Number Formula Weight Satisfied by 1 1 Nightingal,Sparrow 2 5 Bear, Pinguin

∀ x:¬AbleToFly x⇒ EatsFishx

∀ x: Bird x⇒ AbleToFly x

SLIDE 5

Pw X =x= 1 Z exp∑i winix

Can be seen as the probability that the given world is an interpretation of our model. When x is fixed the model depends on w and thus is a likelihood function on w. Given a prior distribution p of w, we can compute the posterior distribution of w

Pw∣X =x= Pw X =x pw

∑x ' Pw X =x '

So-called maximum a posteriori estimation (MAP) of w.

SLIDE 6

Maximization of the Log-Likelihood function

arg maxw Pw X =x= 1 Z exp∑i winix Z=∑x' exp∑i wi nix '

We want to fit our model to our world x, therefore we have to

ptimize the weights

arg max wlog PwX =x=∑i winix−log∑ x' exp∑i winix' 

easier

⇒ ∂ ∂ wi log Pw X =x=nix−∑x' Pw X =x' nix =ni x−E w[nix]

SLIDE 7

A conditional likelihood approach

In many applications, we know a priori which predicates will be evidence and which ones will be queried. Goal is to correctly predict the latter given the former. Partition the ground atoms in the domain into a set

f evidence atoms X and a set of query atoms Y .

PwY =y∣X =x= 1 Z x exp∑i∈F Y winix , y

The conditional likelihood of Y given X is

F Y :the set of all MLN clauses with at least one grounding involving a query atom

SLIDE 8

A convex Problem

argmaxw PwY = y∣X =x= 1 Z x exp∑i∈F Y wi nix , y

⇒ ∂ ∂ wi log PwY =y∣X =x=ni x , y−∑y ' PwY =y'∣X =xnix , y'  =ni x , y−Ew[nix , y] The log-likelihood function is a convex function.

w∈[0,∞]

Since which is a convex set, we have a convex optimization problem. For a convex optimization problem holds, if a local minimum exists, then it is a global minimum.

SLIDE 9

Gradient descent / ascent

Let f be a differentiable function then the iteration

xn=xn−1∇ f xn−1

is the so-called gradient ascent method.. In our case

wi ,t=wi ,t−1 ∂ ∂ wi log PwY =y∣X=xwt−1 =wi, t−1ni x , y−Ew[nix , y]

Ew[nix , y]

But how can we compute which is still intractable?

SLIDE 10

MAP with MaxWalkSat

can be approximated by the counts in the MAP state . This will be a good approximation if most of the probability mass of is concentrated around .

nix , y Ew[nix , y] yw

* x

PwY = y∣X =x yw

* x

1 Z x exp∑i∈F Y wini x , y

suggests solution

yw

* x

is the state that maximizes the sum of the weights

f the satisfied ground clauses

This weighted satisfiability problem is solved by MaxWalkSat.

SLIDE 11

Voted Perceptron Algorithm

Initializing For i=0 to T Approximate by MaxWalkSat Count Return

wi~N 0,1 yw

* x

nix , yw

* 

wi ,t=wi ,t−1nix , y−Ew

* [ni x , y]

w= 1 T ∑i

wt

SLIDE 12

Markov Chain Monte Carlo (MCMC)

Ew[nix , y]

can be estimated via the Markov Chain Monte Carlo method The Markov Chain has to generate values whose asymptotic distribution follows the distribution of

y1,..., yt Pw X =x∣Y = y

Taking t samples from the MC and averaging shall gives us a good approxmiation for Ew[nix , y]

SLIDE 13

Markov Chain Monte Carlo (MCMC)

Ew[nix , y]

Estimate by a MCMC as follows Let For i=1 to T For each y in Y Compute Sample from Set y = in Samples are taken from

W

0=X ∪Y

W

i=W i−1

P y∣Y ∖{y}

y

P y∣Y ∖{y}

y

W

0,...,W T

SLIDE 14

Contrastive Divergence

Initializing For i=0 to T Approximate by MCMC Return

wi~N 0,1

wi ,t=wi ,t−1nix , y−Ew

* [ni x , y]

w= 1 T ∑i

wt

Ew[nix , y]

SLIDE 15

Per Weight Learning Rates

Gradient descent is highly vulnerable to ill-conditioning, that is large deviation from one of the ratio of the largest and smallest absolute eigenvalues of the Hessian. In MLN's is the Hessian

[

Ew[n1] Ew[n1]−Ew[n1n1] ... Ew[n1]E w[nk ]−E w[n1nk] ⋮ ⋱ ⋮ Ew[nk] Ew[n1]−Ew[nk n1] ... E w[nk ]E w[nk ]−E w[nk nk]]

the negative covariance matrix of the clause counts

i=  ni

introduce different learning weights

SLIDE 16

Diagonal Newton

A gradient descent of the form

xt=xt−1−H

−1∇ f  xt−1

for a differentiable function f and H the Hessian is the so-called Newton method For many weights calculating the inverse of H becomes infeasible. Use the inverse of the diagonalized Hessian

[

1 Ew[n1] Ew[n1]−Ew[n1n1] ... ⋮ ⋱ ⋮ ... 1 Ew[nk ] Ew[nk ]−E w[nk nk]]

So-called diagonal Newton method.

SLIDE 17

Diagonal Newton

wi ,t=wi ,t−1− ni x , y−Ew[ni x , y] Ew[nix , y] Ew[nix , y]−Ew[nix , yni x , y]

Step size for a given search direction d

=d

T ∇ PwY =y∣X =x

T H −1d  d T d

limits the step size to a region in which the approximation is good



SLIDE 18

Diagonal Newton Choosing λ

actual=d t−1

∇ Pt  pred=d t−1

T ∇ Pt−1H t−1∇ Pt−1/2

Approximated actual change Predicted change If then

actual  pred 0.75

t1=t/2

If then

actual  pred 0.25

t1=4t

If is negative, the step is rejected and redone after adjusting

actual=d t−1

∇ Pt



SLIDE 19

Scaled Conjugate Gradient

Instead of taking a constant step size, do a so-called line search

argmina ha= f xka d ,a∈ℝ

can be „loosley“ minimized On ill – conditioned problems inefficient, since line searchs along successive directions tend to partly undo the effect - gradient zigzags Impose at each step that gradient along previous directions remain zero Use Hessian instead of line search

SLIDE 20

Experiments

Datasets Cora 1295 citations of 132 different Papers Task: which citations refer to the same paper, given the words in their author, title and venue fields WebKB Labeled 4165 web pages, 10,935 web links, each marked with a subset of the categories „student, faculty, professor, department, Research project, course“ Task: predict categories from web pages' words and links

SLIDE 21

MLN Rules for WebKB

Hasword , page⇒Class page ,class ¬Hasword , page⇒Class page ,class Class page1,class1∧LinksTo page1, page2⇒Class page2 ,class1

Learned separate weight for each of these rules for each (word,class) and (class,class) pair. Contained 10,981 weights Ground clauses exceeded 300,000

SLIDE 22

Experiments Methodology

Cora Five-way cross-validation WebKB Four-way cross-validation Each algorithm trained for 4h on different learning rates on each training set with a Gaussian prior, based on preleminary experiments After tuning rerun the algorithms for 10h on the whole data set (including the held-out validation set)

SLIDE 23

Results

SLIDE 24