SLIDE 1
Efficient Weight Learning for Markov Logic Networks Speaker - - PowerPoint PPT Presentation
Efficient Weight Learning for Markov Logic Networks Speaker - - PowerPoint PPT Presentation
Efficient Weight Learning for Markov Logic Networks Speaker Manuel Noll Advisor Maximilian Dylla What is this talk about? Well, of course about weight learning. A weight represents the relative strenght or importance of that rule,
SLIDE 2
SLIDE 3
Bird Nightingal Sparrow Pinguin Able to fly Nightingal Sparrow Eats fish Pinguin Bear
Getting a better understanding of
Pw X =x= 1 Z exp∑i winix
nix:is the number of times the ith formula is satisfied by the state of the world x
Example
Number Formula Weight 1 1 2 5
∀ x:¬AbleToFly x⇒ EatsFishx
∀ x: Bird x⇒ AbleToFly x
Vector translation x=(1,1,1,0,1,1,0,0,0,0,1,1) Constant symbols C=(Nigthingal,Sparrow, Pinguin, Bear) We make a closed world assumption.
SLIDE 4
Bird Nightingal Sparrow Pinguin Able to fly Nightingal Sparrow Eats fish Pinguin Bear
Getting a better understanding of
Pw X =x= 1 Z exp∑i winix
nix:is the number of times the ith formula is satisfied by the state of the world x
Example
Number Formula Weight 1 1 2 5
∀ x:¬AbleToFly x⇒ EatsFishx
∀ x: Bird x⇒ AbleToFly x
Number Formula Weight Satisfied by 1 1 Nightingal,Sparrow 2 5 Bear, Pinguin
∀ x:¬AbleToFly x⇒ EatsFishx
∀ x: Bird x⇒ AbleToFly x
SLIDE 5
Pw X =x= 1 Z exp∑i winix
Can be seen as the probability that the given world is an interpretation of our model. When x is fixed the model depends on w and thus is a likelihood function on w. Given a prior distribution p of w, we can compute the posterior distribution of w
Pw∣X =x= Pw X =x pw
∑x ' Pw X =x '
So-called maximum a posteriori estimation (MAP) of w.
SLIDE 6
Maximization of the Log-Likelihood function
arg maxw Pw X =x= 1 Z exp∑i winix Z=∑x' exp∑i wi nix '
We want to fit our model to our world x, therefore we have to
- ptimize the weights
arg max wlog PwX =x=∑i winix−log∑ x' exp∑i winix'
easier
⇒ ∂ ∂ wi log Pw X =x=nix−∑x' Pw X =x' nix =ni x−E w[nix]
SLIDE 7
A conditional likelihood approach
In many applications, we know a priori which predicates will be evidence and which ones will be queried. Goal is to correctly predict the latter given the former. Partition the ground atoms in the domain into a set
- f evidence atoms X and a set of query atoms Y .
PwY =y∣X =x= 1 Z x exp∑i∈F Y winix , y
The conditional likelihood of Y given X is
F Y :the set of all MLN clauses with at least one grounding involving a query atom
SLIDE 8
A convex Problem
argmaxw PwY = y∣X =x= 1 Z x exp∑i∈F Y wi nix , y
⇒ ∂ ∂ wi log PwY =y∣X =x=ni x , y−∑y ' PwY =y'∣X =xnix , y' =ni x , y−Ew[nix , y] The log-likelihood function is a convex function.
w∈[0,∞]
Since which is a convex set, we have a convex optimization problem. For a convex optimization problem holds, if a local minimum exists, then it is a global minimum.
SLIDE 9
Gradient descent / ascent
Let f be a differentiable function then the iteration
xn=xn−1∇ f xn−1
is the so-called gradient ascent method.. In our case
wi ,t=wi ,t−1 ∂ ∂ wi log PwY =y∣X=xwt−1 =wi, t−1ni x , y−Ew[nix , y]
Ew[nix , y]
But how can we compute which is still intractable?
SLIDE 10
MAP with MaxWalkSat
can be approximated by the counts in the MAP state . This will be a good approximation if most of the probability mass of is concentrated around .
nix , y Ew[nix , y] yw
* x
PwY = y∣X =x yw
* x
1 Z x exp∑i∈F Y wini x , y
suggests solution
yw
* x
is the state that maximizes the sum of the weights
- f the satisfied ground clauses
This weighted satisfiability problem is solved by MaxWalkSat.
SLIDE 11
Voted Perceptron Algorithm
Initializing For i=0 to T Approximate by MaxWalkSat Count Return
wi~N 0,1 yw
* x
nix , yw
*
wi ,t=wi ,t−1nix , y−Ew
* [ni x , y]
w= 1 T ∑i
T
wt
SLIDE 12
Markov Chain Monte Carlo (MCMC)
Ew[nix , y]
can be estimated via the Markov Chain Monte Carlo method The Markov Chain has to generate values whose asymptotic distribution follows the distribution of
y1,..., yt Pw X =x∣Y = y
Taking t samples from the MC and averaging shall gives us a good approxmiation for Ew[nix , y]
SLIDE 13
Markov Chain Monte Carlo (MCMC)
Ew[nix , y]
Estimate by a MCMC as follows Let For i=1 to T For each y in Y Compute Sample from Set y = in Samples are taken from
W
0=X ∪Y
W
i=W i−1
P y∣Y ∖{y}
y
t
P y∣Y ∖{y}
y
t
W
i
W
0,...,W T
SLIDE 14
Contrastive Divergence
Initializing For i=0 to T Approximate by MCMC Return
wi~N 0,1
wi ,t=wi ,t−1nix , y−Ew
* [ni x , y]
w= 1 T ∑i
T
wt
Ew[nix , y]
SLIDE 15
Per Weight Learning Rates
Gradient descent is highly vulnerable to ill-conditioning, that is large deviation from one of the ratio of the largest and smallest absolute eigenvalues of the Hessian. In MLN's is the Hessian
[
Ew[n1] Ew[n1]−Ew[n1n1] ... Ew[n1]E w[nk ]−E w[n1nk] ⋮ ⋱ ⋮ Ew[nk] Ew[n1]−Ew[nk n1] ... E w[nk ]E w[nk ]−E w[nk nk]]
the negative covariance matrix of the clause counts
i= ni
introduce different learning weights
SLIDE 16
Diagonal Newton
A gradient descent of the form
xt=xt−1−H
−1∇ f xt−1
for a differentiable function f and H the Hessian is the so-called Newton method For many weights calculating the inverse of H becomes infeasible. Use the inverse of the diagonalized Hessian
[
1 Ew[n1] Ew[n1]−Ew[n1n1] ... ⋮ ⋱ ⋮ ... 1 Ew[nk ] Ew[nk ]−E w[nk nk]]
So-called diagonal Newton method.
SLIDE 17
Diagonal Newton
wi ,t=wi ,t−1− ni x , y−Ew[ni x , y] Ew[nix , y] Ew[nix , y]−Ew[nix , yni x , y]
Step size for a given search direction d
=d
T ∇ PwY =y∣X =x
d
T H −1d d T d
limits the step size to a region in which the approximation is good
SLIDE 18
Diagonal Newton Choosing λ
actual=d t−1
T
∇ Pt pred=d t−1
T ∇ Pt−1H t−1∇ Pt−1/2
Approximated actual change Predicted change If then
actual pred 0.75
t1=t/2
If then
actual pred 0.25
t1=4t
If is negative, the step is rejected and redone after adjusting
actual=d t−1
T
∇ Pt
SLIDE 19
Scaled Conjugate Gradient
Instead of taking a constant step size, do a so-called line search
argmina ha= f xka d ,a∈ℝ
+
can be „loosley“ minimized On ill – conditioned problems inefficient, since line searchs along successive directions tend to partly undo the effect - gradient zigzags Impose at each step that gradient along previous directions remain zero Use Hessian instead of line search
SLIDE 20
Experiments
Datasets Cora 1295 citations of 132 different Papers Task: which citations refer to the same paper, given the words in their author, title and venue fields WebKB Labeled 4165 web pages, 10,935 web links, each marked with a subset of the categories „student, faculty, professor, department, Research project, course“ Task: predict categories from web pages' words and links
SLIDE 21
MLN Rules for WebKB
Hasword , page⇒Class page ,class ¬Hasword , page⇒Class page ,class Class page1,class1∧LinksTo page1, page2⇒Class page2 ,class1
Learned separate weight for each of these rules for each (word,class) and (class,class) pair. Contained 10,981 weights Ground clauses exceeded 300,000
SLIDE 22
Experiments Methodology
Cora Five-way cross-validation WebKB Four-way cross-validation Each algorithm trained for 4h on different learning rates on each training set with a Gaussian prior, based on preleminary experiments After tuning rerun the algorithms for 10h on the whole data set (including the held-out validation set)
SLIDE 23
Results
SLIDE 24