Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , - - PowerPoint PPT Presentation
Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , - - PowerPoint PPT Presentation
Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , Xavier Carreras (1 , 2) , Ariadna Quattoni (1 , 2) (1) Universitat Polit` (2) Xerox Research Centre Europe ecnica de Catalunya Bilexical Relations Increasing interest in
Bilexical Relations
◮ Increasing interest in bilexical relations (relation between pairs of words) ◮ Dependency Parsing - lexical items (words) connected by binary relations
Small birds sing loud songs
ROOT NMOD SUBJ OBJ NMOD
◮ Bilexical Predictions can be modelled as Pr(modifier|head)
2
In Focus: Unseen words
Adjective-Noun relation, where an adjective modifies a noun Vynil can be applied to electronic devices and cases
NMOD? NMOD?
◮ If one/more of the above nouns or adjectives have not been observed in the
supervision: estimating Pr(adjective|noun)
◮ Zipf distribution
◮ Generalisation is a challenge
3
Distributional Word Space Models
◮ Distributional Hypothesis: Linguistic items with similar distributions have similar meanings the curtains open and the moon shining in on the barely ars and the cold , close moon " . And neither of the w rough the night with the moon shining so brightly , it made in the light of the moon . It all boils down , wr surely under a crescent moon , thrilled by ice-white sun , the seasons of the moon ? Home , alone , Jay pla m is dazzling snow , the moon has risen full and cold un and the temple of the moon , driving out of the hug in the dark and now the moon rises , full and amber a bird on the shape of the moon
- ver the trees in front
But I could nt see the moon
- r the stars , only the
rning , with a sliver of moon hanging among the stars they love the sun , the moon and the stars . None of the light of an enormous moon . The plash of flowing w man s first step on the moon ; various exhibits , aer the inevitable piece of moon rock . Housing The Airsh
- ud obscured part of the moon
. The Allied guns behind ◮ For every word we can compute an n-dimensional vector space representation φ(w) → Rn from a large
corpus
4
Contributions
Formulation of statistical models to improve bilexical prediction tasks
◮ Supervised framework to learn bilexical models over distributional representations
⇒ based on learning bilinear forms
◮ Compressing representations by imposing low-rank constraints to bilinear forms ◮ Lexical embeddings tailored for a specific bilexical task.
5
Overview
Bilexical Models Low Rank Constraints Learning Experiments
6
Overview
Bilexical Models Low Rank Constraints Learning Experiments
7
Unsupervised Bilexical Models
◮ We can define a simple bilexical model as:
Pr(m | h) = exp {φ(m), φ(h)}
- m′ exp {φ(m′), φ(h)}
where φ(x), φ(y) denotes the inner-product.
◮ Problem: Designing appropriate contexts for required relations ◮ Solution: Leverage supervised training corpus
8
Supervised bilexical model
◮ We define the bilexical model in a bilinear setting here as:
φ(m)⊤W φ(h) where:
φ(m) and φ(h) are n−dimensional representations of m and h W ∈ Rn×n is a matrix of parameters
9
Interpreting the Bilinear Models
◮ If we write the bilinear model as: n
- i=1
n
- j=1
fi,j(m, h)Wi,j
◮ fi,j(m, h) = φ(m)[i]φ(h)[j]
⇒ Bilinear models are linear models, with an extended feature space!
◮ =
⇒ We can re-use all the algorithms designed for linear models.
10
Using Bilexical Models
◮ We define the bilexical operator as:
Pr(m|h) = exp
- φ(m)⊤W φ(h)
- m′∈M exp {φ(m′)⊤W φ(h)}
⇒ Standard conditional log-linear model
11
Overview
Bilexical Models Low Rank Constraints Learning Experiments
12
Rank Constraints
φ(m)⊤W φ(h)
- m1
m2 · · · · · · · · · mn
- φ(m)⊤
w11 w12 · · · · · · · · · w1n w21 w22 · · · · · · · · · w2n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . wn1 wn2 · · · · · · · · · wnn h1 h2 . . . . . . . . . hn φ(h)
13
Rank Constraints
◮ Factorizing W :
- m1
m2 · · · · · · mn
- φ(m)⊤
u11 · · · u1k u21 · · · w2k . . . . . . . . . . . . . . . . . . un1 · · · unn
- U
σ1 · · · . . . ... . . . · · · σk
- Σ
v11 · · · · · · v1n . . . . . . . . . . . . vk1 · · · · · · vkn
- V ⊤
- SVD(W ) = UΣV ⊤
h1 h2 . . . . . . hn φ(h)
◮ Please note: W has rank k
14
Low Rank Embedding
◮ Regrouping, we get:
- m1
m2 · · · · · · mn
-
u11 · · · u1k u21 · · · w2k . . . . . . . . . . . . . . . . . . un1 · · · unn
- φ(m)⊤U
σ1 · · · . . . ... . . . · · · σk
- Σ
v11 · · · · · · v1n . . . . . . . . . . . . vk1 · · · · · · vkn h1 h2 . . . . . . hn
- V ⊤φ(h)
◮ We can see φ(m)⊤U as a projection of m and V ⊤φ(h) as a projection of h ◮ ⇒ Rank(W ) defines the dimesionality of the induced space, hence the embedding
15
Computational Properties
◮ In many tasks, given a head, rank a huge number of modifiers ◮ Strategy:
◮ Project each lexical item in the vocabulary into its low dimensional embedding of
size k
◮ Compute the bilexical score as k−dimensional inner product
◮ Substantial computational gain as long as we obtain low-rank models
16
Summary
◮ Induce high dimensional representation from a huge corpus ◮ Learn embeddings suited for a given task ◮ Our bilexical formulation is, in principle, a linear model but with an extended
features space
◮ Low rank bilexical embedding is computationally efficient
17
Overview
Bilexical Models Low Rank Constraints Learning Experiments
18
Formulation
◮ Given:
◮ Set of training tuples D = (m1, h1) . . . (ml, hl) ◮ where m are modifiers and h are heads ◮ the distributional representations: φ(m) and φ(h) is computed over some corpus
◮ We set it as a conditional log-linear distribution:
Pr(m|h) = exp
- φ(m)⊤W φ(h)
- m′∈M exp {φ(m′)⊤W φ(h)}
19
Learning and Regularization
◮ Std. conditional Max. Likelihood optimization; maximize the log-likelihood
function: log Pr(D) =
- (m,h)∈D
φ(m)⊤W φ(h) − log
- m′∈M
exp
- φ(m′)⊤W φ(h)
- ◮ Adding regularization penalty, our algorithm essentially maximizes:
- (m,h)∈D
log Pr(m | h)) + λW p
◮ Regularization using the proximal gradient method (FOBOS):
◮ ℓ1 Regularization, W 1 ⇒ Sparse feature space ◮ ℓ2 Regularization, W 2 ⇒ Dense parameters ◮ ℓ∗ Regularization, W ∗ ⇒ Low Rank Embedding 20
Algorithm: Proximal Algorithm for Bilexical Operators
1 while iteration < MaxIteration do 2
Wt+0.5 = Wt − ηtg(Wt); // gradient of neg log-likelihood
/* adding regularization penalty: */ /* Wt+1 = argminW ||Wt+0.5 − W ||2
2 + ηtλr(W )
*/ /* we use proximal operator */
3
if ℓ1 regularizer then
4
Wt+1(i, j) = sign(Wt+0.5(i, j)) · max(Wt+0.5(i, j) − ηtλ, 0);
// Basic thresholding operation
5
else if ℓ2 regularizer then
6
Wt+1 =
1 1+ηtλWt+0.5; // Basic scaling operation 7
else if nuclear norm regularizer then
8
Wt+0.5 = UΣV ⊤;
9
¯ σi = max(σi − ηtλ, 0); // σi = the i-th element on Σ
10
Wt+1 = U ¯ ΣV ⊤ ;
11 end
21
Overview
Bilexical Models Low Rank Constraints Learning Experiments
22
Experiments
◮ Tasks:
◮ Noun-Adjective relations: ◮ Pr(adjective|noun) and Pr(noun|adjective) ◮ Verb-Object relations: ◮ Pr(object|verb) and Pr(verb|object)
◮ Data:
◮ Supervised corpus: gold standard dependencies of the Penn Treebank ◮ We partition the heads of head-modifier relations into three parts:
60% heads for training, 10% heads for validation and 30% heads for test.
◮ No heads from the test set were training set. ◮ Corpora for distributional representations: BLLIP corpus
◮ Training: For each head word, using supervised data, we compile a list of
compatible and incompatible modifiers
23
Results
Nouns Predicted Adjectives president executive, senior, chief, frank, former, international, marketing, assistant, annual, financial wife former, executive, new, financial, own, senior, old, other, deputy, major shares annual, due, net, convertible, average, new, high-yield, initial, tax- exempt, subordinated mortgages annualized, annual, three-month, one-year, average, six-month, conven- tional, short-term, higher, lower month last, next, fiscal, first, past, latest, early, previous, new, current problem new, good, major, tough, bad, big, first, financial, long, federal holiday new, major, special, fourth-quarter, joint, quarterly, third-quarter, small, strong, own
Table: 10 most likely adjectives for some nouns
24
Results
46 48 50 52 54 56 58 60 62 1e3 1e4 1e5 1e6 1e7 1e8 pairwise accuracy number of operations Objects given Verb unsupervised NN L1 L2
◮ Pairwise accuracy - measure of
compatible/incompatible modifiers
◮ Capacity of the model - given the
head, number of double operations required to compute scores for all modifiers
◮ In general if the representation is n
and there are m modifiers then:
◮ ℓ1 & ℓ2 ⇒ if W has d non-zero
weights ⇒ dm
◮ ℓ∗ ⇒ if the rank of W is k ⇒
kn + km
25
50 55 60 65 70 75 80 85 90 1e3 1e4 1e5 1e6 1e7 1e8 pairwise accuracy Adjectives given Noun unsupervised NN L1 L2 66 68 70 72 74 76 78 80 1e3 1e4 1e5 1e6 1e7 1e8 Nouns given Adjective unsupervised NN L1 L2 46 48 50 52 54 56 58 60 62 1e3 1e4 1e5 1e6 1e7 1e8 number of operations Objects given Verb unsupervised NN L1 L2 60 65 70 75 80 1e3 1e4 1e5 1e6 1e7 1e8 number of operations Verbs given Object unsupervised NN L1 L2
Figure: Pairwise accuracy vs number of double operations to compute the distribution over m
26
Prepositional Phrase attachment
Verb Object Modifier (v) (o) (m)
NMOD NOMINAL(prep) VERBAL(prep)
◮ Given: For every preposition p, set of training
tuples Dp = {(v, o, p, m, y)1. . .(v, o, p, m, y)l}
◮ Distributional representations: φ(v), φ(o), φ(m)
Pr(y=V |v, o, p, m) = exp
- φ(v)⊤W V
p φ(m)
- Z
& Pr(y=O|v, o, p, m) = exp
- φ(o)⊤W V
p φ(m)
- Z
◮ Does the bilinear model complement the linear
model?
◮ For a constant λ ∈ [0, 1]
Pr(y|x) = λ PrL(y|x) + (1 − λ) PrB(y|x)
27
Results
55 60 65 70 75 80 for from with attachment accuracy bilinear L1 bilinear L2 bilinear NN linear interpolated L1 interpolated L2 interpolated NN
Figure: Attachment accuracies of linear, bilinear and interpolated models for three prepositions
28
Conclusion
◮ We have presented a semi-supervised bilexical model that has a potential to
generalize over unseen words
◮ We have proposed a method to learn low-rank embeddings for scoring bilexical
relations efficiently
◮ We want to apply this idea to other bilexical tasks in NLP ◮ We want to explore how we can combine other feature representations with
low-rank bilexical operators.
29
Thank You
30