Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , - - PowerPoint PPT Presentation

learning task specific bilexical embeddings
SMART_READER_LITE
LIVE PREVIEW

Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , - - PowerPoint PPT Presentation

Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , Xavier Carreras (1 , 2) , Ariadna Quattoni (1 , 2) (1) Universitat Polit` (2) Xerox Research Centre Europe ecnica de Catalunya Bilexical Relations Increasing interest in


slide-1
SLIDE 1

Learning Task-specific Bilexical Embeddings

Pranava Madhyastha(1), Xavier Carreras(1,2), Ariadna Quattoni(1,2)

(1) Universitat Polit`

ecnica de Catalunya

(2) Xerox Research Centre Europe

slide-2
SLIDE 2

Bilexical Relations

◮ Increasing interest in bilexical relations (relation between pairs of words) ◮ Dependency Parsing - lexical items (words) connected by binary relations

Small birds sing loud songs

ROOT NMOD SUBJ OBJ NMOD

◮ Bilexical Predictions can be modelled as Pr(modifier|head)

2

slide-3
SLIDE 3

In Focus: Unseen words

Adjective-Noun relation, where an adjective modifies a noun Vynil can be applied to electronic devices and cases

NMOD? NMOD?

◮ If one/more of the above nouns or adjectives have not been observed in the

supervision: estimating Pr(adjective|noun)

◮ Zipf distribution

◮ Generalisation is a challenge

3

slide-4
SLIDE 4

Distributional Word Space Models

◮ Distributional Hypothesis: Linguistic items with similar distributions have similar meanings the curtains open and the moon shining in on the barely ars and the cold , close moon " . And neither of the w rough the night with the moon shining so brightly , it made in the light of the moon . It all boils down , wr surely under a crescent moon , thrilled by ice-white sun , the seasons of the moon ? Home , alone , Jay pla m is dazzling snow , the moon has risen full and cold un and the temple of the moon , driving out of the hug in the dark and now the moon rises , full and amber a bird on the shape of the moon

  • ver the trees in front

But I could nt see the moon

  • r the stars , only the

rning , with a sliver of moon hanging among the stars they love the sun , the moon and the stars . None of the light of an enormous moon . The plash of flowing w man s first step on the moon ; various exhibits , aer the inevitable piece of moon rock . Housing The Airsh

  • ud obscured part of the moon

. The Allied guns behind ◮ For every word we can compute an n-dimensional vector space representation φ(w) → Rn from a large

corpus

4

slide-5
SLIDE 5

Contributions

Formulation of statistical models to improve bilexical prediction tasks

◮ Supervised framework to learn bilexical models over distributional representations

⇒ based on learning bilinear forms

◮ Compressing representations by imposing low-rank constraints to bilinear forms ◮ Lexical embeddings tailored for a specific bilexical task.

5

slide-6
SLIDE 6

Overview

Bilexical Models Low Rank Constraints Learning Experiments

6

slide-7
SLIDE 7

Overview

Bilexical Models Low Rank Constraints Learning Experiments

7

slide-8
SLIDE 8

Unsupervised Bilexical Models

◮ We can define a simple bilexical model as:

Pr(m | h) = exp {φ(m), φ(h)}

  • m′ exp {φ(m′), φ(h)}

where φ(x), φ(y) denotes the inner-product.

◮ Problem: Designing appropriate contexts for required relations ◮ Solution: Leverage supervised training corpus

8

slide-9
SLIDE 9

Supervised bilexical model

◮ We define the bilexical model in a bilinear setting here as:

φ(m)⊤W φ(h) where:

φ(m) and φ(h) are n−dimensional representations of m and h W ∈ Rn×n is a matrix of parameters

9

slide-10
SLIDE 10

Interpreting the Bilinear Models

◮ If we write the bilinear model as: n

  • i=1

n

  • j=1

fi,j(m, h)Wi,j

◮ fi,j(m, h) = φ(m)[i]φ(h)[j]

⇒ Bilinear models are linear models, with an extended feature space!

◮ =

⇒ We can re-use all the algorithms designed for linear models.

10

slide-11
SLIDE 11

Using Bilexical Models

◮ We define the bilexical operator as:

Pr(m|h) = exp

  • φ(m)⊤W φ(h)
  • m′∈M exp {φ(m′)⊤W φ(h)}

⇒ Standard conditional log-linear model

11

slide-12
SLIDE 12

Overview

Bilexical Models Low Rank Constraints Learning Experiments

12

slide-13
SLIDE 13

Rank Constraints

φ(m)⊤W φ(h)

  • m1

m2 · · · · · · · · · mn

  • φ(m)⊤

           w11 w12 · · · · · · · · · w1n w21 w22 · · · · · · · · · w2n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . wn1 wn2 · · · · · · · · · wnn                       h1 h2 . . . . . . . . . hn                                 φ(h)

13

slide-14
SLIDE 14

Rank Constraints

◮ Factorizing W :

  • m1

m2 · · · · · · mn

  • φ(m)⊤

                   u11 · · · u1k u21 · · · w2k . . . . . . . . . . . . . . . . . . un1 · · · unn        

  • U

   σ1 · · · . . . ... . . . · · · σk   

  • Σ

   v11 · · · · · · v1n . . . . . . . . . . . . vk1 · · · · · · vkn   

  • V ⊤

          

  • SVD(W ) = UΣV ⊤

        h1 h2 . . . . . . hn                              φ(h)

◮ Please note: W has rank k

14

slide-15
SLIDE 15

Low Rank Embedding

◮ Regrouping, we get:

       

  • m1

m2 · · · · · · mn

       u11 · · · u1k u21 · · · w2k . . . . . . . . . . . . . . . . . . un1 · · · unn                

  • φ(m)⊤U

   σ1 · · · . . . ... . . . · · · σk   

  • Σ

           v11 · · · · · · v1n . . . . . . . . . . . . vk1 · · · · · · vkn            h1 h2 . . . . . . hn                

  • V ⊤φ(h)

◮ We can see φ(m)⊤U as a projection of m and V ⊤φ(h) as a projection of h ◮ ⇒ Rank(W ) defines the dimesionality of the induced space, hence the embedding

15

slide-16
SLIDE 16

Computational Properties

◮ In many tasks, given a head, rank a huge number of modifiers ◮ Strategy:

◮ Project each lexical item in the vocabulary into its low dimensional embedding of

size k

◮ Compute the bilexical score as k−dimensional inner product

◮ Substantial computational gain as long as we obtain low-rank models

16

slide-17
SLIDE 17

Summary

◮ Induce high dimensional representation from a huge corpus ◮ Learn embeddings suited for a given task ◮ Our bilexical formulation is, in principle, a linear model but with an extended

features space

◮ Low rank bilexical embedding is computationally efficient

17

slide-18
SLIDE 18

Overview

Bilexical Models Low Rank Constraints Learning Experiments

18

slide-19
SLIDE 19

Formulation

◮ Given:

◮ Set of training tuples D = (m1, h1) . . . (ml, hl) ◮ where m are modifiers and h are heads ◮ the distributional representations: φ(m) and φ(h) is computed over some corpus

◮ We set it as a conditional log-linear distribution:

Pr(m|h) = exp

  • φ(m)⊤W φ(h)
  • m′∈M exp {φ(m′)⊤W φ(h)}

19

slide-20
SLIDE 20

Learning and Regularization

◮ Std. conditional Max. Likelihood optimization; maximize the log-likelihood

function: log Pr(D) =

  • (m,h)∈D

φ(m)⊤W φ(h) − log

  • m′∈M

exp

  • φ(m′)⊤W φ(h)
  • ◮ Adding regularization penalty, our algorithm essentially maximizes:
  • (m,h)∈D

log Pr(m | h)) + λW p

◮ Regularization using the proximal gradient method (FOBOS):

◮ ℓ1 Regularization, W 1 ⇒ Sparse feature space ◮ ℓ2 Regularization, W 2 ⇒ Dense parameters ◮ ℓ∗ Regularization, W ∗ ⇒ Low Rank Embedding 20

slide-21
SLIDE 21

Algorithm: Proximal Algorithm for Bilexical Operators

1 while iteration < MaxIteration do 2

Wt+0.5 = Wt − ηtg(Wt); // gradient of neg log-likelihood

/* adding regularization penalty: */ /* Wt+1 = argminW ||Wt+0.5 − W ||2

2 + ηtλr(W )

*/ /* we use proximal operator */

3

if ℓ1 regularizer then

4

Wt+1(i, j) = sign(Wt+0.5(i, j)) · max(Wt+0.5(i, j) − ηtλ, 0);

// Basic thresholding operation

5

else if ℓ2 regularizer then

6

Wt+1 =

1 1+ηtλWt+0.5; // Basic scaling operation 7

else if nuclear norm regularizer then

8

Wt+0.5 = UΣV ⊤;

9

¯ σi = max(σi − ηtλ, 0); // σi = the i-th element on Σ

10

Wt+1 = U ¯ ΣV ⊤ ;

11 end

21

slide-22
SLIDE 22

Overview

Bilexical Models Low Rank Constraints Learning Experiments

22

slide-23
SLIDE 23

Experiments

◮ Tasks:

◮ Noun-Adjective relations: ◮ Pr(adjective|noun) and Pr(noun|adjective) ◮ Verb-Object relations: ◮ Pr(object|verb) and Pr(verb|object)

◮ Data:

◮ Supervised corpus: gold standard dependencies of the Penn Treebank ◮ We partition the heads of head-modifier relations into three parts:

60% heads for training, 10% heads for validation and 30% heads for test.

◮ No heads from the test set were training set. ◮ Corpora for distributional representations: BLLIP corpus

◮ Training: For each head word, using supervised data, we compile a list of

compatible and incompatible modifiers

23

slide-24
SLIDE 24

Results

Nouns Predicted Adjectives president executive, senior, chief, frank, former, international, marketing, assistant, annual, financial wife former, executive, new, financial, own, senior, old, other, deputy, major shares annual, due, net, convertible, average, new, high-yield, initial, tax- exempt, subordinated mortgages annualized, annual, three-month, one-year, average, six-month, conven- tional, short-term, higher, lower month last, next, fiscal, first, past, latest, early, previous, new, current problem new, good, major, tough, bad, big, first, financial, long, federal holiday new, major, special, fourth-quarter, joint, quarterly, third-quarter, small, strong, own

Table: 10 most likely adjectives for some nouns

24

slide-25
SLIDE 25

Results

46 48 50 52 54 56 58 60 62 1e3 1e4 1e5 1e6 1e7 1e8 pairwise accuracy number of operations Objects given Verb unsupervised NN L1 L2

◮ Pairwise accuracy - measure of

compatible/incompatible modifiers

◮ Capacity of the model - given the

head, number of double operations required to compute scores for all modifiers

◮ In general if the representation is n

and there are m modifiers then:

◮ ℓ1 & ℓ2 ⇒ if W has d non-zero

weights ⇒ dm

◮ ℓ∗ ⇒ if the rank of W is k ⇒

kn + km

25

slide-26
SLIDE 26

50 55 60 65 70 75 80 85 90 1e3 1e4 1e5 1e6 1e7 1e8 pairwise accuracy Adjectives given Noun unsupervised NN L1 L2 66 68 70 72 74 76 78 80 1e3 1e4 1e5 1e6 1e7 1e8 Nouns given Adjective unsupervised NN L1 L2 46 48 50 52 54 56 58 60 62 1e3 1e4 1e5 1e6 1e7 1e8 number of operations Objects given Verb unsupervised NN L1 L2 60 65 70 75 80 1e3 1e4 1e5 1e6 1e7 1e8 number of operations Verbs given Object unsupervised NN L1 L2

Figure: Pairwise accuracy vs number of double operations to compute the distribution over m

26

slide-27
SLIDE 27

Prepositional Phrase attachment

Verb Object Modifier (v) (o) (m)

NMOD NOMINAL(prep) VERBAL(prep)

◮ Given: For every preposition p, set of training

tuples Dp = {(v, o, p, m, y)1. . .(v, o, p, m, y)l}

◮ Distributional representations: φ(v), φ(o), φ(m)

Pr(y=V |v, o, p, m) = exp

  • φ(v)⊤W V

p φ(m)

  • Z

& Pr(y=O|v, o, p, m) = exp

  • φ(o)⊤W V

p φ(m)

  • Z

◮ Does the bilinear model complement the linear

model?

◮ For a constant λ ∈ [0, 1]

Pr(y|x) = λ PrL(y|x) + (1 − λ) PrB(y|x)

27

slide-28
SLIDE 28

Results

55 60 65 70 75 80 for from with attachment accuracy bilinear L1 bilinear L2 bilinear NN linear interpolated L1 interpolated L2 interpolated NN

Figure: Attachment accuracies of linear, bilinear and interpolated models for three prepositions

28

slide-29
SLIDE 29

Conclusion

◮ We have presented a semi-supervised bilexical model that has a potential to

generalize over unseen words

◮ We have proposed a method to learn low-rank embeddings for scoring bilexical

relations efficiently

◮ We want to apply this idea to other bilexical tasks in NLP ◮ We want to explore how we can combine other feature representations with

low-rank bilexical operators.

29

slide-30
SLIDE 30

Thank You

30