learning task specific bilexical embeddings
play

Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , - PowerPoint PPT Presentation

Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , Xavier Carreras (1 , 2) , Ariadna Quattoni (1 , 2) (1) Universitat Polit` (2) Xerox Research Centre Europe ecnica de Catalunya Bilexical Relations Increasing interest in


  1. Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , Xavier Carreras (1 , 2) , Ariadna Quattoni (1 , 2) (1) Universitat Polit` (2) Xerox Research Centre Europe ecnica de Catalunya

  2. Bilexical Relations ◮ Increasing interest in bilexical relations (relation between pairs of words) ◮ Dependency Parsing - lexical items (words) connected by binary relations ROOT OBJ NMOD NMOD SUBJ Small birds sing loud songs ◮ Bilexical Predictions can be modelled as Pr(modifier | head) 2

  3. In Focus: Unseen words Adjective-Noun relation, where an adjective modifies a noun NMOD? NMOD? Vynil can be applied to electronic devices and cases ◮ If one/more of the above nouns or adjectives have not been observed in the supervision: estimating Pr(adjective | noun) ◮ Zipf distribution ◮ Generalisation is a challenge 3

  4. Distributional Word Space Models ◮ Distributional Hypothesis: Linguistic items with similar distributions have similar meanings the curtains open and the moon shining in on the barely ars and the cold , close moon " . And neither of the w rough the night with the moon shining so brightly , it made in the light of the moon . It all boils down , wr surely under a crescent moon , thrilled by ice-white sun , the seasons of the moon ? Home , alone , Jay pla m is dazzling snow , the moon has risen full and cold un and the temple of the moon , driving out of the hug in the dark and now the moon rises , full and amber a bird on the shape of the moon over the trees in front But I could nt see the moon or the stars , only the rning , with a sliver of moon hanging among the stars they love the sun , the moon and the stars . None of the light of an enormous moon . The plash of flowing w man s first step on the moon ; various exhibits , aer the inevitable piece of moon rock . Housing The Airsh oud obscured part of the moon . The Allied guns behind ◮ For every word we can compute an n -dimensional vector space representation φ ( w ) → R n from a large corpus 4

  5. Contributions Formulation of statistical models to improve bilexical prediction tasks ◮ Supervised framework to learn bilexical models over distributional representations ⇒ based on learning bilinear forms ◮ Compressing representations by imposing low-rank constraints to bilinear forms ◮ Lexical embeddings tailored for a specific bilexical task. 5

  6. Overview Bilexical Models Low Rank Constraints Learning Experiments 6

  7. Overview Bilexical Models Low Rank Constraints Learning Experiments 7

  8. Unsupervised Bilexical Models ◮ We can define a simple bilexical model as: exp {� φ ( m ) , φ ( h ) �} Pr( m | h ) = � m ′ exp {� φ ( m ′ ) , φ ( h ) �} where � φ ( x ) , φ ( y ) � denotes the inner-product. ◮ Problem: Designing appropriate contexts for required relations ◮ Solution: Leverage supervised training corpus 8

  9. Supervised bilexical model ◮ We define the bilexical model in a bilinear setting here as: φ (m) ⊤ W φ (h) where: φ ( m ) and φ ( h ) are n − dimensional representations of m and h W ∈ R n × n is a matrix of parameters 9

  10. Interpreting the Bilinear Models ◮ If we write the bilinear model as: n n � � f i , j ( m , h ) W i , j i =1 j =1 ◮ f i , j ( m , h ) = φ ( m ) [ i ] φ ( h ) [ j ] ⇒ Bilinear models are linear models, with an extended feature space! ◮ = ⇒ We can re-use all the algorithms designed for linear models. 10

  11. Using Bilexical Models ◮ We define the bilexical operator as: � � φ ( m ) ⊤ W φ ( h ) exp Pr( m | h ) = � m ′ ∈M exp { φ ( m ′ ) ⊤ W φ ( h ) } ⇒ Standard conditional log-linear model 11

  12. Overview Bilexical Models Low Rank Constraints Learning Experiments 12

  13. Rank Constraints φ (m) ⊤ W φ (h)      · · · · · · · · · w 11 w 12 w 1 n h 1    w 21 w 22 · · · · · · · · · w 2 n h 2            . . . . . . .  � �      . . . . . . .  · · · · · · · · · m 1 m 2 m n . . . . . . .          φ ( h ) . . . . . . . � �� �  . . . . . .   .  φ ( m ) ⊤ . . . . . . .            . . . . . . .   . . . . . .   .    . . . . . . .        · · · · · · · · ·  w n 1 w n 2 w nn h n 13

  14. Rank Constraints ◮ Factorizing W :     u 11 · · · u 1 k      u 21 · · · w 2 k      σ 1 · · · 0 v 11 · · · · · · v 1 n        h 1  . . .  . . . . . .   . . .  ...       . . . . . .  . . .   h 2  . . . . . .         � �      · · · · · ·    m 1 m 2 m n . . . .   . . .   0 · · · · · · · · ·    σ k v k 1 v kn . . . . . φ ( h )       � �� �     φ ( m ) ⊤ � �� � � �� � . · · ·  u n 1 u nn   V ⊤  .   Σ .       �� �  �   U h n    � �� � SVD( W ) = U Σ V ⊤ ◮ Please note: W has rank k 14

  15. Low Rank Embedding ◮ Regrouping, we get:         · · · u 11 u 1 k h 1     u 21 · · · w 2 k h 2         · · · 0 · · · · · · σ 1 v 11 v 1 n         . . . .     . .  . . . .    � � . . . ... .  . .   . . . .  · · · · · · . . . . m 1 m 2 m n         . . . . . .             . . . .   . . .     .   0 · · · · · · · · · σ k v k 1 v kn . . . .         � �� � · · · u n 1 u nn h n Σ � �� � � �� � φ ( m ) ⊤ U V ⊤ φ ( h ) ◮ We can see φ ( m ) ⊤ U as a projection of m and V ⊤ φ ( h ) as a projection of h ◮ ⇒ Rank ( W ) defines the dimesionality of the induced space, hence the embedding 15

  16. Computational Properties ◮ In many tasks, given a head, rank a huge number of modifiers ◮ Strategy: ◮ Project each lexical item in the vocabulary into its low dimensional embedding of size k ◮ Compute the bilexical score as k − dimensional inner product ◮ Substantial computational gain as long as we obtain low-rank models 16

  17. Summary ◮ Induce high dimensional representation from a huge corpus ◮ Learn embeddings suited for a given task ◮ Our bilexical formulation is, in principle, a linear model but with an extended features space ◮ Low rank bilexical embedding is computationally efficient 17

  18. Overview Bilexical Models Low Rank Constraints Learning Experiments 18

  19. Formulation ◮ Given: ◮ Set of training tuples D = ( m 1 , h 1 ) . . . ( m l , h l ) ◮ where m are modifiers and h are heads ◮ the distributional representations: φ ( m ) and φ ( h ) is computed over some corpus ◮ We set it as a conditional log-linear distribution: � � φ ( m ) ⊤ W φ ( h ) exp Pr( m | h ) = � m ′ ∈M exp { φ ( m ′ ) ⊤ W φ ( h ) } 19

  20. Learning and Regularization ◮ Std. conditional Max. Likelihood optimization; maximize the log-likelihood function: � � � � φ ( m ) ⊤ W φ ( h ) − log φ ( m ′ ) ⊤ W φ ( h ) log Pr( D ) = exp ( m , h ) ∈D m ′ ∈M ◮ Adding regularization penalty, our algorithm essentially maximizes: � log Pr( m | h )) + λ � W � p ( m , h ) ∈D ◮ Regularization using the proximal gradient method (FOBOS): ◮ ℓ 1 Regularization, � W � 1 ⇒ Sparse feature space ◮ ℓ 2 Regularization, � W � 2 ⇒ Dense parameters ◮ ℓ ∗ Regularization, � W � ∗ ⇒ Low Rank Embedding 20

  21. Algorithm: Proximal Algorithm for Bilexical Operators 1 while iteration < MaxIteration do W t +0 . 5 = W t − η t g ( W t ); // gradient of neg log-likelihood 2 /* adding regularization penalty: */ /* W t +1 = argmin W || W t +0 . 5 − W || 2 2 + η t λ r ( W ) */ /* we use proximal operator */ if ℓ 1 regularizer then 3 W t +1 ( i , j ) = sign ( W t +0 . 5 ( i , j )) · max( W t +0 . 5 ( i , j ) − η t λ, 0); 4 // Basic thresholding operation else if ℓ 2 regularizer then 5 1 W t +1 = 1+ η t λ W t +0 . 5 ; // Basic scaling operation 6 else if nuclear norm regularizer then 7 W t +0 . 5 = U Σ V ⊤ ; 8 σ i = max( σ i − η t λ, 0); // σ i = the i -th element on Σ ¯ 9 Σ V ⊤ ; W t +1 = U ¯ 10 11 end 21

  22. Overview Bilexical Models Low Rank Constraints Learning Experiments 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend