Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , - PowerPoint PPT Presentation

Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , Xavier Carreras (1 , 2) , Ariadna Quattoni (1 , 2) (1) Universitat Polit` (2) Xerox Research Centre Europe ecnica de Catalunya

Bilexical Relations ◮ Increasing interest in bilexical relations (relation between pairs of words) ◮ Dependency Parsing - lexical items (words) connected by binary relations ROOT OBJ NMOD NMOD SUBJ Small birds sing loud songs ◮ Bilexical Predictions can be modelled as Pr(modifier | head) 2

In Focus: Unseen words Adjective-Noun relation, where an adjective modifies a noun NMOD? NMOD? Vynil can be applied to electronic devices and cases ◮ If one/more of the above nouns or adjectives have not been observed in the supervision: estimating Pr(adjective | noun) ◮ Zipf distribution ◮ Generalisation is a challenge 3

Distributional Word Space Models ◮ Distributional Hypothesis: Linguistic items with similar distributions have similar meanings the curtains open and the moon shining in on the barely ars and the cold , close moon " . And neither of the w rough the night with the moon shining so brightly , it made in the light of the moon . It all boils down , wr surely under a crescent moon , thrilled by ice-white sun , the seasons of the moon ? Home , alone , Jay pla m is dazzling snow , the moon has risen full and cold un and the temple of the moon , driving out of the hug in the dark and now the moon rises , full and amber a bird on the shape of the moon over the trees in front But I could nt see the moon or the stars , only the rning , with a sliver of moon hanging among the stars they love the sun , the moon and the stars . None of the light of an enormous moon . The plash of flowing w man s first step on the moon ; various exhibits , aer the inevitable piece of moon rock . Housing The Airsh oud obscured part of the moon . The Allied guns behind ◮ For every word we can compute an n -dimensional vector space representation φ ( w ) → R n from a large corpus 4

Contributions Formulation of statistical models to improve bilexical prediction tasks ◮ Supervised framework to learn bilexical models over distributional representations ⇒ based on learning bilinear forms ◮ Compressing representations by imposing low-rank constraints to bilinear forms ◮ Lexical embeddings tailored for a specific bilexical task. 5

Overview Bilexical Models Low Rank Constraints Learning Experiments 6

Unsupervised Bilexical Models ◮ We can define a simple bilexical model as: exp {� φ ( m ) , φ ( h ) �} Pr( m | h ) = � m ′ exp {� φ ( m ′ ) , φ ( h ) �} where � φ ( x ) , φ ( y ) � denotes the inner-product. ◮ Problem: Designing appropriate contexts for required relations ◮ Solution: Leverage supervised training corpus 8

Supervised bilexical model ◮ We define the bilexical model in a bilinear setting here as: φ (m) ⊤ W φ (h) where: φ ( m ) and φ ( h ) are n − dimensional representations of m and h W ∈ R n × n is a matrix of parameters 9

Interpreting the Bilinear Models ◮ If we write the bilinear model as: n n � � f i , j ( m , h ) W i , j i =1 j =1 ◮ f i , j ( m , h ) = φ ( m ) [ i ] φ ( h ) [ j ] ⇒ Bilinear models are linear models, with an extended feature space! ◮ = ⇒ We can re-use all the algorithms designed for linear models. 10

Using Bilexical Models ◮ We define the bilexical operator as: � � φ ( m ) ⊤ W φ ( h ) exp Pr( m | h ) = � m ′ ∈M exp { φ ( m ′ ) ⊤ W φ ( h ) } ⇒ Standard conditional log-linear model 11

Rank Constraints φ (m) ⊤ W φ (h)      · · · · · · · · · w 11 w 12 w 1 n h 1    w 21 w 22 · · · · · · · · · w 2 n h 2            . . . . . . .  � �      . . . . . . .  · · · · · · · · · m 1 m 2 m n . . . . . . .          φ ( h ) . . . . . . . � ��  . . . . . .   .  φ ( m ) ⊤ . . . . . . .            . . . . . . .   . . . . . .   .    . . . . . . .        · · · · · · · · ·  w n 1 w n 2 w nn h n 13

Rank Constraints ◮ Factorizing W :     u 11 · · · u 1 k      u 21 · · · w 2 k      σ 1 · · · 0 v 11 · · · · · · v 1 n        h 1  . . .  . . . . . .   . . .  ...       . . . . . .  . . .   h 2  . . . . . .         � �      · · · · · ·    m 1 m 2 m n . . . .   . . .   0 · · · · · · · · ·    σ k v k 1 v kn . . . . . φ ( h )       � ��     φ ( m ) ⊤ � �� . · · ·  u n 1 u nn   V ⊤  .   Σ .       ��  �   U h n    � �� SVD( W ) = U Σ V ⊤ ◮ Please note: W has rank k 14

Low Rank Embedding ◮ Regrouping, we get:         · · · u 11 u 1 k h 1     u 21 · · · w 2 k h 2         · · · 0 · · · · · · σ 1 v 11 v 1 n         . . . .     . .  . . . .    � � . . . ... .  . .   . . . .  · · · · · · . . . . m 1 m 2 m n         . . . . . .             . . . .   . . .     .   0 · · · · · · · · · σ k v k 1 v kn . . . .         � �� · · · u n 1 u nn h n Σ � �� φ ( m ) ⊤ U V ⊤ φ ( h ) ◮ We can see φ ( m ) ⊤ U as a projection of m and V ⊤ φ ( h ) as a projection of h ◮ ⇒ Rank ( W ) defines the dimesionality of the induced space, hence the embedding 15

Computational Properties ◮ In many tasks, given a head, rank a huge number of modifiers ◮ Strategy: ◮ Project each lexical item in the vocabulary into its low dimensional embedding of size k ◮ Compute the bilexical score as k − dimensional inner product ◮ Substantial computational gain as long as we obtain low-rank models 16

Summary ◮ Induce high dimensional representation from a huge corpus ◮ Learn embeddings suited for a given task ◮ Our bilexical formulation is, in principle, a linear model but with an extended features space ◮ Low rank bilexical embedding is computationally efficient 17

Formulation ◮ Given: ◮ Set of training tuples D = ( m 1 , h 1 ) . . . ( m l , h l ) ◮ where m are modifiers and h are heads ◮ the distributional representations: φ ( m ) and φ ( h ) is computed over some corpus ◮ We set it as a conditional log-linear distribution: � � φ ( m ) ⊤ W φ ( h ) exp Pr( m | h ) = � m ′ ∈M exp { φ ( m ′ ) ⊤ W φ ( h ) } 19

Learning and Regularization ◮ Std. conditional Max. Likelihood optimization; maximize the log-likelihood function: � � � � φ ( m ) ⊤ W φ ( h ) − log φ ( m ′ ) ⊤ W φ ( h ) log Pr( D ) = exp ( m , h ) ∈D m ′ ∈M ◮ Adding regularization penalty, our algorithm essentially maximizes: � log Pr( m | h )) + λ � W � p ( m , h ) ∈D ◮ Regularization using the proximal gradient method (FOBOS): ◮ ℓ 1 Regularization, � W � 1 ⇒ Sparse feature space ◮ ℓ 2 Regularization, � W � 2 ⇒ Dense parameters ◮ ℓ ∗ Regularization, � W � ∗ ⇒ Low Rank Embedding 20

Algorithm: Proximal Algorithm for Bilexical Operators 1 while iteration < MaxIteration do W t +0 . 5 = W t − η t g ( W t ); // gradient of neg log-likelihood 2 /* adding regularization penalty: */ /* W t +1 = argmin W || W t +0 . 5 − W || 2 2 + η t λ r ( W ) */ /* we use proximal operator */ if ℓ 1 regularizer then 3 W t +1 ( i , j ) = sign ( W t +0 . 5 ( i , j )) · max( W t +0 . 5 ( i , j ) − η t λ, 0); 4 // Basic thresholding operation else if ℓ 2 regularizer then 5 1 W t +1 = 1+ η t λ W t +0 . 5 ; // Basic scaling operation 6 else if nuclear norm regularizer then 7 W t +0 . 5 = U Σ V ⊤ ; 8 σ i = max( σ i − η t λ, 0); // σ i = the i -th element on Σ ¯ 9 Σ V ⊤ ; W t +1 = U ¯ 10 11 end 21

Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , - PowerPoint PPT Presentation

Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , Xavier Carreras (1 , 2) , Ariadna Quattoni (1 , 2) (1) Universitat Polit` (2) Xerox Research Centre Europe ecnica de Catalunya Bilexical Relations Increasing interest in

Whens a grammar bilexical? Efficient Parsing for Bilexical CF Grammars If it has rules /

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

What is best for spoken langage understanding: small but task-dependent embeddings or huge but

Deep Image-Text Embeddings Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016)

CS 4803 / 7643: Deep Learning Guest Lecture: Embeddings and world2vec Feb. 18 th 2020 Ledell Wu

Efficient Parsing for Bilexical CF Grammars Head Automaton Grammars Jason Eisner Giorgio

Transforming Projective Bilexical Dependency Grammars into efficiently-parsable CFGs with

Joint Multitask Learning for Community Question Answering Using Task-Specific Embeddings Shafiq

First Quarter Fiscal 2014 Results November 7, 2013 Todays Speakers Tony Jensen Stefan

Todays Presenter Sue Hall Director of Library Strategies, columnist, Public Libraries, and

Workplace Wellbeing Team Number: 10 Wellbeing models abound. Our team cut through the noise,

THE ADVANTAGE PLAN The Work of the Holy Spirit: John 16. 5-16. (NIV) Now I am going to him

Update on LBNF Status to the Long-Baseline Neutrino Committee Chris Mossey, Deputy Director for

Is Ireland Ready for a Green New Deal? Donal Nevin Annual Lecture, NERI Institute Dr. Lorna Gold

Computer Vision Exercise Session 1 Institute of Visual Computing Organization Teaching

Khem Raj Embedded Linux Conference 2014, San Jose, CA } What is GCC } General Optimizations