a a la carte emb mbedding ch cheap but effective
play

A A La Carte Emb mbedding: Ch Cheap but Effective Induction on - PowerPoint PPT Presentation

A A La Carte Emb mbedding: Ch Cheap but Effective Induction on of of Se Semantic Feature Vector ors Mikhail Khodak* ,1 , Nikunj Saunshi *,1 , Yingyu Liang 2 , Tengyu Ma 3 , Brandon Stewart 1 , Sanjeev Arora 1 1: Princeton University, 2:


  1. A A La Carte Emb mbedding: Ch Cheap but Effective Induction on of of Se Semantic Feature Vector ors Mikhail Khodak* ,1 , Nikunj Saunshi *,1 , Yingyu Liang 2 , Tengyu Ma 3 , Brandon Stewart 1 , Sanjeev Arora 1 1: Princeton University, 2: University of Wisconsin-Madison, 3: FAIR/Stanford University

  2. ACL 2018 Motivations Distributed representations for words / text have had lots of successes in NLP (language models, machine translation, text classification)

  3. ACL 2018 Motivations Distributed representations for words / text have had lots of successes in NLP (language models, machine translation, text classification) Motivations for our work: • Can we induce embeddings for all kinds of features, especially those with very few occurrences (e.g. ngrams, rare words)

  4. ACL 2018 Motivations Distributed representations for words / text have had lots of successes in NLP (language models, machine translation, text classification) Motivations for our work: • Can we induce embeddings for all kinds of features, especially those with very few occurrences (e.g. ngrams, rare words) • Can we develop simple methods for unsupervised text embedding that compete well with state-of-the-art LSTM methods

  5. ACL 2018 Motivations Distributed representations for words / text have had lots of successes in NLP (language models, machine translation, text classification) We make progress on both problems - Simple and efficient method for embedding features Motivations for our work: (ngrams, rare words, synsets) • Can we induce embeddings for all kinds of features, especially those with very - Simple text embeddings using ngram embeddings which few occurrences (e.g. ngrams, rare words) perform well on classification tasks • Can we develop simple methods for unsupervised text embedding that compete well with state-of-the-art LSTM methods

  6. ACL 2018 Word embeddings • Core idea: Cooccurring words are trained to have high inner product • E.g. LSA, word2vec, GloVe and variants

  7. ACL 2018 Word embeddings • Core idea: Cooccurring words are trained to have high inner product • E.g. LSA, word2vec, GloVe and variants • Require few passes over a very large text corpus and do non-convex optimization word Optimizing ! " ∈ ℝ % corpus embeddings objective

  8. ACL 2018 Word embeddings • Core idea: Cooccurring words are trained to have high inner product • E.g. LSA, word2vec, GloVe and variants • Require few passes over a very large text corpus and do non-convex optimization word Optimizing ! " ∈ ℝ % corpus embeddings objective • Used for solving analogies, language models, machine translation, text classification …

  9. ACL 2018 Feature embeddings • Capturing meaning of other natural language features • E.g. ngrams, phrases, sentences, annotated words, synsets

  10. ACL 2018 Feature embeddings • Capturing meaning of other natural language features • E.g. ngrams, phrases, sentences, annotated words, synsets • Interesting setting: features with zero or few occurrences

  11. ACL 2018 Feature embeddings • Capturing meaning of other natural language features • E.g. ngrams, phrases, sentences, annotated words, synsets • Interesting setting: features with zero or few occurrences • One approach (extension of word embeddings): Learn embeddings for all features in a text corpus feature Optimizing ! " ∈ ℝ % corpus embeddings objective

  12. ACL 2018 Feature embeddings Issues • Usually need to learn embeddings for all features together • Need to learn many parameters • Computation cost paid is prix fixe rather than à la carte • Bad quality for rare features

  13. ACL 2018 Feature embeddings Firth revisited: Feature derives meaning from words around it

  14. ACL 2018 Feature embeddings Firth revisited: Feature derives meaning from words around it Given a feature ! and one (few) context(s) of words around it, can we find a reliable embedding for ! efficiently?

  15. ACL 2018 Feature embeddings Firth revisited: Feature derives meaning from words around it Given a feature ! and one (few) context(s) of words around it, can we find a reliable embedding for ! efficiently? Scientists attending ACL work on cutting edge research in NLP Petrichor : the earthy scent produce when rain falls on dry soil Roger Federer won the first set NN of the match

  16. ACL 2018 Problem setup Given: Text corpus and high quality word embeddings trained on it & ' ∈ ℝ * + & + ∈ ℝ * ! # … "%# … Algorithm f ! ! ! $ " Output: Good quality embedding Input: A feature in context(s) for the feature

  17. ACL 2018 Linear approach • Given a feature f and words in a context c around it #$% = 1 ! " |)| * ! + +∈-

  18. ACL 2018 Linear approach • Given a feature f and words in a context c around it #$% = 1 ! " |)| * ! + +∈- • Issues • stop words (“is”, “the”) are frequent but are less informative • Word vectors tend to share common components which will be amplified

  19. ACL 2018 Potential fixes • Ignore stop words

  20. ACL 2018 Potential fixes • Ignore stop words • SIF weights 1 : Down-weight frequent words (similar to tf-idf) , ! " = 1 + ( = |&| ' + ( ! ( , + . ( . ( is frequency of w in corpus (∈* 1: Arora et al. ’17

  21. ACL 2018 Potential fixes • Ignore stop words • SIF weights 1 : Down-weight frequent words (similar to tf-idf) A ! " = 1 + ( = |&| ' + ( ! ( A + / ( / ( is frequency of w in corpus (∈* • All-but-the-top 2 : Remove the component of top direction from word vectors , = -./_1234&-2.5 ! ( ! " = 1 6 = ; − ,, = ! ( >?@ |&| ' ! ( 6 = 347.!4_&.7/.545-(! ( , ,) ! ( (∈* 1: Arora et al. ’17, 2: Mu et al. ‘18

  22. ACL 2018 Our more general approach • Down-weighting and removing directions can be achieved by matrix multiplication ! " ≈ $ 1 ,-. Induced Embedding & ' ! ( = $! " (∈* Induction Matrix

  23. ACL 2018 Our more general approach • Down-weighting and removing directions can be achieved by matrix multiplication " # ≈ ! 1 ,-. Induced Embedding & ' " ( = !" # (∈* Induction Matrix • Learn ! by using words as features ! ∗ = 012345 6 ' ,-. | 9 9 |" ( − !" ( ( • Learn ! by linear regression and is unsupervised

  24. ACL 2018 Theoretical justification • [Arora et al. TACL ’18] prove that under a generative model for text, there exists a matrix ! which satisfies %&' " # ≈ !" #

  25. ACL 2018 Theoretical justification • [Arora et al. TACL ’18] prove that under a generative model for text, there exists a matrix ! which satisfies %&' " # ≈ !" # • Empirically we find that the best ! ∗ recovers the original word vectors %&' ≥ 0.9 )*+,-. " # , ! ∗ " #

  26. ACL 2018 A la carte embeddings + . , 1. Learn induction matrix ! ∗ = $%&'() * + 012 | 3 3 |. , − !. , Linear , Regression 4 ∗

  27. ACL 2018 A la carte embeddings + 3 1 1. Learn induction matrix & ∗ = )*+,-. / 0 567 | 8 8 |3 1 − &3 1 Linear ! # 1 Regression . . . 2. A la carte embeddings ! " 5:; ? ∗ + 3 9 f 1 567 = & ∗ 5:; = & ∗ 3 9 3 9 |=| 0 3 1 ! "%# . 1∈; . . ! $

  28. ACL 2018 A la carte embeddings Only once !! + 3 1 1. Learn induction matrix & ∗ = )*+,-. / 0 567 | 8 8 |3 1 − &3 1 Linear ! # 1 Regression . . . 2. A la carte embeddings ! " 5:; ? ∗ + 3 9 f 1 567 = & ∗ 5:; = & ∗ 3 9 3 9 |=| 0 3 1 ! "%# . 1∈; . . ! $

  29. ACL 2018 Advantages • à la carte: Compute embedding only for given feature • Simple optimization: Linear regression • Computational efficiency: One pass over corpus and contexts • Sample efficiency: Learn only ! " parameters for # ∗ (rather than %! ) • Versatility: Works for any feature which has at least 1 context

  30. Effect of induction matrix • We plot the extent to which ! ∗ down-weights words against frequency of words compared to all-but-the-top

  31. Effect of induction matrix • We plot the extent to which ! ∗ down-weights words against frequency of words compared to all-but-the-top Change in Embedding Norm under Transform ! ∗ mainly down-weights words with very high and very low frequency |! ∗ $ % | |$ % | All-but-the-top mainly down-weights frequent words log(*+,-. % )

  32. ACL 2018 Effect of number of contexts Contextual Rare Words (CRW) dataset 1 providing contexts for rare words • Task: Predict human-rated similarity scores for pairs of words • Evaluation: Spearman’s rank coefficient between inner product and score 1: Subset of RW dataset [Luong et al. ’13]

  33. ACL 2018 Effect of number of contexts Contextual Rare Words (CRW) dataset 1 providing contexts for rare words • Task: Predict human-rated similarity scores for pairs of words • Evaluation: Spearman’s rank coefficient between inner product and score Average Compare to the following methods: Average , all-but-the-top Average, no stop words SIF • Average of words in context SIF + all-but-the-top à la carte • Average of non stop words • SIF weighted average • all-but-the-top 1: Subset of RW dataset [Luong et al. ’13]

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend