evaluation methods for unsupervised word embeddings
play

Evaluation methods for unsupervised word embeddings EMNLP 2015 - PowerPoint PPT Presentation

Evaluation methods for unsupervised word embeddings EMNLP 2015 Tobias Schnabel, Igor Labutov, David Mimno and Thorsten Joachims Cornell University September 19th, 2015 Evaluation methods for unsupervised word embeddings Motivation How


  1. Evaluation methods for unsupervised word embeddings EMNLP 2015 Tobias Schnabel, Igor Labutov, David Mimno and Thorsten Joachims Cornell University September 19th, 2015

  2. Evaluation methods for unsupervised word embeddings Motivation  How similar (on a scale from 0-10) are the following two words? (a) tiger (b) fauna  Answer: 5.62 (According to WordSim-353)  Problems: o Large variance ( 𝜏 = 2.9 ) o Aggregation of different pairs  Question: How can we improve this? 2 September 19th, 2015

  3. Evaluation methods for unsupervised word embeddings Procedure design for intrinsic evaluation  Which option is most similar to the query word? Query: skillfully (a) swiftly (b) expertly (c) cleverly (d) pointedly (e) I don’t know the meaning of one (or several) of the words  Answer: 8/8 votes for (b) 3 September 19th, 2015

  4. Evaluation methods for unsupervised word embeddings Procedure design for intrinsic evaluation Comparative evaluation (new): Embedding 1 Query Judgements inventory Embedding 2 Embedding 3 Advantages:  Directly reflects human preferences  Relative instead of absolute judgements 4 September 19th, 2015

  5. Evaluation methods for unsupervised word embeddings Looking back  How can we improve absolute evaluation?  Comparative evaluation … but (a) tiger (b) fauna How should we pick these? 5 September 19th, 2015

  6. Evaluation methods for unsupervised word embeddings Inventory design  Often: Heuristically chosen  Goal: Linguistic insight  Aim at diversity and balancedness: o Balance rare and frequent words (e.g., play vs. devour) o Balance POS classes (e.g., skillfully vs. piano) o Balance abstractness/concreteness (e.g., eagerness vs. table) 6 September 19th, 2015

  7. Evaluation methods for unsupervised word embeddings Results  Embeddings: o Prediction-based: CBOW and Collobert&Weston (CW) o Reconstruction-based: CCA, Hellinger PCA, Random Projections, GloVe o Trained on Wikipedia (2008), made vocabularies the same  Details: o Options came from position k = 1, 5, 50 in NN from each embedding o 100 query words x 3 ranks = 300 subtasks o Users of Amazon Mechanical Turk answered 50 such questions  Win score: Fraction of votes for each embedding, averaged 7 September 19th, 2015

  8. Evaluation methods for unsupervised word embeddings Results – by frequency ⇒ Performance varies with word frequency 8 September 19th, 2015

  9. Evaluation methods for unsupervised word embeddings Results – by rank ⇒ Different falloff behavior 9 September 19th, 2015

  10. Evaluation methods for unsupervised word embeddings Results – absolute performance Results on absolute intrinsic evaluation ⇒ Similar results for absolute metrics However: Absolute metrics less principled and insightful September 19th, 2015 10

  11. Evaluation methods for unsupervised word embeddings Looking back  How can we improve absolute evaluation?  Comparative evaluation  How should we pick the query inventory?  Strive for diversity and balancedness … but (b) fauna (a) tiger Are there more global properties? 11 September 19th, 2015

  12. Evaluation methods for unsupervised word embeddings Properties of word embeddings  Common: Pair-based evaluation, e.g., A B  Similarity/relatedness  Analogy A B  Idea: Set-based evaluation o All interactions considered o Goal: measure coherence C D 12 September 19th, 2015

  13. Evaluation methods for unsupervised word embeddings Properties of word embeddings  What word belongs the least to the following group? (a) finally (b) eventually (c) put (d) immediately Answer: put (8/8 votes) 13 September 19th, 2015

  14. Evaluation methods for unsupervised word embeddings Properties of word embeddings  Construction: (a) finally (b) eventually (c) put (d) immediately  For each embedding, create sets of 4 with one intruder Query word Nearest neighbors … Coherent Intruder 14 September 19th, 2015

  15. Evaluation methods for unsupervised word embeddings Results Pair-based performance Outlier precision ≠ ⇒ Set-based evaluation ≠ item-based evaluation 15 September 19th, 2015

  16. Evaluation methods for unsupervised word embeddings Looking back  How can we improve absolute evaluation?  Comparative evaluation  How should we pick the query inventory?  Strive for diversity and balancedness  Are there other interesting properties?  Coherence … but What about downstream performance? 16 September 19th, 2015

  17. Evaluation methods for unsupervised word embeddings The big picture Text Meaning Word embeddings data 17 September 19th, 2015

  18. Evaluation methods for unsupervised word embeddings The big picture Linguistic insight Text Word embeddings data Build better NLP systems 18 September 19th, 2015

  19. Evaluation methods for unsupervised word embeddings The big picture Similarity Clustering Intrinsic Analogy evaluation Text Word embeddings data NER Chunking Extrinsic evaluation POS tagging 19 September 19th, 2015

  20. Evaluation methods for unsupervised word embeddings The big picture Similarity Clustering Intrinsic Analogy evaluation Text Word embeddings data NER Chunking Extrinsic evaluation POS tagging 20 September 19th, 2015

  21. Evaluation methods for unsupervised word embeddings Extrinsic vs. intrinsic performance  Hypothesis: o Better intrinsic quality also gives better downstream performance  Experiment:  Use each word embedding as extra features in supervised task 21 September 19th, 2015

  22. Evaluation methods for unsupervised word embeddings Results – Chunking Intrinsic performance Extrinsic performance 94.15 94.1 94.05 94 ≠ 93.95 93.9 93.85 93.8 93.75 Rand. H-PCA C&W TSSCA GloVe CBOW Proj. F1 chunking results ⇒ Intrinsic performance ≠ extrinsic performance 22 September 19th, 2015

  23. Evaluation methods for unsupervised word embeddings Looking back  How can we improve absolute evaluation?  Comparative evaluation  How should we pick the query inventory?  Strive for diversity and balancedness  Are there other interesting properties?  Coherence  Does better intrinsic performance lead to better extrinsic results?  No! 23 September 19th, 2015

  24. Evaluation methods for unsupervised word embeddings Discussion  Why do we see such different behavior? o Hypothesis: Unwanted information encoded as well  Embeddings can accurately predict word frequency 24 September 19th, 2015

  25. Evaluation methods for unsupervised word embeddings Discussion  Also: Experiments show strong correlation of word frequency and similarity  Further problems with cosine similarity: o Used in almost all intrinsic evaluation tasks – conflates different aspects o Not used during training: disconnect between evaluation and training  Better: o Learn custom metric for each task (e.g., semantic relatedness, syntatic similarity, etc.) 25 September 19th, 2015

  26. Evaluation methods for unsupervised word embeddings Conclusions  Practical recommendations: o Specify what the goal of an embedding method is o Advantage: Now able to use datasets to inform training  Future work: o Improving similarity metrics o Use data from comparative experiments to do offline evaluation  All data and code available at: o http://www.cs.cornell.edu/~schnabts/eval/ 26 September 19th, 2015

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend