Evaluation methods for unsupervised word embeddings EMNLP 2015 - - PowerPoint PPT Presentation

β–Ά
evaluation methods for unsupervised word embeddings
SMART_READER_LITE
LIVE PREVIEW

Evaluation methods for unsupervised word embeddings EMNLP 2015 - - PowerPoint PPT Presentation

Evaluation methods for unsupervised word embeddings EMNLP 2015 Tobias Schnabel, Igor Labutov, David Mimno and Thorsten Joachims Cornell University September 19th, 2015 Evaluation methods for unsupervised word embeddings Motivation How


slide-1
SLIDE 1

September 19th, 2015

Evaluation methods for unsupervised word embeddings

EMNLP 2015

Tobias Schnabel, Igor Labutov, David Mimno and Thorsten Joachims Cornell University

slide-2
SLIDE 2

Evaluation methods for unsupervised word embeddings

2

September 19th, 2015

Motivation

  • Answer: 5.62 (According to WordSim-353)
  • Problems:
  • Large variance (𝜏 = 2.9)
  • Aggregation of different pairs
  • Question: How can we improve this?

(a) tiger (b) fauna

  • How similar (on a scale from 0-10) are the following two words?
slide-3
SLIDE 3

Evaluation methods for unsupervised word embeddings

3

September 19th, 2015

Procedure design for intrinsic evaluation

  • Which option is most similar to the query word?

Query: skillfully (a) swiftly (b) expertly (c) cleverly (d) pointedly (e) I don’t know the meaning of one (or several) of the words

  • Answer: 8/8 votes for (b)
slide-4
SLIDE 4

Evaluation methods for unsupervised word embeddings

4

September 19th, 2015

Procedure design for intrinsic evaluation

Embedding 2 Embedding 1 Embedding 3 Query inventory Comparative evaluation (new): Judgements Advantages:

  • Directly reflects human preferences
  • Relative instead of absolute judgements
slide-5
SLIDE 5

Evaluation methods for unsupervised word embeddings

5

September 19th, 2015

Looking back

  • How can we improve absolute evaluation?
  • Comparative evaluation

… but

(a) tiger (b) fauna

How should we pick these?

slide-6
SLIDE 6

Evaluation methods for unsupervised word embeddings

6

  • Often: Heuristically chosen
  • Goal: Linguistic insight
  • Aim at diversity and balancedness:
  • Balance rare and frequent words (e.g., play vs. devour)
  • Balance POS classes (e.g., skillfully vs. piano)
  • Balance abstractness/concreteness (e.g., eagerness vs. table)

September 19th, 2015

Inventory design

slide-7
SLIDE 7

Evaluation methods for unsupervised word embeddings

7

September 19th, 2015

Results

  • Embeddings:
  • Prediction-based: CBOW and Collobert&Weston (CW)
  • Reconstruction-based: CCA, Hellinger PCA, Random Projections, GloVe
  • Trained on Wikipedia (2008), made vocabularies the same
  • Details:
  • Options came from position k = 1, 5, 50 in NN from each embedding
  • 100 query words x 3 ranks = 300 subtasks
  • Users of Amazon Mechanical Turk answered 50 such questions
  • Win score: Fraction of votes for each embedding, averaged
slide-8
SLIDE 8

Evaluation methods for unsupervised word embeddings

8

September 19th, 2015

Results – by frequency

β‡’ Performance varies with word frequency

slide-9
SLIDE 9

Evaluation methods for unsupervised word embeddings

9

September 19th, 2015

Results – by rank

β‡’ Different falloff behavior

slide-10
SLIDE 10

Evaluation methods for unsupervised word embeddings

10

September 19th, 2015

Results – absolute performance

β‡’ Similar results for absolute metrics However: Absolute metrics less principled and insightful

Results on absolute intrinsic evaluation

slide-11
SLIDE 11

Evaluation methods for unsupervised word embeddings

11

September 19th, 2015

Looking back

  • How can we improve absolute evaluation?
  • Comparative evaluation
  • How should we pick the query inventory?
  • Strive for diversity and balancedness

… but

(a) tiger

Are there more global properties?

(b) fauna

slide-12
SLIDE 12

Evaluation methods for unsupervised word embeddings

12

Properties of word embeddings

  • Common: Pair-based evaluation, e.g.,
  • Similarity/relatedness
  • Analogy
  • Idea: Set-based evaluation
  • All interactions considered
  • Goal: measure coherence

September 19th, 2015

A B A C B D

slide-13
SLIDE 13

Evaluation methods for unsupervised word embeddings

13

Properties of word embeddings

September 19th, 2015

(a) finally (b) eventually (c) put (d) immediately

  • What word belongs the least to the following group?

Answer: put (8/8 votes)

slide-14
SLIDE 14

Evaluation methods for unsupervised word embeddings

14

Properties of word embeddings

September 19th, 2015

(a) finally (b) eventually (c) put (d) immediately

  • Construction:
  • For each embedding, create sets of 4 with one intruder

Query word Nearest neighbors

…

Intruder Coherent

slide-15
SLIDE 15

Evaluation methods for unsupervised word embeddings

15

Results

September 19th, 2015

β‡’ Set-based evaluation β‰  item-based evaluation

Pair-based performance Outlier precision

β‰ 

slide-16
SLIDE 16

Evaluation methods for unsupervised word embeddings

16

September 19th, 2015

Looking back

  • How can we improve absolute evaluation?
  • Comparative evaluation
  • How should we pick the query inventory?
  • Strive for diversity and balancedness
  • Are there other interesting properties?
  • Coherence

… but What about downstream performance?

slide-17
SLIDE 17

Evaluation methods for unsupervised word embeddings

17

September 19th, 2015

The big picture

Word embeddings Text data Meaning

slide-18
SLIDE 18

Evaluation methods for unsupervised word embeddings

18

September 19th, 2015

The big picture

Word embeddings Text data Linguistic insight Build better NLP systems

slide-19
SLIDE 19

Evaluation methods for unsupervised word embeddings

19

September 19th, 2015

The big picture

Word embeddings Text data Intrinsic evaluation Extrinsic evaluation NER Chunking POS tagging Similarity Clustering Analogy

slide-20
SLIDE 20

Evaluation methods for unsupervised word embeddings

20

September 19th, 2015

The big picture

Word embeddings Text data Intrinsic evaluation Extrinsic evaluation NER Chunking POS tagging Similarity Clustering Analogy

slide-21
SLIDE 21

Evaluation methods for unsupervised word embeddings

21

September 19th, 2015

Extrinsic vs. intrinsic performance

  • Hypothesis:
  • Better intrinsic quality also gives better downstream

performance

  • Experiment:
  • Use each word embedding as extra features in

supervised task

slide-22
SLIDE 22

Evaluation methods for unsupervised word embeddings

22

September 19th, 2015

Results – Chunking

β‡’ Intrinsic performance β‰  extrinsic performance

β‰ 

Intrinsic performance

93.75 93.8 93.85 93.9 93.95 94 94.05 94.1 94.15

Rand. Proj. H-PCA C&W TSSCA GloVe CBOW

Extrinsic performance

F1 chunking results

slide-23
SLIDE 23

Evaluation methods for unsupervised word embeddings

23

September 19th, 2015

Looking back

  • How can we improve absolute evaluation?
  • Comparative evaluation
  • How should we pick the query inventory?
  • Strive for diversity and balancedness
  • Are there other interesting properties?
  • Coherence
  • Does better intrinsic performance lead to better extrinsic results?
  • No!
slide-24
SLIDE 24

Evaluation methods for unsupervised word embeddings

24

September 19th, 2015

Discussion

  • Why do we see such different behavior?
  • Hypothesis: Unwanted information encoded as well
  • Embeddings can accurately predict word frequency
slide-25
SLIDE 25

Evaluation methods for unsupervised word embeddings

25

September 19th, 2015

Discussion

  • Also: Experiments show strong correlation of word frequency and similarity
  • Further problems with cosine similarity:
  • Used in almost all intrinsic evaluation tasks – conflates different aspects
  • Not used during training: disconnect between evaluation and training
  • Better:
  • Learn custom metric for each task (e.g., semantic relatedness, syntatic

similarity, etc.)

slide-26
SLIDE 26

Evaluation methods for unsupervised word embeddings

26

September 19th, 2015

Conclusions

  • Practical recommendations:
  • Specify what the goal of an embedding method is
  • Advantage: Now able to use datasets to inform training
  • Future work:
  • Improving similarity metrics
  • Use data from comparative experiments to do offline evaluation
  • All data and code available at:
  • http://www.cs.cornell.edu/~schnabts/eval/