[PPT] - INF5820: Language technological applications Lecture 6 Evaluating PowerPoint Presentation

SLIDE 1

INF5820: Language technological applications Lecture 6 Evaluating Word Embeddings and Using them in Deep Neural Networks

Andrey Kutuzov, Lilja Øvrelid, Stephan Oepen, & Erik Velldal

University of Oslo

25 September 2018

1

SLIDE 2

Technicalities

Obligatory assignment

◮ Obligatory assignment 2 is (finally) out. ◮ ‘Distributional Word Embedding Models’ ◮ Will work on related tasks at the 27/09 group session. ◮ The assignment is due October 5.

2

SLIDE 4

Visualizing Word Embeddings

◮ The most common way of visualizing high-dimensional vectors: ◮ project them into 3D or 2D space, minimizing the difference between

the original and the projected vectors.

◮ Several algorithms:

◮ Principal Component Analysis (PCA) [Tipping and Bishop, 1999] ◮ t-distributed Stochastic Neighbor Embedding (t-SNE)

[Van der Maaten and Hinton, 2008].

3

SLIDE 6

Visualizing Word Embeddings

t-SNE in visualizing semantic shifts over time [Hamilton et al., 2016] Good to know

◮ Both PCA and t-SNE are implemented in sklearn, TensorFlow, etc ◮ Nice online visualization tool: http://projector.tensorflow.org/ ◮ Remember t-SNE is probabilistic:

◮ produces a different picture each run

◮ Important reading about using t-SNE properly:

https://distill.pub/2016/misread-tsne/ [Wattenberg et al., 2016]

4

SLIDE 7

Evaluating Word Embeddings

Intrinsic evaluation

◮ How do we evaluate trained word embeddings (besides downstream

tasks)?

◮ Subject to many discussions! The topic of special workshops at major

NLP conferences (ACL and EMNLP):

◮ https://repeval2017.github.io/ ◮ Synonym detection (what is most similar?)

◮ TOEFL dataset (1997)

◮ Concept categorization (what groups with what?)

◮ ESSLI 2008 dataset ◮ Battig dataset (2010)

◮ Semantic similarity/relatedness (what is the association degree?)

◮ RG dataset [Rubenstein and Goodenough, 1965] ◮ WordSim-353 (WS353) dataset [Finkelstein et al., 2001] ◮ MEN dataset [Bruni et al., 2014] ◮ SimLex999 dataset [Hill et al., 2015] 5

SLIDE 9

Evaluating Word Embeddings

Semantic similarity datasets

◮ Judgments about word pairs semantic similarity from human informants; ◮ correlation of those with the predictions of word embedding models.

Spearman rank correlation: 0.9, p = 0.037

6

SLIDE 10

Evaluating Word Embeddings

There are strong relations/directions between word embeddings within a model: king − man + woman = queen

7

SLIDE 11

Evaluating Word Embeddings

Countries and their capitals This can be used to evaluate models as well.

8

SLIDE 12

Evaluating Word Embeddings

◮ Analogical inference on relations (A is to B as C is to ?)

◮ Google Analogies dataset [Le and Mikolov, 2014]; ◮ Bigger Analogy Test Set (BATS) [Gladkova et al., 2016]; ◮ Many domain-specific test sets inspired by Google Analogies.

◮ Correlation with manually crafted linguistic features:

◮ QVEC uses words affiliations with WordNet synsets [Tsvetkov et al., 2015]; ◮ Linguistic Diagnostics Toolkit (ldtoolkit) offers a multi-factor evaluation

strategy based on several linguistic properties of a model under analysis

[Rogers et al., 2018].

9

SLIDE 13

Evaluating Word Embeddings

All evaluation approaches are problematic

◮ What level of correlation will allow us to consider the model ‘bad’? ◮ The model below achieves Spearman rank correlation with SimLex999

f only 0.4, but it is very good in various downstream tasks!

Dependency between human judgments and model predictions At least, we can compare different models with each other.

10

SLIDE 14

Evaluating Word Embeddings

Example: word embeddings performance in semantic relatedness task depending on window and vector sizes.

11

SLIDE 15

Word Embeddings in Neural Networks

Word embeddings are widely replacing discrete word tokens as an input to more complex neural network models:

◮ feedforward networks, ◮ convolutional networks, ◮ recurrent networks, ◮ LSTMs...

12

SLIDE 17

Word Embeddings in Neural Networks

Main libraries and toolkits to train word embeddings

1. Dissect toolkit [Dinu et al., 2013]

(http://clic.cimec.unitn.it/composes/toolkit/);

2. word2vec original C code [Le and Mikolov, 2014]

(https://code.google.com/archive/p/word2vec/)

3. Gensim library for Python, including word2vec and fastText

implementation and wrappers for other algorithms (https://github.com/RaRe-Technologies/gensim);

4. word2vec implementations in Google’s TensorFlow

(https://www.tensorflow.org/tutorials/word2vec);

5. GloVe reference implementation [Pennington et al., 2014]

(http://nlp.stanford.edu/projects/glove/).

13

SLIDE 18

Word Embeddings in Neural Networks

Hyperparameter influence Word embeddings quality hugely depends on training settings (hyperparameters):

1. CBOW or Skip-Gram algorithm. Skip-Gram is generally better (but

slower). CBOW seems to be better on small corpora (less than 100 mln tokens).

2. Vector size: how many semantic features (dimensions, vector entries)

we use to describe a word. The more is not always the better (300 is a good choice).

3. Window size: context width and influence of distance. Broad windows

(more than 5 words) produce topical (associative) models, narrow windows produce functional (semantic proper) models.

4. Vocabulary size (in Gensim, can be stated explicitly or set through the

min_count threshold)

◮ useful to get rid of long noisy lexical tail.

5. Number of iterations (epochs) over the training data, etc...

14

SLIDE 19

Word Embeddings in Neural Networks

Models can come in several formats:

1. Simple text format: words and sequences of values representing their

vectors, one word per line; first line gives information on the number of words in the model and vector size.

2. The same in the binary form.
3. Gensim binary format: uses NumPy matrices; stores a lot of additional

information (training weights, hyperparameters, word frequency, etc). Gensim works with all of these formats.

15

SLIDE 20

Word Embeddings in Neural Networks

Feeding embeddings to Keras

◮ Embeddings are already numbers, so once can simply feed them as

input vectors.

◮ Another way in Keras is to use an an Embedding() layer:

◮ a matrix of row vectors; ◮ transforms integers (word identifiers) into the corresponding vectors; ◮ ...or sequences of integers into sequences of vectors.

◮ Importantly, the weights in and Embedding() layer can be updated as

part of the training process.

16

SLIDE 21

Representing Documents

◮ Distributional approaches allow to extract semantics from unlabeled

data at word level.

◮ But we also need to represent variable-length documents!

◮ for classification, ◮ for clustering, ◮ for information retrieval (including web search). 17

SLIDE 23

Representing Documents

◮ Can we detect semantically similar texts in the same way as we detect

similar words?

◮ Yes we can! ◮ Nothing prevents us from representing sentences, paragraphs or whole

documents (further we use the term ‘document’ for all these things) as dense vectors.

◮ After the documents are represented as vectors, classification, clustering

r other data processing tasks become trivial.

Note: this lecture does not cover sequence-to-sequence sentence modeling approaches based on recurrent neural networks (RNNs), like the Skip-Thought algorithm [Kiros et al., 2015].

We are concerned with comparatively simple algorithms conceptually similar to prediction-based distributional models for words.

18

SLIDE 24

Representing Documents

Bag-of-words with TF-IDF A very strong baseline approach for document representation, hard to beat by modern methods:

1. Extract vocabulary V of all words (terms) in the training collection

consisting of n documents;

2. For each term, calculate its document frequency: in how many

documents it occurs (df );

3. Represent each document as a sparse vector of frequencies for all terms

from V contained in it (tf );

4. For each value, calculate the weighted frequency wf using term

frequency / inverted document frequency (TF-IDF):

◮ wf = (1 + log10tf ) × log10( n

df )

5. Use these weighted document vectors in your downstream tasks.

What if we want semantically-aware representations?

19

SLIDE 25

Composing from word vectors

◮ Document meaning is composed of individual word meanings. ◮ Combine continuous word vectors into continuous document vectors? ◮ It is called a composition function.

Semantic fingerprints

◮ One of the simplest composition functions: an average vector s over

vectors of all words w0...n in the document. s = 1 n ×

n

i=0

wi (1)

◮ We don’t care about syntax and word order. ◮ If we already have a good word embedding model, this bottom-up

approach is strikingly efficient and usually beats bag-of-words.

◮ ‘Semantic fingerprint’ of the document is just a fancy term for this

simple concept.

◮ It is very important to remove stop words beforehand!

20

SLIDE 26

Composing from word vectors

s = 1 n ×

n

i=0

wi (2)

◮ You even don’t have to average. Summing vectors is enough: cosine is

about angles, not magnitudes.

◮ However, averaging makes difference in case of other distance metrics

(Euclidean distance, etc).

◮ Also helps to keep things tidy and normalized (representations do not

depend on document length).

21

SLIDE 27

Composing from word vectors

Advantages of semantic fingerprints

◮ ‘Semantic fingerprints’ work fast and reuse pre-trained word

embeddings.

◮ Generalized document representations do not depend on particular

words.

◮ They take advantage of ‘semantic features’ learned during the training

f the embedding model.

◮ Topically connected words collectively increase or decrease expression of

the corresponding semantic components.

◮ Thus, topical words automatically become more important than noise

words. See more in [Kutuzov et al., 2016].

22

SLIDE 28

Composing from word vectors

Composition functions

◮ One can experiment with different combinations of word vectors, not

nly averaging/summation:

◮ Concatenation ◮ Multiplication ◮ Weighted sum ◮ etc...

◮ Can introduce word order knowledge by using n-grams instead of words. ◮ See [Mitchell and Lapata, 2010] for extensive description and evaluation of

various compositional models.

23

SLIDE 29

Composing from word vectors

But... However, for some problems such compositional approaches are not enough and we need to generate real document embeddings. But how?

24

SLIDE 30

Training document vectors

Paragraph Vector

◮ [Le and Mikolov, 2014] proposed Paragraph Vector; ◮ primarily designed for learning sentence vectors; ◮ the algorithm takes as an input sentences/documents tagged with

(possibly unique) identifiers;

◮ learns distributed representations for the sentences, such that similar

sentences have similar vectors;

◮ so each sentence is represented with an identifier and a vector, like a

word;

◮ these vectors serve as sort of document memories or document topics.

See [Hill et al., 2016] and [Lau and Baldwin, 2016] for evaluation.

25

SLIDE 31

Training document vectors

Paragraph Vector (aka doc2vec)

◮ implemented in Gensim under the name doc2vec; ◮ Distributed memory (DM) and Distributed Bag-of-words (DBOW)

methods;

◮ PV-DM:

◮ learn word embeddings in a usual way (shared by all documents); ◮ randomly initialize document vectors; ◮ use document vectors together with word vectors to predict the

neighboring words within a pre-defined window;

◮ minimize error; ◮ the trained model can inference a vector for any new document (the

model remains intact).

◮ PV-DBOW:

◮ don’t use sliding window at all; ◮ just predict all words in the current document using its vector.

Contradicting reports on which method is better.

26

SLIDE 32

Training document vectors

Paragraph Vector - Distributed memory (PV-DM)

27

SLIDE 33

Training document vectors

Paragraph Vector - Distributed Bag-of-words (PV-DBOW)

28

SLIDE 34

Training document vectors

Paragraph Vector (aka doc2vec)

◮ You train the model, then inference embeddings for the documents you

are interested in.

◮ The resulting embeddings are shown to perform very good on sentiment

analysis and other document classification tasks, as well as in IR tasks.

◮ Very memory-hungry: each sentence gets its own vector (many millions

f sentences in the real-life corpora).

◮ It is possible to reduce the memory footprint by training a limited

number of vectors: group sentences into classes.

29

SLIDE 35

Representing Documents

There can be many other ways to represent documents with dense continuous vectors! For example:

◮ interpret sentences or paragraphs as ‘words’ and train a straightforward

word embedding model.

◮ etc...

The choice of an approach depends very much on your downstream task.

30

SLIDE 36

Group session on September 27

◮ Working with pre-trained word embeddings from the LTG vectors

repository;

◮ tinkering with word embeddings and adapting them to the task at hand; ◮ coupling Gensim and Keras; ◮ working on the obligatory assignment 2. ◮ See you next Thursday!

31

SLIDE 38

References I

Bruni, E., Tran, N.-K., and Baroni, M. (2014). Multimodal distributional semantics.

J. Artif. Intell. Res.(JAIR), 49(1-47).

Dinu, G., Pham, N. T., and Baroni, M. (2013). Dissect - distributional semantics composition toolkit. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 31–36. Association for Computational Linguistics. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., and Ruppin, E. (2001). Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406–414. ACM.

32

SLIDE 39

References II

Gladkova, A., Drozd, A., and Matsuoka, S. (2016). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In Proceedings of the NAACL Student Research Workshop, pages 8–15. Association for Computational Linguistics. Hamilton, W. L., Leskovec, J., and Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1489–1501. Association for Computational Linguistics.

33

SLIDE 40

References III

Hill, F., Cho, K., and Korhonen, A. (2016). Learning distributed representations of sentences from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter

f the Association for Computational Linguistics: Human Language

Technologies, pages 1367–1377. Association for Computational Linguistics. Hill, F., Reichart, R., and Korhonen, A. (2015). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695.

34

SLIDE 41

References IV

Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S. (2015). Skip-Thought vectors. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 3294–3302. Curran Associates, Inc. Kutuzov, A., Kopotev, M., Sviridenko, T., and Ivanova, L. (2016). Clustering comparable corpora of Russian and Ukrainian academic texts: Word embeddings and semantic fingerprints. In Ninth Workshop on Building and Using Comparable Corpora (LREC-2016), page 3. European Language Resources Association.

35

SLIDE 42

References V

Lau, J. H. and Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 78–86. Association for Computational Linguistics. Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188–1196. Mitchell, J. and Lapata, M. (2010). Composition in distributional models of semantics. Cognitive science, 34(8):1388–1429.

36

SLIDE 43

References VI

Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Rogers, A., Hosur Ananthakrishna, S., and Rumshisky, A. (2018). What’s in your embedding, and how it predicts task performance. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2690–2703. Association for Computational Linguistics. Rubenstein, H. and Goodenough, J. B. (1965). Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633.

37

SLIDE 44

References VII

Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622. Tsvetkov, Y., Faruqui, M., Ling, W., Lample, G., and Dyer, C. (2015). Evaluation of word vector representations by subspace alignment. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2049–2054. Association for Computational Linguistics. Van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(2579-2605):85.

38

SLIDE 45

References VIII

Wattenberg, M., Viégas, F., and Johnson, I. (2016). How to use t-sne effectively. Distill, 1(10):e2.

39

INF5820: Language technological applications Lecture 6 Evaluating Word Embeddings and Using them in Deep Neural Networks

Andrey Kutuzov, Lilja Øvrelid, Stephan Oepen, & Erik Velldal

University of Oslo

25 September 2018

Contents

1

Technicalities

2

Visualizing Word Embeddings

3

Evaluating Word Embeddings

4

Word Embeddings in Neural Networks

5

Representing Documents Composing from word vectors Training document vectors

6

Group session on September 27

Technicalities

Obligatory assignment

◮ Obligatory assignment 2 is (finally) out. ◮ ‘Distributional Word Embedding Models’ ◮ Will work on related tasks at the 27/09 group session. ◮ The assignment is due October 5.

Contents

1

Technicalities

2

Visualizing Word Embeddings

3

Evaluating Word Embeddings

4

Word Embeddings in Neural Networks

5

Representing Documents Composing from word vectors Training document vectors

6

Group session on September 27

Visualizing Word Embeddings

◮ The most common way of visualizing high-dimensional vectors: ◮ project them into 3D or 2D space, minimizing the difference between

the original and the projected vectors.

◮ Several algorithms:

[Van der Maaten and Hinton, 2008].

Visualizing Word Embeddings

t-SNE in visualizing semantic shifts over time [Hamilton et al., 2016] Good to know

◮ Both PCA and t-SNE are implemented in sklearn, TensorFlow, etc ◮ Nice online visualization tool: http://projector.tensorflow.org/ ◮ Remember t-SNE is probabilistic:

◮ Important reading about using t-SNE properly:

https://distill.pub/2016/misread-tsne/ [Wattenberg et al., 2016]

Contents

1

Technicalities

2

Visualizing Word Embeddings

3

Evaluating Word Embeddings

4

Word Embeddings in Neural Networks

5

Representing Documents Composing from word vectors Training document vectors

6

Group session on September 27

Evaluating Word Embeddings

Intrinsic evaluation

◮ How do we evaluate trained word embeddings (besides downstream

tasks)?

◮ Subject to many discussions! The topic of special workshops at major

NLP conferences (ACL and EMNLP):

◮ https://repeval2017.github.io/ ◮ Synonym detection (what is most similar?)

◮ Concept categorization (what groups with what?)

◮ Semantic similarity/relatedness (what is the association degree?)

Evaluating Word Embeddings

Semantic similarity datasets

◮ Judgments about word pairs semantic similarity from human informants; ◮ correlation of those with the predictions of word embedding models.

Spearman rank correlation: 0.9, p = 0.037

Evaluating Word Embeddings

There are strong relations/directions between word embeddings within a model: king − man + woman = queen

Evaluating Word Embeddings

Countries and their capitals This can be used to evaluate models as well.

Evaluating Word Embeddings

◮ Analogical inference on relations (A is to B as C is to ?)

◮ Correlation with manually crafted linguistic features:

strategy based on several linguistic properties of a model under analysis

[Rogers et al., 2018].

Evaluating Word Embeddings

All evaluation approaches are problematic