Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert - - PowerPoint PPT Presentation

word embeddings through hellinger pca
SMART_READER_LITE
LIVE PREVIEW

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert - - PowerPoint PPT Presentation

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert Idiap Research Institute / EPFL EACL, 29 April 2014 Word Embeddings Continuous vector-space models. Represent word meanings with vectors capturing semantic +


slide-1
SLIDE 1

Word Embeddings through Hellinger PCA

Rémi Lebret and Ronan Collobert

Idiap Research Institute / EPFL

EACL, 29 April 2014

slide-2
SLIDE 2

Word Embeddings

◮ Continuous vector-space models.

→ Represent word meanings with vectors capturing semantic + syntactic information.

◮ Similarity measures by computing distances between vectors. ◮ Useful applications:

◮ Information retrieval ◮ Document classification ◮ Question answering

◮ Successful methods: Neural Language Models

[Bengio et al., 2003, Collobert and Weston, 2008, Mikolov et al., 2013].

2

slide-3
SLIDE 3

Neural Language Model Understanding

3

slide-4
SLIDE 4

Neural Language Model

4

↑ Trained by backpropagation

slide-5
SLIDE 5

Neural Language Model

4

slide-6
SLIDE 6

Neural Language Model

4

slide-7
SLIDE 7

Neural Language Model

4

slide-8
SLIDE 8

Neural Language Model

4

slide-9
SLIDE 9

Neural Language Model

4

slide-10
SLIDE 10

Neural Language Model

4

slide-11
SLIDE 11

Use of Context “You shall know a word by the company it keeps” [Firth, 1957]

5

slide-12
SLIDE 12

Use of Context

5

Next word probability distribution: P(Wt|Wt−1)

slide-13
SLIDE 13

Use of Context

5

Next word probability distribution: P(Wt|Wt−1)

slide-14
SLIDE 14

Use of Context

5

Next word probability distribution: P(Wt|Wt−1)

slide-15
SLIDE 15

Neural Language Model

Critical Limitations:

◮ Large corpus needed → for rare words ◮ Difficult to train → finding the right parameters ◮ Time-consuming → weeks of training 6

slide-16
SLIDE 16

Neural Language Model

Critical Limitations:

◮ Large corpus needed → for rare words ◮ Difficult to train → finding the right parameters ◮ Time-consuming → weeks of training

Alternative

◮ Estimate P(Wt|Wt−1) by simply counting words. ◮ Dimensionality reduction → PCA with an appropriate metric. 6

slide-17
SLIDE 17

Hellinger PCA of the Word Co-occurence Matrix

A simpler and faster method for word embeddings

7

slide-18
SLIDE 18

A Spectral Method

Word co-occurence statistics:

Counting number of times Wt ∈ D occurs after a sequence Wt−1:t−T: P(Wt|Wt−1:t−T) = P(Wt, Wt−1:t−T) P(Wt−1:t−T) = n(Wt, Wt−1:t−T)

  • W n(W, Wt−1:t−T) ,

◮ Sequence size from 1 to T words. ◮ Next word probability distribution P for each sequence.

→ Multinomial distribution of |D| classes (words).

◮ Co-occurence matrix of size N × |D|.

→ For word embeddings, T = 1.

8

slide-19
SLIDE 19

A Spectral Method

Word co-occurence statistics:

Counting number of times Wt ∈ D occurs after a sequence Wt−1:t−T: P(Wt|Wt−1:t−T) = P(Wt, Wt−1:t−T) P(Wt−1:t−T) = n(Wt, Wt−1:t−T)

  • W n(W, Wt−1:t−T) ,

Example of word co-occurence probability matrix:

Wt−1 Wt breeds computing cover food is meat named

  • f

cat 0.04 0.00 0.00 0.13 0.53 0.02 0.18 0.10 dog 0.11 0.00 0.00 0.12 0.39 0.06 0.15 0.17 cloud 0.00 0.29 0.19 0.00 0.12 0.00 0.00 0.40

8

slide-20
SLIDE 20

A Spectral Method

Hellinger distance:

H(P, Q) = 1 √ 2

  • k
  • i=1

(√pi − √qi)2 , (1) with P = (p1, . . . , pk), Q = (q1, . . . , qk) discrete probability distributions.

◮ Related to Euclidean norm:

H(P, Q) = 1 √ 2

P −

  • Q2 .

(2)

◮ Normalized distributions: ||

√ P|| = 1.

9

slide-21
SLIDE 21

A Spectral Method

Dimensionality reduction in practice:

PCA with square roots of probability distributions:

Wt−1 Wt breeds computing cover food is meat named

  • f

cat √ 0.04 √ 0.00 √ 0.00 √ 0.13 √ 0.53 √ 0.02 √ 0.18 √ 0.10 dog √ 0.11 √ 0.00 √ 0.00 √ 0.12 √ 0.39 √ 0.06 √ 0.15 √ 0.17 cloud √ 0.00 √ 0.29 √ 0.19 √ 0.00 √ 0.12 √ 0.00 √ 0.00 √ 0.40

10

slide-22
SLIDE 22

Word Embeddings Evaluation

11

slide-23
SLIDE 23

Word Embeddings Evaluation Supervised NLP tasks:

◮ Syntactic: Named Entity Recognition ◮ Semantic: Movie Review 12

slide-24
SLIDE 24

Sentence-level Architecture

13

slide-25
SLIDE 25

Example of Movie Review

14

slide-26
SLIDE 26

Document-level Architecture

15

slide-27
SLIDE 27

Word Embeddings Fine-Tuning

16

◮ Embeddings are generic.

slide-28
SLIDE 28

Word Embeddings Fine-Tuning

16

◮ Embeddings are generic.

slide-29
SLIDE 29

Word Embeddings Fine-Tuning

16

◮ Embeddings are generic. ⇒ Task-specific tuned embeddings.

slide-30
SLIDE 30

Experimental Setup

17

slide-31
SLIDE 31

Experimental Setup

Building Word Embeddings over Large Corpora:

◮ English corpus = Wikipedia + Reuters + Wall Street Journal

→ 1.652 billion words.

◮ Vocabulary = words that appear at least 100 times

→ 178,080 words

◮ Context vocabulary = 10,000 most frequent words

→ Co-occurence matrix of size 178, 080 × 10, 000

◮ 50-dimensional vector after PCA 18

slide-32
SLIDE 32

Experimental Setup

Comparison with Existing Available Word Embeddings:

◮ LR-MVL: 300,000 words with 50 dimensions trained on RCV1 corpus.

→ Another spectral method

◮ CW: 130,000 words with 50 dimensions trained over Wikipedia.

→ Neural network language model

◮ Turian: 268,810 words with 50 dimensions trained over RCV1 corpus.

→ Same model as CW

◮ HLBL: 246,122 words with 50 dimensions trained over RCV1 corpus.

→ Probabilistic and linear neural model

19

slide-33
SLIDE 33

Experimental Setup

Supervised Evaluation Tasks:

Named Entity Recognition (NER) Reuters corpus:

◮ Training set → 203,621 words ◮ Test set → 46,435 words ◮ Number of tags = 9

Features:

◮ Word embeddings ◮ Capital letter feature 20

slide-34
SLIDE 34

Experimental Setup

Supervised Evaluation Tasks:

Movie Review IMDB Review Dataset:

◮ Training set → 25,000 reviews ◮ Test set → 25,000 reviews ◮ Even number of positive and negative reviews

Features:

◮ Word embeddings 21

slide-35
SLIDE 35

Results

22

slide-36
SLIDE 36

Named Entity Recognition Other models Brown 1000 clusters 88.5 Ando & Zhang (2005) 89.3 Suzuki & Isozaki (2008) 89.9 Lin & Wu (2009) 90.9 Our model* No Tuned Tuned LR-MVL 86.8 87.4 CW 88.1 88.7 Turian 86.3 87.3 HLBL 83.9 85.9 H-PCA 87.9 89.2 E-PCA 84.3 87.1

Results in F1 score

Mainly syntactic → Slight increase with fine-tuning

*Only word embeddings + capital letter as features. No gazetteers. No previous predictions.

23

slide-37
SLIDE 37

IMDB Movie Review Other models LDA 67.4 LSA 84.0 Maas et al. (2011) 88.9 Wang & Manning (2012) with unigram 88.3 Wang & Manning (2012) with bigram 91.2 Brychcin & Habernal (2013) 92.2 Our model* No Tuned Tuned LR-MVL 84.4 89.8 CW 87.6 89.9 Turian 84.4 89.7 HLBL 85.3 89.6 H-PCA 84.1 89.9 E-PCA 73.3 89.6

Results in classification accuracy

Clearly semantic → Fine-tuning do help

*Only word embeddings as features. No global context.

24

slide-38
SLIDE 38

Computational Cost Core Completion Time LR-MVL 70 CPU 3 days CW 1 CPU 2 months Turian 1 CPU few weeks HLBL GPGPU 7 days H-PCA 1 CPU 3 hours H-PCA 100 CPU 3 minutes

25

slide-39
SLIDE 39

Fine-Tuning 10 nearest neighbors with and without fine-tuning

BORING BAD AWESOME BEFORE AFTER BEFORE AFTER BEFORE AFTER SAD CRAP HORRIBLE TERRIBLE SPOOKY TERRIFIC SILLY LAME TERRIBLE STUPID AWFUL TIMELESS SUBLIME MESS DREADFUL BORING SILLY FANTASTIC FANCY STUPID UNFORTUNATE DULL SUMMERTIME LOVELY SOBER DULL AMAZING CRAP NASTY FLAWLESS TRASH HORRIBLE AWFUL WRONG MACABRE MARVELOUS LOUD RUBBISH MARVELOUS TRASH CRAZY EERIE RIDICULOUS SHAME WONDERFUL SHAME ROTTEN LIVELY RUDE AWFUL GOOD KINDA OUTRAGEOUS FANTASY MAGIC ANNOYING FANTASTIC JOKE SCARY SURREAL

26

slide-40
SLIDE 40

Valuable feature

  • 27
slide-41
SLIDE 41

Conclusion

◮ Appealing word embeddings from Hellinger PCA of the word

co-occurence matrix. → Simply counting words over a large corpus.

◮ PCA of a N × 10, 000 matrix → fast and not memory consuming.

→ Practical alternative to neural language models.

◮ H-PCA’s embeddings available online:

→ 50, 100 and 200 dimensions → Demo → http://www.lebret.ch/words

28

slide-42
SLIDE 42

Conclusion

◮ Appealing word embeddings from Hellinger PCA of the word

co-occurence matrix. → Simply counting words over a large corpus.

◮ PCA of a N × 10, 000 matrix → fast and not memory consuming.

→ Practical alternative to neural language models.

◮ H-PCA’s embeddings available online:

→ 50, 100 and 200 dimensions → Demo → http://www.lebret.ch/words

Thank you !

28

slide-43
SLIDE 43

References I

Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilistic language model.

  • J. Mach. Learn. Res., 3:1137–1155.

Collobert, R. and Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In International Conference on Machine Learning, ICML. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493–2537. Firth, J. R. (1957). A synopsis of linguistic theory 1930-55. 1952-59:1–32.

29

slide-44
SLIDE 44

References II

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K., editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.

30

slide-45
SLIDE 45

NNLM Architecture

Figure: Neural Language model ([Bengio et al., 2003])

31

slide-46
SLIDE 46

Word-tagging Architecture

Figure: Sentence Approach ([Collobert et al., 2011])

32