Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert - PowerPoint PPT Presentation

Word Embeddings through Hellinger PCA Rémi Lebret and Ronan Collobert Idiap Research Institute / EPFL EACL, 29 April 2014

Word Embeddings ◮ Continuous vector-space models. → Represent word meanings with vectors capturing semantic + syntactic information . ◮ Similarity measures by computing distances between vectors. ◮ Useful applications: ◮ Information retrieval ◮ Document classification ◮ Question answering ◮ Successful methods: Neural Language Models [Bengio et al., 2003, Collobert and Weston, 2008, Mikolov et al., 2013]. 2

Neural Language Model Understanding 3

Neural Language Model ↑ Trained by backpropagation 4

Neural Language Model 4

Use of Context “You shall know a word by the company it keeps” [Firth, 1957] 5

Use of Context Next word probability distribution: P ( W t | W t − 1 ) 5

Neural Language Model Critical Limitations: ◮ Large corpus needed → for rare words ◮ Difficult to train → fi nding the right parameters ◮ Time-consuming → weeks of training 6

Neural Language Model Critical Limitations: ◮ Large corpus needed → for rare words ◮ Difficult to train → fi nding the right parameters ◮ Time-consuming → weeks of training Alternative ◮ Estimate P ( W t | W t − 1 ) by simply counting words. ◮ Dimensionality reduction → PCA with an appropriate metric. 6

Hellinger PCA of the Word Co-occurence Matrix A simpler and faster method for word embeddings 7

A Spectral Method Word co-occurence statistics: Counting number of times W t ∈ D occurs after a sequence W t − 1 : t − T : P ( W t | W t − 1 : t − T ) = P ( W t , W t − 1 : t − T ) n ( W t , W t − 1 : t − T ) = P ( W t − 1 : t − T ) � W n ( W , W t − 1 : t − T ) , ◮ Sequence size from 1 to T words. ◮ Next word probability distribution P for each sequence. → Multinomial distribution of |D| classes (words). ◮ Co-occurence matrix of size N × |D| . → For word embeddings, T = 1 . 8

A Spectral Method Word co-occurence statistics: Counting number of times W t ∈ D occurs after a sequence W t − 1 : t − T : P ( W t | W t − 1 : t − T ) = P ( W t , W t − 1 : t − T ) n ( W t , W t − 1 : t − T ) = P ( W t − 1 : t − T ) � W n ( W , W t − 1 : t − T ) , Example of word co-occurence probability matrix: W t − 1 W t breeds computing cover food is meat named of 0.04 0.00 0.00 0.13 0.53 0.02 0.18 0.10 cat 0.11 0.00 0.00 0.12 0.39 0.06 0.15 0.17 dog 0.00 0.29 0.19 0.00 0.12 0.00 0.00 0.40 cloud 8

A Spectral Method Hellinger distance: ◮ � k � ( √ p i − √ q i ) 2 , 1 � � H ( P , Q ) = √ � (1) 2 i = 1 with P = ( p 1 , . . . , p k ) , Q = ( q 1 , . . . , q k ) discrete probability distributions. ◮ Related to Euclidean norm: √ 1 � H ( P , Q ) = √ � P − Q � 2 . (2) 2 √ ◮ Normalized distributions: || P || = 1 . 9

A Spectral Method Dimensionality reduction in practice: PCA with square roots of probability distributions: W t W t − 1 breeds computing cover food is meat named of √ √ √ √ √ √ √ √ 0 . 04 0 . 00 0 . 00 0 . 13 0 . 53 0 . 02 0 . 18 0 . 10 cat √ √ √ √ √ √ √ √ 0 . 11 0 . 00 0 . 00 0 . 12 0 . 39 0 . 06 0 . 15 0 . 17 dog √ √ √ √ √ √ √ √ 0 . 00 0 . 29 0 . 19 0 . 00 0 . 12 0 . 00 0 . 00 0 . 40 cloud 10

Word Embeddings Evaluation 11

Word Embeddings Evaluation Supervised NLP tasks: ◮ Syntactic : Named Entity Recognition ◮ Semantic : Movie Review 12

Sentence-level Architecture 13

Example of Movie Review 14

Document-level Architecture 15

Word Embeddings Fine-Tuning ◮ Embeddings are generic. 16

Word Embeddings Fine-Tuning ◮ Embeddings are generic. ⇒ Task-speci fi c tuned embeddings. 16

Experimental Setup 17

Experimental Setup Building Word Embeddings over Large Corpora: ◮ English corpus = Wikipedia + Reuters + Wall Street Journal → 1.652 billion words. ◮ Vocabulary = words that appear at least 100 times → 178,080 words ◮ Context vocabulary = 10,000 most frequent words → Co-occurence matrix of size 178 , 080 × 10 , 000 ◮ 50-dimensional vector after PCA 18

Experimental Setup Comparison with Existing Available Word Embeddings: ◮ LR-MVL : 300,000 words with 50 dimensions trained on RCV1 corpus. → Another spectral method ◮ CW : 130,000 words with 50 dimensions trained over Wikipedia. → Neural network language model ◮ Turian : 268,810 words with 50 dimensions trained over RCV1 corpus. → Same model as CW ◮ HLBL : 246,122 words with 50 dimensions trained over RCV1 corpus. → Probabilistic and linear neural model 19

Experimental Setup Supervised Evaluation Tasks: Named Entity Recognition (NER) Reuters corpus: ◮ Training set → 203,621 words ◮ Test set → 46,435 words ◮ Number of tags = 9 Features: ◮ Word embeddings ◮ Capital letter feature 20

Experimental Setup Supervised Evaluation Tasks: Movie Review IMDB Review Dataset: ◮ Training set → 25,000 reviews ◮ Test set → 25,000 reviews ◮ Even number of positive and negative reviews Features: ◮ Word embeddings 21

Results 22

Named Entity Recognition Other models Brown 1000 clusters 88.5 Ando & Zhang (2005) 89.3 Suzuki & I sozaki (2008) 89.9 Lin & Wu (2009) 90.9 Our model * No Tuned Tuned LR-MVL 86.8 87.4 CW 88.1 88.7 86.3 87.3 Turian HLBL 83.9 85.9 H-PCA 87.9 89.2 E-PCA 84.3 87.1 Results in F1 score Mainly syntactic → Slight increase with fine-tuning * Only word embeddings + capital letter as features. No gazetteers. No previous predictions. 23

IMDB Movie Review Other models LDA 67.4 LSA 84.0 Maas et al. (2011) 88.9 Wang & Manning (2012) with unigram 88.3 Wang & Manning (2012) with bigram 91.2 Brychcin & Habernal (2013) 92.2 Our model * No Tuned Tuned LR-MVL 84.4 89.8 CW 87.6 89.9 Turian 84.4 89.7 85.3 89.6 HLBL 84.1 89.9 H-PCA E-PCA 73.3 89.6 Results in classi fi cation accuracy Clearly semantic → Fine-tuning do help * Only word embeddings as features. No global context. 24

Computational Cost Core Completion Time LR-MVL 70 CPU 3 days CW 1 CPU 2 months 1 CPU few weeks Turian HLBL GPGPU 7 days 1 CPU 3 hours H-PCA H-PCA 100 CPU 3 minutes 25

Fine-Tuning 10 nearest neighbors with and without fi ne-tuning BORING BAD AWESOME BEFORE AFTER BEFORE AFTER BEFORE AFTER SAD CRAP HORRIBLE TERRIBLE SPOOKY TERRIFIC SILLY LAME TERRIBLE STUPID AWFUL TIMELESS SUBLIME MESS DREADFUL BORING SILLY FANTASTIC FANCY STUPID UNFORTUNATE DULL SUMMERTIME LOVELY SOBER DULL AMAZING CRAP NASTY FLAWLESS TRASH HORRIBLE AWFUL WRONG MACABRE MARVELOUS LOUD RUBBISH MARVELOUS TRASH CRAZY EERIE RIDICULOUS SHAME WONDERFUL SHAME ROTTEN LIVELY RUDE AWFUL GOOD KINDA OUTRAGEOUS FANTASY MAGIC ANNOYING FANTASTIC JOKE SCARY SURREAL 26

Valuable feature � � 27

Conclusion ◮ Appealing word embeddings from Hellinger PCA of the word co-occurence matrix. → Simply counting words over a large corpus. ◮ PCA of a N × 10 , 000 matrix → fast and not memory consuming. → Practical alternative to neural language models. ◮ H-PCA’s embeddings available online: → 50, 100 and 200 dimensions → Demo → http://www.lebret.ch/words 28

Conclusion ◮ Appealing word embeddings from Hellinger PCA of the word co-occurence matrix. → Simply counting words over a large corpus. ◮ PCA of a N × 10 , 000 matrix → fast and not memory consuming. → Practical alternative to neural language models. ◮ H-PCA’s embeddings available online: → 50, 100 and 200 dimensions → Demo → http://www.lebret.ch/words Thank you ! 28

References I Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilistic language model. J. Mach. Learn. Res. , 3:1137–1155. Collobert, R. and Weston, J. (2008). A uni fi ed architecture for natural language processing: Deep neural networks with multitask learning. In International Conference on Machine Learning, ICML . Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research , 12:2493–2537. Firth, J. R. (1957). A synopsis of linguistic theory 1930-55. 1952-59:1–32. 29

References II Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K., editors, Advances in Neural Information Processing Systems 26 , pages 3111–3119. Curran Associates, Inc. 30

NNLM Architecture Figure: Neural Language model ([Bengio et al., 2003]) 31

Word-tagging Architecture Figure: Sentence Approach ([Collobert et al., 2011]) 32

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert - PowerPoint PPT Presentation

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert Idiap Research Institute / EPFL EACL, 29 April 2014 Word Embeddings Continuous vector-space models. Represent word meanings with vectors capturing semantic +

Change of Measure formula and the Hellinger Distance of two Lvy Processes Erika Hausenblas

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Software Side-Channel Analysis: Attack Synthesis Lucas Bang Dissertation Defense Committee:

Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary

Approximate Relational Reasoning for Probabilistic Programs PhD Candidate: Federico Olmedo

Optimum Source Resolvability Rate with Respect to f -Divergences Using the Smooth Rnyi Entropy

General estimation theory We have shown that it is possible to win over the shot noise in optical

Some Tricks for Deep Learning in Complex Dynamical Systems Stuart Gordon Reid Chief Scientist

Grouping techniques for facing Volume and Velocity in Big Data How to do it using HistDAWass

Ginibre point process and its Palm measures: absolute continuity and singularity . . . . .

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert - PowerPoint PPT Presentation

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert Idiap Research Institute / EPFL EACL, 29 April 2014 Word Embeddings Continuous vector-space models. Represent word meanings with vectors capturing semantic +

Change of Measure formula and the Hellinger Distance of two Lvy Processes Erika Hausenblas

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Software Side-Channel Analysis: Attack Synthesis Lucas Bang Dissertation Defense Committee:

Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary

Approximate Relational Reasoning for Probabilistic Programs PhD Candidate: Federico Olmedo

Optimum Source Resolvability Rate with Respect to f -Divergences Using the Smooth Rnyi Entropy

General estimation theory We have shown that it is possible to win over the shot noise in optical

Some Tricks for Deep Learning in Complex Dynamical Systems Stuart Gordon Reid Chief Scientist

Grouping techniques for facing Volume and Velocity in Big Data How to do it using HistDAWass

Ginibre point process and its Palm measures: absolute continuity and singularity . . . . .

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to