Encoding Prior Knowledge with Eigenword Embeddings Dominique Osborne - PowerPoint PPT Presentation

Encoding Prior Knowledge with Eigenword Embeddings Dominique Osborne 1 , Shashi Narayan 2 & Shay Cohen 2 1 Department of Mathematics and Statistics, University of Strathclyde 2 School of Informatics, University of Edinburgh EACL 2017 1 / 19

Word embeddings ... cat (0.1, 0.2, 0, 0.2, 0.03, ...) dog (0.2, 0.02, 0.1, 0.1, 0.02, ...) car (0.001, 0, 0, 0.1, 0.3, ...) 2 / 19

Learning dense representations Neural networks Matrix factorization context 1 context 2 ... context n word 1 word 2 ... word n ◮ LSA (word-document) (Deerwester et al., 1990) ◮ GloVe (word-neighbourWords) ◮ NLM (word-neighbourWords) (Pennington et al., 2014) (Bengio et al., 2003) ◮ CCA based Eigenword ◮ Word2Vec (Mikolov et al., (word-neighbourWords) (Dhillon et al., 2015) 2013) Distributional hypothesis (Harris, 1954) 3 / 19

Adding knowledge to word embeddings ◮ Refining vector space representations using semantic lexicons such as WordNet, FrameNet, and the Paraphrase Database, to ◮ encourage linked words to have similar vector representations. ◮ Often operates as a post processing step, e.g., Retrofitting (Faruqui et at, 2015) and AutoExtend (Rothe and Schutze, 2015) . 4 / 19

In this talk ... Encode semantic knowledge to CCA-based eigenword embeddings ◮ Spectral learning algorithms are interesting for their speed, scalability, globally optimal solution, and performance in various NLP applications. 5 / 19

In this talk ... Encode semantic knowledge to CCA-based eigenword embeddings ◮ Spectral learning algorithms are interesting for their speed, scalability, globally optimal solution, and performance in various NLP applications. ◮ We introduce prior knowledge in the CCA derivation itself. ◮ Preserves the properties of spectral learning algorithms for learning word embeddings. ◮ Applicable for incorporating prior knowledge into any CCA. 5 / 19

CCA-based Eigenword embeddings (Dhillon et al., 2015) Training set: { ( w ( i ) 1 , . . . , w ( i ) k , w ( i ) , w ( i ) k + 1 , . . . , w ( i ) 2 k ) | i ∈ [ n ] } ◮ Pivot word: w ( i ) ◮ Left context: { w ( i ) 1 , . . . , w ( i ) k } ◮ Right context: { w ( i ) k + 1 , . . . , w ( i ) 2 k } CCA finds projections for the contexts and for the pivot words which are most correlated (follows distributional hypothesis of Harris, 1954 ) 6 / 19

Defining two views for CCA Training set: { ( w ( i ) 1 , . . . , w ( i ) k , w ( i ) , w ( i ) k + 1 , . . . , w ( i ) 2 k ) | i ∈ [ n ] } | H | n 1 2 j | H | Word matrix W ∈ R n ×| H | 0 0 0 0 1 0 0 0 i w ( i ) = h j 2 1 W 1 k 2 k n j | H | 1 2 0 0 0 0 1 0 0 0 Context matrix C ∈ R n × 2 k | H | i w ( i ) = h j k 2 1 C 7 / 19

Dimensionality reduction with SVD − 1 2 − 1 m 2 d ′ diag W ⊤ W × W ⊤ × × diag C ⊤ C ≈ × × V ⊤ m C d U Σ D 1 D 2 M X ⊤ Y Eigenword embedding E = D − 1 / 2 U ∈ R | H |× m 1 8 / 19

Adding prior knowledge to Eigenword embeddings Introduce prior knowledge in the CCA derivation itself to preserves the properties of spectral learning algorithms Prior knowledge ⇐ WordNet, FrameNet and the Paraphrase Database 9 / 19

Adding prior knowledge to Eigenword embeddings d ′ d n n W n L n C prior knowledge − 1 2 − 1 m 2 d ′ diag W ⊤ W × W ⊤ × × diag C ⊤ C ≈ × × V ⊤ m C d U Σ D 1 D 2 M X ⊤ Y Improve the optimization of correlation between the two views by weighing them using the external source of prior knowledge 10 / 19

Two views for CCA Training set: { ( w ( i ) 1 , . . . , w ( i ) k , w ( i ) , w ( i ) k + 1 , . . . , w ( i ) 2 k ) | i ∈ [ n ] } | H | n 1 2 j | H | Word matrix W ∈ R n ×| H | 0 0 0 0 1 0 0 0 i w ( i ) = h j 2 1 W 1 k 2 k n j | H | 1 2 0 0 0 0 1 0 0 0 Context matrix C ∈ R n × 2 k | H | i w ( i ) = h j k 2 1 C 11 / 19

Prior knowledge as the weight matrix Training set: { ( w ( i ) 1 , . . . , w ( i ) k , w ( i ) , w ( i ) k + 1 , . . . , w ( i ) 2 k ) | i ∈ [ n ] } Weight matrix over examples: L ∈ R nxn n n L Captures adjacency information in the semantic lexicons, such as WordNet, FrameNet, and the Paraphrase Database 12 / 19

Adding prior knowledge to Eigenword embeddings d ′ d n n W n L n C prior knowledge − 1 2 − 1 m 2 d ′ diag W ⊤ W × W ⊤ × × diag C ⊤ C ≈ × × V ⊤ m C d U Σ D 1 D 2 M X ⊤ Y Do we still find projections for the contexts and for the pivot words which are most correlated? 13 / 19

Generalisation of CCA Yes, if L is a Laplacian matrix! Laplacian matrix L ∈ R nxn A symmetric positive semi-definite square matrix such that the sum over rows (or columns) is 0. � n − 1 if i = j L ij = − 1 if i � = j . Lemma X ⊤ LY equals X ⊤ Y up to a multiplication by a positive constant. Optimizes same objective function! 14 / 19

Generalisation of CCA m � 2 � � ( Xu k ) ⊤ L ( Yv k )) = max ( � d m max ( − L ij ) ij k = 1 i , j n � 2 � � 2 ) � � d m d m � = max ( − n ij ii i , j i = 1 where d m ij is the distance between projections of i th word and j th context views. CCA follows distributional hypothesis, with additional constraints from prior knowledge. 15 / 19

Experiments ◮ Evaluation Benchmarks ◮ Word Similarity: 11 different widely used benchmarks, e.g., the WS-353-ALL dataset (Finkelstein et al., 2002) and the SimLex-999 dataset (Hill et al., 2015) ◮ Geographic Analogies: “Greece (a) is to Athens (b) as Iraq (c) is to (d)” (Mikolov et al. 2013) ◮ d = c − ( a − b ) ◮ NP Bracketing: “annual (price growth)” vs “(annual price) growth” (Lazaridou et al., 2013) 16 / 19

Experiments ◮ Prior Knowledge Resources: WordNet, the Paraphrase Database (PPDB), and FrameNet. ◮ Baselines ◮ Off-the-shelf Word Embeddings: Glove (Pennington et al., 2014) , Skip-Gram (Mikolov et al., 2013) , Global Context (Huang et al., 2012) , Multilingual (Faruqui and Dyer, 2014) and Eigen word embeddings (Dhillon et al. (2015) ◮ Retrofitting (Faruqui et al., 2015) All embeddings were trained on the first 5 billion words from Wikipedia. 16 / 19

Results NPK: No prior knowledge, WN: WordNet, PD: the paraphrase database and FN: FrameNet. Word similarity average Geographic analogies NP bracketing NPK WN PD FN NPK WN PD FN NPK WN PD FN Glove 59.7 63.1 64.6 57.5 94.8 75.3 80.4 94.8 78.1 79.5 79.4 78.7 g n Skip-Gram 64.1 65.5 68.6 62.3 87.3 72.3 70.5 87.7 79.9 80.4 81.5 80.5 i t t fi Global Context 44.4 50.0 50.4 47.3 7.3 4.5 18.2 7.3 79.4 79.1 80.5 80.2 o r t Multilingual 62.3 66.9 68.2 62.8 70.7 46.2 53.7 72.7 81.9 81.8 82.7 82.0 e R Eigen (CCA) 59.5 62.2 63.6 61.4 89.9 79.2 73.5 89.9 81.3 81.7 81.2 80.7 CCAPrior - 60.7 60.6 60.0 - 89.1 93.2 92.9 - 81.8 82.4 81.0 CCAPrior+RF - 63.4 64.9 61.6 - 78.0 71.9 92.5 - 81.9 81.7 81.2 17 / 19

Results NPK: No prior knowledge, WN: WordNet, PD: the paraphrase database and FN: FrameNet. Word similarity average Geographic analogies NP bracketing NPK WN PD FN NPK WN PD FN NPK WN PD FN Glove 59.7 63.1 64.6 57.5 94.8 75.3 80.4 94.8 78.1 79.5 79.4 78.7 g n Skip-Gram 64.1 65.5 68.6 62.3 87.3 72.3 70.5 87.7 79.9 80.4 81.5 80.5 i t t fi Global Context 44.4 50.0 50.4 47.3 7.3 4.5 18.2 7.3 79.4 79.1 80.5 80.2 o r t Multilingual 62.3 66.9 68.2 62.8 70.7 46.2 53.7 72.7 81.9 81.8 82.7 82.0 e R Eigen (CCA) 59.5 62.2 63.6 61.4 89.9 79.2 73.5 89.9 81.3 81.7 81.2 80.7 CCAPrior - 60.7 60.6 60.0 - 89.1 93.2 92.9 - 81.8 82.4 81.0 CCAPrior+RF - 63.4 64.9 61.6 - 78.0 71.9 92.5 - 81.9 81.7 81.2 Adding prior knowledge to eigenword embeddings does improve the quality of word vectors 18 / 19

Results NPK: No prior knowledge, WN: WordNet, PD: the paraphrase database and FN: FrameNet. Word similarity average Geographic analogies NP bracketing NPK WN PD FN NPK WN PD FN NPK WN PD FN Glove 59.7 63.1 64.6 57.5 94.8 75.3 80.4 94.8 78.1 79.5 79.4 78.7 g n Skip-Gram 64.1 65.5 68.6 62.3 87.3 72.3 70.5 87.7 79.9 80.4 81.5 80.5 i t t fi Global Context 44.4 50.0 50.4 47.3 7.3 4.5 18.2 7.3 79.4 79.1 80.5 80.2 o r t Multilingual 62.3 66.9 68.2 62.8 70.7 46.2 53.7 72.7 81.9 81.8 82.7 82.0 e R Eigen (CCA) 59.5 62.2 63.6 61.4 89.9 79.2 73.5 89.9 81.3 81.7 81.2 80.7 CCAPrior - 60.7 60.6 60.0 - 89.1 93.2 92.9 - 81.8 82.4 81.0 CCAPrior+RF - 63.4 64.9 61.6 - 78.0 71.9 92.5 - 81.9 81.7 81.2 Retrofitting further improves eigenword embeddings 19 / 19

Results NPK: No prior knowledge, WN: WordNet, PD: the paraphrase database and FN: FrameNet. Word similarity average Geographic analogies NP bracketing NPK WN PD FN NPK WN PD FN NPK WN PD FN Glove 59.7 63.1 64.6 57.5 94.8 75.3 80.4 94.8 78.1 79.5 79.4 78.7 g n Skip-Gram 64.1 65.5 68.6 62.3 87.3 72.3 70.5 87.7 79.9 80.4 81.5 80.5 i t t fi Global Context 44.4 50.0 50.4 47.3 7.3 4.5 18.2 7.3 79.4 79.1 80.5 80.2 o r t Multilingual 62.3 66.9 68.2 62.8 70.7 46.2 53.7 72.7 81.9 81.8 82.7 82.0 e R Eigen (CCA) 59.5 62.2 63.6 61.4 89.9 79.2 73.5 89.9 81.3 81.7 81.2 80.7 CCAPrior - 60.7 60.6 60.0 - 89.1 93.2 92.9 - 81.8 82.4 81.0 CCAPrior+RF - 63.4 64.9 61.6 - 78.0 71.9 92.5 - 81.9 81.7 81.2 CCA results are more stable than retrofitting 20 / 19

Encoding Prior Knowledge with Eigenword Embeddings Dominique Osborne - PowerPoint PPT Presentation

Encoding Prior Knowledge with Eigenword Embeddings Dominique Osborne 1 , Shashi Narayan 2 & Shay Cohen 2 1 Department of Mathematics and Statistics, University of Strathclyde 2 School of Informatics, University of Edinburgh EACL 2017 1 / 19

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Deep Encode: Machine Learning for Per-Title Encoding Daniel Silhavy| IBC20| Per-Title Encoding

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Control Strategy EMA, London; 23 November 2017 1 EMA Prior Knowledge Workshop Case Study:

R. Martijn van der Plas 1 How to use prior knowledge in defining a control strategy? Some

Separable Automorphisms on Matrix Algebras over Finite Field Extensions. Applications to Ideal

Announcements 61A Extra Lecture 4 Representing Strings: UTF-8 Encoding UTF (UCS (Universal

Quantum Algorithms for Estimating Physical Quantities using Block-Encodings Patrick Rall Quantum

Fix Fixed ed-PS PSNR Lossy Compres essio ion for Sci Scientific D c Data Dingwen Tao (The

F F 1/ 30 Last time: Simply typed lambda

Differential Computation Analysis against Internally-Encoded White-Box Implementations Junwei

Y86 encoding / SEQ part 1 1 last time instruction set (interface) v microarchitecture

CS598LAZ - Variational Autoencoders Raymond Yeh, Junting Lou, Teck-Yian Lim Outline - Review