Sparse Word Embeddings Using 1 Regularized Online Learning Fei Sun , - PowerPoint PPT Presentation

Sparse Word Embeddings Using ℓ 1 Regularized Online Learning Fei Sun , Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng July 14, 2016 ofey.sunfei@gmail.com, { guojiafeng, lanyanyan,junxu, cxq } @ict.ac.cn CAS Key Lab of Network Data Science and Technology Institute of Computing Technology, Chinese Academy of Sciences

Distributed Word Representation Sentiment Machine Analysis Translation [Maas et al., [Kalchbrenner and Blunsom, 2013] 2011] Language Distributed POS Modeling Taging Word [Collobert et al., [Bengio et al., Representation 2011] 2003] Word-Sense Parsing Disambiguation [Socher et al., [Collobert et al., 2011] 2011] Distributed word representation is so hot in NLP community. 1

Models Input Window word of interest Text cat sat the mat o i -th output = P ( w t = i | context ) w 1 w 1 w N Feature 1 . . . 1 2 . . . softmax . . . . . . Feature K w K w K . . . w K 1 2 N most computation here Lookup Table LT W 1 . . d . tanh . . . . . . concat Linear M 1 × · C ( w t − n + 1 ) C ( w t − 2 ) C ( w t − 1 ) n 1 . . . . . . . . . . . . LBL HardTanh Table Matrix C look−up shared parameters in C across words Linear w t − n + 1 index for index for w t − 2 index for w t − 1 M 2 × · NPLM n 2 hu = #tags C&W score INPUT PROJECTION OUTPUT INPUT PROJECTION OUTPUT w(t-2) w(t-2) sum score l score g w(t-1) w(t-1) Document SUM w(t) w(t) river play w(t+1) w(t+1) shore ⋮ weighted average w(t+2) w(t+2) ... he walks to the bank ... global semantic vector water CBOW Skip-gram Huang Word2Vec State-Of-The-Art: CBOW and SG. 2

Dense Representation and Interpretability Example 1 man [0 . 326172 , . . . , 0 . 00524902 , . . . , 0 . 0209961] woman [0 . 243164 , . . . , − 0 . 205078 , . . . , − 0 . 0294189] dog [0 . 0512695 , . . . , − 0 . 306641 , . . . , 0 . 222656] computer [0 . 107422 , . . . , − 0 . 0375977 , . . . , − 0 . 0620117] 1 Vectors from GoogleNews-vectors-negative300.bin. 3

Dense Representation and Interpretability Example 1 man [0 . 326172 , . . . , 0 . 00524902 , . . . , 0 . 0209961] woman [0 . 243164 , . . . , − 0 . 205078 , . . . , − 0 . 0294189] dog [0 . 0512695 , . . . , − 0 . 306641 , . . . , 0 . 222656] computer [0 . 107422 , . . . , − 0 . 0375977 , . . . , − 0 . 0620117] • Which dimension represents the gender of man and woman ? 1 Vectors from GoogleNews-vectors-negative300.bin. 3

Dense Representation and Interpretability Example 1 man [0 . 326172 , . . . , 0 . 00524902 , . . . , 0 . 0209961] woman [0 . 243164 , . . . , − 0 . 205078 , . . . , − 0 . 0294189] dog [0 . 0512695 , . . . , − 0 . 306641 , . . . , 0 . 222656] computer [0 . 107422 , . . . , − 0 . 0375977 , . . . , − 0 . 0620117] • Which dimension represents the gender of man and woman ? • What sort of value indicates male or female? 1 Vectors from GoogleNews-vectors-negative300.bin. 3

Dense Representation and Interpretability Example 1 man [0 . 326172 , . . . , 0 . 00524902 , . . . , 0 . 0209961] woman [0 . 243164 , . . . , − 0 . 205078 , . . . , − 0 . 0294189] dog [0 . 0512695 , . . . , − 0 . 306641 , . . . , 0 . 222656] computer [0 . 107422 , . . . , − 0 . 0375977 , . . . , − 0 . 0620117] • Which dimension represents the gender of man and woman ? • What sort of value indicates male or female? • Gender dimension(s) would be active in all the word vectors including irrelevant words like computer . 1 Vectors from GoogleNews-vectors-negative300.bin. 3

Dense Representation and Interpretability Example 1 man [0 . 326172 , . . . , 0 . 00524902 , . . . , 0 . 0209961] woman [0 . 243164 , . . . , − 0 . 205078 , . . . , − 0 . 0294189] dog [0 . 0512695 , . . . , − 0 . 306641 , . . . , 0 . 222656] computer [0 . 107422 , . . . , − 0 . 0375977 , . . . , − 0 . 0620117] • Which dimension represents the gender of man and woman ? • What sort of value indicates male or female? • Gender dimension(s) would be active in all the word vectors including irrelevant words like computer . • Difficult in interpretation and uneconomic in storage. 1 Vectors from GoogleNews-vectors-negative300.bin. 3

Sparse Word Representation Non-Negative Sparse Embedding Sparse Coding (Word2Vec) (NNSE) [Murphy et al., 2012] [Faruqui et al., 2015] K V m D x � � � � X i − A i × D � 2 + λ � A i � 1 A K arg min Sparse coding A ∈ R m × k i =1 V D ∈ R k × n Sparse overcomplete vectors X L where, A i , j ≥ 0 , ∀ 1 ≤ i ≤ m , ∀ 1 ≤ j ≤ k Initial dense vectors K V V D i D ⊤ ≤ 1 , ∀ 1 ≤ i ≤ k D i x Projection B K Non-negative sparse coding Sparse, binary overcomplete vectors Convert dense vector using sparse Introduce sparse and non-negative coding in a post-processing way. constraints into MF. 4

Sparse Word Representation Non-Negative Sparse Embedding Sparse Coding (Word2Vec) (NNSE) [Murphy et al., 2012] [Faruqui et al., 2015] K V m D x � � � � X i − A i × D � 2 + λ � A i � 1 A K arg min Sparse coding A ∈ R m × k i =1 V D ∈ R k × n Sparse overcomplete vectors X L where, A i , j ≥ 0 , ∀ 1 ≤ i ≤ m , ∀ 1 ≤ j ≤ k Initial dense vectors K V V D i D ⊤ ≤ 1 , ∀ 1 ≤ i ≤ k D i x Projection B K Non-negative sparse coding Sparse, binary overcomplete vectors Convert dense vector using sparse Introduce sparse and non-negative coding in a post-processing way. constraints into MF. Thay are difficult to train on large-scale data for heavy memory usage! 4

Our Motivation Large Scale Interpretable Good Performance Fast 5

Our Motivation Large Scale Interpretable Good Performance Fast Word2Vec 5

Our Motivation Large Scale Interpretable Good Performance Fast Sparse Word2Vec 5

Our Motivation Large Scale Interpretable Good Performance Fast Sparse Word2Vec Sparse CBOW Directly apply the sparse constraint to CBOW 5

CBOW . . . N c i − 2 � L cbow = log p ( w i | h i ) i =1 c i − 1 w i · # ‰ exp( # ‰ h i ) w i p ( w i | h i ) = w · # ‰ � w ∈ W exp( # ‰ h i ) c i +1 i + l h i = 1 # ‰ � c i +2 c j # ‰ 2 l . . j = i − l . j � = i CBOW N � � w i · # ‰ W log σ ( − # w · # ‰ ‰ � L ns cbow = log σ ( # ‰ h i ) + k · E ˜ ˜ h i ) w ∼ P ˜ i =1 6

Sparse CBOW � L ns s − cbow = L ns � # ‰ cbow − λ w � 1 w ∈ W 7

Sparse CBOW � L ns s − cbow = L ns � # ‰ cbow − λ w � 1 w ∈ W Online, SGD 7

Sparse CBOW � L ns s − cbow = L ns � # ‰ cbow − λ w � 1 w ∈ W failed Online, SGD 7

Sparse CBOW � L ns s − cbow = L ns � # ‰ cbow − λ w � 1 w ∈ W failed Online, SGD Regularized Dual Averaging (RAD) [Xiao, 2009] Truncating using online average subgradients 7

RDA algorithm for Sparse CBOW 1: procedure SparseCBOW ( C ) w = # ‰ g 0 Initialize : # w , ∀ w ∈ W , # ‰ c , ∀ c ∈ C , ¯ ‰ 0 , ∀ w ∈ W 2: # ‰ for i = 1 , 2 , 3 , . . . do 3: 4: t ← update time of word w i i + l # ‰ 1 � # ‰ h i = c j 5: 2 l j = i − l j � = i � � # i · # ‰ ‰ g t w t w i = 1 h ( w i ) − σ ( # ‰ h i ) h i 6: # ‰ g t w i = t − 1 g t − 1 + 1 t g t 7: ¯ t ¯ ⊲ Keeping track of the online average subgradients # ‰ # ‰ # ‰ w i w i Update # w i element-wise according to ‰ 8: � g t λ 0 if | ¯ w ij | ≤ #( w i ) , # ‰ w t +1 # ‰ = ij � g t λ g t � η t ¯ w ij − #( w i ) sgn (¯ w ij ) otherwise, 9: ⊲ Truncating # ‰ # ‰ where , j = 1 , 2 , . . . , d for k = − l , . . . , − 1 , 1 , . . . , l do 10: update # ‰ 11: c i + k according to � � # i · # ‰ c i + k + α w t w t # c i + k := # ‰ ‰ 1 h i ( w i ) − σ ( # ‰ ‰ 12: h i ) i 2 l 13: end for end for 14: 15: end procedure 8

Evaluation

Baselines & Tasks Baseline Tasks • Dense representation models • Interpretability • GloVe [Pennington et al., 2014] • Word Intrusion • CBOW and SG [Mikolov et al., 2013] • Expressive Power • Sparse representation models • Word Analogy • Sparse Coding (SC) [Faruqui et al., 2015] • Word Similarity • Positive Pointwise Mutual Information (PPMI) [Bullinaria and Levy, 2007] • NNSE [Murphy et al., 2012] 9

Experimental Settings Corpus: Wikipedia 2010 (1B words) Parameters Setting: window negative iteration λ learning rate noise distribution ∝ #( w ) 0 . 75 10 10 20 grid search 0.05 Baseline: Model Setting GloVe, CBOW, SG same setting with released tools SC Embeddings of CBOW as input PPMI matrix, 4,0000 words, SPAMS 1 NNSE 1 http://spams-devel.gforge.inria.fr 10

Interpretability

Word Intrusion [Chang et al., 2009] Sort in descending dim poisson …… …… i Pick out vocab parametric markov 0.27 Top 5 markov …… …… bayesian poisson bayesian 0.23 stochastic parametric jodel …… …… …… bayesian Sorted List stochastic markov jodel -0.13 …… …… …… jodel poisson 0.47 …… …… …… 1. Sort dimension i in descending order. 2. A set: { top 5 words, 1 word (bottom 50% in i & top 10% in j , j � = i ) } 3. Pick out the intruder word More interpretable, more easy to pick out. 11

Sparse Word Embeddings Using 1 Regularized Online Learning Fei Sun , - PowerPoint PPT Presentation

Sparse Word Embeddings Using 1 Regularized Online Learning Fei Sun , Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng July 14, 2016 ofey.sunfei@gmail.com, { guojiafeng, lanyanyan,junxu, cxq } @ict.ac.cn CAS Key Lab of Network Data Science and

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang, Hsiang-Fu Yu,

How to Deal with Context in How to Implement the . . . A Seemingly Natural . . . Computing with

Gerth Stlting Brodal University of Aarhus Monday June 9, 2008, IT University of Copenhagen,

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 16: Language

National Association of Health Data Organizations Conference Challenges and Opportunities around

Quantum Error-Correcting Codes by Concatenation Markus Grassl joint work with Bei Zeng Centre

LAWCI 2018 Thomas Jerkovits, thomas.jerkovits@dlr.de German Aerospace Center, DLR Outline 1

15-853:Algorithms in the Real World Reed Solomon Codes (Cont.) Concatenation of codes

Quantum Codes G. Eric Moorhouse, UW Math Corrected copies of transparencies for this sem- inar

Sparse Word Embeddings Using 1 Regularized Online Learning Fei Sun , - PowerPoint PPT Presentation

Sparse Word Embeddings Using 1 Regularized Online Learning Fei Sun , Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng July 14, 2016 ofey.sunfei@gmail.com, { guojiafeng, lanyanyan,junxu, cxq } @ict.ac.cn CAS Key Lab of Network Data Science and

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang, Hsiang-Fu Yu,

How to Deal with Context in How to Implement the . . . A Seemingly Natural . . . Computing with

Gerth Stlting Brodal University of Aarhus Monday June 9, 2008, IT University of Copenhagen,

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 16: Language

National Association of Health Data Organizations Conference Challenges and Opportunities around

Quantum Error-Correcting Codes by Concatenation Markus Grassl joint work with Bei Zeng Centre

LAWCI 2018 Thomas Jerkovits, thomas.jerkovits@dlr.de German Aerospace Center, DLR Outline 1

15-853:Algorithms in the Real World Reed Solomon Codes (Cont.) Concatenation of codes

Quantum Codes G. Eric Moorhouse, UW Math Corrected copies of transparencies for this sem- inar

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to