 
              Sparse Word Embeddings Using ℓ 1 Regularized Online Learning Fei Sun , Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng July 14, 2016 ofey.sunfei@gmail.com, { guojiafeng, lanyanyan,junxu, cxq } @ict.ac.cn CAS Key Lab of Network Data Science and Technology Institute of Computing Technology, Chinese Academy of Sciences
Distributed Word Representation Sentiment Machine Analysis Translation [Maas et al., [Kalchbrenner and Blunsom, 2013] 2011] Language Distributed POS Modeling Taging Word [Collobert et al., [Bengio et al., Representation 2011] 2003] Word-Sense Parsing Disambiguation [Socher et al., [Collobert et al., 2011] 2011] Distributed word representation is so hot in NLP community. 1
Models Input Window word of interest Text cat sat the mat o i -th output = P ( w t = i | context ) w 1 w 1 w N Feature 1 . . . 1 2 . . . softmax . . . . . . Feature K w K w K . . . w K 1 2 N most computation here Lookup Table LT W 1 . . d . tanh . . . . . . concat Linear M 1 × · C ( w t − n + 1 ) C ( w t − 2 ) C ( w t − 1 ) n 1 . . . . . . . . . . . . LBL HardTanh Table Matrix C look−up shared parameters in C across words Linear w t − n + 1 index for index for w t − 2 index for w t − 1 M 2 × · NPLM n 2 hu = #tags C&W score INPUT PROJECTION OUTPUT INPUT PROJECTION OUTPUT w(t-2) w(t-2) sum score l score g w(t-1) w(t-1) Document SUM w(t) w(t) river play w(t+1) w(t+1) shore ⋮ weighted average w(t+2) w(t+2) ... he walks to the bank ... global semantic vector water CBOW Skip-gram Huang Word2Vec State-Of-The-Art: CBOW and SG. 2
Dense Representation and Interpretability Example 1 man [0 . 326172 , . . . , 0 . 00524902 , . . . , 0 . 0209961] woman [0 . 243164 , . . . , − 0 . 205078 , . . . , − 0 . 0294189] dog [0 . 0512695 , . . . , − 0 . 306641 , . . . , 0 . 222656] computer [0 . 107422 , . . . , − 0 . 0375977 , . . . , − 0 . 0620117] 1 Vectors from GoogleNews-vectors-negative300.bin. 3
Dense Representation and Interpretability Example 1 man [0 . 326172 , . . . , 0 . 00524902 , . . . , 0 . 0209961] woman [0 . 243164 , . . . , − 0 . 205078 , . . . , − 0 . 0294189] dog [0 . 0512695 , . . . , − 0 . 306641 , . . . , 0 . 222656] computer [0 . 107422 , . . . , − 0 . 0375977 , . . . , − 0 . 0620117] • Which dimension represents the gender of man and woman ? 1 Vectors from GoogleNews-vectors-negative300.bin. 3
Dense Representation and Interpretability Example 1 man [0 . 326172 , . . . , 0 . 00524902 , . . . , 0 . 0209961] woman [0 . 243164 , . . . , − 0 . 205078 , . . . , − 0 . 0294189] dog [0 . 0512695 , . . . , − 0 . 306641 , . . . , 0 . 222656] computer [0 . 107422 , . . . , − 0 . 0375977 , . . . , − 0 . 0620117] • Which dimension represents the gender of man and woman ? • What sort of value indicates male or female? 1 Vectors from GoogleNews-vectors-negative300.bin. 3
Dense Representation and Interpretability Example 1 man [0 . 326172 , . . . , 0 . 00524902 , . . . , 0 . 0209961] woman [0 . 243164 , . . . , − 0 . 205078 , . . . , − 0 . 0294189] dog [0 . 0512695 , . . . , − 0 . 306641 , . . . , 0 . 222656] computer [0 . 107422 , . . . , − 0 . 0375977 , . . . , − 0 . 0620117] • Which dimension represents the gender of man and woman ? • What sort of value indicates male or female? • Gender dimension(s) would be active in all the word vectors including irrelevant words like computer . 1 Vectors from GoogleNews-vectors-negative300.bin. 3
Dense Representation and Interpretability Example 1 man [0 . 326172 , . . . , 0 . 00524902 , . . . , 0 . 0209961] woman [0 . 243164 , . . . , − 0 . 205078 , . . . , − 0 . 0294189] dog [0 . 0512695 , . . . , − 0 . 306641 , . . . , 0 . 222656] computer [0 . 107422 , . . . , − 0 . 0375977 , . . . , − 0 . 0620117] • Which dimension represents the gender of man and woman ? • What sort of value indicates male or female? • Gender dimension(s) would be active in all the word vectors including irrelevant words like computer . • Difficult in interpretation and uneconomic in storage. 1 Vectors from GoogleNews-vectors-negative300.bin. 3
Sparse Word Representation Non-Negative Sparse Embedding Sparse Coding (Word2Vec) (NNSE) [Murphy et al., 2012] [Faruqui et al., 2015] K V m D x � � � � X i − A i × D � 2 + λ � A i � 1 A K arg min Sparse coding A ∈ R m × k i =1 V D ∈ R k × n Sparse overcomplete vectors X L where, A i , j ≥ 0 , ∀ 1 ≤ i ≤ m , ∀ 1 ≤ j ≤ k Initial dense vectors K V V D i D ⊤ ≤ 1 , ∀ 1 ≤ i ≤ k D i x Projection B K Non-negative sparse coding Sparse, binary overcomplete vectors Convert dense vector using sparse Introduce sparse and non-negative coding in a post-processing way. constraints into MF. 4
Sparse Word Representation Non-Negative Sparse Embedding Sparse Coding (Word2Vec) (NNSE) [Murphy et al., 2012] [Faruqui et al., 2015] K V m D x � � � � X i − A i × D � 2 + λ � A i � 1 A K arg min Sparse coding A ∈ R m × k i =1 V D ∈ R k × n Sparse overcomplete vectors X L where, A i , j ≥ 0 , ∀ 1 ≤ i ≤ m , ∀ 1 ≤ j ≤ k Initial dense vectors K V V D i D ⊤ ≤ 1 , ∀ 1 ≤ i ≤ k D i x Projection B K Non-negative sparse coding Sparse, binary overcomplete vectors Convert dense vector using sparse Introduce sparse and non-negative coding in a post-processing way. constraints into MF. Thay are difficult to train on large-scale data for heavy memory usage! 4
Our Motivation Large Scale Interpretable Good Performance Fast 5
Our Motivation Large Scale Interpretable Good Performance Fast Word2Vec 5
Our Motivation Large Scale Interpretable Good Performance Fast Sparse Word2Vec 5
Our Motivation Large Scale Interpretable Good Performance Fast Sparse Word2Vec Sparse CBOW Directly apply the sparse constraint to CBOW 5
CBOW . . . N c i − 2 � L cbow = log p ( w i | h i ) i =1 c i − 1 w i · # ‰ exp( # ‰ h i ) w i p ( w i | h i ) = w · # ‰ � w ∈ W exp( # ‰ h i ) c i +1 i + l h i = 1 # ‰ � c i +2 c j # ‰ 2 l . . j = i − l . j � = i CBOW N � � w i · # ‰ W log σ ( − # w · # ‰ ‰ � L ns cbow = log σ ( # ‰ h i ) + k · E ˜ ˜ h i ) w ∼ P ˜ i =1 6
Sparse CBOW � L ns s − cbow = L ns � # ‰ cbow − λ w � 1 w ∈ W 7
Sparse CBOW � L ns s − cbow = L ns � # ‰ cbow − λ w � 1 w ∈ W Online, SGD 7
Sparse CBOW � L ns s − cbow = L ns � # ‰ cbow − λ w � 1 w ∈ W failed Online, SGD 7
Sparse CBOW � L ns s − cbow = L ns � # ‰ cbow − λ w � 1 w ∈ W failed Online, SGD Regularized Dual Averaging (RAD) [Xiao, 2009] Truncating using online average subgradients 7
RDA algorithm for Sparse CBOW 1: procedure SparseCBOW ( C ) w = # ‰ g 0 Initialize : # w , ∀ w ∈ W , # ‰ c , ∀ c ∈ C , ¯ ‰ 0 , ∀ w ∈ W 2: # ‰ for i = 1 , 2 , 3 , . . . do 3: 4: t ← update time of word w i i + l # ‰ 1 � # ‰ h i = c j 5: 2 l j = i − l j � = i � � # i · # ‰ ‰ g t w t w i = 1 h ( w i ) − σ ( # ‰ h i ) h i 6: # ‰ g t w i = t − 1 g t − 1 + 1 t g t 7: ¯ t ¯ ⊲ Keeping track of the online average subgradients # ‰ # ‰ # ‰ w i w i Update # w i element-wise according to ‰ 8: � g t λ 0 if | ¯ w ij | ≤ #( w i ) , # ‰ w t +1 # ‰ = ij � g t λ g t � η t ¯ w ij − #( w i ) sgn (¯ w ij ) otherwise, 9: ⊲ Truncating # ‰ # ‰ where , j = 1 , 2 , . . . , d for k = − l , . . . , − 1 , 1 , . . . , l do 10: update # ‰ 11: c i + k according to � � # i · # ‰ c i + k + α w t w t # c i + k := # ‰ ‰ 1 h i ( w i ) − σ ( # ‰ ‰ 12: h i ) i 2 l 13: end for end for 14: 15: end procedure 8
Evaluation
Baselines & Tasks Baseline Tasks • Dense representation models • Interpretability • GloVe [Pennington et al., 2014] • Word Intrusion • CBOW and SG [Mikolov et al., 2013] • Expressive Power • Sparse representation models • Word Analogy • Sparse Coding (SC) [Faruqui et al., 2015] • Word Similarity • Positive Pointwise Mutual Information (PPMI) [Bullinaria and Levy, 2007] • NNSE [Murphy et al., 2012] 9
Experimental Settings Corpus: Wikipedia 2010 (1B words) Parameters Setting: window negative iteration λ learning rate noise distribution ∝ #( w ) 0 . 75 10 10 20 grid search 0.05 Baseline: Model Setting GloVe, CBOW, SG same setting with released tools SC Embeddings of CBOW as input PPMI matrix, 4,0000 words, SPAMS 1 NNSE 1 http://spams-devel.gforge.inria.fr 10
Interpretability
Word Intrusion [Chang et al., 2009] Sort in descending dim poisson …… …… i Pick out vocab parametric markov 0.27 Top 5 markov …… …… bayesian poisson bayesian 0.23 stochastic parametric jodel …… …… …… bayesian Sorted List stochastic markov jodel -0.13 …… …… …… jodel poisson 0.47 …… …… …… 1. Sort dimension i in descending order. 2. A set: { top 5 words, 1 word (bottom 50% in i & top 10% in j , j � = i ) } 3. Pick out the intruder word More interpretable, more easy to pick out. 11
Recommend
More recommend