sparse word embeddings using 1 regularized online learning
play

Sparse Word Embeddings Using 1 Regularized Online Learning Fei Sun , - PowerPoint PPT Presentation

Sparse Word Embeddings Using 1 Regularized Online Learning Fei Sun , Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng July 14, 2016 ofey.sunfei@gmail.com, { guojiafeng, lanyanyan,junxu, cxq } @ict.ac.cn CAS Key Lab of Network Data Science and


  1. Sparse Word Embeddings Using ℓ 1 Regularized Online Learning Fei Sun , Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng July 14, 2016 ofey.sunfei@gmail.com, { guojiafeng, lanyanyan,junxu, cxq } @ict.ac.cn CAS Key Lab of Network Data Science and Technology Institute of Computing Technology, Chinese Academy of Sciences

  2. Distributed Word Representation Sentiment Machine Analysis Translation [Maas et al., [Kalchbrenner and Blunsom, 2013] 2011] Language Distributed POS Modeling Taging Word [Collobert et al., [Bengio et al., Representation 2011] 2003] Word-Sense Parsing Disambiguation [Socher et al., [Collobert et al., 2011] 2011] Distributed word representation is so hot in NLP community. 1

  3. Models Input Window word of interest Text cat sat the mat o i -th output = P ( w t = i | context ) w 1 w 1 w N Feature 1 . . . 1 2 . . . softmax . . . . . . Feature K w K w K . . . w K 1 2 N most computation here Lookup Table LT W 1 . . d . tanh . . . . . . concat Linear M 1 × · C ( w t − n + 1 ) C ( w t − 2 ) C ( w t − 1 ) n 1 . . . . . . . . . . . . LBL HardTanh Table Matrix C look−up shared parameters in C across words Linear w t − n + 1 index for index for w t − 2 index for w t − 1 M 2 × · NPLM n 2 hu = #tags C&W score INPUT PROJECTION OUTPUT INPUT PROJECTION OUTPUT w(t-2) w(t-2) sum score l score g w(t-1) w(t-1) Document SUM w(t) w(t) river play w(t+1) w(t+1) shore ⋮ weighted average w(t+2) w(t+2) ... he walks to the bank ... global semantic vector water CBOW Skip-gram Huang Word2Vec State-Of-The-Art: CBOW and SG. 2

  4. Dense Representation and Interpretability Example 1 man [0 . 326172 , . . . , 0 . 00524902 , . . . , 0 . 0209961] woman [0 . 243164 , . . . , − 0 . 205078 , . . . , − 0 . 0294189] dog [0 . 0512695 , . . . , − 0 . 306641 , . . . , 0 . 222656] computer [0 . 107422 , . . . , − 0 . 0375977 , . . . , − 0 . 0620117] 1 Vectors from GoogleNews-vectors-negative300.bin. 3

  5. Dense Representation and Interpretability Example 1 man [0 . 326172 , . . . , 0 . 00524902 , . . . , 0 . 0209961] woman [0 . 243164 , . . . , − 0 . 205078 , . . . , − 0 . 0294189] dog [0 . 0512695 , . . . , − 0 . 306641 , . . . , 0 . 222656] computer [0 . 107422 , . . . , − 0 . 0375977 , . . . , − 0 . 0620117] • Which dimension represents the gender of man and woman ? 1 Vectors from GoogleNews-vectors-negative300.bin. 3

  6. Dense Representation and Interpretability Example 1 man [0 . 326172 , . . . , 0 . 00524902 , . . . , 0 . 0209961] woman [0 . 243164 , . . . , − 0 . 205078 , . . . , − 0 . 0294189] dog [0 . 0512695 , . . . , − 0 . 306641 , . . . , 0 . 222656] computer [0 . 107422 , . . . , − 0 . 0375977 , . . . , − 0 . 0620117] • Which dimension represents the gender of man and woman ? • What sort of value indicates male or female? 1 Vectors from GoogleNews-vectors-negative300.bin. 3

  7. Dense Representation and Interpretability Example 1 man [0 . 326172 , . . . , 0 . 00524902 , . . . , 0 . 0209961] woman [0 . 243164 , . . . , − 0 . 205078 , . . . , − 0 . 0294189] dog [0 . 0512695 , . . . , − 0 . 306641 , . . . , 0 . 222656] computer [0 . 107422 , . . . , − 0 . 0375977 , . . . , − 0 . 0620117] • Which dimension represents the gender of man and woman ? • What sort of value indicates male or female? • Gender dimension(s) would be active in all the word vectors including irrelevant words like computer . 1 Vectors from GoogleNews-vectors-negative300.bin. 3

  8. Dense Representation and Interpretability Example 1 man [0 . 326172 , . . . , 0 . 00524902 , . . . , 0 . 0209961] woman [0 . 243164 , . . . , − 0 . 205078 , . . . , − 0 . 0294189] dog [0 . 0512695 , . . . , − 0 . 306641 , . . . , 0 . 222656] computer [0 . 107422 , . . . , − 0 . 0375977 , . . . , − 0 . 0620117] • Which dimension represents the gender of man and woman ? • What sort of value indicates male or female? • Gender dimension(s) would be active in all the word vectors including irrelevant words like computer . • Difficult in interpretation and uneconomic in storage. 1 Vectors from GoogleNews-vectors-negative300.bin. 3

  9. Sparse Word Representation Non-Negative Sparse Embedding Sparse Coding (Word2Vec) (NNSE) [Murphy et al., 2012] [Faruqui et al., 2015] K V m D x � � � � X i − A i × D � 2 + λ � A i � 1 A K arg min Sparse coding A ∈ R m × k i =1 V D ∈ R k × n Sparse overcomplete vectors X L where, A i , j ≥ 0 , ∀ 1 ≤ i ≤ m , ∀ 1 ≤ j ≤ k Initial dense vectors K V V D i D ⊤ ≤ 1 , ∀ 1 ≤ i ≤ k D i x Projection B K Non-negative sparse coding Sparse, binary overcomplete vectors Convert dense vector using sparse Introduce sparse and non-negative coding in a post-processing way. constraints into MF. 4

  10. Sparse Word Representation Non-Negative Sparse Embedding Sparse Coding (Word2Vec) (NNSE) [Murphy et al., 2012] [Faruqui et al., 2015] K V m D x � � � � X i − A i × D � 2 + λ � A i � 1 A K arg min Sparse coding A ∈ R m × k i =1 V D ∈ R k × n Sparse overcomplete vectors X L where, A i , j ≥ 0 , ∀ 1 ≤ i ≤ m , ∀ 1 ≤ j ≤ k Initial dense vectors K V V D i D ⊤ ≤ 1 , ∀ 1 ≤ i ≤ k D i x Projection B K Non-negative sparse coding Sparse, binary overcomplete vectors Convert dense vector using sparse Introduce sparse and non-negative coding in a post-processing way. constraints into MF. Thay are difficult to train on large-scale data for heavy memory usage! 4

  11. Our Motivation Large Scale Interpretable Good Performance Fast 5

  12. Our Motivation Large Scale Interpretable Good Performance Fast Word2Vec 5

  13. Our Motivation Large Scale Interpretable Good Performance Fast Sparse Word2Vec 5

  14. Our Motivation Large Scale Interpretable Good Performance Fast Sparse Word2Vec Sparse CBOW Directly apply the sparse constraint to CBOW 5

  15. CBOW . . . N c i − 2 � L cbow = log p ( w i | h i ) i =1 c i − 1 w i · # ‰ exp( # ‰ h i ) w i p ( w i | h i ) = w · # ‰ � w ∈ W exp( # ‰ h i ) c i +1 i + l h i = 1 # ‰ � c i +2 c j # ‰ 2 l . . j = i − l . j � = i CBOW N � � w i · # ‰ W log σ ( − # w · # ‰ ‰ � L ns cbow = log σ ( # ‰ h i ) + k · E ˜ ˜ h i ) w ∼ P ˜ i =1 6

  16. Sparse CBOW � L ns s − cbow = L ns � # ‰ cbow − λ w � 1 w ∈ W 7

  17. Sparse CBOW � L ns s − cbow = L ns � # ‰ cbow − λ w � 1 w ∈ W Online, SGD 7

  18. Sparse CBOW � L ns s − cbow = L ns � # ‰ cbow − λ w � 1 w ∈ W failed Online, SGD 7

  19. Sparse CBOW � L ns s − cbow = L ns � # ‰ cbow − λ w � 1 w ∈ W failed Online, SGD Regularized Dual Averaging (RAD) [Xiao, 2009] Truncating using online average subgradients 7

  20. RDA algorithm for Sparse CBOW 1: procedure SparseCBOW ( C ) w = # ‰ g 0 Initialize : # w , ∀ w ∈ W , # ‰ c , ∀ c ∈ C , ¯ ‰ 0 , ∀ w ∈ W 2: # ‰ for i = 1 , 2 , 3 , . . . do 3: 4: t ← update time of word w i i + l # ‰ 1 � # ‰ h i = c j 5: 2 l j = i − l j � = i � � # i · # ‰ ‰ g t w t w i = 1 h ( w i ) − σ ( # ‰ h i ) h i 6: # ‰ g t w i = t − 1 g t − 1 + 1 t g t 7: ¯ t ¯ ⊲ Keeping track of the online average subgradients # ‰ # ‰ # ‰ w i w i Update # w i element-wise according to ‰ 8: � g t λ 0 if | ¯ w ij | ≤ #( w i ) , # ‰ w t +1 # ‰ = ij � g t λ g t � η t ¯ w ij − #( w i ) sgn (¯ w ij ) otherwise, 9: ⊲ Truncating # ‰ # ‰ where , j = 1 , 2 , . . . , d for k = − l , . . . , − 1 , 1 , . . . , l do 10: update # ‰ 11: c i + k according to � � # i · # ‰ c i + k + α w t w t # c i + k := # ‰ ‰ 1 h i ( w i ) − σ ( # ‰ ‰ 12: h i ) i 2 l 13: end for end for 14: 15: end procedure 8

  21. Evaluation

  22. Baselines & Tasks Baseline Tasks • Dense representation models • Interpretability • GloVe [Pennington et al., 2014] • Word Intrusion • CBOW and SG [Mikolov et al., 2013] • Expressive Power • Sparse representation models • Word Analogy • Sparse Coding (SC) [Faruqui et al., 2015] • Word Similarity • Positive Pointwise Mutual Information (PPMI) [Bullinaria and Levy, 2007] • NNSE [Murphy et al., 2012] 9

  23. Experimental Settings Corpus: Wikipedia 2010 (1B words) Parameters Setting: window negative iteration λ learning rate noise distribution ∝ #( w ) 0 . 75 10 10 20 grid search 0.05 Baseline: Model Setting GloVe, CBOW, SG same setting with released tools SC Embeddings of CBOW as input PPMI matrix, 4,0000 words, SPAMS 1 NNSE 1 http://spams-devel.gforge.inria.fr 10

  24. Interpretability

  25. Word Intrusion [Chang et al., 2009] Sort in descending dim poisson …… …… i Pick out vocab parametric markov 0.27 Top 5 markov …… …… bayesian poisson bayesian 0.23 stochastic parametric jodel …… …… …… bayesian Sorted List stochastic markov jodel -0.13 …… …… …… jodel poisson 0.47 …… …… …… 1. Sort dimension i in descending order. 2. A set: { top 5 words, 1 word (bottom 50% in i & top 10% in j , j � = i ) } 3. Pick out the intruder word More interpretable, more easy to pick out. 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend