exploring kernel functions in the softmax layer for
play

Exploring Kernel Functions in the Softmax Layer for Contextual Word - PowerPoint PPT Presentation

Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney Agenda Background Methodology Experiments Conclusion 2 of 33 Exploring Kernel Functions in


  1. Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney

  2. Agenda • Background • Methodology • Experiments • Conclusion 2 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

  3. Background Contextual Word Classification • language modeling (LM) and machine translation (MT) • dominated by neural networks (NN) 1 | ... ) = p ( w 0 ) � T , ... ) = � T 1 p ( w t | w t − 1 • p ( w T 1 p ( w t | h t ) 0 • despite many choices to learn the context vector h : – feed-forward NN – recurrent NN – convolutional NN – self-attention NN – ... • output is often modeled with standard softmax and trained with cross entropy: v h ) / � V p( w v | h ) = exp( W T v ′ =1 exp( W T v ′ h ) L = − log p( w v | h ) 3 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

  4. Background Softmax Bottleneck [Yang et al., 2018] • from previous: � � v h ) / � V exp( W T v ′ =1 exp( W T L = − log p( w v | h ) = − log v ′ h ) • exponential-and-logarithm calculation: log p( w v | h ) + log � V v ′ =1 exp( W T v ′ h ) = W T v h • approximate true log posteriors with inner products: p( w v | h ) + C h ≈ W T log ˜ v h 4 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

  5. Background Softmax Bottleneck Cont. • from previous: log ˜ p( w v | h ) + C h ≈ W T v h • in matrix form:     W T ˜ p( w 1 | h 1 ) . . . ˜ p( w 1 | h N ) 1 . . � C h 1 . . . C h N � . � h 1 . . . h N � ... . . . log . . + ≈ .     V × N d × N W T p( w V | h 1 ) . . . ˜ ˜ p( w V | h N ) V V × N V × d � �� � � �� � rank ∼ V rank ∼ d • factorization the true log posterior matrix: log ˜ P + C ≈ W T H • Softmax Bottleneck: – log ˜ P is high-rank for natural language: rank(log ˜ P ) ∼ V – C decreases the rank of left-hand side by maximum 1 – rank of W T H is bounded by hidden dimension: rank( W T H ) ∼ d – typically V ∼ 100 , 000 and d ∼ 1000 ≈ → �≈ 5 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

  6. Background Breaking Softmax Bottleneck • mixture-of-softmax (mos) [Yang et al., 2018]: K exp( W T v f k ( h )) � p mos ( w v | h ) = π k � V v ′ =1 exp( W T v ′ f k ( h )) k =1 • sigsoftmax [Kanai et al., 2018]: exp( W T v h ) σ ( W T v h ) p sigsoftmax ( w v | h ) = � V v ′ =1 exp( W T v ′ h ) σ ( W T v ′ h ) • weight norm regularization [Herold et al., 2018]: � � V � � � � ( || W v || 2 − ν ) 2 L wnr = − log(p mos ( w v | h )) + ρ � data v =1 V 6 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

  7. Background Breaking Softmax Bottleneck Cont. • let z = W T v h • theoretically, to break softmax bottleneck with activation g( z ) [Kanai et al., 2018]: – nonlinearity of log(g( z )) – numerical stable – non-negative – monotonically increasing 7 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

  8. Background Geometric Explanation of Softmax Bottleneck • an intuitive example: – ˜ p( dog | a common home pet is ... ) ≈ ˜ p( cat | a common home pet is ... ) ≈ 50% – learned word vectors are close: W dog ≈ W cat – posteriors over dog and cat are thus close: p( dog | ... ) ≈ p( cat | ... ) – there exist contexts that would fail the model: – overall an expressiveness problem 8 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

  9. Background Kernel Trick • widely used in support-vector machine (SVM) and logistic regression • improve expressiveness by implicitly transforming data into high dimensional feature spaces [Eric, 2019] 9 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

  10. Background Kernel Trick Cont. • K ( x , y ) = < φ ( x ) , φ ( y ) > � x 1 � � y 1 � � x 1 � � y 1 � √ √ � T = φ T ) = ( x 1 y 1 + x 2 y 2 ) 2 = � � � • K sq ( x 2 2 x 1 x 2 , x 2 y 2 2 y 1 y 2 , y 2 , 1 , 1 , φ 2 2 x 2 y 2 x 2 y 2 • for a kernel function (kernel) to be valid: – positive semidefinite (PSD) Gram matrix – corresponds to a scalar product in some feature space • empirically, non PSD kernels also work well [Lin and Lin, 2003, Boughorbel et al., 2005] → we do not enforce PSD • where there is an inner product, there could be an application of the kernel trick – import vector machine [Zhu and Hastie, 2002] – multilayer kernel machine [Cho and Saul, 2009] – gated softmax [Memisevic et al., 2010] 10 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

  11. Background Non-Euclidean Word Embedding • Gaussian embedding [Vilnis and McCallum, 2015, Athiwaratkun and Wilson, 2017] [Athiwaratkun and Wilson, 2017] 11 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

  12. Background Non-Euclidean Word Embedding Cont. • hyperbolic (Poincar´ e) embedding [Nickel and Kiela, 2017, Dhingra et al., 2018] [Nickel and Kiela, 2017] 12 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

  13. Background Quick Recap • contextual word classification • softmax bottleneck • breaking softmax bottleneck • geometric explanation of softmax bottleneck • kernel trick • non-Euclidean word embedding → kernels in softmax 13 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

  14. Agenda • Background • Methodology • Experiments • Conclusion 14 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

  15. Methodology Generalized Softmax • model posterior: K exp(S k ( W v , f k ( h ))) � p( w v | h ) = π k � V v ′ =1 exp(S k ( W v ′ , f k ( h ))) k =1 • mixture weight: exp( M T k h ) π k = � K k ′ =1 exp( M T k ′ h ) • nonlinearly transformed context: f k ( h ) = tanh( Q T k h ) • with trainable parameters: W ∈ R d × V , M ∈ R d × K and Q k ∈ R d × d 15 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

  16. Methodology Generalized Softmax Cont. • from previous: K exp(S k ( W v , f k ( h ))) � p( w v | h ) = π k � V v ′ =1 exp(S k ( W v ′ , f k ( h ))) k =1 • note: – replace inner product with kernels: W T v h → S k ( W v , f k ( h )) – replace single softmax with a mixture: 1 → � K k =1 π k – replace context vector with transformed ones: h → f k ( h ) – shared word vectors due to memory restriction • motivations: – different kernels give different feature spaces – based on context, model chooses which feature space is suitable – for each feature space, the context vector could be different – ideally, for each feature space, the word vector could also be different 16 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

  17. Methodology Individual Kernels S lin ( W v , h ) = W T v h S log ( W v , h ) = − log( || W v − h || p + 1) S pow ( W v , h ) = −|| W v − h || p S pol ( W v , h ) = ( α W T v h + c ) p S rbf ( W v , h ) = exp( − γ || W v − h || 2 ) S wav ( W v , h ) = cos( || W v − h || 2 ) exp( −|| W v − h || 2 ) a b � S ssg ( W v , h ) = log N ( µ W v , Σ W v ) N ( µ h , Σ h ) � � S mog ( W v , h ) = log N ( µ i , W v , Σ i , W v ) N ( µ j , h , Σ j , h ) i , j 2 || W v − h || 2 S hpb ( W v , h ) = − acos(1 + (1 − || W v || 2 )(1 − || h || 2 )) 17 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend