Exploring Kernel Functions in the Softmax Layer for Contextual Word - PowerPoint PPT Presentation

Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney

Agenda • Background • Methodology • Experiments • Conclusion 2 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Background Contextual Word Classification • language modeling (LM) and machine translation (MT) • dominated by neural networks (NN) 1 | ... ) = p ( w 0 ) � T , ... ) = � T 1 p ( w t | w t − 1 • p ( w T 1 p ( w t | h t ) 0 • despite many choices to learn the context vector h : – feed-forward NN – recurrent NN – convolutional NN – self-attention NN – ... • output is often modeled with standard softmax and trained with cross entropy: v h ) / � V p( w v | h ) = exp( W T v ′ =1 exp( W T v ′ h ) L = − log p( w v | h ) 3 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Background Softmax Bottleneck [Yang et al., 2018] • from previous: � � v h ) / � V exp( W T v ′ =1 exp( W T L = − log p( w v | h ) = − log v ′ h ) • exponential-and-logarithm calculation: log p( w v | h ) + log � V v ′ =1 exp( W T v ′ h ) = W T v h • approximate true log posteriors with inner products: p( w v | h ) + C h ≈ W T log ˜ v h 4 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Background Softmax Bottleneck Cont. • from previous: log ˜ p( w v | h ) + C h ≈ W T v h • in matrix form:     W T ˜ p( w 1 | h 1 ) . . . ˜ p( w 1 | h N ) 1 . . � C h 1 . . . C h N � . � h 1 . . . h N � ... . . . log . . + ≈ .     V × N d × N W T p( w V | h 1 ) . . . ˜ ˜ p( w V | h N ) V V × N V × d � �� rank ∼ V rank ∼ d • factorization the true log posterior matrix: log ˜ P + C ≈ W T H • Softmax Bottleneck: – log ˜ P is high-rank for natural language: rank(log ˜ P ) ∼ V – C decreases the rank of left-hand side by maximum 1 – rank of W T H is bounded by hidden dimension: rank( W T H ) ∼ d – typically V ∼ 100 , 000 and d ∼ 1000 ≈ → �≈ 5 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Background Breaking Softmax Bottleneck • mixture-of-softmax (mos) [Yang et al., 2018]: K exp( W T v f k ( h )) � p mos ( w v | h ) = π k � V v ′ =1 exp( W T v ′ f k ( h )) k =1 • sigsoftmax [Kanai et al., 2018]: exp( W T v h ) σ ( W T v h ) p sigsoftmax ( w v | h ) = � V v ′ =1 exp( W T v ′ h ) σ ( W T v ′ h ) • weight norm regularization [Herold et al., 2018]: � � V � � � � ( || W v || 2 − ν ) 2 L wnr = − log(p mos ( w v | h )) + ρ � data v =1 V 6 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Background Breaking Softmax Bottleneck Cont. • let z = W T v h • theoretically, to break softmax bottleneck with activation g( z ) [Kanai et al., 2018]: – nonlinearity of log(g( z )) – numerical stable – non-negative – monotonically increasing 7 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Background Geometric Explanation of Softmax Bottleneck • an intuitive example: – ˜ p( dog | a common home pet is ... ) ≈ ˜ p( cat | a common home pet is ... ) ≈ 50% – learned word vectors are close: W dog ≈ W cat – posteriors over dog and cat are thus close: p( dog | ... ) ≈ p( cat | ... ) – there exist contexts that would fail the model: – overall an expressiveness problem 8 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Background Kernel Trick • widely used in support-vector machine (SVM) and logistic regression • improve expressiveness by implicitly transforming data into high dimensional feature spaces [Eric, 2019] 9 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Background Kernel Trick Cont. • K ( x , y ) = < φ ( x ) , φ ( y ) > � x 1 � � y 1 � � x 1 � � y 1 � √ √ � T = φ T ) = ( x 1 y 1 + x 2 y 2 ) 2 = � � � • K sq ( x 2 2 x 1 x 2 , x 2 y 2 2 y 1 y 2 , y 2 , 1 , 1 , φ 2 2 x 2 y 2 x 2 y 2 • for a kernel function (kernel) to be valid: – positive semidefinite (PSD) Gram matrix – corresponds to a scalar product in some feature space • empirically, non PSD kernels also work well [Lin and Lin, 2003, Boughorbel et al., 2005] → we do not enforce PSD • where there is an inner product, there could be an application of the kernel trick – import vector machine [Zhu and Hastie, 2002] – multilayer kernel machine [Cho and Saul, 2009] – gated softmax [Memisevic et al., 2010] 10 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Background Non-Euclidean Word Embedding • Gaussian embedding [Vilnis and McCallum, 2015, Athiwaratkun and Wilson, 2017] [Athiwaratkun and Wilson, 2017] 11 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Background Non-Euclidean Word Embedding Cont. • hyperbolic (Poincar´ e) embedding [Nickel and Kiela, 2017, Dhingra et al., 2018] [Nickel and Kiela, 2017] 12 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Background Quick Recap • contextual word classification • softmax bottleneck • breaking softmax bottleneck • geometric explanation of softmax bottleneck • kernel trick • non-Euclidean word embedding → kernels in softmax 13 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Agenda • Background • Methodology • Experiments • Conclusion 14 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Methodology Generalized Softmax • model posterior: K exp(S k ( W v , f k ( h ))) � p( w v | h ) = π k � V v ′ =1 exp(S k ( W v ′ , f k ( h ))) k =1 • mixture weight: exp( M T k h ) π k = � K k ′ =1 exp( M T k ′ h ) • nonlinearly transformed context: f k ( h ) = tanh( Q T k h ) • with trainable parameters: W ∈ R d × V , M ∈ R d × K and Q k ∈ R d × d 15 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Methodology Generalized Softmax Cont. • from previous: K exp(S k ( W v , f k ( h ))) � p( w v | h ) = π k � V v ′ =1 exp(S k ( W v ′ , f k ( h ))) k =1 • note: – replace inner product with kernels: W T v h → S k ( W v , f k ( h )) – replace single softmax with a mixture: 1 → � K k =1 π k – replace context vector with transformed ones: h → f k ( h ) – shared word vectors due to memory restriction • motivations: – different kernels give different feature spaces – based on context, model chooses which feature space is suitable – for each feature space, the context vector could be different – ideally, for each feature space, the word vector could also be different 16 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Methodology Individual Kernels S lin ( W v , h ) = W T v h S log ( W v , h ) = − log( || W v − h || p + 1) S pow ( W v , h ) = −|| W v − h || p S pol ( W v , h ) = ( α W T v h + c ) p S rbf ( W v , h ) = exp( − γ || W v − h || 2 ) S wav ( W v , h ) = cos( || W v − h || 2 ) exp( −|| W v − h || 2 ) a b � S ssg ( W v , h ) = log N ( µ W v , Σ W v ) N ( µ h , Σ h ) � � S mog ( W v , h ) = log N ( µ i , W v , Σ i , W v ) N ( µ j , h , Σ j , h ) i , j 2 || W v − h || 2 S hpb ( W v , h ) = − acos(1 + (1 − || W v || 2 )(1 − || h || 2 )) 17 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

Exploring Kernel Functions in the Softmax Layer for Contextual Word - PowerPoint PPT Presentation

Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney Agenda Background Methodology Experiments Conclusion 2 of 33 Exploring Kernel Functions in

Softmax Alternatives in Neural MT Graham Neubig 5/24/2017 1 Softmax Alternatives in Neural MT

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

Breaking the Softmax Bottleneck via Monotonic Functions Octavian Ganea, Sylvain Gelly, Gary

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

ELEC / COMP 177 Fall 2016 Some slides from Kurose and Ross, Computer Networking , 5 th Edition

5 Network Layer Network Layer Network Layer Network Layer Example: Choosing among multiple ASes

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

for Efficient Softmax Inference Shun Liao* 1 , Ting Chen* 2 , Tian Lin 2 , Denny Zhou 2 , Chong

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas Background:

Large-Margin Softmax Loss for Conv. Neural Networks Weiyang Liu 1* , Yandong Wen 2* , Zhiding Yu 3

10 mm Cytoarchitecture and function layer 4: input layer 5: output Motor cortex: expanded layer

Data-link layer Da Data ta-link link layer er Referred to as layer 2 Physical

CompSci 356: Computer Network Architectures Lecture 25: Application Layer Protocols Chapter 9.1

7 Network Layer Network Layer Network Layer Network Layer Subnets Classful Address

Mining Software Data Mara Gmez Software Engineering Course Summer Semester 2017 How

Beyond NP: The Work and Beyond NP: The Work and Legacy of Larry Stockmeyer Legacy of Larry

Efficient Threshold Encryption from Lossy Trapdoor Functions Xiang Xie, Rui Xue and Rui Zhang

B+ Tree Structure B+ Tree File Organization In a B+ Tree file organization, the leaf

New gravity duals for higher - dimensional superconformal theories Alessandro Tomasiello based

Discrimination in Decision Making: Humans vs. Machines Muhammad Bilal Zafar, Isabel Valera,

PPoPP 20 Feb. 22-26, 2020 San Diego, CA, US spcl.inf.ethz.ch @spcl_eth Deep learning

LEARNING REPRESENTATIONS OF SOURCE CODE FROM STRUCTURE & CONTEXT by Dylan Bourgeois

Exploring Kernel Functions in the Softmax Layer for Contextual Word - PowerPoint PPT Presentation

Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney Agenda Background Methodology Experiments Conclusion 2 of 33 Exploring Kernel Functions in

Softmax Alternatives in Neural MT Graham Neubig 5/24/2017 1 Softmax Alternatives in Neural MT

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

Breaking the Softmax Bottleneck via Monotonic Functions Octavian Ganea, Sylvain Gelly, Gary

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

ELEC / COMP 177 Fall 2016 Some slides from Kurose and Ross, Computer Networking , 5 th Edition

5 Network Layer Network Layer Network Layer Network Layer Example: Choosing among multiple ASes

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

for Efficient Softmax Inference Shun Liao* 1 , Ting Chen* 2 , Tian Lin 2 , Denny Zhou 2 , Chong

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas Background:

Large-Margin Softmax Loss for Conv. Neural Networks Weiyang Liu 1* , Yandong Wen 2* , Zhiding Yu 3

10 mm Cytoarchitecture and function layer 4: input layer 5: output Motor cortex: expanded layer

Data-link layer Da Data ta-link link layer er Referred to as layer 2 Physical

CompSci 356: Computer Network Architectures Lecture 25: Application Layer Protocols Chapter 9.1

7 Network Layer Network Layer Network Layer Network Layer Subnets Classful Address

Mining Software Data Mara Gmez Software Engineering Course Summer Semester 2017 How

Beyond NP: The Work and Beyond NP: The Work and Legacy of Larry Stockmeyer Legacy of Larry

Efficient Threshold Encryption from Lossy Trapdoor Functions Xiang Xie, Rui Xue and Rui Zhang

B+ Tree Structure B+ Tree File Organization In a B+ Tree file organization, the leaf

New gravity duals for higher - dimensional superconformal theories Alessandro Tomasiello based

Discrimination in Decision Making: Humans vs. Machines Muhammad Bilal Zafar, Isabel Valera,

PPoPP 20 Feb. 22-26, 2020 San Diego, CA, US spcl.inf.ethz.ch @spcl_eth Deep learning

LEARNING REPRESENTATIONS OF SOURCE CODE FROM STRUCTURE &amp; CONTEXT by Dylan Bourgeois

LEARNING REPRESENTATIONS OF SOURCE CODE FROM STRUCTURE & CONTEXT by Dylan Bourgeois