Softmax Alternatives in Neural MT Graham Neubig 5/24/2017 1 - PowerPoint PPT Presentation

Softmax Alternatives in Neural MT Softmax Alternatives in Neural MT Graham Neubig 5/24/2017 1

Softmax Alternatives in Neural MT Neural MT Models okonai give a talk kouen wo </s> masu 2 ) 3 ) P ( e 1 ∣ F ) P ( e 2 ∣ F ,e 1 ) P ( e 3 ∣ F,e 1 P ( e 4 ∣ F ,e 1 argmax give a talk </s> 2

Softmax Alternatives in Neural MT How we Calculate Probabilities p(e i | h i ) = softmax( W * h i + b ) Next word prob. Weights Hidden context Bias b W b 1 [ w *,1 , w *,2 , w *,3 , ...] b 2 In other words, the score is: b 3 s(e i | c i ) = w *,k ・ c i + b k ... Closeness of output embedding and context + bias. Choose word with highest score 3

Softmax Alternatives in Neural MT A Visual Example p = b W h softmax( + ) 4

Softmax Alternatives in Neural MT Problems w/ Softmax ● Computationally inefficient at training time ● Computationally inefficient at test time ● Many parameters ● Sub-optimal accuracy 5

Softmax Alternatives in Neural MT Calculation/Parameter Efficient Softmax Variants 6

Softmax Alternatives in Neural MT Negative Sampling/ Noise Contrastive Estimation ● Calculate the denominator over a subset b W c + b' W' c + Negative samples according to distribution q 7

Softmax Alternatives in Neural MT Lots of Alternatives! ● Noise contrastive estimation: train a model to discriminate between true and false examples ● Negative sampling: e.g. word2vec ● BlackOut Used in MT: Eriguchi et al. 2016: Tree-to-sequence attentional neural machine translation 8 Ref: Chris Dyer, 2014. Notes on Noise Contrastive Estimation and Negative Sampling

Softmax Alternatives in Neural MT GPUifying Noise Contrastive Estimation ● Creating the negative samples and arranging memory is expensive on GPU ● Simple solution: sample the negative samples once for each mini-batch Zoph et al. 2016. Simple, Fast Noise-Contrastive Estimation for Large RNN Vocabularies 9

Softmax Alternatives in Neural MT Summary of Negative Sampling Approaches ● Train time efficiency: Much faster! ● Test time efficiency: Same ● Number of parameters: Same ● Test time accuracy: A little worse? ● Code complexity: Moderate 10

Softmax Alternatives in Neural MT Vocabulary Selection ● Select the vocabulary on a per-sentence basis Mi 2016. Vocabulary Manipulation for NMT L'Hostis et al. 2016. Vocabulary Selection Strategies for NMT 11

Softmax Alternatives in Neural MT Summary of Vocabulary Selection ● Train time efficiency: A little faster ● Test time efficiency: Much faster! ● Number of parameters: Same ● Test time accuracy: Better or a little worse ● Code complexity: Moderate 12

Softmax Alternatives in Neural MT Class-based Softmax ● Predict P(class|hidden), then P(word|class,hidden) ● Because P(w|c,h) is 0 for all but one class, efficient computation b c W c h softmax( ) + b w W w softmax( ) h + 13 Goodman 2001. Classes for Fast Maximum Entropy Training

Softmax Alternatives in Neural MT Hierarchical Softmax ● Tree-structured prediction of word ID ● Usually modeled as a sequence of binary decisions 0 1 1 1 0 → word 14 14 Morin and Bengio 2005: Hierarchical Probabilistic NNLM

Softmax Alternatives in Neural MT Summary of Class-based Softmaxes ● Train time efficiency: Faster on CPU , Pain to GPU ● Test time efficiency: Worse ● Number of parameters: More ● Test time accuracy: Slightly worse to slightly better ● Code complexity: High 15

Softmax Alternatives in Neural MT Binary Code Prediction ● Just directly predict the binary code of the word ID 0 1 b W h σ( ) = + 1 1 0 ↓ word 14 ● Like hierarchical softmax, but with shared weights at every layer → fewer parameters, easy to GPU 16 Oda et al. 2017: NMT Via Binary Code Prediction

Softmax Alternatives in Neural MT Two Improvements Hybrid model Error correcting codes 17

Softmax Alternatives in Neural MT Summary of Binary Code Prediction ● Train time efficiency: Faster ● Test time efficiency: Faster (12x on CPU!) ● Number of parameters: Fewer ● Test time accuracy: Slightly worse ● Code complexity: Moderate 18

Softmax Alternatives in Neural MT Parameter Sharing 19

Softmax Alternatives in Neural MT Parameter Sharing ● We have two |V| x |h| matrices in the decoder: ● Input word embeddings, which we look up and feed into the RNN ● Output word embeddings, which are the weight matrix W in the softmax ● Simple idea: tie their weights together Press et al. 2016: Using the output embedding to improve language models Inan et al. 2016: Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling 20

Softmax Alternatives in Neural MT Summary of Parameter Sharing ● Train time efficiency: Same ● Test time efficiency: Same ● Number of parameters: Fewer ● Test time accuracy: Better ● Code complexity: Low 21

Softmax Alternatives in Neural MT Incorporating External Information 22

Softmax Alternatives in Neural MT Problems w/ Lexical Choice in Neural MT Arthur et al. 2016: Incorporating Discrete Translation Lexicons in NMT 23

Softmax Alternatives in Neural MT When Does Translation Succeed? (in Output Embedding Space) I come from Tunisia w * ,eat w * ,sweden w *,tunisia w * ,consume h 1 w * ,norway w * ,nigeria 24

Softmax Alternatives in Neural MT When Does Translation Fail? Embeddings Version I come from Tunisia w * ,eat w * ,sweden w *,tunisia w * ,consume h 1 w * ,norway w * ,nigeria 25

Softmax Alternatives in Neural MT When Does Translation Fail? Bias Version I come from Tunisia w * ,eat w *,china w * ,sweden w *,tunisia w * ,consume h 1 w * ,norway w * ,nigeria b tunisia = -0.5 b china = 4.5 26

Softmax Alternatives in Neural MT Even if We Make a Mistake... his father likes Tunisia P( kare |his) = 0.5 kare P( no |his) = 0.5 no P( chichi |Tunisia) = 1.0 ☓ chichi P( chunijia | wa ☓ father) = 1.0 chunijia P( suki |likes) = 0.5 ga suki P( da |likes) = 0.5 da Different mistakes Soft alignment than neural MT possible 28

Softmax Alternatives in Neural MT Calculating Lexicon Probabilities I come from Tunisia Attention 0.05 0.01 0.02 0.93 watashi 0.6 0.03 0.01 0.0 0.03 ore 0.2 0.01 0.02 0.0 0.01 … … … … … … kuru 0.01 0.3 0.01 0.0 0.00 kara 0.02 0.1 0.5 0.01 0.02 … … … … … … chunijia 0.0 0.0 0.0 0.89 0.96 oranda 0.0 0.0 0.0 0.0 0.00 Word-by-word Conditional lexicon prob lexicon prob 29

Softmax Alternatives in Neural MT Incorporating w/ Neural MT ● softmax bias: p(e i | h i ) = softmax( W * h i + b + log ( lex i + ε)) To prevent -∞ scores ● Linear interpolation: p(e i | h i ) = γ * softmax( W * h i + b ) + (1-γ) * lex i 30

Softmax Alternatives in Neural MT Summary of External Lexicons ● Train time efficiency: Worse ● Test time efficiency: Worse ● Number of parameters: Same ● Test time accuracy: Better to Much Better ● Code complexity: High 31

Softmax Alternatives in Neural MT Other Varieties of Biases ● Copying source words as-is Gu et al. 2016. Incorporating copying mechanism in sequence-to-sequence learning Gulcehre et al. 2016. Pointing the unknown words ● Remembering and copying target words Were called cache models, now called pointer ★ sentinel models ★ :) Merity et al. 2016. Pointer Sentinel Mixture Models 32

Softmax Alternatives in Neural MT Use of External Phrase Tables Tang et al. 2016. NMT with External Phrase Memory 33

Softmax Alternatives in Neural MT Conclusion 34

Softmax Alternatives in Neural MT Conclusion ● Lots of softmax alternatives for neural MT → Consider them in your systems! ● But there is no fast at train, fast at test, accurate, small, and simple method → Consider making one yourself! 35

Softmax Alternatives in Neural MT Graham Neubig 5/24/2017 1 - PowerPoint PPT Presentation

Softmax Alternatives in Neural MT Softmax Alternatives in Neural MT Graham Neubig 5/24/2017 1 Softmax Alternatives in Neural MT Neural MT Models okonai give a talk kouen wo </s> masu 2 ) 3 ) P ( e 1 F ) P ( e 2 F ,e 1 )

Large-Margin Softmax Loss for Conv. Neural Networks Weiyang Liu 1* , Yandong Wen 2* , Zhiding Yu 3

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

for Efficient Softmax Inference Shun Liao* 1 , Ting Chen* 2 , Tian Lin 2 , Denny Zhou 2 , Chong

Contextual Token Representations ULMfit, OpenAI GPT, ELMo, BERT, XLM Noe Casas Background:

Breaking the Softmax Bottleneck via Monotonic Functions Octavian Ganea, Sylvain Gelly, Gary

Neural Networks + Backpropagation Last Class Softmax Classifier Generalization /

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Word Embedding Praveen Krishnan CVIT, IIIT Hyderabad June 22, 2017 1 Outline Introduction

CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS Neural networks Fully connected networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Midt Midtown Corridor Alternatives Analysis C id Alt ti A l i Evaluation of Alternatives and

Alternatives Screening 2 1 Initial Set of Alternatives Electric Fixed Guideway Zero Emission

Alternatives to Merger Alternatives to Merger Legal and Structural Considerations Legal and

The Exploration of the Development of Taiwan's PhD Education in the Field of Biomedical Sciences

Training Mobile Application Presented by: Loh Yeow Leng Safety Induction Training Mobile

Proof Systems for Sustainable Blockchains: How to Prove you Waste Space and Time Krzysztof

Lattice QCD: 2020 and beyond Andreas Kronfeld with input from Ruth Van de Water and Mike Wagman

TWGrid Eric Yen and Simon C. Lin ASGC, Taiwan OSG All Hands Meeting at SDSC Mar. 2007 Outline

Introduction to site effects Influence of the local geology on the ground motion Fabian Bonilla

Practical Secure Two-Party Computation and Applications Thomas Schneider Estonian Winter School

Division of Molecular and Cellular Biosciences (MCB) Virtual Office Hours Welcome to the MCB