neural networks language models
play

Neural Networks Language Models Philipp Koehn 1 October 2020 - PowerPoint PPT Presentation

Neural Networks Language Models Philipp Koehn 1 October 2020 Philipp Koehn Machine Translation: Neural Networks 1 October 2020 N-Gram Backoff Language Model 1 Previously, we approximated p ( W ) = p ( w 1 , w 2 , ..., w n ) ... by


  1. Neural Networks Language Models Philipp Koehn 1 October 2020 Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  2. N-Gram Backoff Language Model 1 • Previously, we approximated p ( W ) = p ( w 1 , w 2 , ..., w n ) • ... by applying the chain rule � p ( W ) = p ( w i | w 1 , ..., w i − 1 ) i • ... and limiting the history (Markov order) p ( w i | w 1 , ..., w i − 1 ) ≃ p ( w i | w i − 4 , w i − 3 , w i − 2 , w i − 1 ) • Each p ( w i | w i − 4 , w i − 3 , w i − 2 , w i − 1 ) may not have enough statistics to estimate → we back off to p ( w i | w i − 3 , w i − 2 , w i − 1 ) , p ( w i | w i − 2 , w i − 1 ) , etc., all the way to p ( w i ) – exact details of backing off get complicated — ”interpolated Kneser-Ney” Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  3. Refinements 2 • A whole family of back-off schemes • Skip-n gram models that may back off to p ( w i | w i − 2 ) • Class-based models p ( C ( w i ) | C ( w i − 4 ) , C ( w i − 3 ) , C ( w i − 2 ) , C ( w i − 1 )) ⇒ We are wrestling here with – using as much relevant evidence as possible – pooling evidence between words Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  4. First Sketch 3 w i Output Word Softmax h Hidden Layer FF w i-4 w i-3 w i-2 w i-1 History Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  5. Representing Words 4 • Words are represented with a one-hot vector, e.g., – dog = (0,0,0,0,1,0,0,0,0,....) – cat = (0,0,0,0,0,0,0,1,0,....) – eat = (0,1,0,0,0,0,0,0,0,....) • That’s a large vector! • Remedies – limit to, say, 20,000 most frequent words, rest are OTHER – place words in √ n classes, so each word is represented by ∗ 1 class label ∗ 1 word in class label Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  6. Word Classes for Two-Hot Representations 5 • WordNet classes • Brown clusters • Frequency binning – sort words by frequency – place them in order into classes – each class has same token count → very frequent words have their own class → rare words share class with many other words • Anything goes: assign words randomly to classes Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  7. 6 word embeddings Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  8. Add a Hidden Layer 7 w i Output Word Softmax h Hidden Layer FF Ew Embedding Embed Embed Embed Embed w i-4 w i-3 w i-2 w i-1 History • Map each word first into a lower-dimensional real-valued space • Shared weight matrix E Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  9. Details (Bengio et al., 2003) 8 • Add direct connections from embedding layer to output layer • Activation functions – input → embedding: none – embedding → hidden: tanh – hidden → output: softmax • Training – loop through the entire corpus – update between predicted probabilities and 1-hot vector for output word Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  10. Word Embeddings 9 Word Embedding C • By-product: embedding of word into continuous space • Similar contexts → similar embedding • Recall: distributional semantics Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  11. Word Embeddings 10 Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  12. Word Embeddings 11 Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  13. Are Word Embeddings Magic? 12 • Morphosyntactic regularities (Mikolov et al., 2013) – adjectives base form vs. comparative, e.g., good, better – nouns singular vs. plural, e.g., year, years – verbs present tense vs. past tense, e.g., see, saw • Semantic regularities – clothing is to shirt as dish is to bowl – evaluated on human judgment data of semantic similarities Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  14. 13 recurrent neural networks Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  15. Recurrent Neural Networks 14 Output Word Softmax Hidden Layer tanh Embedding 0 Embed w 1 History • Start: predict second word from first • Mystery layer with nodes all with value 1 Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  16. Recurrent Neural Networks 15 Output Word Softmax Softmax Hidden Layer tanh tanh copy Embedding 0 Embed Embed w 1 w 2 History Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  17. Recurrent Neural Networks 16 Output Word Softmax Softmax Softmax Hidden Layer tanh tanh tanh copy copy Embedding 0 Embed Embed Embed w 1 w 2 w 3 History Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  18. Training 17 Cost y t Output Word Softmax h t Hidden Layer 0 RNN Ew t Embedding Embed w t w 1 History • Process first training example • Update weights with back-propagation Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  19. Training 18 Cost y t Output Word Softmax h t Hidden Layer RNN RNN Ew t Embedding Embed w t w 2 History • Process second training example • Update weights with back-propagation • And so on... • But: no feedback to previous history Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  20. Back-Propagation Through Time 19 Cost Cost Cost y t Output Word Softmax Softmax Softmax h t Hidden Layer 0 RNN RNN RNN Ew t Embedding Embed Embed Embed w t w 1 w 2 w 3 History • After processing a few training examples, update through the unfolded recurrent neural network Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  21. Back-Propagation Through Time 20 • Carry out back-propagation though time (BPTT) after each training example – 5 time steps seems to be sufficient – network learns to store information for more than 5 time steps • Or: update in mini-batches – process 10-20 training examples – update backwards through all examples – removes need for multiple steps for each training example Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  22. 21 long short term memory Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  23. Vanishing Gradients 22 • Error is propagated to previous steps • Updates consider – prediction at that time step – impact on future time steps • Vanishing gradient: propagated error disappears Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  24. Recent vs. Early History 23 • Hidden layer plays double duty – memory of the network – continuous space representation used to predict output words • Sometimes only recent context important After much economic progress over the years, the country → has • Sometimes much earlier context important The country which has made much economic progress over the years still → has Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  25. Long Short Term Memory (LSTM) 24 • Design quite elaborate, although not very complicated to use • Basic building block: LSTM cell – similar to a node in a hidden layer – but: has a explicit memory state • Output and memory state change depends on gates – input gate : how much new input changes memory state – forget gate : how much of prior memory state is retained – output gate : how strongly memory state is passed on to next layer. • Gates can be not just be open (1) and closed (0), but slightly ajar (e.g., 0.2) Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  26. LSTM Cell 25 LSTM Layer Time t-1 m ⊗ forget gate Preceding Layer output gate ⊗ ⊕ X i input gate Next Layer ⊗ m o h Y LSTM Layer Time t Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  27. LSTM Cell (Math) 26 • Memory and output values at time step t memory t = gate input × input t + gate forget × memory t − 1 output t = gate output × memory t • Hidden node value h t passed on to next layer applies activation function f h t = f ( output t ) • Input computed as input to recurrent neural network node x t = ( x t 1 , ..., x t – given node values for prior layer � X ) h t − 1 = ( h t − 1 – given values for hidden layer from previous time step � , ..., h t − 1 H ) 1 – input value is combination of matrix multiplication with weights w x and w h and activation function g � X � H input t = g � � w x i x t w h i h t − 1 i + i i =1 i =1 Philipp Koehn Machine Translation: Neural Networks 1 October 2020

  28. Values for Gates 27 • Gates are very important • How do we compute their value? → with a neural network layer! • For each gate a ∈ ( input , forget , output ) – weight matrix W xa to consider node values in previous layer � x t – weight matrix W ha to consider hidden layer � h t − 1 at previous time step – weight matrix W ma to consider memory at previous time step memory t − 1 � – activation function h � X H H � � � � w xa i x t w ha i h t − 1 w ma memory t − 1 gate a = h i + + i i i i =1 i =1 i =1 Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend