Neural Networks Language Models Philipp Koehn 1 October 2020 - PowerPoint PPT Presentation

Neural Networks Language Models Philipp Koehn 1 October 2020 Philipp Koehn Machine Translation: Neural Networks 1 October 2020

N-Gram Backoff Language Model 1 • Previously, we approximated p ( W ) = p ( w 1 , w 2 , ..., w n ) • ... by applying the chain rule � p ( W ) = p ( w i | w 1 , ..., w i − 1 ) i • ... and limiting the history (Markov order) p ( w i | w 1 , ..., w i − 1 ) ≃ p ( w i | w i − 4 , w i − 3 , w i − 2 , w i − 1 ) • Each p ( w i | w i − 4 , w i − 3 , w i − 2 , w i − 1 ) may not have enough statistics to estimate → we back off to p ( w i | w i − 3 , w i − 2 , w i − 1 ) , p ( w i | w i − 2 , w i − 1 ) , etc., all the way to p ( w i ) – exact details of backing off get complicated — ”interpolated Kneser-Ney” Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Refinements 2 • A whole family of back-off schemes • Skip-n gram models that may back off to p ( w i | w i − 2 ) • Class-based models p ( C ( w i ) | C ( w i − 4 ) , C ( w i − 3 ) , C ( w i − 2 ) , C ( w i − 1 )) ⇒ We are wrestling here with – using as much relevant evidence as possible – pooling evidence between words Philipp Koehn Machine Translation: Neural Networks 1 October 2020

First Sketch 3 w i Output Word Softmax h Hidden Layer FF w i-4 w i-3 w i-2 w i-1 History Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Representing Words 4 • Words are represented with a one-hot vector, e.g., – dog = (0,0,0,0,1,0,0,0,0,....) – cat = (0,0,0,0,0,0,0,1,0,....) – eat = (0,1,0,0,0,0,0,0,0,....) • That’s a large vector! • Remedies – limit to, say, 20,000 most frequent words, rest are OTHER – place words in √ n classes, so each word is represented by ∗ 1 class label ∗ 1 word in class label Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Word Classes for Two-Hot Representations 5 • WordNet classes • Brown clusters • Frequency binning – sort words by frequency – place them in order into classes – each class has same token count → very frequent words have their own class → rare words share class with many other words • Anything goes: assign words randomly to classes Philipp Koehn Machine Translation: Neural Networks 1 October 2020

6 word embeddings Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Add a Hidden Layer 7 w i Output Word Softmax h Hidden Layer FF Ew Embedding Embed Embed Embed Embed w i-4 w i-3 w i-2 w i-1 History • Map each word first into a lower-dimensional real-valued space • Shared weight matrix E Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Details (Bengio et al., 2003) 8 • Add direct connections from embedding layer to output layer • Activation functions – input → embedding: none – embedding → hidden: tanh – hidden → output: softmax • Training – loop through the entire corpus – update between predicted probabilities and 1-hot vector for output word Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Word Embeddings 9 Word Embedding C • By-product: embedding of word into continuous space • Similar contexts → similar embedding • Recall: distributional semantics Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Word Embeddings 10 Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Word Embeddings 11 Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Are Word Embeddings Magic? 12 • Morphosyntactic regularities (Mikolov et al., 2013) – adjectives base form vs. comparative, e.g., good, better – nouns singular vs. plural, e.g., year, years – verbs present tense vs. past tense, e.g., see, saw • Semantic regularities – clothing is to shirt as dish is to bowl – evaluated on human judgment data of semantic similarities Philipp Koehn Machine Translation: Neural Networks 1 October 2020

13 recurrent neural networks Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Recurrent Neural Networks 14 Output Word Softmax Hidden Layer tanh Embedding 0 Embed w 1 History • Start: predict second word from first • Mystery layer with nodes all with value 1 Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Recurrent Neural Networks 15 Output Word Softmax Softmax Hidden Layer tanh tanh copy Embedding 0 Embed Embed w 1 w 2 History Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Recurrent Neural Networks 16 Output Word Softmax Softmax Softmax Hidden Layer tanh tanh tanh copy copy Embedding 0 Embed Embed Embed w 1 w 2 w 3 History Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Training 17 Cost y t Output Word Softmax h t Hidden Layer 0 RNN Ew t Embedding Embed w t w 1 History • Process first training example • Update weights with back-propagation Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Training 18 Cost y t Output Word Softmax h t Hidden Layer RNN RNN Ew t Embedding Embed w t w 2 History • Process second training example • Update weights with back-propagation • And so on... • But: no feedback to previous history Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Back-Propagation Through Time 19 Cost Cost Cost y t Output Word Softmax Softmax Softmax h t Hidden Layer 0 RNN RNN RNN Ew t Embedding Embed Embed Embed w t w 1 w 2 w 3 History • After processing a few training examples, update through the unfolded recurrent neural network Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Back-Propagation Through Time 20 • Carry out back-propagation though time (BPTT) after each training example – 5 time steps seems to be sufficient – network learns to store information for more than 5 time steps • Or: update in mini-batches – process 10-20 training examples – update backwards through all examples – removes need for multiple steps for each training example Philipp Koehn Machine Translation: Neural Networks 1 October 2020

21 long short term memory Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Vanishing Gradients 22 • Error is propagated to previous steps • Updates consider – prediction at that time step – impact on future time steps • Vanishing gradient: propagated error disappears Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Recent vs. Early History 23 • Hidden layer plays double duty – memory of the network – continuous space representation used to predict output words • Sometimes only recent context important After much economic progress over the years, the country → has • Sometimes much earlier context important The country which has made much economic progress over the years still → has Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Long Short Term Memory (LSTM) 24 • Design quite elaborate, although not very complicated to use • Basic building block: LSTM cell – similar to a node in a hidden layer – but: has a explicit memory state • Output and memory state change depends on gates – input gate : how much new input changes memory state – forget gate : how much of prior memory state is retained – output gate : how strongly memory state is passed on to next layer. • Gates can be not just be open (1) and closed (0), but slightly ajar (e.g., 0.2) Philipp Koehn Machine Translation: Neural Networks 1 October 2020

LSTM Cell 25 LSTM Layer Time t-1 m ⊗ forget gate Preceding Layer output gate ⊗ ⊕ X i input gate Next Layer ⊗ m o h Y LSTM Layer Time t Philipp Koehn Machine Translation: Neural Networks 1 October 2020

LSTM Cell (Math) 26 • Memory and output values at time step t memory t = gate input × input t + gate forget × memory t − 1 output t = gate output × memory t • Hidden node value h t passed on to next layer applies activation function f h t = f ( output t ) • Input computed as input to recurrent neural network node x t = ( x t 1 , ..., x t – given node values for prior layer � X ) h t − 1 = ( h t − 1 – given values for hidden layer from previous time step � , ..., h t − 1 H ) 1 – input value is combination of matrix multiplication with weights w x and w h and activation function g � X � H input t = g � � w x i x t w h i h t − 1 i + i i =1 i =1 Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Values for Gates 27 • Gates are very important • How do we compute their value? → with a neural network layer! • For each gate a ∈ ( input , forget , output ) – weight matrix W xa to consider node values in previous layer � x t – weight matrix W ha to consider hidden layer � h t − 1 at previous time step – weight matrix W ma to consider memory at previous time step memory t − 1 � – activation function h � X H H � � � � w xa i x t w ha i h t − 1 w ma memory t − 1 gate a = h i + + i i i i =1 i =1 i =1 Philipp Koehn Machine Translation: Neural Networks 1 October 2020

Neural Networks Language Models Philipp Koehn 1 October 2020 - PowerPoint PPT Presentation

Neural Networks Language Models Philipp Koehn 1 October 2020 Philipp Koehn Machine Translation: Neural Networks 1 October 2020 N-Gram Backoff Language Model 1 Previously, we approximated p ( W ) = p ( w 1 , w 2 , ..., w n ) ... by

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural networks, Language

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Introducing Adaptive Algorithmic Behavior of Primal Heuristics in SCIP for Solving Mixed Integer

GLOBAL BIOANALYSIS CONSORTIUM Regulated Bioanalysis - A Proposed Global Harmonization Process

The Projective Line Over The Integers Ela Celikbas and Christina Eubanks-Turner Department of

Review questions BSC3052 FST and PVA 1. Genetic data gives a long term view on structure of

A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov

Section 1 Time Series Modeling 1 / 37 Time Series Modeling ST 810-006 Statistics and Financial

Foundations of Computer Science Lecture 19 Expected Value The Average Over Many Runs of an

Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age Viktor

Sambuz

Useful Links

Newsletter

Mail Us