Neural Networks Language Models Philipp Koehn 16 April 2015 - PowerPoint PPT Presentation

Neural Networks Language Models Philipp Koehn 16 April 2015 Philipp Koehn Machine Translation: Neural Networks 16 April 2015

N-Gram Backoff Language Model 1 • Previously, we approximated p ( W ) = p ( w 1 , w 2 , ..., w n ) • ... by applying the chain rule � p ( W ) = p ( w i | w 1 , ..., w i − 1 ) i • ... and limiting the history (Markov order) p ( w i | w 1 , ..., w i − 1 ) ≃ p ( w i | w i − 4 , w i − 3 , w i − 2 , w i − 1 ) • Each p ( w i | w i − 4 , w i − 3 , w i − 2 , w i − 1 ) may not have enough statistics to estimate → we back off to p ( w i | w i − 3 , w i − 2 , w i − 1 ) , p ( w i | w i − 2 , w i − 1 ) , etc., all the way to p ( w i ) – exact details of backing off get complicated — ”interpolated Kneser-Ney” Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Refinements 2 • A whole family of back-off schemes • Skip-n gram models that may back off to p ( w i | w i − 2 ) • Class-based models p ( C ( w i ) | C ( w i − 4 ) , C ( w i − 3 ) , C ( w i − 2 ) , C ( w i − 1 )) ⇒ We are wrestling here with – using as much relevant evidence as possible – pooling evidence between words Philipp Koehn Machine Translation: Neural Networks 16 April 2015

First Sketch 3 Word 1 Hidden Layer Word 2 Word 5 Word 3 Word 4 Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Representing Words 4 • Words are represented with a one-hot vector, e.g., – dog = (0,0,0,0,1,0,0,0,0,....) – cat = (0,0,0,0,0,0,0,1,0,....) – eat = (0,1,0,0,0,0,0,0,0,....) • That’s a large vector! • Remedies – limit to, say, 20,000 most frequent words, rest are OTHER – place words in √ n classes, so each word is represented by ∗ 1 class label ∗ 1 word in class label Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Word Classes for Two-Hot Representations 5 • WordNet classes • Brown clusters • Frequency binning – sort words by frequency – place them in order into classes – each class has same token count → very frequent words have their own class → rare words share class with many other words • Anything goes: assign words randomly to classes Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Second Sketch 6 Word 1 Hidden Layer Word 2 Word 5 Word 3 Word 4 Philipp Koehn Machine Translation: Neural Networks 16 April 2015

7 word embeddings Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Add a Hidden Layer 8 Word 1 C Hidden Layer Word 2 C Word 5 Word 3 C Word 4 C • Map each word first into a lower-dimensional real-valued space • Shared weight matrix C Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Details (Bengio et al., 2003) 9 • Add direct connections from embedding layer to output layer • Activation functions – input → embedding: none – embedding → hidden: tanh – hidden → output: softmax • Training – loop through the entire corpus – update between predicted probabilities and 1-hot vector for output word Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Word Embeddings 10 Word Embedding C • By-product: embedding of word into continuous space • Similar contexts → similar embedding • Recall: distributional semantics Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Word Embeddings 11 Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Word Embeddings 12 Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Are Word Embeddings Magic? 13 • Morphosyntactic regularities (Mikolov et al., 2013) – adjectives base form vs. comparative, e.g., good, better – nouns singular vs. plural, e.g., year, years – verbs present tense vs. past tense, e.g., see, saw • Semantic regularities – clothing is to shirt as dish is to bowl – evaluated on human judgment data of semantic similarities Philipp Koehn Machine Translation: Neural Networks 16 April 2015

14 integration into machine translation systems Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Reranking 15 • First decode without neural network language model (NNLM) • Generate – n-best list – lattice • Score candidates with NNLM • Rerank (requires training of weight for NNLM) Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Computations During Inference 16 Precomputed Word 1 C Hidden Layer Word 2 C Word 5 Word 3 C Word 4 C Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Computations During Inference 17 Precomputed Can be cached Word 1 C Hidden Layer Word 2 C Word 5 Word 3 C Word 4 C Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Computations During Inference 18 Precomputed Can be cached Only compute score for Word 1 C Hidden Layer predicted word Word 2 C Word 5 Word 3 C Word 4 C 100x1,000,000 4x30 4x30x100 100 100x1 1,000,000 nodes weights nodes weights nodes Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Only Compute Score for Predicted Word? 19 • Proper probabilities require normalization – compute scores for all possible words – add them up – normalize (softmax) • How can we get away with it? – we do not care — a score is a score (Auli and Gao, 2014) – training regime that normalizes (Vaswani et al, 2013) – integrate normalization into objective function (Devlin et al., 2014) • Class-based word representations may help – first predict class, normalize – then predict word, normalize → compute 2 √ n instead of n output node values Philipp Koehn Machine Translation: Neural Networks 16 April 2015

20 recurrent neural networks Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Recurrent Neural Networks 21 1 Word 1 E H Word 2 C • Start: predict second word from first • Mystery layer with nodes all with value 1 Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Recurrent Neural Networks 22 1 Word 1 E H Word 2 C copy values H Word 2 E H Word 3 C Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Recurrent Neural Networks 23 1 Word 1 E H Word 2 C copy values H Word 2 E H Word 3 C copy values H Word 3 E H Word 4 C Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Training 24 1 Word 1 E H Word 2 • Process first training example • Update weights with back-propagation Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Training 25 H Word 2 E H Word 3 • Process second training example • Update weights with back-propagation • And so on... • But: no feedback to previous history Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Back-Propagation Through Time 26 H Word 1 E H Word 2 Word 2 E H Word 3 Word 3 E H Word 4 • After processing a few training examples, update through the unfolded recurrent neural network Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Back-Propagation Through Time 27 • Carry out back-propagation though time (BPTT) after each training example – 5 time steps seems to be sufficient – network learns to store information for more than 5 time steps • Or: update in mini-batches – process 10-20 training examples – update backwards through all examples – removes need for multiple steps for each training example Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Integration into Decoder 28 • Recurrent neural networks depend on entire history ⇒ very bad for dynamic programming Philipp Koehn Machine Translation: Neural Networks 16 April 2015

29 long short term memory Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Vanishing and Exploding Gradients 30 • Error is propagated to previous steps • Updates consider – prediction at that time step – impact on future time steps • Exploding gradient: propagated error dominates weight update • Vanishing gradient: propagated error disappears ⇒ We want the proper balance Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Long Short Term Memory (LSTM) 31 • Redesign of the neural network node to keep balance • Rather complex • ... but reportedly simple to train Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Node in a Recurrent Neural Network 32 • Given – input word embedding � x – previous hidden layer values � h ( t − 1) – weight matrices W and U j u ij h ( t − 1) • Sum s i = � j w ij x j + � j • Activation y i = sigmoid ( s i ) Philipp Koehn Machine Translation: Neural Networks 16 April 2015

Neural Networks Language Models Philipp Koehn 16 April 2015 - PowerPoint PPT Presentation

Neural Networks Language Models Philipp Koehn 16 April 2015 Philipp Koehn Machine Translation: Neural Networks 16 April 2015 N-Gram Backoff Language Model 1 Previously, we approximated p ( W ) = p ( w 1 , w 2 , ..., w n ) ... by

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural networks, Language

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

DFS Search on Undirected Graphs Algorithm : Design & Analysis [13] In the last class

Main Memory CS 4410, Opera3ng Systems Fall 2016 Cornell University Rachit Agarwal Anne Bracy

Process Address Spaces and Binary Formats Don Porter 1 COMP 530: Operating Systems Background

Samba and the road to Python3 Noel Power SUSE/Samba team noel.power@suse.com npower@samba.org

TRAC Experience Summary Kyle Rader August 6, 2019 SBN Program Overview Short-Baseline

Rates of Change Math 132 Stewart 2.7 Mathematics solves problems partly with technical tools

.. COMMITTEES I To: Mayor and Council May 16, 2001 From: Environmental Advisory Committee

Advisory Board meetings 5 and 6 Key Discussion Points of CTCN AB5 Briefing on CTCN

Neural Networks Language Models Philipp Koehn 16 April 2015 - PowerPoint PPT Presentation

Neural Networks Language Models Philipp Koehn 16 April 2015 Philipp Koehn Machine Translation: Neural Networks 16 April 2015 N-Gram Backoff Language Model 1 Previously, we approximated p ( W ) = p ( w 1 , w 2 , ..., w n ) ... by

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural networks, Language

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

DFS Search on Undirected Graphs Algorithm : Design &amp; Analysis [13] In the last class

Main Memory CS 4410, Opera3ng Systems Fall 2016 Cornell University Rachit Agarwal Anne Bracy

Process Address Spaces and Binary Formats Don Porter 1 COMP 530: Operating Systems Background

Samba and the road to Python3 Noel Power SUSE/Samba team noel.power@suse.com npower@samba.org

TRAC Experience Summary Kyle Rader August 6, 2019 SBN Program Overview Short-Baseline

Rates of Change Math 132 Stewart 2.7 Mathematics solves problems partly with technical tools

.. COMMITTEES I To: Mayor and Council May 16, 2001 From: Environmental Advisory Committee

Advisory Board meetings 5 and 6 Key Discussion Points of CTCN AB5 Briefing on CTCN

DFS Search on Undirected Graphs Algorithm : Design & Analysis [13] In the last class