Machine Learning for NLP Sequential NN models Aurlie Herbelot 2019 - PowerPoint PPT Presentation

Machine Learning for NLP Sequential NN models Aurélie Herbelot 2019 Centre for Mind/Brain Sciences University of Trento 1

The unreasonable effectiveness... 2

Karpathy (2015) • In 2015, Andrej Karpathy wrote a blog entry which became famous: The unreasonable effectiveness of Recurrent Neural Networks 1 . • How a simple model can be unbelievably effective. 1 https://karpathy.github.io/2015/05/21/rnn-effectiveness/ 3

Recurrence • Feedforward NNs which take a vector as input and produce a vector as output are limited. • Putting recurrence into our model, we can now process sequences of vectors, at each layer of the network. 4

Architectures What might these architectures be used for? https://karpathy.github.io/2015/05/21/rnn-effectiveness/ 5

Is this a recurrent architecture? https://github.com/avisingh599/visual-qa 6

Is this a recurrent architecture? https://github.com/avisingh599/visual-qa 7

Is this a recurrent architecture? Venugopalan et al (2016) 8

Reminder: language modeling A language model (LM) is a model that computes the probability of a sequence of words, given some previously observed data. LMs are used widely, for instance in predictive text on your smartphone: Today, I am in (bed|heaven|Rovereto|Ulaanbaatar). 9

The Markov assumption • Let’s assume the following sentence: I am in Rovereto. • We are going to use the chain rule for calculating its probability: P ( A n , . . . , A 1 ) = P ( A n | A n − 1 , . . . , A 1 ) · P ( A n − 1 , . . . , A 1 ) • For our example: P ( I , am , in , Rovereto ) = P ( Rovereto | in , am , I ) · P ( in | am , I ) · P ( am | I ) · P ( I ) 10

The Markov assumption • The problem is, we cannot easily estimate the probability of a word in a long sequence. • There are too many possible sequences that are not observable in our data or have very low frequency: P ( Rovereto | in , am , I , today , but , yesterday , there ... ) • So we make a simplifying Markov assumption: P ( Rovereto | in , am , I ) ≈ P ( Rovereto | in ) (bigram) or P ( Rovereto | in , am , I ) ≈ P ( Rovereto | in , am ) (trigram) 11

The Markov assumption • Coming back to our example: P ( I , am , in , Rovereto ) = P ( Rovereto | in , am , I ) · P ( in | am , I ) · P ( am | I ) · P ( I ) • A bigram model simplifies this to: P ( I , am , in , Rovereto ) = P ( Rovereto | in ) · P ( in | am ) · P ( am | I ) · P ( I ) • That is, we are not taking into account long-distance dependencies in language. • Trade-off between accuracy of the model and trainability. 12

LMs as generative models • In your smartphone, the LM does not just calculate a sentence probability, it suggests the next word to what you’re writing. • Given the sequence I am in , for each word w in the vocabulary, the LM can calculate: P ( w | in , am , I ) • The word with highest probability is returned. 13

Language modeling with RNNs • The sequence given to the RNN is equivalent to the n-gram of a language model. • Given a word or character, it has to predict the next one. https://karpathy.github.io/2015/05/21/rnn-effectiveness/ 14

Example: rewriting Harry Potter http://www.botnik.org/content/harry-potter.html 15

Example: writing code https://karpathy.github.io/2015/05/21/rnn-effectiveness/ 16

Sequences for non-sequential input Check animation at https://karpathy.github.io/2015/05/21/rnn-effectiveness/ 17

Types of recurrent NNs • RNNs (Recurrent Neural Networks): the original version. Simple architecture but does not have much memory. • LSTMs (Long Short-Term Memory Networks): an RNN able to remember and forget selectively. • GRUs (Gated Recurrent Units): a variation on LSTMs. 18

Recurrent Neural Networks 19

Recurrent Neural Networks (RNNs) • Traditional neural networks do not have persistence : when presented with a new input, they forget the previous one. • RNNs solve this problem by ‘having loops’: like several copies of a NN, passing a message to the next instance. https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 20

Recurrent Neural Networks (RNNs) https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 21

The step functions • A simple RNN consists has a single step function which: • updates the hidden layer of the unit; • computes the output. • Hidden layer at time t gets updated as: h t = a h ( W hh · h t − 1 + W xh · x t ) • Output is then given by: y = a o ( W hy · h t ) 22

The state space • A recurrent network is a dynamical system described by the two equations in the step function (see previous slide). • The state of the system is the summary of its past behaviour , i.e. the set of hidden unit activations h t . • In addition to the input and output spaces, we have a state space which has the dimensionality of the hidden layer. 23

Backpropagation through time (BPTT) • Imagine doing backprop over an unfolded RNN. • Let us have a network training sequence from time t 0 to time t k . • The cost function E ( t 0 , t k ) is the sum of error E ( t ) over time: t k � E ( t 0 , t k ) = E ( t ) t = t 0 • Similarly, the gradient descent has contributions from all time steps: θ j := θ j − α δ E ( t 0 , t k ) δθ j 24

Backpropagation through time (BPTT) • Imagine doing backprop over an unfolded RNN. • Let us have a network training sequence from time t 0 to time t k . • The cost function E ( t 0 , t k ) is the sum of error E ( t ) over time: t k � E ( t 0 , t k ) = E ( t ) t = t 0 • Similarly, the gradient descent has contributions from all time steps: t k δ � θ j := θ j − α E ( t ) δθ j t = t 0 24

An RNN, step by step • Let us see what happens in an RNN with a simple example of forward and backpropagation. • Let’s assume a character-based language modeling task. The model has to predict the next character given a sequence. • We will set the vocabulary to four letters: e, h, l, o . • We will express each element in the input sequence as a 4-dimensional one-hot vector: • 1 0 0 0 = e • 0 1 0 0 = h • 0 0 1 0 = l • 0 0 0 1 = o 25

An RNN, step by step • We will have sequences of length 4, e.g. ‘lloo’ or ‘oleh’. • We will have an RNN with a hidden layer of dimension 3. https://karpathy.github.io/2015/05/21/rnn-effectiveness/ 26

An RNN, step by step • Let’s imagine we give the following training example to the RNN. We input hell and we want to get the sequence ello . • Let’s have: • x = [[ 0100 ] , [ 1000 ] , [ 0010 ] , [ 0010 ]] • y = [[ 1000 ] , [ 0010 ] , [ 0010 ] , [ 0001 ]] • Each vector in x and y corresponds to a time step, so x t 2 = [ 1000 ] . y t will be prediction by the model at time t . • ˆ • ˆ y will be entire sequence predicted by the model. 27

An RNN, step by step We do a forward pass over the input sequence. It will mean calculating each state of the hidden layer and the resulting output. h t 1 = a h ( x t 1 W xh + h t 0 W hh ) y t 1 = a o ( h t 1 W hy ) ˆ h t 2 = a h ( x t 2 W xh + h t 1 W hh ) y t 2 = a o ( h t 2 W hy ) ˆ h t 3 = a h ( x t 3 W xh + h t 2 W hh ) y t 3 = a o ( h t 3 W hy ) ˆ h t 4 = a h ( x t 4 W xh + h t 3 W hh ) y t 4 = a o ( h t 4 W hy ) ˆ 28

An RNN, step by step • Let’s now assume that the network did not do very well and predicted lole instead of ello , so the sequence ˆ y = [[ 0010 ] , [ 0001 ] , [ 0010 ] , [ 1000 ]] • We now want our error: t k δ � θ j := θ j − α E ( t ) δθ j t = t 0 • This requires calculating the derivative of the error at each time step, for each parameter θ j in the RNN: δ E ( t ) δθ j 29

An RNN, step by step • Our error E ( t ) at each time step is y t − y t , over all our some function of ˆ training instances, as normal. For instance, MSE: N 1 t − y i ) 2 � (ˆ E ( t ) = y i 2 N i = 1 • The entire error is the sum of those errors (see slide 24): t k � E = E ( t ) t = t 0 NB: t 0 is the input, there is no error on it! 30

An RNN, step by step • Now we backpropagate through time. • Note that backpropagation happens also across timesteps. 31

An RNN, step by step • How many parameters do we have in the network? • 4 × 3 for W xh • 3 × 3 for W hh • 3 × 4 for W hy • That is 33 parameters, plus associated biases (not shown). • A real network will have many more. So RNNs are expensive to train when backpropagating through the whole sequence. 32

RNNs and memory • RNNs are known not to have much memory: they cannot process long-distance dependencies. • Consider the following sentences: 1) Harry had not revised for the exams, having spent time fighting dementors, [insert long list of monsters], so he got a bad mark. 2) Hermione revised course material the whole time while fighting dementors, [insert long list of monsters], so she got a good mark. • When modeling this text, the RNN must remember the gender of the proper noun to correctly predict the pronoun. 33

RNNs and vanishing/exploding gradients • Reminder: at the points where an activation function is very steep and/or very flat, its gradient will be very large (exploding) or very small (vanishing). • For instance, the sigmoid function as a vanishing gradient for low and high values of x . 34

Machine Learning for NLP Sequential NN models Aurlie Herbelot 2019 - PowerPoint PPT Presentation

Machine Learning for NLP Sequential NN models Aurlie Herbelot 2019 Centre for Mind/Brain Sciences University of Trento 1 The unreasonable effectiveness... 2 Karpathy (2015) In 2015, Andrej Karpathy wrote a blog entry which became

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Recurrent machines for likelihood-free inference Arthur Pesah Antoine Wehenkel Gilles Louppe

CSCE 496/896 Lecture 6: Architectures Stephen Scott Recurrent Architectures Introduction Basic

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

Differential Categories, Recurrent Neural Networks, and Machine Learning Shin-ya Katsumata and

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent Neural Networks (RNNs) Caio

PixelCNN Models with Auxiliary Variables for Natural Image Modeling Alexander Kolesnikov*,

Convolutional and recurrent neural networks Benoit Favre < benoit.favre@univ-mrs.fr >

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

Machine Learning for NLP Sequential NN models Aurlie Herbelot 2019 - PowerPoint PPT Presentation

Machine Learning for NLP Sequential NN models Aurlie Herbelot 2019 Centre for Mind/Brain Sciences University of Trento 1 The unreasonable effectiveness... 2 Karpathy (2015) In 2015, Andrej Karpathy wrote a blog entry which became

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Recurrent machines for likelihood-free inference Arthur Pesah Antoine Wehenkel Gilles Louppe

CSCE 496/896 Lecture 6: Architectures Stephen Scott Recurrent Architectures Introduction Basic

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

Differential Categories, Recurrent Neural Networks, and Machine Learning Shin-ya Katsumata and

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent Neural Networks (RNNs) Caio

PixelCNN Models with Auxiliary Variables for Natural Image Modeling Alexander Kolesnikov*,

Convolutional and recurrent neural networks Benoit Favre &lt; benoit.favre@univ-mrs.fr &gt;

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

Convolutional and recurrent neural networks Benoit Favre < benoit.favre@univ-mrs.fr >