Recurrent Neural Models: Language Models, and Sequence Prediction and Generation
CMSC 473/673 Frank Ferraro
Recurrent Neural Models: Language Models, and Sequence Prediction - - PowerPoint PPT Presentation
Recurrent Neural Models: Language Models, and Sequence Prediction and Generation CMSC 473/673 Frank Ferraro WARNING: Neural methods are NOT the only way to do sequence prediction: Structured Perceptron (478/678) Hidden Markov Models
Recurrent Neural Models: Language Models, and Sequence Prediction and Generation
CMSC 473/673 Frank Ferraro
WARNING: Neural methods are NOT the only way to do sequence prediction:
(473/673, 678, 691 GSML)
CRFs are Very Popular for {POS, NER, other sequence tasks}
f(π¨πβ1, π¨π, π) = (π¨πβ1 == Noun & π¨π == Verb & (π₯πβ2 in list of adjectives or determiners))
fpath p(π¨πβ1, π¨π, π) = (π¨πβ1 == Per & π¨π == Per & (syntactic path p involving π₯π exists ))
z1
β¦
w1 w2 w3 w4 β¦ z2 z3 z4
π π¨1, β¦ , π¨π π₯1, β¦ , π₯π) β ΰ·
π
exp( πππ π¨πβ1, π¨π, ππ, β¦ , ππΆ ) Canβt easily do these with an HMM β Conditional models can allow richer features
CRFs can be used in neural networks too:
hon/tf/contrib/crf/CrfForwardRnnCell
Outline
Types of networks Basic cell definition Example in PyTorch
A Note on Graphical Notation
x h y Input: Could be BOW, sequence of items, structured input, etc. Output: Label, Sequence of labels, Generated Text, etc.
A Note on Graphical Notation
x h y Input: Could be BOW, sequence of items, structured input, etc. Output: Label, Sequence of labels, Generated Text, etc. Hidden state/representation
A Note on Graphical Notation
x h y Input: Could be BOW, sequence of items, structured input, etc. Output: Label, Sequence of labels, Generated Text, etc. Hidden state/representation
factor
A Note on Graphical Notation
x h y Input: Could be BOW, sequence of items, structured input, etc. Output: Label, Sequence of labels, Generated Text, etc. Hidden state/representation
factor
another neural cell, or factor
A Note on Graphical Notation
x h y Input: Could be BOW, sequence of items, structured input, etc. Output: Label, Sequence of labels, Generated Text, etc. Hidden state/representation
factor
another neural cell, or factor
The red arrows indicate parameters to learn
Five Broad Categories of Neural Networks
Single Input, Single Output Single Input, Multiple Outputs Multiple Inputs, Single Output Multiple Inputs, Multiple Outputs (βsequence predictionβ: no time delay) Multiple Inputs, Multiple Outputs (βsequence-to-sequenceβ: with time delay)
Five Broad Categories of Neural Networks
Single Input, Single Output Single Input, Multiple Outputs Multiple Inputs, Single Output Multiple Inputs, Multiple Outputs (βsequence predictionβ: no time delay) Multiple Inputs, Multiple Outputs (βsequence-to-sequenceβ: with time delay)
βSingleβ: fixed number of items βMultipleβ: variable number
Network Types: Single Input, Single Output
x h y
1. Feed forward Linearizable feature input Bag-of-items classification/regression Basic non-linear model
Weβve already seen some instances
Terminology
Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative NaΓ―ve Bayes Very shallow (sigmoidal) neural nets
as statistical regression a form of viewed as based in information theory to be cool today :) common NLP term
Recall from maxent slides
Recall: N-gram to Maxent to Neural Language Models
predict the next word given some contextβ¦ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyβ¦
π π₯π π₯πβ3, π₯πβ2, π₯πβ1) = softmax(π β π(π₯πβ3, π₯πβ2, π₯πβ1, π₯π))
Recall: N-gram to Maxent to Neural Language Models
predict the next word given some contextβ¦ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyβ¦
π π₯π π₯πβ3, π₯πβ2, π₯πβ1) = softmax(π β π(π₯πβ3, π₯πβ2, π₯πβ1, π₯π))
x y no learned representation h
Recall: N-gram to Maxent to Neural Language Models
predict the next word given some contextβ¦ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyβ¦
π π₯π π₯πβ3, π₯πβ2, π₯πβ1) β softmax(ππ₯π β π(π₯πβ3, π₯πβ2, π₯πβ1))
create/use βdistributed representationsββ¦ ei-3 ei-2 ei-1 combine these representationsβ¦ C = f
matrix-vector product
ew ΞΈwi
Recall: N-gram to Maxent to Neural Language Models
predict the next word given some contextβ¦ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyβ¦
π π₯π π₯πβ3, π₯πβ2, π₯πβ1) β softmax(ππ₯π β π(π₯πβ3, π₯πβ2, π₯πβ1))
create/use βdistributed representationsββ¦ ei-3 ei-2 ei-1 combine these representationsβ¦ C = f
matrix-vector product
ew ΞΈwi x y h
Common Types of Single Input, Single Output
General Formulation: Input: x Compute: h0 = x for layer l = 1 to L: hl = fl(Wl hl-1 + bl) return argmaxπ§ softmax πβπ hidden state at layer l (non-linear) activation function at l linear layer
Common Types of Single Input, Single Output
General Formulation: Input: x Compute: h0 = x for layer l = 1 to L: hl = fl(Wl hl-1 + bl) return argmaxπ§ softmax πβπ hidden state at layer l (non-linear) activation function at l linear layer
Common Types of Single Input, Single Output
General Formulation: Input: x Compute: h0 = x for layer l = 1 to L: hl = fl(Wl hl-1 + bl) return argmaxπ§ softmax πβπ hidden state at layer l (non-linear) activation function at l linear layer In Pytorch (torch.nn): Activation functions: https://pytorch.org/docs/stable/nn.html?highlight =activation#non-linear-activations-weighted-sum- nonlinearity Linear layer: https://pytorch.org/docs/stable/nn.html#linear- layers torch.nn.Linear( in_features=<dim of hl-1 >,
bias=<Boolean: include bias bl >)
Network Types: Single Input, Multiple Outputs
x h0 y0
Recursive: One input, Sequence output Label-based generation Automated caption generation
h1 y1 h2 y2
Label-Based Generation
Given a label y, generate an entire text π argmax π π π π§) argmax
π₯1,β¦,π₯π
π π₯1, β¦ , π₯π π§) Performing this argmax is difficult, and often requires an approximate search technique called beam search
Example: Sentiment-based Tweet Generation
Given a sentiment label y (e.g., HAPPY, SAD, ANGRY, etc.), generate a tweet that would be expressing that sentiment argmax π π π π§) argmax
π₯1,β¦,π₯π
π π₯1, β¦ , π₯π π§)
Q: Why might you want to do this? Q: What ethical aspects should you consider? Q: What is the potential harm?
Example: Image Caption Generation
Show and Tell: A Neural Image Caption Generator, CVPR 15
Slide credit: Arun Mallya
Network Types: Multiple Inputs, Single Output
x0 h0
Recursive: Sequence input, one output Document classification Action recognition in video (high-level)
h1 h2 y x1 x2
Network Types: Multiple Inputs, Single Output
x0 h0
Recursive: Sequence input, one output Document classification Action recognition in video (high-level)
h1 h2 y x1 x2
Think of this as generalizing using maxent models to build discriminatively trained classifiers π π§ π¦) = maxent π¦, π§ β π π§ π¦) = recurrent_classifier π¦, π§
Example: RTE (many options)
s: Michael Jordan, coach Phil Jackson and the star cast, including Scottie Pippen, took the Chicago Bulls to six National Basketball Association championships. z: The Bulls basketball team is based in Chicago.
ENTAILED
s0 hs,0 hs,N y sN β¦ z0 hz,0 hz,M zM β¦
Example: RTE (many options)
s: Michael Jordan, coach Phil Jackson and the star cast, including Scottie Pippen, took the Chicago Bulls to six National Basketball Association championships. z: The Bulls basketball team is based in Chicago.
ENTAILED
s0 hs,0 hs,N y sN β¦ z0 hz,0 hz,M zM β¦
Reminder! GLUE https://gluebenchmark.com/ https://super.gluebenchmark.com/
Many (but not all) of these tasks fall into the Multiple Inputs, Single Output regime
Network Types: Multiple Inputs, Multiple Outputs (βsequence predictionβ: no time delay)
x0 h0
Recursive: Sequence input, Sequence output Part of speech tagging Named entity recognition
h1 h2 x1 x2 y0 y1 y2
Example 1: Part of Speech Tagging
British Left Waffles on Falkland Islands
Noun Verb Noun Prep Noun Noun
x0 h0 y0 x1 h1 y1 x2 h2 y2 x3 h3 y3 x4 h4 y4 x5 h5 y5
Task: Predict a part-of-speech tag for each word in a provided sentence
Example 1: Part of Speech Tagging
British Left Waffles on Falkland Islands
Noun Verb Noun Prep Noun Noun
x0 h0 y0 x1 h1 y1 x2 h2 y2 x3 h3 y3 x4 h4 y4 x5 h5 y5
Example 2: Named Entity Recognition
British Left Waffles on Falkland Islands
ORG ORG Other Other LOC LOC
x0 h0 y0 x1 h1 y1 x2 h2 y2 x3 h3 y3 x4 h4 y4 x5 h5 y5
Task: Predict a named entity tag for each word in a provided sentence
What are Named Entities?
Named entity recognition (NER) Identify proper names in texts, and classification into a set of predefined categories of interest Person names Organizations (companies, government organisations, committees, etc) Locations (cities, countries, rivers, etc) Date and time expressions Measures (percent, money, weight etc), email addresses, Web addresses, street addresses, etc. Domain-specific: names of drugs, medical conditions, names
Cunningham and Bontcheva (2003, RANLP Tutorial)
Example 2: Named Entity Recognition
British Left Waffles on Falkland Islands
ORG ORG Other Other LOC LOC
x0 h0 y0 x1 h1 y1 x2 h2 y2 x3 h3 y3 x4 h4 y4 x5 h5 y5
Task: Predict a named entity tag for each word in a provided sentence
Example: Named Entity Recognition
British Left Waffles on Falkland Islands
ORG ORG Other Other LOC LOC
x0 h0 y0 x1 h1 y1 x2 h2 y2 x3 h3 y3 x4 h4 y4 x5 h5 y5
Network Types: Multiple Inputs, Multiple Outputs (βsequence-to-sequenceβ: time delay)
x0 h0
Recursive: Sequence input, Sequence output (time delay) Machine translation Sequential description Summarization
h1 h2 x1 x2
y0
y1
y2
y3
Example: Translation
Translate French (observed) into English:
The cat is on the chair. Le chat est sur la chaise.
variable # of input words variable # of
Example: Translation
Translate French (observed) into English:
The cat is on the chair. Le chat est sur la chaise.
x0 h0 h2 x2
y0
y3 β¦ β¦
RNN Output: Visual Storytelling
CNN CNN CNN CNN CNN GRU GRU GRU GRU GRU
Encode Decode
GRUs GRUs
β¦
The family got together for a cookout They had a lot
food.
The family got together for a cookout. They had a lot of delicious food. The dog was happy to be there. They had a great time on the beach. They even had a swim in the water. Huang et al. (2016) Human Reference The family has gathered around the dinner table to share a meal
Afterwards they took the family dog to the beach to get some exercise. The waves were cool and refreshing! The dog had so much fun in the
Outline
Types of networks Basic cell definition Example in PyTorch
A More Typical View of Recurrent Neural Language Modeling
wi-3 wi-2 wi wi-1 hi-3 hi-2 hi-1 hi wi-2 wi-1 wi+1 wi
A More Typical View of Recurrent Neural Language Modeling
wi-3 wi-2 wi wi-1 hi-3 hi-2 hi-1 hi wi-2 wi-1 wi+1 wi
A More Typical View of Recurrent Neural Language Modeling
wi-3 wi-2 wi wi-1 hi-3 hi-2 hi-1 hi wi-2 wi-1 wi+1 wi
predict the next word
A More Typical View of Recurrent Neural Language Modeling
wi-3 wi-2 wi wi-1 hi-3 hi-2 hi-1 hi wi-2 wi-1 wi+1 wi
predict the next word from these hidden states
wi-3 wi-2 wi wi-1 hi-3 hi-2 hi-1 hi wi-2 wi-1 wi+1 wi
predict the next word from these hidden states
βcellβ
A More Typical View of Recurrent Neural Language Modeling
wi wi-1 hi-1 hi wi+1 wi
A Recurrent Neural Network Cell
wi wi-1 hi-1 hi wi+1 wi
A Recurrent Neural Network Cell
W W
encoding
wi wi-1 hi-1 hi wi+1 wi
A Recurrent Neural Network Cell
W W U U
decoding encoding
wi wi-1 hi-1 hi wi+1 wi
A Recurrent Neural Network Cell
W W U U S S
decoding encoding
wi wi-1 hi-1 hi wi+1 wi
A Simple Recurrent Neural Network Cell
W W U U S S
βπ = π(πβπβ1 + ππ₯π)
decoding encoding
wi wi-1 hi-1 hi wi+1 wi
A Simple Recurrent Neural Network Cell
W W U U S S
βπ = π(πβπβ1 + ππ₯π)
π π¦ = 1 1 + exp(βπ¦)
decoding encoding
wi wi-1 hi-1 hi wi+1 wi
A Simple Recurrent Neural Network Cell
W W U U S S
βπ = π(πβπβ1 + ππ₯π)
π π¦ = 1 1 + exp(βπ¦)
decoding encoding
wi wi-1 hi-1 hi wi+1 wi
A Simple Recurrent Neural Network Cell
W W U U S S
βπ = π(πβπβ1 + ππ₯π)
π π¦ = 1 1 + exp(βπ¦)
decoding encoding
wi wi-1 hi-1 hi wi+1 wi
A Simple Recurrent Neural Network Cell
W W U U S S
βπ = π(πβπβ1 + ππ₯π)
π π¦ = 1 1 + exp(βπ¦)
ΰ· π₯π+1 = softmax(πβπ)
decoding encoding
wi wi-1 hi-1 hi wi+1 wi
A Simple Recurrent Neural Network Cell
W W U U S S
βπ = π(πβπβ1 + ππ₯π) ΰ· π₯π+1 = softmax(πβπ)
must learn matrices U, S, W
decoding encoding
wi wi-1 hi-1 hi wi+1 wi
A Simple Recurrent Neural Network Cell
W W U U S S
βπ = π(πβπβ1 + ππ₯π) ΰ· π₯π+1 = softmax(πβπ)
must learn matrices U, S, W suggested solution: gradient descent on prediction ability
decoding encoding
wi wi-1 hi-1 hi wi+1 wi
A Simple Recurrent Neural Network Cell
W W U U S S
βπ = π(πβπβ1 + ππ₯π) ΰ· π₯π+1 = softmax(πβπ)
must learn matrices U, S, W suggested solution: gradient descent on prediction ability problem: theyβre tied across inputs/timesteps
decoding encoding
wi wi-1 hi-1 hi wi+1 wi
A Simple Recurrent Neural Network Cell
W W U U S S
βπ = π(πβπβ1 + ππ₯π) ΰ· π₯π+1 = softmax(πβπ)
must learn matrices U, S, W suggested solution: gradient descent on prediction ability problem: theyβre tied across inputs/timesteps good news for you: many toolkits do this automatically
Why Is Training RNNs Hard?
Conceptually, it can get strange But really getting the gradient just requires many applications of the chain rule for derivatives
Why Is Training RNNs Hard?
Conceptually, it can get strange But really getting the gradient just requires many applications of the chain rule for derivatives Vanishing gradients Multiply the same matrices at each timestep β multiply many matrices in the gradients
Why Is Training RNNs Hard?
Conceptually, it can get strange But really getting the gradient just requires many applications of the chain rule for derivatives Vanishing gradients Multiply the same matrices at each timestep β multiply many matrices in the gradients One solution: clip the gradients to a max value
Outline
Types of networks Basic cell definition Example in PyTorch
Natural Language Processing from torch import * from keras import *
Pick Your Toolkit
PyTorch Deeplearning4j TensorFlow DyNet Caffe Keras MxNet Gluon CNTK β¦
Comparisons: https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software https://deeplearning4j.org/compare-dl4j-tensorflow-pytorch https://github.com/zer0n/deepframeworks (older---2015)
Defining A Simple RNN in Python
(Modified Very Slightly)
http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi
Defining A Simple RNN in Python
(Modified Very Slightly)
http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi
Defining A Simple RNN in Python
(Modified Very Slightly)
http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi
Defining A Simple RNN in Python
(Modified Very Slightly)
http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
Defining A Simple RNN in Python
(Modified Very Slightly)
http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
encode
wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi
Defining A Simple RNN in Python
(Modified Very Slightly)
http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
decode
wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi
Training A Simple RNN in Python
(Modified Very Slightly)
http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
Training A Simple RNN in Python
(Modified Very Slightly)
http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
Negative log- likelihood
Training A Simple RNN in Python
(Modified Very Slightly)
http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
Negative log- likelihood get predictions
wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi
Training A Simple RNN in Python
(Modified Very Slightly)
http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
Negative log- likelihood get predictions eval predictions
Training A Simple RNN in Python
(Modified Very Slightly)
http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
Negative log- likelihood get predictions eval predictions compute gradient
Training A Simple RNN in Python
(Modified Very Slightly)
http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
Negative log- likelihood get predictions eval predictions compute gradient perform SGD
Another Solution: LSTMs/GRUs
LSTM: Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) GRU: Gated Recurrent Unit (Cho et al., 2014) Basic Ideas: learn to forget
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
forget line representation line
Outline
Types of networks Basic cell definition Example in PyTorch