Institute of Computational Perception
Natural Language Processing with Deep Learning LSTM, GRU, and - - PowerPoint PPT Presentation
Natural Language Processing with Deep Learning LSTM, GRU, and - - PowerPoint PPT Presentation
Natural Language Processing with Deep Learning LSTM, GRU, and applications in summarization and contextualized word embeddings Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Institute of Computational Perception
Agenda
- Vanishing/Exploding gradient
- RNNs with Gates: LSTM, GRU
- Contextualized word embeddings with RNNs
- Extractive summarization with RNNs
Some slides are adopted from http://web.stanford.edu/class/cs224n/
3
Element-wise Multiplication Β§ πβ¨π =
- dimensions: 1Γdβ¨1Γd =
1 2 3 β¨ 3 β2 =
Β§ π©β¨πͺ =
- dimensions: lΓmβ¨lΓm =
2 3 1 1 β1 β¨ β1 2 0.5 β1 =
π
1Γd
π«
lΓm
3 β6 β2 2 0.5 1
Agenda
- Vanishing/Exploding gradient
- RNNs with Gates: LSTM, GRU
- Contextualized word embeddings with RNNs
- Extractive summarization with RNNs
5
Recap: Backpropagation Through Time (BPTT) Β§ Unrolling RNN (simplified) πβ($) ππΏ& =? π(') πΏ&
β¦
πΏ& π($()) πΏ& π($(*) πΏ& π($) πΏ& β($)
6
Recap: Backpropagation Through Time (BPTT) π(*) πΏ& π()) πΏ& π(+) πΏ& π(,) β(,) πβ(,) ππΏ& =? π(') πΏ&
7
π(*) πΏ& π()) πΏ& π(+) πΏ& π(,) β(,) π(') πΏ& Recap: Backpropagation Through Time (BPTT)
! πβ(") ππΏ$
(")
= πβ(") ππ(") ππ(") ππΏ$ ! πβ(") ππΏ$
(%)
= πβ(") ππ(") ππ(") ππ(%) ππ(%) ππΏ$ ! πβ(") ππΏ$
(&)
= πβ(") ππ(") ππ(") ππ(%) ππ(%) ππ(&) ππ(&) ππΏ$ ! πβ(") ππΏ$
(')
= πβ(") ππ(") ππ(") ππ(%) ππ(%) ππ(&) ππ(&) ππ(') ππ(') ππΏ$
πβ(") ππ(") ππ($) ππ(%) ππ(%) ππ(&) ππ(&) ππΏ' πβ(() ππΏ' = 1
)*& (
2 πβ(() ππΏ'
())
ππ(") ππ($)
. πβ($) ππΏ& (3) = πβ($) ππ($) ππ($) ππ($(*) β¦ ππ(3) ππΏ&
8
π(*) πΏ& π()) πΏ& π(+) πΏ& π(,) β(,) π(') πΏ& Vanishing/Exploding gradient
πβ(") ππ(") ππ($) ππ(%) ππ(%) ππ(&) ππ(&) ππΏ' ππ(") ππ($)
Β§ In practice, the gradient regarding each time step becomes smaller and smaller as it goes back in time β Vanishing gradient Β§ While less often, this may also happen other way around: the gradient regarding further time steps becomes larger and largerβ Exploding gradient
9
π(*) πΏ& π()) πΏ& π(+) πΏ& π(,) β(,) π(') πΏ& Vanishing/Exploding gradient β why?
πβ(") ππ(") ππ($) ππ(%) ππ(%) ππ(&) ππ(&) ππΏ' ππ(") ππ($)
+ πβ(") ππΏ$
(%)
= πβ(") ππ(") ππ(") ππ(&) ππ(&) ππ(') ππ(') ππ(%) ππ(%) ππΏ$ If these gradients are small, their multiplication gets smaller. As we go further back, the final gradient contains more of these!
10
Vanishing/Exploding gradient β why? Β§ What is 4π(")
4π("$%) ?!
Β§ Recall the definition of RNN: π($) = π(π $(* πΏ& + π($)πΏ5 + π) Β§ Letβs replace sigmoid (π) with a simple linear activation (π§ = π¦) function. π($) = π $(* πΏ& + π($)πΏ5 + π Β§ In this case:
ππ(1) ππ(134) = πΏ5
11
Vanishing/Exploding gradient β why? Β§ Recall the BPTT formula (for the simplified case): . πβ($) ππΏ& (3) = πβ($) ππ($) ππ($) ππ($(*) β¦ ππ(36*) ππ(3) ππ(3) ππΏ& Β§ Given π = π’ β π, the BPTT formula can be rewritten as: . πβ($) ππΏ& (3) = πβ $ ππ $ (πΏ&)7 ππ(3) ππΏ&
If weights in πΏ$ are small (i.e. eigenvalues of πΏ$ are smaller then 1), these term gets exponentially smaller
12
Why is vanishing/exploding gradient a problem? Β§ Vanishing gradient
- Gradient signal from faraway βfades awayβ and becomes
insignificant in comparison with the gradient signal from close-by
- Long-term dependencies are not captured, since model weights
are updated only with respect to near effects β one approach to address it: RNNs with gates β LSTM, GRU
Β§ Exploding gradient
- Gradients become too big β SGD update steps become too large
- This causes (large loss values and) large updates on parameters,
and eventually unstable training β main approach to address it: Gradient clipping
13
Gradient clipping Β§ Gradient clipping: if the norm of the gradient is greater than some threshold, scale the gradient down Β§ Intuition: take a step in the same direction, but in a smaller step
14
Problem with vanilla RNN β summary Β§ It is too difficult for the hidden state of vanilla RNN to learn and preserve information of several time steps
- In particular as new contents are constantly added to the hidden
state in every step
π($) = π(π $(* πΏ& + π($)πΏ5 + π)
In every step, input vector βaddsβ new content to hidden state
Agenda
- Vanishing & Exploding gradient
- RNNs with Gates: LSTM, GRU
- Contextualized word embeddings with LSTM
- Extractive summarization with LSTM
16
Gate vector
Β§ Gate vector:
- A vector with values between 0 and 1
- Gate vector acts as βgate-keeperβ, such that it
controls the content flow of another vector
Β§ Gate vectors are typically defined using sigmoid: π = π(π‘πππ π€πππ’ππ ) β¦ and are applied to a vector π with element-wise multiplication to control its contents: π β π Β§ For each element (feature) π of the vectors:
- If π( is 1 β π€( remains the same; everything passes; open gate!
- If π( is 0 β π€( becomes 0; nothing passes; closed gate!
17
Long Short-Term Memory (LSTM)
Β§ Proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients problem Β§ LSTM exploits a new vector cell state π (() to carry the memory of previous states
- The cell state stores long-term information
- As in vanilla RNN, hidden states π()) is used as output vector
Β§ LSTM controls the process of reading, writing, and erasing information in/from memory states
- These controls are done using gate vectors
- Gates are dynamic and defined based on the input vector and
hidden state
Hochreiter, Sepp, and JΓΌrgen Schmidhuber. "Long short-term memory" Neural computation (1997)
18
LSTM β unrolled
LSTM
π($) π($(*) π($) π ($(*)
LSTM
π(*) π(*) π(') π (')
LSTM
π()) π())
LSTM
π(+) π(+) π (*) π ()) π (+)
19
LSTM definition β gates
Β§ Gates are functions of input vector π(() and previous hidden state π((*+)
π($) = function(π $(* , π $ ) π($) = π(π $(* πΏ&3 + π $ πΏ83 + π3) π($) = function(π $(* , π $ ) π($) = π(π $(* πΏ&9 + π $ πΏ89 + π9) π($) = function(π $(* , π $ ) π($) = π(π $(* πΏ&: + π $ πΏ8: + π:)
Model parameters (weights) are shown in red
input gate: controls what parts of the new cell content are written to cell forget gate: controls what is kept vs forgotten, from previous cell state
- utput gate: controls
what parts of cell are
- utput to hidden state
20
LSTM definition β states G π ($) = function(π $(* , π $ ) G π ($) = tanh(π $(* πΏ&; + π $ πΏ8; + π;) π ($) = π($) β π $ + π($) β G π ($) π($) = π($) β tanh(π ($))
Model parameters (weights) are shown in red
new cell content: the new content to be used for cell and hidden (output) state cell state: erases (βforgetsβ) some content from last cell state, and writes (βinputsβ) some new cell content hidden state: reads (βoutputsβ) some content from the current cell state
21
LSTM definition β all together π($) = π(π $(* πΏ&3 + π $ πΏ83 + π3) π($) = π(π $(* πΏ&9 + π $ πΏ89 + π9) π($) = π(π $(* πΏ&: + π $ πΏ8: + π:) G π ($) = tanh(π $(* πΏ&; + π $ πΏ8; + π;) π ($) = π($) β π $ + π($) β G π ($) π($) = π($) β tanh(π ($))
Model parameters (weights) are shown in red
input gate: controls what parts of the new cell content are written to cell forget gate: controls what is kept vs forgotten, from previous cell state
- utput gate: controls
what parts of cell are
- utput to hidden state
new cell content: the new content to be used for cell and hidden (output) state cell state: erases (βforgetsβ) some content from last cell state, and writes (βinputsβ) some new cell content hidden state: reads (βoutputsβ) some content from the current cell state
22
LSTM definition β visually!
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
23
Gated Recurrent Unit (GRU) π($) = π(π $(* πΏ&< + π $ πΏ8< + π<) π($) = π(π $(* πΏ&= + π $ πΏ8= + π=) M π($) = tanh((π($) β π $(* ) πΏ&& + π $ πΏ8& + π&) π($) = 1 β π $ β π $(* + π $ β M π($)
Model parameters (weights) are shown in red
update gate: controls what parts of hidden state are updated vs preserved reset gate: controls what parts
- f previous hidden state are
used to compute new content new hidden state content: (1) reset gate selects useful parts of previous hidden state. (2) Use this and current input to compute new hidden content. hidden state: update gate simultaneously controls what is kept from previous hidden state, and what is updated to new hidden state content
Cho, K., van MerriΓ«nboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN EncoderβDecoder for Statistical Machine Translation. In Proc. Of EMNLP)
24
RNNs with gates β counting parameters
Β§ Parameters in LSTM (bias terms discarded)
- πΏ$(, πΏ$* , πΏ$+ , πΏ$, β βΓβ β 4
- πΏ-(, πΏ-* , πΏ-+ , πΏ-, β πΓβ β 4
Β§ Parameters in GRU (bias terms discarded)
- πΏ$., πΏ$/ , πΏ$$ β βΓβ β 3
- πΏ-., πΏ-/ , πΏ-$ β πΓβ β 3
Β§ If also considering encoder and decoder embeddings (e.g. in a Language Modeling network)
- π β π Γπ
- π½ β βΓ π
π and β are the number of dimensions of the input embedding and hidden vectors, respectively.
25
RNNs with gates β summary
Β§ LSTM (and GRU) with dynamic gate mechanisms makes it easier to preserve necessary information over many timesteps Β§ LSTM does not guarantee that there is no vanishing/exploding gradient, but its large success in practice has shown that it can learn long-distance dependencies Β§ LSTM vs. GRU: LSTM is usually the default choice. Especially, when enough training data is available and capturing longer distances is important. GRU is faster and more suited for settings with low computation resources
Agenda
- Vanishing/Exploding gradient
- RNNs with Gates: LSTM, GRU
- Contextualized word embeddings with RNNs
- Extractive summarization with RNNs
27
Contextualized Word Embeddings
Β§ Meaning of words can better be understood when we consider their contexts
- Example of sense disambiguation when context is available:
- what is apple? a fruit or the name of a company?
- βeating an appleβ vs. βshare of the apple companyβ
Β§ In contextualized word embedding, the representation of a word is defined based on the context, it appears in Β§ The input is a sequence of word embeddings and the
- utput is a sequence of contextualized word embeddings
28
Contextualized Word Embeddings
RNN RNN RNN RNN
β¦
The quick brown fox jumps over the lazy dog π(*) π()) π(+) π(@) π(') ' π(*) π()) π(+) π(@(*) π(@) π¦(*) π¦()) π¦(+) π¦(@)
The contextualized word embedding of βbrownβ. However, it only has had access to the previous words (not the future ones)
29
Bidirectional RNNs
Β§ Bidirectional RNN consists of two RNNs, one reads from the beginning to the end of sequence (forward), and the other reads from the end to the beginning (backward)
π($) = RNN (π $(* , π($)) π($) = RNN (π($6*), π($))
Β§ Output at each time step is the concatenation of the outputs
- f both RNNs at that time step:
π($) = [π $ ; π($)]
Β§ To remember: Using bidirectional RNN is only possible when the entire sequence is available
30
RNN RNN RNN RNN RNN RNN RNN RNN
π(") π($) π(%) π(&) π(') π(") π($) π(%) π(&) π(() π(&) π(%) π($) π(")
[π + ; π(+)] [π , ; π(,)] [π - ; π(-)] [π . ; π(.)]
31
ELMo (Embeddings from Language Models) ELMo is a Multi-layer Bidirectional LSTM, trained on a Language Modeling objective
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. Deep Contextualized Word Representations. In Proc. of NAACL-HTL 2018
π layers
32
ELMo β contextualized word embedding
Β§ Given the set of input embeddings π+, π,, β¦ , π/, in each layer π, ELMo outputs a set of contextualized word embeddings π+
0, π, 0, β¦ , π/
Β§ Contextualized embeddings in higher-levels capture semantic aspects (e.g., word senses), while embeddings in lower-levels model aspects of syntax (e.g., part-of-speech tagging)
π layers
33
ELMo in supervised tasks
Β§ In supervised tasks, ELMo makes use of the embeddings in all layers (not only the last layer)! Β§ The final word embeddings are the weighted sum of the intermediary hidden states. Embedding of the word at position π: πΏ <
01+ 2
π
0π3
π
0 defines the weight (importance) of each layer
Β§ π
0 values and πΏ are also model parameters, and are trained
end-to-end with whole the model
Model parameters are shown in red
34
ELMo β key results
https://allennlp.org/elmo
Agenda
- Vanishing/Exploding gradient
- RNNs with Gates: LSTM, GRU
- Contextualized word embeddings with RNNs
- Extractive summarization with RNNs
36
Text Summarization Β§ The task of summarizing the key information content of a text (document) π = {π¦*, π¦), β¦ , π¦C} in summary π = {π§*, π§), β¦ , π§D}
- Summary is concise and (much) shorter than document
Β§ Some datasets:
- Gigaword: first one or two sentences of a news article
- CNN/DailyMail: news article
- Wikihow: full how-to article
37
Summarization Β§ Extractive Summarization
- Selecting sections (typically sentences) of the document
- A model decides if a section of document should be selected for
the summary
Β§ Abstractive Summarization
- Writing (generating) new summary text for the document
- A language generation task conditioned on the document
Extractive Summarization Abstractive Summarization
38
Extractive Summarization
39
Abstractive Summarization
SAN FRANCISCO, California (Reuters) -- Sony has cut the price of the PlayStation 3 by $100,
- r 17 percent, in the United States, a move that should boost the video game console's
lackluster sales. Starting Monday, the current PS3 60 gigabyte model will cost $499 -- a $100 price drop. The PlayStation 3, which includes a 60-gigabyte hard drive and a Blu-ray high-definition DVD player, will now cost $500, or $20 more than the most expensive version of Microsoft's Xbox 360. The PS3 still costs twice that of Nintendo's Wii console, whose $250 price and motion-sensing controller have made it a best-seller despite its lack of cutting-edge graphics and hard disk. "Our initial expectation is that sales should double at a minimum," Jack Tretton, chief executive of Sony Computer Entertainment America, said in an interview. "We've gotten our production issues behind us on the PlayStation 3, reaching a position to pass
- n the savings to consumers, and our attitude is the sooner the better."
β¦
- Sony drops price of current 60GB PlayStation 3 console by $100 in U.S.
- PS3 still costs twice that of Nintendo's best-selling Wii console, which is $250
- Some expect Microsoft to respond with its first price cuts on the Xbox 360
- Sony to revise PS3 console with bigger 80GB hard drive
Document Summary
40
Summarization β Evaluation Β§ ROUGE-N: overlap of n-grams between output and reference summary
- ROUGE-1: the overlap of unigrams
- ROUGE-2: the overlap of bigrams
- β¦
ROUGEβN = nβgrams (X π) β© nβgrams(π) nβgrams(π)
π and B π are the reference and output summary, respectively. nβgrams retrurns the set of all possible n-grams of the given text.
ROUGE: A Package for Automatic Evaluation of Summaries, Lin, 2004 http://www.aclweb.org/anthology/W04-1013
41
Summarization β Evaluation Β§ ROUGE-L is based on the length of the longest common subsequence between the output and reference summary ROUGEβL = LCS (X π, π) X π
LCS is the longest common subsequence of the two given texts.
Β§ ROUGE-L
- does not require consecutive matches but in-sequence matches
- reflects sentence structure
- donβt need a predefined n-gram length
42
Summarization β Evaluation
Β§ ROUGE (in the discussed definitions) is a recall-based measure
- ROUGE can also be defined as a precision-based, as well as F-measure
Example Β§ π: βpolice killed the gunmanβ Β§ B π1: βpolice kill the gunmanβ Β§ B π2: βthe gunman kill policeβ Β§ LCS B π1, π =βpolice the gunmanββ ROUGEβL(B π1, π) = 0.75 Β§ LCS B π2, π = βthe gunmanβ β ROUGEβL(B π2, π) = 0.5 Β§ However, ROUGE-2 of both B π1 and B π2 are the same
- In both 4
π1 and 4 π2, βthe gunmanβ is the only common bigram with the reference
43
Extractive Summarization β paper walkthrough
Problem definition
Β§ Document πΈ contains π sentences: πΈ = π‘+, π‘,, β¦ , π‘2 Β§ Each sentence π‘3 contains a set words, each annotated with π¦(3) Β§ Extractive summarization objective: select π sentences of the document such that they provide the βbestβ summary (concise and comprehensive) Β§ Evaluation is done by comparing the output summary with the reference summary
Zhou, Q., Yang, N., Wei, F., Huang, S., Zhou, M., & Zhao, T.. Neural Document Summarization by Jointly Learning to Score and Select Sentences. In Proceedings of ACL 2018
44
Core Ideas β NeuSum model
Β§ At each time step, the model wants to decide which sentence to include in the summary (sentence selection) Β§ To do sentence selection, at each time step, the model assigns scores to sentences that are not included in summary (sentence scoring), and selects the one with the highest score Β§ The sentence scoring is based on the representation of each sentence, but also the content of the previously selected sentences
- Why on previously selected sentences? Intuitively, if some
contents are already included in the summary, the model should avoid selecting the sentences with similar contents
45
Sentence encoding
contextual word embeddings sentence embedding using the final hidden states Bi-directional GRU Bi-directional GRU contextualized sentence embedding
46
Sentence scoring and selection
Β§ Sentence scoring learns a function π that assigns a score to each sentence π‘3 at time step π’. It uses: 1. sentence embedding π3 2. information of previously selected sentences, embedded in the vector π( β current state of summary
π(π‘3) = function(π$, π3) π(π‘3) = πEtanh(π$πΏF + π3πΏG + π3)
Β§ π(π‘3) is calculated for the sentences that are not yet included in the summary Β§ The sentence with highest π is added to summary
47
Sentence scoring
Β§ How is current state of summary π( calculated?
- Using another GRU
Β§ A GRU model outputs the new state of the summary using the previous state of summary and the last selected sentence
π$ = GRU(π$(*, π$(*)
You can find such more details by looking into the paper: Β§ π' is created based on the document embedding Β§ The model is optimized using the Kullback-Leibler divergence between the distribution of output scores and the distribution of ROUGE-2 F1 gains of sentences
48
An unrolled GRU that
- utputs the next state
- f summary
π(π‘)) = function(π*, π))
π( can be seen as an aggregation of the contents
- f the summary so far
49
Recap Β§ Vanishing/Exploding gradient Β§ RNNs with gates
- address the problem of vanishing/exploding
gradient by controlling what to be stored in and forgotten from the memory states
π($(+) πΏ& π($()) πΏ& π($(*) πΏ& π($) β($) π($(,) πΏ&
50
Recap ELMo
A contextualized word embedding model using multilayer bidirectional LSTM
LSTM
π($) π($(*) π($) π ($(*) LSTM
51