Natural Language Processing with Deep Learning LSTM, GRU, and - - PowerPoint PPT Presentation

β–Ά
natural language processing with deep learning lstm gru
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing with Deep Learning LSTM, GRU, and - - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning LSTM, GRU, and applications in summarization and contextualized word embeddings Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Institute of Computational Perception


slide-1
SLIDE 1

Institute of Computational Perception

Natural Language Processing with Deep Learning LSTM, GRU, and applications in summarization and contextualized word embeddings

Navid Rekab-Saz

navid.rekabsaz@jku.at Institute of Computational Perception

slide-2
SLIDE 2

Agenda

  • Vanishing/Exploding gradient
  • RNNs with Gates: LSTM, GRU
  • Contextualized word embeddings with RNNs
  • Extractive summarization with RNNs

Some slides are adopted from http://web.stanford.edu/class/cs224n/

slide-3
SLIDE 3

3

Element-wise Multiplication Β§ 𝒃⨀𝒄 =

  • dimensions: 1Γ—d⨀1Γ—d =

1 2 3 ⨀ 3 βˆ’2 =

Β§ 𝑩⨀π‘ͺ =

  • dimensions: lΓ—m⨀lΓ—m =

2 3 1 1 βˆ’1 ⨀ βˆ’1 2 0.5 βˆ’1 =

𝒅

1Γ—d

𝑫

lΓ—m

3 βˆ’6 βˆ’2 2 0.5 1

slide-4
SLIDE 4

Agenda

  • Vanishing/Exploding gradient
  • RNNs with Gates: LSTM, GRU
  • Contextualized word embeddings with RNNs
  • Extractive summarization with RNNs
slide-5
SLIDE 5

5

Recap: Backpropagation Through Time (BPTT) Β§ Unrolling RNN (simplified) πœ–β„’($) πœ–π‘Ώ& =? π’Š(') 𝑿&

…

𝑿& π’Š($()) 𝑿& π’Š($(*) 𝑿& π’Š($) 𝑿& β„’($)

slide-6
SLIDE 6

6

Recap: Backpropagation Through Time (BPTT) π’Š(*) 𝑿& π’Š()) 𝑿& π’Š(+) 𝑿& π’Š(,) β„’(,) πœ–β„’(,) πœ–π‘Ώ& =? π’Š(') 𝑿&

slide-7
SLIDE 7

7

π’Š(*) 𝑿& π’Š()) 𝑿& π’Š(+) 𝑿& π’Š(,) β„’(,) π’Š(') 𝑿& Recap: Backpropagation Through Time (BPTT)

! πœ–β„’(") πœ–π‘Ώ$

(")

= πœ–β„’(") πœ–π’Š(") πœ–π’Š(") πœ–π‘Ώ$ ! πœ–β„’(") πœ–π‘Ώ$

(%)

= πœ–β„’(") πœ–π’Š(") πœ–π’Š(") πœ–π’Š(%) πœ–π’Š(%) πœ–π‘Ώ$ ! πœ–β„’(") πœ–π‘Ώ$

(&)

= πœ–β„’(") πœ–π’Š(") πœ–π’Š(") πœ–π’Š(%) πœ–π’Š(%) πœ–π’Š(&) πœ–π’Š(&) πœ–π‘Ώ$ ! πœ–β„’(") πœ–π‘Ώ$

(')

= πœ–β„’(") πœ–π’Š(") πœ–π’Š(") πœ–π’Š(%) πœ–π’Š(%) πœ–π’Š(&) πœ–π’Š(&) πœ–π’Š(') πœ–π’Š(') πœ–π‘Ώ$

πœ–β„’(") πœ–π’Š(") πœ–π’Š($) πœ–π’Š(%) πœ–π’Š(%) πœ–π’Š(&) πœ–π’Š(&) πœ–π‘Ώ' πœ–β„’(() πœ–π‘Ώ' = 1

)*& (

2 πœ–β„’(() πœ–π‘Ώ'

())

πœ–π’Š(") πœ–π’Š($)

. πœ–β„’($) πœ–π‘Ώ& (3) = πœ–β„’($) πœ–π’Š($) πœ–π’Š($) πœ–π’Š($(*) … πœ–π’Š(3) πœ–π‘Ώ&

slide-8
SLIDE 8

8

π’Š(*) 𝑿& π’Š()) 𝑿& π’Š(+) 𝑿& π’Š(,) β„’(,) π’Š(') 𝑿& Vanishing/Exploding gradient

πœ–β„’(") πœ–π’Š(") πœ–π’Š($) πœ–π’Š(%) πœ–π’Š(%) πœ–π’Š(&) πœ–π’Š(&) πœ–π‘Ώ' πœ–π’Š(") πœ–π’Š($)

§ In practice, the gradient regarding each time step becomes smaller and smaller as it goes back in time → Vanishing gradient § While less often, this may also happen other way around: the gradient regarding further time steps becomes larger and larger→ Exploding gradient

slide-9
SLIDE 9

9

π’Š(*) 𝑿& π’Š()) 𝑿& π’Š(+) 𝑿& π’Š(,) β„’(,) π’Š(') 𝑿& Vanishing/Exploding gradient – why?

πœ–β„’(") πœ–π’Š(") πœ–π’Š($) πœ–π’Š(%) πœ–π’Š(%) πœ–π’Š(&) πœ–π’Š(&) πœ–π‘Ώ' πœ–π’Š(") πœ–π’Š($)

+ πœ–β„’(") πœ–π‘Ώ$

(%)

= πœ–β„’(") πœ–π’Š(") πœ–π’Š(") πœ–π’Š(&) πœ–π’Š(&) πœ–π’Š(') πœ–π’Š(') πœ–π’Š(%) πœ–π’Š(%) πœ–π‘Ώ$ If these gradients are small, their multiplication gets smaller. As we go further back, the final gradient contains more of these!

slide-10
SLIDE 10

10

Vanishing/Exploding gradient – why? Β§ What is 4π’Š(")

4π’Š("$%) ?!

Β§ Recall the definition of RNN: π’Š($) = 𝜏(π’Š $(* 𝑿& + 𝒇($)𝑿5 + 𝒄) Β§ Let’s replace sigmoid (𝜏) with a simple linear activation (𝑧 = 𝑦) function. π’Š($) = π’Š $(* 𝑿& + 𝒇($)𝑿5 + 𝒄 Β§ In this case:

πœ–π’Š(1) πœ–π’Š(134) = 𝑿5

slide-11
SLIDE 11

11

Vanishing/Exploding gradient – why? Β§ Recall the BPTT formula (for the simplified case): . πœ–β„’($) πœ–π‘Ώ& (3) = πœ–β„’($) πœ–π’Š($) πœ–π’Š($) πœ–π’Š($(*) … πœ–π’Š(36*) πœ–π’Š(3) πœ–π’Š(3) πœ–π‘Ώ& Β§ Given π‘š = 𝑒 βˆ’ 𝑗, the BPTT formula can be rewritten as: . πœ–β„’($) πœ–π‘Ώ& (3) = πœ–β„’ $ πœ–π’Š $ (𝑿&)7 πœ–π’Š(3) πœ–π‘Ώ&

If weights in 𝑿$ are small (i.e. eigenvalues of 𝑿$ are smaller then 1), these term gets exponentially smaller

slide-12
SLIDE 12

12

Why is vanishing/exploding gradient a problem? Β§ Vanishing gradient

  • Gradient signal from faraway β€œfades away” and becomes

insignificant in comparison with the gradient signal from close-by

  • Long-term dependencies are not captured, since model weights

are updated only with respect to near effects β†’ one approach to address it: RNNs with gates – LSTM, GRU

Β§ Exploding gradient

  • Gradients become too big β†’ SGD update steps become too large
  • This causes (large loss values and) large updates on parameters,

and eventually unstable training β†’ main approach to address it: Gradient clipping

slide-13
SLIDE 13

13

Gradient clipping Β§ Gradient clipping: if the norm of the gradient is greater than some threshold, scale the gradient down Β§ Intuition: take a step in the same direction, but in a smaller step

slide-14
SLIDE 14

14

Problem with vanilla RNN – summary Β§ It is too difficult for the hidden state of vanilla RNN to learn and preserve information of several time steps

  • In particular as new contents are constantly added to the hidden

state in every step

π’Š($) = 𝜏(π’Š $(* 𝑿& + π’š($)𝑿5 + 𝒄)

In every step, input vector β€œadds” new content to hidden state

slide-15
SLIDE 15

Agenda

  • Vanishing & Exploding gradient
  • RNNs with Gates: LSTM, GRU
  • Contextualized word embeddings with LSTM
  • Extractive summarization with LSTM
slide-16
SLIDE 16

16

Gate vector

Β§ Gate vector:

  • A vector with values between 0 and 1
  • Gate vector acts as β€œgate-keeper”, such that it

controls the content flow of another vector

Β§ Gate vectors are typically defined using sigmoid: 𝒉 = 𝜏(𝑑𝑝𝑛𝑓 𝑀𝑓𝑑𝑒𝑝𝑠) … and are applied to a vector π’˜ with element-wise multiplication to control its contents: 𝒉 βŠ™ π’˜ Β§ For each element (feature) 𝑗 of the vectors:

  • If 𝑕( is 1 β†’ 𝑀( remains the same; everything passes; open gate!
  • If 𝑕( is 0 β†’ 𝑀( becomes 0; nothing passes; closed gate!
slide-17
SLIDE 17

17

Long Short-Term Memory (LSTM)

Β§ Proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients problem Β§ LSTM exploits a new vector cell state 𝒅(() to carry the memory of previous states

  • The cell state stores long-term information
  • As in vanilla RNN, hidden states π’Š()) is used as output vector

Β§ LSTM controls the process of reading, writing, and erasing information in/from memory states

  • These controls are done using gate vectors
  • Gates are dynamic and defined based on the input vector and

hidden state

Hochreiter, Sepp, and JΓΌrgen Schmidhuber. "Long short-term memory" Neural computation (1997)

slide-18
SLIDE 18

18

LSTM – unrolled

LSTM

π’Š($) π’Š($(*) π’š($) 𝒅($(*)

LSTM

π’š(*) π’Š(*) π’Š(') 𝒅(')

LSTM

π’š()) π’Š())

LSTM

π’š(+) π’Š(+) 𝒅(*) 𝒅()) 𝒅(+)

slide-19
SLIDE 19

19

LSTM definition – gates

Β§ Gates are functions of input vector π’š(() and previous hidden state π’Š((*+)

𝒋($) = function(π’Š $(* , π’š $ ) 𝒋($) = 𝜏(π’Š $(* 𝑿&3 + π’š $ 𝑿83 + 𝒄3) π’ˆ($) = function(π’Š $(* , π’š $ ) π’ˆ($) = 𝜏(π’Š $(* 𝑿&9 + π’š $ 𝑿89 + 𝒄9) 𝒑($) = function(π’Š $(* , π’š $ ) 𝒑($) = 𝜏(π’Š $(* 𝑿&: + π’š $ 𝑿8: + 𝒄:)

Model parameters (weights) are shown in red

input gate: controls what parts of the new cell content are written to cell forget gate: controls what is kept vs forgotten, from previous cell state

  • utput gate: controls

what parts of cell are

  • utput to hidden state
slide-20
SLIDE 20

20

LSTM definition – states G 𝒅($) = function(π’Š $(* , π’š $ ) G 𝒅($) = tanh(π’Š $(* 𝑿&; + π’š $ 𝑿8; + 𝒄;) 𝒅($) = π’ˆ($) βŠ™ 𝒅 $ + 𝒋($) βŠ™ G 𝒅($) π’Š($) = 𝒑($) βŠ™ tanh(𝒅($))

Model parameters (weights) are shown in red

new cell content: the new content to be used for cell and hidden (output) state cell state: erases (β€œforgets”) some content from last cell state, and writes (β€œinputs”) some new cell content hidden state: reads (β€œoutputs”) some content from the current cell state

slide-21
SLIDE 21

21

LSTM definition – all together 𝒋($) = 𝜏(π’Š $(* 𝑿&3 + π’š $ 𝑿83 + 𝒄3) π’ˆ($) = 𝜏(π’Š $(* 𝑿&9 + π’š $ 𝑿89 + 𝒄9) 𝒑($) = 𝜏(π’Š $(* 𝑿&: + π’š $ 𝑿8: + 𝒄:) G 𝒅($) = tanh(π’Š $(* 𝑿&; + π’š $ 𝑿8; + 𝒄;) 𝒅($) = π’ˆ($) βŠ™ 𝒅 $ + 𝒋($) βŠ™ G 𝒅($) π’Š($) = 𝒑($) βŠ™ tanh(𝒅($))

Model parameters (weights) are shown in red

input gate: controls what parts of the new cell content are written to cell forget gate: controls what is kept vs forgotten, from previous cell state

  • utput gate: controls

what parts of cell are

  • utput to hidden state

new cell content: the new content to be used for cell and hidden (output) state cell state: erases (β€œforgets”) some content from last cell state, and writes (β€œinputs”) some new cell content hidden state: reads (β€œoutputs”) some content from the current cell state

slide-22
SLIDE 22

22

LSTM definition – visually!

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

slide-23
SLIDE 23

23

Gated Recurrent Unit (GRU) 𝒗($) = 𝜏(π’Š $(* 𝑿&< + π’š $ 𝑿8< + 𝒄<) 𝒔($) = 𝜏(π’Š $(* 𝑿&= + π’š $ 𝑿8= + 𝒄=) M π’Š($) = tanh((𝒔($) βŠ™ π’Š $(* ) 𝑿&& + π’š $ 𝑿8& + 𝒄&) π’Š($) = 1 βˆ’ 𝒗 $ βŠ™ π’Š $(* + 𝒗 $ βŠ™ M π’Š($)

Model parameters (weights) are shown in red

update gate: controls what parts of hidden state are updated vs preserved reset gate: controls what parts

  • f previous hidden state are

used to compute new content new hidden state content: (1) reset gate selects useful parts of previous hidden state. (2) Use this and current input to compute new hidden content. hidden state: update gate simultaneously controls what is kept from previous hidden state, and what is updated to new hidden state content

Cho, K., van MerriΓ«nboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proc. Of EMNLP)

slide-24
SLIDE 24

24

RNNs with gates – counting parameters

Β§ Parameters in LSTM (bias terms discarded)

  • 𝑿$(, 𝑿$* , 𝑿$+ , 𝑿$, β†’ β„ŽΓ—β„Ž βˆ— 4
  • 𝑿-(, 𝑿-* , 𝑿-+ , 𝑿-, β†’ π‘’Γ—β„Ž βˆ— 4

Β§ Parameters in GRU (bias terms discarded)

  • 𝑿$., 𝑿$/ , 𝑿$$ β†’ β„ŽΓ—β„Ž βˆ— 3
  • 𝑿-., 𝑿-/ , 𝑿-$ β†’ π‘’Γ—β„Ž βˆ— 3

Β§ If also considering encoder and decoder embeddings (e.g. in a Language Modeling network)

  • 𝑭 β†’ π•Ž ×𝑒
  • 𝑽 β†’ β„ŽΓ— π•Ž

𝑒 and β„Ž are the number of dimensions of the input embedding and hidden vectors, respectively.

slide-25
SLIDE 25

25

RNNs with gates – summary

Β§ LSTM (and GRU) with dynamic gate mechanisms makes it easier to preserve necessary information over many timesteps Β§ LSTM does not guarantee that there is no vanishing/exploding gradient, but its large success in practice has shown that it can learn long-distance dependencies Β§ LSTM vs. GRU: LSTM is usually the default choice. Especially, when enough training data is available and capturing longer distances is important. GRU is faster and more suited for settings with low computation resources

slide-26
SLIDE 26

Agenda

  • Vanishing/Exploding gradient
  • RNNs with Gates: LSTM, GRU
  • Contextualized word embeddings with RNNs
  • Extractive summarization with RNNs
slide-27
SLIDE 27

27

Contextualized Word Embeddings

Β§ Meaning of words can better be understood when we consider their contexts

  • Example of sense disambiguation when context is available:
  • what is apple? a fruit or the name of a company?
  • β€œeating an apple” vs. β€œshare of the apple company”

Β§ In contextualized word embedding, the representation of a word is defined based on the context, it appears in Β§ The input is a sequence of word embeddings and the

  • utput is a sequence of contextualized word embeddings
slide-28
SLIDE 28

28

Contextualized Word Embeddings

RNN RNN RNN RNN

…

The quick brown fox jumps over the lazy dog 𝒇(*) 𝒇()) 𝒇(+) 𝒇(@) π’Š(') ' π’Š(*) π’Š()) π’Š(+) π’Š(@(*) π’Š(@) 𝑦(*) 𝑦()) 𝑦(+) 𝑦(@)

The contextualized word embedding of β€œbrown”. However, it only has had access to the previous words (not the future ones)

slide-29
SLIDE 29

29

Bidirectional RNNs

Β§ Bidirectional RNN consists of two RNNs, one reads from the beginning to the end of sequence (forward), and the other reads from the end to the beginning (backward)

π’Š($) = RNN (π’Š $(* , π’š($)) π’Š($) = RNN (π’Š($6*), π’š($))

Β§ Output at each time step is the concatenation of the outputs

  • f both RNNs at that time step:

π’Š($) = [π’Š $ ; π’Š($)]

Β§ To remember: Using bidirectional RNN is only possible when the entire sequence is available

slide-30
SLIDE 30

30

RNN RNN RNN RNN RNN RNN RNN RNN

𝒇(") 𝒇($) 𝒇(%) 𝒇(&) π’Š(') π’Š(") π’Š($) π’Š(%) π’Š(&) π’Š(() π’Š(&) π’Š(%) π’Š($) π’Š(")

[π’Š + ; π’Š(+)] [π’Š , ; π’Š(,)] [π’Š - ; π’Š(-)] [π’Š . ; π’Š(.)]

slide-31
SLIDE 31

31

ELMo (Embeddings from Language Models) ELMo is a Multi-layer Bidirectional LSTM, trained on a Language Modeling objective

Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. Deep Contextualized Word Representations. In Proc. of NAACL-HTL 2018

𝑀 layers

slide-32
SLIDE 32

32

ELMo – contextualized word embedding

Β§ Given the set of input embeddings 𝒇+, 𝒇,, … , 𝒇/, in each layer π‘˜, ELMo outputs a set of contextualized word embeddings π’Š+

0, π’Š, 0, … , π’Š/

Β§ Contextualized embeddings in higher-levels capture semantic aspects (e.g., word senses), while embeddings in lower-levels model aspects of syntax (e.g., part-of-speech tagging)

𝑀 layers

slide-33
SLIDE 33

33

ELMo in supervised tasks

Β§ In supervised tasks, ELMo makes use of the embeddings in all layers (not only the last layer)! Β§ The final word embeddings are the weighted sum of the intermediary hidden states. Embedding of the word at position 𝑗: 𝛿 <

01+ 2

πœ„

0π’Š3

πœ„

0 defines the weight (importance) of each layer

Β§ πœ„

0 values and 𝛿 are also model parameters, and are trained

end-to-end with whole the model

Model parameters are shown in red

slide-34
SLIDE 34

34

ELMo – key results

https://allennlp.org/elmo

slide-35
SLIDE 35

Agenda

  • Vanishing/Exploding gradient
  • RNNs with Gates: LSTM, GRU
  • Contextualized word embeddings with RNNs
  • Extractive summarization with RNNs
slide-36
SLIDE 36

36

Text Summarization Β§ The task of summarizing the key information content of a text (document) π‘Œ = {𝑦*, 𝑦), … , 𝑦C} in summary 𝑍 = {𝑧*, 𝑧), … , 𝑧D}

  • Summary is concise and (much) shorter than document

Β§ Some datasets:

  • Gigaword: first one or two sentences of a news article
  • CNN/DailyMail: news article
  • Wikihow: full how-to article
slide-37
SLIDE 37

37

Summarization Β§ Extractive Summarization

  • Selecting sections (typically sentences) of the document
  • A model decides if a section of document should be selected for

the summary

Β§ Abstractive Summarization

  • Writing (generating) new summary text for the document
  • A language generation task conditioned on the document

Extractive Summarization Abstractive Summarization

slide-38
SLIDE 38

38

Extractive Summarization

slide-39
SLIDE 39

39

Abstractive Summarization

SAN FRANCISCO, California (Reuters) -- Sony has cut the price of the PlayStation 3 by $100,

  • r 17 percent, in the United States, a move that should boost the video game console's

lackluster sales. Starting Monday, the current PS3 60 gigabyte model will cost $499 -- a $100 price drop. The PlayStation 3, which includes a 60-gigabyte hard drive and a Blu-ray high-definition DVD player, will now cost $500, or $20 more than the most expensive version of Microsoft's Xbox 360. The PS3 still costs twice that of Nintendo's Wii console, whose $250 price and motion-sensing controller have made it a best-seller despite its lack of cutting-edge graphics and hard disk. "Our initial expectation is that sales should double at a minimum," Jack Tretton, chief executive of Sony Computer Entertainment America, said in an interview. "We've gotten our production issues behind us on the PlayStation 3, reaching a position to pass

  • n the savings to consumers, and our attitude is the sooner the better."

…

  • Sony drops price of current 60GB PlayStation 3 console by $100 in U.S.
  • PS3 still costs twice that of Nintendo's best-selling Wii console, which is $250
  • Some expect Microsoft to respond with its first price cuts on the Xbox 360
  • Sony to revise PS3 console with bigger 80GB hard drive

Document Summary

slide-40
SLIDE 40

40

Summarization – Evaluation Β§ ROUGE-N: overlap of n-grams between output and reference summary

  • ROUGE-1: the overlap of unigrams
  • ROUGE-2: the overlap of bigrams
  • …

ROUGEβˆ’N = nβˆ’grams (X 𝑍) ∩ nβˆ’grams(𝑍) nβˆ’grams(𝑍)

𝑍 and B 𝑍 are the reference and output summary, respectively. nβˆ’grams retrurns the set of all possible n-grams of the given text.

ROUGE: A Package for Automatic Evaluation of Summaries, Lin, 2004 http://www.aclweb.org/anthology/W04-1013

slide-41
SLIDE 41

41

Summarization – Evaluation Β§ ROUGE-L is based on the length of the longest common subsequence between the output and reference summary ROUGEβˆ’L = LCS (X 𝑍, 𝑍) X 𝑍

LCS is the longest common subsequence of the two given texts.

Β§ ROUGE-L

  • does not require consecutive matches but in-sequence matches
  • reflects sentence structure
  • don’t need a predefined n-gram length
slide-42
SLIDE 42

42

Summarization – Evaluation

Β§ ROUGE (in the discussed definitions) is a recall-based measure

  • ROUGE can also be defined as a precision-based, as well as F-measure

Example Β§ 𝑍: β€œpolice killed the gunman” Β§ B 𝑍1: β€œpolice kill the gunman” Β§ B 𝑍2: β€œthe gunman kill police” Β§ LCS B 𝑍1, 𝑍 =β€œpolice the gunman”→ ROUGEβˆ’L(B 𝑍1, 𝑍) = 0.75 Β§ LCS B 𝑍2, 𝑍 = β€œthe gunman” β†’ ROUGEβˆ’L(B 𝑍2, 𝑍) = 0.5 Β§ However, ROUGE-2 of both B 𝑍1 and B 𝑍2 are the same

  • In both 4

𝑍1 and 4 𝑍2, β€œthe gunman” is the only common bigram with the reference

slide-43
SLIDE 43

43

Extractive Summarization – paper walkthrough

Problem definition

Β§ Document 𝐸 contains 𝑀 sentences: 𝐸 = 𝑑+, 𝑑,, … , 𝑑2 Β§ Each sentence 𝑑3 contains a set words, each annotated with 𝑦(3) Β§ Extractive summarization objective: select π‘š sentences of the document such that they provide the β€œbest” summary (concise and comprehensive) Β§ Evaluation is done by comparing the output summary with the reference summary

Zhou, Q., Yang, N., Wei, F., Huang, S., Zhou, M., & Zhao, T.. Neural Document Summarization by Jointly Learning to Score and Select Sentences. In Proceedings of ACL 2018

slide-44
SLIDE 44

44

Core Ideas – NeuSum model

Β§ At each time step, the model wants to decide which sentence to include in the summary (sentence selection) Β§ To do sentence selection, at each time step, the model assigns scores to sentences that are not included in summary (sentence scoring), and selects the one with the highest score Β§ The sentence scoring is based on the representation of each sentence, but also the content of the previously selected sentences

  • Why on previously selected sentences? Intuitively, if some

contents are already included in the summary, the model should avoid selecting the sentences with similar contents

slide-45
SLIDE 45

45

Sentence encoding

contextual word embeddings sentence embedding using the final hidden states Bi-directional GRU Bi-directional GRU contextualized sentence embedding

slide-46
SLIDE 46

46

Sentence scoring and selection

Β§ Sentence scoring learns a function πœ€ that assigns a score to each sentence 𝑑3 at time step 𝑒. It uses: 1. sentence embedding 𝒕3 2. information of previously selected sentences, embedded in the vector π’Š( β†’ current state of summary

πœ€(𝑑3) = function(π’Š$, 𝒕3) πœ€(𝑑3) = 𝒙Etanh(π’Š$𝑿F + 𝒕3𝑿G + 𝒄3)

Β§ πœ€(𝑑3) is calculated for the sentences that are not yet included in the summary Β§ The sentence with highest πœ€ is added to summary

slide-47
SLIDE 47

47

Sentence scoring

Β§ How is current state of summary π’Š( calculated?

  • Using another GRU

Β§ A GRU model outputs the new state of the summary using the previous state of summary and the last selected sentence

π’Š$ = GRU(π’Š$(*, 𝒕$(*)

You can find such more details by looking into the paper: Β§ π’Š' is created based on the document embedding Β§ The model is optimized using the Kullback-Leibler divergence between the distribution of output scores and the distribution of ROUGE-2 F1 gains of sentences

slide-48
SLIDE 48

48

An unrolled GRU that

  • utputs the next state
  • f summary

πœ€(𝑑)) = function(π’Š*, 𝒕))

π’Š( can be seen as an aggregation of the contents

  • f the summary so far
slide-49
SLIDE 49

49

Recap Β§ Vanishing/Exploding gradient Β§ RNNs with gates

  • address the problem of vanishing/exploding

gradient by controlling what to be stored in and forgotten from the memory states

π’Š($(+) 𝑿& π’Š($()) 𝑿& π’Š($(*) 𝑿& π’Š($) β„’($) π’Š($(,) 𝑿&

slide-50
SLIDE 50

50

Recap ELMo

A contextualized word embedding model using multilayer bidirectional LSTM

LSTM

π’Š($) π’Š($(*) π’š($) 𝒅($(*) LSTM

slide-51
SLIDE 51

51

Recap Extractive summarization

NeuSum model