Natural Language Processing with Deep Learning Language Modeling - - PowerPoint PPT Presentation

β–Ά
natural language processing with deep learning language
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing with Deep Learning Language Modeling - - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Language Modeling with n- grams Recurrent Neural


slide-1
SLIDE 1

Institute of Computational Perception

Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks

Navid Rekab-Saz

navid.rekabsaz@jku.at

slide-2
SLIDE 2

Agenda

  • Language Modeling with n-grams
  • Recurrent Neural Networks
  • Language Modeling with RNN
  • Backpropagation Through Time

The slides are adopted from http://web.stanford.edu/class/cs224n/

slide-3
SLIDE 3

Agenda

  • Language Modeling with n-grams
  • Recurrent Neural Networks
  • Language Modeling with RNN
  • Backpropagation Through Time
slide-4
SLIDE 4

4

Language Modeling Β§ Language Modeling is the task of predicting a word (or a subword or character) given a context:

𝑄(𝑀|context)

Β§ A Language Model can answer the questions like 𝑄(𝑀|the students opened their)

slide-5
SLIDE 5

5

Language Modeling Β§ Formally, given a sequence of words 𝑦("), 𝑦($), … , 𝑦(%), a language model calculates the probability distribution of next word 𝑦(%&") over all words in vocabulary 𝑄(𝑦(%&")|𝑦 % , 𝑦(%'"), … , 𝑦("))

𝑦 is any word in the vocabulary π•Ž = {𝑀1, 𝑀2, … , 𝑀𝑂}

slide-6
SLIDE 6

6

Language Modeling Β§ You can also think of a Language Model as a system that assigns probability to a piece of text

  • How probable is it that someone generates this sentence?!

β€œcolorless green ideas sleep furiously”

Β§ According to a Language Model, the probability of a given text is computed by:

𝑄 𝑦 ! , … , 𝑦 " = 𝑄 𝑦 ! ×𝑄 𝑦 # 𝑦 ! Γ— ⋯×𝑄 𝑦 " 𝑦 "$! , … , 𝑦 ! 𝑄 𝑦 ! , … , 𝑦 " = /

%&! "

𝑄(𝑦 " |𝑦 "$! , … , 𝑦 ! )

slide-7
SLIDE 7

7

Why Language Modeling? Β§ Language Modeling is a benchmark task that helps us measure our progress on understanding language Β§ Language Modeling is a subcomponent of many NLP tasks, especially those involving generating text or estimating the probability of text:

  • Predictive typing
  • Spelling/grammar correction
  • Speech recognition
  • Handwriting recognition
  • Machine translation
  • Summarization
  • Dialogue /chatbots
  • etc.
slide-8
SLIDE 8

8

Language Modeling

[link]

slide-9
SLIDE 9

9

Language Modeling

slide-10
SLIDE 10

10

n-gram Language Model Β§ Recall: a n-gram is a chunk of n consecutive words. the students opened their ______

Β§ unigrams: β€œthe”, β€œstudents”, β€œopened”, β€œtheir” Β§ bigrams: β€œthe students”, β€œstudents opened”, β€œopened their” Β§ trigrams: β€œthe students opened”, β€œstudents opened their” Β§ 4-grams: β€œthe students opened their”

Β§ A n-gram Language Model collects frequency statistics

  • f different n-grams in a corpus, and use these to

calculate probabilities

slide-11
SLIDE 11

11

n-gram Language Model Β§ Markov assumption: decision at time 𝑒 depends

  • nly on the current state

Β§ In n-gram Language Model: predicting 𝑦(%&") depends

  • n preceding n-1 words

Β§ Without Markovian assumption: 𝑄(𝑦(%&")|𝑦 % , 𝑦(%'"), … , 𝑦(")) Β§ n-gram Language Model: 𝑄(𝑦(%&")|𝑦 % , 𝑦(%'"), … , 𝑦(%'(&$))

n-1 words

slide-12
SLIDE 12

12

n-gram Language Model Β§ Based on definition of conditional probability: 𝑄 𝑦 %&" 𝑦 % , … , 𝑦 %'(&$ = 𝑄 𝑦 %&" , 𝑦 % , … , 𝑦 %'(&$ 𝑄 𝑦 % , … , 𝑦 %'(&$ Β§ The n-gram probability is calculated by counting n-grams and [n–1]-grams in a large corpus of text: 𝑄 𝑦 %&" 𝑦 % , … , 𝑦 %'(&$ β‰ˆ count 𝑦 %&" , 𝑦 % , … , 𝑦 %'(&$ count 𝑦 % , … , 𝑦 %'(&$

slide-13
SLIDE 13

13

n-gram Language Model Β§ Example: learning a 4-gram Language Model

as the exam clerk started the clock, the students opened their ______

𝑄 𝑀 students opened their = 𝑄 students opened their 𝑀 𝑄 students opened their

Β§ For example, suppose that in the corpus:

  • β€œstudents opened their” occurred 1000 times
  • β€œstudents opened their books” occurred 400 times
  • 𝑄(books | students opened their) = 0.4
  • β€œstudents opened their exams” occurred 100 times
  • 𝑄(exams | students opened their) = 0.1

condition on this

slide-14
SLIDE 14

14

n-gram Language Model – problems Β§ Sparsity

  • If nominator β€žstudents opened their π‘€β€œ never occurred in corpus
  • Smoothing: add small hyper-parameter πœ€ to all words
  • If denominator β€žstudents opened theirβ€œ never occurred in corpus
  • Backoff: condition on β€œstudents opened” instead
  • Increasing n makes sparsity problem worse!

Β§ Storage

  • The model needs to store all n-grams (from unigram to n-gram),
  • bserved in the corpus
  • Increasing n worsens the storage problem radically!
slide-15
SLIDE 15

15

n-gram Language Models – generating text Β§ A trigram Language Model trained on Reuters corpus (1.7 M words)

slide-16
SLIDE 16

16

n-gram Language Models – generating text Β§ Generating text by sampling from the probability distributions

slide-17
SLIDE 17

17

n-gram Language Models – generating text Β§ Generating text by sampling from the probability distributions

slide-18
SLIDE 18

18

n-gram Language Models – generating text Β§ Generating text by sampling from the probability distributions

slide-19
SLIDE 19

19

n-gram Language Models – generating text Β§ Generating text by sampling from the probability distributions Β§ Very good in syntax … but incoherent! Β§ Increasing n makes the text more coherent but also intensifies the discussed issues

slide-20
SLIDE 20

Agenda

  • Language Modeling with n-grams
  • Recurrent Neural Networks
  • Training Language Models with RNN
  • Backpropagation Through Time
slide-21
SLIDE 21

21

Recurrent Neural Network Β§ Recurrent Neural Network (RNN) encodes/embeds a sequential input of any size like …

  • Sequence of word/subword/character vectors
  • Time series

… into compositional embeddings Β§ RNN captures dependencies through the sequence by applying the same parameters repeatedly … Β§ RNN outputs a final embedding but also intermediary embeddings on each time step

slide-22
SLIDE 22

22

Recurrent Neural Networks

RNN

π’Š(%) π’Š(%'") 𝒇(%)

slide-23
SLIDE 23

23

Recurrent Neural Networks Β§ Output π’Š(%) is a function of input 𝒇(%) and the

  • utput of the previous time step π’Š(%'")

π’Š(%) = RNN(π’Š %'" , 𝒇(%)) Β§ π’Š(%) is called hidden state Β§ With hidden state π’Š(%'"), the model accesses to a sort of memory from all previous entities

RNN

π’Š(%) π’Š(%'") 𝒇(%)

slide-24
SLIDE 24

24

RNN – Unrolling

RNN RNN RNN RNN

…

The quick brown fox jumps over the lazy dog 𝒇(") 𝒇($) 𝒇(+) 𝒇(,) π’Š(-) - π’Š(") π’Š($) π’Š(+) π’Š(,'") π’Š(,) 𝑦(") 𝑦($) 𝑦(+) 𝑦(,)

slide-25
SLIDE 25

25

RNN – Compositional embedding cat sunbathes on river bank

RNN RNN RNN

𝒇(") 𝒇($) 𝒇(+) π’Š(") π’Š($) π’Š(+)

RNN

𝒇(.) π’Š(.)

RNN

𝒇(/) π’Š(/)

sentence embedding

π’Š(-)

slide-26
SLIDE 26

26

RNN – Compositional embedding

sentence embedding use last hidden state RNN RNN RNN

𝒇(") 𝒇($) 𝒇(+) π’Š(") π’Š($) π’Š(+)

RNN

𝒇(.) π’Š(.)

RNN

𝒇(/) π’Š(/) π’Š(-) cat sunbathes on river bank

slide-27
SLIDE 27

27

RNN – Compositional embedding

sentence embedding calculate element-wise max, or the mean of hidden states RNN RNN RNN

𝒇(") 𝒇($) 𝒇(+) π’Š(") π’Š($) π’Š(+)

RNN

𝒇(.) π’Š(.)

RNN

𝒇(/) π’Š(/) π’Š(-) cat sunbathes on river bank

slide-28
SLIDE 28

28

Standard (Elman) RNN Β§ General form of an RNN function π’Š(%) = RNN(π’Š %'" , 𝒇(%)) Β§ Standard RNN:

  • linear projection of the previous hidden state π’Š !"#
  • linear projection of input 𝒇(!)
  • summing the projections and applying a non-linearity

π’Š(%) = 𝜏(π’Š %'" 𝑿2 + 𝒇(%)𝑿3 + 𝒄)

RNN

π’Š(%) π’Š(%'") 𝒇(%)

slide-29
SLIDE 29

Agenda

  • Language Modeling with n-grams
  • Recurrent Neural Networks
  • Language Modeling with RNN
  • Backpropagation Through Time
slide-30
SLIDE 30

30

RNN Language Model 𝑭 the students open their 𝑭 𝑭 𝑭 𝑽

𝑄(𝑦(()|the students opened their)

𝑦(") 𝑦($) 𝑦(+) 𝑦(.)

RNN RNN RNN

𝒇(") 𝒇($) 𝒇(+) π’Š(") π’Š($) π’Š(+)

RNN

𝒇(.) π’Š(.) π’Š(-) ; 𝒛(.)

slide-31
SLIDE 31

31

RNN Language Model Β§ Encoder

  • word at time step 𝑒 β†’ 𝑦(%)
  • One-hot vector of 𝑦(%) β†’ π’š(%) ∈ ℝ π•Ž
  • Word embedding β†’ 𝒇(%) = π’š(%)𝑭

Β§ RNN π’Š(%) = RNN(π’Š %'" , 𝒇(%)) Β§ Decoder

  • Predicted probability distribution:

; 𝒛(%) = softmax π‘½π’Š % + 𝒄 ∈ ℝ π•Ž

  • Probability of any word 𝑀 at step 𝑒:

𝑄 𝑀 𝑦 %'" , … , 𝑦(") = E 𝑧5

(%)

slide-32
SLIDE 32

32

Training an RNN Language Model Β§ Start with a large text corpus: 𝑦 " , … , 𝑦 , Β§ For every step 𝑒 predict the output distribution ; 𝒛(%) Β§ Calculate the loss function: Negative Log Likelihood of the predicted probability of the true next word 𝑦 %&" β„’(%) = βˆ’ log E π‘§π’š !"#

%

= βˆ’ log 𝑄 𝑦 %&" 𝑦 % , … , 𝑦(") Β§ Overall loss is the average of loss values over the entire training set: β„’ = 1 π‘ˆ M

%7" ,

β„’(%)

slide-33
SLIDE 33

33

Training 𝑭 the students open their exams … 𝑽 𝑦(") 𝑦($) 𝑦(+) 𝑦(.)

RNN

𝒇(") π’Š(") π’Š(-) ; 𝒛(") β„’(") 𝑦(/)

NLL of students

slide-34
SLIDE 34

34

Training 𝑭 the students open their exams … 𝑭 𝑽 𝑦(") 𝑦($) 𝑦(+) 𝑦(.)

RNN RNN

𝒇(") 𝒇($) π’Š(") π’Š($) π’Š(-) ; 𝒛(") β„’(") 𝑦(/) β„’($)

NLL of open

𝑽 ; 𝒛($)

slide-35
SLIDE 35

35

Training 𝑭 the students open their exams … 𝑭 𝑭 𝑽 𝑦(") 𝑦($) 𝑦(+) 𝑦(.)

RNN RNN RNN

𝒇(") 𝒇($) 𝒇(+) π’Š(") π’Š($) π’Š(+) π’Š(-) ; 𝒛(") β„’(") 𝑦(/) β„’($) 𝑽 ; 𝒛($) β„’(+)

NLL of their

𝑽 ; 𝒛(+)

slide-36
SLIDE 36

36

Training 𝑭 the students open their exams … 𝑭 𝑭 𝑭 𝑽 𝑦(") 𝑦($) 𝑦(+) 𝑦(.)

RNN RNN RNN

𝒇(") 𝒇($) 𝒇(+) π’Š(") π’Š($) π’Š(+)

RNN

𝒇(.) π’Š(.) π’Š(-) ; 𝒛(") β„’(") 𝑦(/) β„’($) 𝑽 ; 𝒛($) β„’(+) 𝑽 ; 𝒛(+) β„’(.)

NLL of exams

𝑽 ; 𝒛(.)

slide-37
SLIDE 37

37

Training RNN Language Model – Mini batches Β§ In practice, the overall loss is calculated not over whole the corpus, but over (mini) batches of length 𝑀 : β„’ = 1 𝑀 M

%7" 8

β„’(%) Β§ After calculating the β„’ for one batch, gradients are computed, and weights are updated (e.g. using SGD)

slide-38
SLIDE 38

38

Training RNN Language Model – Data preparation

Β§ In practice, every forward pass contains 𝑙 batches. These batches are all trained in parallel Β§ With batch size 𝑙, tensor of each forward pass has the shape: 𝑙, 𝑀 Β§ To prepare the data, the corpus should be splitted into sub-corpora, where each sub-corpus contains the text for each row of forward tensors Β§ For example, for batch size 𝑙 = 2, the corpus is splitted in middle, and the first forward pass tensor is: batch size 𝑙 β†’ 𝑦 ! … 𝑦 * 𝑦 +,! … 𝑦 +,* where 𝜐 = ⁄

" # is the overall length of the first sub-corpus

slide-39
SLIDE 39

39

Training RNN Language Model – π’Š(-)

Β§ At the beginning, the initial hidden state π’Š(-) is set to vectors of zeros (no memory) Β§ To carry the memory of previous forward passes, at each forward pass, the initial hidden states are initialized with the values of the last hidden states of the previous forward pass Example Β§ First forward pass: π’Š(-) is set to zero values 𝑦 ! … 𝑦 * 𝑦 +,! … 𝑦 +,* Β§ Second forward pass: π’Š(-) is set to the π’Š(*) values in the first pass 𝑦 *,! … 𝑦 #* 𝑦 *,+,! … 𝑦 +,#*

slide-40
SLIDE 40

40

Training RNN Language Model – Weight Tying Β§ Parameters in a vanilla RNN (bias terms discarded)

  • 𝑭 β†’ π•Ž Γ—β„Ž
  • 𝑽 β†’ β„ŽΓ— π•Ž
  • 𝑿. β†’ π‘’Γ—β„Ž
  • 𝑿/ β†’ β„ŽΓ—β„Ž

where 𝑒 and β„Ž are the number of dimensions of the input embedding and hidden vectors, respectively.

Β§ Encoder and decoder embeddings contain the most parameters 𝑭 𝑽 𝑦

RNN

𝒇 π’Š ; 𝒛

slide-41
SLIDE 41

41

Training RNN Language Model – Weight Tying Β§ Parameters in a vanilla RNN (bias terms discarded)

  • 𝑭 β†’ π•Ž Γ—β„Ž
  • 𝑽 β†’ β„ŽΓ— π•Ž
  • 𝑿. β†’ π‘’Γ—β„Ž
  • 𝑿/ β†’ β„ŽΓ—β„Ž

where 𝑒 and β„Ž are the number of dimensions of the input embedding and hidden vectors, respectively.

Β§ Weight tying: set the decoder parameters the same as encoder parameters (saving π•Ž Γ—β„Ž decoding parameters)

  • In this case 𝑒 must be equal to β„Ž
  • If 𝑒 β‰  β„Ž, usually a linear projects the output vector

from β„Ž to 𝑒 dimensions

𝑭 𝑽 𝑦

RNN

𝒇 π’Š ; 𝒛

slide-42
SLIDE 42

42

𝑭 a 𝑽

RNN

𝒇(") π’Š(") π’Š(-) ; 𝒛(")

sampling

cat Generating text

slide-43
SLIDE 43

43

𝑭 𝑭 𝑽

RNN RNN

𝒇(") 𝒇($) π’Š(") π’Š($) π’Š(-) ; 𝒛(") 𝑽 ; 𝒛($) cat

sampling

sunbathes a cat Generating text

slide-44
SLIDE 44

44

𝑭 𝑭 𝑭 𝑽

RNN RNN RNN

𝒇(") 𝒇($) 𝒇(+) π’Š(") π’Š($) π’Š(+) π’Š(-) ; 𝒛(") 𝑽 ; 𝒛($) 𝑽 ; 𝒛(+) a cat sunbathes

sampling

  • n

Generating text

…

cat sunbathes

slide-45
SLIDE 45

45

Generating text with RNN Language Model

Β§ Trained on Obama speeches Jobs The United States will step up to the cost of a new challenges of the American people that will share the fact that we created the problem. They were attacked and so that they have to say that all the task of the final days of war that I will not be able to get this done. The promise of the men and women who were still going to take out the fact that the American people have fought to make sure that they have to be able to protect our part. It was a chance to stand together to completely look for the commitment to borrow from the American people.

https://medium.com/@samim/obama-rnn-machine-generated-political-speeches-c8abd18a2ea0

slide-46
SLIDE 46

46

Generating text with RNN Language Model

Β§ Trained on Trump speeches make the country

  • rich. it was terrible. but saudi arabia, they make a billion dollars a day.

i was the king. i was the king. i was the smartest person in yemen, we have to get to business. i have to say, but he was an early starter. and we have to get to business. i have to say, donald, i can't believe it. it's so important. but this is what they're saying, dad, you're going to be really pro, growth, blah, blah. it's disgusting what's disgusting., and it was a 19 set washer and to go to japan. did you hear that character. we are going to have to think about it. but you know, i've been nice to me.

https://github.com/ppramesi/RoboTrumpDNN

slide-47
SLIDE 47

Agenda

  • Language Modeling with n-grams
  • Recurrent Neural Networks
  • Language Modeling with RNN
  • Backpropagation Through Time
slide-48
SLIDE 48

48

Backpropagation for RNNs Β§ Unrolling the computation graph of RNN Β§ Simplified: the interactions with 𝑽 and also input parameters (𝑭 and 𝑿3) are removed Β§ What is … πœ–β„’(%) πœ–π‘Ώ2 =? π’Š(-) 𝑿2

…

𝑿2 π’Š(%'$) 𝑿2 π’Š(%'") 𝑿2 π’Š(%) 𝑿2 β„’(%)

slide-49
SLIDE 49

49

Backpropagation for RNNs π’Š(-) 𝑿2 π’Š(") 𝑿2 π’Š($) 𝑿2 π’Š(+) β„’(+) πœ–β„’(+) πœ–π‘Ώ2 =?

slide-50
SLIDE 50

50

Backpropagation for RNNs T πœ–β„’(+) πœ–π‘Ώ2

(+)

= Β§ Gradient regarding 𝑿2 at time step 3 π’Š(-) 𝑿2 π’Š(") 𝑿2 π’Š($) π’Š(+) β„’(+)

πœ–β„’(&) πœ–π’Š(&) πœ–π’Š(&) πœ–π‘Ώ'

𝑿2 πœ–β„’(+) πœ–π’Š(+) πœ–π’Š(+) πœ–π‘Ώ2

slide-51
SLIDE 51

51

Backpropagation for RNNs T πœ–β„’(+) πœ–π‘Ώ2

($)

= Β§ Gradient regarding 𝑿2 at time step 2 π’Š(-) 𝑿2 π’Š(") 𝑿2 π’Š($) π’Š(+) β„’(+)

πœ–β„’(&) πœ–π’Š(&) πœ–π’Š(&) πœ–π’Š(()

𝑿2

πœ–π’Š(() πœ–π‘Ώ'

πœ–β„’(+) πœ–π’Š(+) πœ–π’Š(+) πœ–π’Š($) πœ–π’Š($) πœ–π‘Ώ2

slide-52
SLIDE 52

52

Backpropagation for RNNs T πœ–β„’(+) πœ–π‘Ώ2

(")

= Β§ Gradient regarding 𝑿2 at time step 1 π’Š(-) 𝑿2 π’Š(") 𝑿2 π’Š($) π’Š(+) β„’(+)

πœ–β„’(&) πœ–π’Š(&) πœ–π’Š(&) πœ–π’Š(()

𝑿2

πœ–π’Š(() πœ–π’Š(#) πœ–π’Š(#) πœ–π‘Ώ'

πœ–β„’(+) πœ–π’Š(+) πœ–π’Š(+) πœ–π’Š($) πœ–π’Š($) πœ–π’Š(") πœ–π’Š(") πœ–π‘Ώ2

slide-53
SLIDE 53

53

Backpropagation Through Time Β§ Final gradient is the sum of the gradients regarding the model parameters (such as 𝑿2) from the current time step back to the beginning of corpus (or batch) πœ–β„’(%) πœ–π‘Ώ2 = M

<7" %

T πœ–β„’(%) πœ–π‘Ώ2 (<) Β§ In this simplified case, this can be written as: πœ–β„’(%) πœ–π‘Ώ2 = M

<7" % πœ–β„’(%)

πœ–π’Š(%) πœ–π’Š(%) πœ–π’Š(%'") … πœ–π’Š(<) πœ–π‘Ώ2

slide-54
SLIDE 54

54

Summary Β§ A Language Model calculates …

  • the probability of the appearance of a

word/subword/character given its context

  • the probability of a stream of text

Β§ Recurrent Neural Network

  • Predicts next entity (word/subword/character)

based on a memory of previous steps and the given input

  • Backpropagation Through Time in each step 𝑒

updates model parameters 𝑒 times, from the current point back to the beginning of the sequence

RNN

π’Š(%) π’Š(%'") 𝒇(%)

slide-55
SLIDE 55

55

Summary RNN for Language Modeling Β§ Pros:

  • Can process any length input
  • Can (in theory) use information from many steps back
  • Model size doesn’t increase for longer input

Β§ Cons:

  • Recurrent computation is slow β†’ (in its vanilla form) does not

fully exploit the parallel computation capability of GPUs

  • In practice, difficult to access information from many steps back