Recurrent Neural Networks III Milan Straka April 29, 2019 Charles - - PowerPoint PPT Presentation

recurrent neural networks iii
SMART_READER_LITE
LIVE PREVIEW

Recurrent Neural Networks III Milan Straka April 29, 2019 Charles - - PowerPoint PPT Presentation

NPFL114, Lecture 9 Recurrent Neural Networks III Milan Straka April 29, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Recurrent Neural Networks


slide-1
SLIDE 1

NPFL114, Lecture 9

Recurrent Neural Networks III

Milan Straka

April 29, 2019

Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

slide-2
SLIDE 2

Recurrent Neural Networks

Single RNN cell

input

  • utput

state

Unrolled RNN cells

input 1

  • utput 1

state input 2

  • utput 2

state input 3

  • utput 3

state input 4

  • utput 4

state

2/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-3
SLIDE 3

Basic RNN Cell

input previous state

  • utput = new state

Given an input and previous state , the new state is computed as One of the simplest possibilities is

x(t) s(t−1) s =

(t)

f(s , x ; θ).

(t−1) (t)

s =

(t)

tanh(Us +

(t−1)

V x +

(t)

b).

3/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-4
SLIDE 4

Basic RNN Cell

Basic RNN cells suffer a lot from vanishing/exploding gradients (the challenge of long-term dependencies). If we simplify the recurrence of states to we get If has eigenvalue decomposition of , we get The main problem is that the same function is iteratively applied many times. Several more complex RNN cell variants have been proposed, which alleviate this issue to some degree, namely LSTM and GRU.

s =

(t)

Us ,

(t−1)

s =

(t)

U s .

t (0)

U U = QΛQ−1 s =

(t)

QΛ Q s .

t −1 (0)

4/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-5
SLIDE 5

Long Short-Term Memory

xt ht−1 ht tanh σ xt ht−1 tanh σ xt ht−1 ct σ ht−1 xt

Later in Gers, Schmidhuber & Cummins (1999) a possibility to forget information from memory cell was added.

c

t

i

t

f

t

  • t

c

t

h

t

← σ(W x

+ V h + b )

i t i t−1 i

← σ(W x

+ V h + b )

f t f t−1 f

← σ(W x

+ V h + b )
  • t
  • t−1
  • ← f
⋅ c + i ⋅ tanh(W x + V h + b )

t t−1 t y t y t−1 y

← o

⋅ tanh(c )

t t

5/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-6
SLIDE 6

Long Short-Term Memory

http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png

6/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-7
SLIDE 7

Gated Recurrent Unit

σ xt ht−1 ht−1 σ ht−1 xt xt tanh ht−1 1− + ht

r

t

u

t

h ^t h

t

← σ(W x

+ V h + b )

r t r t−1 r

← σ(W x

+ V h + b )

u t u t−1 u

← tanh(W x

+ V (r ⋅ h ) + b )

h t h t t−1 h

← u

⋅ h + (1 − u ) ⋅

t t−1 t

h ^t

7/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-8
SLIDE 8

Gated Recurrent Unit

http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-GRU.png

8/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-9
SLIDE 9

Word Embeddings

One-hot encoding considers all words to be independent of each other. However, words are not independent – some are more similar than others. Ideally, we would like some kind of similarity in the space of the word representations.

Distributed Representation

The idea behind distributed representation is that objects can be represented using a set of common underlying factors. We therefore represent words as fixed-size embeddings into space, with the vector elements playing role of the common underlying factors.

Rd

9/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-10
SLIDE 10

Word Embeddings

The word embedding layer is in fact just a fully connected layer on top of one-hot encoding. However, it is important that this layer is shared across the whole network.

Word in

  • ne-hot

encoding V D1 V D2 V D3 Word in

  • ne-hot

encoding D D1 D D2 D D3 V D

10/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-11
SLIDE 11

Word Embeddings for Unknown Words

Recurrent Character-level WEs

Figure 1 of paper "Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation", https://arxiv.org/abs/1508.02096.

11/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-12
SLIDE 12

Word Embeddings for Unknown Words

Convolutional Character-level WEs

Figure 1 of paper "Character-Aware Neural Language Models", https://arxiv.org/abs/1508.06615.

12/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-13
SLIDE 13

Basic RNN Applications

Sequence Element Classification

Use outputs for individual elements.

input 1

  • utput 1

state input 2

  • utput 2

state input 3

  • utput 3

state input 4

  • utput 4

state

Sequence Representation

Use state after processing the whole sequence (alternatively, take output of the last element).

13/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-14
SLIDE 14

Structured Prediction

Consider generating a sequence of given input . Predicting each sequence element independently models the distribution . However, there may be dependencies among the themselves, which is difficult to capture by independent element classification.

y

, … , y ∈

1 N

Y N x

, … , x

1 N

P(y

∣X)

i

y

i

14/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-15
SLIDE 15

Linear-Chain Conditional Random Fields (CRF)

Linear-chain Conditional Random Fields, usually abbreviated only to CRF, acts as an output

  • layer. It can be considered an extension of a softmax – instead of a sequence of independent

softmaxes, CRF is a sentence-level softmax, with additional weights for neighboring sequence elements.

s(X, y; θ, A) =

(A +

i=1

N y

,y

i−1 i

f

(y ∣X))

θ i

p(y∣X) = softmax

(s(X, z))

z∈Y N z

log p(y∣X) = s(X, y) − logadd

(s(X, z))

z∈Y N

15/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-16
SLIDE 16

Linear-Chain Conditional Random Fields (CRF)

Computation

We can compute efficiently using dynamic programming. If we denote as probability of all sentences with elements with the last being . The core idea is the following: For efficient implementation, we use the fact that

p(y∣X) α

(k)

t

t y k α

(k) =

t

f

(y =

θ t

k∣X) + logadd

(α (j) +

j∈Y t−1

A

).

j,k

ln(a + b) = ln a + ln(1 + e ).

ln b−ln a

16/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-17
SLIDE 17

Conditional Random Fields (CRF)

Decoding

We can perform optimal decoding, by using the same algorithm, only replacing with and tracking where the maximum was attained.

Applications

CRF output layers are useful for span labeling tasks, like named entity recognition dialog slot filling

logadd max

17/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-18
SLIDE 18

Connectionist Temporal Classification

Let us again consider generating a sequence of given input , but this time and there is no explicit alignment of and in the gold data.

Figure 7.1 of the dissertation "Supervised Sequence Labelling with Recurrent Neural Networks" by Alex Graves.

y

, … , y

1 M

x

, … , x

1 N

M ≤ N x y

18/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-19
SLIDE 19

Connectionist Temporal Classification

We enlarge the set of output labels by a – (blank) and perform a classification for every input element to produce an extended labeling. We then post-process it by the following rules (denoted ):

  • 1. We remove neighboring symbols.
  • 2. We remove the –.

Because the explicit alignment of inputs and labels is not known, we consider all possible alignments. Denoting the probability of label at time as , we define

B l t p

l t

α (s)

t

=

def

p .

labeling π:B(π

)=y

1:t 1:s

t =1

t π

t′

t′

19/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-20
SLIDE 20

CRF and CTC Comparison

In CRF, we normalize the whole sentences, therefore we need to compute unnormalized probabilities for all the (exponentially many) sentences. Decoding can be performed optimally. In CTC, we normalize per each label. However, because we do not have explicit alignment, we compute probability of a labeling by summing probabilities of (generally exponentially many) extended labelings.

20/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-21
SLIDE 21

Connectionist Temporal Classification

Computation

When aligning an extended labeling to a regular one, we need to consider whether the extended labeling ends by a blank or not. We therefore define and compute as .

α

(s)

− t

α

(s)

∗ t

p

=

def

labeling π:B(π

)=y ,π =−

1:t 1:s t

t =1

t π

t′

t′

p

=

def

labeling π:B(π

)=y ,π =−

1:t 1:s t

t =1

t π

t′

t′

α (s)

t

α

(s) +

− t

α

(s)

∗ t

21/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-22
SLIDE 22

Connectionist Temporal Classification

Figure 7.3 of the dissertation "Supervised Sequence Labelling with Recurrent Neural Networks" by Alex Graves.

Computation

We initialize s as follows: We then proceed recurrently according to:

α α

(0) ←

− 1

p

− 1

α

(1) ←

∗ 1

p

y

1

1

α

(s) ←

− t

p

(α (s) +

− t − t−1

α

(s))

∗ t−1

α

(s) ←

∗ t

{p

(α (s) + α

(s − 1) + a

(s − 1)), if y = y

y

s

t ∗ t−1 ∗ t−1 − t−1 s  s−1

p

(α (s) + a (s − 1)), if y = y

y

s

t ∗ t−1 − t−1 s s−1

22/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-23
SLIDE 23

CTC Decoding

Unlike CRF, we cannot perform the decoding optimally. The key observation is that while an

  • ptimal extended labeling can be extended into an optimal labeling of a larger length, the same

does not apply to regular (non-extended) labeling. The problem is that regular labeling coresponds to many extended labelings, which are modified each in a different way during an extension of the regular labeling.

Figure 7.5 of the dissertation "Supervised Sequence Labelling with Recurrent Neural Networks" by Alex Graves.

23/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-24
SLIDE 24

CTC Decoding

Beam Search

To perform beam search, we keep best regular labelings for each prefix of the extended

  • labelings. For each regular labeling we keep both

and and by best we mean such regular labelings with maximum . To compute best regular labelings for longer prefix of extended labelings, for each regular labeling in the beam we consider the following cases: adding a blank symbol, i.e., updating both and ; adding any non-blank symbol, i.e., updating . Finally, we merge the resulting candidates according to their regular labeling and keep only the best.

k α

a

α

+

α

α

α

α

k

24/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-25
SLIDE 25

Unsupervised Word Embeddings

The embeddings can be trained for each task separately. However, a method of precomputing word embeddings have been proposed, based on distributional hypothesis: Words that are used in the same contexts tend to have similar meanings. The distributional hypothesis is usually attributed to Firth (1957).

25/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-26
SLIDE 26

Word2Vec

wt−2 wt−1 wt+1 wt+2

INPUT PROJECTION OUTPUT

wt

SUM CBOW (Continuous Bag Of Words)

wt wt−2 wt−1 wt+1 wt+2

INPUT PROJECTION OUTPUT

Skip-gram

Mikolov et al. (2013) proposed two very simple architectures for precomputing word embeddings, together with a C multi-threaded implementation word2vec.

26/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-27
SLIDE 27

Word2Vec

Table 8 of paper "Efficient Estimation of Word Representations in Vector Space", https://arxiv.org/abs/1301.3781.

27/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-28
SLIDE 28

Word2Vec – SkipGram Model

wt−2 wt−1 wt+1 wt+2

INPUT PROJECTION OUTPUT

wt

SUM CBOW (Continuous Bag Of Words)

wt wt−2 wt−1 wt+1 wt+2

INPUT PROJECTION OUTPUT

Skip-gram

Considering input word and output , the Skip-gram model defines

w

i

w

  • p(w
∣w )
  • i =

def

. e

∑w

W

V

w ⊤ w i

eW

V

w

w i

28/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-29
SLIDE 29

Word2Vec – Hierarchical Softmax

Instead of a large softmax, we construct a binary tree over the words, with a sigmoid classifier for each node. If word corresponds to a path , we define

w n

, n , … , n

1 2 L

p

(w∣w )

HS i =

def

σ([+1 if n is right child else -1] ⋅

j=1

L−1 j+1

W

V ).

n

j

⊤ w

i

29/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-30
SLIDE 30

Word2Vec – Negative Sampling

Instead of a large softmax, we could train individual sigmoids for all words. We could also only sample the negative examples instead of training all of them. This gives rise to the following negative sampling objective: For , both uniform and unigram distribution work, but

  • utperforms them significantly (this fact has been reported in several papers by different

authors).

l

(w , w )

NEG

  • i =

def log σ(W

V ) +

w

w

i

E log (1 −

j=1

k w

∼P (w)

j

σ(W

V )).

w

j

⊤ w

i

P(w) U(w) U(w)3/4

30/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-31
SLIDE 31

Recurrent Character-level WEs

Figure 1 of paper "Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation", https://arxiv.org/abs/1508.02096.

31/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-32
SLIDE 32

Convolutional Character-level WEs

Table 6 of paper "Character-Aware Neural Language Models", https://arxiv.org/abs/1508.06615.

32/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-33
SLIDE 33

Character N-grams

Another simple idea appeared simultaneously in three nearly simultaneous publications as Charagram, Subword Information or SubGram. A word embedding is a sum of the word embedding plus embeddings of its character n-grams. Such embedding can be pretrained using same algorithms as word2vec. The implementation can be dictionary based: only some number of frequent character n-grams is kept; hash-based: character n-grams are hashed into buckets (usually is used).

K K ∼ 106

33/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-34
SLIDE 34

Charagram WEs

Table 7 of paper "Enriching Word Vectors with Subword Information", https://arxiv.org/abs/1607.04606.

34/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-35
SLIDE 35

Charagram WEs

Figure 2 of paper "Enriching Word Vectors with Subword Information", https://arxiv.org/abs/1607.04606.

35/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-36
SLIDE 36

Sequence-to-Sequence Architecture

Sequence-to-Sequence Architecture

36/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-37
SLIDE 37

Sequence-to-Sequence Architecture

Figure 1 of paper "Sequence to Sequence Learning with Neural Networks", https://arxiv.org/abs/1409.0473.

37/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-38
SLIDE 38

Sequence-to-Sequence Architecture

Figure 1 of paper "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation", https://arxiv.org/abs/1406.1078.

38/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-39
SLIDE 39

Sequence-to-Sequence Architecture

BOS ˆ x(0) x(0) ˆ x(1) x(1) ˆ x(2) x(2) ˆ x(3) x(3) EOS BOS ˆ x(0) ˆ x(1) ˆ x(2) ˆ x(3) EOS

Training

The so-called teacher forcing is used during training – the gold outputs are used as inputs during training.

Inference

During inference, the network processes its own predictions. Usually, the generated logits are processed by an , the chosen word embedded and used as next input.

arg max

39/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-40
SLIDE 40

Tying Word Embeddings

Target word id Output layer Matrix V × D Target word embedding RNN Matrix D × V Target word logits 40/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-41
SLIDE 41

Attention

Figure 1 of paper "Neural Machine Translation by Jointly Learning to Align and Translate", https://arxiv.org/abs/1409.0473.

As another input during decoding, we add context vector : We compute the context vector as a weighted combination of source sentence encoded outputs: The weights are softmax of

  • ver ,

with being

c

i

s

=

i

f(s

, y , c ).

i−1 i−1 i

c

=

i

α h

j

ij j

α

ij

e

ij

j α

=

i

softmax(e

),

i

e

ij

e

=

ij

v tanh(V h

+

⊤ j

W s

+

i−1

b).

41/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-42
SLIDE 42

Attention

Figure 3 of paper "Neural Machine Translation by Jointly Learning to Align and Translate", https://arxiv.org/abs/1409.0473.

42/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-43
SLIDE 43

Subword Units

Translate subword units instead of words. The subword units can be generated in several ways, the most commonly used are BPE – Using the byte pair encoding algorithm. Start with characters plus a special end-of- word symbol . Then, merge the most occurring symbol pair by a new symbol , with the symbol pair never crossing word boundary. Considering a dictionary with words low, lowest, newer, wider: Wordpieces – Joining neighboring symbols to maximize unigram language model likelihood. Usually quite little subword units are used (32k-64k), often generated on the union of the two vocabularies (the so-called joint BPE or shared wordpieces).

⋅ A, B AB r ⋅ l o lo w e r⋅ → r⋅ → lo → low → er⋅

43/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-44
SLIDE 44

Google NMT

Figure 1 of paper "Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation", https://arxiv.org/abs/1609.08144.

44/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-45
SLIDE 45

Google NMT

Figure 5 of paper "Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation", https://arxiv.org/abs/1609.08144.

45/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-46
SLIDE 46

Google NMT

Figure 6 of paper "Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation", https://arxiv.org/abs/1609.08144.

46/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-47
SLIDE 47

Beyond one Language Pair

Figure 5 of "Show and Tell: Lessons learned from the 2015 MSCOCO...", https://arxiv.org/abs/1609.06647.

47/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-48
SLIDE 48

Beyond one Language Pair

Figure 6 of "Multimodal Compact Bilinear Pooling for VQA and Visual Grounding", https://arxiv.org/abs/1606.01847.

48/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT

slide-49
SLIDE 49

Multilingual Translation

Many attempts at multilingual translation. Individual encoders and decoders, shared attention. Shared encoders and decoders.

49/49 NPFL114, Lecture 9

Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT