Neural Networks Language Models Philipp Koehn 1 October 2020 - - PowerPoint PPT Presentation

neural networks language models
SMART_READER_LITE
LIVE PREVIEW

Neural Networks Language Models Philipp Koehn 1 October 2020 - - PowerPoint PPT Presentation

Neural Networks Language Models Philipp Koehn 1 October 2020 Philipp Koehn Machine Translation: Neural Networks 1 October 2020 N-Gram Backoff Language Model 1 Previously, we approximated p ( W ) = p ( w 1 , w 2 , ..., w n ) ... by


slide-1
SLIDE 1

Neural Networks Language Models

Philipp Koehn 1 October 2020

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-2
SLIDE 2

1

N-Gram Backoff Language Model

  • Previously, we approximated

p(W) = p(w1, w2, ..., wn)

  • ... by applying the chain rule

p(W) =

  • i

p(wi|w1, ..., wi−1)

  • ... and limiting the history (Markov order)

p(wi|w1, ..., wi−1) ≃ p(wi|wi−4, wi−3, wi−2, wi−1)

  • Each p(wi|wi−4, wi−3, wi−2, wi−1) may not have enough statistics to estimate

→ we back off to p(wi|wi−3, wi−2, wi−1), p(wi|wi−2, wi−1), etc., all the way to p(wi) – exact details of backing off get complicated — ”interpolated Kneser-Ney”

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-3
SLIDE 3

2

Refinements

  • A whole family of back-off schemes
  • Skip-n gram models that may back off to p(wi|wi−2)
  • Class-based models p(C(wi)|C(wi−4), C(wi−3), C(wi−2), C(wi−1))

⇒ We are wrestling here with – using as much relevant evidence as possible – pooling evidence between words

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-4
SLIDE 4

3

First Sketch

Softmax FF

wi h wi-4 wi-3 wi-2 wi-1 History Hidden Layer Output Word

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-5
SLIDE 5

4

Representing Words

  • Words are represented with a one-hot vector, e.g.,

– dog = (0,0,0,0,1,0,0,0,0,....) – cat = (0,0,0,0,0,0,0,1,0,....) – eat = (0,1,0,0,0,0,0,0,0,....)

  • That’s a large vector!
  • Remedies

– limit to, say, 20,000 most frequent words, rest are OTHER – place words in √n classes, so each word is represented by ∗ 1 class label ∗ 1 word in class label

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-6
SLIDE 6

5

Word Classes for Two-Hot Representations

  • WordNet classes
  • Brown clusters
  • Frequency binning

– sort words by frequency – place them in order into classes – each class has same token count → very frequent words have their own class → rare words share class with many other words

  • Anything goes: assign words randomly to classes

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-7
SLIDE 7

6

word embeddings

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-8
SLIDE 8

7

Add a Hidden Layer

Softmax FF

wi h Hidden Layer Output Word wi-4 wi-3 wi-2 wi-1 History

Embed Embed Embed Embed

Embedding Ew

  • Map each word first into a lower-dimensional real-valued space
  • Shared weight matrix E

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-9
SLIDE 9

8

Details (Bengio et al., 2003)

  • Add direct connections from embedding layer to output layer
  • Activation functions

– input→embedding: none – embedding→hidden: tanh – hidden→output: softmax

  • Training

– loop through the entire corpus – update between predicted probabilities and 1-hot vector for output word

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-10
SLIDE 10

9

Word Embeddings

C

Word Embedding

  • By-product: embedding of word into continuous space
  • Similar contexts → similar embedding
  • Recall: distributional semantics

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-11
SLIDE 11

10

Word Embeddings

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-12
SLIDE 12

11

Word Embeddings

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-13
SLIDE 13

12

Are Word Embeddings Magic?

  • Morphosyntactic regularities (Mikolov et al., 2013)

– adjectives base form vs. comparative, e.g., good, better – nouns singular vs. plural, e.g., year, years – verbs present tense vs. past tense, e.g., see, saw

  • Semantic regularities

– clothing is to shirt as dish is to bowl – evaluated on human judgment data of semantic similarities

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-14
SLIDE 14

13

recurrent neural networks

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-15
SLIDE 15

14

Recurrent Neural Networks

Hidden Layer Output Word History Embedding

Softmax tanh

w1

Embed

  • Start: predict second word from first
  • Mystery layer with nodes all with value 1

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-16
SLIDE 16

15

Recurrent Neural Networks

Hidden Layer Output Word History Embedding

Softmax tanh

w1

Embed Softmax tanh

w2

Embed

copy

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-17
SLIDE 17

16

Recurrent Neural Networks

Hidden Layer Output Word History Embedding

Softmax tanh

w1

Embed Softmax tanh

w2

Embed Softmax tanh

w3

Embed

copy copy

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-18
SLIDE 18

17

Training

Softmax RNN

w1

Embed

Hidden Layer Output Word History Embedding

Cost

yt ht Ewt wt

  • Process first training example
  • Update weights with back-propagation

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-19
SLIDE 19

18

Training

RNN Softmax RNN

w2

Embed

Hidden Layer Output Word History Embedding

Cost

yt ht Ewt wt

  • Process second training example
  • Update weights with back-propagation
  • And so on...
  • But: no feedback to previous history

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-20
SLIDE 20

19

Back-Propagation Through Time

Softmax RNN

w1

Embed Softmax RNN

w2

Embed

Hidden Layer Output Word History Embedding

Softmax RNN

w3

Embed Cost Cost Cost

yt ht Ewt wt

  • After processing a few training examples,

update through the unfolded recurrent neural network

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-21
SLIDE 21

20

Back-Propagation Through Time

  • Carry out back-propagation though time (BPTT) after each training example

– 5 time steps seems to be sufficient – network learns to store information for more than 5 time steps

  • Or: update in mini-batches

– process 10-20 training examples – update backwards through all examples – removes need for multiple steps for each training example

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-22
SLIDE 22

21

long short term memory

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-23
SLIDE 23

22

Vanishing Gradients

  • Error is propagated to previous steps
  • Updates consider

– prediction at that time step – impact on future time steps

  • Vanishing gradient: propagated error disappears

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-24
SLIDE 24

23

Recent vs. Early History

  • Hidden layer plays double duty

– memory of the network – continuous space representation used to predict output words

  • Sometimes only recent context important

After much economic progress over the years, the country → has

  • Sometimes much earlier context important

The country which has made much economic progress over the years still → has

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-25
SLIDE 25

24

Long Short Term Memory (LSTM)

  • Design quite elaborate, although not very complicated to use
  • Basic building block: LSTM cell

– similar to a node in a hidden layer – but: has a explicit memory state

  • Output and memory state change depends on gates

– input gate: how much new input changes memory state – forget gate: how much of prior memory state is retained – output gate: how strongly memory state is passed on to next layer.

  • Gates can be not just be open (1) and closed (0), but slightly ajar (e.g., 0.2)

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-26
SLIDE 26

25

LSTM Cell

input gate

  • utput gate

forget gate

X i m

  • ⊗ ⊕

h m

LSTM Layer Time t-1 Next Layer Y LSTM Layer Time t Preceding Layer

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-27
SLIDE 27

26

LSTM Cell (Math)

  • Memory and output values at time step t

memoryt = gateinput × inputt + gateforget × memoryt−1

  • utputt = gateoutput × memoryt
  • Hidden node value ht passed on to next layer applies activation function f

ht = f(outputt)

  • Input computed as input to recurrent neural network node

– given node values for prior layer xt = (xt

1, ..., xt X)

– given values for hidden layer from previous time step ht−1 = (ht−1

1

, ..., ht−1

H )

– input value is combination of matrix multiplication with weights wx and wh and activation function g inputt = g X

  • i=1

wx

i xt i + H

  • i=1

wh

i ht−1 i

  • Philipp Koehn

Machine Translation: Neural Networks 1 October 2020

slide-28
SLIDE 28

27

Values for Gates

  • Gates are very important
  • How do we compute their value?

→ with a neural network layer!

  • For each gate a ∈ (input, forget, output)

– weight matrix W xa to consider node values in previous layer xt – weight matrix W ha to consider hidden layer ht−1 at previous time step – weight matrix W ma to consider memory at previous time step

  • memoryt−1

– activation function h gatea = h X

  • i=1

wxa

i xt i + H

  • i=1

wha

i ht−1 i

+

H

  • i=1

wma

i

memoryt−1

i

  • Philipp Koehn

Machine Translation: Neural Networks 1 October 2020

slide-29
SLIDE 29

28

Training

  • LSTM are trained the same way as recurrent neural networks
  • Back-propagation through time
  • This looks all very complex, but:

– all the operations are still based on ∗ matrix multiplications ∗ differentiable activation functions → we can compute gradients for objective function with respect to all parameters → we can compute update functions

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-30
SLIDE 30

29

What is the Point?

(from Tran, Bisazza, Monz, 2016)

  • Each node has memory memoryi independent from current output hi
  • Memory may be carried through unchanged (gatei

input = 0, gatei memory = 1)

⇒ can remember important features over long time span (capture long distance dependencies)

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-31
SLIDE 31

30

Visualizing Individual Cells

Karpathy et al. (2015): ”Visualizing and Understanding Recurrent Networks”

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-32
SLIDE 32

31

Visualizing Individual Cells

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-33
SLIDE 33

32

Gated Recurrent Unit (GRU)

update gate reset gate

X x

h h

GRU Layer Time t-1 Next Layer Y GRU Layer Time t Preceding Layer

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-34
SLIDE 34

33

Gated Recurrent Unit (Math)

  • Two Gates

updatet = g(Wupdate inputt + Uupdate statet−1 + biasupdate) resett = g(Wreset inputt + Ureset statet−1 + biasreset)

  • Combination of input and previous state

(similar to traditional recurrent neural network) combinationt = f(W inputt + U(resett ◦ statet−1))

  • Interpolation with previous state

statet =(1 − updatet) ◦ statet−1 + updatet

  • combinationt) + bias

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-35
SLIDE 35

34

deeper models

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-36
SLIDE 36

35

Deep Learning?

Output Hidden Layer Input Word Embedding

Softmax RNN Softmax RNN Softmax RNN Embed Embed Embed

yt ht E xt Shallow

  • Not much deep learning so far
  • Between prediction from input to output: only 1 hidden layer
  • How about more hidden layers?

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-37
SLIDE 37

36

Deep Models

Softmax RNN RNN Softmax RNN RNN Softmax RNN RNN RNN RNN RNN Embed Embed Embed

yt ht,3 ht,2 ht,1 E xi

Softmax RNN RNN Softmax RNN RNN Softmax RNN RNN RNN RNN RNN

Output Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Input Word Embedding

Embed Embed Embed

yt ht,3 ht,2 ht,1 E xi Deep Stacked Deep Transitional

Philipp Koehn Machine Translation: Neural Networks 1 October 2020

slide-38
SLIDE 38

37

questions?

Philipp Koehn Machine Translation: Neural Networks 1 October 2020