Neural Networks Language Models Philipp Koehn 16 April 2015 - - PowerPoint PPT Presentation

neural networks language models
SMART_READER_LITE
LIVE PREVIEW

Neural Networks Language Models Philipp Koehn 16 April 2015 - - PowerPoint PPT Presentation

Neural Networks Language Models Philipp Koehn 16 April 2015 Philipp Koehn Machine Translation: Neural Networks 16 April 2015 N-Gram Backoff Language Model 1 Previously, we approximated p ( W ) = p ( w 1 , w 2 , ..., w n ) ... by


slide-1
SLIDE 1

Neural Networks Language Models

Philipp Koehn 16 April 2015

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-2
SLIDE 2

1

N-Gram Backoff Language Model

  • Previously, we approximated

p(W) = p(w1, w2, ..., wn)

  • ... by applying the chain rule

p(W) =

  • i

p(wi|w1, ..., wi−1)

  • ... and limiting the history (Markov order)

p(wi|w1, ..., wi−1) ≃ p(wi|wi−4, wi−3, wi−2, wi−1)

  • Each p(wi|wi−4, wi−3, wi−2, wi−1) may not have enough statistics to estimate

→ we back off to p(wi|wi−3, wi−2, wi−1), p(wi|wi−2, wi−1), etc., all the way to p(wi) – exact details of backing off get complicated — ”interpolated Kneser-Ney”

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-3
SLIDE 3

2

Refinements

  • A whole family of back-off schemes
  • Skip-n gram models that may back off to p(wi|wi−2)
  • Class-based models p(C(wi)|C(wi−4), C(wi−3), C(wi−2), C(wi−1))

⇒ We are wrestling here with – using as much relevant evidence as possible – pooling evidence between words

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-4
SLIDE 4

3

First Sketch

Word 1 Word 2 Word 3 Word 4 Word 5 Hidden Layer

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-5
SLIDE 5

4

Representing Words

  • Words are represented with a one-hot vector, e.g.,

– dog = (0,0,0,0,1,0,0,0,0,....) – cat = (0,0,0,0,0,0,0,1,0,....) – eat = (0,1,0,0,0,0,0,0,0,....)

  • That’s a large vector!
  • Remedies

– limit to, say, 20,000 most frequent words, rest are OTHER – place words in √n classes, so each word is represented by ∗ 1 class label ∗ 1 word in class label

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-6
SLIDE 6

5

Word Classes for Two-Hot Representations

  • WordNet classes
  • Brown clusters
  • Frequency binning

– sort words by frequency – place them in order into classes – each class has same token count → very frequent words have their own class → rare words share class with many other words

  • Anything goes: assign words randomly to classes

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-7
SLIDE 7

6

Second Sketch

Word 1 Word 2 Word 3 Word 4 Word 5 Hidden Layer

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-8
SLIDE 8

7

word embeddings

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-9
SLIDE 9

8

Add a Hidden Layer

Word 1 Word 2 Word 3 Word 4 Word 5 Hidden Layer C C C C

  • Map each word first into a lower-dimensional real-valued space
  • Shared weight matrix C

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-10
SLIDE 10

9

Details (Bengio et al., 2003)

  • Add direct connections from embedding layer to output layer
  • Activation functions

– input→embedding: none – embedding→hidden: tanh – hidden→output: softmax

  • Training

– loop through the entire corpus – update between predicted probabilities and 1-hot vector for output word

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-11
SLIDE 11

10

Word Embeddings

C

Word Embedding

  • By-product: embedding of word into continuous space
  • Similar contexts → similar embedding
  • Recall: distributional semantics

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-12
SLIDE 12

11

Word Embeddings

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-13
SLIDE 13

12

Word Embeddings

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-14
SLIDE 14

13

Are Word Embeddings Magic?

  • Morphosyntactic regularities (Mikolov et al., 2013)

– adjectives base form vs. comparative, e.g., good, better – nouns singular vs. plural, e.g., year, years – verbs present tense vs. past tense, e.g., see, saw

  • Semantic regularities

– clothing is to shirt as dish is to bowl – evaluated on human judgment data of semantic similarities

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-15
SLIDE 15

14

integration into machine translation systems

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-16
SLIDE 16

15

Reranking

  • First decode without neural network language model (NNLM)
  • Generate

– n-best list – lattice

  • Score candidates with NNLM
  • Rerank (requires training of weight for NNLM)

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-17
SLIDE 17

16

Computations During Inference

Word 1 Word 2 Word 3 Word 4 Word 5 Hidden Layer C C C C

Precomputed

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-18
SLIDE 18

17

Computations During Inference

Word 1 Word 2 Word 3 Word 4 Word 5 Hidden Layer C C C C

Precomputed Can be cached

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-19
SLIDE 19

18

Computations During Inference

Word 1 Word 2 Word 3 Word 4 Word 5 Hidden Layer C C C C

Precomputed Only compute score for predicted word

4x30x100 weights 100 nodes 4x30 nodes 1,000,000 nodes 100x1,000,000 100x1 weights

Can be cached

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-20
SLIDE 20

19

Only Compute Score for Predicted Word?

  • Proper probabilities require normalization

– compute scores for all possible words – add them up – normalize (softmax)

  • How can we get away with it?

– we do not care — a score is a score (Auli and Gao, 2014) – training regime that normalizes (Vaswani et al, 2013) – integrate normalization into objective function (Devlin et al., 2014)

  • Class-based word representations may help

– first predict class, normalize – then predict word, normalize → compute 2√n instead of n output node values

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-21
SLIDE 21

20

recurrent neural networks

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-22
SLIDE 22

21

Recurrent Neural Networks

Word 1 Word 2 E C 1 H

  • Start: predict second word from first
  • Mystery layer with nodes all with value 1

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-23
SLIDE 23

22

Recurrent Neural Networks

Word 1 Word 2 E C 1 H Word 2 Word 3 E C H H

copy values

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-24
SLIDE 24

23

Recurrent Neural Networks

Word 1 Word 2 E C 1 H Word 2 Word 3 E C H H

copy values

Word 3 Word 4 E C H H

copy values

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-25
SLIDE 25

24

Training

Word 1 Word 2 E 1 H

  • Process first training example
  • Update weights with back-propagation

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-26
SLIDE 26

25

Training

Word 2 Word 3 E H H

  • Process second training example
  • Update weights with back-propagation
  • And so on...
  • But: no feedback to previous history

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-27
SLIDE 27

26

Back-Propagation Through Time

Word 1 Word 2 E H H Word 2 Word 3 E H Word 3 Word 4 E H

  • After processing a few training examples,

update through the unfolded recurrent neural network

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-28
SLIDE 28

27

Back-Propagation Through Time

  • Carry out back-propagation though time (BPTT) after each training example

– 5 time steps seems to be sufficient – network learns to store information for more than 5 time steps

  • Or: update in mini-batches

– process 10-20 training examples – update backwards through all examples – removes need for multiple steps for each training example

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-29
SLIDE 29

28

Integration into Decoder

  • Recurrent neural networks depend on entire history

⇒ very bad for dynamic programming

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-30
SLIDE 30

29

long short term memory

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-31
SLIDE 31

30

Vanishing and Exploding Gradients

  • Error is propagated to previous steps
  • Updates consider

– prediction at that time step – impact on future time steps

  • Exploding gradient: propagated error dominates weight update
  • Vanishing gradient: propagated error disappears

⇒ We want the proper balance

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-32
SLIDE 32

31

Long Short Term Memory (LSTM)

  • Redesign of the neural network node to keep balance
  • Rather complex
  • ... but reportedly simple to train

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-33
SLIDE 33

32

Node in a Recurrent Neural Network

  • Given

– input word embedding x – previous hidden layer values h(t−1) – weight matrices W and U

  • Sum si =

j wijxj + j uijh(t−1) j

  • Activation yi = sigmoid(si)

Philipp Koehn Machine Translation: Neural Networks 16 April 2015

slide-34
SLIDE 34

33

Node (”Cell”) in LSMT

  • Now three gates: input, output, forget

each with their own weight matrices: WI, UI, WO, UO, WF, UF

  • Input and forget gates lead to activations as before

yI

i = sigmoid( j wI ijxj + j uI ijh(t−1) j

) yF

i = sigmoid( j wF ijxj + j uF ijh(t−1) j

)

  • Compute a candidate value for the ”state” of the node (weight matrices WC, UC)

˜ C(t)

i

= tanh(

j wC ijxj + j uC ijh(t−1) j

)

  • Input and forget activations balance candidate state and previous state

C(t)

i

= yI

i ˜

C(t)

i

+ yF

i C(t−1)

  • Output gate also considers state (additional weight matrix V )

yO

i = sigmoid( j wO ijxj + j uO ijh(t−1) j

) +

j vijC(t) j )

  • Output

h(t) = yO

i tanh(C(t)) Philipp Koehn Machine Translation: Neural Networks 16 April 2015