deep learning for natural language processing
play

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent - PowerPoint PPT Presentation

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent Neural Networks (RNNs) Caio Corro LECTURE 1 RECALL Language modeling with a multi-layer perceptron n 2nd order Markov chain: p ( y 1 , . . . , y n ) = p ( y 1 ) p ( y 2 |


  1. DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent Neural Networks (RNNs) Caio Corro

  2. LECTURE 1 RECALL Language modeling with a multi-layer perceptron n ∏ 2nd order Markov chain: p ( y 1 , . . . , y n ) = p ( y 1 ) p ( y 2 | y 1 ) p ( y i | y i − 1 , y i − 2 ) i =3 x = [ exp( w y i ) Embedding of y i − 2 ] Embedding of y i − 1 z = σ ( U (1) x + b (1) ) p ( y i | y i − 1 , y i − 2 ) = w = U (2) z + b (2) ∑ y ′ � exp( w y ′ � ) Probability Hidden Output Concatenate the distribution representation projection embeddings of the two previous words

  3. LECTURE 1 RECALL Language modeling with a multi-layer perceptron n ∏ 2nd order Markov chain: p ( y 1 , . . . , y n ) = p ( y 1 ) p ( y 2 | y 1 ) p ( y i | y i − 1 , y i − 2 ) i =3 x = [ exp( w y i ) Embedding of y i − 1 Embedding of y i − 2 ] z = σ ( U (1) x + b (1) ) p ( y i | y i − 1 , y i − 2 ) = w = U (2) z + b (2) ∑ y ′ � exp( w y ′ � ) Sentence classification with a Convolutional Neural Network 1. Convolution: sliding window of fixed size of the input sentence 2. Mean/max pooling over convolution outputs 3. Multi-linear perceptron

  4. LECTURE 1 RECALL Language modeling with a multi-layer perceptron n ∏ 2nd order Markov chain: p ( y 1 , . . . , y n ) = p ( y 1 ) p ( y 2 | y 1 ) p ( y i | y i − 1 , y i − 2 ) i =3 x = [ exp( w y i ) Embedding of y i − 1 Embedding of y i − 2 ] z = σ ( U (1) x + b (1) ) p ( y i | y i − 1 , y i − 2 ) = w = U (2) z + b (2) ∑ y ′ � exp( w y ′ � ) Sentence classification with a Convolutional Neural Network 1. Convolution: sliding window of fixed size of the input sentence 2. Mean/max pooling over convolution outputs 3. Multi-linear perceptron Main issue ➤ These 2 networks only use local word-order information ➤ No long range dependencies

  5. LONG RANGE DEPENDENCIES Today Recurrent neural networks ➤ Inputs are fed sequentially ➤ State representation updated at each input The dog is eating

  6. LONG RANGE DEPENDENCIES Today Recurrent neural networks ➤ Inputs are fed sequentially ➤ State representation updated at each input The dog is eating Next week! Attention network ➤ Inputs contain position information ➤ At each position look at any input in the sentence The.1 dog.2 is.3 eating.4

  7. RECURRENT NEURAL NETWORK Recurrent neural network cell Output h ( n ) h ( n ) Incoming recurrent Outgoing recurrent r ( n − 1) r ( n ) connection connection x ( n ) x ( n ) Input

  8. RECURRENT NEURAL NETWORK Recurrent neural network cell Output h ( n ) h ( n ) Incoming recurrent Outgoing recurrent r ( n − 1) r ( n ) connection connection x ( n ) x ( n ) Input Dynamic neural network All cells share the h (1) h (2) h (3) h (4) same parameters The dog is eating

  9. LANGUAGE MODEL Why do we usually make independence assumptions? ➤ Less parameters to learn ➤ Less sparsity | V | × | V | parameters Non neural language model n ∏ ➤ 1st order Markov chain: p ( y 1 , . . . , y n ) = p ( y 1 ) p ( y i | y i − 1 ) i =2 n ∏ ➤ 2nd order Markov chain: p ( y 2 | y 1 ) p ( y i | y i − 1 , y i − 2 ) p ( y 1 , . . . , y n ) = p ( y 1 ) i =3 | V | × | V | × | V | parameters Multi-layer perceptron language model ➤ No sparsity issue thanks to word embeddings ➤ Independence assumption, so no long range dependencies

  10. LANGUAGE MODEL WITH RECURRENT NEURAL NETWORKS p ( y 1 . . . y n ) = p ( y 1 , . . . , y n − 1 ) p ( y n | y 1 , . . . , y n − 1 ) No independence assumption!

  11. LANGUAGE MODEL WITH RECURRENT NEURAL NETWORKS p ( y 1 . . . y n ) = p ( y 1 , . . . , y n − 1 ) p ( y n | y 1 , . . . , y n − 1 ) No independence assumption! p ( y 1 ) <BOS>

  12. LANGUAGE MODEL WITH RECURRENT NEURAL NETWORKS p ( y 1 . . . y n ) = p ( y 1 , . . . , y n − 1 ) p ( y n | y 1 , . . . , y n − 1 ) No independence assumption! p ( y 2 | y 1 ) p ( y 1 ) <BOS> <BOS> The

  13. LANGUAGE MODEL WITH RECURRENT NEURAL NETWORKS p ( y 1 . . . y n ) = p ( y 1 , . . . , y n − 1 ) p ( y n | y 1 , . . . , y n − 1 ) No independence assumption! p ( y 2 | y 1 ) p ( y 3 | y 1 , y 2 ) p ( y 1 ) <BOS> <BOS> <BOS> The dog The

  14. LANGUAGE MODEL WITH RECURRENT NEURAL NETWORKS p ( y 1 . . . y n ) = p ( y 1 , . . . , y n − 1 ) p ( y n | y 1 , . . . , y n − 1 ) No independence assumption! p ( y 2 | y 1 ) p ( y 3 | y 1 , y 2 ) p ( y 1 ) <BOS> <BOS> <BOS> The dog The p ( y 4 | y 1 , y 2 , y 3 ) <BOS> is The dog

  15. SENTENCE CLASSIFICATION Neural architecture 1. A recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. A multi-layer perceptron takes as input this representation and output class weights

  16. SENTENCE CLASSIFICATION Neural architecture 1. A recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. A multi-layer perceptron takes as input this representation and output class weights 1 Context sensitive representation z (1) The dog is eating

  17. SENTENCE CLASSIFICATION Neural architecture 1. A recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. A multi-layer perceptron takes as input this representation and output class weights 1 2 MLP hidden layer Context sensitive representation z (2) = σ ( U (1) z (1) + b (1) ) z (1) w = U (2) z (2) + b (2) Output weights The dog is eating

  18. MACHINE TRANSLATION Neural architecture: Encoder-Decoder 1. Encoder: a recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. Decoder: a di ff erent recurrent neural network (RNN) compute the translation, 
 word after word Conditional language model

  19. MACHINE TRANSLATION Neural architecture: Encoder-Decoder 1. Encoder: a recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. Decoder: a di ff erent recurrent neural network (RNN) compute the translation, 
 word after word Conditional language model 1 z The dog is running

  20. MACHINE TRANSLATION Neural architecture: Encoder-Decoder 1. Encoder: a recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. Decoder: a di ff erent recurrent neural network (RNN) compute the translation, 
 word after word Conditional language model 1 2 z le The dog is running <BOS> Begin of sentence

  21. MACHINE TRANSLATION Neural architecture: Encoder-Decoder 1. Encoder: a recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. Decoder: a di ff erent recurrent neural network (RNN) compute the translation, 
 word after word Conditional language model 1 2 z le chien The dog is running <BOS> le Begin of sentence

  22. MACHINE TRANSLATION Neural architecture: Encoder-Decoder 1. Encoder: a recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. Decoder: a di ff erent recurrent neural network (RNN) compute the translation, 
 word after word Conditional language model 1 2 z le chien court The dog is running <BOS> le chien Begin of sentence

  23. MACHINE TRANSLATION Neural architecture: Encoder-Decoder 1. Encoder: a recurrent neural network (RNN) compute a context sensitive representation of the sentence 2. Decoder: a di ff erent recurrent neural network (RNN) compute the translation, 
 word after word Conditional language model Stop translation when the end of sentence token is generated 1 2 z le chien court <EOS> The dog is running <BOS> le chien court Begin of sentence

  24. SIMPLE RECURRENT NEURAL NETWORK

  25. MULTI-LAYER PERCEPTRON RECURRENT NETWORK Multi-linear perceptron cell ➤ Input: the current word and the previous output ➤ Output: the hidden representation The recurrent connection is juste the output at each position h (1) h (2) h (3) h (4) h h (4) h word The dog is eating h ( n ) = tanh ( U [ h ( n − 1) ] + b ) x ( n )

  26. GRADIENT BASED LEARNING PROBLEM Does it work? ➤ In theory: yes ➤ In practice: no, gradient based learning of RNN fail to learn long range dependencies! h (11) h (3) h (4) h (1) h (2) … … by my friend , is The dog , I was told Di ffi culties to propagate influence

  27. GRADIENT BASED LEARNING PROBLEM Does it work? ➤ In theory: yes ➤ In practice: no, gradient based learning of RNN fail to learn long range dependencies! h (11) h (3) h (4) h (1) h (2) … … by my friend , is The dog , I was told Di ffi culties to propagate influence Deep learning is not a « single tool fits all problem » solution ➤ You need to understand your data and prediction task ➤ You need to understand why a given neural architecture may fail for a given task ➤ You need to be able design tailored neural architectures for a given task

  28. LONG SHORT-TERM MEMORY NETWORKS

  29. LONG SHORT-TERM MEMORY NETWORKS (LSTM) Memory vector Intuition c ➤ Memory vector which is passed along the sequence ➤ At each time step, the network selects which cell of the memory to modify The network can learn to keep track of long distance relationships LSTM cell ➤ The recurrent connection pass the memory vector to the next cell h h , c x

  30. ERASING/WRITING VALUES IN A VECTOR Erasing values in the memory 3.02 0 − 4.11 0 ⇒ « Forget » the first 21.00 21.00 two cells 4.44 4.44 − 6.9 − 6.9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend