deep learning for natural language processing
play

deep learning for natural language processing Sergey I. Nikolenko - PowerPoint PPT Presentation

deep learning for natural language processing Sergey I. Nikolenko 1,2 AINL FRUCT 2016 St. Petersburg, November 10, 2016 1 Laboratory for Internet Studies, NRU Higher School of Economics, St. Petersburg 2 Steklov Institute of Mathematics at St.


  1. deep learning for natural language processing Sergey I. Nikolenko 1,2 AINL FRUCT 2016 St. Petersburg, November 10, 2016 1 Laboratory for Internet Studies, NRU Higher School of Economics, St. Petersburg 2 Steklov Institute of Mathematics at St. Petersburg Random facts : November 10 is the UNESCO World Science Day for Peace and Development; on November 10, 1871, Henry Morton Stanley correctly presumed he finally found Dr. Livingstone

  2. plan • The deep learning revolution has not left natural language processing alone. • DL in NLP has started with standard architectures (RNN, CNN) but then has branched out into new directions. • Our plan for today: (1) intro to distributed word representations; (2) a primer on sentence embeddings and character-level models; (3) a ((very-)very) brief overview of the most promising directions in modern NLP based on deep learning. • We will concentrate on directions that have given rise to new models and architectures. 2

  3. basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • feedforward NNs are the basic building block; • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3

  4. basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • autoencoders map a (possibly distorted) input to itself, usually for feature engineering; • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3

  5. basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • convolutional NNs apply NNs with shared weights to certain windows in the previous layer (or input), collecting first local and then more and more global features; • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3

  6. basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • recurrent NNs have a hidden state and propagate it further, used for sequence learning; • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3

  7. basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • in particular, LSTM ( long short-term memory ) and GRU ( gated recurrent unit ) units are an important RNN architecture often used for NLP, good for longer dependencies. • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3

  8. 𝜖𝑑 𝑢 basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • main idea: in an LSTM, 𝑑 𝑢 = 𝑔 𝑢 ⊙ 𝑑 𝑢−1 + … , so unless the LSTM actually wants to forget something, 𝜖𝑑 𝑢−1 = 1 + … , and the gradients do not vanish. • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3

  9. word embeddings, sentence embeddings, and character-level models

  10. word embeddings • Distributional hypothesis in linguistics: words with similar meaning will occur in similar contexts. • Distributed word representations map words to a Euclidean space (usually of dimension several hundred): • started in earnest in (Bengio et al. 2003; 2006), although there were earlier ideas; • word2vec (Mikolov et al. 2013): train weights that serve best for simple prediction tasks between a word and its context: continuous bag-of-words (CBOW) and skip-gram; • Glove (Pennington et al. 2014): train word weights to decompose the (log) cooccurrence matrix. 5

  11. word embeddings • Difference between skip-gram and CBOW architectures: • CBOW model predicts a word from its local context; • skip-gram model predicts context words from the current word. 5

  12. . ∑ 𝑊 . ∑ 𝑊 exp(𝑣 𝑙𝑑 𝑙 ) 𝑞(𝑑 𝑙 |𝑗) = ̂ ̂ 𝑞(𝑗|𝑑 1 , … , 𝑑 𝑜 ) = exp(𝑣 𝑘 ) word embeddings • The CBOW word2vec model operates as follows: • inputs are one-hot word representations of dimension 𝑊 ; • the hidden layer is the matrix of vector embeddings 𝑋 ; • the hidden layer’s output is the average of input vectors; • as output we get an estimate 𝑣 𝑘 for each word, and the posterior is a simple softmax : 𝑘 ′ =1 exp(𝑣 𝑘 ′ ) • In skip-gram, it’s the opposite: • we predict each context word from the central word; • so now there are several multinomial distributions, one softmax for each context word: 𝑘 ′ =1 exp(𝑣 𝑘 ′ ) 5

  13. = ⎟ 𝑞(𝑑 ∣ 𝑗; 𝜄) = 𝑥 ⊤ 𝑞(𝑑 ∣ 𝑗; 𝜄), (𝑗,𝑑)∈𝐸 ∏ 𝑑 ′ 𝑥 𝑗 ) ⎠ 𝑞(𝑑 ∣ 𝑗; 𝜄)⎞ 𝑥 ⊤ 𝑑∈𝐷(𝑗) ∏ ⎝ ⎜ ⎛ 𝑗∈𝐸 𝑀(𝜄) = ∏ . exp( ̃ word embeddings • How do we train a model like that? • E.g., in skip-gram we choose 𝜄 to maximize and we parameterize 𝑑 𝑥 𝑗 ) ∑ 𝑑 ′ exp( ̃ 6

  14. (𝑗,𝑑)∈𝐸 = arg max 𝑑 ′ 𝑑 ′ 𝑥 𝑗 )) , 𝑥 ⊤ (exp( ̃ ̃ ∑ 𝜄 𝑞(𝑑 ∣ 𝑗; 𝜄) = 𝑥 ⊤ (𝑗,𝑑)∈𝐸 ∑ 𝜄 𝑞(𝑑 ∣ 𝑗; 𝜄) = arg max (𝑗,𝑑)∈𝐸 ∏ 𝜄 arg max exp( ̃ word embeddings • This leads to the total likelihood 𝑑 𝑥 𝑗 ) − log ∑ which we maximize with negative sampling. • Question: why do we need separate 𝑥 and 𝑥 vectors? • Live demo: nearest neighbors, simple geometric relations. 6

  15. how to use word vectors • Next we can use recurrent architectures on top of word vectors. • E.g., LSTMs for sentiment analysis: • Train a network of LSTMs for language modeling, then use either the last output or averaged hidden states for sentiment. • We will see a lot of other architectures later. 7

  16. up and down from word embeddings • Word embeddings are the first step of most DL models in NLP. • But we can go both up and down from word embeddings. • First, a sentence is not necessarily the sum of its words. • Second, a word is not quite as atomic as the word2vec model would like to think. 8

  17. sentence embeddings • How do we combine word vectors into “text chunk” vectors? • The simplest idea is to use the sum and/or mean of word embeddings to represent a sentence/paragraph: • a baseline in (Le and Mikolov 2014); • a reasonable method for short phrases in (Mikolov et al. 2013) • shown to be effective for document summarization in (Kageback et al. 2014). 9

  18. sentence embeddings • How do we combine word vectors into “text chunk” vectors? • Distributed Memory Model of Paragraph Vectors (PV-DM) (Le and Mikolov 2014): • a sentence/paragraph vector is an additional vector for each paragraph; • acts as a “memory” to provide longer context; 9

  19. sentence embeddings • How do we combine word vectors into “text chunk” vectors? • Distributed Bag of Words Model of Paragraph Vectors (PV-DBOW) (Le and Mikolov 2014): • the model is forced to predict words randomly sampled from a specific paragraph; • the paragraph vector is trained to help predict words from the same paragraph in a small window. 9

  20. sentence embeddings • How do we combine word vectors into “text chunk” vectors? • A number of convolutional architectures (Ma et al., 2015; Kalchbrenner et al., 2014). • (Kiros et al. 2015): skip-thought vectors capture the meanings of a sentence by training from skip-grams constructed on sentences. • (Djuric et al. 2015): model large text streams with hierarchical neural language models with a document level and a token level. 9

  21. sentence embeddings • How do we combine word vectors into “text chunk” vectors? • Recursive neural networks (Socher et al., 2012): • a neural network composes a chunk of text with another part in a tree; • works its way up from word vectors to the root of a parse tree. 9

  22. sentence embeddings • How do we combine word vectors into “text chunk” vectors? • Recursive neural networks (Socher et al., 2012): • by training this in a supervised way, one can get a very effective approach to sentiment analysis (Socher et al. 2013). 9

  23. sentence embeddings • How do we combine word vectors into “text chunk” vectors? • A similar effect can be achieved with CNNs. • Unfolding Recursive Auto-Encoder model (URAE) (Socher et al., 2011) collapses all word embeddings into a single vector following the parse tree and then reconstructs back the original sentence; applied to paraphrasing and paraphrase detection. 9

  24. deep recursive networks • Deep recursive networks for sentiment analysis (Irsoy, Cardie, 2014). • First idea: decouple leaves and internal nodes. • In recursive networks, we apply the same weights throughout the tree: 𝑦 𝑤 = 𝑔(𝑋 𝑀 𝑦 𝑚(𝑤) + 𝑋 𝑆 𝑦 𝑠(𝑤) + 𝑐). • Now, we use different matrices for leaves (input words) and hidden nodes: • we can now have fewer hidden units than the word vector dimension; • we can use ReLU: sparse inputs and dense hidden units do not cause a discrepancy. 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend