 
              deep learning for natural language processing Sergey I. Nikolenko 1,2 AINL FRUCT 2016 St. Petersburg, November 10, 2016 1 Laboratory for Internet Studies, NRU Higher School of Economics, St. Petersburg 2 Steklov Institute of Mathematics at St. Petersburg Random facts : November 10 is the UNESCO World Science Day for Peace and Development; on November 10, 1871, Henry Morton Stanley correctly presumed he finally found Dr. Livingstone
plan • The deep learning revolution has not left natural language processing alone. • DL in NLP has started with standard architectures (RNN, CNN) but then has branched out into new directions. • Our plan for today: (1) intro to distributed word representations; (2) a primer on sentence embeddings and character-level models; (3) a ((very-)very) brief overview of the most promising directions in modern NLP based on deep learning. • We will concentrate on directions that have given rise to new models and architectures. 2
basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • feedforward NNs are the basic building block; • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3
basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • autoencoders map a (possibly distorted) input to itself, usually for feature engineering; • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3
basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • convolutional NNs apply NNs with shared weights to certain windows in the previous layer (or input), collecting first local and then more and more global features; • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3
basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • recurrent NNs have a hidden state and propagate it further, used for sequence learning; • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3
basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • in particular, LSTM ( long short-term memory ) and GRU ( gated recurrent unit ) units are an important RNN architecture often used for NLP, good for longer dependencies. • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3
𝜖𝑑 𝑢 basic nn architectures • Basic neural network architectures that have been adapted for deep learning over the last decade: • main idea: in an LSTM, 𝑑 𝑢 = 𝑔 𝑢 ⊙ 𝑑 𝑢−1 + … , so unless the LSTM actually wants to forget something, 𝜖𝑑 𝑢−1 = 1 + … , and the gradients do not vanish. • Deep learning refers to several layers, any network mentioned above can be deep or shallow, usually in several different ways. 3
word embeddings, sentence embeddings, and character-level models
word embeddings • Distributional hypothesis in linguistics: words with similar meaning will occur in similar contexts. • Distributed word representations map words to a Euclidean space (usually of dimension several hundred): • started in earnest in (Bengio et al. 2003; 2006), although there were earlier ideas; • word2vec (Mikolov et al. 2013): train weights that serve best for simple prediction tasks between a word and its context: continuous bag-of-words (CBOW) and skip-gram; • Glove (Pennington et al. 2014): train word weights to decompose the (log) cooccurrence matrix. 5
word embeddings • Difference between skip-gram and CBOW architectures: • CBOW model predicts a word from its local context; • skip-gram model predicts context words from the current word. 5
. ∑ 𝑊 . ∑ 𝑊 exp(𝑣 𝑙𝑑 𝑙 ) 𝑞(𝑑 𝑙 |𝑗) = ̂ ̂ 𝑞(𝑗|𝑑 1 , … , 𝑑 𝑜 ) = exp(𝑣 𝑘 ) word embeddings • The CBOW word2vec model operates as follows: • inputs are one-hot word representations of dimension 𝑊 ; • the hidden layer is the matrix of vector embeddings 𝑋 ; • the hidden layer’s output is the average of input vectors; • as output we get an estimate 𝑣 𝑘 for each word, and the posterior is a simple softmax : 𝑘 ′ =1 exp(𝑣 𝑘 ′ ) • In skip-gram, it’s the opposite: • we predict each context word from the central word; • so now there are several multinomial distributions, one softmax for each context word: 𝑘 ′ =1 exp(𝑣 𝑘 ′ ) 5
= ⎟ 𝑞(𝑑 ∣ 𝑗; 𝜄) = 𝑥 ⊤ 𝑞(𝑑 ∣ 𝑗; 𝜄), (𝑗,𝑑)∈𝐸 ∏ 𝑑 ′ 𝑥 𝑗 ) ⎠ 𝑞(𝑑 ∣ 𝑗; 𝜄)⎞ 𝑥 ⊤ 𝑑∈𝐷(𝑗) ∏ ⎝ ⎜ ⎛ 𝑗∈𝐸 𝑀(𝜄) = ∏ . exp( ̃ word embeddings • How do we train a model like that? • E.g., in skip-gram we choose 𝜄 to maximize and we parameterize 𝑑 𝑥 𝑗 ) ∑ 𝑑 ′ exp( ̃ 6
(𝑗,𝑑)∈𝐸 = arg max 𝑑 ′ 𝑑 ′ 𝑥 𝑗 )) , 𝑥 ⊤ (exp( ̃ ̃ ∑ 𝜄 𝑞(𝑑 ∣ 𝑗; 𝜄) = 𝑥 ⊤ (𝑗,𝑑)∈𝐸 ∑ 𝜄 𝑞(𝑑 ∣ 𝑗; 𝜄) = arg max (𝑗,𝑑)∈𝐸 ∏ 𝜄 arg max exp( ̃ word embeddings • This leads to the total likelihood 𝑑 𝑥 𝑗 ) − log ∑ which we maximize with negative sampling. • Question: why do we need separate 𝑥 and 𝑥 vectors? • Live demo: nearest neighbors, simple geometric relations. 6
how to use word vectors • Next we can use recurrent architectures on top of word vectors. • E.g., LSTMs for sentiment analysis: • Train a network of LSTMs for language modeling, then use either the last output or averaged hidden states for sentiment. • We will see a lot of other architectures later. 7
up and down from word embeddings • Word embeddings are the first step of most DL models in NLP. • But we can go both up and down from word embeddings. • First, a sentence is not necessarily the sum of its words. • Second, a word is not quite as atomic as the word2vec model would like to think. 8
sentence embeddings • How do we combine word vectors into “text chunk” vectors? • The simplest idea is to use the sum and/or mean of word embeddings to represent a sentence/paragraph: • a baseline in (Le and Mikolov 2014); • a reasonable method for short phrases in (Mikolov et al. 2013) • shown to be effective for document summarization in (Kageback et al. 2014). 9
sentence embeddings • How do we combine word vectors into “text chunk” vectors? • Distributed Memory Model of Paragraph Vectors (PV-DM) (Le and Mikolov 2014): • a sentence/paragraph vector is an additional vector for each paragraph; • acts as a “memory” to provide longer context; 9
sentence embeddings • How do we combine word vectors into “text chunk” vectors? • Distributed Bag of Words Model of Paragraph Vectors (PV-DBOW) (Le and Mikolov 2014): • the model is forced to predict words randomly sampled from a specific paragraph; • the paragraph vector is trained to help predict words from the same paragraph in a small window. 9
sentence embeddings • How do we combine word vectors into “text chunk” vectors? • A number of convolutional architectures (Ma et al., 2015; Kalchbrenner et al., 2014). • (Kiros et al. 2015): skip-thought vectors capture the meanings of a sentence by training from skip-grams constructed on sentences. • (Djuric et al. 2015): model large text streams with hierarchical neural language models with a document level and a token level. 9
sentence embeddings • How do we combine word vectors into “text chunk” vectors? • Recursive neural networks (Socher et al., 2012): • a neural network composes a chunk of text with another part in a tree; • works its way up from word vectors to the root of a parse tree. 9
sentence embeddings • How do we combine word vectors into “text chunk” vectors? • Recursive neural networks (Socher et al., 2012): • by training this in a supervised way, one can get a very effective approach to sentiment analysis (Socher et al. 2013). 9
sentence embeddings • How do we combine word vectors into “text chunk” vectors? • A similar effect can be achieved with CNNs. • Unfolding Recursive Auto-Encoder model (URAE) (Socher et al., 2011) collapses all word embeddings into a single vector following the parse tree and then reconstructs back the original sentence; applied to paraphrasing and paraphrase detection. 9
deep recursive networks • Deep recursive networks for sentiment analysis (Irsoy, Cardie, 2014). • First idea: decouple leaves and internal nodes. • In recursive networks, we apply the same weights throughout the tree: 𝑦 𝑤 = 𝑔(𝑋 𝑀 𝑦 𝑚(𝑤) + 𝑋 𝑆 𝑦 𝑠(𝑤) + 𝑐). • Now, we use different matrices for leaves (input words) and hidden nodes: • we can now have fewer hidden units than the word vector dimension; • we can use ReLU: sparse inputs and dense hidden units do not cause a discrepancy. 10
Recommend
More recommend