in4080 2020 fall
play

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 - PowerPoint PPT Presentation

1 IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural networks, Language models, word2wec Lecture 6, 21 Sept Today 3 Neural networks Language models Word embeddings Word2vec Artificial neural networks


  1. 1 IN4080 – 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning

  2. 2 Neural networks, Language models, word2wec Lecture 6, 21 Sept

  3. Today 3  Neural networks  Language models  Word embeddings  Word2vec

  4. Artificial neural networks 4  Inspired by the brain 1 1  neurons, synapses  Does not pretend to be a model of the brain  The simplest model is the  Feed forward network, also called  Multi-layer Perceptron

  5. Linear regression as a network 5  Each feature, 𝑦 𝑗 , of the input bias vector is an input node node 1  An additional bias node 𝑦 0 = 1 for the intercept w0 w1 x1  A weight at each edge,  Multiply the input values with w2 𝑧 ො 𝑧 Σ the respective weights: 𝑥 𝑗 𝑦 𝑗 x2 w3  Sum them target output value node 𝑛 𝑥 𝑗 𝑦 𝑗 = 𝒙 ∙ 𝒚 𝑧 = σ 𝑗=0  ො x3 input nodes

  6. Gradient descent (for linear regression) 6  We start with an initial set of weights  Consider training examples  Adjust the weights to reduce the loss  How?  Gradient descent  Gradient means partial derivatives.

  7. Linear regression: higher dimensions 7  Linear regression of more than two variables works similarly  We try to fit the best (hyper-)plane 𝑜 𝑧 = 𝑔 𝑦 0 , 𝑦 1 , … , 𝑦 𝑜 = ෍ ො 𝑥 𝑗 𝑦 𝑗 = 𝑥 ∙ Ԧ 𝑦 𝑗=0  We can use the same mean square: 𝑛 1 𝑧 𝑗 2 𝑛 ෍ 𝑧 𝑗 − ො 𝑗=1

  8. Partial derivatives 8  A function of more than one variable, e.g. 𝑔(𝑦, 𝑧) 𝜖𝑔  The partial derivative, e.g. 𝜖𝑦 is the derivative one gets by keeping the other variables constant  E.g. if 𝑔 𝑦, 𝑧 = 𝑏𝑦 + 𝑐𝑧 + 𝑑, 𝜖𝑔 𝜖𝑔 𝜖𝑦 = 𝑏 and 𝜖𝑧 = 𝑐 https://www.wikihow.com/Image:OyXsh.png

  9. Gradient descent 9  We move in the opposite direction of where the gradient is pointing.  Intuitively:  Take small steps in all direction parallel to the (feature) axes  The length of the steps are proportional to the steepness in each direction

  10. Properties of the derivatives 10 If 𝑔 𝑦 = 𝑏𝑦 + 𝑐 then 𝑔 ′ 𝑦 = 𝑏 1. 𝑒𝑔  we also write 𝑒𝑦 = 𝑏 𝑒𝑧  and if 𝑧 = 𝑔 𝑦 , we can write 𝑒𝑦 = 𝑏 If 𝑔 𝑦 = 𝑦 𝑜 for an integer ≠ 0 then 𝑔 ′ 𝑦 = 𝑜𝑦 (𝑜−1) 2. If 𝑔 𝑦 = 𝑕(𝑧) and 𝑧 = ℎ(𝑦) then 𝑔 ′ 𝑦 = 𝑕 ′ 𝑧 ℎ′(𝑦) 3. 𝑒𝑨 𝑒𝑨 𝑒𝑧  if 𝑨 = 𝑔 𝑦 = 𝑕(𝑧) , this can be written 𝑒𝑦 = 𝑒𝑧 𝑒𝑦  In particular, if 𝑔 𝑦 = 𝑏𝑦 + 𝑐 2 then 𝑔 ′ 𝑦 = 2 𝑏𝑦 + 𝑐 𝑏

  11. Gradient descent (for linear regression) 11  Loss: Mean squared error : 2 1 𝑜 𝑜 σ 𝑘=1  𝑀 ෝ 𝒛 , 𝒛 = 𝑧 𝑘 − 𝑧 𝑘 ො 𝑛 𝑥 𝑗 𝑦 𝑘,𝑗 = 𝒙 ∙ 𝒚 𝑘 𝑧 𝑘 = σ 𝑗=0  ො  We will update the 𝑥 𝑗 -s  Consider the partial derivatives w.r.t the 𝑥 𝑗 -s 𝜖 1 𝑜 𝑜 σ 𝑘=1 𝜖𝑥 𝑗 𝑀 ෝ 𝒛 , 𝒛 = 2 ො 𝑧 𝑘 − 𝑧 𝑘 𝑦 𝑘,𝑗  𝑜 is the number of observations, 0 ≤ 𝑘 ≤ 𝑜 and 𝜖  Update 𝑥 𝑗 : 𝑥 𝑗 = 𝑥 𝑗 − 𝜃 𝜖𝑥 𝑗 𝑀 ෝ 𝒛 , 𝒛 𝑛 is the number of features for each observation, 0 ≤ 𝑗 ≤ 𝑛

  12. Inspecting the update 12 𝑜 bias 𝑥 𝑗 = 𝑥 𝑗 − 𝜃 1 node 𝑜 ෍ 2 ො 𝑧 𝑘 − 𝑧 𝑘 𝑦 𝑘,𝑗 1 𝑘=1 w0 w1 x1 w2 𝑧 ො 𝑧 Σ The error term The x2 w3 (delta term) of this contribution to target output value node prediction, from the the error from loss function x3 this weight input nodes 𝜃 is the learning rate

  13. Logistic regression as a network 13 𝑛 𝑥 𝑗 𝑦 𝑗 = 𝒙 ∙ 𝒚  z = σ 𝑗=0 bias node 1 1 𝑧 = 𝜏(𝑨) = ො  1+𝑓 −𝑨 𝑘 1 − ො 1−𝑧 𝑘 𝑜  Loss: 𝑀 𝐷𝐹 = − σ 𝑘=1 log ො 𝑧 𝑘 𝑧 𝑘 w0 w1 x1 𝜖 𝜖 𝜖 ො 𝑧 𝜖𝑨 𝑥 𝑗 𝑀 𝐷𝐹 = 𝑧 𝑀 𝐷𝐹 × 𝜖𝑨 ×  𝜖ෞ 𝜖 ො 𝜖𝑥 𝑗 z 𝑧 ො 𝜖 𝑧− ො 𝑧 𝑧 w2 Σ 𝑧 𝑀 𝐷𝐹 =  𝜖 ො 𝑧 1− ො ො To simplify, 𝑧 x2 𝜖 ො 𝑧 consider only one w3 𝜖𝑨 = ො 𝑧 1 − ො 𝑧  target observation, 𝑧 𝑘 value output 𝜖𝑨 𝜖𝑥 𝑗 = 𝑦 𝑗  node x3 𝑧− ො 𝜖 𝑧 𝑧 𝑦 𝑗 = 𝑧 − ො 𝑥 𝑗 𝑀 𝐷𝐹 = 𝑧 ො 𝑧 1 − ො 𝑧 𝑦 𝑗  𝜖ෞ 𝑧 1− ො ො input nodes

  14. Logistic regression as a network 14 From the bias activation From the node 1 function loss w0 w1 x1 𝜖 𝑧− ො 𝑧 z 𝑧 ො 𝑧 𝑦 𝑗 = 𝑧 − ො 𝑥 𝑗 𝑀 𝐷𝐹 = 𝑧 ො 𝑧 1 − ො 𝑧 𝑦 𝑗 𝑧 w2 Σ 𝜖ෞ 𝑧 1− ො ො x2 w3 target value The delta term The contribution output node at the end of to the error from x3 W this weight input nodes

  15. Feed forward network 15  An input layer 1 1  An output layer: the predictions  One or more hidden layers  Connections from one layer to the next (from left to right)

  16. The hidden nodes 16  Each hidden node is like a small logistic regression: 1  First sum of weighted inputs : w0 𝑛 𝑥 𝑗 𝑦 𝑗 = 𝒙 ∙ 𝒚 w1 x1  z = σ 𝑗=0  Then the result is run through an z y w2 Σ activation function, e.g. σ x2 1  𝑧 = 𝜏(𝑨) = w3 1+𝑓 −𝑥∙𝑦 x3 It is the non-linearity of the activation function which makes it possible for MLP to predict non-linear decision boundaries

  17. The output layer 17 Alternatives 1 1  Regression:  One node  No activation function  Binary classifier:  One node  Logistic activation function  Multinomial classifier  Several nodes  Softmax  + more alternatives  Choice of loss function depends on task

  18. Learning in multi-layer networks 18  Consider two consecutive layers: M0  Layer M, with 1 ≤ 𝑗 ≤ 𝑛 nodes, and a bias N1 node M0 M1  Layer N, with 1 ≤ 𝑘 ≤ 𝑜 nodes N2  Let 𝑥 𝑗,𝑘 be the weight at the edge going from 𝑁 𝑗 to 𝑂 𝑘 M2  Consider processing one observation: N3  Let 𝑦 𝑗 be the value going out of node 𝑁 𝑗 M3  If M is a hidden layer: N4  𝑦 𝑗 = 𝜏(𝑨 𝑗 ) , where 𝑨 𝑗 = σ(… )

  19. Learning in multi-layer networks 19  If N is the output layer, calculate the error 𝑂 as before from the loss and the M0 terms 𝜀 𝑥 1,1 N1 𝑘 activation function at each node 𝑂 𝑘 M1 𝑥 1,2  If M is a hidden layer: Calculate the error N2 term at the nodes combining 𝑥 1,3  A weighted sum of the error terms at layer N M2 N3  The derivative of the activation function 𝑥 1,4 𝑁 = σ 𝑘=1 𝑒𝑦 𝑗 𝑜 𝑂  𝜀 𝑗 𝑥 𝑗,𝑘 𝜀 M3 𝑘 𝑒𝑨 𝑗 N4  where 𝑦 𝑗 = 𝜏(𝑨 𝑗 ) , where 𝑨 𝑗 = σ(… )

  20. Learning in multi-layer networks 20  By repeating the process, we get error M0 terms at all nodes in all the hidden layers. 𝑥 1,1 N1  The update of the weights between the M1 𝑥 1,2 layers can be done as before: N2 𝑥 1,3 𝑂  𝑥 𝑗,𝑘 = 𝑥 𝑗,𝑘 − 𝑦 𝑗 𝜀 𝑘 M2  where 𝑦 𝑗 is the value going out of node 𝑁 𝑗 N3 𝑥 1,4 M3 N4

  21. Alternative activation functions 21  There are alternative activation functions 𝑓 𝑦 −𝑓 −𝑦  tanh 𝑦 = 𝑓 𝑦 +𝑓 −𝑦  𝑆𝑓𝑀𝑉 𝑦 = max 𝑦, 0  ReLU is the preferred method in hidden layers in deep networks

  22. Today 22  Neural networks  Language models  Word embeddings  Word2vec

  23. 23 Language model

  24. Probabilistic Language Models 24  Goal: Ascribe probabilities to word sequences.  Motivation:  Translation:  P(she is a tall woman) > P(she is a high woman)  P(she has a high position) > P(she has a tall position)  Spelling correction:  P(She met the prefect.) > P(She met the perfect.)  Speech recognition:  P(I saw a van) > P(eyes awe of an)

  25. Probabilistic Language Models 25  Goal: Ascribe probabilities to word sequences.  𝑄(𝑥 1 , 𝑥 2 , 𝑥 3 , … , 𝑥 𝑜 )  Related: the probability of the next word  𝑄(𝑥 𝑜 | 𝑥 1 , 𝑥 2 , 𝑥 3 , … , 𝑥 𝑜−1 )  A model which does either is called a Language Model, LM  Comment: The term is somewhat misleading  (Probably origin from speech recognition)

  26. Chain rule 26  The two definitions are related by the chain rule for probability:  𝑄 𝑥 1 , 𝑥 2 , 𝑥 3 , … , 𝑥 𝑜 =  𝑄 𝑥 1 × 𝑄 𝑥 2 𝑥 1 × 𝑄 𝑥 3 |𝑥 1 , 𝑥 2 ×∙∙∙× 𝑄 𝑥 𝑜 |𝑥 1 , 𝑥 2 , … , 𝑥 𝑜−1 = 𝑜 𝑄 𝑥 𝑗 |𝑥 1 , 𝑥 2 , … , 𝑥 𝑗−1 = ς 𝑗 𝑜 𝑄 𝑥 𝑗 |𝑥 1 𝑗−1  ς 𝑗  P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is so)  But this does not work for long sequences  (we may not even have seen before)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend