deep learning for natural language processing a short
play

Deep learning for natural language processing A short primer on deep - PowerPoint PPT Presentation

Deep learning for natural language processing A short primer on deep learning Benoit Favre < benoit.favre@univ-mrs.fr > Aix-Marseille Universit, LIF/CNRS 20 Feb 2017 Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 1 / 25 Deep


  1. Deep learning for natural language processing A short primer on deep learning Benoit Favre < benoit.favre@univ-mrs.fr > Aix-Marseille Université, LIF/CNRS 20 Feb 2017 Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 1 / 25

  2. Deep learning for Natural Language Processing Day 1 ▶ Class: intro to natural language processing ▶ Class: quick primer on deep learning ▶ Tutorial: neural networks with Keras Day 2 ▶ Class: word embeddings ▶ Tutorial: word embeddings Day 3 ▶ Class: convolutional neural networks, recurrent neural networks ▶ Tutorial: sentiment analysis Day 4 ▶ Class: advanced neural network architectures ▶ Tutorial: language modeling Day 5 ▶ Tutorial: Image and text representations ▶ Test Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 2 / 25

  3. Mathematical notations Just to be make sure we share the same vocabulary x can be a scalar, vector, matrix or tensor (n-dimensional array) ▶ An “axis" of x is one of the dimensions of x ▶ The “shape" of x is the size of the axes of x ▶ x i,j,k is the element of index i, j, k in the 3 first dimensions f ( x ) is a function on x , it returns a same-shape mathematical object xy = x · y = dot ( x, y ) is the matrix-to-matrix multiplication ▶ if r = xy , then r i,j = ∑ k x i,k × y k,j x ⊙ y is the elementwise multiplication tanh ( x ) applies the tanh function to all elements of x and returns the result σ is the sigmoid function, | x | is the absolute value, max ( x ) is the largest element... ∑ x is the sum of elements in x , ∏ x is the product of elements in x ∂f ∂θ is the partial derivative of f with respect to parameter θ Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 3 / 25

  4. What is machine learning? Objective ▶ Train a computer to simulate what humans do ▶ Give examples to a computer and teach it to do the same Actual way of doing machine learning ▶ Adjust parameters of a function so that it generates an output that looks like some data ▶ Minimize a loss function between the output of the function and some true data ▶ Actual minimization target: perform well on new data (empirical risk) Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 4 / 25

  5. A formalization Formalism ▶ x ∈ R k is an observation, a vector of real numbers ▶ y ∈ R m is a class label among m possible labels { } ▶ X, Y = ( x ( i ) , y ( i ) ) i ∈ [1 ..n ] is training data ▶ f θ ( · ) is a function parametrized by θ ▶ L ( · , · ) is a loss function Inference ▶ Predict a label by passing the observation through a neural network y = f θ ( x ) Training ▶ Find the parameter vector that minimizes the loss of predictions versus truth on a training corpus θ ⋆ = argmin ∑ L ( f θ ( x ) , y ) θ ( x,y ) ∈ T Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 5 / 25

  6. Neural networks A biological neuron ▶ Inputs: dendrite ▶ Output: axon ▶ Processing unit: nucleus Source: http://www.marekrei.com/blog/wp-content/uploads/2014/01/neuron.png One formal neuron ▶ output = activation ( weighted sum ( inputs ) + bias ) A layer of neurons ▶ f is an activation function ▶ Process multiple neurons in parallel ▶ Implement as matrix-vector multiplication y = f ( Wx + b ) A multilayer perceptron y = f 3 ( W 3 f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) + b 3 ) y = NN θ ( x ) , qquadθ = ( W 1 , b 1 , W 2 , b 2 , W 3 , b 3 ) Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 6 / 25

  7. Encoding inputs and outputs Input x ▶ Vector of real values Output y ▶ Binary problem: 1 value, can be 0 or 1 (or -1 and 1 depending on activation function) ▶ Regression problem: 1 real value ▶ Multiclass problem ⋆ One-hot encoding ⋆ Example: class 3 among 6 → (0 , 0 , 1 , 0 , 0 , 0) Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 7 / 25

  8. Non linearity Activation function ▶ If f is identity, composition of linear applications is still linear ▶ Need non linearity ( tanh , σ , ...) ▶ For instance, 1 hidden-layer MLP NN θ ( x ) = σ ( W 2 z ( x ) + b 2 ) z ( x ) = σ ( W 1 x + b 1 ) Non linearity ▶ Neural network can approximate any 1 continuous function [Cybenko’89, Hornik’91, ...] Deep neural networks ▶ A composition of many non-linear functions ▶ Faster to compute and better expressive power than very large shallow network ▶ Used to be hard to train 1 http://neuralnetworksanddeeplearning.com/chap4.html Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 8 / 25

  9. Loss Loss suffered by wrongfully predicting the class of an example n L ( X, Y ) = 1 ∑ l ( y ( i ) , NN θ ( x )) n i =1 Well-known losses ▶ y t is the true label, y p is the predicted label l mae ( y t , y p ) = | y t − y p | absolute loss l mse ( y t , y p ) = ( y t − y p ) 2 mean square error l ce ( y t , y p ) = y t ln y p + (1 − y t ) ln (1 − y p ) cross entropy l hinge ( y t , y p ) = max (0 , 1 − y t y p ) hinge loss The most common loss for classification ▶ Cross entropy Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 9 / 25

  10. Training as loss minimization As a loss minimization problem θ × = argmin L ( X, Y ) θ So 1-hidden layer MLP with cross entropy loss n 1 θ × = argmin ∑ y t ln y p + (1 − y t ) ln (1 − y p ) n θ i =1 y p = We have a multilayer perceptron with two hidden layers y p = NN θ ( x ) = σ ( W 2 z ( x ) + b 2 ) z ( x ) = σ ( W 1 x + b 1 ) → Need to minimize a non linear, non convex function Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 10 / 25

  11. Function minimization Non convext → local minima Gradient descent Source: https://qph.ec.quoracdn.net/main-qimg-1ec77cdbb354c3b9d439fbe436dc5d4f Source: https://www.inverseproblem.co.nz/OPTI/Images/plot_ex2nlpb.png Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 11 / 25

  12. Gradient descent Start with random θ Compute gradient of loss with respect to θ ( ∂L ( X, Y ) ) , . . . ∂L ( X, Y ) ∇ L ( Y, X ) = ∂θ 1 ∂θ n Make a step towards the direction of the gradient θ ( t +1) = θ ( t ) + λ ∇ L ( X, Y ) λ is a small value called learning rate Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 12 / 25

  13. Chain rule Differentiation of function composition ▶ Remember calculus class g ◦ f ( x ) = g ( f ( x )) ∂ ( g ◦ f ) = ∂g ∂f ∂x ∂f ∂x So if you have function compositions, you can compute their derivative with respect to a parameter by multiplying a series of factors ∂ ( f 1 ◦ · · · ◦ f n ) = ∂f 1 . . . ∂f n − 1 ∂f n ∂θ ∂f 2 ∂f n ∂θ Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 13 / 25

  14. Example for MLP Multilayer perceptron with one hidden layer ( z 2 ) n L ( X, Y ) = 1 ∑ l ce ( y ( i ) , NN θ ( x ( i ) )) n i =1 NN θ ( x ) = z 1 ( x ) = σ ( W 2 z 2 ( x ) + b 2 ) z 2 ( x ) = σ ( W 1 x + b 1 ) θ = ( W 1 , b 1 , W 2 , b 2 ) So we need to compute ∂L = ∂L ∂l ce ∂z 1 ∂W 2 ∂l ce ∂z 1 ∂W 2 ∂L = ∂L ∂l ce ∂z 1 ∂b 2 ∂l ce ∂z 1 ∂b 2 ∂L = ∂L ∂l ce ∂z 1 ∂z 2 ∂W 2 ∂l ce ∂z 1 ∂z 2 ∂W 1 ∂L = ∂L ∂l ce ∂z 1 ∂z 2 ∂b 2 ∂l ce ∂z 1 ∂z 2 ∂b 1 A lot of the computation is redundant Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 14 / 25

  15. Back propagation A lot of computations are shared ▶ No need to recompute them ▶ Similar to dynamic programming Information propagates back through the network ▶ We call it “back-propagation" Training a neural network 1 θ 0 = random 2 while not converged forward: L θ t ( X, Y ) 1 ⋆ Predict y p ⋆ Compute loss backward: ∇ L θ t ( X, Y ) 2 ⋆ Compute partial derivatives update θ t +1 = θ t + λ ∇ L θ t ( X, Y ) 3 Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 15 / 25

  16. Computational Graphs Represent operations in L ( X, Y ) as a graph ▶ Every operation, not just high-level functions Source: http://colah.github.io More details: http://outlace.com/Computational-Graph/ Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 16 / 25

  17. Building blocks for neural networks Can build a neural network like lego ▶ Each block has inputs, parameters and outputs ▶ Examples ⋆ Logarithm: forward: y = ln ( x ) , backward: ∂ln ∂x ( y ) = 1 /y ⋆ Linear: forward: y = f W,b ( x ) = W · x + b ∂f ∂W ( y ) = y · W , ∂f ∂f ∂x ( y ) = y T · x , backward: ∂b ( y ) = y ⋆ Sum, product: ... Provides auto-differentiation ▶ A key component of modern deep learning toolkits ∂f ∂x 2 ( y ) x 2 f ( x 1 , x 2 ) f y x 1 ∂f ∂x 1 ( y ) Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 17 / 25

  18. Stochastic optimization Stochastic gradient descent (SGD) ▶ Look at one example at a time ▶ Update parameters every time ▶ Learning rate λ Many optimization techniques have been proposed ▶ Sometimes we should make larger steps: adaptive λ ⋆ λ ← λ/ 2 when loss stops decreasing on validation set ▶ Add inertia to skip through local minima ▶ Adagrad, Adadelta, Adam, NAdam, RMSprop... ▶ The key is that fancier algorithms use more memory ⋆ But they can converge faster Regularization ▶ Prevent model from fitting too well to the data ▶ Penalize loss by magnitude of parameter vector ( loss + || θ || ) ▶ Dropout: randomly disable ▶ Mini-batches ⋆ Averages SGD updates over a set of examples ⋆ Much faster because computations are parallel Benoit Favre (AMU) DL4NLP: deep learning 20 Feb 2017 18 / 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend