traitement automatique des langues fondements et
play

Traitement automatique des langues : Fondements et applications - PowerPoint PPT Presentation

Traitement automatique des langues : Fondements et applications Cours 10 : Neural networks (1) Tim Van de Cruys & Philippe Muller 20162017 Introduction Machine learning for NLP Standard approach: linear model trained over


  1. Traitement automatique des langues : Fondements et applications Cours 10 : Neural networks (1) Tim Van de Cruys & Philippe Muller 2016—2017

  2. Introduction Machine learning for NLP • Standard approach: linear model trained over high-dimensional but very sparse feature vectors • Recently: non-linear neural networks over dense input vectors

  3. Neural Network Architectures Feed-forward neural networks • Best known, standard neural network approach • Fully connected layers • Can be used as drop-in replacement for typical NLP classifiers

  4. Feature representation Dense vs. one hot • One hot : each feature is its own dimension • Dimensionality vector is same as number of features • Each feature is completely independent from one another • Dense : each feature is a d -dimensional vector • Dimensionality is d • Similar features have similar vectors

  5. Feature representation Feature combinations • Traditional NLP: specify interactions of features • E.g. features like ’word is jump , tag is V and previous word is they ’ • Non-linear network: only specify core features • Non-linearity of network takes care of finding indicative feature combinations

  6. Feature representation Why dense? • Discrete approach often works surprisingly well for NLP tasks • n-gram language models • POS-tagging, parsing • sentiment analysis • Still, a very poor representation of word meaning • No notion of similarity • Limited inference

  7. Feature representation Why dense? • Discrete approach often works surprisingly well for NLP tasks • n-gram language models • POS-tagging, parsing • sentiment analysis • Still, a very poor representation of word meaning • No notion of similarity • Limited inference [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ]

  8. Feature representation Why dense? • Discrete approach often works surprisingly well for NLP tasks • n-gram language models • POS-tagging, parsing • sentiment analysis • Still, a very poor representation of word meaning • No notion of similarity • Limited inference [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ] [ 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ]

  9. Feature representation Why dense?

  10. Feature representation Why dense?

  11. Feature representation Why dense?

  12. Feed-forward Architecture Multi-layer perceptron with 2 hidden layers NN MLP 2 ( x ) = y (1) h 1 = g ( xW 1 + b 1 ) (2) h 2 = g ( h 1 W 2 + b 2 ) (3) y = h 2 W 3 (4) x : vector of size d in = 3 y : vector of size d out = 2 h 1 , h 2 : vectors of size d hidden = 4

  13. Feed-forward Architecture Multi-layer perceptron with 2 hidden layers NN MLP 2 ( x ) = y (1) h 1 = g ( xW 1 + b 1 ) (2) h 2 = g ( h 1 W 2 + b 2 ) (3) y = h 2 W 3 (4) W 1 , W 2 , W 3 : matrices of size [ 3 × 4 ] , [ 4 × 4 ] , [ 4 × 2 ] b 1 , b 2 : ’bias’ vectors of size d hidden = 4 g ( · ) : non-linear activation function (elementwise)

  14. Feed-forward Architecture Multi-layer perceptron with 2 hidden layers NN MLP 2 ( x ) = y (1) h 1 = g ( xW 1 + b 1 ) (2) h 2 = g ( h 1 W 2 + b 2 ) (3) y = h 2 W 3 (4) W 1 , W 2 , W 3 , b 1 , b 2 = parameters of the network ( θ ) Use of multiple hidden layers: deep learning

  15. Feed-forward Non-linear activation functions Sigmoid (logistic) function 1 σ ( x ) = 1 + e − x 1 0.5 0 − 6 − 4 − 2 0 2 4 6

  16. Feed-forward Non-linear activation functions Hyperbolic tangent (tanh) function tanh ( x ) = e 2 x − 1 e 2 x + 1

  17. Feed-forward Non-linear activation functions Rectified linear unit (ReLU) ReLU ( x ) = max ( 0 , x )

  18. Feed-forward Output transformation function Softmax function x = x 1 . . . x k e xi softmax ( x i ) = � k j = 1 e xj

  19. Feed-forward Input vector • Embedding lookup from embedding matrix • concatenate or sum embeddings

  20. Feed-forward Loss functions • L ( ˆ y , y ) - the loss of predicting ˆ y when true output is y • Set parameters θ in order to minimize loss across different training examples • Compute gradient of parameters with regard to loss function to find minimum, take steps in right direction

  21. Feed-forward Loss functions • Hinge loss (binary and multi-class) • classify correct class over incorrect class(es) with margin of at least 1 • Categorical cross-entropy loss (negative log-likelihood) • Measure difference between true class distribution y and predicted class distribution ˆ y • Use with softmax output • Ranking loss • In unsupervised setting: rank attested examples over unattested, corrupted ones with margin of at least 1

  22. Training Stochastic gradient descent • Goal: minimize total loss � n i = 1 L ( f ( x i ; θ ) , y i ) • Estimating gradient over entire training set before taking step is computationally heavy • Compute gradient for small batch of samples from training set → estimate of gradient: stochastic • Learning rate λ : size of step in right direction • Improvements: momentum, adaptive learning rate

  23. Training Stochastic gradient descent • Size mini-batch: balance between better estimate and faster convergence • Gradients over different parameters (weight matrices, bias terms, embeddings, ...) efficiently calculated using backpropagation algorithm • No need to carry out derivations yourself: automatic tools for gradient computation using computational graphs

  24. Training Initialization • Parameters of network are initialized randomly • Magnitude of random samples has effect on training success • effective initialization schemes exist

  25. Training Misc • Shuffling: shuffle training set with each epoch • Learning rate: balance between proper convergence and fast convergence • Minibatch: balance speed/proper estimate; efficient using GPU

  26. Training Regularization • Neural networks have many parameters: risk of overfitting • Solution: regularization • L 2 : extend loss function with squared penalty on parameters, i.e. λ 2 || θ || 2 • Dropout: Randomly dropping (setting to zero) half of the neurons in the network for each training sample

  27. Word embeddings • Each word i is represented by a small, dense vector v i ∈ R d • d is typically in the range 50–1000 • Matrix of size V (vocabulary size) × d (embedding size) • words are ‘embedded’ in a real-valued, low-dimensional space • Similar words have similar embeddings

  28. Word embeddings • Each word i is represented by a small, dense vector v i ∈ R d • d is typically in the range 50–1000 • Matrix of size V (vocabulary size) × d (embedding size) • words are ‘embedded’ in a real-valued, low-dimensional space • Similar words have similar embeddings d 1 d 2 d 3 . . . apple –2.34 –1.01 0.33 pear –2.28 –1.20 0.11 car –0.20 1.02 2.44 . . .

  29. Word embeddings • Each word i is represented by a small, dense vector v i ∈ R d • d is typically in the range 50–1000 • Matrix of size V (vocabulary size) × d (embedding size) • words are ‘embedded’ in a real-valued, low-dimensional space • Similar words have similar embeddings

  30. Neural word embeddings • Word embeddings have been around for quite some time • The term ‘embedding’ was coined within the neural network community, along with new methods to learn them • Idea: Let’s allocate a number of parameters for each word and allow the neural network to automatically learn what the useful values should be • Prediction-based: learn to predict the next word

  31. Embeddings through language modeling • Predict the next word in a sequence, based on the previous word • One non-linear hidden layer, one softmax layer for classification • Choose parameters that optimize probability of correct word

  32. Embeddings through language modeling • Predict the next word in a sequence, based on the previous word • One non-linear hidden layer, one softmax layer for classification • Choose parameters that optimize probability of correct word

  33. Embeddings through error detection • Take a correct sentence and create a corrupted counterpart • Train the network to assign a higher score to the correct version of each sentence

  34. Embeddings through error detection • Take a correct sentence and create a corrupted counterpart • Train the network to assign a higher score to the correct version of each sentence

  35. Word2vec • Neural network approaches work well, but large number of parameters makes them computationally heavy • Popular, light-weight approach with less parameters: word2vec • No hidden layer, only softmax classifier • Two different models • Continuous bag of words (CBOW): predict current word based on surrounding words • Skip-gram: predict surrounding words based on context words

  36. CBOW • Current word w t is predicted from context words • Prediction is made from the sum of context embeddings

  37. Skip-gram • Each context word is predicted from current word • Parameters for each softmax classifier are shared

  38. Negative sampling • Computation of full softmax classifier is still rather expensive • Only compute score for correct context, and a number of wrong contexts • Maximize correct contexts and minimize wrong ones

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend