from log linear to neural language models
play

From Log-Linear to Neural Language Models Karl Stratos Rutgers - PowerPoint PPT Presentation

CS 533: Natural Language Processing From Log-Linear to Neural Language Models Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/32 Agenda 1. Loose ends (STOP symbol, Zipfs law) 2. Log-linear language


  1. CS 533: Natural Language Processing From Log-Linear to Neural Language Models Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/32

  2. Agenda 1. Loose ends (STOP symbol, Zipf’s law) 2. Log-linear language models ◮ Gradient descent 3. Neural language models ◮ Feedforward ◮ Recurrent Karl Stratos CS 533: Natural Language Processing 2/32

  3. Zipf’s Law w 1 . . . w | V | ∈ V sorted in decreasing probability p ( w i ) = 2 p ( w i +1 ) First four words: 93% of the unigram probability mass? , the to . in of a and Karl Stratos CS 533: Natural Language Processing 3/32

  4. Zipf’s Law: Empirical Karl Stratos 10000 20000 30000 40000 50000 60000 0 the , . to of in and a `` 's on '' that for said he with is was his it at as has i by be but will have from CS 533: Natural Language Processing an we not ) ( after who this had are president two would been they also first their which world -- last were over new its n't one out more when obama against about up she year her minister all us him china : there united time team government if u.s. do no before or into you years second than between could people told can 4/32 since state tuesday Zipf Frequency win

  5. Log-Linear Language Model ◮ Random variables: context x (e.g., previous n words), next word y ◮ Assumes a feature function φ ( x, y ) ∈ { 0 , 1 } d ◮ Model parameter: weight vector w ∈ R d ◮ Model: for any ( x, y ) e w ⊤ φ ( x,y ) q φ,w ( y | x ) = � y ′ ∈ V e w ⊤ φ ( x,y ′ ) ◮ Model estimation: minimize cross entropy ( ≡ MLE) � � w ∗ = arg min − ln q φ,w ( y | x ) E ( x,y ) ∼ p XY w ∈ R d Karl Stratos CS 533: Natural Language Processing 5/32

  6. Example: Feature Extraction Corpus: ◮ the dog chased the cat ◮ the cat chased the mouse ◮ the mouse chased the dog Feature template ◮ ( x [ − 1] , y ) ◮ ( x [ − 2] , y ) ◮ ( x [ − 2] , x [ − 1] , y ) ◮ ( x [ − 1][: − 2] , y ) How many features do we extract from the corpus (what is d )? Karl Stratos CS 533: Natural Language Processing 6/32

  7. Example: Score of ( x, y ) For any ( x, y ) , its “score” given by parameter w ∈ R d is d � w ⊤ φ ( x, y ) = w i i =1: φ i ( x,y )=1 Example: x = mouse chased w ⊤ φ ( mouse chased , the ) = w (-1)chased,the + w (-2)mouse,the + w (-2)mouse(-1)chased,the + w (-1:-2)ed,the w ⊤ φ ( mouse chased , chased ) = w (-1)chased,chased + w (-2)mouse,chased + w (-2)mouse(-1)chased,chased + w (-1:-2)ed,chased Karl Stratos CS 533: Natural Language Processing 7/32

  8. Empirical Objective � � − ln q φ,w ( y | x ) E ( x,y ) ∼ p XY N ≈ 1 � − ln q φ,w ( y ( l ) | x ( l ) ) N l =1   N = 1 � � e w ⊤ φ ( x ( l ) ,y )  − w ⊤ φ ( x ( l ) , y ( l ) ) ln N l =1 y ∈ V � �� � J ( w ) When is J ( w ) minimized? Karl Stratos CS 533: Natural Language Processing 8/32

  9. Regularization Ways to make sure w doesn’t overfit training data 1. Early stopping : stop training when validation performance stops improving 2. Explicit regularization term d d � � w 2 w ∈ R d J ( w ) + λ min or w ∈ R d J ( w ) + λ min | w i | i i =1 i =1 � �� � � �� � || w || 2 || w || 1 2 3. Other techniques (e.g., dropout) Karl Stratos CS 533: Natural Language Processing 9/32

  10. Gradient Descent Minimize f ( x ) = x 3 + 2 x 2 − x − 1 over x (Courtesy to FooPlot) Karl Stratos CS 533: Natural Language Processing 10/32

  11. Local Search Input : training objective J ( θ ) ∈ R , number of iterations T θ ∈ R d such that J (ˆ Output : parameter ˆ θ ) is small 1. Initialize θ 0 (e.g., randomly). 2. For t = 0 . . . T − 1 , 2.1 Obtain ∆ t ∈ R n such that J ( θ t + ∆ t ) ≤ J ( θ t ) . 2.2 Choose some “step size” η t ∈ R . 2.3 Set θ t +1 = θ t + η t ∆ t . 3. Return θ T . What is a good ∆ t ? Karl Stratos CS 533: Natural Language Processing 11/32

  12. Gradient of the Objective at the Current Parameter At θ t ∈ R n , the rate of increase (of the value of J ) along a direction u ∈ R n (i.e., || u || 2 = 1 ) is given by the directional derivative J ( θ t + ǫu ) − J ( θ t ) ∇ u J ( θ t ) := lim ǫ ǫ → 0 The gradient of J at θ t is defined to be a vector ∇ J ( θ t ) such that ∇ u J ( θ t ) = ∇ J ( θ t ) · u ∀ u ∈ R n Therefore, the direction of the greatest rate of decrease is given by � �� � �� −∇ J ( θ t ) / � ∇ J ( θ t ) 2 . � Karl Stratos CS 533: Natural Language Processing 12/32

  13. Gradient Descent Input : training objective J ( θ ) ∈ R , number of iterations T θ ∈ R n such that J (ˆ Output : parameter ˆ θ ) is small 1. Initialize θ 0 (e.g., randomly). 2. For t = 0 . . . T − 1 , θ t +1 = θ t − η t ∇ J ( θ t ) 3. Return θ T . When J ( θ ) is additionally convex (as in linear regression), gradient descent converges to an optimal solution (for appropriate step sizes). Karl Stratos CS 533: Natural Language Processing 13/32

  14. Stochastic Gradient Descent for Log-Linear Model Input : training objective N J ( w ) = 1 � J ( l ) ( w ) N l =1   e w ⊤ φ ( x ( l ) ,y ) J ( l ) ( w ) = ln  �  − w ⊤ φ ( x ( l ) , y ( l ) ) y ∈ V number of iterations T (“epochs”) 1. Initialize w 0 (e.g., randomly). 2. For t = 0 . . . T − 1 , 2.1 For l ∈ shuffle ( { 1 . . . N } ) , w t +1 = w t − η t ∇ w J ( l ) ( w t ) 3. Return w T . Karl Stratos CS 533: Natural Language Processing 14/32

  15. Gradient Derivation Board Karl Stratos CS 533: Natural Language Processing 15/32

  16. Summary of Gradient Descent ◮ Gradient descent is a local search algorithm that can be used to optimize any differentiable objective function. ◮ Stochastic gradient descent is the cornerstone of modern large-scale machine learning. Karl Stratos CS 533: Natural Language Processing 16/32

  17. Word Vectors ◮ Instead of manually designing features φ , can we learn features themselves? ◮ Model parameter: now includes E ∈ R | V |× d ◮ E w ∈ R d : continuous dense representation of word w ∈ V ◮ If we define q ( y | x ) as a differentiable function of E , we learn E itself. Karl Stratos CS 533: Natural Language Processing 17/32

  18. Simple Model? ◮ Parameters: E ∈ R | V |× d , W ∈ R | V |× 2 d ◮ Model: � � E x [ − 1] �� q E,W ( y | x ) = softmax y W E x [ − 2] ◮ Model estimation: minimize cross entropy ( ≡ MLE) E ∗ , W ∗ = arg min − ln q E,W ( y | x ) � � E ( x,y ) ∼ p XY E ∈ R | V |× d W ∈ R | V |× 2 d Karl Stratos CS 533: Natural Language Processing 18/32

  19. Neural Network Just a composition of linear/nonlinear functions. � � � � f ( x ) = W ( L ) tanh W ( L − 1) · · · tanh W (1) x · · · More like a paradigm, not a specific model. 1. Transform your input x − → f ( x ) . 2. Define loss between f ( x ) and the target label y . 3. Train parameters by minimizing the loss. Karl Stratos CS 533: Natural Language Processing 19/32

  20. You’ve Already Seen Some Neural Networks . . . Log-linear model is a neural network with 0 hidden layer and a softmax output layer: exp([ Wx ] y ) p ( y | x ) := y ′ exp([ Wx ] y ′ ) = softmax y ( Wx ) � Get W by minimizing L ( W ) = − � i log p ( y i | x i ) . Linear regression is a neural network with 0 hidden layer and the identity output layer: f ( x ) := Wx Get W by minimizing L ( W ) = � i ( y i − f i ( x )) 2 . Karl Stratos CS 533: Natural Language Processing 20/32

  21. Feedforward Network Think: log-linear with extra transformation With 1 hidden layer: h (1) = tanh( W (1) x ) p ( y | x ) = softmax y ( h (1) ) With 2 hidden layers: h (1) = tanh( W (1) x ) h (2) = tanh( W (2) h (1) ) p ( y | x ) = softmax y ( h (2) ) Again, get parameters W ( l ) by minimizing − � i log p ( y i | x i ) . ◮ Q. What’s the catch? Karl Stratos CS 533: Natural Language Processing 21/32

  22. Training = Loss Minimization We can decrease any continuous loss by following the gradient. 1. Differentiate the loss wrt. model parameters (backprop) 2. Take a gradient step Karl Stratos CS 533: Natural Language Processing 22/32

  23. Backpropagation ◮ J ( θ ) any loss function differentiable with respect to θ ∈ R d ◮ The gradient of J with respect to θ at some point θ ′ ∈ R d ∇ θ J ( θ ′ ) ∈ R d can be calculated automatically by backpropagation. ◮ Note/code: http://karlstratos.com/notes/backprop.pdf Karl Stratos CS 533: Natural Language Processing 23/32

  24. Bengio et al. (2003) ◮ Parameters: E ∈ R | V |× d , W ∈ R d ′ × nd , V ∈ R | V |× d ′ ◮ Model:       E x [ − 1] . q E,W,V ( y | x ) = softmax y    .     V tanh  W .     E x [ − n ] ◮ Model estimation: minimize cross entropy ( ≡ MLE) E ∗ , W ∗ , V ∗ = arg min � − ln q E,W,V ( y | x ) � E ( x,y ) ∼ p XY E ∈ R | V |× d W ∈ R d ′× nd V ∈ R | V |× d ′ Karl Stratos CS 533: Natural Language Processing 24/32

  25. Bengio et al. (2003): Continued Karl Stratos CS 533: Natural Language Processing 25/32

  26. Collobert and Weston (2008) Nearest neighbors of trained word embeddings E ∈ R | V |× d https: //ronan.collobert.com/pub/matos/2008_nlp_icml.pdf Karl Stratos CS 533: Natural Language Processing 26/32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend