recurrent networks and lstms for nlp
play

Recurrent Networks, and LSTMs, for NLP Michael Collins, Columbia - PowerPoint PPT Presentation

Recurrent Networks, and LSTMs, for NLP Michael Collins, Columbia University Representing Sequences Often we want to map some sequence x [1: n ] = x 1 . . . x n to a label y or a distribution p ( y | x [1: n ] ) Examples: Language


  1. Recurrent Networks, and LSTMs, for NLP Michael Collins, Columbia University

  2. Representing Sequences ◮ Often we want to map some sequence x [1: n ] = x 1 . . . x n to a label y or a distribution p ( y | x [1: n ] ) ◮ Examples: ◮ Language modeling: x [1: n ] is first n words in a document, y is the ( n + 1) ’th word ◮ Sentiment analysis: x [1: n ] is a sentence (or document), y is label indicating whether the sentence is positive/neutral/negative about a particular topic (e.g., a particular restaurant) ◮ Machine translation: x [1: n ] is a source-language sentence, y is a target language sentence (or the first word in the target language sentence)

  3. Representing Sequences (continued) ◮ Slightly more generally: map a sequence x [1: n ] and a position i ∈ { 1 . . . n } to a label y or a distribution p ( y | x [1: n ] , i ) ◮ Examples: ◮ Tagging: x [1: n ] is a sentence, i is a position in the sentence, y is the tag for position i ◮ Dependency parsing: x [1: n ] is a sentence, i is a position in the sentence, y ∈ { 1 . . . n } , y � = i is the head for word x i in the dependency parse

  4. A Simple Recurrent Network Inputs: A sequence x 1 . . . x n where each x j ∈ R d . A label y ∈ { 1 . . . K } . An integer m defining size of hidden dimension. Parameters W hh ∈ R m × m , W hx ∈ R m × d , b h ∈ R m , h 0 ∈ R m , V ∈ R K × m , γ ∈ R K . Transfer function g : R m → R m . Definitions: { W hh , W hx , b h , h 0 } θ = g ( W hx x ( t ) + W hh h ( t − 1) + b h ) R ( x ( t ) , h ( t − 1) ; θ ) = Computational Graph: ◮ For t = 1 . . . n ◮ h ( t ) = R ( x ( t ) , h ( t − 1) ; θ ) ◮ l = V h ( n ) + γ, q = LS ( l ) , o = − q y

  5. The Computational Graph

  6. A Problem in Training: Exploding and Vanishing Gradients ◮ Calculation of gradients involves multiplication of long chains of Jacobians ◮ This leads to exploding and vanishing gradients

  7. LSTMs (Long Short-Term Memory units) ◮ Old definitions of the recurrent update: { W hh , W hx , b h , h 0 } θ = g ( W hx x ( t ) + W hh h ( t − 1) + b h ) R ( x ( t ) , h ( t − 1) ; θ ) = ◮ LSTMs give an alternative definition of R ( x ( t ) , h ( t − 1) ; θ ) .

  8. Definition of Sigmoid Function, Element-Wise Product ◮ Given any integer d ≥ 1 , σ d : R d → R d is the function that maps a vector v to a vector σ d ( v ) such that for i = 1 . . . d , e v i σ d i ( v ) = 1 + e v i ◮ Given vectors a ∈ R d and b ∈ R d , c = a ⊙ b has components c i = a i × b i for i = 1 . . . d

  9. LSTM Equations (from Ilya Sutskever, PhD thesis) s t , h t as hidden state at position t . s t is memory , Maintain s t , ˜ intuitively allows long-term memory. The function s t , h t = LSTM ( x t , s t − 1 , ˜ s t , ˜ s t − 1 , h t − 1 ; θ ) is defined as: u t CONCAT ( h t − 1 , x t , ˜ s t − 1 ) = g ( W h u t + b h ) h t = (hidden state) g ( W i u t + b i ) i t = (“input”) σ ( W ι u t + b ι ) ι t = (“input gate”) σ ( W o u t + b o ) o t = (“output gate”) σ ( W f u t + b f ) f t = (“forget gate”) s t − 1 ⊙ f t + i t ⊙ ι t s t = forget and input gates control update of memory s t ⊙ o t s t ˜ = output gate controls information that can leave the unit

  10. An LSTM-based Recurrent Network Inputs: A sequence x 1 . . . x n where each x j ∈ R d . A label y ∈ { 1 . . . K } . Computational Graph: s (0) are set to some inital values. ◮ h (0) , s (0) , ˜ ◮ For t = 1 . . . n s ( t ) , h ( t ) = LSTM ( x ( t ) , s ( t − 1) , ˜ ◮ s ( t ) , ˜ s ( t − 1) , h ( t − 1) ; θ ) ◮ l = V lh h ( n ) + V ls ˜ s ( n ) + γ, q = LS ( l ) , o = − q y

  11. The Computational Graph

  12. An LSTM-based Recurrent Network for Tagging Inputs: A sequence x 1 . . . x n where each x j ∈ R d . A sequence y 1 . . . y n of tags. Computational Graph: s (0) are set to some inital values. ◮ h (0) , s (0) , ˜ ◮ For t = 1 . . . n s ( t ) , h ( t ) = LSTM ( x ( t ) , s ( t − 1) , ˜ ◮ s ( t ) , ˜ s ( t − 1) , h ( t − 1) ; θ ) ◮ For t = 1 . . . n ◮ l t = V × CONCAT ( h ( t ) , ˜ s ( t ) ) + γ, q t = LS ( l t ) , o t = − q y t ◮ o = � n t =1 o t

  13. The Computational Graph

  14. A bi-directional LSTM (bi-LSTM) for tagging Inputs: A sequence x 1 . . . x n where each x j ∈ R d . A sequence y 1 . . . y n of tags. Definitions: θ F and θ B are parameters of a forward and backward LSTM. Computational Graph: α ( n +1) are set to some inital values. ◮ h (0) , s (0) , ˜ s (0) , η ( n +1) , α ( n +1) , ˜ ◮ For t = 1 . . . n s ( t ) , h ( t ) = LSTM ( x ( t ) , s ( t − 1) , ˜ s ( t − 1) , h ( t − 1) ; θ F ) ◮ s ( t ) , ˜ ◮ For t = n . . . 1 α ( t ) , η ( t ) = LSTM ( x ( t ) , α ( t +1) , ˜ ◮ α ( t ) , ˜ α ( t +1) , η ( t +1) ; θ B ) ◮ For t = 1 . . . n ◮ l t = V × CONCAT ( h ( t ) , ˜ α t ) + γ, q t = LS ( l t ) , s ( t ) , η ( t ) , ˜ o t = − q y t ◮ o = � n t =1 o t

  15. The Computational Graph

  16. Results on Language Modeling ◮ Results from One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling , Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants.

  17. Results on Dependency Parsing ◮ Deep Biaffine Attention for Neural Dependency Parsing , Dozat and Manning. ◮ Uses a bidirectional LSTM to represent each word ◮ Uses LSTM representations to predict head for each word in the sentence ◮ Unlabeled dependency accuracy: 95.75%

  18. Conclusions ◮ Recurrent units map input sequences x 1 . . . x n to representations h 1 . . . h n . The vector h n can be used to predict a label for the entire sentence. Each vector h i for i = 1 . . . n can be used to make a prediction for position i ◮ LSTMs are recurrent units that make use of more involved recurrent updates. They maintain a “memory” state. Empirically they perform extremely well ◮ Bi-directional LSTMs allow representation of both the information before and after a position i in the sentence ◮ Many applications: language modeling, tagging, parsing, speech recognition, we will soon see machine translation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend