word embeddings in feedforward networks tagging and
play

Word Embeddings in Feedforward Networks; Tagging and Dependency - PowerPoint PPT Presentation

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks Michael Collins, Columbia University Overview Introduction Multi-layer feedforward networks Representing words as vectors (word


  1. Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks Michael Collins, Columbia University

  2. Overview ◮ Introduction ◮ Multi-layer feedforward networks ◮ Representing words as vectors (“word embeddings”) ◮ The dependency parsing problem ◮ Dependency parsing using a shift-reduce neural-network model

  3. Multi-Layer Feedforward Networks ◮ An integer d specifying the input dimension. A set Y of output labels with |Y| = K . ◮ An integer J specifying the number of hidden layers in the network. ◮ An integer m j for j ∈ { 1 . . . J } specifying the number of hidden units in the j ’th layer. ◮ A matrix W 1 ∈ R m 1 × d and a vector b 1 ∈ R m 1 associated with the first layer. ◮ For each j ∈ { 2 . . . J } , a matrix W j ∈ R m j × m j − 1 and a vector b j ∈ R m j associated with the j ’th layer. ◮ For each j ∈ { 1 . . . J } , a transfer function g j : R m j → R m j associated with the j ’th layer. ◮ A matrix V ∈ R K × m J and a vector γ ∈ R K specifying the parameters in the output layer.

  4. Multi-Layer Feedforward Networks (continued) ◮ Calculate output of first layer: z 1 ∈ R m 1 W 1 x i + b 1 = h 1 ∈ R m 1 g 1 ( z 1 ) = ◮ Calculate outputs of layers 2 . . . J : For j = 2 . . . J : z j ∈ R m j W j h j − 1 + b j = h j ∈ R m j g j ( z j ) = ◮ Calculate output value: V h J + b J l ∈ R K = q ∈ R K = LS ( l ) o ∈ R = − log q y i

  5. Overview ◮ Introduction ◮ Multi-layer feedforward networks ◮ Representing words as vectors (“word embeddings”) ◮ The dependency parsing problem ◮ Dependency parsing using a shift-reduce neural-network model

  6. An Example: Part-of-Speech Tagging Hispaniola/NNP quickly/RB became/VB an/DT important/JJ base/?? from which Spain expanded its empire into the rest of the Western Hemisphere . • There are many possible tags in the position ?? { NN, NNS, Vt, Vi, IN, DT, . . . } • The task: model the distribution p ( t j | t 1 , . . . , t j − 1 , w 1 . . . w n ) where t j is the j ’th tag in the sequence, w j is the j ’th word • The input to the neural network will be � t 1 . . . t j − 1 , w 1 . . . w n , j �

  7. One-Hot Encodings of Words, Tags etc. ◮ A dictionary D with size s ( D ) maps each word w in the vocabulary to an integer Index ( w, D ) in the range 1 . . . s ( D ) . Index ( the , D ) = 1 Index ( dog , D ) = 2 Index ( cat , D ) = 3 Index ( saw , D ) = 4 . . . ◮ For any word w , dictionary D , Onehot ( w, D ) maps a word w to a “one-hot vector” u = Onehot ( w, D ) ∈ R s ( D ) . We have = 1 for j = Index ( w, D ) u j u j = 0 otherwise

  8. One-Hot Encodings of Words, Tags etc. (continued) ◮ A dictionary D with size s ( D ) maps each word w in the vocabulary to an integer in the range 1 . . . s ( D ) . Index ( the , D ) = 1 Index ( dog , D ) = 2 Index ( cat , D ) = 3 . . . Onehot ( the , D ) = [1 , 0 , 0 , . . . ] Onehot ( dog , D ) = [0 , 1 , 0 , . . . ] Onehot ( cat , D ) = [0 , 0 , 1 , . . . ] . . .

  9. The Concatenation Operation ◮ Given column vectors v i ∈ R d i for i = 1 . . . n , z ∈ R d = Concat ( v 1 , v 2 , . . . v n ) where d = � n i =1 d i ◮ z is a vector formed by concatenating the vectors v 1 . . . v n ◮ z is a column vector of dimension � i d i

  10. The Concatenation Operation (continued) ◮ Given vectors v i ∈ R d i for i = 1 . . . n , z ∈ R d = Concat ( v 1 , v 2 , . . . v n ) where d = � n i =1 d i ◮ The Jacobians: ∂z ∂v i ∈ R d × d i have entries � ∂z � = 1 ∂v i j,k if j = k + � i ′ <i d i ′ , � ∂z � = 0 ∂v i j,k otherwise

  11. A Single-Layer Computational Network for Tagging Inputs: A training example x i = � t 1 . . . t j − 1 , w 1 . . . w n , j � , y i ∈ Y . A word dictionary D with size s ( D ) , a tag dictionary T with size s ( T ) . Parameters of a single-layer feedforward network. Computational Graph: − 2 ∈ R s ( T ) t ′ = Onehot ( t j − 2 , T ) − 1 ∈ R s ( T ) t ′ = Onehot ( t j − 1 , T ) − 1 ∈ R s ( D ) w ′ = Onehot ( w j − 1 , D ) 0 ∈ R s ( D ) w ′ = Onehot ( w j , D ) +1 ∈ R s ( D ) w ′ = Onehot ( w j +1 , D ) u ∈ R 2 s ( T )+3 s ( D ) Concat ( t ′ − 2 , t ′ − 1 , w ′ − 1 , w ′ 0 , w ′ = +1 ) z = Wu + b, h = g ( z ) , l = V h + γ, q = LS ( l ) o = q y i

  12. The Number of Parameters − 2 ∈ R s ( T ) t ′ = Onehot ( t j − 2 , T ) . . . +1 ∈ R s ( D ) w ′ = Onehot ( w j +1 , D ) Concat ( t ′ − 2 , t ′ − 1 , w ′ − 1 , w ′ 0 , w ′ u = +1 ) z ∈ R m = Wu + b . . . ◮ An example: s ( T ) = 50 (50 tags), s ( D ) = 10 , 000 (10,000 words), m = 1000 (1000 neurons in the single layer) ◮ Then W ∈ R m × (2 s ( T )+3 s ( D )) and m = 1000 , 2 s ( T ) + 3 s ( D ) = 30 , 100 , so there are m × (2 s ( T ) + 3 s ( D )) = 30 , 100 , 000 parameters in the matrix W

  13. An Example Hispaniola/NNP quickly/RB became/VB an/DT important/JJ base/?? from which Spain expanded its empire into the rest of the Western Hemisphere . s ( T ) − 2 ∈ R t ′ = Onehot ( t j − 2 , T ) s ( T ) − 1 ∈ R t ′ = Onehot ( t j − 1 , T ) s ( D ) − 1 ∈ R w ′ = Onehot ( w j − 1 , D ) s ( D ) 0 ∈ R w ′ = Onehot ( w j , D ) s ( D ) +1 ∈ R w ′ = Onehot ( w j +1 , D ) Concat ( t ′ − 2 , t ′ − 1 , w ′ − 1 , w ′ 0 , w ′ = +1 ) u . . .

  14. Embedding Matrices ◮ Given a word w , a word dictionary D we can map w to a one-hot representation w ′ ∈ R s ( D ) × 1 = Onehot ( w, D ) ◮ Now assume we have an embedding dictionary E ∈ R e × s ( D ) where e is some integer. Typical values of e are e = 100 or e = 200 ◮ We can now map the one-hot representation w ′ to w ′′ = w ′ = E × Onehot ( w, D ) E ���� ���� ���� e × 1 e × s ( D ) s ( D ) × 1 ◮ Equivalently, a word w is mapped to a vector E (: j ) ∈ R e where j = Index ( w, D ) is the integer that word w is mapped to, and E (: j ) is the j ’th column in the matrix.

  15. Embedding Matrices vs. One-hot Vectors ◮ One-hot representation: w ′ ∈ R s ( D ) × 1 = Onehot ( w, D ) This representation is high-dimensional, sparse ◮ Embedding representation: w ′′ = w ′ = E × Onehot ( w, D ) E ���� ���� ���� e × 1 e × s ( D ) s ( D ) × 1 This representation is low-dimensional, dense ◮ The embedding matrices can be learned using stochastic gradient descent and backpropagation (each entry of E is a new parameter in the model) ◮ Critically, embeddings allow shared information between words: e.g., words with similar meaning or syntax get mapped to “similar” embeddings

  16. A Single-Layer Computational Network for Tagging Inputs: A training example x i = � t 1 . . . t j − 1 , w 1 . . . w n , j � , y i ∈ Y . A word dictionary D with size s ( D ) , a tag dictionary T with size s ( T ) . A word embedding matrix E ∈ R e × s ( D ) . A tag embedding matrix A ∈ R a × s ( D ) . Parameters of a single-layer feedforward network. Computational Graph: − 2 ∈ R a t ′ = A × Onehot ( t j − 2 , T ) − 1 ∈ R a t ′ = A × Onehot ( t j − 1 , T ) − 1 ∈ R e w ′ = E × Onehot ( w j − 1 , D ) 0 ∈ R e w ′ = E × Onehot ( w j , D ) +1 ∈ R e w ′ = E × Onehot ( w j +1 , D ) u ∈ R 2 a +3 e Concat ( t ′ − 2 , t ′ − 1 , w ′ − 1 , w ′ 0 , w ′ = +1 ) = Wu + b, h = g ( z ) , l = V h + γ, q = LS ( l ) z o = q y i

  17. An Example Hispaniola/NNP quickly/RB became/VB an/DT important/JJ base/?? from which Spain expanded its empire into the rest of the Western Hemisphere . a − 2 ∈ R t ′ = A × Onehot ( t j − 2 , T ) a − 1 ∈ R t ′ = A × Onehot ( t j − 1 , T ) e − 1 ∈ R w ′ = E × Onehot ( w j − 1 , D ) e 0 ∈ R w ′ = E × Onehot ( w j , D ) e +1 ∈ R w ′ = E × Onehot ( w j +1 , D ) 2 a +3 e u ∈ R Concat ( t ′ − 2 , t ′ − 1 , w ′ − 1 , w ′ 0 , w ′ = +1 )

  18. Calculating Jacobians 0 ∈ R e w ′ = E × Onehot ( w, D ) Equivalently: � ( w ′ 0 ) j = E j,k × Onehot k ( w, D ) k ◮ Need to calculate the Jacobian ∂w ′ 0 E This has entries � ∂w ′ � 1 if j = j ′ and Onehot k ( w, E ) = 1 , 0 otherwise 0 = E j, ( j ′ ,k )

  19. An Additional Perspective − 2 ∈ R a − 2 ∈ R a t ′ t ′ = Onehot ( t j − 2 , T ) = A × Onehot ( t j − 2 , T ) . . . . . . +1 ∈ R e +1 ∈ R e w ′ w ′ = Onehot ( w j +1 , D ) = E × Onehot ( w j +1 , D ) Concat ( t ′ − 2 . . . w ′ Concat ( t ′ − 2 . . . w ′ u = +1 ) u ¯ = +1 ) ¯ z ∈ R m z ∈ R m = Wu + b ¯ = W ¯ u + b ◮ If we set ¯ W = W × Diag ( A, A, E, E, E ) ���� ���� � �� � m × (2 s ( T )+3 s ( E )) m × (2 a +3 e ) (2 a +3 e ) × (2 s ( T )+3 s ( D )) then Wu + b = ¯ W ¯ u + b hence z = ¯ z

  20. An Additional Perspective (continued) ◮ If we set ¯ = × Diag ( A, A, E, E, E ) W W ���� ���� � �� � m × (2 s ( T )+3 s ( E )) m × (2 a +3 e ) (2 a +3 e ) × (2 s ( T )+3 s ( D )) then Wu + b = ¯ W ¯ u + b hence z = ¯ z ◮ An example: s ( T ) = 50 (50 tags), s ( D ) = 10 , 000 (10,000 words), a = e = 100 (recall a , e are size of embeddings for tags and words respectively), m = 1000 (1000 neurons) ◮ Then we have parameters ¯ W vs. W A E ���� ���� ���� ���� 1000 × 30 , 100 1000 × 500 100 × 50 100 × 10 , 000

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend