deep residual output layers for neural language generation
play

Deep Residual Output Layers for Neural Language Generation Nikolaos - PowerPoint PPT Presentation

Deep Residual Output Layers for Neural Language Generation Nikolaos Pappas, James Henderson June 13, 2019 Neural language generation Cat sat on ? Probability distribution at time t given context vector h t R d , weights W R d |V|


  1. Deep Residual Output Layers for Neural Language Generation Nikolaos Pappas, James Henderson June 13, 2019

  2. Neural language generation Cat sat on ? Probability distribution at time t given context vector h t ∈ R d , weights W ∈ R d ×|V| and bias b ∈ R |V| : h 1 h 2 h 3 h 4 p ( y t | y t − 1 ) ∝ exp ( W T h t + b ) 1 on Cat sat <s> 2/6

  3. Neural language generation Cat sat on ? Probability distribution at time t given context vector h t ∈ R d , weights W ∈ R d ×|V| and bias b ∈ R |V| : h 1 h 2 h 3 h 4 p ( y t | y t − 1 ) ∝ exp ( W T h t + b ) 1 on Cat sat <s> • Output layer parameterisation depends on the vocabulary size |V| → Sample inefficient Output layer power depends on hidden dim or rank d : “softmax bottleneck” High overhead and prone to overfitting 2/6

  4. Neural language generation Cat sat on ? Probability distribution at time t given context vector h t ∈ R d , weights W ∈ R d ×|V| and bias b ∈ R |V| : h 1 h 2 h 3 h 4 p ( y t | y t − 1 ) ∝ exp ( W T h t + b ) 1 on Cat sat <s> • Output layer parameterisation depends on the vocabulary size |V| → Sample inefficient • Output layer power depends on hidden dim or rank d : “softmax bottleneck” → High overhead and prone to overfitting 2/6

  5. Previous work Cat sat on ? Probability distribution at time t given context vector h t ∈ R d , weights W ∈ R d ×|V| and bias b ∈ R |V| : h 1 h 2 h 3 h 4 p ( y t | y t − 1 ) ∝ exp ( W T h t + b ) 1 on Cat sat <s> • Output layer parameterisation no longer depends on the vocabulary size |V| (1) → More sample efficient • Output layer power still depends on hidden dim or rank d : “softmax bottleneck” (2) → High overhead and prone to overfitting Output similarity structure learning methods help with (1) but not yet with (2) . 3/6

  6. Previous work Output text ? Output structure learning factorization of probability w 1 , w 2 , …, w |V| g out distribution given word embedding E ∈ R |V|× d : E . y t Input text h t b y 1 , y 2 , …, y t-1 p ( y t | y t − 1 ) ∝ g out ( E , V ) g in ( E , y t − 1 g in ) + b 1 1 • Shallow label encoder networks such as weight tying [PW17] , bilinear mapping [G18] , and dual nonlinear mapping [P18] 3/6

  7. Our contributions Output text E (k) Output structure learning factorization of probability w 1 , w 2 , …, w |V| g out distribution given word embedding E ∈ R |V|× d : E . y t Input text h t b y 1 , y 2 , …, y t-1 p ( y t | y t − 1 ) ∝ g out ( E , V ) g in ( E , y t − 1 g in ) + b 1 1 • Generalize previous output similarity structure learning methods → More sample efficient Propose a deep output label encoder network with dropout between layers Avoids overfitting Increase output layer power with representation depth instead of rank d Low overhead 4/6

  8. Our contributions Output text E (k) Output structure learning factorization of probability w 1 , w 2 , …, w |V| g out distribution given word embedding E ∈ R |V|× d : E . y t Input text h t b y 1 , y 2 , …, y t-1 p ( y t | y t − 1 ) ∝ g out ( E , V ) g in ( E , y t − 1 g in ) + b 1 1 • Generalize previous output similarity structure learning methods → More sample efficient • Propose a deep output label encoder network with dropout between layers → Avoids overfitting Increase output layer power with representation depth instead of rank d Low overhead 4/6

  9. Our contributions Output text E (k) Output structure learning factorization of probability w 1 , w 2 , …, w |V| g out distribution given word embedding E ∈ R |V|× d : E . y t Input text h t b y 1 , y 2 , …, y t-1 p ( y t | y t − 1 ) ∝ g out ( E , V ) g in ( E , y t − 1 g in ) + b 1 1 • Generalize previous output similarity structure learning methods → More sample efficient • Propose a deep output label encoder network with dropout between layers → Avoids overfitting • Increase output layer power with representation depth instead of rank d → Low overhead 4/6

  10. Label Encoder Network E E … E E E (0) (0) (k-2) (k-1) (k-1) (k) f out f out (k) f out • Shares parameters across output labels with k nonlinear projections E ( k ) = f ( k ) out ( E ( k − 1 ) ) 5/6

  11. Label Encoder Network E E … E E E (0) (0) (k-2) (k-1) (k-1) (k) f out f out (k) f out • Shares parameters across output labels with k nonlinear projections E ( k ) = f ( k ) out ( E ( k − 1 ) ) • Preserves information across layers with residual connections E ( k ) = f ( k ) out ( E ( k − 1 ) ) + E ( k − 1 ) + E 5/6

  12. Label Encoder Network E E … E E E (0) (0) (k-2) (k-1) (k-1) (k) f out f out (k) f out • Shares parameters across output labels with k nonlinear projections E ( k ) = f ( k ) out ( E ( k − 1 ) ) • Preserves information across layers with residual connections E ( k ) = f ( k ) out ( E ( k − 1 ) ) + E ( k − 1 ) + E • Avoids overfitting with standard or variational dropout for each layer i = 1 , . . . , k f ′ ( i ) f ( i ) ⊙ f ( i ) out ( E ( i − 1 ) ) = δ out ( E ( i − 1 ) ) out ( E ( i − 1 ) ) � � 5/6

  13. Results • Improve competitive architectures • Better transfer across low-resource without increasing their dim or rank output labels Language modeling ppl sec/ep AWD-LSTM [M18] 65.8 89 (1 . 0 × ) AWD-LSTM-DRILL 61.9 106 (1 . 2 × ) AWD-LSTM-MoS [Y18] 61.4 862 (9 . 7 × ) WikiText-2 Machine translation bleu min/ep Transformer [V17] 27.3 111 (1 . 0 × ) Transformer-DRILL 28.1 189 (1 . 7 × ) Transformer (big) [V17] 28.4 779 (7 . 0 × ) En → De (32K BPE) 6/6

  14. Talk to us at Poster #104 in Pacific Ballroom . Thank you! http://github.com/idiap/drill

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend