Deep Residual Output Layers for Neural Language Generation Nikolaos - PowerPoint PPT Presentation

Deep Residual Output Layers for Neural Language Generation Nikolaos Pappas, James Henderson June 13, 2019

Neural language generation Cat sat on ? Probability distribution at time t given context vector h t ∈ R d , weights W ∈ R d ×|V| and bias b ∈ R |V| : h 1 h 2 h 3 h 4 p ( y t | y t − 1 ) ∝ exp ( W T h t + b ) 1 on Cat sat <s> • Output layer parameterisation depends on the vocabulary size |V| → Sample inefficient Output layer power depends on hidden dim or rank d : “softmax bottleneck” High overhead and prone to overfitting 2/6

Neural language generation Cat sat on ? Probability distribution at time t given context vector h t ∈ R d , weights W ∈ R d ×|V| and bias b ∈ R |V| : h 1 h 2 h 3 h 4 p ( y t | y t − 1 ) ∝ exp ( W T h t + b ) 1 on Cat sat <s> • Output layer parameterisation depends on the vocabulary size |V| → Sample inefficient • Output layer power depends on hidden dim or rank d : “softmax bottleneck” → High overhead and prone to overfitting 2/6

Previous work Cat sat on ? Probability distribution at time t given context vector h t ∈ R d , weights W ∈ R d ×|V| and bias b ∈ R |V| : h 1 h 2 h 3 h 4 p ( y t | y t − 1 ) ∝ exp ( W T h t + b ) 1 on Cat sat <s> • Output layer parameterisation no longer depends on the vocabulary size |V| (1) → More sample efficient • Output layer power still depends on hidden dim or rank d : “softmax bottleneck” (2) → High overhead and prone to overfitting Output similarity structure learning methods help with (1) but not yet with (2) . 3/6

Previous work Output text ? Output structure learning factorization of probability w 1 , w 2 , …, w |V| g out distribution given word embedding E ∈ R |V|× d : E . y t Input text h t b y 1 , y 2 , …, y t-1 p ( y t | y t − 1 ) ∝ g out ( E , V ) g in ( E , y t − 1 g in ) + b 1 1 • Shallow label encoder networks such as weight tying [PW17] , bilinear mapping [G18] , and dual nonlinear mapping [P18] 3/6

Our contributions Output text E (k) Output structure learning factorization of probability w 1 , w 2 , …, w |V| g out distribution given word embedding E ∈ R |V|× d : E . y t Input text h t b y 1 , y 2 , …, y t-1 p ( y t | y t − 1 ) ∝ g out ( E , V ) g in ( E , y t − 1 g in ) + b 1 1 • Generalize previous output similarity structure learning methods → More sample efficient Propose a deep output label encoder network with dropout between layers Avoids overfitting Increase output layer power with representation depth instead of rank d Low overhead 4/6

Our contributions Output text E (k) Output structure learning factorization of probability w 1 , w 2 , …, w |V| g out distribution given word embedding E ∈ R |V|× d : E . y t Input text h t b y 1 , y 2 , …, y t-1 p ( y t | y t − 1 ) ∝ g out ( E , V ) g in ( E , y t − 1 g in ) + b 1 1 • Generalize previous output similarity structure learning methods → More sample efficient • Propose a deep output label encoder network with dropout between layers → Avoids overfitting Increase output layer power with representation depth instead of rank d Low overhead 4/6

Our contributions Output text E (k) Output structure learning factorization of probability w 1 , w 2 , …, w |V| g out distribution given word embedding E ∈ R |V|× d : E . y t Input text h t b y 1 , y 2 , …, y t-1 p ( y t | y t − 1 ) ∝ g out ( E , V ) g in ( E , y t − 1 g in ) + b 1 1 • Generalize previous output similarity structure learning methods → More sample efficient • Propose a deep output label encoder network with dropout between layers → Avoids overfitting • Increase output layer power with representation depth instead of rank d → Low overhead 4/6

Label Encoder Network E E … E E E (0) (0) (k-2) (k-1) (k-1) (k) f out f out (k) f out • Shares parameters across output labels with k nonlinear projections E ( k ) = f ( k ) out ( E ( k − 1 ) ) 5/6

Label Encoder Network E E … E E E (0) (0) (k-2) (k-1) (k-1) (k) f out f out (k) f out • Shares parameters across output labels with k nonlinear projections E ( k ) = f ( k ) out ( E ( k − 1 ) ) • Preserves information across layers with residual connections E ( k ) = f ( k ) out ( E ( k − 1 ) ) + E ( k − 1 ) + E 5/6

Label Encoder Network E E … E E E (0) (0) (k-2) (k-1) (k-1) (k) f out f out (k) f out • Shares parameters across output labels with k nonlinear projections E ( k ) = f ( k ) out ( E ( k − 1 ) ) • Preserves information across layers with residual connections E ( k ) = f ( k ) out ( E ( k − 1 ) ) + E ( k − 1 ) + E • Avoids overfitting with standard or variational dropout for each layer i = 1 , . . . , k f ′ ( i ) f ( i ) ⊙ f ( i ) out ( E ( i − 1 ) ) = δ out ( E ( i − 1 ) ) out ( E ( i − 1 ) ) � � 5/6

Results • Improve competitive architectures • Better transfer across low-resource without increasing their dim or rank output labels Language modeling ppl sec/ep AWD-LSTM [M18] 65.8 89 (1 . 0 × ) AWD-LSTM-DRILL 61.9 106 (1 . 2 × ) AWD-LSTM-MoS [Y18] 61.4 862 (9 . 7 × ) WikiText-2 Machine translation bleu min/ep Transformer [V17] 27.3 111 (1 . 0 × ) Transformer-DRILL 28.1 189 (1 . 7 × ) Transformer (big) [V17] 28.4 779 (7 . 0 × ) En → De (32K BPE) 6/6

Talk to us at Poster #104 in Pacific Ballroom . Thank you! http://github.com/idiap/drill

Deep Residual Output Layers for Neural Language Generation Nikolaos - PowerPoint PPT Presentation

Deep Residual Output Layers for Neural Language Generation Nikolaos Pappas, James Henderson June 13, 2019 Neural language generation Cat sat on ? Probability distribution at time t given context vector h t R d , weights W R d |V|

Pipeline Strategies and conversations behind securing a Residual Bequest Agenda 1. Why Residual?

An Overview of Deep Residual Learning Semih Yagcioglu 01.03.2016 Deep Residual Learning

Tra ffi c Management as a Service | Ghent, Belgium INPUT PROCESS OUTPUT INPUT PROCESS OUTPUT

Clarifying Residual Flow s for Surface Water Takes August 2017 Clarifying Residual Flow s

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

Chapter 12 Overview Devices and Output Visual Output Dynamic Visualizations Sound

16. Recursion 2 Output: 103 Input: (3 + 5) * 20 Output: 160 Input: -(3 + 5) + 20 Output: 12

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Pace Layers The social economy ecosystem has many layers, all of which change at different

The Fundamentals of Deep Learning Building Blocks Theory with Applications Neural Units Neural

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

SPOT Farm East (Elveden) 2016 Residual Herbicide Demonstration Report Background The urea

Residual Flows for Invertible Generative Modeling Ricky T. Q. Chen, Jens Behrmann, David

Residual Networks (ResNet) Residual Networks (ResNet) In [1]: import d2l from mxnet import gluon,

Financial disclosure Netra Systems, Inc. Pearls on Angle Assessment Pearls on Angle

TimesVector: A vectorized clustering approach to the analysis of time series transcriptome data

Hypothyroidism Therapeutics 1. The metabolically active thyroid hormone is _____________.

STERILITY TEST (ST) Centre for Quality Control National Pharmaceutical Control Bureau Lot 36,

Beyond Weight Tying: Learning Joint Input-Output Embeddings for Neural Machine Translation

Analysis of variance and regression November 13, 2007 SAS graphics Scatter plots

Distilling the Outcomes of Personal Experiences: A Propensity-scored Analysis of Social Media

Hybrid (DG) Methods for the Helmholtz Equation Joachim Sch oberl Computational Mathematics in