SLIDE 1 LANGUAGE MODELING WITH GATED CONVOLUTIONAL NETWORKS
YANN N. DAUPHIN, ANGELA FAN, MICHAEL AULI AND DAVID GRANGIER FACEBOOK AI RESEARCH
CS 546 Paper Presentation Jinfeng Xiao 2/22/2018
SLIDE 2 Intro: Language Models
■ Full$model: * +,, … , +/ = * +, 1
234 /
* +2|+,, … , +264 ■ n-gram model: * +2 = * +2|+26784, … , +264 ■ Hard to represent long-range dependencies, due to data sparsity
SLIDE 3 Intro: LSTM
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
“Gate”
SLIDE 4 Intro: LSTM
■ State-of-the-art neural network approach for language modeling ■ + Can theoretically model arbitrarily long dependencies ■ -- Not parallelizable; O(N) operations
http://colah.github.io/posts/2015-08-Understanding-LSTMs
SLIDE 5
Intro: CNN
■ Predict the current word y with previous words x (i.e. context) ■ Model long-term dependencies with O(N/k) operations k
SLIDE 6
This Paper: GCNN
■ Gated Convolutional Neural Networks ■ Each CNN layer is followed by a gating layer ■ Allows parallelization over sequential tokens ■ Reduces the latency to score a sentence by an order of magnitude ■ Competitive performance on WikiText-103 and Google Billion Words benchmarks
SLIDE 7
Architecture
■ Word Embedding + ■ CNN + ■ Gating
SLIDE 8
Architecture
■ Word E Embedding + ■ CNN + ■ Gating
SLIDE 9
Architecture
■ Word Embedding + ■ CNN CNN + ■ Gating
*: Convolution operation
SLIDE 10
Architecture
■ Word Embedding + ■ CNN CNN + ■ Gating learned parameters
SLIDE 11 Example: Convolution
■ “Average” over a small patch around an element
http://colah.github.io/posts/2015-08-Understanding-LSTMs
SLIDE 12
Architecture
■ Word Embedding + ■ CNN + ■ Ga Gating
SLIDE 13 Two Gating Mechanisms
■ Gated linear units (GLU) ■ Gated tanh units (GTU)
ℎ" # = # ∗ & + ( ⊗σ # ∗ + + ,
ℎ- # = tanh # ∗ & + ( ⊗σ # ∗ + + ,
SLIDE 14 Evaluation Metric: Perplexity
■ The perplexity of a discrete probability distribution p is !
" # ∑%&'
(
)*+,- .%|…,.%2'
■ It measures how well our model matches the held out test data set. ■ The smaller, the better.
https://en.wikipedia.org/wiki/Perplexity
SLIDE 15 Benchmark: Google Billion Word
Average Sequence Length = 20 Words
ReLU % = %⊗ % > 0
SLIDE 16
GCNN Is Faster
On Google Billion Words
SLIDE 17
Benchmark: WikiText-103
Average Sequence Length = 4,000 Words
SLIDE 18 Short Context Size Suffices
Google Billion Word
Wiki-103
SLIDE 19
Summary
■ GCNN: CNN + Gating ■ Perplexity is comparable with the state-of-the-art LSTM ■ GCNN converges faster and allows parallelization over sequential tokens ■ The simpler linear gating (GLU) works better than LSTM-like tanh gating (GTU)