LANGUAGE MODELING WITH GATED CONVOLUTIONAL NETWORKS YANN N. - - PowerPoint PPT Presentation

language modeling with gated convolutional networks
SMART_READER_LITE
LIVE PREVIEW

LANGUAGE MODELING WITH GATED CONVOLUTIONAL NETWORKS YANN N. - - PowerPoint PPT Presentation

LANGUAGE MODELING WITH GATED CONVOLUTIONAL NETWORKS YANN N. DAUPHIN, ANGELA FAN, MICHAEL AULI AND DAVID GRANGIER FACEBOOK AI RESEARCH CS 546 Paper Presentation Jinfeng Xiao 2/22/2018 Intro: Language Models Full$model: / * + , , , + /


slide-1
SLIDE 1

LANGUAGE MODELING WITH GATED CONVOLUTIONAL NETWORKS

YANN N. DAUPHIN, ANGELA FAN, MICHAEL AULI AND DAVID GRANGIER FACEBOOK AI RESEARCH

CS 546 Paper Presentation Jinfeng Xiao 2/22/2018

slide-2
SLIDE 2

Intro: Language Models

■ Full$model: * +,, … , +/ = * +, 1

234 /

* +2|+,, … , +264 ■ n-gram model: * +2 = * +2|+26784, … , +264 ■ Hard to represent long-range dependencies, due to data sparsity

slide-3
SLIDE 3

Intro: LSTM

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

“Gate”

slide-4
SLIDE 4

Intro: LSTM

■ State-of-the-art neural network approach for language modeling ■ + Can theoretically model arbitrarily long dependencies ■ -- Not parallelizable; O(N) operations

http://colah.github.io/posts/2015-08-Understanding-LSTMs

slide-5
SLIDE 5

Intro: CNN

■ Predict the current word y with previous words x (i.e. context) ■ Model long-term dependencies with O(N/k) operations k

slide-6
SLIDE 6

This Paper: GCNN

■ Gated Convolutional Neural Networks ■ Each CNN layer is followed by a gating layer ■ Allows parallelization over sequential tokens ■ Reduces the latency to score a sentence by an order of magnitude ■ Competitive performance on WikiText-103 and Google Billion Words benchmarks

slide-7
SLIDE 7

Architecture

■ Word Embedding + ■ CNN + ■ Gating

slide-8
SLIDE 8

Architecture

■ Word E Embedding + ■ CNN + ■ Gating

slide-9
SLIDE 9

Architecture

■ Word Embedding + ■ CNN CNN + ■ Gating

*: Convolution operation

slide-10
SLIDE 10

Architecture

■ Word Embedding + ■ CNN CNN + ■ Gating learned parameters

slide-11
SLIDE 11

Example: Convolution

■ “Average” over a small patch around an element

http://colah.github.io/posts/2015-08-Understanding-LSTMs

slide-12
SLIDE 12

Architecture

■ Word Embedding + ■ CNN + ■ Ga Gating

slide-13
SLIDE 13

Two Gating Mechanisms

■ Gated linear units (GLU) ■ Gated tanh units (GTU)

ℎ" # = # ∗ & + ( ⊗σ # ∗ + + ,

ℎ- # = tanh # ∗ & + ( ⊗σ # ∗ + + ,

slide-14
SLIDE 14

Evaluation Metric: Perplexity

■ The perplexity of a discrete probability distribution p is !

" # ∑%&'

(

)*+,- .%|…,.%2'

■ It measures how well our model matches the held out test data set. ■ The smaller, the better.

https://en.wikipedia.org/wiki/Perplexity

slide-15
SLIDE 15

Benchmark: Google Billion Word

Average Sequence Length = 20 Words

ReLU % = %⊗ % > 0

slide-16
SLIDE 16

GCNN Is Faster

On Google Billion Words

slide-17
SLIDE 17

Benchmark: WikiText-103

Average Sequence Length = 4,000 Words

slide-18
SLIDE 18

Short Context Size Suffices

Google Billion Word

  • Avg. Text Length = 20

Wiki-103

  • Avg. Text Length = 4,000
slide-19
SLIDE 19

Summary

■ GCNN: CNN + Gating ■ Perplexity is comparable with the state-of-the-art LSTM ■ GCNN converges faster and allows parallelization over sequential tokens ■ The simpler linear gating (GLU) works better than LSTM-like tanh gating (GTU)