LANGUAGE MODELING WITH GATED CONVOLUTIONAL NETWORKS YANN N. - - PowerPoint PPT Presentation

▶

Jan 06, 2023 350 likes •561 views

LANGUAGE MODELING WITH GATED CONVOLUTIONAL NETWORKS YANN N. DAUPHIN, ANGELA FAN, MICHAEL AULI AND DAVID GRANGIER FACEBOOK AI RESEARCH CS 546 Paper Presentation Jinfeng Xiao 2/22/2018 Intro: Language Models Full$model: / * + , , , + /

SLIDE 1

LANGUAGE MODELING WITH GATED CONVOLUTIONAL NETWORKS

YANN N. DAUPHIN, ANGELA FAN, MICHAEL AULI AND DAVID GRANGIER FACEBOOK AI RESEARCH

CS 546 Paper Presentation Jinfeng Xiao 2/22/2018

SLIDE 2

Intro: Language Models

■ Full$model: * +,, … , +/ = * +, 1

234 /

* +2|+,, … , +264 ■ n-gram model: * +2 = * +2|+26784, … , +264 ■ Hard to represent long-range dependencies, due to data sparsity

SLIDE 3

Intro: LSTM

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

“Gate”

SLIDE 4

Intro: LSTM

■ State-of-the-art neural network approach for language modeling ■ + Can theoretically model arbitrarily long dependencies ■ -- Not parallelizable; O(N) operations

http://colah.github.io/posts/2015-08-Understanding-LSTMs

SLIDE 5

Intro: CNN

■ Predict the current word y with previous words x (i.e. context) ■ Model long-term dependencies with O(N/k) operations k

SLIDE 6

This Paper: GCNN

■ Gated Convolutional Neural Networks ■ Each CNN layer is followed by a gating layer ■ Allows parallelization over sequential tokens ■ Reduces the latency to score a sentence by an order of magnitude ■ Competitive performance on WikiText-103 and Google Billion Words benchmarks

SLIDE 7

Architecture

■ Word Embedding + ■ CNN + ■ Gating

SLIDE 8

Architecture

■ Word E Embedding + ■ CNN + ■ Gating

SLIDE 9

Architecture

■ Word Embedding + ■ CNN CNN + ■ Gating

*: Convolution operation

SLIDE 10

Architecture

■ Word Embedding + ■ CNN CNN + ■ Gating learned parameters

SLIDE 11

Example: Convolution

■ “Average” over a small patch around an element

http://colah.github.io/posts/2015-08-Understanding-LSTMs

SLIDE 12

Architecture

■ Word Embedding + ■ CNN + ■ Ga Gating

SLIDE 13

Two Gating Mechanisms

■ Gated linear units (GLU) ■ Gated tanh units (GTU)

ℎ" # = # ∗ & + ( ⊗σ # ∗ + + ,

ℎ- # = tanh # ∗ & + ( ⊗σ # ∗ + + ,

SLIDE 14

Evaluation Metric: Perplexity

■ The perplexity of a discrete probability distribution p is !

" # ∑%&'

(

)*+,- .%|…,.%2'

■ It measures how well our model matches the held out test data set. ■ The smaller, the better.

https://en.wikipedia.org/wiki/Perplexity

SLIDE 15

Benchmark: Google Billion Word

Average Sequence Length = 20 Words

ReLU % = %⊗ % > 0

SLIDE 16

GCNN Is Faster

On Google Billion Words

SLIDE 17

Benchmark: WikiText-103

Average Sequence Length = 4,000 Words

SLIDE 18

Short Context Size Suffices

Google Billion Word

Avg. Text Length = 20

Wiki-103

Avg. Text Length = 4,000

SLIDE 19

Summary

■ GCNN: CNN + Gating ■ Perplexity is comparable with the state-of-the-art LSTM ■ GCNN converges faster and allows parallelization over sequential tokens ■ The simpler linear gating (GLU) works better than LSTM-like tanh gating (GTU)