Exploring the Limits of Language Modeling Rafal Jozefowicz, Oriol - PowerPoint PPT Presentation

Exploring the Limits of Language Modeling Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu Presented by Arvid Frydenlund November 11, 2016

Word-level Neural Language Modelling exp ( zw ) w ) where z w = h T t e o p ( w ) = w � w ′∈ V exp ( z ′ e o h t , partial-sentence embedding w , output word embedding on cat sat mat the EOS h 1 h 2 h 3 h 4 h 5 h 6 cat sat on mat The the e i w , input word embedding

Overview They present 4 different models: 1. Word-level language model 2. Character-level input word-level output, without an input look-up table 3. Character-level input word-level output, without a any look-up table 4. Word-level input, character-level output, with a encoder-decoder

Models

Achievements: ◮ State-of-the-art language modelling on Billion Word Benchmark (800k vocabulary) ◮ Reduced perplexity from 51.3 to 30.0, and then to 23.7 with an ensemble ◮ While significantly reducing model parameters (20 billion to 1.04 billion) ◮ Novel replacement of the output look-up table ◮ Novel encoder-decoder model for character-level ouput language modelling

Issues: ◮ Full softmax over 800k vocabulary at test time ◮ Training time (32 GPUs for 3 weeks) ◮ Output look-up table replacement preforms worse than a full look-up table and still requires one anyways ◮ Character-level output model doesn’t work well ◮ Issue of comparing character-level output to word-level output

Modelling input words ◮ Don’t model words independently ◮ ‘cat’ and ‘cats’ should share semantic information ◮ ‘-ing’ should share syntactic information ◮ Replace whole word look-up table with compositional function ◮ c,a,t,s → e i w ◮ Can be seen as approximating the look-up table with an embedded neural network. ◮ Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , Ling et al. 2015 ◮ used bidirectional LSTM ◮ Character-Aware Neural Language Models , Kim et al. , 2015 ◮ used a CNN & Highway feedforward NN

Char-CNN (Kim et al. )

Results of replacing input look-up table Model (RNN state size, e o w size) Test Perplexity Params (B) Previous SOTA 51.3 20 LSTM (512, 512) 54.1 0.82 LSTM (1024, 512) 48.2 0.82 LSTM (2048, 512) 43.7 0.83 LSTM (8192, 2048), No dropout 37.9 3.3 LSTM (8192, 2048), Dropout 32.2 3.3 2-layer LSTM (8192, 1024), Big LSTM 30.6 1.8 Big LSTM with CNN Inputs 30.0 1.04

Modelling output words ◮ c,a,t,s → e o w exp ( z w ) w ) where z w = h T t e o ◮ p ( w ) = w � w ′∈ V exp ( z ′ ◮ Issue: orthographic confusion ◮ Solution: Char CNN + whole word embeddings of 128 dimensions (‘correction factor’) ◮ Bottleneck layer

Results of replacing output look-up table Model (RNN state size, e o w size) Test Perplexity Params (B) Previous SOTA 51.3 20 LSTM (512, 512) 54.1 0.82 LSTM (1024, 512) 48.2 0.82 LSTM (2048, 512) 43.7 0.83 LSTM (8192, 2048), No dropout 37.9 3.3 LSTM (8192, 2048), Dropout 32.2 3.3 2-layer LSTM (8192, 1024), Big LSTM 30.6 1.8 Big LSTM with CNN Inputs 30.0 1.04 Above with CNN outputs 39.8 0.29 Above with correction factor 35.8 0.39

Full character-level language modelling

Character-level output language modelling ◮ Replace softmax and output word embeddings with RNN ◮ RNN conditions on h t and predicts characters one by one ◮ Training, word-level model frozen and decoder attached ◮ Issue: perplexity, 2 H ( P m ) ◮ Solution: Brute force renormalization

Results for character-level output language modelling Model (RNN state size, e o w size) Test Perplexity Params (B) Previous SOTA 51.3 20 LSTM (512, 512) 54.1 0.82 LSTM (1024, 512) 48.2 0.82 LSTM (2048, 512) 43.7 0.83 LSTM (8192, 2048), No dropout 37.9 3.3 LSTM (8192, 2048), Dropout 32.2 3.3 2-layer LSTM (8192, 1024), Big LSTM 30.6 1.8 Big LSTM with CNN Inputs 30.0 1.04 Above with CNN outputs 39.8 0.29 Above with correction factor 35.8 0.39 Big LSTM, characters out 49.0 0.23 Above with renormalization 47.9 0.23

Questions? ◮ Exploring the Limits of Language Modeling , Jozefowicz et al. 2016 ◮ Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , Ling et al. 2015 ◮ Character-Aware Neural Language Models , Kim et al. , 2015

Exploring the Limits of Language Modeling Rafal Jozefowicz, Oriol - PowerPoint PPT Presentation

Exploring the Limits of Language Modeling Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu Presented by Arvid Frydenlund November 11, 2016 Word-level Neural Language Modelling exp ( zw ) w ) where z w = h T t e o p ( w )

City Limits Lions Clubs City Limits Lions Clubs City Limits Lions Clubs City Limits Lions

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

Different Types of Limits Besides ordinary, two-sided limits, there are one-sided limits (left-

MAT 166 Calculus for Bus/Soc Chapter 3 Notes Limits The Deriviative David J. Gisch Limits

Modeling Limits Jaroslav Neetil Patrice Ossona de Mendez Charles University CAMS, CNRS/EHESS

Limits (the size of the pie) allocation limits minimum reliability flow of supply Limits

Medical Programs Overview Table 1. Caption Medical SNAP TANF Programs Income Limits Income

Scope & Limits of Scope & Limits of Scope & Limits of Legal Authority Legal

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

DB server limits (process/sessions) DB server limits (process/sessions) Carlos Fernando Gamboa,

d Limits at infinity and infinite limits i E 2 Lectures a l l u d b Dr. Abdulla Eid A

Limits of sub semigroups of C and Siegel enrichments Ismael Bachy 22 novembre 2010 Limits of

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Calculus without Limits: The difficulty of limits the Theory The difficulty of defining R

A Falling Drop scale. Symbols for the various physical quantities follow the usual convention.

An event free neo-Davidsonian syntax-semantics interface Cleo Condoravdi SPIEL, Aug. 3, 2004

Forward guidance: Slides for Sweden, Feb 2007 - Sep 2014 Lars E.O. Svensson Web:

P

E l e c t r i c i t y G e n e r a t i o n C a p a c i t y D e v e l o p m e n t i n A l b a n

developments Summary Presentation In this seminar, based on my daily experience, it will focus

The Impacts of Neighborhoods on Economic Opportunity New Evidence and Policy Lessons Raj Chetty

Scheduling Algorithm and Analysis EDF (Module 28) Yann-Hang Lee Arizona State University

Exploring the Limits of Language Modeling Rafal Jozefowicz, Oriol - PowerPoint PPT Presentation

Exploring the Limits of Language Modeling Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu Presented by Arvid Frydenlund November 11, 2016 Word-level Neural Language Modelling exp ( zw ) w ) where z w = h T t e o p ( w )

City Limits Lions Clubs City Limits Lions Clubs City Limits Lions Clubs City Limits Lions

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

Different Types of Limits Besides ordinary, two-sided limits, there are one-sided limits (left-

MAT 166 Calculus for Bus/Soc Chapter 3 Notes Limits The Deriviative David J. Gisch Limits

Modeling Limits Jaroslav Neetil Patrice Ossona de Mendez Charles University CAMS, CNRS/EHESS

Limits (the size of the pie) allocation limits minimum reliability flow of supply Limits

Medical Programs Overview Table 1. Caption Medical SNAP TANF Programs Income Limits Income

Scope &amp; Limits of Scope &amp; Limits of Scope &amp; Limits of Legal Authority Legal

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

DB server limits (process/sessions) DB server limits (process/sessions) Carlos Fernando Gamboa,

d Limits at infinity and infinite limits i E 2 Lectures a l l u d b Dr. Abdulla Eid A

Limits of sub semigroups of C and Siegel enrichments Ismael Bachy 22 novembre 2010 Limits of

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Calculus without Limits: The difficulty of limits the Theory The difficulty of defining R

A Falling Drop scale. Symbols for the various physical quantities follow the usual convention.

An event free neo-Davidsonian syntax-semantics interface Cleo Condoravdi SPIEL, Aug. 3, 2004

Forward guidance: Slides for Sweden, Feb 2007 - Sep 2014 Lars E.O. Svensson Web:

P

E l e c t r i c i t y G e n e r a t i o n C a p a c i t y D e v e l o p m e n t i n A l b a n

developments Summary Presentation In this seminar, based on my daily experience, it will focus

The Impacts of Neighborhoods on Economic Opportunity New Evidence and Policy Lessons Raj Chetty

Scheduling Algorithm and Analysis EDF (Module 28) Yann-Hang Lee Arizona State University

Scope & Limits of Scope & Limits of Scope & Limits of Legal Authority Legal