The Neural Noisy Channel: Generative Models for Sequence to - PowerPoint PPT Presentation

The Neural Noisy Channel:   Generative Models   for   Sequence to Sequence Modeling Chris Dyer

The Neural Noisy Channel:   Generative Models   for   Sequence to Sequence Modeling EVERYTHING Chris Dyer

What is a discriminative problem? Text Classification

What is a discriminative problem? Text Summary

What is a discriminative problem? Text Translation

What is a discriminative problem? Text Output • Discriminative problems (in contrast to, e.g., density estimation, clustering, or dimensionality reduction problems) seek to select the correct output for a given text input • Neural networks models are very good discriminative models, but they a lot of training data to achieve good performance

Discriminative Models • Discriminative training objectives are similar to the following: L ( x , y , W ) = log p ( y | x ; W ) • That is, they directly model the posterior distribution over outputs given inputs. • In many domains, we have lots of paired samples to train our models on, so this estimation problem is feasible. • We have also developed very powerful function classes for modeling complex relationships between inputs and outputs.

Text Classification p ( y | x ) y X x 3 x 5 x 1 x 2 x 4 X L ( W ) = log p ( y i | x i ; W ) i

Generative Models • Generative models are a kind of density estimation problem: L ( x , y , W ) = log p ( x , y | W ) • The can, however, be used to compute the same conditional probabilities as discriminative models: p ( x , y ) p ( y | x ) = p ( x ) = P y 0 p ( x , y 0 ) • The renormalization by p( x ) is cause for concern, but making the Bayes optimal prediction under a 0-1 “cost” means we ignore the renormalization: y = arg max ˆ p ( y | x ) y p ( x , y ) = arg max p ( x ) = P y 0 p ( x , y 0 ) y = arg max p ( x , y ) y

Bayes’ Rule • A traditionally useful way of formulating a generative model involves the application of Bayes’ rule. p ( x | y ) p ( y ) p ( y | x ) = p ( x ) = P y 0 p ( x | y 0 ) p ( y 0 ) • This formulation posits the existence of two independent models, a prior probability over outputs p( y ) and a likelihood p( x | y ), which says how likely an input x is to be observed with output y . • Why might we favor this model? • Humans learn new tasks quickly from small amounts of data • But they often have a great deal of prior knowledge about the output space. • Outputs are chosen that justify the input, whereas in discriminative models, outputs are chosen that make the discriminative model happy.

But didn’t we use generative models   and give them up for some reason?

Generative Neural Models • Generative models frequently require modeling complex distributions, e.g., sentences, speech, images • Traditionally: complex distributions -> lots of (conditional) independence assumptions (think naive Bayes, or n-grams, or HMMs) • Neural networks are powerful density estimators that figure out their own independence assumptions • The motivating hypothesis in this work: • The previous empirical limits of generative models were due to bad independence assumptions, not the generative modeling paradigm per se.

Reasons for Optimism • Ng and Jordan (2001) show that linear models that are trained to generate have lower sample complexity— although higher asymptotic errors— than models that are trained to discriminative (Naive Bayer vs. logistic regression) • What about nonlinear models such as neural networks? • Formal characterization of the generalization behaviors of complex neural networks is difficult , with findings from convex problems failing to account for empirical facts about their generalization (Zhang et al, 2017) • Let’s investigate empirical properties of generative vs. discriminative recurrent networks commonly used in NLP applications

Warm up: Text Classification

Warm up: Text Classification y x {real news, fake news}

Discriminative Model p ( y | x ) y X x 3 x 5 x 1 x 2 x 4 X L ( W ) = log p ( y i | x i ; W ) i

Full Dataset Results AGNews DBPedia Yahoo Yelp Binary 90.0 96.0 68.7 86.0 Naive Bayes 89.3 95.4 69.3 81.8 Knesser-Ney Bayes 92.1 98.7 73.7 92.6 Discriminative LSTM 90.7 94.8 70.5 90.0 Generative LSTM Bag of Words 88.8 96.6 68.9 92.2 (Zhang et al., 2015) char-CRNN 91.4 98.6 71.7 94.5 (Xiao and Cho, 2016) very deep CNN 91.3 98.7 73.4 95.7 (Conneau et al., 2016)

Sample Complexity and Asymptotic Errors Yahoo DBPedia 100 70 80 % accuracy % accuracy 60 50 naive bayes KN bayes 40 naive bayes disc LSTM 30 KN bayes gen LSTM disc LSTM 20 gen LSTM 10 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 log (#training + 1) log (#training + 1) Yelp Binary Sogou 100 100 90 80 % accuracy % accuracy 80 60 naive bayes 70 KN bayes naive bayes disc LSTM 40 KN bayes 60 gen LSTM disc LSTM gen LSTM 20 50 0 2 4 6 8 10 12 0 2 4 6 8 10 12 log (#training + 1) log (#training + 1)

Zero-shot Learning • With discriminative training, we can use these class embeddings as the softmax weights • This technique is not successful since the model (understandably) does not want to predict the new class since it is trained discriminatively • In the generative case, the model predicts instances of the new class with very high precision but very low recall • When we do self-training on these newly predicted examples, we are able to obtain good results in the zero-shot setting (about 60% of the time, depending on the hidden class)

Zero-shot Learning Class Precision Recall Accuracy 98.9 46.6 93.3 company 99.2 49.5 92.8 educational institution 88.3 4.3 90.3 artist 96.5 90.1 94.6 athlete 0 0 89.1 office holder 96.5 74.3 94.2 means of transportation 99.9 37.7 92.1 building 98.9 88.2 95.4 natural place 99.9 68.1 93.8 village 99.7 68.1 93.8 animal 99.2 76.9 94.3 plant 0.03 0.001 88.8 album 99.4 73.3 94.5 film 93.8 26.5 91.3 written work

Adversarial Examples • Generative models also provide an estimate of p( x ), that is, the marginal likelihood of the input. • The likelihood of the input is a good estimate of “what the model knows”. Adversarial examples that fall out of this are a good indication that the model should stop what it’s doing and get help.

Discussion • Generative models of text approach their asymptotic errors more rapidly , (better in small-data regime), are able to handle new classes, and can perform zero-shot learning by acquiring knowledge about the new class from an auxiliary task better, and they have a good estimate of p ( x ) • Discriminative models of text have lower asymptotic errors , faster training and inference time

Main Course:   Sequence to Sequence Modeling

Seq2Seq Modeling • Many problems in text processing can be formulated as sequence to sequence problems • Translation : input is a source language sentence, output is a target language sentence • Summarization : input is a document, output is a short summary • Parsing : input is a sentence, output is a (linearized) parse tree • Code generation : input is a text description of an algorithm, output is a program • Text to speech : input is an encoding of the linguistic features associated with how a text should be pronounced, output is a waveform. • Speech recognition : input is an encoding of a waveform (or spectrum), output is text.

Seq2Seq Modeling • State of the art performance in most applications — provided enough data exists • But there are some serious problems • You can’t use “unpaired” samples of x and y to train the model • “Explaining away effects” - models like this learn to ignore “inconvenient” inputs (i.e., x ), in favor of high probability continuations of an output prefix ( y <i )

Generative: Seq2Seq Models “Source model” “Channel model”

Generative: Seq2Seq Models “Source model” “Channel model” The world is colorful because of the Internet...

Generative: Seq2Seq Models “Source model” “Channel model” The world is colorful because of the 世界因互联⽹罒⽽耍多彩 ... Internet...

Generative: Seq2Seq Models “Source model” “Channel model” The world is colorful because of the 世界因互联⽹罒⽽耍多彩 ... Internet... Source model can be estimated from   unpaired y ’s

Generative: Seq2Seq Models “Source model” “Channel model” The world is colorful because of the 世界因互联⽹罒⽽耍多彩 ... Internet...

Generative: Seq2Seq Models The world is colorful because of the 世界因互联⽹罒⽽耍多彩 ... Internet...

Generative: Seq2Seq Models The world is colorful because of the 世界因互联⽹罒⽽耍多彩 ... Internet... Is proposed output well-formed?

Generative: Seq2Seq Models The world is colorful because of the 世界因互联⽹罒⽽耍多彩 ... Internet... Is proposed output Does proposed output well-formed? explain the observed input?

The Neural Noisy Channel: Generative Models for Sequence to - PowerPoint PPT Presentation

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris Dyer The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling EVERYTHING Chris Dyer What is a discriminative

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

CHANNEL ALLOCATION Channel Language Translation Channel Translation Language Channel 1 German

ANNUAL ACCOUNTS PRESS CONFERENCE CHANNEL ALLOCATION. Channel Language Translation Channel

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Noisy Channel Coding: Correlated Random Variables & Communication over a Noisy Channel Toni

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

generative design systems Generative Brief Design Definitions Workshop Processes

Channel Assignment and Channel Hopping in IEEE 802.11 Operating Channels for 802.11b Europe

Noisy Channel Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T oday

Natural Language Processing Spring 2017 Unit 1: Sequence Models Lectures 5-6: Language Models

ANNUAL ACCOUNTS PRESS CONFERENCE LANGUAGE CHANNELS. Channel Language Channel (translation)

Channel design Channel coverage Intensive Selective Exclusive Channel

Speech Recognition and Synthesis Dan Klein UC Berkeley Language Models Noisy Channel Model: ASR

Multiplier Effect: Case Studies in Distributions for Publishers Jon Peck | Courtney Yuskis |

Episodic Memory in Lifelong Language Learning NIPS 19 Cyprien de Masson dAutume, Sebastian

Contras(ve learning, mul(-view redundancy, and linear models Daniel Hsu Columbia University

Lessons Learned from Agnews None Presenter: Alan Wilens M.Ed. Panel Members: Amy M. Narciso,

Packet Spraying in Geneve Overlay Network draft-xiang-nvo3-geneve-packet-spray-00 Haizhou Xiang ,

Go to our website: MissionAdvancement.com Development Strategies During COVID-19 Crisis Mission

SpRay an R-based visual-analytics platform for large and high-dimensional datasets J. Heinrich 1

LUKE 7:1-10,36-50 1 When Jesus had finished saying all this to the people who were listening, he

The Neural Noisy Channel: Generative Models for Sequence to - PowerPoint PPT Presentation

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris Dyer The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling EVERYTHING Chris Dyer What is a discriminative

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

CHANNEL ALLOCATION Channel Language Translation Channel Translation Language Channel 1 German

ANNUAL ACCOUNTS PRESS CONFERENCE CHANNEL ALLOCATION. Channel Language Translation Channel

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Noisy Channel Coding: Correlated Random Variables &amp; Communication over a Noisy Channel Toni

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

generative design systems Generative Brief Design Definitions Workshop Processes

Channel Assignment and Channel Hopping in IEEE 802.11 Operating Channels for 802.11b Europe

Noisy Channel Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T oday

Natural Language Processing Spring 2017 Unit 1: Sequence Models Lectures 5-6: Language Models

ANNUAL ACCOUNTS PRESS CONFERENCE LANGUAGE CHANNELS. Channel Language Channel (translation)

Channel design Channel coverage Intensive Selective Exclusive Channel

Speech Recognition and Synthesis Dan Klein UC Berkeley Language Models Noisy Channel Model: ASR

Multiplier Effect: Case Studies in Distributions for Publishers Jon Peck | Courtney Yuskis |

Episodic Memory in Lifelong Language Learning NIPS 19 Cyprien de Masson dAutume, Sebastian

Contras(ve learning, mul(-view redundancy, and linear models Daniel Hsu Columbia University

Lessons Learned from Agnews None Presenter: Alan Wilens M.Ed. Panel Members: Amy M. Narciso,

Packet Spraying in Geneve Overlay Network draft-xiang-nvo3-geneve-packet-spray-00 Haizhou Xiang ,

Go to our website: MissionAdvancement.com Development Strategies During COVID-19 Crisis Mission

SpRay an R-based visual-analytics platform for large and high-dimensional datasets J. Heinrich 1

LUKE 7:1-10,36-50 1 When Jesus had finished saying all this to the people who were listening, he

Noisy Channel Coding: Correlated Random Variables & Communication over a Noisy Channel Toni