Recurrent Neural Networks Graham Neubig Site - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2020/

NLP and Sequential Data • NLP is full of sequential data • Words in sentences • Characters in words • Sentences in discourse • …

Long-distance Dependencies in Language • Agreement in number, gender, etc. He does not have very much confidence in himself . She does not have very much confidence in herself . • Selectional preference The reign has lasted as long as the life of the queen . The rain has lasted as long as the life of the clouds .

Can be Complicated! • What is the referent of “it”? The trophy would not fit in the brown suitcase because it was too big . Trophy The trophy would not fit in the brown suitcase because it was too small . Suitcase (from Winograd Schema Challenge: http://commonsensereasoning.org/winograd.html)

Recurrent Neural Networks (Elman 1990) • Tools to “remember” information Feed-forward NN Recurrent NN context context lookup lookup transform transform predict predict label label

Unrolling in Time • What does processing a sequence look like? I hate this movie RNN RNN RNN RNN predict predict predict predict label label label label

Training RNNs I hate this movie RNN RNN RNN RNN predict predict predict predict prediction 1 prediction 2 prediction 3 prediction 4 loss 1 loss 2 loss 3 loss 4 label 1 label 2 label 3 label 4 total loss sum

      RNN Training • The unrolled graph is a well-formed (DAG) computation graph—we can run backprop   sum total loss • Parameters are tied across time, derivatives are aggregated across all time steps • This is historically called “backpropagation through time” (BPTT)

Parameter Tying Parameters are shared! Derivatives are accumulated. I hate this movie RNN RNN RNN RNN predict predict predict predict prediction 1 prediction 2 prediction 3 prediction 4 loss 1 loss 2 loss 3 loss 4 label 1 label 2 label 3 label 4 total loss sum

Applications of RNNs

What Can RNNs Do? • Represent a sentence • Read whole sentence, make a prediction • Represent a context within a sentence • Read context up until that point

Representing Sentences I hate this movie RNN RNN RNN RNN predict prediction • Sentence classification • Conditioned generation • Retrieval

Representing Contexts I hate this movie RNN RNN RNN RNN predict predict predict predict label label label label • Tagging • Language Modeling • Calculating Representations for Parsing, etc.

e.g. Language Modeling <s> I hate this movie RNN RNN RNN RNN RNN predict predict predict predict predict I hate this movie </s> • Language modeling is like a tagging task, where each tag is the next word!

Bi-RNNs • A simple extension, run the RNN in both directions I hate this movie RNN RNN RNN RNN RNN RNN RNN RNN concat concat concat concat softmax softmax softmax softmax PRN VB DET NN

Code Examples sentiment-rnn.py

Vanishing Gradients

Vanishing Gradient • Gradients decrease as they get pushed back • Why? “Squashed” by non-linearities or small weights in matrices.

A Solution:   Long Short-term Memory (Hochreiter and Schmidhuber 1997) • Basic idea: make additive connections between time steps • Addition does not modify the gradient, no vanishing • Gates to control the information flow

LSTM Structure

Code Examples sentiment-lstm.py lm-lstm.py

What can LSTMs Learn? (1) (Karpathy et al. 2015) • Additive connections make single nodes surprisingly interpretable

What can LSTMs Learn? (2) (Shi et al. 2016, Radford et al. 2017) Count length of sentence Sentiment

Efficiency Tricks

Handling Mini-batching • Mini-batching makes things much faster! • But mini-batching in RNNs is harder than in feed- forward networks • Each word depends on the previous word • Sequences are of various length

Mini-batching Method this is an example </s> this is another </s> </s> Padding Loss 1   1   1   1   1   � � � � � 1 1 1 1 0 Calculation Mask Take Sum (Or use DyNet automatic mini-batching, much easier but a bit slower)

Bucketing/Sorting • If we use sentences of different lengths, too much padding and sorting can result in decreased performance • To remedy this: sort sentences so similarly- lengthed sentences are in the same batch

Code Example lm-minibatch.py

Optimized Implementations of LSTMs (Appleyard 2015) • In simple implementation, still need one GPU call for each time step • For some RNN variants (e.g. LSTM) efficient full- sequence computation supported by CuDNN • Basic process: combine inputs into tensor, single GPU call combine inputs into tensor, single GPU call • Downside: significant loss of flexibility

RNN Variants

Gated Recurrent Units (Cho et al. 2014) • A simpler version that preserves the additive connections Additive or Non-linear • Note: GRUs cannot do things like simply count

Extensive Architecture Search for LSTMs (Greffen et al. 2015) • Many different types of architectures tested for LSTMs • Conclusion: basic LSTM quite good, other variants (e.g. coupled input/ forget gates) reasonable

Handling Long Sequences

Handling Long Sequences • Sometimes we would like to capture long-term dependencies over long sequences • e.g. words in full documents • However, this may not fit on (GPU) memory

Truncated BPTT • Backprop over shorter segments, initialize w/ the state from the previous segment 1st Pass I hate this movie RNN RNN RNN RNN 2nd Pass state only, no backprop It is so bad RNN RNN RNN RNN

Questions?

Recurrent Neural Networks Graham Neubig Site - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2020/ NLP and Sequential Data NLP is full of sequential data Words in sentences Characters in words Sentences in

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Computa(on through dynamics Using recurrent neural networks to unveil mechanism in neural

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Introduction to Recurrent Neural Networks Jakob Verbeek Modeling sequential data with Recurrent

When we try to pick out anything by itself, we find it hitched to everything else in the

Expanding t the W World o of H Heterogenous Mem emory H y Hier erarchies The Evolving

Memory networks Zhirong Wu Feb 9th, 2015 Outline motivation Most machine learning algorithms

6 ways to reduce picky eating Rough Draft Get ready for some fun Were glad youre here!

Crowdsourcing and Human Computation Instructor: Chris Callison-Burch Website:

Two-Sample Instrumental Variable Analysis: Challenges and Some Progress Qingyuan Zhao Department

Outline Infections of the Brain Infections of the Spine Outline Infections of the

Walter Reed Army Institute of Research Director, Henry M. Jackson Foundation Component, MHRP