CSEP 517: Natural Language Processing Recurrent Neural Networks - PowerPoint PPT Presentation

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer University of Washington [most slides from Yejin Choi]

RECURRENT NEURAL L NE NETWOR WORKS

Recurrent Neural Networks (RNNs) Each input “word” is a vector • Each RNN unit computes a new hidden state using the previous • state and a new input h t = f ( x t , h t − 1 ) Each RNN unit (optionally) makes an output using the current • hidden state y t = softmax( V h t ) Hidden states are continuous vectors h t ∈ R D • – Can represent very rich information, function of entire history Parameters are shared (tied) across all RNN units (unlike • feedforward NNs) ℎ % ℎ " ℎ # ℎ $ ! $ ! % ! " ! #

Softmax • Turn a vector of real numbers x into a probability distribution • We have seen this trick before! – log-linear models… 4

Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) y t = softmax( V h t ) • Vanilla RNN: h t = tanh( Ux t + Wh t − 1 + b ) y t = softmax( V h t ) ℎ % ℎ " ℎ # ℎ $ ! $ ! % ! " ! #

Sigmoid • Often used for gates 1 σ ( x ) = • Pro: neuron-like, 1 + e − x differentiable σ 0 ( x ) = σ ( x )(1 − σ ( x )) • Con: gradients saturate to zero almost everywhere except x near zero => vanishing gradients • Batch normalization helps 6

Tanh tanh( x ) = e x − e − x e x + e − x • Often used for hidden states & cells tanh’(x) = 1 − tanh 2 ( x ) in RNNs, LSTMs • Pro: differentiable, tanh( x ) = 2 σ (2 x ) − 1 often converges faster than sigmoid • Con: gradients easily saturate to zero => vanishing gradients 7

Many uses of RNNs 1. Classification (seq to one) • Input: a sequence • Output: one label (classification) • Example: sentiment classification h t = f ( x t , h t − 1 ) y = softmax( V h n ) ℎ % ℎ " ℎ # ℎ $ ! $ ! % ! " ! #

Many uses of RNNs 2. one to seq • Input: one item • Output: a sequence h t = f ( x t , h t − 1 ) • Example: Image captioning y t = softmax( V h t ) Cat sitting on top of …. ℎ " ℎ % ℎ & ℎ $ ℎ " ℎ $ ℎ % ! "

Many uses of RNNs 3. sequence tagging • Input: a sequence • Output: a sequence (of the same length) • Example: POS tagging, Named Entity Recognition • How about Language Models? – Yes! RNNs can be used as LMs! h t = f ( x t , h t − 1 ) – RNNs make markov assumption: T/F? y t = softmax( V h t ) ℎ $ ℎ " ℎ % ℎ # ℎ " ℎ # ℎ $ ! $ ! % ! " ! #

Many uses of RNNs 4. Language models • Input: a sequence of words h t = f ( x t , h t − 1 ) • Output: next word y t = softmax( V h t ) – (or sequence of next words, if repeated) • During training, x t and y t-1 are the same word. • During testing, x t is sampled from softmax in y t-1 . • Does RNN LMs make Markov assumption? – i.e., the next word depends only on the previous N words ℎ $ ℎ " ℎ % ℎ # ℎ " ℎ # ℎ $ ! $ ! % ! " ! #

Many uses of RNNs 5. seq2seq (aka “encoder-decoder”) • Input: a sequence • Output: a sequence (of different length) • Examples? h t = f ( x t , h t − 1 ) y t = softmax( V h t ) ℎ ( ℎ & ℎ ) ℎ ' ℎ " ℎ # ℎ $ ℎ & ℎ ' ℎ ( ! $ ! " ! #

Many uses of RNNs 4. seq2seq (aka “encoder-decoder”) Parsing! - “Grammar as Foreign Language” (Vinyals et al., 2015) ℎ ( ℎ & ℎ ) ℎ ' ℎ " ℎ # ℎ $ ℎ & ℎ ' ℎ ( ! $ ! " ! # John has a dog

Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) y t = softmax( V h t ) • Vanilla RNN: h t = tanh( Ux t + Wh t − 1 + b ) y t = softmax( V h t ) ℎ % ℎ " ℎ # ℎ $ ! $ ! % ! " ! #

vanishing gradient problem for RNNs. The shading of the nodes in the unfolded network indicates their • sensitivity to the inputs at time one (the darker the shade, the greater the sensitivity). The sensitivity decays over time as new inputs overwrite the activations • of the hidden layer, and the network ‘forgets’ the first inputs. Example from Graves 2012

Recurrent Neural Networks (RNNs) • Generic RNNs: h t = f ( x t , h t − 1 ) • Vanilla RNNs: h t = tanh( Ux t + Wh t − 1 + b ) • LSTMs ( L ong S hort- t erm M emory Networks): i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ There are many c t = f t � c t − 1 + i t � ˜ c t known variations to this set of h t = o t � tanh( c t ) equations! ' % ' ( : cell state ' # ' $ ' " ℎ # ℎ $ ℎ % ℎ ( : hidden state ℎ " ! $ ! % ! " ! #

LST LSTMS S (LONG ONG SHOR HORT-TERM ERM MEM EMOR ORY NETWO WORKS ! "#$ ! " ℎ "#$ ℎ " Figure by Christopher Olah (colah.github.io)

LST LSTMS S (LONG ONG SHOR HORT-TERM ERM MEM EMOR ORY NETWO WORKS sigmoid: Forget gate: forget the past or not [0,1] f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Figure by Christopher Olah (colah.github.io)

LST LSTMS S (LONG ONG SHOR HORT-TERM ERM MEM EMOR ORY NETWO WORKS sigmoid: Forget gate: forget the past or not [0,1] f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Input gate: use the input or not tanh: [-1,1] i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ Figure by Christopher Olah (colah.github.io)

LST LSTMS S (LONG ONG SHOR HORT-TERM ERM MEM EMOR ORY NETWO WORKS sigmoid: Forget gate: forget the past or not [0,1] f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Input gate: use the input or not tanh: [-1,1] i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ New cell content: - mix old cell with the new temp cell c t = f t � c t − 1 + i t � ˜ c t Figure by Christopher Olah (colah.github.io)

LST LSTMS S (LONG ONG SHOR HORT-TERM ERM MEM EMOR ORY NETWO WORKS Forget gate: forget the past or not Output gate: output from the new cell or not f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) Input gate: use the input or not Hidden state: i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) h t = o t � tanh( c t ) New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ New cell content: - mix old cell with the new temp cell c t = f t � c t − 1 + i t � ˜ c t Figure by Christopher Olah (colah.github.io)

LST LSTMS S (LONG ONG SHOR HORT-TERM ERM MEM EMOR ORY NETWO WORKS Forget gate: forget the past or not f t = σ ( U ( f ) x t + W ( f ) h t − 1 + b ( f ) ) Input gate: use the input or not i t = σ ( U ( i ) x t + W ( i ) h t − 1 + b ( i ) ) Output gate: output from the new o t = σ ( U ( o ) x t + W ( o ) h t − 1 + b ( o ) ) cell or not New cell content (temp): c t = tanh( U ( c ) x t + W ( c ) h t − 1 + b ( c ) ) ˜ New cell content: c t = f t � c t − 1 + i t � ˜ c t - mix old cell with the new temp cell Hidden state: h t = o t � tanh( c t ) ! "#$ ! " ℎ "#$ ℎ "

Preservation of gradient information by LSTM Output gate Forget gate Input gate For simplicity, all gates are either entirely open (‘O’) or closed (‘—’). • The memory cell ‘remembers’ the first input as long as the forget gate is • open and the input gate is closed. The sensitivity of the output layer can be switched on and off by the output • gate without affecting the cell. Example from Graves 2012

Gates • Gates contextually control information flow • Open/close with sigmoid • In LSTMs, they are used to (contextually) maintain longer term history 27

RNN Learning: B ack p rop T hrough T ime (BPTT) • Similar to backprop with non-recurrent NNs • But unlike feedforward (non-recurrent) NNs, each unit in the computation graph repeats the exact same parameters… • Backprop gradients of the parameters of each unit as if they are different parameters • When updating the parameters using the gradients, use the average gradients throughout the entire chain of units. ℎ % ℎ " ℎ # ℎ $ ! $ ! % ! " ! #

Vanishing / exploding Gradients • Deep networks are hard to train • Gradients go through multiple layers • The multiplicative effect tends to lead to exploding or vanishing gradients • Practical solutions w.r.t. – network architecture – numerical operations 29

Vanishing / exploding Gradients • Practical solutions w.r.t. numerical operations – Gradient Clipping: bound gradients by a max value – Gradient Normalization: renormalize gradients when they are above a fixed norm – Careful initialization, smaller learning rates – Avoid saturating nonlinearities (like tanh, sigmoid) • ReLU or hard-tanh instead – Batch Normalization: add intermediate input normalization layers 30

Sneak peak: Bi-directional RNNs Can incorporate context from both directions • Generally improves over uni-directional RNNs • 31

RNNs make great LMs! https://research.fb.com/building-an-efficient-neural-language-model-over-a-billion-words/ 32

CSEP 517: Natural Language Processing Recurrent Neural Networks - PowerPoint PPT Presentation

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer University of Washington [most slides from Yejin Choi] RECURRENT NEURAL L NE NETWOR WORKS Recurrent Neural Networks (RNNs) Each input

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CSEP 517 Natural Language Processing Language Models Luke Zettlemoyer Slides adapted from Dan

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

Recurrent Neural Networks Luke Zettlemoyer (Slides adapted from Danqi Chen, Chris Manning,

CSEP 517 Natural Language Processing Introduction Luke Zettlemoyer Slides adapted from Dan

CSEP 517: Natural Language Processing New PMP Course! Instructor: Luke Zettlemoyer Autumn 2013

CSEP 517 Natural Language Processing Autumn 2018 Introduction Luke Zettlemoyer Slides adapted

Natural Language Processing (CSEP 517): Computational Pragmatics Chenhao Tan 2017 c

Natural Language Processing (CSEP 517): Introduction & Language Models Noah Smith c 2017

IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks

CSEP 517 Natural Language Processing Autumn 2015 Parsing (Trees) Yejin Choi - University of

CSEP 517 Natural Language Processing Frame Semantics Luke Zettlemoyer Slides adapted from Yejin

CSEP 517 Natural Language Processing Autumn 2015 Introduction Yejin Choi Slides adapted

CSEP 517 Natural Language Processing Luke Zettlemoyer Machine Translation, Sequence-to-sequence

1 Domain search Span Space High Gradient Subdivide the space into several sub-domains,

Flash Design Mustafa M. Shihab - The University of Texas at Dallas Jie Zhang - Yonsei University

LREC/ZREC YEAR 3 RFP Bidder Meeting May 5, 2014 1 Program Overview CL&P Christie Bradway

A very short, sketchy, introduction to A very short, sketchy, introduction to Bioconductor

Diffusion Contaminant at Contaminant Solutes (contaminants) migrate due to concentration

Cell Division Comics Using abstraction to represent a complex scientific topic in a simple,

Cell structures for finite subset spaces Christopher Tuffley Institute of Fundamental Sciences

A sequent calculus for opetopes CT 2019 Pierre-Louis Curien 1 Cdric Ho Thanh 1 Samuel Mimram 2

CSEP 517: Natural Language Processing Recurrent Neural Networks - PowerPoint PPT Presentation

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer University of Washington [most slides from Yejin Choi] RECURRENT NEURAL L NE NETWOR WORKS Recurrent Neural Networks (RNNs) Each input

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CSEP 517 Natural Language Processing Language Models Luke Zettlemoyer Slides adapted from Dan

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

Recurrent Neural Networks Luke Zettlemoyer (Slides adapted from Danqi Chen, Chris Manning,

CSEP 517 Natural Language Processing Introduction Luke Zettlemoyer Slides adapted from Dan

CSEP 517: Natural Language Processing New PMP Course! Instructor: Luke Zettlemoyer Autumn 2013

CSEP 517 Natural Language Processing Autumn 2018 Introduction Luke Zettlemoyer Slides adapted

Natural Language Processing (CSEP 517): Computational Pragmatics Chenhao Tan 2017 c

Natural Language Processing (CSEP 517): Introduction &amp; Language Models Noah Smith c 2017

IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks

CSEP 517 Natural Language Processing Autumn 2015 Parsing (Trees) Yejin Choi - University of

CSEP 517 Natural Language Processing Frame Semantics Luke Zettlemoyer Slides adapted from Yejin

CSEP 517 Natural Language Processing Autumn 2015 Introduction Yejin Choi Slides adapted

CSEP 517 Natural Language Processing Luke Zettlemoyer Machine Translation, Sequence-to-sequence

1 Domain search Span Space High Gradient Subdivide the space into several sub-domains,

Flash Design Mustafa M. Shihab - The University of Texas at Dallas Jie Zhang - Yonsei University

LREC/ZREC YEAR 3 RFP Bidder Meeting May 5, 2014 1 Program Overview CL&amp;P Christie Bradway

A very short, sketchy, introduction to A very short, sketchy, introduction to Bioconductor

Diffusion Contaminant at Contaminant Solutes (contaminants) migrate due to concentration

Cell Division Comics Using abstraction to represent a complex scientific topic in a simple,

Cell structures for finite subset spaces Christopher Tuffley Institute of Fundamental Sciences

A sequent calculus for opetopes CT 2019 Pierre-Louis Curien 1 Cdric Ho Thanh 1 Samuel Mimram 2

Natural Language Processing (CSEP 517): Introduction & Language Models Noah Smith c 2017

LREC/ZREC YEAR 3 RFP Bidder Meeting May 5, 2014 1 Program Overview CL&P Christie Bradway