Recurrent Neural Networks III Milan Straka April 29, 2019 Charles - PowerPoint PPT Presentation

NPFL114, Lecture 9 Recurrent Neural Networks III Milan Straka April 29, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

Recurrent Neural Networks Single RNN cell input state output Unrolled RNN cells input 1 input 2 input 3 input 4 state state state state output 1 output 2 output 3 output 4 NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 2/49

Basic RNN Cell input output = new state previous state x ( t ) s ( t −1) Given an input and previous state , the new state is computed as ( t ) ( t −1) ( t ) = f ( s , x ; θ ). s One of the simplest possibilities is ( t ) ( t −1) ( t ) = tanh( Us + + b ). s V x NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 3/49

Basic RNN Cell Basic RNN cells suffer a lot from vanishing/exploding gradients ( the challenge of long-term dependencies ). If we simplify the recurrence of states to ( t ) ( t −1) = , s Us we get ( t ) (0) t = . s U s U = QΛQ −1 U If has eigenvalue decomposition of , we get ( t ) −1 (0) t = . s QΛ Q s The main problem is that the same function is iteratively applied many times. Several more complex RNN cell variants have been proposed, which alleviate this issue to some degree, namely LSTM and GRU . NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 4/49

Long Short-Term Memory Later in Gers, Schmidhuber & Cummins (1999) a possibility to forget information from memory c t cell was added. x t h t − 1 i i i ← σ ( W x + V h + b ) i t −1 t t f ← σ ( W x f + V h f + b ) f σ t −1 t t o o o ← σ ( W x + V h + b ) o t −1 t t c t y y y ← f ⋅ c + i ⋅ tanh( W x + V h + b ) c t −1 t −1 t t t t x t h t ← o ⋅ tanh( c ) tanh tanh h t t t σ σ h t − 1 x t h t − 1 x t h t − 1 NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 5/49

Long Short-Term Memory http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 6/49

Gated Recurrent Unit x t h t − 1 σ h t − 1 tanh x t h t 1 − + h t − 1 σ x t h t − 1 r ← σ ( W x r + V h r + b ) r t −1 t t u u u ← σ ( W x + V h + b ) u t −1 t t ^ t h h h ← tanh( W x + V ( r ⋅ h ) + b ) h t −1 t t ^ t ← u ⋅ h + (1 − u ) ⋅ h h t −1 t t t NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 7/49

Gated Recurrent Unit http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-GRU.png NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 8/49

Word Embeddings One-hot encoding considers all words to be independent of each other. However, words are not independent – some are more similar than others. Ideally, we would like some kind of similarity in the space of the word representations. Distributed Representation The idea behind distributed representation is that objects can be represented using a set of common underlying factors. R d We therefore represent words as fixed-size embeddings into space, with the vector elements playing role of the common underlying factors. NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 9/49

Word Embeddings The word embedding layer is in fact just a fully connected layer on top of one-hot encoding. However, it is important that this layer is shared across the whole network. D 1 D 1 V D D 2 D D 2 Word in Word in V V D one-hot one-hot encoding encoding D 3 D 3 V D NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 10/49

Word Embeddings for Unknown Words Recurrent Character-level WEs Figure 1 of paper "Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation", https://arxiv.org/abs/1508.02096. NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 11/49

Word Embeddings for Unknown Words Convolutional Character-level WEs Figure 1 of paper "Character-Aware Neural Language Models", https://arxiv.org/abs/1508.06615. NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 12/49

Basic RNN Applications Sequence Element Classification Use outputs for individual elements. input 1 input 2 input 3 input 4 state state state state output 1 output 2 output 3 output 4 Sequence Representation Use state after processing the whole sequence (alternatively, take output of the last element). NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 13/49

Structured Prediction , … , y ∈ , … , x Y N y x 1 1 N N Consider generating a sequence of given input . P ( y ∣ X ) i Predicting each sequence element independently models the distribution . y i However, there may be dependencies among the themselves, which is difficult to capture by independent element classification. NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 14/49

Linear-Chain Conditional Random Fields (CRF) Linear-chain Conditional Random Fields, usually abbreviated only to CRF, acts as an output layer. It can be considered an extension of a softmax – instead of a sequence of independent softmaxes, CRF is a sentence-level softmax, with additional weights for neighboring sequence elements. N ∑ s ( X , y ; θ , A ) = + ( y ∣ X ) ) ( A f , y y θ i i −1 i i =1 p ( y ∣ X ) = softmax ( s ( X , z ) ) z ∈ Y N z log p ( y ∣ X ) = s ( X , y ) − logadd ( s ( X , z )) z ∈ Y N NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 15/49

Linear-Chain Conditional Random Fields (CRF) Computation p ( y ∣ X ) ( k ) α t We can compute efficiently using dynamic programming. If we denote as t y k probability of all sentences with elements with the last being . The core idea is the following: ( k ) = ( y = k ∣ X ) + logadd ( α ( j ) + ). α f A t −1 j , k j ∈ Y t θ t For efficient implementation, we use the fact that ln b −ln a ln( a + b ) = ln a + ln(1 + e ). NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 16/49

Conditional Random Fields (CRF) Decoding logadd We can perform optimal decoding, by using the same algorithm, only replacing with max and tracking where the maximum was attained. Applications CRF output layers are useful for span labeling tasks, like named entity recognition dialog slot filling NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 17/49

Connectionist Temporal Classification , … , y , … , x y x 1 1 M N Let us again consider generating a sequence of given input , but this M ≤ N x y time and there is no explicit alignment of and in the gold data. Figure 7.1 of the dissertation "Supervised Sequence Labelling with Recurrent Neural Networks" by Alex Graves. NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 18/49

Connectionist Temporal Classification We enlarge the set of output labels by a – ( blank ) and perform a classification for every input element to produce an extended labeling . We then post-process it by the following rules B (denoted ): 1. We remove neighboring symbols. 2. We remove the – . Because the explicit alignment of inputs and labels is not known, we consider all possible alignments. t l t p l Denoting the probability of label at time as , we define t ∑ ∏ t ′ def t α ( s ) = . p π t ′ ′ t =1 labeling π : B ( π )= y 1: t 1: s NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 19/49

CRF and CTC Comparison In CRF, we normalize the whole sentences, therefore we need to compute unnormalized probabilities for all the (exponentially many) sentences. Decoding can be performed optimally. In CTC, we normalize per each label. However, because we do not have explicit alignment, we compute probability of a labeling by summing probabilities of (generally exponentially many) extended labelings. NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 20/49

Connectionist Temporal Classification Computation When aligning an extended labeling to a regular one, we need to consider whether the extended labeling ends by a blank or not. We therefore define t ∑ ∏ t ′ def t ( s ) = α p − π t ′ ′ t =1 labeling π : B ( π )= y , π =− 1: t 1: s t t ∑ ∏ t ′ def t ( s ) = α p ∗ π t ′ t =1 ′ labeling π : B ( π )= y , π t  =− 1: t 1: s α ( s ) ( s ) + ( s ) t t t α α − ∗ and compute as . NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 21/49

Recurrent Neural Networks III Milan Straka April 29, 2019 Charles - PowerPoint PPT Presentation

NPFL114, Lecture 9 Recurrent Neural Networks III Milan Straka April 29, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Recurrent Neural Networks

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Computa(on through dynamics Using recurrent neural networks to unveil mechanism in neural

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Counter/Timers Overview ATmega328P has two _ and one __ counters. Can configure to

Seminar C2NLU, Schlo Dagstuhl, Wadern, Germany 24-January-2017 From Bayes Decision Rule to

AVR Microcontrollers -Timers (Chapter 9 of the text book) 1 Contents Timers 0 and 2 of

LHC as Time Machine (Adventures in Extra-Dimensions) Tom Weiler Vanderbilt University

Server virtualiza,on and security CSCI 470: Web Science

Introduc>on to MARIE 2 Schedule Today Introduce new

Frequently Asked Questions Retrieval for Croatian Based on Semantic Textual Similarity Mladen

CWID08 Demonstrates Rapid Evolutionary Acquisition Model of Coalition C2 AFCEA-GMU C4I CENTER

Recurrent Neural Networks III Milan Straka April 29, 2019 Charles - PowerPoint PPT Presentation

NPFL114, Lecture 9 Recurrent Neural Networks III Milan Straka April 29, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Recurrent Neural Networks

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Computa(on through dynamics Using recurrent neural networks to unveil mechanism in neural

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Counter/Timers Overview ATmega328P has two _____ and one ______ counters. Can configure to

Seminar C2NLU, Schlo Dagstuhl, Wadern, Germany 24-January-2017 From Bayes Decision Rule to

AVR Microcontrollers -Timers (Chapter 9 of the text book) 1 Contents Timers 0 and 2 of

LHC as Time Machine (Adventures in Extra-Dimensions) Tom Weiler Vanderbilt University

Server virtualiza,on and security CSCI 470: Web Science

Introduc&gt;on to MARIE 2 Schedule Today Introduce new

Frequently Asked Questions Retrieval for Croatian Based on Semantic Textual Similarity Mladen

CWID08 Demonstrates Rapid Evolutionary Acquisition Model of Coalition C2 AFCEA-GMU C4I CENTER

Counter/Timers Overview ATmega328P has two _ and one __ counters. Can configure to

Introduc>on to MARIE 2 Schedule Today Introduce new