Recurrent Neural Networks + LSTMs + Attention Surag Nair (based on - PowerPoint PPT Presentation

Recurrent Neural Networks + LSTMs + Attention Surag Nair (based on slides by Xavier Giró-i-Nieto, Santi Pascual and M. Malinowski)

Multilayer Perceptron The output depends ONLY on the current input. Alex Graves, “Supervised Sequence Labelling with Recurrent Neural Networks” 2

Recurrent Neural Network (RNN) The hidden layers and the output depend from previous states of the hidden layers Alex Graves, “Supervised Sequence Labelling with Recurrent Neural Networks” 3

Recurrent Neural Networks (RNN) Each node represents a layer of neurons at a single timestep. t t-1 t+1 Alex Graves, “Supervised Sequence Labelling with Recurrent Neural Networks” 4

Recurrent Neural Networks (RNN) The input is a SEQUENCE x(t) of any length. t t-1 t+1 Alex Graves, “Supervised Sequence Labelling with Recurrent Neural Networks” 5

Recurrent Neural Networks (RNN) Must learn temporally shared weights w2; in addition to w1 & w3. t t-1 t+1 Alex Graves, “Supervised Sequence Labelling with Recurrent Neural Networks” 6

Bidirectional RNN (BRNN) Must learn weights w2, w3, w4 & w5; in addition to w1 & w6. Alex Graves, “Supervised Sequence Labelling with Recurrent Neural Networks” 7

Formulation: Single recurrence One-time Recurrence Slide: Santi Pascual 17

Formulation: Multiple recurrences One time-step recurrence Recurrence T time steps recurrences Slide: Santi Pascual 9

RNN problems Long term memory vanishes because of the T nested multiplications by U. ... Slide: Santi Pascual 10

RNN problems During training, gradients may explode or vanish because of temporal depth. Example: Back- propagation in time with 3 steps. Slide: Santi Pascual 11

Vanishing/Exploding Gradient Problem Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. "On the difficulty of training recurrent neural networks." ICML (3) 28 (2013): 1310-1318.

Long Short-Term Memory (LSTM) Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9, no. 8 (1997): 1735-1780. 13

Long Short-Term Memory (LSTM) Based on a standard RNN whose neuron activates with tanh ... Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) 14

Long Short-Term Memory (LSTM) C t is the cell state, which flows through the entire chain... Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) 15

Long Short-Term Memory (LSTM) ...and is updated with a sum instead of a product. This avoid memory vanishing and exploding/vanishing backprop gradients. Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) 16

Long Short-Term Memory (LSTM) Forget Gate : Concatenate Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes 17

Long Short-Term Memory (LSTM) Input Gate Layer New contribution to cell state Classic neuron Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes 18

Long Short-Term Memory (LSTM) Update Cell State (memory): Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes 19

Long Short-Term Memory (LSTM) Output Gate Layer Output to next layer Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes 20

Gated Recurrent Unit (GRU) Similar performance as LSTM with less computation. Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014). 21

Attention - Motivation • Long term memories - attending to memories ‣ Dealing with gradient vanishing problem • Exceeding limitations of a global representation ‣ Attending/focusing to smaller parts of data - patches in images - words or phrases in sentences • Decoupling representation from a problem ‣ Different problems required different sizes of representations - LSTM with longer sentences requires larger vectors • Overcoming computational limits for visual data ‣ Focusing only on the parts of images ‣ Scalability independent of the size of images • Adds some interpretability to the models (error inspection)

Extension of LSTM via context vector

Soft Attention Example : http://distill.pub/2016/augmented-rnns/

Teaching Machines to Read and Comprehend Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, Lei Yu, and Phil Blunsom pblunsom@google.com

Features and NLP Twenty years ago log-linear models allowed greater freedom to model correlations than simple multinomial parametrisations, but imposed the need for feature engineering.

Features and NLP Distributed/neural models allow us to learn shallow features for our classifiers, capturing simple correlations between inputs.

Deep Learning and NLP Fully connected layer K-Max pooling (k=3) Folding Wide convolution (m=2) Dynamic k-max pooling (k= f(s) =5) Wide convolution (m=3) Projected sentence matrix (s=7) game's the same, just got more fierce Deep learning should allow us to learn hierarchical generalisations.

Deep Learning and NLP: Question Answer Selection When did James Dean die? Generalisation Generalisation In 1955, actor James Dean was killed in a two-car collision near Cholame, Calif. Beyond classification, deep models for embedding sentences have seen increasing success.

Deep Learning and NLP: Question Answer Selection When did James Dean Die ? g In 1955, actor was killed in James Dean a two-car collision near Cholame, Calif. Recurrent neural networks provide a very practical tool for sentence embedding.

我一杯白葡萄酒。 Deep Learning for NLP: Machine Translation i 'd like a glass of white wine , please . Generation Generalisation � � We can even view translation as encoding and decoding sentences.

Deep Learning for NLP: Machine Translation Dogs love bones </s> Les chiens aiment les os ||| Dogs love bones Source sequence Target sequence Recurrent neural networks again perform surprisingly well.

Supervised Reading Comprehension To achieve our aim of training supervised machine learning models for machine reading and comprehension, we must first find data.

Supervised Reading Comprehension The CNN and DailyMail websites provide paraphrase summary sentences for each full news story.

Supervised Reading Comprehension CNN article: Document The BBC producer allegedly struck by Jeremy Clarkson will not press charges against the “Top Gear” host, his lawyer said Friday. Clarkson, who hosted one of the most-watched television shows in the world, was dropped by the BBC Wednesday after an internal investigation by the British broadcaster found he had subjected producer Oisin Tymon “to an unprovoked physical and verbal attack.” . . . Query Producer X will not press charges against Jeremy Clarkson, his lawyer says. Answer Oisin Tymon We formulate Cloze style queries from the story paraphrases.

Supervised Reading Comprehension From the Daily Mail: • The hi-tech bra that helps you beat breast X ; • Could Saccharin help beat X ?; • Can fish oils help fight prostate X ? An ngram language model would correctly predict ( X = cancer ), regardless of the document, simply because this is a frequently cured entity in the Daily Mail corpus.

Supervised Reading Comprehension MNIST example generation: We generate quasi-synthetic examples from the original document-query pairs, obtaining exponentially more training examples by anonymising and permuting the mentioned entities.

Supervised Reading Comprehension Original Version Anonymised Version Context The BBC producer allegedly struck by Jeremy the ent381 producer allegedly struck by ent212 will Clarkson will not press charges against the “Top not press charges against the “ ent153 ” host , his Gear” host, his lawyer said Friday. Clarkson, who lawyer said friday . ent212 , who hosted one of the hosted one of the most-watched television shows in most - watched television shows in the world , was the world, was dropped by the BBC Wednesday after dropped by the ent381 wednesday after an internal an internal investigation by the British broadcaster investigation by the ent180 broadcaster found he found he had subjected producer Oisin Tymon “to an had subjected producer ent193 “ to an unprovoked unprovoked physical and verbal attack.” . . . physical and verbal attack . ” . . . Query Producer X will not press charges against Jeremy producer X will not press charges against ent212 , his Clarkson, his lawyer says. lawyer says . Answer Oisin Tymon ent193 Original and anonymised version of a data point from the Daily Mail validation set. The anonymised entity markers are constantly permuted during training and testing. + Barun, Shantanu, Arindam, Ankit, Daraksha, Dinesh - Akshay/Barun : errors introduced by co-reference system?

Recurrent Neural Networks + LSTMs + Attention Surag Nair (based on - PowerPoint PPT Presentation

Recurrent Neural Networks + LSTMs + Attention Surag Nair (based on slides by Xavier Gir-i-Nieto, Santi Pascual and M. Malinowski) Multilayer Perceptron The output depends ONLY on the current input. Alex Graves, Supervised Sequence

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

LSTMs Overview Subhashini Venugopalan Neural Networks z t Output B Hidden Hidden Input WHY

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Recurrent Networks: Stability analysis and LSTMs 1 Which open source project? 2 Related math.

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

Recurrent Neural Networks (RNN) Artificial Intelligence @ Allegheny College Janyl Jumadinova

Recurrent Neural Networks: Stability analysis and LSTMs M. Soleymani Sharif University of

Natural Language Understanding Lecture 12: Recurrent Neural Networks and LSTMs Adam Lopez

Cognitive Psychology Philipp Koehn 13 February 2020 Philipp Koehn Artificial Intelligence:

2. Cognitive Perspective of Learning Cognition: Big Questions How do things out there

Class 15 - Long Short-Term Memory (LSTM) Class 15 - Long Short-Term Memory (LSTM) Study materials

CS1110 Nate Brunelle Today: How do computers? Questions? Last Time Paper airplanes

Deep Learning: Theory and Practice 30-04-2019 Recurrent Neural Networks Introduction The

GESIS Survey Guidelines Timo Lenzner and Natalja Menold These slides are based on the GESIS

Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks by Kai

LSTM M Based sed Ada dapt ptive ive Fil ilterin ering g for r Redu duced ced Pre redi