Recurrent Neural Models: Language Models, and Sequence Prediction - PowerPoint PPT Presentation

Recurrent Neural Models: Language Models, and Sequence Prediction and Generation CMSC 473/673 Frank Ferraro

WARNING: Neural methods are NOT the only way to do sequence prediction: • Structured Perceptron (478/678) • Hidden Markov Models (473/673, 678, 691 GSML) • Conditional Random Fields (473/673, 678, 691 GSML) • (and others)

CRFs are Very Popular for {POS, NER, other sequence tasks} … z 1 z 2 z 3 z 4 𝑞 𝑨 1 , … , 𝑨 𝑂 𝑥 1 , … , 𝑥 𝑂 ) ∝ exp( 𝜄 𝑈 𝑔 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 𝟐 , … , 𝒙 𝑶 ) ෑ w 1 w 2 w 3 w 4 … 𝑗 • POS f( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = Can’t easily do these CRFs can be used in neural networks too: ( 𝑨 𝑗−1 == Noun & 𝑨 𝑗 == Verb & with an HMM ( 𝑥 𝑗−2 in list of adjectives or determiners)) ➔ • https://www.tensorflow.org/versions/r1.15/api_docs/pyt • NER Conditional models hon/tf/contrib/crf/CrfForwardRnnCell can allow richer f path p ( 𝑨 𝑗−1 , 𝑨 𝑗 , 𝒙 ) = • https://pytorch-crf.readthedocs.io/en/stable/ features ( 𝑨 𝑗−1 == Per & 𝑨 𝑗 == Per & (syntactic path p involving 𝑥 𝑗 exists ))

Outline Types of networks Basic cell definition Example in PyTorch

A Note on Graphical Notation Output: Label, Sequence of y labels, Generated Text, etc. h Input: Could be BOW, x sequence of items, structured input, etc.

A Note on Graphical Notation Output: Label, Sequence of y labels, Generated Text, etc. h Hidden state/representation Input: Could be BOW, x sequence of items, structured input, etc.

A Note on Graphical Notation Output: Label, Sequence of y labels, Generated Text, etc. h Hidden state/representation • h computed by a neural cell, or factor • This is called the encoder Input: Could be BOW, x sequence of items, structured input, etc.

A Note on Graphical Notation Output: Label, Sequence of y labels, Generated Text, etc. • y is predicted/generated from h another neural cell, or factor • This is called the decoder h Hidden state/representation • h computed by a neural cell, or factor • This is called the encoder Input: Could be BOW, x sequence of items, structured input, etc.

A Note on Graphical Notation Output: Label, Sequence of y labels, Generated Text, etc. • y is predicted/generated from h another neural cell, or factor • This is called the decoder h Hidden state/representation • h computed by a neural cell, or factor • This is called the encoder Input: Could be BOW, x sequence of items, structured input, etc. The red arrows indicate parameters to learn

Five Broad Categories of Neural Networks Single Input, Single Output Single Input, Multiple Outputs Multiple Inputs, Single Output Multiple Inputs, Multiple Outputs (“sequence prediction”: no time delay) Multiple Inputs, Multiple Outputs (“sequence -to- sequence”: with time delay)

Five Broad Categories of Neural Networks Single Input, Single Output “Single”: fixed number of items Single Input, Multiple Outputs “Multiple”: variable number Multiple Inputs, Single Output of items Multiple Inputs, Multiple Outputs (“sequence prediction”: no time delay) Multiple Inputs, Multiple Outputs (“sequence -to- sequence”: with time delay)

Network Types: Single Input, Single Output y 1. Feed forward Linearizable feature input h Bag-of-items classification/regression Basic non-linear model x We’ve already seen some instances of this

Terminology Recall from maxent slides common NLP Log-Linear Models term (Multinomial) logistic regression as statistical regression Softmax regression based in Maximum Entropy models (MaxEnt) information theory Generalized Linear Models a form of Discriminative Naïve Bayes viewed as to be cool Very shallow (sigmoidal) neural nets today :)

Recall: N-gram to Maxent to Neural Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) = softmax(𝜄 ⋅ 𝑔(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 )) w i predict the next word

Recall: N-gram to Maxent to Neural Language Models given some context… w i-3 w i-2 w i-1 x no learned representation h compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) = softmax(𝜄 ⋅ 𝑔(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 )) y w i predict the next word

Recall: N-gram to Maxent to Neural Language Models given some context… w i-3 w i-2 w i-1 create/use “ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector C = f θ wi representations… product compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ softmax(𝜄 𝑥 𝑗 ⋅ 𝒈(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 )) w i predict the next word

Recall: N-gram to Maxent to Neural Language Models given some context… w i-3 w i-2 w i-1 x create/use “ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector C = f θ wi representations… product h compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ softmax(𝜄 𝑥 𝑗 ⋅ 𝒈(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 )) y w i predict the next word

Common Types of Single Input, Single Output • Feed forward networks • Multilayer perceptrons (MLPs) General Formulation: Input: x Compute: h 0 = x for layer l = 1 to L: linear layer h l = f l (W l h l-1 + b l ) hidden state (non-linear) at layer l activation function at l return argmax 𝑧 softmax 𝜄ℎ 𝑀

Common Types of Single Input, Single Output • Feed forward networks • Multilayer perceptrons (MLPs) In Pytorch (torch.nn): General Formulation: Activation functions: https://pytorch.org/docs/stable/nn.html?highlight Input: x =activation#non-linear-activations-weighted-sum- Compute: nonlinearity h 0 = x Linear layer: for layer l = 1 to L: https://pytorch.org/docs/stable/nn.html#linear- linear layer h l = f l (W l h l-1 + b l ) layers hidden state (non-linear) torch.nn.Linear( at layer l activation in_features=<dim of h l-1 >, function at l out_features=<dim of h l >, return argmax 𝑧 softmax 𝜄ℎ 𝑀 bias=<Boolean: include bias b l >)

Network Types: Single Input, Multiple Outputs y 0 y 1 y 2 Recursive: One input, Sequence output Label-based generation h 0 h 1 h 2 Automated caption generation x

Label-Based Generation Given a label y, generate an entire text 🗏 𝑞 🗏 𝑧) argmax 🗏 argmax 𝑞 𝑥 1 , … , 𝑥 𝑂 𝑧) 𝑥 1 ,…,𝑥 𝑂 Performing this argmax is difficult, and often requires an approximate search technique called beam search

Example: Sentiment-based Tweet Generation Given a sentiment label y (e.g., HAPPY , SAD , ANGRY , etc.), generate a tweet that would be expressing that sentiment 𝑞 🗏 𝑧) argmax 🗏 argmax 𝑞 𝑥 1 , … , 𝑥 𝑂 𝑧) 𝑥 1 ,…,𝑥 𝑂 Q: Why might you want to do this? Q: What ethical aspects should you consider? Q: What is the potential harm?

Example: Image Caption Generation Show and Tell: A Neural Image Caption Generator, CVPR 15 Slide credit: Arun Mallya

Network Types: Multiple Inputs, Single Output y Recursive: Sequence input, one output Document classification h 0 h 1 h 2 Action recognition in video (high-level) x 0 x 1 x 2

Network Types: Multiple Inputs, Single Output Recursive: Sequence input, one output y Document classification Action recognition in video (high-level) h 0 h 1 h 2 Think of this as generalizing using maxent models to build discriminatively trained classifiers 𝑞 𝑧 𝑦) = maxent 𝑦, 𝑧 x 0 x 1 x 2 ➔ 𝑞 𝑧 𝑦) = recurrent_classifier 𝑦, 𝑧

Example: RTE (many options) s: Michael Jordan, coach Phil Jackson and the star cast, including Scottie Pippen, took p( | ) the Chicago Bulls to six E NTAILED National Basketball Association championships. z: The Bulls basketball team is based in Chicago. y … … h s,0 h s,N h z,0 h z,M s 0 s N z 0 z M

Many (but not all) of these tasks fall into the Reminder! Multiple Inputs, Single Output regime GLUE https://gluebenchmark.com/ https://super.gluebenchmark.com/

Network Types: Multiple Inputs, Multiple Outputs (“sequence prediction”: no time delay) y 0 y 1 y 2 h 0 h 1 h 2 x 0 x 1 x 2 Recursive: Sequence input, Sequence output Part of speech tagging Named entity recognition

Recurrent Neural Models: Language Models, and Sequence Prediction - PowerPoint PPT Presentation

Recurrent Neural Models: Language Models, and Sequence Prediction and Generation CMSC 473/673 Frank Ferraro WARNING: Neural methods are NOT the only way to do sequence prediction: Structured Perceptron (478/678) Hidden Markov Models

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Recurrent Neural Network Agenda Recurrent Neural Network

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks

Computa(on through dynamics Using recurrent neural networks to unveil mechanism in neural

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

Previously... We started the acquisition process... We elicited tacit knowledge Moving

Acquaintance content and Obviation . Pranav Anand (UC Santa Cruz) & Natasha Korotkova

Epidemic models (part II) Search in networks Argimiro Arratia & Marta Arias Universitat

Large scale computation of the trace of a matrix function Giuseppe Rodriguez Department of

Computational Semantics and Pragmatics Autumn 2013 Raquel Fernndez Institute for Logic,

run leikja dreifu umhverfi lafur Andri Ragnarsson Betware, RU, IGI Revenues Betware

in the Digital Age Dr. Gabriela Avram Definition of learning The act, process, or experience

An Empirical Study on Lazy Multilabel Classification Algorithms Eleftherios Spyromitros,

Recurrent Neural Models: Language Models, and Sequence Prediction - PowerPoint PPT Presentation

Recurrent Neural Models: Language Models, and Sequence Prediction and Generation CMSC 473/673 Frank Ferraro WARNING: Neural methods are NOT the only way to do sequence prediction: Structured Perceptron (478/678) Hidden Markov Models

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Recurrent Neural Network Agenda Recurrent Neural Network

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks

Computa(on through dynamics Using recurrent neural networks to unveil mechanism in neural

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

Previously... We started the acquisition process... We elicited tacit knowledge Moving

Acquaintance content and Obviation . Pranav Anand (UC Santa Cruz) &amp; Natasha Korotkova

Epidemic models (part II) Search in networks Argimiro Arratia &amp; Marta Arias Universitat

Large scale computation of the trace of a matrix function Giuseppe Rodriguez Department of

Computational Semantics and Pragmatics Autumn 2013 Raquel Fernndez Institute for Logic,

run leikja dreifu umhverfi lafur Andri Ragnarsson Betware, RU, IGI Revenues Betware

in the Digital Age Dr. Gabriela Avram Definition of learning The act, process, or experience

An Empirical Study on Lazy Multilabel Classification Algorithms Eleftherios Spyromitros,

Acquaintance content and Obviation . Pranav Anand (UC Santa Cruz) & Natasha Korotkova

Epidemic models (part II) Search in networks Argimiro Arratia & Marta Arias Universitat