Recurrent Neural Models: Language Models, and Sequence Prediction - - PowerPoint PPT Presentation

β–Ά
recurrent neural models
SMART_READER_LITE
LIVE PREVIEW

Recurrent Neural Models: Language Models, and Sequence Prediction - - PowerPoint PPT Presentation

Recurrent Neural Models: Language Models, and Sequence Prediction and Generation CMSC 473/673 Frank Ferraro WARNING: Neural methods are NOT the only way to do sequence prediction: Structured Perceptron (478/678) Hidden Markov Models


slide-1
SLIDE 1

Recurrent Neural Models: Language Models, and Sequence Prediction and Generation

CMSC 473/673 Frank Ferraro

slide-2
SLIDE 2

WARNING: Neural methods are NOT the only way to do sequence prediction:

  • Structured Perceptron (478/678)
  • Hidden Markov Models (473/673, 678, 691 GSML)
  • Conditional Random Fields

(473/673, 678, 691 GSML)

  • (and others)
slide-3
SLIDE 3

CRFs are Very Popular for {POS, NER, other sequence tasks}

  • POS

f(π‘¨π‘—βˆ’1, 𝑨𝑗, 𝒙) = (π‘¨π‘—βˆ’1 == Noun & 𝑨𝑗 == Verb & (π‘₯π‘—βˆ’2 in list of adjectives or determiners))

  • NER

fpath p(π‘¨π‘—βˆ’1, 𝑨𝑗, 𝒙) = (π‘¨π‘—βˆ’1 == Per & 𝑨𝑗 == Per & (syntactic path p involving π‘₯𝑗 exists ))

z1

…

w1 w2 w3 w4 … z2 z3 z4

π‘ž 𝑨1, … , 𝑨𝑂 π‘₯1, … , π‘₯𝑂) ∝ ΰ·‘

𝑗

exp( πœ„π‘ˆπ‘” π‘¨π‘—βˆ’1, 𝑨𝑗, π’™πŸ, … , 𝒙𝑢 ) Can’t easily do these with an HMM βž” Conditional models can allow richer features

CRFs can be used in neural networks too:

  • https://www.tensorflow.org/versions/r1.15/api_docs/pyt

hon/tf/contrib/crf/CrfForwardRnnCell

  • https://pytorch-crf.readthedocs.io/en/stable/
slide-4
SLIDE 4

Outline

Types of networks Basic cell definition Example in PyTorch

slide-5
SLIDE 5

A Note on Graphical Notation

x h y Input: Could be BOW, sequence of items, structured input, etc. Output: Label, Sequence of labels, Generated Text, etc.

slide-6
SLIDE 6

A Note on Graphical Notation

x h y Input: Could be BOW, sequence of items, structured input, etc. Output: Label, Sequence of labels, Generated Text, etc. Hidden state/representation

slide-7
SLIDE 7

A Note on Graphical Notation

x h y Input: Could be BOW, sequence of items, structured input, etc. Output: Label, Sequence of labels, Generated Text, etc. Hidden state/representation

  • h computed by a neural cell, or

factor

  • This is called the encoder
slide-8
SLIDE 8

A Note on Graphical Notation

x h y Input: Could be BOW, sequence of items, structured input, etc. Output: Label, Sequence of labels, Generated Text, etc. Hidden state/representation

  • h computed by a neural cell, or

factor

  • This is called the encoder
  • y is predicted/generated from h

another neural cell, or factor

  • This is called the decoder
slide-9
SLIDE 9

A Note on Graphical Notation

x h y Input: Could be BOW, sequence of items, structured input, etc. Output: Label, Sequence of labels, Generated Text, etc. Hidden state/representation

  • h computed by a neural cell, or

factor

  • This is called the encoder
  • y is predicted/generated from h

another neural cell, or factor

  • This is called the decoder

The red arrows indicate parameters to learn

slide-10
SLIDE 10

Five Broad Categories of Neural Networks

Single Input, Single Output Single Input, Multiple Outputs Multiple Inputs, Single Output Multiple Inputs, Multiple Outputs (β€œsequence prediction”: no time delay) Multiple Inputs, Multiple Outputs (β€œsequence-to-sequence”: with time delay)

slide-11
SLIDE 11

Five Broad Categories of Neural Networks

Single Input, Single Output Single Input, Multiple Outputs Multiple Inputs, Single Output Multiple Inputs, Multiple Outputs (β€œsequence prediction”: no time delay) Multiple Inputs, Multiple Outputs (β€œsequence-to-sequence”: with time delay)

β€œSingle”: fixed number of items β€œMultiple”: variable number

  • f items
slide-12
SLIDE 12

Network Types: Single Input, Single Output

x h y

1. Feed forward Linearizable feature input Bag-of-items classification/regression Basic non-linear model

We’ve already seen some instances

  • f this
slide-13
SLIDE 13

Terminology

Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative NaΓ―ve Bayes Very shallow (sigmoidal) neural nets

as statistical regression a form of viewed as based in information theory to be cool today :) common NLP term

Recall from maxent slides

slide-14
SLIDE 14

Recall: N-gram to Maxent to Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) = softmax(πœ„ β‹… 𝑔(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1, π‘₯𝑗))

slide-15
SLIDE 15

Recall: N-gram to Maxent to Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) = softmax(πœ„ β‹… 𝑔(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1, π‘₯𝑗))

x y no learned representation h

slide-16
SLIDE 16

Recall: N-gram to Maxent to Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) ∝ softmax(πœ„π‘₯𝑗 β‹… π’ˆ(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1))

create/use β€œdistributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f

matrix-vector product

ew ΞΈwi

slide-17
SLIDE 17

Recall: N-gram to Maxent to Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) ∝ softmax(πœ„π‘₯𝑗 β‹… π’ˆ(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1))

create/use β€œdistributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f

matrix-vector product

ew ΞΈwi x y h

slide-18
SLIDE 18

Common Types of Single Input, Single Output

  • Feed forward networks
  • Multilayer perceptrons (MLPs)

General Formulation: Input: x Compute: h0 = x for layer l = 1 to L: hl = fl(Wl hl-1 + bl) return argmax𝑧 softmax πœ„β„Žπ‘€ hidden state at layer l (non-linear) activation function at l linear layer

slide-19
SLIDE 19

Common Types of Single Input, Single Output

  • Feed forward networks
  • Multilayer perceptrons (MLPs)

General Formulation: Input: x Compute: h0 = x for layer l = 1 to L: hl = fl(Wl hl-1 + bl) return argmax𝑧 softmax πœ„β„Žπ‘€ hidden state at layer l (non-linear) activation function at l linear layer

slide-20
SLIDE 20

Common Types of Single Input, Single Output

  • Feed forward networks
  • Multilayer perceptrons (MLPs)

General Formulation: Input: x Compute: h0 = x for layer l = 1 to L: hl = fl(Wl hl-1 + bl) return argmax𝑧 softmax πœ„β„Žπ‘€ hidden state at layer l (non-linear) activation function at l linear layer In Pytorch (torch.nn): Activation functions: https://pytorch.org/docs/stable/nn.html?highlight =activation#non-linear-activations-weighted-sum- nonlinearity Linear layer: https://pytorch.org/docs/stable/nn.html#linear- layers torch.nn.Linear( in_features=<dim of hl-1 >,

  • ut_features=<dim of hl >,

bias=<Boolean: include bias bl >)

slide-21
SLIDE 21

Network Types: Single Input, Multiple Outputs

x h0 y0

Recursive: One input, Sequence output Label-based generation Automated caption generation

h1 y1 h2 y2

slide-22
SLIDE 22

Label-Based Generation

Given a label y, generate an entire text πŸ— argmax πŸ— π‘ž πŸ— 𝑧) argmax

π‘₯1,…,π‘₯𝑂

π‘ž π‘₯1, … , π‘₯𝑂 𝑧) Performing this argmax is difficult, and often requires an approximate search technique called beam search

slide-23
SLIDE 23

Example: Sentiment-based Tweet Generation

Given a sentiment label y (e.g., HAPPY, SAD, ANGRY, etc.), generate a tweet that would be expressing that sentiment argmax πŸ— π‘ž πŸ— 𝑧) argmax

π‘₯1,…,π‘₯𝑂

π‘ž π‘₯1, … , π‘₯𝑂 𝑧)

Q: Why might you want to do this? Q: What ethical aspects should you consider? Q: What is the potential harm?

slide-24
SLIDE 24

Example: Image Caption Generation

Show and Tell: A Neural Image Caption Generator, CVPR 15

Slide credit: Arun Mallya

slide-25
SLIDE 25

Network Types: Multiple Inputs, Single Output

x0 h0

Recursive: Sequence input, one output Document classification Action recognition in video (high-level)

h1 h2 y x1 x2

slide-26
SLIDE 26

Network Types: Multiple Inputs, Single Output

x0 h0

Recursive: Sequence input, one output Document classification Action recognition in video (high-level)

h1 h2 y x1 x2

Think of this as generalizing using maxent models to build discriminatively trained classifiers π‘ž 𝑧 𝑦) = maxent 𝑦, 𝑧 βž” π‘ž 𝑧 𝑦) = recurrent_classifier 𝑦, 𝑧

slide-27
SLIDE 27

Example: RTE (many options)

s: Michael Jordan, coach Phil Jackson and the star cast, including Scottie Pippen, took the Chicago Bulls to six National Basketball Association championships. z: The Bulls basketball team is based in Chicago.

p( | )

ENTAILED

s0 hs,0 hs,N y sN … z0 hz,0 hz,M zM …

slide-28
SLIDE 28

Example: RTE (many options)

s: Michael Jordan, coach Phil Jackson and the star cast, including Scottie Pippen, took the Chicago Bulls to six National Basketball Association championships. z: The Bulls basketball team is based in Chicago.

p( | )

ENTAILED

s0 hs,0 hs,N y sN … z0 hz,0 hz,M zM …

slide-29
SLIDE 29

Reminder! GLUE https://gluebenchmark.com/ https://super.gluebenchmark.com/

Many (but not all) of these tasks fall into the Multiple Inputs, Single Output regime

slide-30
SLIDE 30

Network Types: Multiple Inputs, Multiple Outputs (β€œsequence prediction”: no time delay)

x0 h0

Recursive: Sequence input, Sequence output Part of speech tagging Named entity recognition

h1 h2 x1 x2 y0 y1 y2

slide-31
SLIDE 31

Example 1: Part of Speech Tagging

British Left Waffles on Falkland Islands

Noun Verb Noun Prep Noun Noun

x0 h0 y0 x1 h1 y1 x2 h2 y2 x3 h3 y3 x4 h4 y4 x5 h5 y5

Task: Predict a part-of-speech tag for each word in a provided sentence

slide-32
SLIDE 32

Example 1: Part of Speech Tagging

British Left Waffles on Falkland Islands

Noun Verb Noun Prep Noun Noun

x0 h0 y0 x1 h1 y1 x2 h2 y2 x3 h3 y3 x4 h4 y4 x5 h5 y5

slide-33
SLIDE 33

Example 2: Named Entity Recognition

British Left Waffles on Falkland Islands

ORG ORG Other Other LOC LOC

x0 h0 y0 x1 h1 y1 x2 h2 y2 x3 h3 y3 x4 h4 y4 x5 h5 y5

Task: Predict a named entity tag for each word in a provided sentence

slide-34
SLIDE 34

What are Named Entities?

Named entity recognition (NER) Identify proper names in texts, and classification into a set of predefined categories of interest Person names Organizations (companies, government organisations, committees, etc) Locations (cities, countries, rivers, etc) Date and time expressions Measures (percent, money, weight etc), email addresses, Web addresses, street addresses, etc. Domain-specific: names of drugs, medical conditions, names

  • f ships, bibliographic references etc.

Cunningham and Bontcheva (2003, RANLP Tutorial)

slide-35
SLIDE 35

Example 2: Named Entity Recognition

British Left Waffles on Falkland Islands

ORG ORG Other Other LOC LOC

x0 h0 y0 x1 h1 y1 x2 h2 y2 x3 h3 y3 x4 h4 y4 x5 h5 y5

Task: Predict a named entity tag for each word in a provided sentence

slide-36
SLIDE 36

Example: Named Entity Recognition

British Left Waffles on Falkland Islands

ORG ORG Other Other LOC LOC

x0 h0 y0 x1 h1 y1 x2 h2 y2 x3 h3 y3 x4 h4 y4 x5 h5 y5

slide-37
SLIDE 37

Network Types: Multiple Inputs, Multiple Outputs (β€œsequence-to-sequence”: time delay)

x0 h0

Recursive: Sequence input, Sequence output (time delay) Machine translation Sequential description Summarization

h1 h2 x1 x2

y0

  • 1

y1

  • 2

y2

  • 3

y3

slide-38
SLIDE 38

Example: Translation

Translate French (observed) into English:

The cat is on the chair. Le chat est sur la chaise.

variable # of input words variable # of

  • utput words
slide-39
SLIDE 39

Example: Translation

Translate French (observed) into English:

The cat is on the chair. Le chat est sur la chaise.

x0 h0 h2 x2

y0

  • 3

y3 … …

slide-40
SLIDE 40

RNN Output: Visual Storytelling

CNN CNN CNN CNN CNN GRU GRU GRU GRU GRU

Encode Decode

GRUs GRUs

…

The family got together for a cookout They had a lot

  • f delicious

food.

The family got together for a cookout. They had a lot of delicious food. The dog was happy to be there. They had a great time on the beach. They even had a swim in the water. Huang et al. (2016) Human Reference The family has gathered around the dinner table to share a meal

  • together. They all pitched in to help cook the seafood to perfection.

Afterwards they took the family dog to the beach to get some exercise. The waves were cool and refreshing! The dog had so much fun in the

  • water. One family member decided to get a better view of the waves!
slide-41
SLIDE 41

Outline

Types of networks Basic cell definition Example in PyTorch

slide-42
SLIDE 42

A More Typical View of Recurrent Neural Language Modeling

wi-3 wi-2 wi wi-1 hi-3 hi-2 hi-1 hi wi-2 wi-1 wi+1 wi

slide-43
SLIDE 43

A More Typical View of Recurrent Neural Language Modeling

wi-3 wi-2 wi wi-1 hi-3 hi-2 hi-1 hi wi-2 wi-1 wi+1 wi

  • bserve these words one at a time
slide-44
SLIDE 44

A More Typical View of Recurrent Neural Language Modeling

wi-3 wi-2 wi wi-1 hi-3 hi-2 hi-1 hi wi-2 wi-1 wi+1 wi

  • bserve these words one at a time

predict the next word

slide-45
SLIDE 45

A More Typical View of Recurrent Neural Language Modeling

wi-3 wi-2 wi wi-1 hi-3 hi-2 hi-1 hi wi-2 wi-1 wi+1 wi

  • bserve these words one at a time

predict the next word from these hidden states

slide-46
SLIDE 46

wi-3 wi-2 wi wi-1 hi-3 hi-2 hi-1 hi wi-2 wi-1 wi+1 wi

  • bserve these words one at a time

predict the next word from these hidden states

β€œcell”

A More Typical View of Recurrent Neural Language Modeling

slide-47
SLIDE 47

wi wi-1 hi-1 hi wi+1 wi

A Recurrent Neural Network Cell

slide-48
SLIDE 48

wi wi-1 hi-1 hi wi+1 wi

A Recurrent Neural Network Cell

W W

slide-49
SLIDE 49

encoding

wi wi-1 hi-1 hi wi+1 wi

A Recurrent Neural Network Cell

W W U U

slide-50
SLIDE 50

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Recurrent Neural Network Cell

W W U U S S

slide-51
SLIDE 51

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

β„Žπ‘— = 𝜏(π‘‹β„Žπ‘—βˆ’1 + 𝑉π‘₯𝑗)

slide-52
SLIDE 52

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

β„Žπ‘— = 𝜏(π‘‹β„Žπ‘—βˆ’1 + 𝑉π‘₯𝑗)

𝜏 𝑦 = 1 1 + exp(βˆ’π‘¦)

slide-53
SLIDE 53

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

β„Žπ‘— = 𝜏(π‘‹β„Žπ‘—βˆ’1 + 𝑉π‘₯𝑗)

𝜏 𝑦 = 1 1 + exp(βˆ’π‘¦)

slide-54
SLIDE 54

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

β„Žπ‘— = 𝜏(π‘‹β„Žπ‘—βˆ’1 + 𝑉π‘₯𝑗)

𝜏 𝑦 = 1 1 + exp(βˆ’π‘¦)

slide-55
SLIDE 55

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

β„Žπ‘— = 𝜏(π‘‹β„Žπ‘—βˆ’1 + 𝑉π‘₯𝑗)

𝜏 𝑦 = 1 1 + exp(βˆ’π‘¦)

ෝ π‘₯𝑗+1 = softmax(π‘‡β„Žπ‘—)

slide-56
SLIDE 56

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

β„Žπ‘— = 𝜏(π‘‹β„Žπ‘—βˆ’1 + 𝑉π‘₯𝑗) ෝ π‘₯𝑗+1 = softmax(π‘‡β„Žπ‘—)

must learn matrices U, S, W

slide-57
SLIDE 57

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

β„Žπ‘— = 𝜏(π‘‹β„Žπ‘—βˆ’1 + 𝑉π‘₯𝑗) ෝ π‘₯𝑗+1 = softmax(π‘‡β„Žπ‘—)

must learn matrices U, S, W suggested solution: gradient descent on prediction ability

slide-58
SLIDE 58

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

β„Žπ‘— = 𝜏(π‘‹β„Žπ‘—βˆ’1 + 𝑉π‘₯𝑗) ෝ π‘₯𝑗+1 = softmax(π‘‡β„Žπ‘—)

must learn matrices U, S, W suggested solution: gradient descent on prediction ability problem: they’re tied across inputs/timesteps

slide-59
SLIDE 59

decoding encoding

wi wi-1 hi-1 hi wi+1 wi

A Simple Recurrent Neural Network Cell

W W U U S S

β„Žπ‘— = 𝜏(π‘‹β„Žπ‘—βˆ’1 + 𝑉π‘₯𝑗) ෝ π‘₯𝑗+1 = softmax(π‘‡β„Žπ‘—)

must learn matrices U, S, W suggested solution: gradient descent on prediction ability problem: they’re tied across inputs/timesteps good news for you: many toolkits do this automatically

slide-60
SLIDE 60

Why Is Training RNNs Hard?

Conceptually, it can get strange But really getting the gradient just requires many applications of the chain rule for derivatives

slide-61
SLIDE 61

Why Is Training RNNs Hard?

Conceptually, it can get strange But really getting the gradient just requires many applications of the chain rule for derivatives Vanishing gradients Multiply the same matrices at each timestep βž” multiply many matrices in the gradients

slide-62
SLIDE 62

Why Is Training RNNs Hard?

Conceptually, it can get strange But really getting the gradient just requires many applications of the chain rule for derivatives Vanishing gradients Multiply the same matrices at each timestep βž” multiply many matrices in the gradients One solution: clip the gradients to a max value

slide-63
SLIDE 63

Outline

Types of networks Basic cell definition Example in PyTorch

slide-64
SLIDE 64

Natural Language Processing from torch import * from keras import *

slide-65
SLIDE 65

Pick Your Toolkit

PyTorch Deeplearning4j TensorFlow DyNet Caffe Keras MxNet Gluon CNTK …

Comparisons: https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software https://deeplearning4j.org/compare-dl4j-tensorflow-pytorch https://github.com/zer0n/deepframeworks (older---2015)

slide-66
SLIDE 66

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi

slide-67
SLIDE 67

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi

slide-68
SLIDE 68

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi

slide-69
SLIDE 69

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

slide-70
SLIDE 70

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

encode

wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi

slide-71
SLIDE 71

Defining A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

decode

wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi

slide-72
SLIDE 72

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

slide-73
SLIDE 73

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Negative log- likelihood

slide-74
SLIDE 74

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Negative log- likelihood get predictions

wi-2 wi-1 wi-1 wi wi wi+1 hi-2 hi-1 hi

slide-75
SLIDE 75

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Negative log- likelihood get predictions eval predictions

slide-76
SLIDE 76

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Negative log- likelihood get predictions eval predictions compute gradient

slide-77
SLIDE 77

Training A Simple RNN in Python

(Modified Very Slightly)

http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

Negative log- likelihood get predictions eval predictions compute gradient perform SGD

slide-78
SLIDE 78

Another Solution: LSTMs/GRUs

LSTM: Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) GRU: Gated Recurrent Unit (Cho et al., 2014) Basic Ideas: learn to forget

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

forget line representation line

slide-79
SLIDE 79

Outline

Types of networks Basic cell definition Example in PyTorch