Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai - - PowerPoint PPT Presentation

recurrent neural network
SMART_READER_LITE
LIVE PREVIEW

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai - - PowerPoint PPT Presentation

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random Variables Text Preprocessing Language Modeling Recurrent Neural Networks (RNN) LSTM Bidirectional RNN Deep RNN d2l.ai


slide-1
SLIDE 1

d2l.ai

Recurrent Neural Network

Rachel Hu and Zhi Zhang

Amazon AI

slide-2
SLIDE 2

d2l.ai

Outline

  • Dependent Random Variables
  • Text Preprocessing
  • Language Modeling
  • Recurrent Neural Networks (RNN)
  • LSTM
  • Bidirectional RNN
  • Deep RNN
slide-3
SLIDE 3

d2l.ai

Dependent Random Variables

slide-4
SLIDE 4

d2l.ai

Time matters (Koren, 2009) Netflix changed the labels its rating system

Yehuda Koren, 2009

slide-5
SLIDE 5

d2l.ai

Time matters (Koren, 2009)

Selection Bias

Yehuda Koren, 2009

slide-6
SLIDE 6

d2l.ai

Kahnemann & Krueger, 2006

slide-7
SLIDE 7

d2l.ai

Kahnemann & Krueger, 2006

slide-8
SLIDE 8

d2l.ai

TL;DR - Data usually isn’t IID

prime day Christmas back to school Q2 earnings rate cuts

  • range

hair tweets rating agencies inventory Black Friday

slide-9
SLIDE 9

d2l.ai

Data

  • So far …
  • Collect observation pairs for training
  • Estimate for unseen
  • Examples
  • Images classification & objects recognition
  • Disease prediction
  • Housing price prediction
  • The order of the data does not matter

(xi, yi) ∼ p(x, y) y|x ∼ p(y|x) x′ ∼ p(x)

slide-10
SLIDE 10

d2l.ai

Text Processing

slide-11
SLIDE 11

d2l.ai

Text Preprocessing

  • Sequence data has long dependency (very costly)
  • Truncate into shorter fragments
  • Transform examples into mini-batches with ndarrays

(batch size, width, height, channel) (batch size, sentence length)

slide-12
SLIDE 12

d2l.ai

Tokenization

  • Basic Idea - map text into sequence of tokens
  • “Deep learning is fun” -> [“Deep”, “learning”, “is”, “fun”, “.”]
  • Character Encoding (each character as a token)
  • Small vocabulary
  • Doesn’t work so well (needs to learn spelling)
  • Word Encoding (each word as a token)
  • Accurate spelling
  • Doesn’t work so well (huge vocabulary = costly multinomial)
  • Byte Pair Encoding (Goldilocks zone)
  • Frequent subsequences (like syllables)
slide-13
SLIDE 13

d2l.ai

Vocabulary

  • Find unique tokens, map each one into a numerical index
  • “Deep” : 1, “learning” : 2, “is” : 3, “fun” : 4, “.” : 5
  • The frequency of words often 

  • beys a power law distribution
  • Map the tailing tokens, 


e.g. appears < 5 times, 
 into a special “unknown” 
 token

slide-14
SLIDE 14

d2l.ai

Minibatch Generation

slide-15
SLIDE 15

d2l.ai

Text Preprocessing Notebook

slide-16
SLIDE 16

d2l.ai

Language Models

slide-17
SLIDE 17

d2l.ai

Language Models

  • Tokens not real values (domain is countably finite)
  • e.g.,
  • Estimating it

p(w1, w2, …, wT) = p(w1)

T

t=2

p(wt|w1, …, wt−1)

p(deep, learning, is, fun, . ) = p(deep)p(learning|deep)p(is|deep, learning) p(fun|deep, learning, is)p( . |deep, learning, is, fun)

̂ p(learning|deep) = n(deep, learning) n(deep)

Need Smoothing

slide-18
SLIDE 18

d2l.ai

Language Modeling

  • Goal: predict the probability of a sentence, e.g.
  • NLP fundamental tasks
  • Typing - predict the next word
  • Machine translation - dog bites man vs man bites dog
  • Speech recognition 


to recognize speech vs to wreck a nice beach

p(Deep, learning, is, fun, . )

slide-19
SLIDE 19

d2l.ai

Language Modeling

  • NLP fundamental tasks
  • Named-entity recognition
  • Part-of-speech tagging
  • Machine translation
  • Question answering
  • Automatic Summarization
slide-20
SLIDE 20

d2l.ai

Recurrent Neural Networks

slide-21
SLIDE 21

d2l.ai

RNN with Hidden States

  • 2-layer MLP

Ht = ϕ(WhxXt−1 + bh)

  • t = WhoHt + bo

Ht = ϕ(WhhHt−1 + WhxXt−1 + bh)

  • t = WhoHt + bo
  • Hidden State update
  • Observation update
slide-22
SLIDE 22

d2l.ai

Next word prediction

slide-23
SLIDE 23

d2l.ai

Input Encoding

  • Need to map input numerical indices to vectors
  • Pick granularity (words, characters, subwords)
  • Map to indicator vectors
slide-24
SLIDE 24

d2l.ai

RNN with hidden state mechanics

  • Input: vector sequence
  • Hidden states:
  • Output: vector sequence
  • p is the vocabulary size
  • is confident score that the t-th timestamp in the

sequence equals to j-th token in the vocabulary

  • Loss: measure the classification error on T tokens

x1, …, xT ∈ ℝd h1, …, hT ∈ ℝh where ht = f(ht−1, xt)

  • 1, …, oT ∈ ℝp where ot = g(ht)
  • t,j
slide-25
SLIDE 25

d2l.ai

Gradient Clipping

  • Long chain of dependencies for backprop
  • Need to keep a lot of intermediate values in memory
  • Butterfly effect style dependencies
  • Gradients can vanish or explod
  • Clipping to prevent divergence



 
 
 rescales to gradient of size at most g ← min (1, θ ∥g∥ ) g θ

slide-26
SLIDE 26

d2l.ai

RNN Notebook

slide-27
SLIDE 27

d2l.ai

Paying attention to a sequence

  • Not all observations are equally relevant
slide-28
SLIDE 28

d2l.ai

Paying attention to a sequence

  • Not all observations are equally relevant
  • Need mechanism to pay attention (update gate)

e.g., an early observation is highly significant for predicting all future observations. We would like to have some mechanism for storing/updaing vital early information in a memory cell.

slide-29
SLIDE 29

d2l.ai

Paying attention to a sequence

  • Not all observations are equally relevant
  • Need mechanism to forget (reset gate)

e.g., There is a logical break between parts of a sequence. For instance there might be a transition between chapters in a book, a transition between a bear and a bull market for securities, etc.

slide-30
SLIDE 30

d2l.ai

From RNN to GRU

Rt = σ(XtWxr + Ht−1Whr + br), Zt = σ(XtWxz + Ht−1Whz + bz) ˜ Ht = tanh(XtWxh + (Rt ⊙ Ht−1) Whh + bh) Ht = Zt ⊙ Ht−1 + (1 − Zt) ⊙ ˜ Ht

  • t = WhoHt + bo

Ht = ϕ(WhhHt−1 + WhxXt−1 + bh)

  • t = WhoHt + bo
slide-31
SLIDE 31

d2l.ai

GRU - Gates

Rt = σ(XtWxr + Ht−1Whr + br), Zt = σ(XtWxz + Ht−1Whz + bz)

slide-32
SLIDE 32

d2l.ai

GRU - Candidate Hidden State

˜ Ht = tanh(XtWxh + (Rt ⊙ Ht−1) Whh + bh)

slide-33
SLIDE 33

d2l.ai

Hidden State

Ht = Zt ⊙ Ht−1 + (1 − Zt) ⊙ ˜ Ht

slide-34
SLIDE 34

d2l.ai

Summary

Rt = σ(XtWxr + Ht−1Whr + br), Zt = σ(XtWxz + Ht−1Whz + bz) ˜ Ht = tanh(XtWxh + (Rt ⊙ Ht−1) Whh + bh) Ht = Zt ⊙ Ht−1 + (1 − Zt) ⊙ ˜ Ht

slide-35
SLIDE 35

d2l.ai

Long Short Term Memory

slide-36
SLIDE 36

d2l.ai

GRU and LSTM

Rt = σ(XtWxr + Ht−1Whr + br), Zt = σ(XtWxz + Ht−1Whz + bz) ˜ Ht = tanh(XtWxh + (Rt ⊙ Ht−1) Whh + bh) Ht = Zt ⊙ Ht−1 + (1 − Zt) ⊙ ˜ Ht

It = σ(XtWxi + Ht−1Whi + bi) Ft = σ(XtWxf + Ht−1Whf + bf) Ot = σ(XtWxo + Ht−1Who + bo) ˜ Ct = tanh(XtWxc + Ht−1Whc + bc) Ct = Ft ⊙ Ct−1 + It ⊙ ˜ Ct Ht = Ot ⊙ tanh(Ct)

slide-37
SLIDE 37

d2l.ai

Long Short Term Memory

  • Forget gate


Reset the memory cell values

  • Input gate


Decide whether we should ignore the input data

  • Output gate


Decide whether the hidden state is used for the output generated by the LSTM


  • Hidden state and Memory cell
slide-38
SLIDE 38

d2l.ai

Gates

It = σ(XtWxi + Ht−1Whi + bi) Ft = σ(XtWxf + Ht−1Whf + bf) Ot = σ(XtWxo + Ht−1Who + bo)

slide-39
SLIDE 39

d2l.ai

Candidate Memory Cell

˜ Ct = tanh(XtWxc + Ht−1Whc + bc)

slide-40
SLIDE 40

d2l.ai

Memory Cell

Ct = Ft ⊙ Ct−1 + It ⊙ ˜ Ct

slide-41
SLIDE 41

d2l.ai

Hidden State / Output

Ht = Ot ⊙ tanh(Ct)

slide-42
SLIDE 42

d2l.ai

Hidden State / Output

It = σ(XtWxi + Ht−1Whi + bi) Ft = σ(XtWxf + Ht−1Whf + bf) Ot = σ(XtWxo + Ht−1Who + bo) ˜ Ct = tanh(XtWxc + Ht−1Whc + bc) Ct = Ft ⊙ Ct−1 + It ⊙ ˜ Ct Ht = Ot ⊙ tanh(Ct)

slide-43
SLIDE 43

d2l.ai

LSTM Notebook

slide-44
SLIDE 44

d2l.ai

Bidirectional RNNs

slide-45
SLIDE 45

d2l.ai

I am _____ I am _____ very hungry, I am _____ very hungry, I could eat half a pig.

slide-46
SLIDE 46

d2l.ai

I am hungry. I am not very hungry, I am very very hungry, I could eat half a pig.

slide-47
SLIDE 47

d2l.ai

The Future Matters

I am happy. I am not very hungry, I am very very hungry, I could eat half a pig.

  • Very different words to fill in, depending on past and

future context of a word.

  • RNNs so far only look at the past
  • In interpolation (fill in) we can use the future, too.

I am hungry. I am not very hungry, I am very very hungry, I could eat half a pig.

slide-48
SLIDE 48

d2l.ai

Bidirectional RNN

  • One RNN forward
  • Another one

backward

  • Combine both

hidden states for

  • utput generation
slide-49
SLIDE 49

d2l.ai

Using RNNs

(image courtesy of karpathy.github.io)

Poetry Generation Sentiment Analysis Document Classification Question Answering Machine Translation Named Entity Tagging

slide-50
SLIDE 50

d2l.ai

Recall - RNNs Architecture

How to make more nonlinear?

Ht = ϕ(WhhHt−1 + WhxXt−1 + bh)

  • t = WhoHt + bo
  • Hidden State update
  • Observation update
slide-51
SLIDE 51

d2l.ai

We go deeper

slide-52
SLIDE 52

d2l.ai

We go deeper

  • Shallow RNN
  • Input
  • Hidden layer
  • Output
  • Deep RNN
  • Input
  • Hidden layer
  • Hidden layer


  • Output

Ht = f(Ht−1, Xt) Ot = g(Ht)

H1

t = f1(H1 t−1, Xt)

Hj

t = fj(Hj t−1, Hj−1 t

) Ot = g(HL

t )

slide-53
SLIDE 53

d2l.ai

Summary

  • Dependent Random Variables
  • Text Preprocessing
  • Language Modeling
  • Recurrent Neural Networks (RNN)
  • LSTM
  • Bidirectional RNN
  • Deep RNN
slide-54
SLIDE 54

d2l.ai

  • Linear algebra,
  • Prob, Calculus &

Statistics Gradient Math

  • Linear regression
  • Image classification
  • Softmax regression
  • Multilayer perceptron

Basic models

  • Loss function
  • Regularization
  • Model selection
  • Environment

Machine learning

  • Convolution, LeNet
  • Alex, VGG, Inception,

ResNet CNN

  • Data Augmentation
  • Fine-tuning
  • Object detection
  • Segmentation

CV

  • Recurrent networks

(RNN, GRU, LSTM) for language modeling

  • Word embedding
  • Seq2seq for machine

translation RNNs and

  • NDarray
  • Autograd
  • Gluon

Basic

  • Numerical stability
  • Multi-GPU Training

Performanc

  • Seq2seq w/ attention
  • Transformer
  • BERT

Attention

  • Convex Optimization
  • Momentum,

RMSProp, Adam Optimizatio

  • Generative Adversarial

Networks

  • DCGAN

GAN

More Contents in D2L.ai

  • What we

covered

  • Not
slide-55
SLIDE 55

d2l.ai

Resources

  • Textbook: numpy.d2l.ai
  • Toolkit for computer vision: gluon-cv.mxnet.io
  • Toolkit for natural language processing: gluon-nlp.mxnet.io
  • Toolkit for time series: gluon-ts.mxnet.io
  • Toolkit for graph neural networks: dgl.ai
  • Discussion forum: https://discuss.mxnet.io/