Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai

Outline • Dependent Random Variables • Text Preprocessing • Language Modeling • Recurrent Neural Networks (RNN) • LSTM • Bidirectional RNN • Deep RNN d2l.ai

Dependent Random Variables d2l.ai

Time matters (Koren, 2009) Netflix changed the labels its rating system Yehuda Koren, 2009 d2l.ai

Time matters (Koren, 2009) Selection Bias Yehuda Koren, 2009 d2l.ai

Kahnemann & Krueger, 2006 d2l.ai

rating rate cuts agencies Q2 earnings orange hair tweets TL;DR - Data usually isn’t IID Black Christmas inventory Friday back to prime day school d2l.ai

Data • So far … • Collect observation pairs for training ( x i , y i ) ∼ p ( x , y ) • Estimate for unseen x ′ � ∼ p ( x ) y | x ∼ p ( y | x ) • Examples • Images classification & objects recognition • Disease prediction • Housing price prediction • The order of the data does not matter d2l.ai

Text Processing d2l.ai

Text Preprocessing • Sequence data has long dependency (very costly) • Truncate into shorter fragments • Transform examples into mini-batches with ndarrays (batch size, width, height, channel) (batch size, sentence length) d2l.ai

Tokenization • Basic Idea - map text into sequence of tokens • “Deep learning is fun” -> [“Deep”, “learning”, “is”, “fun”, “.”] • Character Encoding (each character as a token) • Small vocabulary • Doesn’t work so well (needs to learn spelling) • Word Encoding (each word as a token) • Accurate spelling • Doesn’t work so well (huge vocabulary = costly multinomial) • Byte Pair Encoding (Goldilocks zone) • Frequent subsequences (like syllables) d2l.ai

Vocabulary • Find unique tokens, map each one into a numerical index • “Deep” : 1, “learning” : 2, “is” : 3, “fun” : 4, “.” : 5 • The frequency of words often   obeys a power law distribution • Map the tailing tokens,   e.g. appears < 5 times,   into a special “unknown”   token d2l.ai

Minibatch Generation d2l.ai

Text Preprocessing Notebook d2l.ai

Language Models d2l.ai

̂ Language Models • Tokens not real values (domain is countably finite) T ∏ p ( w 1 , w 2 , …, w T ) = p ( w 1 ) p ( w t | w 1 , …, w t − 1 ) t =2 • e.g., p (deep, learning, is, fun, . ) = p (deep) p (learning | deep) p (is | deep, learning) p (fun | deep, learning, is) p ( . | deep, learning, is, fun) • Estimating it Need Smoothing p (learning | deep) = n (deep, learning) n (deep) d2l.ai

Language Modeling • Goal: predict the probability of a sentence, e.g. p (Deep, learning, is, fun, . ) • NLP fundamental tasks • Typing - predict the next word • Machine translation - dog bites man vs man bites dog • Speech recognition   to recognize speech vs to wreck a nice beach d2l.ai

Language Modeling • NLP fundamental tasks • Named-entity recognition • Part-of-speech tagging • Machine translation • Question answering • Automatic Summarization • … d2l.ai

Recurrent Neural Networks d2l.ai

RNN with Hidden States • Hidden State update • 2-layer MLP H t = ϕ ( W hh H t − 1 + W hx X t − 1 + b h ) H t = ϕ ( W hx X t − 1 + b h ) • Observation update o t = W ho H t + b o o t = W ho H t + b o d2l.ai

Next word prediction d2l.ai

Input Encoding • Need to map input numerical indices to vectors • Pick granularity (words, characters, subwords) • Map to indicator vectors d2l.ai

RNN with hidden state mechanics x 1 , …, x T ∈ ℝ d • Input: vector sequence h 1 , …, h T ∈ ℝ h where h t = f ( h t − 1 , x t ) • Hidden states: o 1 , …, o T ∈ ℝ p where o t = g ( h t ) • Output: vector sequence • p is the vocabulary size • is confident score that the t -th timestamp in the o t , j sequence equals to j -th token in the vocabulary • Loss: measure the classification error on T tokens d2l.ai

      Gradient Clipping • Long chain of dependencies for backprop • Need to keep a lot of intermediate values in memory • Butterfly effect style dependencies • Gradients can vanish or explod • Clipping to prevent divergence   g ← min ( 1, θ ∥ g ∥ ) g rescales to gradient of size at most θ d2l.ai

RNN Notebook d2l.ai

Paying attention to a sequence • Not all observations are equally relevant • d2l.ai

Paying attention to a sequence • Not all observations are equally relevant • • Need mechanism to pay attention (update gate) e.g., an early observation is highly significant for predicting all future observations. We would like to have some mechanism for storing/updaing vital early information in a memory cell. d2l.ai

Paying attention to a sequence • Not all observations are equally relevant • • Need mechanism to forget (reset gate) e.g., There is a logical break between parts of a sequence. For instance there might be a transition between chapters in a book, a transition between a bear and a bull market for securities, etc. d2l.ai

From RNN to GRU R t = σ ( X t W xr + H t − 1 W hr + b r ), Z t = σ ( X t W xz + H t − 1 W hz + b z ) H t = tanh ( X t W xh + ( R t ⊙ H t − 1 ) W hh + b h ) ˜ H t = ϕ ( W hh H t − 1 + W hx X t − 1 + b h ) H t = Z t ⊙ H t − 1 + (1 − Z t ) ⊙ ˜ H t o t = W ho H t + b o o t = W ho H t + b o d2l.ai

GRU - Gates R t = σ ( X t W xr + H t − 1 W hr + b r ), Z t = σ ( X t W xz + H t − 1 W hz + b z ) d2l.ai

GRU - Candidate Hidden State H t = tanh ( X t W xh + ( R t ⊙ H t − 1 ) W hh + b h ) ˜ d2l.ai

Hidden State H t = Z t ⊙ H t − 1 + (1 − Z t ) ⊙ ˜ H t d2l.ai

Summary R t = σ ( X t W xr + H t − 1 W hr + b r ), Z t = σ ( X t W xz + H t − 1 W hz + b z ) H t = tanh ( X t W xh + ( R t ⊙ H t − 1 ) W hh + b h ) ˜ H t = Z t ⊙ H t − 1 + (1 − Z t ) ⊙ ˜ H t d2l.ai

Long Short Term Memory d2l.ai

GRU and LSTM I t = σ ( X t W xi + H t − 1 W hi + b i ) F t = σ ( X t W xf + H t − 1 W hf + b f ) R t = σ ( X t W xr + H t − 1 W hr + b r ), O t = σ ( X t W xo + H t − 1 W ho + b o ) Z t = σ ( X t W xz + H t − 1 W hz + b z ) ˜ C t = tanh ( X t W xc + H t − 1 W hc + b c ) H t = tanh ( X t W xh + ( R t ⊙ H t − 1 ) W hh + b h ) ˜ C t = F t ⊙ C t − 1 + I t ⊙ ˜ H t = Z t ⊙ H t − 1 + (1 − Z t ) ⊙ ˜ C t H t H t = O t ⊙ tanh ( C t ) d2l.ai

Long Short Term Memory • Forget gate   Reset the memory cell values • Input gate   Decide whether we should ignore the input data • Output gate   Decide whether the hidden state is used for the output generated by the LSTM   • Hidden state and Memory cell d2l.ai

Gates I t = σ ( X t W xi + H t − 1 W hi + b i ) F t = σ ( X t W xf + H t − 1 W hf + b f ) O t = σ ( X t W xo + H t − 1 W ho + b o ) d2l.ai

Candidate Memory Cell ˜ C t = tanh ( X t W xc + H t − 1 W hc + b c ) d2l.ai

Memory Cell C t = F t ⊙ C t − 1 + I t ⊙ ˜ C t d2l.ai

Hidden State / Output H t = O t ⊙ tanh ( C t ) d2l.ai

Hidden State / Output I t = σ ( X t W xi + H t − 1 W hi + b i ) F t = σ ( X t W xf + H t − 1 W hf + b f ) O t = σ ( X t W xo + H t − 1 W ho + b o ) ˜ C t = tanh ( X t W xc + H t − 1 W hc + b c ) C t = F t ⊙ C t − 1 + I t ⊙ ˜ C t H t = O t ⊙ tanh ( C t ) d2l.ai

LSTM Notebook d2l.ai

Bidirectional RNNs d2l.ai

I am _____ I am _____ very hungry, I am _____ very hungry, I could eat half a pig. d2l.ai

I am hungry . I am not very hungry, I am very very hungry, I could eat half a pig. d2l.ai

The Future Matters I am happy . I am hungry . I am not very hungry, I am not very hungry, I am very very hungry, I could eat half a pig. I am very very hungry, I could eat half a pig. • Very different words to fill in, depending on past and future context of a word. • RNNs so far only look at the past • In interpolation (fill in) we can use the future, too. d2l.ai

Bidirectional RNN • One RNN forward • Another one backward • Combine both hidden states for output generation d2l.ai

Using RNNs Question Poetry Sentiment Named Answering Generation Analysis Entity Tagging Machine Document Translation Classification d2l.ai (image courtesy of karpathy.github.io)

Recall - RNNs Architecture • Hidden State update H t = ϕ ( W hh H t − 1 + W hx X t − 1 + b h ) How to make more nonlinear? • Observation update o t = W ho H t + b o d2l.ai

We go deeper d2l.ai

We go deeper • Shallow RNN • Input H t = f ( H t − 1 , X t ) • Hidden layer O t = g ( H t ) • Output • Deep RNN H 1 t = f 1 ( H 1 t − 1 , X t ) • Input t − 1 , H j − 1 H j t = f j ( H j ) • Hidden layer t • Hidden layer   O t = g ( H L t ) … • Output d2l.ai

Summary • Dependent Random Variables • Text Preprocessing • Language Modeling • Recurrent Neural Networks (RNN) • LSTM • Bidirectional RNN • Deep RNN d2l.ai

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai - PowerPoint PPT Presentation

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random Variables Text Preprocessing Language Modeling Recurrent Neural Networks (RNN) LSTM Bidirectional RNN Deep RNN d2l.ai

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Recurrent Neural Network Agenda Recurrent Neural Network

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks

Computa(on through dynamics Using recurrent neural networks to unveil mechanism in neural

Differential Categories, Recurrent Neural Networks, and Machine Learning Shin-ya Katsumata and

Gated Orthogonal Recurrent Units: On Learning to Forget Li Jing, a lar Glehre, John

Recurrent Neural Networks for Language Modeling CSE392 - Spring 2019 Special Topic in CS Tasks

Introduction to Recurrent Neural Networks Jakob Verbeek Modeling sequential data with Recurrent

CSCE 496/896 Lecture 6: Architectures Stephen Scott Recurrent Architectures Introduction Basic

Recurrent machines for likelihood-free inference Arthur Pesah Antoine Wehenkel Gilles Louppe

Machine Learning for NLP Sequential NN models Aurlie Herbelot 2019 Centre for Mind/Brain

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent Neural Networks (RNNs) Caio