Introduction to RNNs Arun Mallya Best viewed with Computer Modern - PowerPoint PPT Presentation

Introduction to RNNs � Arun Mallya � Best viewed with Computer Modern fonts installed �

Outline � • Why Recurrent Neural Networks (RNNs)? � • The Vanilla RNN unit � • The RNN forward pass � • Backpropagation refresher � • The RNN backward pass � • Issues with the Vanilla RNN � • The Long Short-Term Memory (LSTM) unit � • The LSTM Forward & Backward pass � • LSTM variants and tips � – Peephole LSTM � – GRU �

� � Motivation � • Not all problems can be converted into one with fixed- length inputs and outputs � • Problems such as Speech Recognition or Time-series Prediction require a system to store and use context information � – Simple case: Output YES if the number of 1s is even, else NO � 1000010101 – YES, 100011 – NO, … � • Hard/Impossible to choose a fixed context window � – There can always be a new sample longer than anything seen �

� Recurrent Neural Networks (RNNs) � • R ecurrent N eural N etwork s take the previous output or hidden states as inputs. � The composite input at time t has some historical information about the happenings at time T < t � • RNNs are useful as their intermediate values (state) can store information about past inputs for a time that is not fixed a priori �

Sample Feed-forward Network � y 1 � h 1 � x 1 � t = 1 � 5 ¡

Sample RNN � y 3 � y 2 � h 3 � y 1 � h 2 � x 3 � h 1 � t = 3 � x 2 � t = 2 � x 1 � t = 1 � 6 ¡

Sample RNN � y 3 � y 2 � h 3 � y 1 � h 2 � x 3 � h 1 � t = 3 � x 2 � h 0 � t = 2 � x 1 � t = 1 � 7 ¡

� � � The Vanilla RNN Cell � x t � W � h t � h t-1 � ⎛ ⎞ x t h t = tanh W ⎜ ⎟ ⎝ ⎠ h t − 1 8 ¡

� � � The Vanilla RNN Forward � C 1 � C 2 � C 3 � y 1 � y 2 � y 3 � ⎛ ⎞ x t h t = tanh W ⎜ ⎟ ⎝ ⎠ h t − 1 h 1 � h 2 � h 3 � y t = F( h t ) C t = Loss( y t ,GT t ) x 1 h 0 � x 2 h 1 � x 3 h 2 � 9 ¡

� � � The Vanilla RNN Forward � C 1 � C 2 � C 3 � y 1 � y 2 � y 3 � ⎛ ⎞ x t h t = tanh W ⎜ ⎟ ⎝ ⎠ h t − 1 h 1 � h 2 � h 3 � y t = F( h t ) C t = Loss( y t ,GT t ) indicates shared weights � x 1 h 0 � x 2 h 1 � x 3 h 2 � 10 ¡

Recurrent Neural Networks (RNNs) � • Note that the weights are shared over time � • Essentially, copies of the RNN cell are made over time (unrolling/unfolding), with di ff erent inputs at di ff erent time steps �

� Sentiment Classification � • Classify a � restaurant review from Yelp! OR � movie review from IMDB OR � … � as positive or negative � • Inputs: Multiple words, one or more sentences � • Outputs: Positive / Negative classification � • “The food was really good” � • “The chicken crossed the road because it was uncooked” �

Sentiment Classification � RNN � h 1 � The �

Sentiment Classification � RNN � RNN � h 1 � h 2 � The � food �

Sentiment Classification � h n � RNN � RNN � RNN � h 1 � h n-1 � h 2 � The � food � good �

Sentiment Classification � Linear Classifier � h n � RNN � RNN � RNN � h 1 � h n-1 � h 2 � The � food � good �

Sentiment Classification � Linear Ignore � Ignore � Classifier � h 2 � h n � h 1 � RNN � RNN � RNN � h 1 � h n-1 � h 2 � The � food � good �

Sentiment Classification � h = Sum(…) � h 1 � h n � h 2 � RNN � RNN � RNN � h 1 � h n-1 � h 2 � The � food � good � http://deeplearning.net/tutorial/lstm.html �

Sentiment Classification � Linear Classifier � h = Sum(…) � h 1 � h n � h 2 � RNN � RNN � RNN � h 1 � h n-1 � h 2 � The � food � good � http://deeplearning.net/tutorial/lstm.html �

� Image Captioning � • Given an image, produce a sentence describing its contents � • Inputs: Image feature (from a CNN) � • Outputs: Multiple words (let’s consider one sentence) � : The dog is hiding ¡

Image Captioning � RNN � CNN �

Image Captioning � The � Linear Classifier � h 2 � RNN � RNN � h 2 � h 1 � CNN �

Image Captioning � The � dog � Linear Linear Classifier � Classifier � h 2 � h 3 � RNN � RNN � RNN � h 2 � h 1 � h 3 � CNN �

RNN Outputs: Image Captions � Show and Tell: A Neural Image Caption Generator, CVPR 15 �

RNN Outputs: Language Modeling � VIOLA: � KING LEAR: � Why, Salisbury must find his flesh and thought � O, if you were a feeble sight, the That which I am not aps, not a man and in fire, � courtesy of your law, � To show the reining of the raven and the wars � Your sight and several breath, will To grace my hand reproach within, and not a fair are wear the gods � hand, � With his heads, and my hands are That Caesar and my goodly father's world; � wonder'd at the deeds, � When I was heaven of presence and our fleets, � So drop upon your lordship's head, We spare with hours, but cut thy council I am great, � and your opinion � Murdered and by thy master's ready there � Shall be against your honour. � My power to give thee but so much as hell: � Some service in the noble bondman here, � Would show him to her wine. � http://karpathy.github.io/2015/05/21/rnn-e ff ectiveness/ �

Input – Output Scenarios � Single - Single � Feed-forward Network � Single - Multiple � Image Captioning � Multiple - Single � Sentiment Classification � Multiple - Multiple � Translation � Image Captioning �

Input – Output Scenarios � Note: We might deliberately choose to frame our problem as a � particular input-output scenario for ease of training or � better performance. � For example, at each time step, provide previous word as � input for image captioning � (Single-Multiple to Multiple-Multiple). �

� � � The Vanilla RNN Forward � C 1 � C 2 � C 3 � ⎛ ⎞ x t y 1 � y 2 � y 3 � h t = tanh W ⎜ ⎟ ⎝ ⎠ h t − 1 y t = F( h t ) C t = Loss( y t ,GT h 1 � h 2 � h 3 � t ) “Unfold” network through time by making copies at each time-step � x 1 h 0 � x 2 h 1 � x 3 h 2 � 28 ¡

BackPropagation Refresher � y = f ( x ; W ) C = Loss( y , y GT ) C � y � SGD Update W ← W − η ∂ C f(x; W) � ∂ W x � ∂ C ⎛ ∂ C ⎞ ∂ y ⎛ ⎞ ∂ W = ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ∂ y ⎠ ∂ W

Multiple Layers � y 1 = f 1 ( x ; W 1 ) y 2 = f 2 ( y 1 ; W 2 ) C = Loss( y 2 , y GT ) C � y 2 � SGD Update W 2 ← W 2 − η ∂ C f 2 (y 1 ; W 2 ) � ∂ W 2 W 1 ← W 1 − η ∂ C y 1 � ∂ W 1 f 1 (x; W 1 ) � x �

Chain Rule for Gradient Computation � y 1 = f 1 ( x ; W 1 ) y 2 = f 2 ( y 1 ; W 2 ) C = Loss( y 2 , y GT ) C � Find ∂ C , ∂ C y 2 � ∂ W 1 ∂ W 2 ⎛ ⎞ ⎛ ⎞ ∂ C ∂ C ∂ y 2 f 2 (y 1 ; W 2 ) � = ⎜ ⎟ ⎜ ⎟ ∂ W 2 ∂ y 2 ∂ W 2 ⎝ ⎠ ⎝ ⎠ y 1 � ⎛ ⎞ ⎛ ⎞ ∂ C ∂ C ∂ y 1 = ⎜ ⎟ ⎜ ⎟ ∂ W 1 ∂ y 1 ∂ W 1 ⎝ ⎠ ⎝ ⎠ f 1 (x; W 1 ) � ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ∂ C ∂ y 2 ∂ y 1 = x � ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ∂ y 2 ⎠ ⎝ ∂ y 1 ⎠ ⎝ ∂ W 1 ⎠ Application of the Chain Rule �

Chain Rule for Gradient Computation � ⎛ ∂ C ⎞ Given: � ⎜ ⎟ ⎝ ∂ y ⎠ ∂ C ⎟ , ∂ C ⎛ ⎞ ⎛ ⎞ We are interested in computing: � ⎜ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ y � ∂ W ∂ x Intrinsic to the layer are: � f(x; W) � ∂ y ⎛ ⎞ ⎟ − How does output change due to params ⎜ ⎝ ⎠ ∂ W x � ∂ y ⎛ ⎞ ⎟ − How does output change due to inputs ⎜ ⎝ ⎠ ∂ x ∂ C ⎟ = ∂ C ⎛ ⎞ ∂ y ∂ C ⎟ = ∂ C ⎛ ⎞ ∂ y ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎜ ⎜ ⎟ ⎜ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ∂ W ⎝ ∂ y ⎠ ∂ W ∂ x ⎝ ∂ y ⎠ ∂ x

Chain Rule for Gradient Computation � ⎛ ∂ C ⎞ Given: � ⎜ ⎟ ⎝ ∂ y ⎠ ⎛ ∂ C ⎞ ∂ C ⎟ , ∂ C ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ We are interested in computing: � ⎜ ⎜ ⎟ ⎝ ∂ y ⎠ ⎝ ⎠ ⎝ ⎠ ∂ W ∂ x Intrinsic to the layer are: � f(x; W) � ∂ y ⎛ ⎞ ⎟ − How does output change due to params ⎜ ⎝ ⎠ ∂ W ∂ C ⎛ ⎞ ⎜ ⎟ ⎝ ⎠ ∂ y ∂ x ⎛ ⎞ ⎟ − How does output change due to inputs ⎜ ⎝ ⎠ ∂ x ∂ C ⎟ = ∂ C ⎛ ⎞ ∂ y ∂ C ⎟ = ∂ C ⎛ ⎞ ∂ y ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎜ ⎜ ⎟ ⎜ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ∂ W ⎝ ∂ y ⎠ ∂ W ∂ x ⎝ ∂ y ⎠ ∂ x Equations for common layers: http://arunmallya.github.io/writeups/nn/backprop.html �

Extension to Computational Graphs � y 1 � y 2 � y � f 1 (y; W 1 ) � f 2 (y; W 2 ) � f(x; W) � y � y � x � f(x; W) � x �

Extension to Computational Graphs � ⎛ ⎞ ⎛ ⎞ ∂ C 1 ∂ C 2 ⎜ ⎟ ⎜ ⎟ ⎝ ∂ y 1 ⎠ ⎝ ∂ y 2 ⎠ ⎛ ∂ C ⎞ ⎜ ⎟ ⎝ ∂ y ⎠ f 1 (y; W 1 ) � f 2 (y; W 2 ) � f(x; W) � ⎛ ∂ C 1 ⎞ ⎛ ∂ C 2 ⎞ ⎜ ⎟ ⎜ ⎟ ⎝ ∂ y ⎠ ⎝ ∂ y ⎠ ∂ C ⎛ ⎞ ⎜ ⎟ ⎝ ⎠ ∂ x Σ f(x; W) � ∂ C ⎛ ⎞ ⎜ ⎟ ⎝ ⎠ ∂ x

Introduction to RNNs Arun Mallya Best viewed with Computer Modern - PowerPoint PPT Presentation

Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher

Recursive Neural Networks and Its Applications LU Yangyang luyy11@sei.pku.edu.cn KERE Seminar

MIXTURE DENSITY NETWORKS MIXTURE DENSITY NETWORKS Charles Martin SO FAR; RNNS THAT MODEL

Introduction to CNNs and RNNs with PyTorch Introduction to CNNs and RNNs with PyTorch Presented

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Distributed Optimization of CNNs and RNNs GTC 2015 William Chan williamchan.ca

Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos Baidu SVAIL April 7, 2016

Recurrent Neural Networks (RNNs) for NLP MACHINE LEARNING MEETUP DR. ANA PELETEIRO RAMALLO

Lecture 11: Introduction to RNNs Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center t

Training of Deep Bidirectional RNNs for Hand Motion Filtering via Multimodal Data Fusion Soroosh

Expressive Power of Evolving Neural Networks Working on Infinite Input Streams Olivier Finkel 1

Linear Time Constituency Parsing with RNNs and Dynamic Programming Juneki Hong 1 Liang Huang 1,2 1

On the Practical Computational Power of Finite Precision RNNs for Language Recognition Gail Weiss

Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert Guthrie, Jacob Eisenstein

Training RNNs with 16-bit Floa5ng Point Erich Elsen

RNNs for Timeseries Analysis www.data4sci.com github.com/DataForScience/RNN Disclaimer The

Thanks to Guillaume Lajoie for some of these slides! Network response to input I(t) Wheres the

Whats new in Nova CellsV2? Matt Riedemann (mriedem on IRC) - Huawei Surya Seetharaman

There are three regulatory tiers that impact the use of cells for cell therapy applications. The

Parallel Peeling Algorithms Justin Thaler, Yahoo Labs Joint Work with: Michael Mitzenmacher,

Introduction to Block-Structured Adaptive Mesh Refinement (AMR) Ann S. Almgren Center for

Fine-grained parallelism in probabilistic parsing with Habanero Java Matthew Francis-Landau 1 ,

Truncated Stanley symmetric functions and amplituhedron cells Thomas Lam June 2014 Reduced

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 7: Vanishing Gradients