Deep Dive on RNNs Charles Martin What is an Artificial Neurone? - PowerPoint PPT Presentation

Deep Dive on RNNs Charles Martin

What is an Artificial Neurone? Source - Wikimedia Commons

Feed-Forward Network For each unit: y = tanh � Wx + b �

Recurrent Network For each unit: y t = tanh � Ux t + Vh t − 1 + b �

Sequence Learning Tasks

Recurrent Network simplifying. . .

Recurrent Network simplifying and rotating. . .

“State” in Recurrent Networks ◮ Recurrent Networks are all about storing a “state” in between computations. ◮ A “lossy summary of. . . past sequences” ◮ h is the “hidden state” of our RNN ◮ What influences h ?

Defining the RNN State We can define a simplified RNN represented by this diagram as follows: � Ux t + Vh t − 1 + b � h t = tanh ˆ y t = softmax( c + Wh t )

Unfolding an RNN in Time Figure 1: Unfolding an RNN in Time ◮ By unfolding the RNN we can compute ˆ y for a given length of sequence. ◮ Note that the weight matrices U , V , W are the same for each timestep; this is the big advantage of RNNs!

Forward Propagation We can now use the following equations to compute ˆ y 3 , by computing h for the previous steps: � Ux t + Vh t − 1 + b � h t = tanh ˆ y t = softmax( c + Wh t )

Y-hat is Softmax’d ˆ y is a probability distribution! A finite number of weights that add to 1: e z j σ ( z ) j = k =1 e z k for j = 1 , . . . , K � K

Calculating Loss: Categorical Cross Entropy We use the categorical cross-entropy function for loss: h t = tanh � b + Vh t − 1 + Ux t � ˆ y t = softmax( c + Wh t ) L t = − y t · log(ˆ y t ) � Loss = L t t

Backpropagation Through Time (BPTT) Propagates error correction backwards through the network graph, adjusting all parameters ( U , V , W ) to minimise loss.

Example: Character-level text model ◮ Training data: a collection of text. ◮ Input ( X ): snippets of 30 characters from the collection. ◮ Target output ( y ) : 1 character, the next one after the 30 in each X .

Training the Character-level Model ◮ Target: A probability distribution with P ( n ) = 1 ◮ Output: A probability distribution over all next letters. ◮ E.g.: “My cat is named Simon” would lead to X : “My cat is named Simo” and y : “n”

Using the trained model to generate text ◮ S : Sampling function, sample a letter using the output probability distribution. ◮ The generated letter is reinserted at as the next input. ◮ We don’t want to always draw the most likely character. The would give frequent repetition and “copying” from the training text. Need a sampling strategy.

Char-RNN ◮ RNN as a sequence generator ◮ Input is current symbol, output is next predicted symbol. ◮ Connect output to input and continue! ◮ CharRNN simply applies this to a See: Karpathy, A. (2015). The (subset) of ASCII characters. unreasonable effectiveness of recurrent ◮ Train and generate on any text neural networks. corpus: Fun!

Char-RNN Examples Shakespeare (Karpathy, 2015): Latex Algebraic Geometry: Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states. DUKE VINCENTIO: Well, your wit is in the care of side and that. N.B. “ Proof. Omitted.” Lol.

RNN Architectures and LSTM

Bidirectional RNNs ◮ Useful for tasks where the whole sequence is available. ◮ Each output unit (ˆ y ) depends on both past and future - but most sensitive to closer times. ◮ Popular in speech recognition, translation etc.

Encoder-Decoder (seq-to-seq) ◮ Learns to generate output sequence ( y ) from an input sequence ( x ). ◮ Final hidden state of encoder is used to compute a context variable C . ◮ For example, translation.

Deep RNNs ◮ Does adding deeper layers to an RNN make it work better? ◮ Several options for architecture. ◮ Simply stacking RNN layers is very popular; shown to work better by Graves et al. (2013) ◮ Intuitively: layers might learn some hierarchical knowledge automatically. ◮ Typical setup: up to three recurrent layers.

Long-Term Dependencies ◮ Learning long dependencies is a mathematical challenge. h t = Wh t − 1 ◮ Basically: gradients propagated h t = ( W t ) h 0 through the same weights tend to vanish (mostly) or explode (rarely) (supposing W admits eigendecomposition ◮ E.g., consider a simplified RNN with with orthogonal matrix Q ) no nonlinear activation function or input. ◮ Each time step multiplies h(0) by W . W = Q Λ Q ⊤ ◮ This corresponds to raising power of h t = Q Λ t Qh 0 eigenvalues in Λ. ◮ Eventually, components of h(0) not aligned with the largest eigenvector will be discarded.

Vanishing and Exploding Gradients ◮ “in order to store memories in a way ◮ Note that this problem is only that is robust to small perturbations, relevant for recurrent networks since the RNN must enter a region of the weights W affecting the hidden parameter space where gradients state are the same at each time step. ◮ Goodfellow and Benigo (2016): “the vanish” ◮ “whenever the model is able to problem of learning long-term represent long term dependencies, dependencies remains one of the the gradient of a long term main challenges in deep learning” ◮ WildML (2015). Backpropagation interaction has exponentially smaller magnitude than the gradient of a Through Time and Vanishing short term interaction.” Gradients ◮ ML for artists

Gated RNNs ◮ Possible solution! ◮ Provide a gate that can change the hidden state a little bit at each step. ◮ The gates are controlled by learnable weights as well! ◮ Hidden state weights that may change at each time step. ◮ Create paths through time with derivatives that do not vanish/explode. ◮ Gates choose information to accumulate or forget at each time step. ◮ Most effective sequence models used in practice!

Long Short-Term Memory ◮ Self-loop containing an internal state (c). ◮ Three extra gating units: ◮ Forget gate : controls how much memory is preserved. ◮ Input gate : control how much of current input is stored. ◮ Output gate : control how much of state is shown to output. ◮ Each gate has own weights and biases , so this uses lots more parameters. ◮ Some variants on this design, e.g., use c as additional input to three gate units.

Long Short-Term Memory ◮ Forget gate: f ◮ Internal state: s ◮ Input gate: g ◮ Output gate: q ◮ Output: h

Other Gating Units ◮ Are three gates necessary? ◮ Other gating units are simpler, e.g., Gated Recurrent Unit (GRU) ◮ For the moment, LSTMs are winning in practical use. ◮ Maybe someone wants to explore alternatives in a project? Source: (Olah, C. 2015.)

Visualising LSTM activations Sometimes, the LSTM cell state corresponds with features of the sequential data: Source: (Karpathy, 2015)

CharRNN Applications: FolkRNN Some kinds of music can be represented in a text-like manner. Source: Sturm et al. 2015. Folk Music Style Modelling by Recurrent Neural Networks with Long Short Term Memory Units

Other CharRNN Applications Teaching Recurrent Neural Networks about Monet

Google Magenta Performance RNN ◮ State-of-the-art in music generating RNNs. ◮ Encode MIDI musical sequences as categorical data. ◮ Now supports polyphony (multiple notes), dynamics (volume), expressive timing

Neural iPad Band, another CharRNN ◮ iPad music transcribed as sequence of numbers for each performer. ◮ Trick: encode multiple ints as one (preserving ordering). ◮ Video

Books and Learning References ◮ Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. ◮ François Chollet. 2018. Manning. ◮ Chris Olah. 2015. Understanding LSTMs ◮ RNNs in Tensorflow ◮ Maybe RNN/LSTM is dead? CNNs can work similarly to BLSTMs ◮ Karpathy. 2015. The Unreasonable Effectiveness of RNNs

Summary ◮ Recurrent Neural Networks let us capture and model the structure of sequential data. ◮ Sampling from trained RNNs allow us to generate new, creative sequences. ◮ The internal state of RNNs make them interesting for interactive applications, since it lets them capture and continue from the current context or “style”. ◮ LSTM units are able to overcome the vanishing gradient problem to some extent.

Deep Dive on RNNs Charles Martin What is an Artificial Neurone? - PowerPoint PPT Presentation

Deep Dive on RNNs Charles Martin What is an Artificial Neurone? Source - Wikimedia Commons Feed-Forward Network For each unit: y = tanh Wx + b Recurrent Network For each unit: y t = tanh Ux t + Vh t 1 + b Sequence

Recursive Neural Networks and Its Applications LU Yangyang luyy11@sei.pku.edu.cn KERE Seminar

DEEP DIVE DEEP DIVE INT INTO O SEO SEO Private and Confidential. Property of Whereoware, LLC.

MIXTURE DENSITY NETWORKS MIXTURE DENSITY NETWORKS Charles Martin SO FAR; RNNS THAT MODEL

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

RTGEN (AGC) & ICCP Deep Dive February 23, 2012 Shari Brown and Matt Beck CBA Project Staff

DataPower DataPower-MQ Integration MQ Integration Deep Dive Deep Dive Robin Wiley (Robin

Introduction to CNNs and RNNs with PyTorch Introduction to CNNs and RNNs with PyTorch Presented

Recurrent Neural Networks (RNNs) for NLP MACHINE LEARNING MEETUP DR. ANA PELETEIRO RAMALLO

A Deep Dive into the Dark Web Coen Schuijt UvA OS3 February 5 th , 2019 February 5 th , 2019

2019 CoC System Goals Deep-Dive for r Goal 1 Laura Bass, Director of Programs, Facing Forward to

Deep Dive Into the Form 1023 Application for 501c3 Tax-Exemption Lorri Dunsmore November 2, 2017

Deep Dive Into Mann-Whitney and Spearman Rank Deliverance Bougie Sr. Statistician August 2018

2019 System Goals Deep-Dive Data Carmelo Barbaro Executive Director, Poverty Lab Urban Labs,

Interactive Deep-dive : Visualizing Terrorism Data Ella Kim, Leah Kim Project Idea Evolution of

Next Generation ACO Model Open Door Forum: Financial Deep Dive March 31, 2015 Agenda

Inter Spike Intervals probability distribution and Double Integral Processes Olivier Faugeras

Deep Imitation Learning with Virtual Reality for Robot Manipulation Tasks University of Hamburg

Deep Topology Classifica0on: A New Approach for Massive Graph Classifica0on Stephen Bonner, John

Master Recherche IAC Apprentissage Statistique, Optimisation & Applications Anne Auger

and Machine Learning Techniques Authors: A. Murari, R.Rossi, M.Lungaroni, P.Gaudio, and M. Gelfusa

Summary Semiotics Perception Data Jrg Cassens Representation Presentation References Data

Computational modeling of reading Danie niela Rotelli lli Computer Science PhD Student

DigitalHealth.London Accelerator 2017/18 Information Day Wednesday 1st March 2017 Agenda 09:00

Deep Dive on RNNs Charles Martin What is an Artificial Neurone? - PowerPoint PPT Presentation

Deep Dive on RNNs Charles Martin What is an Artificial Neurone? Source - Wikimedia Commons Feed-Forward Network For each unit: y = tanh Wx + b Recurrent Network For each unit: y t = tanh Ux t + Vh t 1 + b Sequence

Recursive Neural Networks and Its Applications LU Yangyang luyy11@sei.pku.edu.cn KERE Seminar

DEEP DIVE DEEP DIVE INT INTO O SEO SEO Private and Confidential. Property of Whereoware, LLC.

MIXTURE DENSITY NETWORKS MIXTURE DENSITY NETWORKS Charles Martin SO FAR; RNNS THAT MODEL

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

RTGEN (AGC) &amp; ICCP Deep Dive February 23, 2012 Shari Brown and Matt Beck CBA Project Staff

DataPower DataPower-MQ Integration MQ Integration Deep Dive Deep Dive Robin Wiley (Robin

Introduction to CNNs and RNNs with PyTorch Introduction to CNNs and RNNs with PyTorch Presented

Recurrent Neural Networks (RNNs) for NLP MACHINE LEARNING MEETUP DR. ANA PELETEIRO RAMALLO

A Deep Dive into the Dark Web Coen Schuijt UvA OS3 February 5 th , 2019 February 5 th , 2019

2019 CoC System Goals Deep-Dive for r Goal 1 Laura Bass, Director of Programs, Facing Forward to

Deep Dive Into the Form 1023 Application for 501c3 Tax-Exemption Lorri Dunsmore November 2, 2017

Deep Dive Into Mann-Whitney and Spearman Rank Deliverance Bougie Sr. Statistician August 2018

2019 System Goals Deep-Dive Data Carmelo Barbaro Executive Director, Poverty Lab Urban Labs,

Interactive Deep-dive : Visualizing Terrorism Data Ella Kim, Leah Kim Project Idea Evolution of

Next Generation ACO Model Open Door Forum: Financial Deep Dive March 31, 2015 Agenda

Inter Spike Intervals probability distribution and Double Integral Processes Olivier Faugeras

Deep Imitation Learning with Virtual Reality for Robot Manipulation Tasks University of Hamburg

Deep Topology Classifica0on: A New Approach for Massive Graph Classifica0on Stephen Bonner, John

Master Recherche IAC Apprentissage Statistique, Optimisation &amp; Applications Anne Auger

and Machine Learning Techniques Authors: A. Murari, R.Rossi, M.Lungaroni, P.Gaudio, and M. Gelfusa

Summary Semiotics Perception Data Jrg Cassens Representation Presentation References Data

Computational modeling of reading Danie niela Rotelli lli Computer Science PhD Student

DigitalHealth.London Accelerator 2017/18 Information Day Wednesday 1st March 2017 Agenda 09:00

RTGEN (AGC) & ICCP Deep Dive February 23, 2012 Shari Brown and Matt Beck CBA Project Staff

Master Recherche IAC Apprentissage Statistique, Optimisation & Applications Anne Auger