Lecture 4: Recurrent Neural Networks + Generalization Prof. - PowerPoint PPT Presentation

6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 4: Recurrent Neural Networks + Generalization Prof. Manolis Kellis Slides credit: Geoffrey Hinton, Ian Goodfellow, http://mit6874.github.io David Gifford, 6.S191 (Ava Soleimany, Alex Amini)

Recurrent Neural Networks (RNNs) + Generalization 1. How do you read/listen/understand/write? Can machines do that? – Context matters: characters, words, letters, sounds, completion, multi-modal – Predicting next word/image: from unsupervised learning to supervised learning 2. Encoding temporal context: Hidden Markov Models (HMMs), RNNs – Primitives: hidden state, memory of previous experiences, limitations of HMMs – RNN architectures,unrolling,back-propagation-through-time(BPTT),param reuse 3. Vanishing gradients, Long-Short-Term Memory (LSTM), initialization – Key idea: gated input/output/memory nodes, model choose to forget/remember – Example: online character recognition with LSTM recurrent neural network 4. Improving generalization – More training data – Tuning model capacity  Architecture: # layers, # units  Early stopping: (validation set)  Weight-decay: L1/L2 regularization  Noise: Add noise as a regularizer – Bayesian prior on parameter distribution – Why weight decay  Bayesian prior – Variance of residual errors

1a. What do you hear and why?

Context matters Top-down processing Phonemic restoration Hearing lips and seeing voices (McGurk, MacDonald, Nature 1976) https://youtu.be/PWGeUztTkRA?t=35 Split class into 4 groups: (1) close your eyes, (2) look left, (3) middle, (4) right Adults: 200 ms delay max disruption. Children: 500 ms Delayed typing: Google Docs, zoom video screen sharing, slow computer https://www.sciencedaily.com/releases/2018/11/181129142352.htm

Recurrent Neural Networks (RNNs) + Generalization 1. How do you read/listen/understand/write? Can machines do that? – Context matters: characters, words, letters, sounds, completion, multi-modal – Predicting next word/image: from unsupervised learning to supervised learning 2. Encoding temporal context: Hidden Markov Models (HMMs), RNNs – Primitives: hidden state, memory of previous experiences, limitations of HMMs – RNN architectures,unrolling,back-propagation-through-time(BPTT),param reuse 3. Vanishing gradients, Long-Short-Term Memory (LSTM), initialization – Key idea: gated input/output/memory nodes, model choose to forget/remember – Example: online character recognition with LSTM recurrent neural network 4. Improving generalization – More training data – Tuning model capacity  Architecture: # layers, # units  Early stopping: (validation set)  Weight-decay: L1/L2 regularization  Noise: Add noise as a regularizer – Bayesian prior on parameter distribution – Why weight decay  Bayesian prior – Variance of residual errors

2a. Encoding time

Getting targets when modeling sequences •When applying machine learning to sequences, we often want to turn an input sequence into an output sequence that lives in a different domain. – E. g. turn a sequence of sound pressures into a sequence of word identities. •When there is no separate target sequence, we can get a teaching signal by trying to predict the next term in the input sequence. – The target output sequence is the input sequence with an advance of 1 step. – This seems much more natural than trying to predict one pixel in an image from the other pixels, or one patch of an image from the rest of the image. – For temporal sequences there is a natural order for the predictions. •Predicting the next term in a sequence blurs the distinction between supervised and unsupervised learning. – It uses methods designed for supervised learning, but it doesn’t require a separate teaching signal.

Memoryless models for sequences • Autoregressive models Predict the next term in a sequence from a fixed number of previous terms using “delay taps”. input(t-2) input(t-1) input(t) • Feed-forward neural nets These generalize autoregressive models by using one or more hidde layers of non-linear hidden units. n input(t-2) input(t-1) input(t)

Beyond memoryless models • If we give our generative model some hidden state, and if we give this hidden state its own internal dynamics, we get a much more interesting kind of model. – It can store information in its hidden state for a long time. – If the dynamics is noisy and the way it generates outputs from its hidden state is noisy, we can never know its exact hidden state. – The best we can do is to infer a probability distribution over the space of hidden state vectors. • This inference is only tractable for two types of hidden state model.

Linear Dynamical Systems (engineers love them!) time  • These are generative models. They have a real- valued hidden state that cannot be observed output output output directly. – The hidden state has linear dynamics with Gaussian noise and produces the observations using a linear model with Gaussian noise. – There may also be driving inputs. hidden hidden hidden • To predict the next output (so that we can shoot down the missile) we need to infer the hidden state. – A linearly transformed Gaussian is a Gaussian. So the distribution over the hidden state given the data so far is Gaussian. It can be computed using input driving input driving input driving “Kalman filtering”.

Hidden Markov Models (computer scientists love them!) • Hidden Markov Models have a discrete one- output output output of-N hidden state. Transitions between states are stochastic and controlled by a transition matrix. The outputs produced by a state are stochastic. – We cannot be sure which state produced a given output. So the state is “hidden”. – It is easy to represent a probability distribution across N states with N numbers. • To predict the next output we need to infer the probability distribution over hidden states. – HMMs have efficient algorithms for inference and learning. time 

A fundamental limitation of HMMs • Consider what happens when a hidden Markov model generates data. – At each time step it must select one of its hidden states. So with N hidden states it can only remember log(N) bits about what it generated so far. • Consider the information that the first half of an utterance contains about the second half: – The syntax needs to fit (e.g. number and tense agreement). – The semantics needs to fit. The intonation needs to fit. – The accent, rate, volume, and vocal tract characteristics must all fit. • All these aspects combined could be 100 bits of information that the first half of an utterance needs to convey to the second half. 2^100 is big!

2b. Recursive Neural Networks (RNNs)

Recurrent neural networks time  • RNNs are very powerful, because they combine two properties: output output output – Distributed hidden state that allows them to store a lot of information about the past efficiently. – Non-linear dynamics that allows them to update their hidden state in hidden hidden hidden complicated ways. • With enough neurons and time, RNNs can compute anything that can be computed by your computer. input input input

Do generative models need to be stochastic? • Linear dynamical systems and • Recurrent neural networks are hidden Markov models are deterministic. stochastic models. – So think of the hidden state – But the posterior probability of an RNN as the distribution over their equivalent of the hidden states given the deterministic probability observed data so far is a distribution over hidden deterministic function of the states in a linear dynamical data. system or hidden Markov model.

Recurrent neural networks • What kinds of behaviour can RNNs exhibit? – They can oscillate. Good for motor control? – They can settle to point attractors. Good for retrieving memories? – They can behave chaotically. Bad for information processing? – RNNs could potentially learn to implement lots of small programs that each capture a nugget of knowledge and run in parallel, interacting to produce very complicated effects. • But the computational power of RNNs makes them very hard to train. – For many years we could not exploit the computational power of RNNs despite some heroic efforts (e.g. Tony Robinson’s speech recognizer).

The equivalence between feedforward nets and recurrent nets w 1 w 4 time=3 w 1 w 2 W3 W4 w 2 w 3 time=2 w 1 w 2 W3 W4 Assume that there is a time delay of 1 in using each connection. time=1 The recurrent net is just a w 1 w 2 W3 W4 layered net that keeps reusing the same weights. time=0

2c. Alternative architectures for RNNs

Different RNN remembering architectures Recurrent network with no outputs Single output after entire sequence Teacher-forcing: train from y and x in parallel o: output, y: target, L: loss o: output, y: target, L: loss Memory: h (t-1)  h (t) Memory: o (t-1)  h (t) . Only train sequentially

2d. Back-propagation through time (BPTT)

Lecture 4: Recurrent Neural Networks + Generalization Prof. - PowerPoint PPT Presentation

6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 4: Recurrent Neural Networks + Generalization Prof. Manolis Kellis Slides credit: Geoffrey Hinton, Ian Goodfellow,

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CSC413/2516 Lecture 7: Generalization & Recurrent Neural Networks Jimmy Ba Jimmy Ba

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

Lecture 9: Recurrent Neural Networks Princeton University COS 495 Instructor: Yingyu Liang

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Computa(on through dynamics Using recurrent neural networks to unveil mechanism in neural

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

TIDAL DISRUPTION OF DWARF GALAXIES THE STRANGE CASE OF CRATER II 1. Observations 2.

Improving performance in delay/disruption tolerant networks through passive relay points Saeed

SMT Solvers: A Disruptive Technology John Rushby Computer Science Laboratory SRI International

Gravitational Waves from NS tidal disruption in NSNS and NSBH binaries Michele Vallisneri

Mult ultiple iple Pat athw hway ays for or Univer Univ ersit ity-I -Indus ndustry Innov

Axes of scale Dr. Keith Scott keithlscott@gmail.com The views, opinions, and/or findings

DTSC: A Dynamic Taxonomy on Structural Colour Carlos Fiorentino, Doctoral Research Project on

Galaxy Halo Assembly Simon White Max Planck Institute for Astrophysics Halo assembly for

Lecture 4: Recurrent Neural Networks + Generalization Prof. - PowerPoint PPT Presentation

6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 4: Recurrent Neural Networks + Generalization Prof. Manolis Kellis Slides credit: Geoffrey Hinton, Ian Goodfellow,

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CSC413/2516 Lecture 7: Generalization &amp; Recurrent Neural Networks Jimmy Ba Jimmy Ba

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

Lecture 9: Recurrent Neural Networks Princeton University COS 495 Instructor: Yingyu Liang

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

Computa(on through dynamics Using recurrent neural networks to unveil mechanism in neural

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

TIDAL DISRUPTION OF DWARF GALAXIES THE STRANGE CASE OF CRATER II 1. Observations 2.

Improving performance in delay/disruption tolerant networks through passive relay points Saeed

SMT Solvers: A Disruptive Technology John Rushby Computer Science Laboratory SRI International

Gravitational Waves from NS tidal disruption in NSNS and NSBH binaries Michele Vallisneri

Mult ultiple iple Pat athw hway ays for or Univer Univ ersit ity-I -Indus ndustry Innov

Axes of scale Dr. Keith Scott keithlscott@gmail.com The views, opinions, and/or findings

DTSC: A Dynamic Taxonomy on Structural Colour Carlos Fiorentino, Doctoral Research Project on

Galaxy Halo Assembly Simon White Max Planck Institute for Astrophysics Halo assembly for

CSC413/2516 Lecture 7: Generalization & Recurrent Neural Networks Jimmy Ba Jimmy Ba