Recurrent Neural Networks: Stability analysis and LSTMs M. - PowerPoint PPT Presentation

Recurrent Neural Networks: Stability analysis and LSTMs M. Soleymani Sharif University of Technology Spring 2020 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017.

St Story so far Y(t+6) Stock vector X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) • Iterated structures are good for analyzing time series data with short- time dependence on the past – These are “ Time delay Neural Nets” (TDNNs), AKA convnets 2

St Story so far Y(t) h -1 X(t) t=0 Time • Recurrent structures are good for analyzing time series data with long-term dependence on the past – These are recurrent neural networks 3

Re Recurrent structures can do what static structures cannot 1 0 1 0 1 0 1 1 1 1 0 MLP 1 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0 • The addition problem: Add two N-bit numbers to produce a N+1-bit number – Input is binary – Will require large number of training instances • Output must be specified for every pair of inputs • Weights that generalize will make errors – Network trained for N-bit numbers will not work for N+1 bit numbers 4

MLP MLPs vs RNNs 1 Previous Carry RNN unit carry out 1 0 • The addition problem: Add two N-bit numbers to produce a N+1- bit number • RNN solution: Very simple, can add two numbers of any size 5

MLP: Th The parity problem 1 MLP 1 0 0 0 1 1 0 0 1 0 • Is the number of “ones” even or odd • Network must be complex to capture all patterns – XOR network, quite complex – Fixed input size 6

RNN: Th The parity problem Previous 1 1 output RNN unit 0 • Trivial solution • Generalizes to input of any size 7

Story so far St Y desired (t) Loss Y(t) h -1 X(t) t=0 Time • Recurrent structures can be trained by minimizing the loss between the sequence of outputs and the sequence of desired outputs – Through gradient descent and backpropagation 8

Back Propagation Th Through Ti Time 𝑀𝑝𝑡𝑡 𝑍 )*+,-) (1. . 𝑈) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) h 0 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) • The loss computed is between the sequence of outputs by the network and the desired sequence of outputs • This is not just the sum of the divergences at individual times Unless we explicitly define it that way § 9

� Time-sy Ti synchronous recurrence Y target (t) Loss Y(t) Y(t) h 0 X(t) t=1 Time Usual assumption: Sequence divergence is the sum of the • divergence at individual instants 𝑀𝑝𝑡𝑡 𝑍 )*+,-) 1 … 𝑈 , 𝑍(1 … 𝑈) = 5 𝑀𝑝𝑡𝑡 𝑍 )*+,-) 𝑢 , 𝑍(𝑢) ) 𝛼 9()) 𝑀𝑝𝑡𝑡 𝑍 )*+,-) 1 … 𝑈 , 𝑍(1 … 𝑈) = 𝛼 9()) 𝑀𝑝𝑡𝑡 𝑍 )*+,-) 𝑢 , 𝑍(𝑢) 10

Long-term behavior of RNNs • In linear systems, long-term behavior depends entirely on the eigenvalues of the hidden-layer weights matrix – If the largest Eigen value is greater than 1, the system will “blow up” – If it is lesser than 1, the response will “vanish” very quickly 11

“B “BIB IBO” ” Stabi bility ty • “Bounded Input Bounded Output” stability – This is a highly desirable characteristic 12

“BIB “B IBO” ” Stabi bility ty Y(t+5) • Returning to an old model.. 𝑍 𝑢 = 𝑔(𝑌 𝑢 − 𝑗 , 𝑗 = 1. . 𝐿) • When will the output “blow up”? X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) • Time-delay structures have bounded output if – The function 𝑔() has bounded output for bounded input • Which is true of almost every activation function – 𝑌(𝑢) is bounded 13

Is this Is this BIB IBO? Y(t) h -1 X(t) t=0 Time • Will RNN necessarily be BIBO? 14

Is Is this this BIB IBO? Y(t) h -1 X(t) t=0 Time • Will this necessarily be BIBO? – Guaranteed if output and hidden activations are bounded • But will it saturate (and where) – What if the activations are linear? 15

Analyzing Ana ng recur urrenc nce Y(t) h -1 X(t) t=0 Time • Sufficient to analyze the behavior of the hidden layer ℎ ? since it carries the relevant information – Will assume only a single hidden layer for simplicity 16

Ana Analyzing ng Recur ursi sion 17

St Streetlight effect Y(t) h -1 X(t) t=0 Time • Easier to analyze linear systems – Will attempt to extrapolate to non-linear systems subsequently • All activations are identity functions – 𝑨 ? = 𝑋 B ℎ ?CD + 𝑋 F 𝑦 ? , ℎ ? = 𝑨 ? 18

Li Linear r systems ms • ℎ ? = 𝑋 B ℎ ?CD + 𝑋 F 𝑦 ? – ℎ ?CD = 𝑋 B ℎ ?CH + 𝑋 F 𝑦 ?CD H ℎ ?CH + 𝑋 • ℎ ? = 𝑋 B 𝑋 F 𝑦 ?CD + 𝑋 F 𝑦 ? B 19

Li Linear r systems ms • ℎ ? = 𝑋 B ℎ ?CD + 𝑋 F 𝑦 ? – ℎ ?CD = 𝑋 B ℎ ?CH + 𝑋 F 𝑦 ?CD H ℎ ?CH + 𝑋 • ℎ ? = 𝑋 B 𝑋 F 𝑦 ?CD + 𝑋 F 𝑦 ? B ?ID ℎ CD + 𝑋 ? 𝑋 ?CD 𝑋 • ℎ ? = 𝑋 F 𝑦 J + 𝑋 F 𝑦 D + ⋯ B B B 20

Li Linear r systems ms • ℎ ? = 𝑋 B ℎ ?CD + 𝑋 F 𝑦 ? – ℎ ?CD = 𝑋 B ℎ ?CH + 𝑋 F 𝑦 ?CD H ℎ ?CH + 𝑋 • ℎ ? = 𝑋 B 𝑋 F 𝑦 ?CD + 𝑋 F 𝑦 ? B ?ID ℎ CD + 𝑋 ? 𝑋 ?CD 𝑋 • ℎ ? = 𝑋 F 𝑦 J + 𝑋 F 𝑦 D + ⋯ B B B • ℎ ? = 𝐼 ? ℎ CD + 𝐼 ? 𝑦 J + 𝐼 ? 𝑦 D + 𝐼 ? 𝑦 H + ⋯ = ℎ CD 𝐼 ? (1 CD ) + 𝑦 J 𝐼 ? (1 J ) + 𝑦 D 𝐼 ? (1 D ) + 𝑦 H 𝐼 ? (1 H ) + ⋯ • Where 𝐼 ? (1 ) ) is the hidden response at time k when the input is [0 0 0 … 1 0 . . 0] (where the 1 occurs in the t-th position) 21

St Streetlight effect Y(t) h -1 X(t) t=0 Time • Sufficient to analyze the response to a single input at 𝑢 = 0 – Principle of superposition in linear systems: ℎ ? = ℎ CD 𝐼 ? (1 CD ) + 𝑦 J 𝐼 ? (1 J ) + 𝑦 D 𝐼 ? (1 D ) + 𝑦 H 𝐼 ? (1 H ) + ⋯ 22

Li Linear r recursions • Consider simple, scalar, linear recursion (note change of notation) – ℎ 𝑢 = 𝑥 B ℎ 𝑢 − 1 + 𝑥 F 𝑦(𝑢) ) 𝑥 F 𝑦 1 – ℎ D 𝑢 = 𝑥 B • Response to a single input at 1 ℎ D 𝑙 23

Li Linear r recursions: : Vector r version • Vector linear recursion (note change of notation) – ℎ 𝑢 = 𝑋 B ℎ 𝑢 − 1 + 𝑋 F 𝑦(𝑢) ) 𝑋 – ℎ D 𝑢 = 𝑋 F 𝑦 1 B • Length of response vector to a single input at 1 is |ℎ {D} 𝑢 | B = 𝑉Λ𝑉 CD • We can write 𝑋 – 𝑋 B 𝑣 X = 𝜇 X 𝑣 X – For any vector 𝑤 we can write • 𝑤 = 𝑏 D 𝑣 D + 𝑏 H 𝑣 H + ⋯ + 𝑏 \ 𝑣 \ • 𝑋 B 𝑤 = 𝑏 D 𝜇 D 𝑣 D + 𝑏 H 𝜇 H 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ 𝑣 \ ) 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ ) 𝑣 \ ) 𝑤 = 𝑏 D 𝜇 D ) 𝑣 D + 𝑏 H 𝜇 H • 𝑋 B ) 𝑣 b where 𝑛 = argmax ) 𝑤 = 𝑏 b 𝜇 b – lim )→a 𝑋 𝜇 h B h 24

Li Linear r recursions: : Vector r version • Vector linear recursion (note change of notation) – ℎ 𝑢 = 𝑋 B ℎ 𝑢 − 1 + 𝑋 F 𝑦(𝑢) ) 𝑋 – ℎ D 𝑢 = 𝑋 F 𝑦 1 B • Length of response vector to a single input at 1 is |ℎ {D} 𝑢 | B = 𝑉Λ𝑉 CD • We can write 𝑋 – 𝑋 B 𝑣 X = 𝜇 X 𝑣 X For any input, for large 𝑢 the length of the hidden vector will expand or contract – For any vector 𝑤 we can write according to the 𝑢 th power of the largest eigen value of the hidden-layer weight matrix • 𝑤 = 𝑏 D 𝑣 D + 𝑏 H 𝑣 H + ⋯ + 𝑏 \ 𝑣 \ • 𝑋 B 𝑤 = 𝑏 D 𝜇 D 𝑣 D + 𝑏 H 𝜇 H 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ 𝑣 \ ) 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ ) 𝑣 \ ) 𝑤 = 𝑏 D 𝜇 D ) 𝑣 D + 𝑏 H 𝜇 H • 𝑋 B ) 𝑣 b where 𝑛 = argmax ) 𝑤 = 𝑏 b 𝜇 b – lim )→a 𝑋 𝜇 h B h 25

Li Linear r recursions: : Vector r version • Vector linear recursion (note change of notation) – ℎ 𝑢 = 𝑋 B ℎ 𝑢 − 1 + 𝑋 F 𝑦(𝑢) ) 𝑋 – ℎ D 𝑢 = 𝑋 F 𝑦 1 B • Length of response vector to a single input at 1 is |ℎ {D} 𝑢 | For any input, for large 𝑢 , the length of the hidden vector will expand or contract according to the 𝑢 th power of the largest eigen value of the hidden-layer weight matrix B = 𝑉Λ𝑉 CD • We can write 𝑋 Unless it has no component along the eigen vector corresponding to the largest eigen value. In that case it will grow according to the second largest Eigen value.. – 𝑋 B 𝑣 X = 𝜇 X 𝑣 X And so on.. – For any vector 𝑤 we can write • 𝑤 = 𝑏 D 𝑣 D + 𝑏 H 𝑣 H + ⋯ + 𝑏 \ 𝑣 \ • 𝑋 B 𝑤 = 𝑏 D 𝜇 D 𝑣 D + 𝑏 H 𝜇 H 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ 𝑣 \ ) 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ ) 𝑣 \ ) 𝑤 = 𝑏 D 𝜇 D ) 𝑣 D + 𝑏 H 𝜇 H • 𝑋 B ) 𝑣 b where 𝑛 = argmax ) 𝑤 = 𝑏 b 𝜇 b – lim )→a 𝑋 𝜇 h B h 26

Recurrent Neural Networks: Stability analysis and LSTMs M. - PowerPoint PPT Presentation

Recurrent Neural Networks: Stability analysis and LSTMs M. Soleymani Sharif University of Technology Spring 2020 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li and colleagues lectures, cs231n,

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Recurrent Networks: Stability analysis and LSTMs 1 Which open source project? 2 Related math.

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Recurrent Neural Networks: Stability analysis and LSTMs M. Soleymani Sharif University of

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

Recurrent Networks: Stability analysis and LSTMs 1 Y(t+6) Story so far Stock vector X(t)

LSTMs Overview Subhashini Venugopalan Neural Networks z t Output B Hidden Hidden Input WHY

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Understanding LSTM Networks Recurrent Neural Networks An unrolled recurrent neural network The

Recurrent Neural Networks (RNN) Artificial Intelligence @ Allegheny College Janyl Jumadinova

Natural Language Understanding Lecture 12: Recurrent Neural Networks and LSTMs Adam Lopez

Have your say on the future of the electricity network! Electricity Distribution Code Community

Professional Ethics September 19th, 2018 Do computer professional need to worry about ethics

14.471: Public Economics Tax Enforcement Emmanuel Saez MIT: Fall 2009 1 Tax Enforcement

Key Steps to Successful Partnerships with Health Care Providers Robert Schreiber , Healthy

Du Cercle The APTs That Werent La cuadrature du cercle La cuadratura del circulo Die

Q4 2008 CONFERENCE CALL Caution Regarding Forward-Looking Statements C O R P O R A T E P A R T I

Towards a General Theory of Information Transfer Rudolf Ahlswede More than restoring strings of

Security Interest Group Who Who Am I? Am I? Jason Donenfeld, also known as zx2c4 .

Sambuz

Useful Links

Newsletter

Mail Us