recurrent neural networks stability analysis and lstms
play

Recurrent Neural Networks: Stability analysis and LSTMs M. - PowerPoint PPT Presentation

Recurrent Neural Networks: Stability analysis and LSTMs M. Soleymani Sharif University of Technology Spring 2020 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li and colleagues lectures, cs231n,


  1. Recurrent Neural Networks: Stability analysis and LSTMs M. Soleymani Sharif University of Technology Spring 2020 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017.

  2. St Story so far Y(t+6) Stock vector X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) • Iterated structures are good for analyzing time series data with short- time dependence on the past – These are “ Time delay Neural Nets” (TDNNs), AKA convnets 2

  3. St Story so far Y(t) h -1 X(t) t=0 Time • Recurrent structures are good for analyzing time series data with long-term dependence on the past – These are recurrent neural networks 3

  4. Re Recurrent structures can do what static structures cannot 1 0 1 0 1 0 1 1 1 1 0 MLP 1 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0 • The addition problem: Add two N-bit numbers to produce a N+1-bit number – Input is binary – Will require large number of training instances • Output must be specified for every pair of inputs • Weights that generalize will make errors – Network trained for N-bit numbers will not work for N+1 bit numbers 4

  5. MLP MLPs vs RNNs 1 Previous Carry RNN unit carry out 1 0 • The addition problem: Add two N-bit numbers to produce a N+1- bit number • RNN solution: Very simple, can add two numbers of any size 5

  6. MLP: Th The parity problem 1 MLP 1 0 0 0 1 1 0 0 1 0 • Is the number of “ones” even or odd • Network must be complex to capture all patterns – XOR network, quite complex – Fixed input size 6

  7. RNN: Th The parity problem Previous 1 1 output RNN unit 0 • Trivial solution • Generalizes to input of any size 7

  8. Story so far St Y desired (t) Loss Y(t) h -1 X(t) t=0 Time • Recurrent structures can be trained by minimizing the loss between the sequence of outputs and the sequence of desired outputs – Through gradient descent and backpropagation 8

  9. Back Propagation Th Through Ti Time 𝑀𝑝𝑡𝑡 𝑍 )*+,-) (1. . 𝑈) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) h 0 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) • The loss computed is between the sequence of outputs by the network and the desired sequence of outputs • This is not just the sum of the divergences at individual times Unless we explicitly define it that way § 9

  10. � Time-sy Ti synchronous recurrence Y target (t) Loss Y(t) Y(t) h 0 X(t) t=1 Time Usual assumption: Sequence divergence is the sum of the • divergence at individual instants 𝑀𝑝𝑡𝑡 𝑍 )*+,-) 1 … 𝑈 , 𝑍(1 … 𝑈) = 5 𝑀𝑝𝑡𝑡 𝑍 )*+,-) 𝑢 , 𝑍(𝑢) ) 𝛼 9()) 𝑀𝑝𝑡𝑡 𝑍 )*+,-) 1 … 𝑈 , 𝑍(1 … 𝑈) = 𝛼 9()) 𝑀𝑝𝑡𝑡 𝑍 )*+,-) 𝑢 , 𝑍(𝑢) 10

  11. Long-term behavior of RNNs • In linear systems, long-term behavior depends entirely on the eigenvalues of the hidden-layer weights matrix – If the largest Eigen value is greater than 1, the system will “blow up” – If it is lesser than 1, the response will “vanish” very quickly 11

  12. “B “BIB IBO” ” Stabi bility ty • “Bounded Input Bounded Output” stability – This is a highly desirable characteristic 12

  13. “BIB “B IBO” ” Stabi bility ty Y(t+5) • Returning to an old model.. 𝑍 𝑢 = 𝑔(𝑌 𝑢 − 𝑗 , 𝑗 = 1. . 𝐿) • When will the output “blow up”? X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) • Time-delay structures have bounded output if – The function 𝑔() has bounded output for bounded input • Which is true of almost every activation function – 𝑌(𝑢) is bounded 13

  14. Is this Is this BIB IBO? Y(t) h -1 X(t) t=0 Time • Will RNN necessarily be BIBO? 14

  15. Is Is this this BIB IBO? Y(t) h -1 X(t) t=0 Time • Will this necessarily be BIBO? – Guaranteed if output and hidden activations are bounded • But will it saturate (and where) – What if the activations are linear? 15

  16. Analyzing Ana ng recur urrenc nce Y(t) h -1 X(t) t=0 Time • Sufficient to analyze the behavior of the hidden layer ℎ ? since it carries the relevant information – Will assume only a single hidden layer for simplicity 16

  17. Ana Analyzing ng Recur ursi sion 17

  18. St Streetlight effect Y(t) h -1 X(t) t=0 Time • Easier to analyze linear systems – Will attempt to extrapolate to non-linear systems subsequently • All activations are identity functions – 𝑨 ? = 𝑋 B ℎ ?CD + 𝑋 F 𝑦 ? , ℎ ? = 𝑨 ? 18

  19. Li Linear r systems ms • ℎ ? = 𝑋 B ℎ ?CD + 𝑋 F 𝑦 ? – ℎ ?CD = 𝑋 B ℎ ?CH + 𝑋 F 𝑦 ?CD H ℎ ?CH + 𝑋 • ℎ ? = 𝑋 B 𝑋 F 𝑦 ?CD + 𝑋 F 𝑦 ? B 19

  20. Li Linear r systems ms • ℎ ? = 𝑋 B ℎ ?CD + 𝑋 F 𝑦 ? – ℎ ?CD = 𝑋 B ℎ ?CH + 𝑋 F 𝑦 ?CD H ℎ ?CH + 𝑋 • ℎ ? = 𝑋 B 𝑋 F 𝑦 ?CD + 𝑋 F 𝑦 ? B ?ID ℎ CD + 𝑋 ? 𝑋 ?CD 𝑋 • ℎ ? = 𝑋 F 𝑦 J + 𝑋 F 𝑦 D + ⋯ B B B 20

  21. Li Linear r systems ms • ℎ ? = 𝑋 B ℎ ?CD + 𝑋 F 𝑦 ? – ℎ ?CD = 𝑋 B ℎ ?CH + 𝑋 F 𝑦 ?CD H ℎ ?CH + 𝑋 • ℎ ? = 𝑋 B 𝑋 F 𝑦 ?CD + 𝑋 F 𝑦 ? B ?ID ℎ CD + 𝑋 ? 𝑋 ?CD 𝑋 • ℎ ? = 𝑋 F 𝑦 J + 𝑋 F 𝑦 D + ⋯ B B B • ℎ ? = 𝐼 ? ℎ CD + 𝐼 ? 𝑦 J + 𝐼 ? 𝑦 D + 𝐼 ? 𝑦 H + ⋯ = ℎ CD 𝐼 ? (1 CD ) + 𝑦 J 𝐼 ? (1 J ) + 𝑦 D 𝐼 ? (1 D ) + 𝑦 H 𝐼 ? (1 H ) + ⋯ • Where 𝐼 ? (1 ) ) is the hidden response at time k when the input is [0 0 0 … 1 0 . . 0] (where the 1 occurs in the t-th position) 21

  22. St Streetlight effect Y(t) h -1 X(t) t=0 Time • Sufficient to analyze the response to a single input at 𝑢 = 0 – Principle of superposition in linear systems: ℎ ? = ℎ CD 𝐼 ? (1 CD ) + 𝑦 J 𝐼 ? (1 J ) + 𝑦 D 𝐼 ? (1 D ) + 𝑦 H 𝐼 ? (1 H ) + ⋯ 22

  23. Li Linear r recursions • Consider simple, scalar, linear recursion (note change of notation) – ℎ 𝑢 = 𝑥 B ℎ 𝑢 − 1 + 𝑥 F 𝑦(𝑢) ) 𝑥 F 𝑦 1 – ℎ D 𝑢 = 𝑥 B • Response to a single input at 1 ℎ D 𝑙 23

  24. Li Linear r recursions: : Vector r version • Vector linear recursion (note change of notation) – ℎ 𝑢 = 𝑋 B ℎ 𝑢 − 1 + 𝑋 F 𝑦(𝑢) ) 𝑋 – ℎ D 𝑢 = 𝑋 F 𝑦 1 B • Length of response vector to a single input at 1 is |ℎ {D} 𝑢 | B = 𝑉Λ𝑉 CD • We can write 𝑋 – 𝑋 B 𝑣 X = 𝜇 X 𝑣 X – For any vector 𝑤 we can write • 𝑤 = 𝑏 D 𝑣 D + 𝑏 H 𝑣 H + ⋯ + 𝑏 \ 𝑣 \ • 𝑋 B 𝑤 = 𝑏 D 𝜇 D 𝑣 D + 𝑏 H 𝜇 H 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ 𝑣 \ ) 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ ) 𝑣 \ ) 𝑤 = 𝑏 D 𝜇 D ) 𝑣 D + 𝑏 H 𝜇 H • 𝑋 B ) 𝑣 b where 𝑛 = argmax ) 𝑤 = 𝑏 b 𝜇 b – lim )→a 𝑋 𝜇 h B h 24

  25. Li Linear r recursions: : Vector r version • Vector linear recursion (note change of notation) – ℎ 𝑢 = 𝑋 B ℎ 𝑢 − 1 + 𝑋 F 𝑦(𝑢) ) 𝑋 – ℎ D 𝑢 = 𝑋 F 𝑦 1 B • Length of response vector to a single input at 1 is |ℎ {D} 𝑢 | B = 𝑉Λ𝑉 CD • We can write 𝑋 – 𝑋 B 𝑣 X = 𝜇 X 𝑣 X For any input, for large 𝑢 the length of the hidden vector will expand or contract – For any vector 𝑤 we can write according to the 𝑢 th power of the largest eigen value of the hidden-layer weight matrix • 𝑤 = 𝑏 D 𝑣 D + 𝑏 H 𝑣 H + ⋯ + 𝑏 \ 𝑣 \ • 𝑋 B 𝑤 = 𝑏 D 𝜇 D 𝑣 D + 𝑏 H 𝜇 H 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ 𝑣 \ ) 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ ) 𝑣 \ ) 𝑤 = 𝑏 D 𝜇 D ) 𝑣 D + 𝑏 H 𝜇 H • 𝑋 B ) 𝑣 b where 𝑛 = argmax ) 𝑤 = 𝑏 b 𝜇 b – lim )→a 𝑋 𝜇 h B h 25

  26. Li Linear r recursions: : Vector r version • Vector linear recursion (note change of notation) – ℎ 𝑢 = 𝑋 B ℎ 𝑢 − 1 + 𝑋 F 𝑦(𝑢) ) 𝑋 – ℎ D 𝑢 = 𝑋 F 𝑦 1 B • Length of response vector to a single input at 1 is |ℎ {D} 𝑢 | For any input, for large 𝑢 , the length of the hidden vector will expand or contract according to the 𝑢 th power of the largest eigen value of the hidden-layer weight matrix B = 𝑉Λ𝑉 CD • We can write 𝑋 Unless it has no component along the eigen vector corresponding to the largest eigen value. In that case it will grow according to the second largest Eigen value.. – 𝑋 B 𝑣 X = 𝜇 X 𝑣 X And so on.. – For any vector 𝑤 we can write • 𝑤 = 𝑏 D 𝑣 D + 𝑏 H 𝑣 H + ⋯ + 𝑏 \ 𝑣 \ • 𝑋 B 𝑤 = 𝑏 D 𝜇 D 𝑣 D + 𝑏 H 𝜇 H 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ 𝑣 \ ) 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ ) 𝑣 \ ) 𝑤 = 𝑏 D 𝜇 D ) 𝑣 D + 𝑏 H 𝜇 H • 𝑋 B ) 𝑣 b where 𝑛 = argmax ) 𝑤 = 𝑏 b 𝜇 b – lim )→a 𝑋 𝜇 h B h 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend