on the practical computational power of finite precision
play

On the Practical Computational Power of Finite Precision RNNs for - PowerPoint PPT Presentation

On the Practical Computational Power of Finite Precision RNNs for Language Recognition Gail Weiss , Yoav Goldberg, Eran Yahav GRU < LSTM (!?) 1 Supported by European Unions Seventh Framework Programme (FP7) under grant agreement no.


  1. On the Practical Computational Power of Finite Precision RNNs for Language Recognition Gail Weiss , Yoav Goldberg, Eran Yahav GRU < LSTM (!?) � 1 Supported by European Union’s Seventh Framework Programme (FP7) under grant agreement no. 615688 (PRIME)

  2. Current State • RNNs are everywhere • We don’t know too much about the di ff erences between them: • Gated RNNs are shown to train better, beyond that: • “RNNs are Turing Complete”? � 2

  3. Turing Complete? � 3

  4. Turing Complete? 1993 Proof: 1. Requires Infinite Precision: Uses stack(s), maintained in certain dimension(s) Zeros are pushed using division (using g = g/4 + 1/4) In 32 bits, this reaches the limit after 15 pushes 2. Requires Infinite Time: Allows processing steps beyond reading input (Not the standard use case!) unreasonable assumptions! � 4

  5. Turing Complete? 1993 Proof: G N I R 1. Requires Infinite Precision: U Uses stack(s), maintained in certain dimension(s) T ! T Zeros are pushed using division (using g = g/4 + 1/4) I P In 32 bits, this reaches the limit after 15 pushes R A T 2. Requires Infinite Time: Allows processing steps beyond reading input (Not the standard use case!) unreasonable assumptions! � 5

  6. What happens on real hardware and real use-cases? � 6

  7. Real Use • Gated architectures have the best performance • LSTM and GRU are most popular • Of these, the choice between them is unclear � 7

  8. Main Result We accept all RNN types can simulate DFAs We show that LSTMs and IRNNs can also count And that the GRU and SRNN cannot � 8

  9. Power of Counting Practical In NMT: LSTM better at capturing target length � 9

  10. Power of Counting Practical In NMT: LSTM better at capturing target length Theoretical Finite State Machines vs Counter Machines � 10

  11. K-Counter Machines (SKCMs) Fischer, Meyer, Rosenberg - 1968 • Similar to finite automata, but also maintain k counters • A counter has 4 operations: inc/dec by one, do nothing, reset • Counters are observed by comparison to zero + � 11

  12. Counting Machines and Chomsky Hierarchy Regular Languages (RL) Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 12

  13. Chomsky Hierarchy and SKCMs a n b n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 13

  14. Chomsky Hierarchy and SKCMs a n b n a n b n c n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 14

  15. Chomsky Hierarchy and SKCMs a n b n a n b n c n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 15

  16. Chomsky Hierarchy and SKCMs a n b n a n b n c n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 16

  17. Chomsky Hierarchy and SKCMs SKCMs cross the Chomsky Hierarchy! ? a n b n a n b n c n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 17

  18. Summary so Far • Counters give additional formal power • We claimed that LSTM can count and GRU cannot • Let’s see why � 18

  19. Summary so Far • Counters give additional formal power • We claimed that LSTM can count and GRU cannot • Let’s see why � 19

  20. Popular Architectures GRU LSTM f t = σ ( W f x t + U f h t − 1 + b f ) z t = σ ( W z x t + U z h t − 1 + b z ) i t = σ ( W i x t + U i h t − 1 + b i ) r t = σ ( W r x t + U r h t − 1 + b r ) o t = σ ( W o x t + U o h t − 1 + b o ) ˜ h t = tanh( W h x t + U h ( r t ∘ h t − 1 ) + b h ) c t = tanh( W c x t + U c h t − 1 + b c ) ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) � 20

  21. Popular Architectures GRU LSTM f t = σ ( W f x t + U f h t − 1 + b f ) gates z t = σ ( W z x t + U z h t − 1 + b z ) i t = σ ( W i x t + U i h t − 1 + b i ) r t = σ ( W r x t + U r h t − 1 + b r ) o t = σ ( W o x t + U o h t − 1 + b o ) ˜ h t = tanh( W h x t + U h ( r t ∘ h t − 1 ) + b h ) c t = tanh( W c x t + U c h t − 1 + b c ) ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t candidate h t = o t ∘ g ( c t ) vectors update functions � 21

  22. Popular Architectures GRU LSTM f t ∈ (0,1) W f x t dfsfsfddgdg gates z t ∈ (0,1) i t ∈ (0,1) W i x t ddgdgsfsdfs r t ∈ (0,1) o t ∈ (0,1) W o x t ddgdgsdfsfd ˜ h t = tanh( W h x t + U h ( r t ∘ h t − 1 ) + b h ) c t = tanh( W c x t + U c h t − 1 + b c ) ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t candidate h t = o t ∘ g ( c t ) vectors update functions � 22

  23. Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa gates z t ∈ (0,1) i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t candidate vectors h t = o t ∘ g ( c t ) update functions � 23

  24. Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) � 24

  25. Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) Interpolation � 25

  26. Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) Interpolation � 26

  27. Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t = f t ∘ c t − 1 + i t ∘ ˜ c t c t h t = o t ∘ g ( c t ) Interpolation � 27

  28. Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t = f t ∘ c t − 1 + i t ∘ ˜ c t c t h t = o t ∘ g ( c t ) Interpolation Addition � 28

  29. Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 1 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 + ˜ c t h t = o t ∘ g ( c t ) Interpolation Addition c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 29

  30. Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 1 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ≈ 1 a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 + 1 h t = o t ∘ g ( c t ) Interpolation Increase by 1 c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 30

  31. Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 1 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ≈ − 1 a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 − 1 h t = o t ∘ g ( c t ) Interpolation Decrease by 1 c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 31

  32. Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 0 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 + ˜ c t h t = o t ∘ g ( c t ) Interpolation Do Nothing c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 32

  33. Popular Architectures GRU LSTM f t ≈ 0 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 0 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ 0 c t − 1 + ˜ c t h t = o t ∘ g ( c t ) Interpolation Reset c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend