On the Practical Computational Power of Finite Precision RNNs for - PowerPoint PPT Presentation

On the Practical Computational Power of Finite Precision RNNs for Language Recognition Gail Weiss , Yoav Goldberg, Eran Yahav GRU < LSTM (!?) � 1 Supported by European Union’s Seventh Framework Programme (FP7) under grant agreement no. 615688 (PRIME)

Current State • RNNs are everywhere • We don’t know too much about the di ff erences between them: • Gated RNNs are shown to train better, beyond that: • “RNNs are Turing Complete”? � 2

Turing Complete? � 3

Turing Complete? 1993 Proof: 1. Requires Infinite Precision: Uses stack(s), maintained in certain dimension(s) Zeros are pushed using division (using g = g/4 + 1/4) In 32 bits, this reaches the limit after 15 pushes 2. Requires Infinite Time: Allows processing steps beyond reading input (Not the standard use case!) unreasonable assumptions! � 4

Turing Complete? 1993 Proof: G N I R 1. Requires Infinite Precision: U Uses stack(s), maintained in certain dimension(s) T ! T Zeros are pushed using division (using g = g/4 + 1/4) I P In 32 bits, this reaches the limit after 15 pushes R A T 2. Requires Infinite Time: Allows processing steps beyond reading input (Not the standard use case!) unreasonable assumptions! � 5

What happens on real hardware and real use-cases? � 6

Real Use • Gated architectures have the best performance • LSTM and GRU are most popular • Of these, the choice between them is unclear � 7

Main Result We accept all RNN types can simulate DFAs We show that LSTMs and IRNNs can also count And that the GRU and SRNN cannot � 8

Power of Counting Practical In NMT: LSTM better at capturing target length � 9

Power of Counting Practical In NMT: LSTM better at capturing target length Theoretical Finite State Machines vs Counter Machines � 10

K-Counter Machines (SKCMs) Fischer, Meyer, Rosenberg - 1968 • Similar to finite automata, but also maintain k counters • A counter has 4 operations: inc/dec by one, do nothing, reset • Counters are observed by comparison to zero + � 11

Counting Machines and Chomsky Hierarchy Regular Languages (RL) Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 12

Chomsky Hierarchy and SKCMs a n b n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 13

Chomsky Hierarchy and SKCMs a n b n a n b n c n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 14

Chomsky Hierarchy and SKCMs SKCMs cross the Chomsky Hierarchy! ? a n b n a n b n c n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 17

Summary so Far • Counters give additional formal power • We claimed that LSTM can count and GRU cannot • Let’s see why � 18

Summary so Far • Counters give additional formal power • We claimed that LSTM can count and GRU cannot • Let’s see why � 19

Popular Architectures GRU LSTM f t = σ ( W f x t + U f h t − 1 + b f ) z t = σ ( W z x t + U z h t − 1 + b z ) i t = σ ( W i x t + U i h t − 1 + b i ) r t = σ ( W r x t + U r h t − 1 + b r ) o t = σ ( W o x t + U o h t − 1 + b o ) ˜ h t = tanh( W h x t + U h ( r t ∘ h t − 1 ) + b h ) c t = tanh( W c x t + U c h t − 1 + b c ) ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) � 20

Popular Architectures GRU LSTM f t = σ ( W f x t + U f h t − 1 + b f ) gates z t = σ ( W z x t + U z h t − 1 + b z ) i t = σ ( W i x t + U i h t − 1 + b i ) r t = σ ( W r x t + U r h t − 1 + b r ) o t = σ ( W o x t + U o h t − 1 + b o ) ˜ h t = tanh( W h x t + U h ( r t ∘ h t − 1 ) + b h ) c t = tanh( W c x t + U c h t − 1 + b c ) ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t candidate h t = o t ∘ g ( c t ) vectors update functions � 21

Popular Architectures GRU LSTM f t ∈ (0,1) W f x t dfsfsfddgdg gates z t ∈ (0,1) i t ∈ (0,1) W i x t ddgdgsfsdfs r t ∈ (0,1) o t ∈ (0,1) W o x t ddgdgsdfsfd ˜ h t = tanh( W h x t + U h ( r t ∘ h t − 1 ) + b h ) c t = tanh( W c x t + U c h t − 1 + b c ) ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t candidate h t = o t ∘ g ( c t ) vectors update functions � 22

Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa gates z t ∈ (0,1) i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t candidate vectors h t = o t ∘ g ( c t ) update functions � 23

Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) � 24

Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) Interpolation � 25

Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) Interpolation � 26

Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t = f t ∘ c t − 1 + i t ∘ ˜ c t c t h t = o t ∘ g ( c t ) Interpolation � 27

Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t = f t ∘ c t − 1 + i t ∘ ˜ c t c t h t = o t ∘ g ( c t ) Interpolation Addition � 28

Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 1 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 + ˜ c t h t = o t ∘ g ( c t ) Interpolation Addition c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 29

Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 1 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ≈ 1 a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 + 1 h t = o t ∘ g ( c t ) Interpolation Increase by 1 c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 30

Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 1 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ≈ − 1 a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 − 1 h t = o t ∘ g ( c t ) Interpolation Decrease by 1 c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 31

Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 0 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 + ˜ c t h t = o t ∘ g ( c t ) Interpolation Do Nothing c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 32

Popular Architectures GRU LSTM f t ≈ 0 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 0 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ 0 c t − 1 + ˜ c t h t = o t ∘ g ( c t ) Interpolation Reset c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 33

On the Practical Computational Power of Finite Precision RNNs for - PowerPoint PPT Presentation

On the Practical Computational Power of Finite Precision RNNs for Language Recognition Gail Weiss , Yoav Goldberg, Eran Yahav GRU < LSTM (!?) 1 Supported by European Unions Seventh Framework Programme (FP7) under grant agreement no.

Mixed Precision Training PAI Overview What is mixed-precision

Finite A to B implies |A| = |B| Cardinality for finite A, B finite-card .1 finite-card .2

(power x 0) == 1 (power x (+ n 1)) == (* (power x n) x) (power x 0) == 1 (power x (+ (* 2 m)

VLVK EHF. VLVK EHF. Precision machining Precision machining Professional precision for

2018 Milken Institute Hamptons Dialogues Precision, Precision, Precision: The Future of Health

Finite Automata: Informal Finite Automata: Informal p.1/20 Computational models The

Practical Experience with Practical Experience with Practical Experience with Practical

Finite Automata A finite automaton has a finite set of states with which it accepts or rejects

1 Deterministic Finite Automata S* 0,1 Finite Automaton Finite Internal States 0,1 0,1

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

PRECISION HADDAD-TYPE CALCULABLE RESISTORS J. Kucera, E. Vollmer, J. Schurr CPEM 2008 1

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision

EFFECTIVE USE OF MIXED PRECISION FOR HPC Kate Clark, Smoky Mountain Conference 2019 Why Mixed

MIXED PRECISION TRAINING OF DEEP NEURAL NETWORKS Carl Case, NVIDIA OUTLINE 1. What is mixed

AM P A R CudA Multiple Precision ARithmetic librarY When do we need more precision?

MIXED PRECISION TRAINING Michael OConnor MIXED PRECISION What is the benefit? Using mixed

Agreement Technologies Action IC0801 Sascha Ossowski Agreement Technologies Social Science

Models of Architecture Maxime Pelcat INSA Rennes, IETR, Institut Pascal Nokia Bell Labs 2018

University of Manchester: Dr Ian Brown, School of Nursing; Dr Susan Rutherford, Department

Cpp-Taskflow: Fast Task-based Parallel Programming using Modern C++ Tsung-Wei Huang, C.-X. Lin,

An Ontology Selection and Ranking System Based on the Analytic Hierarchy Process Adrian Groza 1 ,

University of Central Florida What are 21 st Century Interpersonal Skills? Why are

CIDOC CRM Game / CRM SIG / 02-2016 Initial Training Network - Digital Cultural Heritage George

Preference-based Pattern Mining Bruno Crmilleux, Marc Plantevit, Arnaud Soulet Nancy, France -