On the Practical Computational Power of Finite Precision RNNs for - - PowerPoint PPT Presentation

on the practical computational power of finite precision
SMART_READER_LITE
LIVE PREVIEW

On the Practical Computational Power of Finite Precision RNNs for - - PowerPoint PPT Presentation

On the Practical Computational Power of Finite Precision RNNs for Language Recognition Gail Weiss , Yoav Goldberg, Eran Yahav GRU < LSTM (!?) 1 Supported by European Unions Seventh Framework Programme (FP7) under grant agreement no.


slide-1
SLIDE 1
  • 1

Supported by European Union’s Seventh Framework Programme (FP7) under grant agreement no. 615688 (PRIME)

On the Practical Computational Power of Finite Precision RNNs for Language Recognition

Gail Weiss, Yoav Goldberg, Eran Yahav

GRU < LSTM (!?)

slide-2
SLIDE 2

Current State

  • RNNs are everywhere
  • We don’t know too much about the differences between

them:

  • Gated RNNs are shown to train better, beyond that:
  • “RNNs are Turing Complete”?

2

slide-3
SLIDE 3

Turing Complete?

3

slide-4
SLIDE 4

Turing Complete?

4

unreasonable assumptions!

1993 Proof:

  • 1. Requires Infinite Precision:

Uses stack(s), maintained in certain dimension(s) Zeros are pushed using division (using g = g/4 + 1/4) In 32 bits, this reaches the limit after 15 pushes

  • 2. Requires Infinite Time:

Allows processing steps beyond reading input (Not the standard use case!)

slide-5
SLIDE 5

Turing Complete?

5

unreasonable assumptions!

T U R I N G T A R P I T !

1993 Proof:

  • 1. Requires Infinite Precision:

Uses stack(s), maintained in certain dimension(s) Zeros are pushed using division (using g = g/4 + 1/4) In 32 bits, this reaches the limit after 15 pushes

  • 2. Requires Infinite Time:

Allows processing steps beyond reading input (Not the standard use case!)

slide-6
SLIDE 6

What happens on real hardware and real use-cases?

6

slide-7
SLIDE 7

Real Use

  • Gated architectures have the best performance
  • LSTM and GRU are most popular
  • Of these, the choice between them is unclear

7

slide-8
SLIDE 8

Main Result

8

We accept all RNN types can simulate DFAs We show that LSTMs and IRNNs can also count And that the GRU and SRNN cannot

slide-9
SLIDE 9

Power of Counting

9

In NMT: LSTM better at capturing target length Practical

slide-10
SLIDE 10

Power of Counting

10

In NMT: LSTM better at capturing target length Practical Theoretical Finite State Machines vs Counter Machines

slide-11
SLIDE 11
  • Similar to finite automata, but also maintain k counters
  • A counter has 4 operations: inc/dec by one, do nothing,

reset

  • Counters are observed by comparison to zero

K-Counter Machines (SKCMs)

11

Fischer, Meyer, Rosenberg - 1968

+

slide-12
SLIDE 12

Counting Machines and Chomsky Hierarchy

12

Regular Languages (RL) Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE)

slide-13
SLIDE 13

Regular Languages (RL) Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE)

13

anbn

Palindromes

Chomsky Hierarchy and SKCMs

slide-14
SLIDE 14

Regular Languages (RL) Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE)

14

anbn anbncn

Palindromes

Chomsky Hierarchy and SKCMs

slide-15
SLIDE 15

15

anbn anbncn

Palindromes Regular Languages (RL) Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE)

Chomsky Hierarchy and SKCMs

slide-16
SLIDE 16

16

anbn anbncn

Palindromes Regular Languages (RL) Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE)

Chomsky Hierarchy and SKCMs

slide-17
SLIDE 17

Chomsky Hierarchy and SKCMs

17

anbn anbncn

Palindromes Regular Languages (RL) Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) SKCMs cross the Chomsky Hierarchy! ?

slide-18
SLIDE 18

Summary so Far

  • Counters give additional formal power
  • We claimed that LSTM can count and GRU cannot
  • Let’s see why

18

slide-19
SLIDE 19

Summary so Far

  • Counters give additional formal power
  • We claimed that LSTM can count and GRU cannot
  • Let’s see why

19

slide-20
SLIDE 20

Popular Architectures

20

GRU LSTM

zt = σ(Wzxt + Uzht−1 + bz) rt = σ(Wrxt + Urht−1 + br) ˜ ht = tanh(Whxt + Uh(rt ∘ ht−1) + bh) ht = zt ∘ ht−1 + (1 − zt) ∘ ˜ ht

ft = σ(Wf xt + Ufht−1 + bf) it = σ(Wixt + Uiht−1 + bi)

  • t = σ(Woxt + Uoht−1 + bo)

˜ ct = tanh(Wcxt + Ucht−1 + bc) ct = ft ∘ ct−1 + it ∘ ˜ ct ht = ot ∘ g(ct)

slide-21
SLIDE 21

Popular Architectures

21

GRU LSTM

zt = σ(Wzxt + Uzht−1 + bz) rt = σ(Wrxt + Urht−1 + br) ˜ ht = tanh(Whxt + Uh(rt ∘ ht−1) + bh) ht = zt ∘ ht−1 + (1 − zt) ∘ ˜ ht

ft = σ(Wf xt + Ufht−1 + bf) it = σ(Wixt + Uiht−1 + bi)

  • t = σ(Woxt + Uoht−1 + bo)

˜ ct = tanh(Wcxt + Ucht−1 + bc) ct = ft ∘ ct−1 + it ∘ ˜ ct ht = ot ∘ g(ct)

gates candidate vectors update functions

slide-22
SLIDE 22

zt ∈ (0,1) rt ∈ (0,1) ˜ ht = tanh(Whxt + Uh(rt ∘ ht−1) + bh) ht = zt ∘ ht−1 + (1 − zt) ∘ ˜ ht

Popular Architectures

22

GRU LSTM

ft ∈ (0,1)Wf xtdfsfsfddgdg it ∈ (0,1)Wixtddgdgsfsdfs

  • t ∈ (0,1)Woxtddgdgsdfsfd

˜ ct = tanh(Wcxt + Ucht−1 + bc) ct = ft ∘ ct−1 + it ∘ ˜ ct ht = ot ∘ g(ct)

gates candidate vectors update functions

slide-23
SLIDE 23

23

LSTM

zt ∈ (0,1) rt ∈ (0,1) ˜ ht ∈ (−1,1) ht = zt ∘ ht−1 + (1 − zt) ∘ ˜ ht

ft ∈ (0,1)Wf xtaaaaaaaaaa it ∈ (0,1)Wixtaaaaaaaaaa

  • t ∈ (0,1)Woxt(tanh)aaaa

˜ ct ∈ (−1,1)ac

b

ct = ft ∘ ct−1 + it ∘ ˜ ct ht = ot ∘ g(ct)

GRU

Popular Architectures

gates candidate vectors update functions

slide-24
SLIDE 24

24

LSTM

ft ∈ (0,1)Wf xtaaaaaaaaaa it ∈ (0,1)Wixtaaaaaaaaaa

  • t ∈ (0,1)Woxt(tanh)aaaa

˜ ct ∈ (−1,1)ac

b

ct = ft ∘ ct−1 + it ∘ ˜ ct ht = ot ∘ g(ct)

GRU

zt ∈ (0,1) rt ∈ (0,1) ˜ ht ∈ (−1,1) ht = zt ∘ ht−1 + (1 − z) ∘ ˜ ht

Popular Architectures

slide-25
SLIDE 25

25

LSTM

ft ∈ (0,1)Wf xtaaaaaaaaaa it ∈ (0,1)Wixtaaaaaaaaaa

  • t ∈ (0,1)Woxt(tanh)aaaa

˜ ct ∈ (−1,1)ac

b

ct = ft ∘ ct−1 + it ∘ ˜ ct ht = ot ∘ g(ct)

Interpolation

GRU

zt ∈ (0,1) rt ∈ (0,1) ˜ ht ∈ (−1,1) ht = zt ∘ ht−1 + (1 − z) ∘ ˜ ht

Popular Architectures

slide-26
SLIDE 26

26

LSTM

ft ∈ (0,1)Wf xtaaaaaaaaaa it ∈ (0,1)Wixtaaaaaaaaaa

  • t ∈ (0,1)Woxt(tanh)aaaa

˜ ct ∈ (−1,1)ac

b

ct = ft ∘ ct−1 + it ∘ ˜ ct ht = ot ∘ g(ct)

Interpolation

GRU

zt ∈ (0,1) rt ∈ (0,1) ˜ ht ∈ (−1,1) ht = zt ∘ ht−1 + (1 − z) ∘ ˜ ht

Popular Architectures

Bounded!

slide-27
SLIDE 27

27

LSTM

zt ∈ (0,1) rt ∈ (0,1) ˜ ht ∈ (−1,1) ht = zt ∘ ht−1 + (1 − z) ∘ ˜ ht

ft ∈ (0,1)Wf xtaaaaaaaaaa it ∈ (0,1)Wixtaaaaaaaaaa

  • t ∈ (0,1)Woxt(tanh)aaaa

˜ ct ∈ (−1,1)ac

b

ct = ft ∘ ct−1 + it ∘ ˜ ct ht = ot ∘ g(ct)

Interpolation

ct = ft ∘ ct−1 + it ∘ ˜ ct

GRU

Popular Architectures

Bounded!

slide-28
SLIDE 28

28

LSTM

zt ∈ (0,1) rt ∈ (0,1) ˜ ht ∈ (−1,1) ht = zt ∘ ht−1 + (1 − z) ∘ ˜ ht

ft ∈ (0,1)Wf xtaaaaaaaaaa it ∈ (0,1)Wixtaaaaaaaaaa

  • t ∈ (0,1)Woxt(tanh)aaaa

˜ ct ∈ (−1,1)ac

b

ct = ft ∘ ct−1 + it ∘ ˜ ct ht = ot ∘ g(ct)

Interpolation Addition

ct = ft ∘ ct−1 + it ∘ ˜ ct

GRU

Popular Architectures

Bounded!

slide-29
SLIDE 29

29

LSTM

ft ≈ 1Wf xtaaaaaaaaaa it ≈ 1Wixtaaaaaaaaaa

  • t ∈ (0,1)Woxt(tanh)aaaa

˜ ct ∈ (−1,1)ac

b

ct ≈ ct−1 + ˜ ct ht = ot ∘ g(ct)

Interpolation Addition

ct = ft ∘ ct−1 + it ∘ ˜ ct GRU

zt ∈ (0,1) rt ∈ (0,1) ˜ ht ∈ (−1,1) ht = zt ∘ ht−1 + (1 − z) ∘ ˜ ht

Popular Architectures

Bounded!

slide-30
SLIDE 30

30

LSTM

ft ≈ 1Wf xtaaaaaaaaaa it ≈ 1Wixtaaaaaaaaaa

  • t ∈ (0,1)Woxt(tanh)aaaa

˜ ct ≈ 1ac

b

ct ≈ ct−1 + 1 ht = ot ∘ g(ct)

Interpolation Increase by 1

ct = ft ∘ ct−1 + it ∘ ˜ ct GRU

zt ∈ (0,1) rt ∈ (0,1) ˜ ht ∈ (−1,1) ht = zt ∘ ht−1 + (1 − z) ∘ ˜ ht

Popular Architectures

Bounded!

slide-31
SLIDE 31

31

LSTM

ft ≈ 1Wf xtaaaaaaaaaa it ≈ 1Wixtaaaaaaaaaa

  • t ∈ (0,1)Woxt(tanh)aaaa

˜ ct ≈ − 1ac

b

ct ≈ ct−1 − 1 ht = ot ∘ g(ct)

Interpolation Decrease by 1

ct = ft ∘ ct−1 + it ∘ ˜ ct GRU

zt ∈ (0,1) rt ∈ (0,1) ˜ ht ∈ (−1,1) ht = zt ∘ ht−1 + (1 − z) ∘ ˜ ht

Popular Architectures

Bounded!

slide-32
SLIDE 32

32

LSTM

ft ≈ 1Wf xtaaaaaaaaaa it ≈ 0Wixtaaaaaaaaaa

  • t ∈ (0,1)Woxt(tanh)aaaa

˜ ct ∈ (−1,1)ac

b

ct ≈ ct−1+ ˜ ct ht = ot ∘ g(ct)

Interpolation Do Nothing

ct = ft ∘ ct−1 + it ∘ ˜ ct GRU

zt ∈ (0,1) rt ∈ (0,1) ˜ ht ∈ (−1,1) ht = zt ∘ ht−1 + (1 − z) ∘ ˜ ht

Popular Architectures

Bounded!

slide-33
SLIDE 33

33

LSTM

ft ≈ 0Wf xtaaaaaaaaaa it ≈ 0Wixtaaaaaaaaaa

  • t ∈ (0,1)Woxt(tanh)aaaa

˜ ct ∈ (−1,1)ac

b

ct ≈ 0ct−1 + ˜ ct ht = ot ∘ g(ct)

Interpolation Reset

ct = ft ∘ ct−1 + it ∘ ˜ ct GRU

zt ∈ (0,1) rt ∈ (0,1) ˜ ht ∈ (−1,1) ht = zt ∘ ht−1 + (1 − z) ∘ ˜ ht

Popular Architectures

Bounded!

slide-34
SLIDE 34

34

LSTM

ft ≈ 0Wf xtaaaaaaaaaa it ≈ 0Wixtaaaaaaaaaa

  • t ∈ (0,1)Woxt(tanh)aaaa

˜ ct ∈ (−1,1)ac

b

ct ≈ 0ct−1 + ˜ ct ht = ot ∘ g(ct)

Interpolation Reset

ct = ft ∘ ct−1 + it ∘ ˜ ct GRU

zt ∈ (0,1) rt ∈ (0,1) ˜ ht ∈ (−1,1) ht = zt ∘ ht−1 + (1 − z) ∘ ˜ ht

Popular Architectures

Bounded! Can Count!

slide-35
SLIDE 35

Other Architectures

35

SRNN IRNN

ht = σh(Whxt + Uhht−1 + bh)

ht = max(0,Whxt + Uhht−1 + bh)

slide-36
SLIDE 36

Other Architectures

36

SRNN IRNN

ht = σh(Whxt + Uhht−1 + bh) ∈ (0,1)

ht = max(0,Whxt + Uhht−1 + bh)

Bounded!

slide-37
SLIDE 37

Other Architectures

37

SRNN IRNN

ht = σh(Whxt + Uhht−1 + bh) ∈ (0,1)

ht = max(0,Whxt + Uhht−1 + bh)

Bounded!

keep/reset +0 / +1 (subtraction in parallel, also increasing, counter)

{

slide-38
SLIDE 38

Other Architectures

38

SRNN IRNN

ht = σh(Whxt + Uhht−1 + bh) ∈ (0,1)

ht = max(0,Whxt + Uhht−1 + bh)

Bounded!

(subtraction in parallel, also increasing, counter)

{

Can Count!

keep/reset +0 / +1

slide-39
SLIDE 39

So:

  • LSTM can count!
  • GRU cannot
  • Counting gives greater computational power

39

slide-40
SLIDE 40

Empirically

40

LSTM GRU Trained , (on positive examples up to length 100)

anbn

Activations on :

a1000b1000

slide-41
SLIDE 41

Empirically

41

GRU:

  • Took much longer to train
  • Did not generalise even within training domain
  • begin failing at n=39 (vs 257 for LSTM)
  • Did not learn any discernible counting mechanism

LSTM GRU Trained , (on positive examples up to length 100)

anbn

Activations on :

a1000b1000

slide-42
SLIDE 42

Empirically

42

GRU:

  • Took much longer to train
  • Did not generalise even within training domain
  • begin failing at n=39 (vs 257 for LSTM)
  • Did not learn any discernible counting mechanism

LSTM GRU Trained , (on positive examples up to length 100)

anbn

Activations on :

a1000b1000

slide-43
SLIDE 43

Empirically

43

GRU:

  • Took much longer to train
  • Did not generalise even within training domain
  • begin failing at n=39 (vs 257 for LSTM)
  • Did not learn any discernible counting mechanism

LSTM GRU Trained , (on positive examples up to length 100)

anbn

Activations on :

a1000b1000

slide-44
SLIDE 44

Empirically

44

Activations on : LSTM GRU Trained , (on positive examples up to length 50)

anbncn

a100b100c100

slide-45
SLIDE 45

Empirically

45

Activations on : LSTM GRU GRU:

  • Took much longer to train
  • Did not generalise well
  • begin failing at n=9 (vs 101 for LSTM)
  • Did not learn any discernible counting mechanism

Trained , (on positive examples up to length 100)

anbncn

a100b100c100

slide-46
SLIDE 46

Conclusion

46

GRU SRNN

Trainability

LSTM IRNN

slide-47
SLIDE 47

Conclusion

47

GRU SRNN LSTM IRNN

Practical Expressivity Trainability

slide-48
SLIDE 48

Take Home Message

48

Don’t fall in the Turing Tarpit!

Architectural Choices Matter!

and result in actual differences in expressive power

slide-49
SLIDE 49

Thank You

49

GitHub repository:

https://github.com/tech-srl/counting_dimensions

Google Colab (link through GitHub as well):

https://tinyurl.com/ybjkumrz