Differential Categories, Recurrent Neural Networks, and Machine - - PowerPoint PPT Presentation

differential categories recurrent neural networks and
SMART_READER_LITE
LIVE PREVIEW

Differential Categories, Recurrent Neural Networks, and Machine - - PowerPoint PPT Presentation

Differential Categories, Recurrent Neural Networks, and Machine Learning Shin-ya Katsumata and David Sprunger* National Institute of Informatics, Tokyo SYCO 4 Chapman University May 23, 2019 1/32 Outline Feedforward neural networks 1


slide-1
SLIDE 1

1/32

Differential Categories, Recurrent Neural Networks, and Machine Learning

Shin-ya Katsumata and David Sprunger* National Institute of Informatics, Tokyo SYCO 4 Chapman University May 23, 2019

slide-2
SLIDE 2

2/32

Outline

1

Feedforward neural networks

2

Recurrent neural networks

3

Cartesian differential categories

4

Stateful computations / functions

5

Lifting Cartesian differential structure to stateful functions

slide-3
SLIDE 3

3/32

Overview of neural networks

A neural network is a function with two types of arguments, data inputs and parameters. Data come from the environment, parameters are controlled by us. As a string diagram: φ parameters: θ ∈ Rk

  • utput: y ∈ Rm

data inputs: x ∈ Rn

slide-4
SLIDE 4

3/32

Overview of neural networks

A neural network is a function with two types of arguments, data inputs and parameters. Data come from the environment, parameters are controlled by us. As a string diagram: φ parameters: θ ∈ Rk

  • utput: y ∈ Rm

data inputs: x ∈ Rn Training a neural network means finding θ∗ : 1 → Rk so that φ θ∗ has a desired property.

slide-5
SLIDE 5

3/32

Overview of neural networks

A neural network is a function with two types of arguments, data inputs and parameters. Data come from the environment, parameters are controlled by us. As a string diagram: φ parameters: θ ∈ Rk

  • utput: y ∈ Rm

data inputs: x ∈ Rn Training a neural network means finding θ∗ : 1 → Rk so that φ θ∗ has a desired property. Usually, this means minimizing inaccuracy, as measured by φ θ∗ ˆ xi E ˆ yi R where ˆ xi, ˆ yi : 1 → Rn+m are given input-output pairs and E : Rm × Rm → R is a given error function.

slide-6
SLIDE 6

4/32

Overview of neural networks

Gradient-based training algorithms utilize the insight that the gradient of this function: φ ˆ xi E ˆ yi Rk R tells us how to modify θ in order to decrease the error quickest.

slide-7
SLIDE 7

4/32

Overview of neural networks

Gradient-based training algorithms utilize the insight that the gradient of this function: φ ˆ xi E ˆ yi Rk R tells us how to modify θ in order to decrease the error quickest. Backpropagation is an algorithm that finds gradients (or derivatives) of functions f : Rn → Rm, and is often used due to its performance when n ≫ m.

slide-8
SLIDE 8

4/32

Overview of neural networks

Gradient-based training algorithms utilize the insight that the gradient of this function: φ ˆ xi E ˆ yi Rk R tells us how to modify θ in order to decrease the error quickest. Backpropagation is an algorithm that finds gradients (or derivatives) of functions f : Rn → Rm, and is often used due to its performance when n ≫ m. Backprop generates a hint about which direction to change θ, but the trainer determines how this hint is used.

slide-9
SLIDE 9

5/32

Outline

1

Feedforward neural networks

2

Recurrent neural networks

3

Cartesian differential categories

4

Stateful computations / functions

5

Lifting Cartesian differential structure to stateful functions

slide-10
SLIDE 10

6/32

Recurrent neural networks

Recurrent neural networks (RNNs) are able to process variable-length inputs using state, which is stored in registers: ψ

i

x y Ψ :

slide-11
SLIDE 11

6/32

Recurrent neural networks

Recurrent neural networks (RNNs) are able to process variable-length inputs using state, which is stored in registers: ψ

i

x y Ψ : A common semantics of RNNs uses the unrollings of the network: U0Ψ : ψ i U1Ψ : ψ i ψ U2Ψ : ψ i ψ ψ x0 x0 x0 x1 x2 x1 y0 y1 y2

slide-12
SLIDE 12

7/32

Backpropagation through time

Infinite-dimensional derivatives for sequence-to-sequence functions must be approximated to be computationally useful.

slide-13
SLIDE 13

7/32

Backpropagation through time

Infinite-dimensional derivatives for sequence-to-sequence functions must be approximated to be computationally useful. Backpropagation through time (BPTT): Whenever the derivative

  • f Ψ is needed at an input of length k + 1, the derivative of UkΨ

is used instead.

slide-14
SLIDE 14

7/32

Backpropagation through time

Infinite-dimensional derivatives for sequence-to-sequence functions must be approximated to be computationally useful. Backpropagation through time (BPTT): Whenever the derivative

  • f Ψ is needed at an input of length k + 1, the derivative of UkΨ

is used instead. This is a good way to generate hints, but it opens some questions:

1 Uk(Ψ ◦ Φ) = UkΨ ◦ UkΦ. Did we lose the chain rule? What

properties of derivatives hold for BPTT?

slide-15
SLIDE 15

7/32

Backpropagation through time

Infinite-dimensional derivatives for sequence-to-sequence functions must be approximated to be computationally useful. Backpropagation through time (BPTT): Whenever the derivative

  • f Ψ is needed at an input of length k + 1, the derivative of UkΨ

is used instead. This is a good way to generate hints, but it opens some questions:

1 Uk(Ψ ◦ Φ) = UkΨ ◦ UkΦ. Did we lose the chain rule? What

properties of derivatives hold for BPTT?

2 UkΨ and Uk+1Ψ have a lot in common, so their derivatives

should as well. Is there a more compact representation for the derivative of Ψ than a sequence of functions?

slide-16
SLIDE 16

8/32

Understanding BPTT with category theory

The project: start in a category with some notion of derivative, add a mechanism for state, and extend the original notion of differentiation to the stateful setting. Two main parts:

1 Adding state to computations 2 Differentiation for stateful computations

slide-17
SLIDE 17

8/32

Understanding BPTT with category theory

The project: start in a category with some notion of derivative, add a mechanism for state, and extend the original notion of differentiation to the stateful setting. Two main parts:

1 Adding state to computations 1

Relatively common, back to Katis, Sabadini, & Walters ‘97

2

Digital circuits—Ghica & Jung, ‘16

3

Signal flow graphs—Bonchi, Soboci´ nski, & Zanasi, ‘14

2 Differentiation for stateful computations

slide-18
SLIDE 18

8/32

Understanding BPTT with category theory

The project: start in a category with some notion of derivative, add a mechanism for state, and extend the original notion of differentiation to the stateful setting. Two main parts:

1 Adding state to computations 1

Relatively common, back to Katis, Sabadini, & Walters ‘97

2

Digital circuits—Ghica & Jung, ‘16

3

Signal flow graphs—Bonchi, Soboci´ nski, & Zanasi, ‘14

2 Differentiation for stateful computations 1

Not so common

2

Cartesian differential categories—Blute, Cockett, Seely ‘09

3

(Backprop as Functor—Fong, Spivak, Tuy´ eras, ‘17)

4

(Simple Essence of Automatic Differentiation—Elliott ‘18)

slide-19
SLIDE 19

9/32

Outline

1

Feedforward neural networks

2

Recurrent neural networks

3

Cartesian differential categories

4

Stateful computations / functions

5

Lifting Cartesian differential structure to stateful functions

slide-20
SLIDE 20

10/32

Cartesian differential categories [Blute, Cockett, Seely ’09]

A Cartesian differential category has a differential operation on morphisms sending f : X → Y to Df : X × X → Y , satisfying seven axioms: CD1. Ds = s for s ∈ {idX, σX,Y , !X, ∆X, 0X, +X}. CD2. Df = CD3. Df + = Df Df +

  • CD4. D(

= f Df Dg f g

)

CD5. = Dg Df

D(

f g

)

slide-21
SLIDE 21

11/32

Cartesian differential category axioms, continued

CD6. DDf = Df CD7. DDf = DDf

slide-22
SLIDE 22

11/32

Cartesian differential category axioms, continued

CD6. DDf = Df CD7. DDf = DDf

Example

Objects of the category Euc∞ are Rn for n ∈ N, maps are smooth maps between them. Euc∞ is a Cartesian differential category with the (curried) Jacobian sending f : Rn → Rm to Df : (∆x, x) → Jf|x × ∆x.

slide-23
SLIDE 23

12/32

Differentiating the unrollings of a simple RNN

D(

=

)

ψ i Dψ i

slide-24
SLIDE 24

12/32

Differentiating the unrollings of a simple RNN

D(

=

)

ψ i Dψ i

D(

=

)

Dψ i ψ i ψ ψ Dψ

slide-25
SLIDE 25

12/32

Differentiating the unrollings of a simple RNN

D(

=

)

ψ i Dψ i

D(

=

)

Dψ i ψ i ψ ψ Dψ

D(

=

)

Dψ i ψ i ψ ψ Dψ ψ ψ Dψ

slide-26
SLIDE 26

13/32

D(

=

)

Dψ i ψ i ψ ψ Dψ ψ ψ Dψ This suggests a hypothesis:

slide-27
SLIDE 27

13/32

D(

=

)

Dψ i ψ i ψ ψ Dψ ψ ψ Dψ This suggests a hypothesis:

D∗

  • ψ

i

D∗ψ ψ

i

slide-28
SLIDE 28

14/32

Outline

1

Feedforward neural networks

2

Recurrent neural networks

3

Cartesian differential categories

4

Stateful computations / functions

5

Lifting Cartesian differential structure to stateful functions

slide-29
SLIDE 29

15/32

Stateful computations

Let (C, ×, 1) be a strict Cartesian category, whose morphisms we think of as stateless functions. A stateful sequence computation looks like this: Ψ0 X0 Y0 S0 S1 1 1 1 . . .

··

i Ψ1 X1 Y1 S2

slide-30
SLIDE 30

15/32

Stateful computations

Let (C, ×, 1) be a strict Cartesian category, whose morphisms we think of as stateless functions. A stateful sequence computation looks like this: (This is a sequence of 2-cells in a double category based on C, with a restriction on the first 2-cell.) Ψ0 X0 Y0 S0 S1 1 1 1 . . .

··

i Ψ1 X1 Y1 S2

slide-31
SLIDE 31

16/32

Stateful functions

Two computation sequences might have different state spaces and still compute the same function. For example: 1 1 1 . . . X0 X0 X1 X1 1 1 1 1 1 1 . . .

··

i X0 X0 X1 X1 S S S

slide-32
SLIDE 32

17/32

Stateful functions

The nth truncation of a computation sequence is the morphism of the vertical composite of the first n + 1 steps: X0 Y0 . . . Xn Yn . . . . . . × × × × ψ0 i ψn

Definition

Two computation sequences are extensionally equivalent means they have the same nth truncation for all n ∈ N. A stateful (sequence) function is an extensional equivalence class of computation sequences.

slide-33
SLIDE 33

18/32

Stateful functions

Definition

If C is a strict Cartesian category, then its stateful sequence extension is a category St(C) where

  • bjects are infinite sequences of objects in C and

morphisms are stateful functions Ψ : X → Y.

slide-34
SLIDE 34

19/32

Example computation sequences

Here is

i

as a computation sequence: 1 1 1 . . .

··

i X X X X X X X

slide-35
SLIDE 35

20/32

Example computation sequences

Here is × as a computation sequence: 1 1 1 . . . × × X × X X × X 1 1 1 X × X X × X 1 1 1 . . . × ×

··

1 X X X X X X X

slide-36
SLIDE 36

20/32

Example computation sequences

Here is × as a computation sequence: 1 1 1 . . . × × X × X X × X 1 1 1 X × X X × X Here is ×

1

as a computation sequence: 1 1 1 . . . × ×

··

1 X X X X X X X

slide-37
SLIDE 37

21/32

Delayed trace

This loop-with-delay-gate is a trace-like operation. ψ : S × X → S × Y ψ dtrS

i (ψ) : X → Y

ψ

i

It satisfies most of the the trace axioms but misses two: yanking and dinaturality. For regular trace, those are = f g f g =

slide-38
SLIDE 38

22/32

Dinaturality → retiming

f

i

g f

g(i)

g = 1 1 1 . . . f g

··

g(i) f g 1 1 1 . . . f g

··

= f g i

slide-39
SLIDE 39

23/32

Yanking → delay

i i

= 1 1 1 . . .

··

i 1 1 1 . . .

··

= i

slide-40
SLIDE 40

24/32

Outline

1

Feedforward neural networks

2

Recurrent neural networks

3

Cartesian differential categories

4

Stateful computations / functions

5

Lifting Cartesian differential structure to stateful functions

slide-41
SLIDE 41

25/32

Differentiation for stateful functions

Let C be Cartesian differential with differential operator D. The following is a Cartesian differential operator on St(C): ψ X Y S S′ 1 1 1 i . . .

D∗ =

1 1 1 i Dψ ψ S S X X Y S′ S′. . .

slide-42
SLIDE 42

26/32

D∗ is a Cartesian differential operator

Proof idea. For CD2: Df = 1 1 1 i Dφ φ X Y . . .

slide-43
SLIDE 43

26/32

D∗ is a Cartesian differential operator

Proof idea. For CD2: Df = 1 1 1 i Dφ φ X Y ? ? . . .

slide-44
SLIDE 44

26/32

D∗ is a Cartesian differential operator

Proof idea. For CD2: Df = 1 1 1 i Dφ φ

  • X

Y ? ? . . . 1 1 1 i φ

  • X

Y

  • 0 .

. . =

slide-45
SLIDE 45

27/32

D∗ is a Cartesian differential operator

Proof idea. For CD4: D( = f Df Dg f g

)

i Dφ φ . . . φ i Dψ ψ

j

= Dφ Dψ φ i

j

φ ψ . . .

slide-46
SLIDE 46

27/32

D∗ is a Cartesian differential operator

Proof idea. For CD4: D( = f Df Dg f g

)

i Dφ φ ⋆

  • .

. . φ i Dψ ψ

j

  • =

Dφ Dψ φ i

j

φ ψ ⋆

  • .

. .

slide-47
SLIDE 47

27/32

D∗ is a Cartesian differential operator

Proof idea. For CD4: D( = f Df Dg f g

)

i Dφ φ ⋆

  • .

. . φ i Dψ ψ

j

  • =

Dφ Dψ φ i

j

φ ψ ⋆

  • .

. . ⋆ ⋆

slide-48
SLIDE 48

27/32

D∗ is a Cartesian differential operator

Proof idea. For CD4: D( = f Df Dg f g

)

i Dφ φ ⋆

  • .

. . φ i Dψ ψ

j

  • =

Dφ Dψ φ i

j

φ ψ ⋆

  • .

. . ⋆

slide-49
SLIDE 49

27/32

D∗ is a Cartesian differential operator

Proof idea. For CD4: D( = f Df Dg f g

)

i Dφ φ ⋆

  • .

. . φ i Dψ ψ

j

  • =

Dφ Dψ φ i

j

φ ψ ⋆

  • .

. . ⋆

slide-50
SLIDE 50

27/32

D∗ is a Cartesian differential operator

Proof idea. For CD4: D( = f Df Dg f g

)

i Dφ φ ⋆

  • .

. . φ i Dψ ψ

j

  • =

Dφ Dψ φ i

j

φ ψ ⋆

  • .

. . ⋆

slide-51
SLIDE 51

28/32

Differentiating RNNs

ψ X Y S S′ 1 1 1 i . . .

D∗ =

1 1 1 i Dψ ψ S S X X Y S′ S′. . .

D∗

  • ψ

i

D∗ψ ψ

i

S=S’

slide-52
SLIDE 52

28/32

Differentiating RNNs

ψ X Y S S′ 1 1 1 i . . .

D∗ =

1 1 1 i Dψ ψ S S X X Y S′ S′. . .

D∗

  • ψ

i

D∗ψ ψ

i

S=S’

Theorem (D∗ matches BPTT)

The unrolling of D∗(i, [ψ]) is the component-wise application of D to the unrolling of (i, [ψ]) (after a zipping morphism).

slide-53
SLIDE 53

29/32

Future directions

Several obstacles prevent us from applying these ideas in practice right away:

1 Use non-smooth, partly differentiable, or partial functions? 2 Can we get a transpose? 3 Better ways to represent non-mutable state?

slide-54
SLIDE 54

29/32

Future directions

Several obstacles prevent us from applying these ideas in practice right away:

1 Use non-smooth, partly differentiable, or partial functions? 2 Can we get a transpose? 3 Better ways to represent non-mutable state?

There are some questions related to the theory that we would like to understand better:

1 Categorical properties of St(−)? 2 Bisimulations and extensional equality? 3 D∗ and infinite-dimensional derivatives? 4 Basic results for delayed trace categories? 5 Other data shapes (trees, distributions, . . . )?

slide-55
SLIDE 55

30/32

Summary

1 St(−) preserves Cartesian differential category structure. 2 This notion of differentiation is connected to BPTT.

slide-56
SLIDE 56

30/32

Summary

1 St(−) preserves Cartesian differential category structure. 2 This notion of differentiation is connected to BPTT. 3 Cartesian differential categories are a useful tool for organizing

unusual derivatives.

slide-57
SLIDE 57

30/32

Summary

1 St(−) preserves Cartesian differential category structure. 2 This notion of differentiation is connected to BPTT. 3 Cartesian differential categories are a useful tool for organizing

unusual derivatives.

4 Machine learning needs compositional thinkers.

slide-58
SLIDE 58

31/32

Thanks!

slide-59
SLIDE 59

32/32

References

R.F. Blute, J.R.B. Cockett, and R.A.G. Seely. Cartesian differential categories. Theory and Applications of Categories, 22(23):622–672, 2009.

  • F. Bonchi, P. Soboci´

nski, and F. Zanasi. A categorical semantics of signal flow graphs. In CONCUR 2014, 2014.

  • C. Elliott.

The simple essence of automatic differentiation. PACMPL, 2(ICFP):70:1–70:29, 2018.

  • B. Fong, D. Spivak, and R. Tuy´

eras. Backprop as functor: A compositional perspective on supervised learning. See arxiv.org/abs/1711.10455, 2017. D.R. Ghica and A. Jung. Categorical semantics of digital circuits. FMCAD ’16, Austin, TX, 2016.

  • P. Katis, N. Sabadini, and R.F.C. Walters.

Bicategories of processes. Journal of Pure and Applied Algebra, 115(2):141–178, Feb 1997. David Sprunger and Shin-ya Katsumata. Differentiable causal computations via delayed trace. CoRR, abs/1903.01093, 2019.