AMMI Introduction to Deep Learning 4.1. DAG networks Fran cois - - PowerPoint PPT Presentation

ammi introduction to deep learning 4 1 dag networks
SMART_READER_LITE
LIVE PREVIEW

AMMI Introduction to Deep Learning 4.1. DAG networks Fran cois - - PowerPoint PPT Presentation

AMMI Introduction to Deep Learning 4.1. DAG networks Fran cois Fleuret https://fleuret.org/ammi-2018/ Wed Aug 29 16:57:27 CAT 2018 COLE POLYTECHNIQUE FDRALE DE LAUSANNE Everything we have seen for an MLP w (1) b (1) w (2) b (2)


slide-1
SLIDE 1

AMMI – Introduction to Deep Learning 4.1. DAG networks

Fran¸ cois Fleuret https://fleuret.org/ammi-2018/ Wed Aug 29 16:57:27 CAT 2018

ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

slide-2
SLIDE 2

Everything we have seen for an MLP

x × w(1) + b(1) σ × w(2) + b(2) σ f (x)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 1 / 11

slide-3
SLIDE 3

Everything we have seen for an MLP

x × w(1) + b(1) σ × w(2) + b(2) σ f (x)

can be generalized to an arbitrary “Directed Acyclic Graph” (DAG) of operators

x φ(1) φ(2) f (x) φ(3) w(1) w(2)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 1 / 11

slide-4
SLIDE 4

Remember that we use tensorial notation. If (a1, . . . , aQ) = φ(b1, . . . , bR), we have ∂a ∂b

  • = Jφ =

    

∂a1 ∂b1

. . .

∂a1 ∂bR

. . . ... . . .

∂aQ ∂b1

. . .

∂aQ ∂bR

     . This notation does not specify at which point this is computed. It will always be for the forward-pass activations.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 2 / 11

slide-5
SLIDE 5

Remember that we use tensorial notation. If (a1, . . . , aQ) = φ(b1, . . . , bR), we have ∂a ∂b

  • = Jφ =

    

∂a1 ∂b1

. . .

∂a1 ∂bR

. . . ... . . .

∂aQ ∂b1

. . .

∂aQ ∂bR

     . This notation does not specify at which point this is computed. It will always be for the forward-pass activations. Also, if (a1, . . . , aQ) = φ(b1, . . . , bR, c1, . . . , cS), we use ∂a ∂c

  • = Jφ|c =

    

∂a1 ∂c1

. . .

∂a1 ∂cS

. . . ... . . .

∂aQ ∂c1

. . .

∂aQ ∂cS

     .

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 2 / 11

slide-6
SLIDE 6

Forward pass

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 3 / 11

slide-7
SLIDE 7

Forward pass

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

x(0) = x

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 3 / 11

slide-8
SLIDE 8

Forward pass

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

x(0) = x x(1) = φ(1)(x(0); w(1))

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 3 / 11

slide-9
SLIDE 9

Forward pass

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

x(0) = x x(1) = φ(1)(x(0); w(1)) x(2) = φ(2)(x(0), x(1); w(2))

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 3 / 11

slide-10
SLIDE 10

Forward pass

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

x(0) = x x(1) = φ(1)(x(0); w(1)) x(2) = φ(2)(x(0), x(1); w(2)) f (x) = x(3) = φ(3)(x(1), x(2); w(1))

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 3 / 11

slide-11
SLIDE 11

Backward pass, derivatives w.r.t activations

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 4 / 11

slide-12
SLIDE 12

Backward pass, derivatives w.r.t activations

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

∂퓁 ∂x(2)

  • =
  • ∂x(3)

∂x(2) ∂퓁 ∂x(3)

  • = Jφ(3)|x(2)

∂퓁 ∂x(3)

  • Fran¸

cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 4 / 11

slide-13
SLIDE 13

Backward pass, derivatives w.r.t activations

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

∂퓁 ∂x(2)

  • =
  • ∂x(3)

∂x(2) ∂퓁 ∂x(3)

  • = Jφ(3)|x(2)

∂퓁 ∂x(3)

  • ∂퓁

∂x(1)

  • =
  • ∂x(2)

∂x(1) ∂퓁 ∂x(2)

  • +
  • ∂x(3)

∂x(1) ∂퓁 ∂x(3)

  • = Jφ(2)|x(1)

∂퓁 ∂x(2)

  • + Jφ(3)|x(1)

∂퓁 ∂x(3)

  • Fran¸

cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 4 / 11

slide-14
SLIDE 14

Backward pass, derivatives w.r.t activations

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

∂퓁 ∂x(2)

  • =
  • ∂x(3)

∂x(2) ∂퓁 ∂x(3)

  • = Jφ(3)|x(2)

∂퓁 ∂x(3)

  • ∂퓁

∂x(1)

  • =
  • ∂x(2)

∂x(1) ∂퓁 ∂x(2)

  • +
  • ∂x(3)

∂x(1) ∂퓁 ∂x(3)

  • = Jφ(2)|x(1)

∂퓁 ∂x(2)

  • + Jφ(3)|x(1)

∂퓁 ∂x(3)

  • ∂퓁

∂x(0)

  • =
  • ∂x(1)

∂x(0) ∂퓁 ∂x(1)

  • +
  • ∂x(2)

∂x(0) ∂퓁 ∂x(2)

  • = Jφ(1)|x(0)

∂퓁 ∂x(1)

  • + Jφ(2)|x(0)

∂퓁 ∂x(2)

  • Fran¸

cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 4 / 11

slide-15
SLIDE 15

Backward pass, derivatives w.r.t parameters

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 5 / 11

slide-16
SLIDE 16

Backward pass, derivatives w.r.t parameters

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

∂퓁 ∂w(1)

  • =
  • ∂x(1)

∂w(1) ∂퓁 ∂x(1)

  • +
  • ∂x(3)

∂w(1) ∂퓁 ∂x(3)

  • = Jφ(1)|w (1)

∂퓁 ∂x(1)

  • + Jφ(3)|w (1)

∂퓁 ∂x(3)

  • Fran¸

cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 5 / 11

slide-17
SLIDE 17

Backward pass, derivatives w.r.t parameters

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

∂퓁 ∂w(1)

  • =
  • ∂x(1)

∂w(1) ∂퓁 ∂x(1)

  • +
  • ∂x(3)

∂w(1) ∂퓁 ∂x(3)

  • = Jφ(1)|w (1)

∂퓁 ∂x(1)

  • + Jφ(3)|w (1)

∂퓁 ∂x(3)

  • ∂퓁

∂w(2)

  • =
  • ∂x(2)

∂w(2) ∂퓁 ∂x(2)

  • = Jφ(2)|w (2)

∂퓁 ∂x(2)

  • Fran¸

cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 5 / 11

slide-18
SLIDE 18

So if we have a library of “tensor operators”, and implementations of (x1, . . . , xd, w) → φ(x1, . . . , xd; w) ∀c, (x1, . . . , xd, w) → Jφ|xc (x1, . . . , xd; w) (x1, . . . , xd, w) → Jφ|w(x1, . . . , xd; w),

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 6 / 11

slide-19
SLIDE 19

So if we have a library of “tensor operators”, and implementations of (x1, . . . , xd, w) → φ(x1, . . . , xd; w) ∀c, (x1, . . . , xd, w) → Jφ|xc (x1, . . . , xd; w) (x1, . . . , xd, w) → Jφ|w(x1, . . . , xd; w), we can build an arbitrary directed acyclic graph with these operators at the nodes, compute the response of the resulting mapping, and compute its gradient with back-prop.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 6 / 11

slide-20
SLIDE 20

Writing from scratch a large neural network is complex and error-prone.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 7 / 11

slide-21
SLIDE 21

Writing from scratch a large neural network is complex and error-prone. Multiple frameworks provide libraries of tensor operators and mechanisms to combine them into DAGs and automatically differentiate them.

Language(s) License Main backer PyTorch Python BSD Facebook Caffe2 C++, Python Apache Facebook TensorFlow Python, C++ Apache Google MXNet Python, C++, R, Scala Apache Amazon CNTK Python, C++ MIT Microsoft Torch Lua BSD Facebook Theano Python BSD

  • U. of Montreal

Caffe C++ BSD 2 clauses

  • U. of CA, Berkeley

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 7 / 11

slide-22
SLIDE 22

Writing from scratch a large neural network is complex and error-prone. Multiple frameworks provide libraries of tensor operators and mechanisms to combine them into DAGs and automatically differentiate them.

Language(s) License Main backer PyTorch Python BSD Facebook Caffe2 C++, Python Apache Facebook TensorFlow Python, C++ Apache Google MXNet Python, C++, R, Scala Apache Amazon CNTK Python, C++ MIT Microsoft Torch Lua BSD Facebook Theano Python BSD

  • U. of Montreal

Caffe C++ BSD 2 clauses

  • U. of CA, Berkeley

One approach is to define the nodes and edges of such a DAG statically (Torch, TensorFlow, Caffe, Theano, etc.)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 7 / 11

slide-23
SLIDE 23

In TensorFlow, to run a forward/backward pass on

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 8 / 11

slide-24
SLIDE 24

In TensorFlow, to run a forward/backward pass on

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

φ(1) x(0); w(1) = w(1)x(0) φ(2) x(0), x(1); w(2) = x(0) + w(2)x(1) φ(3) x(1), x(2); w(1) = w(1) x(1) + x(2)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 8 / 11

slide-25
SLIDE 25

In TensorFlow, to run a forward/backward pass on

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

φ(1) x(0); w(1) = w(1)x(0) φ(2) x(0), x(1); w(2) = x(0) + w(2)x(1) φ(3) x(1), x(2); w(1) = w(1) x(1) + x(2)

w1 = tf.Variable(tf.random_normal([5, 5])) w2 = tf.Variable(tf.random_normal([5, 5])) x = tf.Variable(tf.random_normal([5, 1])) x0 = x x1 = tf.matmul(w1, x0) x2 = x0 + tf.matmul(w2, x1) x3 = tf.matmul(w1, x1 + x2) q = tf.norm(x3) gw1, gw2 = tf.gradients(q, [w1, w2]) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) _gw1, _gw2 = sess.run([gw1, gw2])

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 8 / 11

slide-26
SLIDE 26

Weight sharing

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 9 / 11

slide-27
SLIDE 27

In our generalized DAG formulation, we have in particular implicitly allowed the same parameters to modulate different parts of the processing.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 10 / 11

slide-28
SLIDE 28

In our generalized DAG formulation, we have in particular implicitly allowed the same parameters to modulate different parts of the processing. For instance w(1) in our example parametrizes both φ(1) and φ(3).

x(0) = x x(1) φ(1) x(2) φ(2) f (x) = x(3) φ(3) w(1) w(2)

This is called weight sharing.

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 10 / 11

slide-29
SLIDE 29

Weight sharing allows in particular to build siamese networks where a full sub-network is replicated several times.

x(0) = x ψu ×

+

σ u(1) ×

+

σ u(2) ψv ×

+

σ v(1) ×

+

σ v(2) φ x(1) w(1) b(1) w(2) b(2)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 11 / 11

slide-30
SLIDE 30

Weight sharing allows in particular to build siamese networks where a full sub-network is replicated several times.

x(0) = x ψu ×

+

σ u(1) ×

+

σ u(2) ψv ×

+

σ v(1) ×

+

σ v(2) φ x(1) w(1) b(1) w(2) b(2)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 11 / 11

slide-31
SLIDE 31

Weight sharing allows in particular to build siamese networks where a full sub-network is replicated several times.

x(0) = x ψu ×

+

σ u(1) ×

+

σ u(2) ψv ×

+

σ v(1) ×

+

σ v(2) φ x(1) w(1) b(1) w(2) b(2)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 11 / 11

slide-32
SLIDE 32

Weight sharing allows in particular to build siamese networks where a full sub-network is replicated several times.

x(0) = x ψu ×

+

σ u(1) ×

+

σ u(2) ψv ×

+

σ v(1) ×

+

σ v(2) φ x(1) w(1) b(1) w(2) b(2)

Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 4.1. DAG networks 11 / 11

slide-33
SLIDE 33

The end