Neural Networks and Computation Graphs CS 6956: Deep Learning for - - PowerPoint PPT Presentation

neural networks and computation graphs
SMART_READER_LITE
LIVE PREVIEW

Neural Networks and Computation Graphs CS 6956: Deep Learning for - - PowerPoint PPT Presentation

Neural Networks and Computation Graphs CS 6956: Deep Learning for NLP Based on slides and material from Geoffrey Hinton, Richard Socher, Yoav Goldberg, and others. The computation graph slides are based on a tutorial Practical Neural


slide-1
SLIDE 1

CS 6956: Deep Learning for NLP

Neural Networks and Computation Graphs

Based on slides and material from Geoffrey Hinton, Richard Socher, Yoav Goldberg, and others. The computation graph slides are based on a tutorial “Practical Neural Networks for NLP” by Chris Dyer, Yoav Goldberg, Graham Neubig in EMNLP 2016

slide-2
SLIDE 2

This lecture

  • What is a neural network?
  • Computation Graphs
  • Algorithms over computation graphs

– The forward pass – The backward pass

1

slide-3
SLIDE 3

Where are we?

  • What is a neural network?

– A quick refresher

  • Computation Graphs
  • Algorithms over computation graphs

– The forward pass – The backward pass

2

slide-4
SLIDE 4

We have seen linear threshold units

3

features dot product threshold Prediction 𝑡𝑕𝑜 (𝒙'𝒚 + 𝑐) = 𝑡𝑕𝑜(∑𝑥/𝑦/ + 𝑐) Learning various algorithms perceptron, SVM, logistic regression,… in general, minimize loss But where do these input features come from? What if the features were outputs of another classifier?

slide-5
SLIDE 5

Features from classifiers

4

slide-6
SLIDE 6

Features from classifiers

5

slide-7
SLIDE 7

Features from classifiers

6

Each of these connections have their own weights as well

slide-8
SLIDE 8

Features from classifiers

7

slide-9
SLIDE 9

Features from classifiers

8

This is a two layer feed forward neural network

slide-10
SLIDE 10

Features from classifiers

9

The output layer The hidden layer The input layer This is a two layer feed forward neural network Think of the hidden layer as learning a good representation of the inputs

slide-11
SLIDE 11

Features from classifiers

10

The dot product followed by the threshold constitutes a neuron Five neurons in this picture (four in hidden layer and one output) This is a two layer feed forward neural network

slide-12
SLIDE 12

But where do the inputs come from?

11

What if the inputs were the outputs of a classifier? The input layer We can make a three layer network…. And so on.

slide-13
SLIDE 13

Let us try to formalize this

12

slide-14
SLIDE 14

Artificial neurons

Functions that very loosely mimic a biological neuron

A neuron accepts a collection of inputs (a vector x) and produces an output by:

– Applying a dot product with weights w and adding a bias b – Applying a (possibly non-linear) transformation called an activation

13

Dot product Threshold activation Other activations are possible 𝑝𝑣𝑢𝑞𝑣𝑢 = 𝑏𝑑𝑢𝑗𝑤𝑏𝑢𝑗𝑝𝑜(𝒙'𝒚 + 𝑐)

slide-15
SLIDE 15

Activation functions

Name of the neuron Activation function: 𝑏𝑑𝑢𝑗𝑤𝑏𝑢𝑗𝑝𝑜 𝑨 Linear unit 𝑨 Threshold/sign unit sgn(𝑨) Sigmoid unit 1 1 + exp (−𝑨) Rectified linear unit (ReLU) max (0, 𝑨) Tanh unit tanh (𝑨)

14

𝑝𝑣𝑢𝑞𝑣𝑢 = 𝑏𝑑𝑢𝑗𝑤𝑏𝑢𝑗𝑝𝑜(𝒙'𝒚 + 𝑐) Many more activation functions exist (sinusoid, sinc, gaussian, polynomial…) Also called transfer functions

slide-16
SLIDE 16

A neural network

A function that converts inputs to outputs defined by a directed acyclic graph

– Nodes organized in layers, correspond to neurons – Edges carry output of one neuron to another, associated with weights

  • To define a neural network, we need to

specify:

– The structure of the graph

  • How many nodes, the connectivity

– The activation function on each node – The edge weights

15

slide-17
SLIDE 17

A neural network

A function that converts inputs to outputs defined by a directed acyclic graph

– Nodes organized in layers, correspond to neurons – Edges carry output of one neuron to another, associated with weights

  • To define a neural network, we need to

specify:

– The structure of the graph

  • How many nodes, the connectivity

– The activation function on each node – The edge weights

16

Called the architecture

  • f the network

Typically predefined, part of the design of the classifier Learned from data Input Hidden Output wIJ

K

wIJ

L

slide-18
SLIDE 18

A neural network

A function that converts inputs to outputs defined by a directed acyclic graph

– Nodes organized in layers, correspond to neurons – Edges carry output of one neuron to another, associated with weights

  • To define a neural network, we need to

specify:

– The structure of the graph

  • How many nodes, the connectivity

– The activation function on each node – The edge weights

17

Input Hidden Output wIJ

K

wIJ

L

slide-19
SLIDE 19

A neural network

A function that converts inputs to outputs defined by a directed acyclic graph

– Nodes organized in layers, correspond to neurons – Edges carry output of one neuron to another, associated with weights

  • To define a neural network, we need to

specify:

– The structure of the graph

  • How many nodes, the connectivity

– The activation function on each node – The edge weights

18

Called the architecture

  • f the network

Typically predefined, part of the design of the classifier Learned from data Input Hidden Output wIJ

K

wIJ

L

slide-20
SLIDE 20

A brief history of neural networks

  • 1943: McCullough and Pitts showed how linear threshold units can

compute logical functions

  • 1949: Hebb suggested a learning rule that has some physiological

plausibility

  • 1950s: Rosenblatt, the Peceptron algorithm for a single threshold neuron
  • 1969: Minsky and Papert studied the neuron from a geometrical

perspective

  • 1980s: Convolutional neural networks (Fukushima, LeCun), the

backpropagation algorithm (various)

  • 2003-today: More compute, more data, deeper networks

19

See also: http://people.idsia.ch/~juergen/deep-learning-overview.html

slide-21
SLIDE 21

Neural networks are universal function approximators

  • Any continuous function can be approximated to arbitrary accuracy using
  • ne hidden layer of sigmoid units [Cybenko 1989]
  • Approximation error is insensitive to the choice of activation functions

[DasGupta et al 1993]

  • Two layer threshold networks can express any Boolean function

– Exercise: Prove this

  • VC dimension of threshold network with edges E: 𝑊𝐷 = 𝑃(|𝐹| log |𝐹|)
  • VC dimension of sigmoid networks with nodes V and edges E:

– Upper bound: Ο 𝑊 K 𝐹 K – Lower bound: Ω 𝐹 K

20

Exercise: Show that if we have only linear units, then multiple layers does not change the expressiveness

slide-22
SLIDE 22

This lecture

  • What is a neural network?
  • Computation Graphs
  • Algorithms over computation graphs

– The forward pass – The backward pass

21

This section heavily draws upon the “Practical Neural Networks for NLP” by Chris Dyer, Yoav Goldberg, Graham Neubig in EMNLP 2016

slide-23
SLIDE 23

Computation graphs

  • A language for constructing deep neural networks

– A way to think about differentiable compute

  • Key ideas:

– We can represent functions as graphs – We can dynamically generate these graphs if necessary – We can define algorithms over these graphs that map to learning and prediction

  • Prediction via the forward pass
  • Learning via gradients computed using the backward pass

22

slide-24
SLIDE 24

What we will see

  • 1. What is the semantics of a computation graph?

– That is, what the nodes and edges mean

  • 2. How to construct them
  • 3. How do perform computations with them

23

slide-25
SLIDE 25

Nodes represent values

24

Expression 𝐲 𝐲

Graph The value is implicitly or explicitly typed. It could represent a

  • Scalar (i.e. a number)
  • A vector
  • A matrix
  • Or more generally, a tensor
slide-26
SLIDE 26

Edges represent function arguments

25

𝐲 | 𝐲 |

f 𝐯 = | 𝐯 |

𝐲 𝐲Y𝐳

f 𝐯, 𝐰 = 𝐯Y𝐰

𝐳

A node with an incoming edge is a function of the the parent node

slide-27
SLIDE 27

Edges represent function arguments

26

𝐲 | 𝐲 |

f 𝐯 = | 𝐯 |

𝐲 𝐲Y𝐳

f 𝐯, 𝐰 = 𝐯Y𝐰

𝐳

A node with an incoming edge is a function of the the parent node

slide-28
SLIDE 28

Edges represent function arguments

27

𝐲 | 𝐲 |

f 𝐯 = | 𝐯 |

𝐲 𝐲Y𝐳

f 𝐯, 𝐰 = 𝐯Y𝐰

𝐳

A node with an incoming edge is a function of the the parent node

slide-29
SLIDE 29

Edges represent function arguments

28

𝐲 𝐲Y𝐳

f 𝐯, 𝐰 = 𝐯Y𝐰

𝐳 Each node knows how to compute two things:

  • 1. Its own value using its inputs
  • In these examples, the nodes on top compute | 𝐲 | and 𝐲Y𝐳
  • 2. The value of its partial derivative with respect to each input
  • Left graph: the node on top knows to compute \]

\𝐯

  • Right graph: the node on top knows to compute \]

\𝐯 and \] \𝐰

𝐲 | 𝐲 |

f 𝐯 = | 𝐯 | Notation: We will write down what that function is next to the node. When we write this, we will use formal arguments (here, the 𝐯 and 𝐰). Think of these as similar to the argument names we use when we declare functions while programming.

slide-30
SLIDE 30

Edges represent function arguments

29

𝐲 | 𝐲 |

f 𝐯 = | 𝐯 |

𝐲 𝐲Y𝐳

f 𝐯, 𝐰 = 𝐯Y𝐰

𝐳 Each node knows how to compute two things:

  • 1. Its own value using its inputs
  • In these examples, the nodes on top compute | 𝐲 | and 𝐲Y𝐳
  • 2. The value of its partial derivative with respect to each input
  • Left graph: the node on top knows to compute \]

\𝐯

  • Right graph: the node on top knows to compute \]

\𝐯 and \] \𝐰

slide-31
SLIDE 31

Graphs represent functions

30

𝐲 | 𝐲 |

f 𝐯 = | 𝐯 |

𝐲 𝐲Y𝐳

f 𝐯, 𝐰 = 𝐯Y𝐰

𝐳

The functions expressed could be

  • Nullary, i.e. with no arguments: if a node has no incoming edges
  • Unary: if a node has one incoming edge
  • Binary: if a node has two incoming edges
  • n-ary: if a node has n incoming edges
slide-32
SLIDE 32

Let’s see some functions as graphs

31

Expression 𝐲Y𝐁 𝐲

Graph

𝐁

f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯𝐔

slide-33
SLIDE 33

Let’s see some functions as graphs

32

Expression 𝐲Y𝐁𝐲 𝐲

Graph

𝐁

f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯𝐔 f 𝐍, 𝐰 = 𝐍𝐰

slide-34
SLIDE 34

Let’s see some functions as graphs

33

Expression 𝐲Y𝐁𝐲 𝐲

Graph

𝐁

f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯𝐔 f 𝐍, 𝐰 = 𝐍𝐰

𝐲 𝐁

f 𝐯, 𝐍 = 𝐯𝐔𝐍𝐯 We could have written the same function with a different graph. Computation graphs are not necessarily unique for a function

slide-35
SLIDE 35

Let’s see some functions as graphs

34

Expression 𝐲Y𝐁𝐲

Graph

𝐲 𝐁

f 𝐯, 𝐍 = 𝐯𝐔𝐍𝐯 Remember: The nodes also know how to compute derivatives with respect to each parent

slide-36
SLIDE 36

Let’s see some functions as graphs

35

Expression 𝐲Y𝐁𝐲

Graph

𝐲 𝐁

f 𝐯, 𝐍 = 𝐯𝐔𝐍𝐯 Remember: The nodes also know how to compute derivatives with respect to each parent 𝜖𝑔 𝜖𝐯 = 𝐍Y + 𝐍 𝐯 Derivative with respect to this parent

slide-37
SLIDE 37

Let’s see some functions as graphs

36

Expression 𝐲Y𝐁𝐲

Graph

𝐲 𝐁

f 𝐯, 𝐍 = 𝐯𝐔𝐍𝐯 Remember: The nodes also know how to compute derivatives with respect to each parent 𝜖𝑔 𝜖𝐍 = 𝐯𝐯' Derivative with respect to this parent

slide-38
SLIDE 38

Let’s see some functions as graphs

37

Expression 𝐲Y𝐁𝐲

Graph

𝐲 𝐁

f 𝐯, 𝐍 = 𝐯𝐔𝐍𝐯 Remember: The nodes also know how to compute derivatives with respect to each parent Together, we can compute derivatives of any function with respect to all its inputs, for any value

  • f the input

𝜖𝑔 𝜖𝐯 = 𝐍Y + 𝐍 𝐯 𝜖𝑔 𝜖𝐍 = 𝐯𝐯' 𝜖𝑔 𝜖𝐲 = 𝐁Y + 𝐁 𝐲 𝜖𝑔 𝜖𝐁 = 𝐲𝐲'

slide-39
SLIDE 39

Let’s see some functions as graphs

38

Expression 𝐲Y𝐁𝐲 + 𝐜Y𝐲 + 𝐝 𝐲

Graph

𝐁

f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯𝐔 f 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝐝

f 𝐯, 𝐰 = 𝐯𝐔𝐰 f 𝑦L, 𝑦K, 𝑦g = h 𝑦/

  • 𝒋
slide-40
SLIDE 40

Let’s see some functions as graphs

39

Expression 𝑧 = 𝐲Y𝐁𝐲 + 𝐜Y𝐲 + 𝐝 𝐲

Graph

𝐁

f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯𝐔 f 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝐝

f 𝐯, 𝐰 = 𝐯𝐔𝐰

𝑧

f 𝑦L, 𝑦K, 𝑦g = h 𝑦/

  • 𝒋
slide-41
SLIDE 41

Let’s see some functions as graphs

40

Expression 𝑧 = 𝐲Y𝐁𝐲 + 𝐜Y𝐲 + 𝐝 𝐲

Graph

𝐁

f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯𝐔 f 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝐝

f 𝐯, 𝐰 = 𝐯𝐔𝐰

𝑧

f 𝑦L, 𝑦K, 𝑦g = h 𝑦/

  • 𝒋

We can name variables by labeling nodes

slide-42
SLIDE 42

Why are computation graphs interesting?

1. For starters, we can write neural networks as computation graphs. 2. We can write loss functions as computation graphs.

Or loss functions within the innermost stochastic gradient descent.

3. They are plug-and-play: We can construct a graph and use it in a program that someone else wrote

For eg: We can write down a neural network and plug it into a loss function and a minimization function from a library

4. They allow efficient gradient computation.

41

slide-43
SLIDE 43

An example two layer neural network

42

𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛

slide-44
SLIDE 44

An example two layer neural network

43

𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛

𝐗 𝐲

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

slide-45
SLIDE 45

An example two layer neural network

44

𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛

𝐗 𝐲

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝐳

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

slide-46
SLIDE 46

Exercises

Write the following functions as computation graphs:

  • 𝑔 𝑦 = 𝑦g − log

(𝑦)

  • 𝑔 𝑦 =

L Lqrst (uv)

  • 𝑔 w, x, 𝑧 = max

(0, 1 − 𝑧wYx)

  • min

x L K w'w + 𝐷 ∑ max

(0, 1 − 𝑧/wY𝑦/)

  • /

45

slide-47
SLIDE 47

Where are we?

  • What is a neural network?
  • Computation Graphs
  • Algorithms over computation graphs

– The forward pass – The backward pass

46

slide-48
SLIDE 48

Three computational questions

1. Forward propagation

– Given inputs to the graph, compute the value of the function expressed by the graph – Something to think about: Given a node, can we say which nodes are inputs? Which nodes are outputs?

2. Backpropagation

– After computing the function value for an input, compute the gradient of the function at that input – Or equivalently: How does the output change if I make a small change to the input?

3. Constructing graphs

– Need an easy-to-use framework to construct graphs – The size of the graph may be input dependent

  • A templating language that creates graphs on the fly

– Tensorflow, PyTorch are the most popular frameworks today

47

slide-49
SLIDE 49

Forward propagation

48

slide-50
SLIDE 50

Three computational questions

1. Forward propagation

– Given inputs to the graph, compute the value of the function expressed by the graph – Something to think about: Given a node, can we say which nodes are inputs? Which nodes are outputs?

2. Backpropagation

– After computing the function value for an input, compute the gradient of the function at that input – Or equivalently: How does the output change if I make a small change to the input?

3. Constructing graphs

– Need an easy-to-use framework to construct graphs – The size of the graph may be input dependent

  • A templating language that creates graphs on the fly

– Tensorflow, PyTorch are the most popular frameworks today

49

slide-51
SLIDE 51

Forward pass: An example

50

𝑧 𝑦 𝑣 + 𝑤 𝑣K log 𝑣 𝑣𝑤 h 𝑣/

  • /

Conventions:

  • 1. Any expression next to a node is the function it computes
  • 2. All the variables in the expression are inputs to the node from left to right.
slide-52
SLIDE 52

Forward pass

51

𝑧 𝑦 𝑣 + 𝑤 𝑣K log 𝑣 𝑣𝑤 h 𝑣/

  • /

What function does this compute?

slide-53
SLIDE 53

Forward pass

52

𝑧 𝑦 𝑣 + 𝑤 𝑣K log 𝑣 𝑣𝑤 h 𝑣/

  • /

What function does this compute? Suppose we shade nodes whose values we know (i.e. we have computed).

slide-54
SLIDE 54

Forward pass

53

𝑧 𝑦 𝑣 + 𝑤 𝑣K log 𝑣 𝑣𝑤 h 𝑣/

  • /

What function does this compute? Suppose we shade nodes whose values we know (i.e. we have computed). 𝑦

slide-55
SLIDE 55

Forward pass

54

𝑧 𝑦 𝑣 + 𝑤 𝑣K log 𝑣 𝑣𝑤 h 𝑣/

  • /

What function does this compute? Suppose we shade nodes whose values we know (i.e. we have computed). 𝑦 𝑧

slide-56
SLIDE 56

Forward pass

55

𝑧 𝑦 𝑣 + 𝑤 𝑣K log 𝑣 𝑣𝑤 h 𝑣/

  • /

What function does this compute? Suppose we shade nodes whose values we know (i.e. we have computed). 𝑦 𝑧 We can only compute the value of a node if we know the values of all its inputs

slide-57
SLIDE 57

Forward pass

56

𝑧 𝑦 𝑣 + 𝑤 𝑣K log 𝑣 𝑣𝑤 h 𝑣/

  • /

What function does this compute? Suppose we shade nodes whose values we know (i.e. we have computed). 𝑦 𝑧 We can only compute the value of a node if we know the values of all its inputs 𝑦 + 𝑧

slide-58
SLIDE 58

Forward pass

57

𝑧 𝑦 𝑣 + 𝑤 𝑣K log 𝑣 𝑣𝑤 h 𝑣/

  • /

What function does this compute? Suppose we shade nodes whose values we know (i.e. we have computed). 𝑦 𝑧 We can only compute the value of a node if we know the values of all its inputs 𝑦 + 𝑧 𝑧K

slide-59
SLIDE 59

Forward pass

58

𝑧 𝑦 𝑣 + 𝑤 𝑣K log 𝑣 𝑣𝑤 h 𝑣/

  • /

What function does this compute? Suppose we shade nodes whose values we know (i.e. we have computed). 𝑦 𝑧 We can only compute the value of a node if we know the values of all its inputs 𝑦 + 𝑧 𝑧K 𝑦(𝑦 + 𝑧)

slide-60
SLIDE 60

Forward pass

59

𝑧 𝑦 𝑣 + 𝑤 𝑣K log 𝑣 𝑣𝑤 h 𝑣/

  • /

What function does this compute? Suppose we shade nodes whose values we know (i.e. we have computed). 𝑦 𝑧 We can only compute the value of a node if we know the values of all its inputs 𝑦 + 𝑧 𝑧K 𝑦(𝑦 + 𝑧) log (𝑦 + 𝑧)

slide-61
SLIDE 61

Forward pass

60

𝑧 𝑦 𝑣 + 𝑤 𝑣K log 𝑣 𝑣𝑤 h 𝑣/

  • /

What function does this compute? Suppose we shade nodes whose values we know (i.e. we have computed). 𝑦 𝑧 We can only compute the value of a node if we know the values of all its inputs 𝑦 + 𝑧 𝑧K 𝑦(𝑦 + 𝑧) log (𝑦 + 𝑧) x x + y + log 𝑦 + 𝑧 + 𝑧K

slide-62
SLIDE 62

Forward pass

61

𝑧 𝑦 𝑣 + 𝑤 𝑣K log 𝑣 𝑣𝑤 h 𝑣/

  • /

What function does this compute? Suppose we shade nodes whose values we know (i.e. we have computed). 𝑦 𝑧 We can only compute the value of a node if we know the values of all its inputs 𝑦 + 𝑧 𝑧K 𝑦(𝑦 + 𝑧) log (𝑦 + 𝑧) x x + y + log 𝑦 + 𝑧 + 𝑧K This gives us the function

slide-63
SLIDE 63

A second example

62

𝐲 𝐁

f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯𝐔 f 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝐝

f 𝐯, 𝐰 = 𝐯𝐔𝐰

𝑧

f 𝑦L, 𝑦K, 𝑦g = h 𝑦/

  • 𝒋
slide-64
SLIDE 64

A second example

63

𝐲 𝐁

f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯𝐔 f 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝐝

f 𝐯, 𝐰 = 𝐯𝐔𝐰

𝑧

f 𝑦L, 𝑦K, 𝑦g = h 𝑦/

  • 𝒋

To compute the function, we need the values of the leaves of this DAG

slide-65
SLIDE 65

A second example

64

𝐲 𝐁

f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯𝐔 f 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝐝

f 𝐯, 𝐰 = 𝐯𝐔𝐰

𝑧

f 𝑦L, 𝑦K, 𝑦g = h 𝑦/

  • 𝒋

To compute the function, we need the values of the leaves of this DAG

slide-66
SLIDE 66

A second example

65

𝐲 𝐁

f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯𝐔 f 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝐝

f 𝐯, 𝐰 = 𝐯𝐔𝐰

𝑧

f 𝑦L, 𝑦K, 𝑦g = h 𝑦/

  • 𝒋

Let’s also highlight which nodes can be computed using what we know so far

slide-67
SLIDE 67

A second example

66

𝐲 𝐁

f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯𝐔 f 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝐝

f 𝐯, 𝐰 = 𝐯𝐔𝐰

𝑧

f 𝑦L, 𝑦K, 𝑦g = h 𝑦/

  • 𝒋

𝐲'

slide-68
SLIDE 68

A second example

67

𝐲 𝐁

f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯𝐔 f 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝐝

f 𝐯, 𝐰 = 𝐯𝐔𝐰

𝑧

f 𝑦L, 𝑦K, 𝑦g = h 𝑦/

  • 𝒋

𝐲' 𝐜'𝐲

slide-69
SLIDE 69

A second example

68

𝐲 𝐁

f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯𝐔 f 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝐝

f 𝐯, 𝐰 = 𝐯𝐔𝐰

𝑧

f 𝑦L, 𝑦K, 𝑦g = h 𝑦/

  • 𝒋

𝐲' 𝐜'𝐲 𝐲'𝐁

slide-70
SLIDE 70

A second example

69

𝐲 𝐁

f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯𝐔 f 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝐝

f 𝐯, 𝐰 = 𝐯𝐔𝐰

𝑧

f 𝑦L, 𝑦K, 𝑦g = h 𝑦/

  • 𝒋

𝐲' 𝐜'𝐲 𝐲'𝐁 𝐲'𝐁𝐲

slide-71
SLIDE 71

A second example

70

𝐲 𝐁

f 𝐕, 𝐖 = 𝐕𝐖 f 𝐯 = 𝐯𝐔 f 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝐝

f 𝐯, 𝐰 = 𝐯𝐔𝐰

𝑧

f 𝑦L, 𝑦K, 𝑦g = h 𝑦/

  • 𝒋

𝐲' 𝐜'𝐲 𝐲'𝐁 𝐲'𝐁𝐲 𝐲'𝐁𝐲 + 𝐜Y𝐲 + 𝐝

slide-72
SLIDE 72

Forward propagation

Given a computation graph G and values of its input nodes: For each node in the graph, in topological order:

Compute the value of that node

Why topological order: Ensures that children are computed before parents.

71

slide-73
SLIDE 73

Forward propagation

Given a computation graph G and values of its input nodes: For each node in the graph, in topological order:

Compute the value of that node

Why topological order: Ensures that children are computed before parents.

72

slide-74
SLIDE 74

Backpropagation with computation graphs

73

slide-75
SLIDE 75

Three computational questions

1. Forward propagation

– Given inputs to the graph, compute the value of the function expressed by the graph – Something to think about: Given a node, can we say which nodes are inputs? Which nodes are outputs?

2. Backpropagation

– After computing the function value for an input, compute the gradient of the function at that input – Or equivalently: How does the output change if I make a small change to the input?

3. Constructing graphs

– Need an easy-to-use framework to construct graphs – The size of the graph may be input dependent

  • A templating language that creates graphs on the fly

– Tensorflow, PyTorch are the most popular frameworks today

74

slide-76
SLIDE 76

Calculus refresher: The chain rule

Suppose we have two functions 𝑔and 𝑕 We wish to compute the gradient of y = 𝑔 𝑕 𝑦 . We know that

{| {v = 𝑔} 𝑕 𝑦

⋅ 𝑕′(𝑦) Or equivalently: if 𝑨 = 𝑕(𝑦) and 𝑧 = 𝑔(𝑨), then 𝑒𝑧 𝑒𝑦 = 𝑒𝑧 𝑒𝑨 ⋅ 𝑒𝑨 𝑒𝑦

75

slide-77
SLIDE 77

Or equivalently: In terms of computation graphs

76

𝑦 𝑨 𝑧 f g

The forward pass gives us 𝑨 and 𝑧

slide-78
SLIDE 78

Or equivalently: In terms of computation graphs

77

𝑦 𝑨 𝑧 f g

The forward pass gives us 𝑨 and 𝑧 Remember that each node knows not only how to compute its value given inputs, but also how to compute gradients

slide-79
SLIDE 79

Or equivalently: In terms of computation graphs

78

𝑦 𝑨 𝑧 f g

The forward pass gives us 𝑨 and 𝑧 Remember that each node knows not only how to compute its value given inputs, but also how to compute gradients Start from the root of the graph and work backwards.

𝑒𝑧 𝑒𝑨

slide-80
SLIDE 80

Or equivalently: In terms of computation graphs

79

𝑦 𝑨 𝑧 f g

The forward pass gives us 𝑨 and 𝑧 Remember that each node knows not only how to compute its value given inputs, but also how to compute gradients Start from the root of the graph and work backwards.

𝑒𝑧 𝑒𝑨 𝑒𝑧 𝑒𝑨 ⋅ 𝑒𝑨 𝑒𝑦

When traversing an edge backwards to a new node: the gradient of the root with respect to that node is the product of the gradient at the parent with the derivative along that edge

slide-81
SLIDE 81

A concrete example

80

𝑦 𝑨 𝑧 𝑔 𝑣 = 1 𝑣 g u = uK 𝑧 = 1 𝑦K

slide-82
SLIDE 82

A concrete example

81

𝑦 𝑨 𝑧 𝑔 𝑣 = 1 𝑣 g u = uK 𝑒𝑔 𝑒𝑣 = − 1 𝑣K 𝑒𝑕 𝑒𝑣 = 2𝑣 𝑧 = 1 𝑦K Let’s also explicitly write down the derivatives.

slide-83
SLIDE 83

A concrete example

82

𝑦 𝑨 𝑧 𝑔 𝑣 = 1 𝑣 g u = uK 𝑒𝑔 𝑒𝑣 = − 1 𝑣K 𝑒𝑕 𝑒𝑣 = 2𝑣 𝑧 = 1 𝑦K 𝑒𝑧 𝑒𝑧 = 1 Now, we can proceed backwards from the output At each step, we compute the gradient of the function represented by the graph with respect to the node that we are at.

slide-84
SLIDE 84

A concrete example

83

𝑦 𝑨 𝑧 𝑔 𝑣 = 1 𝑣 g u = uK 𝑒𝑔 𝑒𝑣 = − 1 𝑣K 𝑒𝑕 𝑒𝑣 = 2𝑣 𝑧 = 1 𝑦K 𝑒𝑧 𝑒𝑧 = 1 𝑒𝑧 𝑒𝑨 = 𝑒𝑧 𝑒𝑧 ̇ ⋅ 𝑒𝑔 𝑒𝑣 „…† = 1 ⋅ − 1 𝑨K = − 1 𝑨K Product of the gradient so far and the derivative computed at this step

slide-85
SLIDE 85

A concrete example

84

𝑦 𝑨 𝑧 𝑔 𝑣 = 1 𝑣 g u = uK 𝑒𝑔 𝑒𝑣 = − 1 𝑣K 𝑒𝑕 𝑒𝑣 = 2𝑣 𝑧 = 1 𝑦K 𝑒𝑧 𝑒𝑧 = 1 𝑒𝑧 𝑒𝑨 = − 1 𝑨K 𝑒𝑧 𝑒𝑦 = 𝑒𝑧 𝑒𝑨 ⋅ 𝑒𝑕 𝑒𝑣 „…v = − 1 𝑨K ⋅ 2𝑦 = − 2x zK

slide-86
SLIDE 86

A concrete example

85

𝑦 𝑨 𝑧 𝑔 𝑣 = 1 𝑣 g u = uK 𝑒𝑔 𝑒𝑣 = − 1 𝑣K 𝑒𝑕 𝑒𝑣 = 2𝑣 𝑧 = 1 𝑦K 𝑒𝑧 𝑒𝑧 = 1

We can simplify this to get − K

vˆ 𝑒𝑧 𝑒𝑨 = − 1 𝑨K 𝑒𝑧 𝑒𝑦 = 𝑒𝑧 𝑒𝑨 ⋅ 𝑒𝑕 𝑒𝑣 „…v = − 1 𝑨K ⋅ 2𝑦 = − 2x zK

slide-87
SLIDE 87

A concrete example

86

𝑦 𝑨 𝑧 𝑔 𝑣, 𝑤 = 𝑤 𝑣 g u = uK 𝑧 = 1 𝑦

with multiple outgoing edges

slide-88
SLIDE 88

A concrete example

87

𝑦 𝑨 𝑧 𝑔 𝑣, 𝑤 = 𝑤 𝑣 g u = uK 𝑧 = 1 𝑦

with multiple outgoing edges

𝑒𝑔 𝑒𝑣 = − 𝑤 𝑣K 𝑒𝑕 𝑒𝑣 = 2𝑣 𝑒𝑔 𝑒𝑤 = 1 𝑣 Let’s also explicitly write down the derivatives. Note that 𝑔 has two derivatives because it has two inputs.

slide-89
SLIDE 89

A concrete example

88

𝑦 𝑨 𝑧 𝑔 𝑣, 𝑤 = 𝑤 𝑣 g u = uK 𝑧 = 1 𝑦

with multiple outgoing edges

𝑒𝑔 𝑒𝑣 = − 𝑤 𝑣K 𝑒𝑕 𝑒𝑣 = 2𝑣 𝑒𝑧 𝑒𝑧 = 1 𝑒𝑔 𝑒𝑤 = 1 𝑣

slide-90
SLIDE 90

A concrete example

89

𝑦 𝑨 𝑧 𝑔 𝑣, 𝑤 = 𝑤 𝑣 g u = uK 𝑧 = 1 𝑦

with multiple outgoing edges

𝑒𝑔 𝑒𝑣 = − 𝑤 𝑣K 𝑒𝑕 𝑒𝑣 = 2𝑣 𝑒𝑧 𝑒𝑧 = 1 𝑒𝑔 𝑒𝑤 = 1 𝑣 At this point, we can compute the gradient

  • f y with respect to z by following the edge

from y to z. But we can not follow the edge from y to x because all of x’s descendants are not marked as done.

slide-91
SLIDE 91

A concrete example

90

𝑦 𝑨 𝑧 𝑔 𝑣, 𝑤 = 𝑤 𝑣 g u = uK 𝑧 = 1 𝑦

with multiple outgoing edges

𝑒𝑔 𝑒𝑣 = − 𝑤 𝑣K 𝑒𝑕 𝑒𝑣 = 2𝑣 𝑒𝑧 𝑒𝑧 = 1 𝑒𝑧 𝑒𝑨 = 𝑒𝑧 𝑒𝑧 ̇ ⋅ 𝑒𝑔 𝑒𝑣 „…† = 1 ⋅ − 𝑦 𝑨K = − 𝑦 𝑨K 𝑒𝑔 𝑒𝑤 = 1 𝑣 Product of the gradient so far and the derivative computed at this step

slide-92
SLIDE 92

A concrete example

91

𝑦 𝑨 𝑧 𝑔 𝑣, 𝑤 = 𝑤 𝑣 g u = uK 𝑧 = 1 𝑦

with multiple outgoing edges

𝑒𝑔 𝑒𝑣 = − 𝑤 𝑣K 𝑒𝑕 𝑒𝑣 = 2𝑣 𝑒𝑧 𝑒𝑧 = 1 𝑒𝑧 𝑒𝑨 = 𝑒𝑧 𝑒𝑧 ̇ ⋅ 𝑒𝑔 𝑒𝑣 „…† = 1 ⋅ − 𝑦 𝑨K = − 𝑦 𝑨K 𝑒𝑔 𝑒𝑤 = 1 𝑣 Now we can get to x There are multiple backward paths into x. The general rule: Add the gradients along all the paths.

slide-93
SLIDE 93

A concrete example

92

𝑦 𝑨 𝑧 𝑔 𝑣, 𝑤 = 𝑤 𝑣 g u = uK 𝑧 = 1 𝑦

with multiple outgoing edges

𝑒𝑔 𝑒𝑣 = − 𝑤 𝑣K 𝑒𝑕 𝑒𝑣 = 2𝑣 𝑒𝑧 𝑒𝑧 = 1 𝑒𝑧 𝑒𝑨 = − 𝑦 𝑨K 𝑒𝑔 𝑒𝑤 = 1 𝑣 𝑒𝑧 𝑒𝑦 = 𝑒𝑧 𝑒𝑨 ⋅ 𝑒𝑕 𝑒𝑣 „…v + 𝑒𝑧 𝑒𝑧 ⋅ 𝑒𝑔 𝑒𝑤 ‰…v There are multiple backward paths into x. The general rule: Add the gradients along all the paths.

slide-94
SLIDE 94

A concrete example

93

𝑦 𝑨 𝑧 𝑔 𝑣, 𝑤 = 𝑤 𝑣 g u = uK 𝑧 = 1 𝑦

with multiple outgoing edges

𝑒𝑔 𝑒𝑣 = − 𝑤 𝑣K 𝑒𝑕 𝑒𝑣 = 2𝑣 𝑒𝑧 𝑒𝑧 = 1 𝑒𝑧 𝑒𝑨 = − 𝑦 𝑨K 𝑒𝑔 𝑒𝑤 = 1 𝑣 𝑒𝑧 𝑒𝑦 = 𝑒𝑧 𝑒𝑨 ⋅ 𝑒𝑕 𝑒𝑣 „…v + 𝑒𝑧 𝑒𝑧 ⋅ 𝑒𝑔 𝑒𝑤 ‰…v There are multiple backward paths into x. The general rule: Add the gradients along all the paths.

slide-95
SLIDE 95

A concrete example

94

𝑦 𝑨 𝑧 𝑔 𝑣, 𝑤 = 𝑤 𝑣 g u = uK 𝑧 = 1 𝑦

with multiple outgoing edges

𝑒𝑔 𝑒𝑣 = − 𝑤 𝑣K 𝑒𝑕 𝑒𝑣 = 2𝑣 𝑒𝑧 𝑒𝑧 = 1 𝑒𝑧 𝑒𝑨 = − 𝑦 𝑨K 𝑒𝑔 𝑒𝑤 = 1 𝑣 𝑒𝑧 𝑒𝑦 = 𝑒𝑧 𝑒𝑨 ⋅ 𝑒𝑕 𝑒𝑣 „…v + 𝑒𝑧 𝑒𝑧 ⋅ 𝑒𝑔 𝑒𝑤 ‰…v There are multiple backward paths into x. The general rule: Add the gradients along all the paths.

slide-96
SLIDE 96

A concrete example

95

𝑦 𝑨 𝑧 𝑔 𝑣, 𝑤 = 𝑤 𝑣 g u = uK 𝑧 = 1 𝑦

with multiple outgoing edges

𝑒𝑔 𝑒𝑣 = − 𝑤 𝑣K 𝑒𝑕 𝑒𝑣 = 2𝑣 𝑒𝑧 𝑒𝑧 = 1 𝑒𝑔 𝑒𝑤 = 1 𝑣 𝑒𝑧 𝑒𝑦 = 𝑒𝑧 𝑒𝑨 ⋅ 𝑒𝑕 𝑒𝑣 „…v + 𝑒𝑧 𝑒𝑧 ⋅ 𝑒𝑔 𝑒𝑤 ‰…v 𝑒𝑧 𝑒𝑦 = − 𝑦 𝑨K ⋅ 2𝑦 + 1 ⋅ 1 𝑨 = − 2𝑦K 𝑨K + 1 𝑨 = − 1 𝑦K 𝑒𝑧 𝑒𝑨 = − 𝑦 𝑨K

slide-97
SLIDE 97

A neural network example

96

This is the same two-layer network we saw before. But this time we have added a new loss term at the end. Suppose our goal is to compute the derivative of the loss with respect to 𝐗, 𝐖, 𝐛, 𝐜 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

slide-98
SLIDE 98

A neural network

97

𝐗 𝐲

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝒃 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

slide-99
SLIDE 99

A neural network

98

𝐗 𝐲 𝒜𝟐

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝒜𝟑

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖 𝐴𝟒 𝒜𝟓

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

To simplify notation, let us name all the nodes

slide-100
SLIDE 100

A neural network

99

𝐗 𝐲 𝒜𝟐

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝒜𝟑

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖 𝐴𝟒 𝒜𝟓

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

Let us highlight nodes that are done 𝑒𝑀 𝑒𝑀 = 1

slide-101
SLIDE 101

A neural network

100

𝐗 𝐲 𝒜𝟐

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝒜𝟑

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖 𝐴𝟒 𝒜𝟓

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

Whenever we have the derivative

  • f the loss with respect to a node,

some new derivatives can be

  • computed. Let us also mark them.

𝑒𝑀 𝑒𝑀 = 1

slide-102
SLIDE 102

A neural network

101

𝐗 𝐲 𝒜𝟐

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝒜𝟑

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖 𝐴𝟒 𝒜𝟓

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

𝑒𝑀 𝑒𝑀 = 1 𝑒𝑀 𝑒𝐳 = 𝑒𝑀 𝑒𝑀 ⋅ 𝑒𝑀 𝑒𝐳 = 1 ⋅ 𝐳 − 𝐳∗

slide-103
SLIDE 103

A neural network

102

𝐗 𝐲 𝒜𝟐

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝒜𝟑

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖 𝐴𝟒 𝒜𝟓

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

𝑒𝑀 𝑒𝐳 = 𝑒𝑀 𝑒𝑀 ⋅ 𝑒𝑀 𝑒𝐳 = 𝐳 − 𝐳∗ 𝑒𝑀 𝑒𝐛 = 𝑒𝑀 𝑒𝐳 ⋅ 𝑒𝐳 𝑒𝐛 = 𝐳 − 𝐳∗ ⋅ 1

slide-104
SLIDE 104

A neural network

103

𝐗 𝐲 𝒜𝟐

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝒜𝟑

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖 𝒜𝟒 𝒜𝟓

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

𝑒𝑀 𝑒𝐳 = 𝑒𝑀 𝑒𝑀 ⋅ 𝑒𝑀 𝑒𝐳 = 𝐳 − 𝐳∗ 𝑒𝑀 𝑒𝐴𝟓 = 𝑒𝑀 𝑒𝐳 ⋅ 𝑒𝐳 𝑒𝒜𝟓 = 𝐳 − 𝐳∗ ⋅ 1

slide-105
SLIDE 105

A neural network

104

𝐗 𝐲 𝒜𝟐

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝒜𝟑

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖 𝒜𝟒 𝒜𝟓

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

𝑒𝑀 𝑒𝐴𝟓 = 𝑒𝑀 𝑒𝐳 ⋅ 𝑒𝐳 𝑒𝒜𝟓 = 𝐳 − 𝐳∗ 𝑒𝑀 𝑒𝐴𝟒 = 𝑒𝑀 𝑒𝐴𝟓 ⋅ 𝑒𝐴𝟓 𝑒𝒜𝟒 = 𝐳 − 𝐳∗ ⋅ 𝐍

slide-106
SLIDE 106

A neural network

105

𝐗 𝐲 𝒜𝟐

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝒜𝟑

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖 𝒜𝟒 𝒜𝟓

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

𝑒𝑀 𝑒𝐴𝟓 = 𝑒𝑀 𝑒𝐳 ⋅ 𝑒𝐳 𝑒𝒜𝟓 = 𝐳 − 𝐳∗ 𝑒𝑀 𝑒𝐴𝟒 = 𝑒𝑀 𝑒𝐴𝟓 ⋅ 𝑒𝐴𝟓 𝑒𝒜𝟒 = 𝐳 − 𝐳∗ ⋅ 𝐍 Because 𝒊 = 𝒜𝟒 𝑒𝑀 𝑒𝐢 = 𝑒𝑀 𝑒𝐴𝟒

slide-107
SLIDE 107

A neural network

106

𝐗 𝐲 𝒜𝟐

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝒜𝟑

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖 𝒜𝟒 𝒜𝟓

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

𝑒𝑀 𝑒𝐴𝟓 = 𝑒𝑀 𝑒𝐳 ⋅ 𝑒𝐳 𝑒𝒜𝟓 = 𝐳 − 𝐳∗ 𝑒𝑀 𝑒𝐴𝟒 = 𝑒𝑀 𝑒𝐴𝟓 ⋅ 𝑒𝐴𝟓 𝑒𝒜𝟒 = 𝐳 − 𝐳∗ ⋅ 𝐍 Because 𝒊 = 𝒜𝟒 𝑒𝑀 𝑒𝐢 = 𝑒𝑀 𝑒𝐴𝟒 = 𝐳 − 𝐳∗ ⋅ 𝐍

slide-108
SLIDE 108

A neural network

107

𝐗 𝐲 𝒜𝟐

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝒜𝟑

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖 𝒜𝟒 𝒜𝟓

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

slide-109
SLIDE 109

A neural network

108

𝐗 𝐲 𝒜𝟐

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝒜𝟑

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖 𝒜𝟒 𝒜𝟓

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

slide-110
SLIDE 110

A neural network

109

𝐗 𝐲 𝒜𝟐

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝒜𝟑

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖 𝒜𝟒 𝒜𝟓

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

slide-111
SLIDE 111

A neural network

110

𝐗 𝐲 𝒜𝟐

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝒜𝟑

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖 𝒜𝟒 𝒜𝟓

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

slide-112
SLIDE 112

A neural network

111

𝐗 𝐲 𝒜𝟐

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝒜𝟑

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖 𝒜𝟒 𝒜𝟓

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

slide-113
SLIDE 113

A neural network

112

𝐗 𝐲 𝒜𝟐

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐜 𝒜𝟑

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰

𝐢

𝐠 𝐰 = tanh(𝐰)

𝐖 𝒜𝟒 𝒜𝟓

𝐠 𝐍, 𝐰 = 𝐍𝐰

𝐛 𝒛

𝐠 𝐯, 𝐰 = 𝐯 + 𝐰 𝐢 = tanh 𝐗𝐲 + 𝐜 𝒛 = 𝐖𝐢 + 𝐛 𝑴 = 1 2 𝒛 − 𝒛∗

𝟑

𝑴 𝐳∗

𝒈(𝐯, 𝐰) = 1 2 𝐯 − 𝐰

𝟑

We can stop when we have all these derivatives because 𝐲 and 𝐳∗ are constants.

slide-114
SLIDE 114

Backpropagation, in general

After we have done the forward propagation, Loop over the nodes in reverse topological order starting with a final goal node

– Compute derivatives of final goal node value with respect to each edge’s tail node

  • If there are multiple outgoing edges from a node, sum up all the

derivatives for the edges

113

slide-115
SLIDE 115

Constructing computation graphs

114

slide-116
SLIDE 116

Three computational questions

1. Forward propagation

– Given inputs to the graph, compute the value of the function expressed by the graph – Something to think about: Given a node, can we say which nodes are inputs? Which nodes are outputs?

2. Backpropagation

– After computing the function value for an input, compute the gradient of the function at that input – Or equivalently: How does the output change if I make a small change to the input?

3. Constructing graphs

– Need an easy-to-use framework to construct graphs – The size of the graph may be input dependent

  • A templating language that creates graphs on the fly

– Tensorflow, PyTorch are the most popular frameworks today

115

slide-117
SLIDE 117

Two methods for constructing graphs

We may require different sized computation graphs for different inputs

– Eg: different sentences have different lengths. We may have a neural network whose size depends on sentence length. – How could we statically declare a computation graph of a fixed size?

  • One option: Assume a size that is big enough and for

smaller examples, pad it with dummy values

  • Another option: Dynamically create a computation graph
  • n the fly when we need to.

116

slide-118
SLIDE 118

Two methods for constructing graphs

  • Static declaration

– Phase 1: Define an architecture

  • Maybe using standard control flow operations like loops,

conditionals, etc to simplify repeated code

– Phase 2: Run a bunch of data through the graph to train and make predictions

  • Dynamic declaration

– Graph is constructed implicitly (perhaps via operator

  • verloading) at the same time as the forward propagation

117

slide-119
SLIDE 119

Static declaration

  • Pros

– Offline optimization/scheduling of graphs is powerful – Limits on operations mean better hardware support

  • Cons

– Structured data (even simple stuff like sequences), even variable-sized data, is ugly – You effectively learn a new programming language (“the Graph Language”) and you write programs in that language to process data.

  • Examples: Torch/PyTorch, Theano, TensorFlow

118

slide-120
SLIDE 120

Dynamic declaration

  • Pros

– The library is less invasive, no need to learn a new syntax – Forward computation is written in your favorite programming language with all its features, using your favorite algorithms – Interleave construction and evaluation of the graph

  • Cons

– We can’t do offline graph optimization because there is little time – If the graph is static, the effort can be wasted

  • Examples: Chainer, most automatic differentiation

libraries, DyNet

119

slide-121
SLIDE 121

Summary: Computation graphs

An abstraction that allows us to write any differentiable (or sub-differentiable) functions as a directed acyclic graph

– Building blocks for modern neural networks – This will allow us to think about differentiable programs

Two algorithms:

– Forward propagation: process nodes in topological order to compute function value – Backpropagation: process nodes in reverse topological order to compute derivative

Two methods for constructing graphs: Static vs dynamic

120