Natural Language Processing with Deep Learning CS224N/Ling284 - - PowerPoint PPT Presentation

natural language processing with deep learning cs224n
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing with Deep Learning CS224N/Ling284 - - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 4: Gradients by hand (matrix calculus) and algorithmically (the backpropagation algorithm) 1. Introduction Assignment 2 is all about making sure you


slide-1
SLIDE 1

Natural Language Processing with Deep Learning CS224N/Ling284

Christopher Manning Lecture 4: Gradients by hand (matrix calculus) and algorithmically (the backpropagation algorithm)

slide-2
SLIDE 2
  • 1. Introduction

Assignment 2 is all about making sure you really understand the math of neural networks … then we’ll let the software do it! We’ll go through it quickly today, but also look at the readings! This will be a tough week for some! à Make sure to get help if you need it Visit office hours Friday/Tuesday Note: Monday is MLK Day – No office hours, sorry! But we will be on Piazza Read tutorial materials given in the syllabus

2

slide-3
SLIDE 3

NER: Binary classification for center word being location

  • We do supervised training and want high score if it’s a location

𝐾" 𝜄 = 𝜏 𝑡 = 1 1 + 𝑓*+

3

x = [ xmuseums xin xParis xare xamazing ]

slide-4
SLIDE 4

Remember: Stochastic Gradient Descent

Update equation: How can we compute ∇-𝐾(𝜄)?

  • 1. By hand
  • 2. Algorithmically: the backpropagation algorithm

𝛽 = step size or learning rate

4

slide-5
SLIDE 5

Lecture Plan

Lecture 4: Gradients by hand and algorithmically

  • 1. Introduction (5 mins)
  • 2. Matrix calculus (40 mins)
  • 3. Backpropagation (35 mins)

5

slide-6
SLIDE 6

Computing Gradients by Hand

  • Matrix calculus: Fully vectorized gradients
  • “multivariable calculus is just like single-variable calculus if

you use matrices”

  • Much faster and more useful than non-vectorized gradients
  • But doing a non-vectorized gradient can be good for

intuition; watch last week’s lecture for an example

  • Lecture notes and matrix calculus notes cover this

material in more detail

  • You might also review Math 51, which has a new online

textbook: http://web.stanford.edu/class/math51/textbook.html

6

slide-7
SLIDE 7

Gradients

  • Given a function with 1 output and 1 input

𝑔 𝑦 = 𝑦3

  • It’s gradient (slope) is its derivative

45 46 = 3𝑦8

“How much will the output change if we change the input a bit?”

7

slide-8
SLIDE 8

Gradients

  • Given a function with 1 output and n inputs
  • Its gradient is a vector of partial derivatives with

respect to each input

8

slide-9
SLIDE 9

Jacobian Matrix: Generalization of the Gradient

  • Given a function with m outputs and n inputs
  • It’s Jacobian is an m x n matrix of partial derivatives

9

slide-10
SLIDE 10

Chain Rule

  • For one-variable functions: multiply derivatives
  • For multiple variables at once: multiply Jacobians

10

slide-11
SLIDE 11

Example Jacobian: Elementwise activation Function

11

slide-12
SLIDE 12

Example Jacobian: Elementwise activation Function

Function has n outputs and n inputs → n by n Jacobian

12

slide-13
SLIDE 13

Example Jacobian: Elementwise activation Function

13

slide-14
SLIDE 14

Example Jacobian: Elementwise activation Function

14

slide-15
SLIDE 15

Example Jacobian: Elementwise activation Function

15

slide-16
SLIDE 16

Other Jacobians

  • Compute these at home for practice!
  • Check your answers with the lecture notes

16

slide-17
SLIDE 17

Other Jacobians

  • Compute these at home for practice!
  • Check your answers with the lecture notes

17

slide-18
SLIDE 18

Other Jacobians

  • Compute these at home for practice!
  • Check your answers with the lecture notes

18 Fine print: This is the correct Jacobian. Later we discuss the “shape convention”; using it the answer would be h.

slide-19
SLIDE 19

Other Jacobians

  • Compute these at home for practice!
  • Check your answers with the lecture notes

19

slide-20
SLIDE 20

Back to our Neural Net!

x = [ xmuseums xin xParis xare xamazing ]

20

slide-21
SLIDE 21

Back to our Neural Net!

x = [ xmuseums xin xParis xare xamazing ]

  • Let’s find
  • Really, we care about the gradient of the loss, but we

will compute the gradient of the score for simplicity

21

slide-22
SLIDE 22
  • 1. Break up equations into simple pieces

22

slide-23
SLIDE 23
  • 2. Apply the chain rule

23

slide-24
SLIDE 24
  • 2. Apply the chain rule

24

slide-25
SLIDE 25
  • 2. Apply the chain rule

25

slide-26
SLIDE 26
  • 2. Apply the chain rule

26

slide-27
SLIDE 27
  • 3. Write out the Jacobians

Useful Jacobians from previous slide

27

slide-28
SLIDE 28
  • 3. Write out the Jacobians

Useful Jacobians from previous slide

28

𝒗:

slide-29
SLIDE 29
  • 3. Write out the Jacobians

Useful Jacobians from previous slide

29

𝒗:

slide-30
SLIDE 30
  • 3. Write out the Jacobians

Useful Jacobians from previous slide

30

𝒗:

slide-31
SLIDE 31
  • 3. Write out the Jacobians

Useful Jacobians from previous slide

31

𝒗: 𝒗:

slide-32
SLIDE 32

Re-using Computation

  • Suppose we now want to compute
  • Using the chain rule again:

32

slide-33
SLIDE 33

Re-using Computation

  • Suppose we now want to compute
  • Using the chain rule again:

The same! Let’s avoid duplicated computation…

33

slide-34
SLIDE 34

Re-using Computation

  • Suppose we now want to compute
  • Using the chain rule again:

34

𝜀 is local error signal

𝒗:

slide-35
SLIDE 35

Derivative with respect to Matrix: Output shape

  • What does look like?
  • 1 output, nm inputs: 1 by nm Jacobian?
  • Inconvenient to do

35

slide-36
SLIDE 36

Derivative with respect to Matrix: Output shape

  • What does look like?
  • 1 output, nm inputs: 1 by nm Jacobian?
  • Inconvenient to do
  • Instead we use shape convention: the shape of

the gradient is the shape of the parameters

  • So is n by m:

36

slide-37
SLIDE 37

Derivative with respect to Matrix

  • Remember
  • is going to be in our answer
  • The other term should be because
  • Answer is:

37

𝜀 is local error signal at 𝑨 𝑦 is local input signal

slide-38
SLIDE 38

Why the Transposes?

  • Hacky answer: this makes the dimensions work out!
  • Useful trick for checking your work!
  • Full explanation in the lecture notes; intuition next
  • Each input goes to each output – you get outer product

38

slide-39
SLIDE 39

Why the Transposes?

39

slide-40
SLIDE 40

Deriving local input gradient in backprop

  • For this function:
  • Let’s consider the derivative
  • f a single weight Wij
  • Wij only contributes to zi
  • For example: W23 is only

used to compute z2 not z1

40

x1 x2 x3 +1 f(z1)= h1 h2 =f(z2) s

u2 W23 b2

𝜖𝑡 𝜖𝑿 = 𝜺 𝜖𝒜 𝜖𝑿 = 𝜺 𝜖 𝜖𝑿 𝑿𝒚 + 𝒄 𝜖𝑨C 𝜖𝑋

CE

= 𝜖 𝜖𝑋

CE

𝑿CF𝒚 + 𝑐C =

H HIJK ∑MNO 4

𝑋

CM𝑦M = 𝑦E

slide-41
SLIDE 41

What shape should derivatives be?

  • is a row vector
  • But convention says our gradient should be a column vector

because is a column vector…

  • Disagreement between Jacobian form (which makes

the chain rule easy) and the shape convention (which makes implementing SGD easy)

  • We expect answers to follow the shape convention
  • But Jacobian form is useful for computing the answers

41

slide-42
SLIDE 42

What shape should derivatives be?

Two options:

  • 1. Use Jacobian form as much as possible, reshape to

follow the convention at the end:

  • What we just did. But at the end transpose to make the

derivative a column vector, resulting in

  • 2. Always follow the convention
  • Look at dimensions to figure out when to transpose and/or

reorder terms

42

slide-43
SLIDE 43
  • Tip 1: Carefully define your variables and keep track of their

dimensionality!

  • Tip 2: Chain rule! If y = f(u) and u = g(x), i.e., y = f(g(x)), then:

𝜖𝒛 𝜖𝒚 = 𝜖𝒛 𝜖𝒗 𝜖𝒗 𝜖𝒚

Keep straight what variables feed into what computations

  • Tip 3: For the top softmax part of a model: First consider the

derivative wrt fc when c = y (the correct class), then consider derivative wrt fc when c ¹ y (all the incorrect classes)

  • Tip 4: Work out element-wise partial derivatives if you’re getting

confused by matrix calculus!

  • Tip 5: Use Shape Convention. Note: The error message 𝜺 that

arrives at a hidden layer has the same dimensionality as that hidden layer

Deriving gradients: Tips

43

slide-44
SLIDE 44
  • 3. Backpropagation

We’ve almost shown you backpropagation It’s taking derivatives and using the (generalized, multivariate, or matrix) chain rule Other trick: We re-use derivatives computed for higher layers in computing derivatives for lower layers to minimize computation

44

slide-45
SLIDE 45

Computation Graphs and Backpropagation Ÿ + Ÿ

  • We represent our neural net

equations as a graph

  • Source nodes: inputs
  • Interior nodes: operations

45

slide-46
SLIDE 46

Computation Graphs and Backpropagation Ÿ + Ÿ

  • We represent our neural net

equations as a graph

  • Source nodes: inputs
  • Interior nodes: operations
  • Edges pass along result of the
  • peration

46

slide-47
SLIDE 47

Computation Graphs and Backpropagation Ÿ + Ÿ

  • Representing our neural net

equations as a graph

  • Source nodes: inputs
  • Interior nodes: operations
  • Edges pass along result of the
  • peration

“Forward Propagation”

47

slide-48
SLIDE 48

Backpropagation Ÿ + Ÿ

  • Go backwards along edges
  • Pass along gradients

48

slide-49
SLIDE 49

Backpropagation: Single Node

  • Node receives an “upstream gradient”
  • Goal is to pass on the correct

“downstream gradient”

Upstream gradient

49

Downstream gradient

slide-50
SLIDE 50

Backpropagation: Single Node

Downstream gradient Upstream gradient

  • Each node has a local gradient
  • The gradient of its output with

respect to its input

Local gradient

50

slide-51
SLIDE 51

Backpropagation: Single Node

Downstream gradient Upstream gradient

  • Each node has a local gradient
  • The gradient of its output with

respect to its input

Local gradient

51

Chain rule!

slide-52
SLIDE 52

Backpropagation: Single Node

Downstream gradient Upstream gradient

  • Each node has a local gradient
  • The gradient of it’s output with

respect to it’s input

Local gradient

  • [downstream gradient] = [upstream gradient] x [local gradient]

52

slide-53
SLIDE 53

Backpropagation: Single Node *

  • What about nodes with multiple inputs?

53

slide-54
SLIDE 54

Backpropagation: Single Node

Downstream gradients Upstream gradient Local gradients

*

  • Multiple inputs → multiple local gradients

54

slide-55
SLIDE 55

An Example

55

slide-56
SLIDE 56

An Example + *

max

56

Forward prop steps

slide-57
SLIDE 57

An Example + *

max

57

Forward prop steps 6 3 2 1 2 2

slide-58
SLIDE 58

An Example + *

max

58

Forward prop steps 6 3 2 1 2 2 Local gradients

slide-59
SLIDE 59

An Example + *

max

59

Forward prop steps 6 3 2 1 2 2 Local gradients

slide-60
SLIDE 60

An Example + *

max

60

Forward prop steps 6 3 2 1 2 2 Local gradients

slide-61
SLIDE 61

An Example + *

max

61

Forward prop steps 6 3 2 1 2 2 Local gradients

slide-62
SLIDE 62

An Example + *

max

62

Forward prop steps 6 3 2 1 2 2 Local gradients upstream * local = downstream 1 1*3 = 3 1*2 = 2

slide-63
SLIDE 63

An Example + *

max

63

Forward prop steps 6 3 2 1 2 2 Local gradients upstream * local = downstream 1 3 2 3*1 = 3 3*0 = 0

slide-64
SLIDE 64

An Example + *

max

64

Forward prop steps 6 3 2 1 2 2 Local gradients upstream * local = downstream 1 3 2 3 2*1 = 2 2*1 = 2

slide-65
SLIDE 65

An Example + *

max

65

Forward prop steps 6 3 2 1 2 2 Local gradients 1 3 2 3 2 2

slide-66
SLIDE 66

Gradients sum at outward branches

66

+

slide-67
SLIDE 67

Gradients sum at outward branches

67

+

slide-68
SLIDE 68

Node Intuitions + *

max

68

6 3 2 1 2 2 1 2 2 2

  • + “distributes” the upstream gradient to each summand
slide-69
SLIDE 69

Node Intuitions + *

max

69

6 3 2 1 2 2 1 3 3

  • + “distributes” the upstream gradient to each summand
  • max “routes” the upstream gradient
slide-70
SLIDE 70

Node Intuitions + *

max

70

6 3 2 1 2 2 1 3 2

  • + “distributes” the upstream gradient
  • max “routes” the upstream gradient
  • * “switches” the upstream gradient
slide-71
SLIDE 71

Efficiency: compute all gradients at once * + Ÿ

  • Incorrect way of doing backprop:
  • First compute

71

slide-72
SLIDE 72

Efficiency: compute all gradients at once * + Ÿ

  • Incorrect way of doing backprop:
  • First compute
  • Then independently compute
  • Duplicated computation!

72

slide-73
SLIDE 73

Efficiency: compute all gradients at once * + Ÿ

  • Correct way:
  • Compute all the gradients at once
  • Analogous to using 𝜺 when we

computed gradients by hand

73

slide-74
SLIDE 74
  • 1. Fprop: visit nodes in topological sort order
  • Compute value of node given predecessors
  • 2. Bprop:
  • initialize output gradient = 1
  • visit nodes in reverse order:

Compute gradient wrt each node using gradient wrt successors Done correctly, big O() complexity of fprop and bprop is the same In general our nets have regular layer-structure and so we can use matrices and Jacobians…

Back-Prop in General Computation Graph

… … …

= successors of

Single scalar output

74

slide-75
SLIDE 75

Automatic Differentiation

  • The gradient computation can be

automatically inferred from the symbolic expression of the fprop

  • Each node type needs to know how

to compute its output and how to compute the gradient wrt its inputs given the gradient wrt its output

  • Modern DL frameworks (Tensorflow,

PyTorch, etc.) do backpropagation for you but mainly leave layer/node writer to hand-calculate the local derivative

75

slide-76
SLIDE 76

Backprop Implementations

76

slide-77
SLIDE 77

Implementation: forward/backward API

77

slide-78
SLIDE 78

Implementation: forward/backward API

78

slide-79
SLIDE 79

Manual Gradient checking: Numeric Gradient

  • For small h (≈ 1e-4),
  • Easy to implement correctly
  • But approximate and very slow:
  • Have to recompute f for every parameter of our model
  • Useful for checking your implementation
  • In the old days when we hand-wrote everything, it was key

to do this everywhere.

  • Now much less needed, when throwing together layers

79

slide-80
SLIDE 80

Summary

We’ve mastered the core technology of neural nets! 🎊

  • Backpropagation: recursively (and hence efficiently)

apply the chain rule along computation graph

  • [downstream gradient] = [upstream gradient] x [local gradient]
  • Forward pass: compute results of operations and save

intermediate values

  • Backward pass: apply chain rule to compute gradients

80

slide-81
SLIDE 81

Why learn all these details about gradients?

  • Modern deep learning frameworks compute gradients for you!
  • But why take a class on compilers or systems when they are

implemented for you?

  • Understanding what is going on under the hood is useful!
  • Backpropagation doesn’t always work perfectly
  • Understanding why is crucial for debugging and improving

models

  • See Karpathy article (in syllabus):
  • https://medium.com/@karpathy/yes-you-should-understand-

backprop-e2f06eab496b

  • Example in future lecture: exploding and vanishing gradients

81