Natural Language Processing with Deep Learning CS224N/Ling284 - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 4: Backpropagation and computation graphs

Lecture Plan Lecture 4: Backpropagation and computation graphs 1. Matrix gradients for our simple neural net and some tips [15 mins] 2. Computation graphs and backpropagation [40 mins] 3. Stuff you should know [15 mins] a. Regularization to prevent overfitting b. Vectorization c. Nonlinearities d. Initialization e. Optimizers f. Learning rates 2

1. Derivative wrt a weight matrix Let’s look carefully at computing • Using the chain rule again: • ) = * + " " = $ % % = &! + ( ! = [ x museums x in x Paris x are x amazing ] 3

Deriving gradients for backprop • For this function (following on from last time): !# = % !& !" !# = % ! !# #' + ) • Let’s consider the derivative of a single weight W ij u 2 • W ij only contributes to z i s • For example: W 23 is only f ( z 1 ) = h 1 h 2 =f ( z 2 ) used to compute z 2 not z 1 W 23 !* + ! b 2 = # +. ' + / + !, !, +- +- 0 8 01 23 ∑ 567 = , +5 9 5 = 9 - x 1 x 2 x 3 +1 4

Deriving gradients for backprop • So for derivative of single W ij : !" = ' $ ( % !# $% Error signal Local gradient from above signal • We want gradient for full W – but each case is the same • Overall answer: Outer product: 5

Deriving gradients: Tips Tip 1 : Carefully define your variables and keep track of their • dimensionality! Tip 2 : Chain rule! If y = f ( u ) and u = g ( x ), i.e., y = f ( g ( x )), then: • !" !# = !" !% !% !# Keep straight what variables feed into what computations • Tip 3 : For the top softmax part of a model: First consider the derivative wrt f c when c = y (the correct class), then consider derivative wrt f c when c ¹ y (all the incorrect classes) • Tip 4 : Work out element-wise partial derivatives if you’re getting confused by matrix calculus! • Tip 5: Use Shape Convention. Note: The error message & that arrives at a hidden layer has the same dimensionality as that hidden layer 6

Deriving gradients wrt words for window model The gradient that arrives at and updates the word vectors can • simply be split up for each word vector: Let • With x window = [ x museums x in x Paris x are x amazing ] • We have • 7

Updating word gradients in window model This will push word vectors around so that they will (in • principle) be more helpful in determining named entities. For example, the model can learn that seeing x in as the word • just before the center word is indicative for the center word to be a location 8

A pitfall when retraining word vectors • Setting: We are training a logistic regression classification model for movie review sentiment using single words. • In the training data we have “TV” and “telly” • In the testing data we have “television” • The pre-trained word vectors have all three similar: TV telly television • Question: What happens when we update the word vectors? 9

A pitfall when retraining word vectors • Question: What happens when we update the word vectors? • Answer: • Those words that are in the training data move around • “TV” and “telly” • Words not in the training data stay where they were • “television” telly TV This can be bad! television 10

So what should I do? • Question: Should I use available “pre-trained” word vectors Answer: • Almost always, yes! • They are trained on a huge amount of data, and so they will know about words not in your training data and will know more about words that are in your training data • Have 100s of millions of words of data? Okay to start random • Question: Should I update (“fine tune”) my own word vectors? • Answer: • If you only have a small training data set, don’t train the word vectors • If you have have a large dataset, it probably will work better to train = update = fine-tune word vectors to the task 11

Backpropagation We’ve almost shown you backpropagation It’s taking derivatives and using the (generalized) chain rule Other trick: we re-use derivatives computed for higher layers in computing derivatives for lower layers so as to minimize computation 12

2. Computation Graphs and Backpropagation We represent our neural net • equations as a graph Source nodes: inputs • Interior nodes: operations • + 13

Computation Graphs and Backpropagation We represent our neural net • equations as a graph Source nodes: inputs • Interior nodes: operations • Edges pass along result of the • operation + 14

Computation Graphs and Backpropagation Representing our neural net • equations as a graph Source nodes: inputs • “Forward Propagation” Interior nodes: operations • Edges pass along result of the • operation + 15

Backpropagation Go backwards along edges • Pass along gradients • + 16

Backpropagation: Single Node Node receives an “upstream gradient” • Goal is to pass on the correct • “downstream gradient” Upstream Downstream 17 gradient gradient

Backpropagation: Single Node Each node has a local gradient • The gradient of it’s output with • respect to it’s input Upstream Downstream Local 18 gradient gradient gradient

Backpropagation: Single Node Each node has a local gradient • The gradient of it’s output with • respect to it’s input Chain rule! Upstream Downstream Local 19 gradient gradient gradient

Backpropagation: Single Node Each node has a local gradient • The gradient of it’s output with • respect to it’s input [downstream gradient] = [upstream gradient] x [local gradient] • Upstream Downstream Local 20 gradient gradient gradient

Backpropagation: Single Node What about nodes with multiple inputs? • * 21

Backpropagation: Single Node Multiple inputs → multiple local gradients • * Local Downstream Upstream gradients gradients gradient 22

An Example 23

An Example Forward prop steps + * max 24

An Example Forward prop steps 1 + 3 2 6 * 2 2 max 0 25

An Example Forward prop steps Local gradients 1 + 3 2 6 * 2 2 max 0 26

An Example Forward prop steps Local gradients 1 + 3 2 1*2 = 2 6 * 2 1 2 max 1*3 = 3 0 upstream * local = downstream 30

An Example Forward prop steps Local gradients 1 + 3 2 2 6 * 2 1 2 3*1 = 3 max 3 0 3*0 = 0 upstream * local = downstream 31

An Example Forward prop steps Local gradients 1 + 2*1 = 2 3 2 2 6 2*1 = 2 * 2 1 2 3 max 3 0 0 upstream * local = downstream 32

An Example Forward prop steps Local gradients 1 + 2 3 2 2 6 2 * 2 1 2 3 max 3 0 0 33

Gradients sum at outward branches + 34

Gradients sum at outward branches + 35

Node Intuitions + “distributes” the upstream gradient • 1 + 2 3 2 2 6 2 * 2 1 2 max 0 36

Node Intuitions + “distributes” the upstream gradient to each summand • max “routes” the upstream gradient • 1 + 3 2 6 * 2 1 2 3 max 3 0 0 37

Node Intuitions + “distributes” the upstream gradient • max “routes” the upstream gradient • * “switches” the upstream gradient • 1 + 3 2 2 6 * 2 1 2 max 3 0 38

Efficiency: compute all gradients at once Incorrect way of doing backprop: • First compute • + * 39

Efficiency: compute all gradients at once Incorrect way of doing backprop: • First compute • Then independently compute • Duplicated computation! • + * 40

Efficiency: compute all gradients at once Correct way: • Compute all the gradients at once • Analogous to using ! when we • computed gradients by hand + * 41

Back-Prop in General Computation Graph 1. Fprop: visit nodes in topological sort order Single scalar output - Compute value of node given predecessors 2. Bprop: - initialize output gradient = 1 … - visit nodes in reverse order: Compute gradient wrt each node using gradient wrt successors … = successors of Done correctly, big O() complexity of fprop and bprop is the same … In general our nets have regular layer-structure and so we can use matrices and Jacobians… 42

Automatic Differentiation • The gradient computation can be automatically inferred from the symbolic expression of the fprop • Each node type needs to know how to compute its output and how to compute the gradient wrt its inputs given the gradient wrt its output • Modern DL frameworks (Tensorflow, PyTorch, etc.) do backpropagation for you but mainly leave layer/node writer to hand-calculate the local derivative 43

Backprop Implementations 44

Implementation: forward/backward API 45

Implementation: forward/backward API 46

Natural Language Processing with Deep Learning CS224N/Ling284 - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 4: Backpropagation and computation graphs Lecture Plan Lecture 4: Backpropagation and computation graphs 1. Matrix gradients for our simple neural net

Natural Language Processing with Deep Learning CS224N The Future of Deep Learning + NLP Kevin

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 15: Natural Language

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 1:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 9:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 9:

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 7: Vanishing Gradients

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 12:

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Machine

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 13:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 11:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 14:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 5:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 16:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 12:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 10:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 2:

Oshkosh Corporation Third Quarter Fiscal 2013 July 30, 2013 Charles L. Szews Chief Executive

Advanced Stream and Sampling Framework for IPPM draft-morton-ippm-2330-update-01 Joachim Fabini

Justice Matters: The Ethnic Penalty Rebecca Roberts & Matt Ford Police corruption, racism and

Managing Complex Terminations M i C l T i i Colleen Dunlop cdunlop@ehlaw.ca Amanda

PARTIAL (INDUSTRY) COMPETITIVE EQUILIBRIUM DEMAND Comes from fi nal consumers, or downstream

MICE Step IV without the Downstream M1 Solenoid J. Scott Berg Brookhaven National Laboratory

REAL-TIME STORMWATER SYSTEMS Branko Kerkez Brandon Wong bkerkez@umich.edu bpwong@umich.edu

Valida&on of Visualiza&on Design Han-Wei Shen

Natural Language Processing with Deep Learning CS224N/Ling284 - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 4: Backpropagation and computation graphs Lecture Plan Lecture 4: Backpropagation and computation graphs 1. Matrix gradients for our simple neural net

Natural Language Processing with Deep Learning CS224N The Future of Deep Learning + NLP Kevin

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 15: Natural Language

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 1:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 9:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 9:

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 7: Vanishing Gradients

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 12:

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Machine

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 13:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 11:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 14:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 5:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 16:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 12:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 10:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 2:

Oshkosh Corporation Third Quarter Fiscal 2013 July 30, 2013 Charles L. Szews Chief Executive

Advanced Stream and Sampling Framework for IPPM draft-morton-ippm-2330-update-01 Joachim Fabini

Justice Matters: The Ethnic Penalty Rebecca Roberts &amp; Matt Ford Police corruption, racism and

Managing Complex Terminations M i C l T i i Colleen Dunlop cdunlop@ehlaw.ca Amanda

PARTIAL (INDUSTRY) COMPETITIVE EQUILIBRIUM DEMAND Comes from fi nal consumers, or downstream

MICE Step IV without the Downstream M1 Solenoid J. Scott Berg Brookhaven National Laboratory

REAL-TIME STORMWATER SYSTEMS Branko Kerkez Brandon Wong bkerkez@umich.edu bpwong@umich.edu

Valida&amp;on of Visualiza&amp;on Design Han-Wei Shen

Justice Matters: The Ethnic Penalty Rebecca Roberts & Matt Ford Police corruption, racism and

Valida&on of Visualiza&on Design Han-Wei Shen