natural language processing with deep learning cs224n
play

Natural Language Processing with Deep Learning CS224N/Ling284 - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 4: Backpropagation and computation graphs Lecture Plan Lecture 4: Backpropagation and computation graphs 1. Matrix gradients for our simple neural net


  1. Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 4: Backpropagation and computation graphs

  2. Lecture Plan Lecture 4: Backpropagation and computation graphs 1. Matrix gradients for our simple neural net and some tips [15 mins] 2. Computation graphs and backpropagation [40 mins] 3. Stuff you should know [15 mins] a. Regularization to prevent overfitting b. Vectorization c. Nonlinearities d. Initialization e. Optimizers f. Learning rates 2

  3. 1. Derivative wrt a weight matrix Let’s look carefully at computing • Using the chain rule again: • ) = * + " " = $ % % = &! + ( ! = [ x museums x in x Paris x are x amazing ] 3

  4. Deriving gradients for backprop • For this function (following on from last time): !# = % !& !" !# = % ! !# #' + ) • Let’s consider the derivative of a single weight W ij u 2 • W ij only contributes to z i s • For example: W 23 is only f ( z 1 ) = h 1 h 2 =f ( z 2 ) used to compute z 2 not z 1 W 23 !* + ! b 2 = # +. ' + / + !, !, +- +- 0 8 01 23 ∑ 567 = , +5 9 5 = 9 - x 1 x 2 x 3 +1 4

  5. Deriving gradients for backprop • So for derivative of single W ij : !" = ' $ ( % !# $% Error signal Local gradient from above signal • We want gradient for full W – but each case is the same • Overall answer: Outer product: 5

  6. Deriving gradients: Tips Tip 1 : Carefully define your variables and keep track of their • dimensionality! Tip 2 : Chain rule! If y = f ( u ) and u = g ( x ), i.e., y = f ( g ( x )), then: • !" !# = !" !% !% !# Keep straight what variables feed into what computations • Tip 3 : For the top softmax part of a model: First consider the derivative wrt f c when c = y (the correct class), then consider derivative wrt f c when c ¹ y (all the incorrect classes) • Tip 4 : Work out element-wise partial derivatives if you’re getting confused by matrix calculus! • Tip 5: Use Shape Convention. Note: The error message & that arrives at a hidden layer has the same dimensionality as that hidden layer 6

  7. Deriving gradients wrt words for window model The gradient that arrives at and updates the word vectors can • simply be split up for each word vector: Let • With x window = [ x museums x in x Paris x are x amazing ] • We have • 7

  8. Updating word gradients in window model This will push word vectors around so that they will (in • principle) be more helpful in determining named entities. For example, the model can learn that seeing x in as the word • just before the center word is indicative for the center word to be a location 8

  9. A pitfall when retraining word vectors • Setting: We are training a logistic regression classification model for movie review sentiment using single words. • In the training data we have “TV” and “telly” • In the testing data we have “television” • The pre-trained word vectors have all three similar: TV telly television • Question: What happens when we update the word vectors? 9

  10. A pitfall when retraining word vectors • Question: What happens when we update the word vectors? • Answer: • Those words that are in the training data move around • “TV” and “telly” • Words not in the training data stay where they were • “television” telly TV This can be bad! television 10

  11. So what should I do? • Question: Should I use available “pre-trained” word vectors Answer: • Almost always, yes! • They are trained on a huge amount of data, and so they will know about words not in your training data and will know more about words that are in your training data • Have 100s of millions of words of data? Okay to start random • Question: Should I update (“fine tune”) my own word vectors? • Answer: • If you only have a small training data set, don’t train the word vectors • If you have have a large dataset, it probably will work better to train = update = fine-tune word vectors to the task 11

  12. Backpropagation We’ve almost shown you backpropagation It’s taking derivatives and using the (generalized) chain rule Other trick: we re-use derivatives computed for higher layers in computing derivatives for lower layers so as to minimize computation 12

  13. 2. Computation Graphs and Backpropagation We represent our neural net • equations as a graph Source nodes: inputs • Interior nodes: operations • + Ÿ Ÿ 13

  14. Computation Graphs and Backpropagation We represent our neural net • equations as a graph Source nodes: inputs • Interior nodes: operations • Edges pass along result of the • operation + Ÿ Ÿ 14

  15. Computation Graphs and Backpropagation Representing our neural net • equations as a graph Source nodes: inputs • “Forward Propagation” Interior nodes: operations • Edges pass along result of the • operation + Ÿ Ÿ 15

  16. Backpropagation Go backwards along edges • Pass along gradients • + Ÿ Ÿ 16

  17. Backpropagation: Single Node Node receives an “upstream gradient” • Goal is to pass on the correct • “downstream gradient” Upstream Downstream 17 gradient gradient

  18. Backpropagation: Single Node Each node has a local gradient • The gradient of it’s output with • respect to it’s input Upstream Downstream Local 18 gradient gradient gradient

  19. Backpropagation: Single Node Each node has a local gradient • The gradient of it’s output with • respect to it’s input Chain rule! Upstream Downstream Local 19 gradient gradient gradient

  20. Backpropagation: Single Node Each node has a local gradient • The gradient of it’s output with • respect to it’s input [downstream gradient] = [upstream gradient] x [local gradient] • Upstream Downstream Local 20 gradient gradient gradient

  21. Backpropagation: Single Node What about nodes with multiple inputs? • * 21

  22. Backpropagation: Single Node Multiple inputs → multiple local gradients • * Local Downstream Upstream gradients gradients gradient 22

  23. An Example 23

  24. An Example Forward prop steps + * max 24

  25. An Example Forward prop steps 1 + 3 2 6 * 2 2 max 0 25

  26. An Example Forward prop steps Local gradients 1 + 3 2 6 * 2 2 max 0 26

  27. An Example Forward prop steps Local gradients 1 + 3 2 6 * 2 2 max 0 27

  28. An Example Forward prop steps Local gradients 1 + 3 2 6 * 2 2 max 0 28

  29. An Example Forward prop steps Local gradients 1 + 3 2 6 * 2 2 max 0 29

  30. An Example Forward prop steps Local gradients 1 + 3 2 1*2 = 2 6 * 2 1 2 max 1*3 = 3 0 upstream * local = downstream 30

  31. An Example Forward prop steps Local gradients 1 + 3 2 2 6 * 2 1 2 3*1 = 3 max 3 0 3*0 = 0 upstream * local = downstream 31

  32. An Example Forward prop steps Local gradients 1 + 2*1 = 2 3 2 2 6 2*1 = 2 * 2 1 2 3 max 3 0 0 upstream * local = downstream 32

  33. An Example Forward prop steps Local gradients 1 + 2 3 2 2 6 2 * 2 1 2 3 max 3 0 0 33

  34. Gradients sum at outward branches + 34

  35. Gradients sum at outward branches + 35

  36. Node Intuitions + “distributes” the upstream gradient • 1 + 2 3 2 2 6 2 * 2 1 2 max 0 36

  37. Node Intuitions + “distributes” the upstream gradient to each summand • max “routes” the upstream gradient • 1 + 3 2 6 * 2 1 2 3 max 3 0 0 37

  38. Node Intuitions + “distributes” the upstream gradient • max “routes” the upstream gradient • * “switches” the upstream gradient • 1 + 3 2 2 6 * 2 1 2 max 3 0 38

  39. Efficiency: compute all gradients at once Incorrect way of doing backprop: • First compute • + Ÿ * 39

  40. Efficiency: compute all gradients at once Incorrect way of doing backprop: • First compute • Then independently compute • Duplicated computation! • + Ÿ * 40

  41. Efficiency: compute all gradients at once Correct way: • Compute all the gradients at once • Analogous to using ! when we • computed gradients by hand + Ÿ * 41

  42. Back-Prop in General Computation Graph 1. Fprop: visit nodes in topological sort order Single scalar output - Compute value of node given predecessors 2. Bprop: - initialize output gradient = 1 … - visit nodes in reverse order: Compute gradient wrt each node using gradient wrt successors … = successors of Done correctly, big O() complexity of fprop and bprop is the same … In general our nets have regular layer-structure and so we can use matrices and Jacobians… 42

  43. Automatic Differentiation • The gradient computation can be automatically inferred from the symbolic expression of the fprop • Each node type needs to know how to compute its output and how to compute the gradient wrt its inputs given the gradient wrt its output • Modern DL frameworks (Tensorflow, PyTorch, etc.) do backpropagation for you but mainly leave layer/node writer to hand-calculate the local derivative 43

  44. Backprop Implementations 44

  45. Implementation: forward/backward API 45

  46. Implementation: forward/backward API 46

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend