training neural networks normalization regularization etc
play

Training Neural Networks: Normalization, Regularization etc. Intro - PowerPoint PPT Presentation

Training Neural Networks: Normalization, Regularization etc. Intro to Deep Learning, Fall 2020 1 Quick Recap: Training a network Divergence between desired output and Average over all actual output of net for a given input training instances


  1. A note on derivatives • The minibatch loss is the average of the divergence between the actual and desired outputs of the network for all inputs in the minibatch � � � � � • The derivative of the minibatch loss w.r.t. network parameters is the average of the derivatives of the divergences for the individual training instances w.r.t. parameters � � � � (�) (�) �,� � �,� • In conventional training, both, the output of the network in response to an input, and the derivative of the divergence for any input are independent of other inputs in the minibatch • If we use Batch Norm, the above relation gets a little complicated 32

  2. A note on derivatives • The outputs are now functions of and which are functions of the entire minibatch • The Divergence for each depends on all the within the minibatch – Training instances within the minibatch are no longer independent 33

  3. The actual divergence with BN • The actual divergence for any minibatch with terms explicity written � � � �� � � �� � � �� � � � � � � � � � � � • We need the derivative for this function • To derive the derivative lets consider the dependencies at a single neuron – Shown pictorially in the following slide 34

  4. Batchnorm is a vector function over the minibatch � � � � � � • Batch normalization is really a vector function applied over all the inputs from a minibatch – Every 𝑨 � affects every 𝑨̂ � – Shown on the next slide • To compute the derivative of the minibatch loss w.r.t any � , we must consider all � in the batch 35

  5. Or more explicitly 𝑣 � = 𝑨 � − 𝜈 � � + 𝜗 𝑨̂ � = 𝛿𝑣 � + 𝛾 𝜏 � � � � � � � � � � • The computation of mini-batch normalized ’s is a vector function – Invoking mean and variance statistics across the minibatch • The subsequent shift and scaling is individually applied to each to compute the corresponding 36

  6. Or more explicitly 𝑣 � = 𝑨 � − 𝜈 � � + 𝜗 𝑨̂ � = 𝛿𝑣 � + 𝛾 𝜏 � We can compute � � � ����� �� � individually � � � for each � because the processing after the computation of � is independent for each � � � � • The computation of mini-batch normalized ’s is a vector function – Invoking mean and variance statistics across the minibatch • The subsequent shift and scaling is individually applied to each to compute the corresponding 37

  7. Batch normalization: Forward pass � � Batch normalization + ��� � � � � � � � � � � � � � � � � � 38 ��� ���

  8. Batch normalization: Backpropagation � � Batch normalization + ��� � � � � � � � � � � � � � � � � � 39 ��� ���

  9. Batch normalization: Backpropagation Parameters to be learned � � + Batch normalization ��� � � � � � � � � � � � � � � � � � 40 ��� ���

  10. Batch normalization: Backpropagation Parameters to be learned � � + Batch normalization ��� � � � � � � � � � � � � � � � � � 41 ��� ���

  11. Propogating the derivative � � � � � � Derivatives computed for every u � � � ����� • We now have �� � for every � • We must propagate the derivative through the first stage of BN – Which is a vector operation over the minibatch 42

  12. The first stage of batchnorm Batch norm stage 1 � � � � � � � � � • The complete dependency figure for the first “normalization” stage of Batchnorm – Which computes the centered “ ”s from the “ ”s for the minibatch • Note : inputs and outputs are different instances in a minibatch – The diagram represents BN occurring at a single neuron • Let’s complete the figure and work out the derivatives 43

  13. The first stage of Batchnorm Batch norm stage 1 � � � � � � � � � • The complete derivative of the mini-batch loss w.r.t. 44

  14. The first stage of Batchnorm Batch norm stage 1 � � � � � � � � � • The complete derivative of the mini-batch loss w.r.t. Already computed 45

  15. The first stage of Batchnorm Batch norm stage 1 � � � � � � � � � • The complete derivative of the mini-batch loss w.r.t. Must compute for every i,j pair 46

  16. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 47

  17. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 48

  18. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 49

  19. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 50

  20. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 51

  21. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 52

  22. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • From the highlighted relation 53

  23. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 54

  24. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 55

  25. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 56

  26. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • From the highlighted relation 57

  27. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 58

  28. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 59

  29. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 60

  30. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • From the highlighted relation 61

  31. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 62

  32. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 63

  33. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 64

  34. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 65

  35. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • From the highlighted equation 66

  36. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 67

  37. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 68

  38. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 69

  39. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • From the highlighted equations 70

  40. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • From the highlighted equations 71

  41. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • From the highlighted equations 72

  42. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • From the highlighted equations 73

  43. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • From the highlighted equations 74

  44. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � � � � � � � � � � � ��� ��� ��� � � � � 75

  45. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • From the highlighted equations 0 76

  46. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • From the highlighted equations 77

  47. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 78

  48. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 79

  49. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “through” line ( ) 80

  50. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “cross” lines ( ) 81

  51. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “cross” lines ( ) 82

  52. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “cross” lines ( ) 83

  53. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “cross” lines ( ) This is identical to the equation for , without the first “through” term 84

  54. The first stage of Batchnorm Batch norm stage 1 � � � � � ��� � � � � � � � � � ��� � � � � � � � � � • The derivative for the “cross” lines ( ) 85

  55. The first stage of Batchnorm Batch norm stage 1 � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 86

  56. The first stage of Batchnorm Batch norm stage 1 � � � � � � � � � • The complete derivative of the mini-batch loss w.r.t. 87

  57. The first stage of Batchnorm � � � � � � � � � � � � � � � � � � � � � � � � � � � � � • The complete derivative of the mini-batch loss w.r.t. � � � � � � � � � � � � � � � � � � 88

  58. Batch normalization: Backpropagation � � � � � � � � � � � � � � � � � � � � + Batch normalization ��� � The rest of backprop continues from ����� �� � 89

  59. Batch normalization: Inference � Batch normalization + � �� ��� � � � � �� � � . • On test data, BN requires 𝜈 � and 𝜏 � • We will use the average over all training minibatches 1 𝜈 �� = 𝑂𝑐𝑏𝑢𝑑ℎ𝑓𝑡 � 𝜈 � (𝑐𝑏𝑢𝑑ℎ) ����� 𝐶 � � (𝑐𝑏𝑢𝑑ℎ) 𝜏 �� = (𝐶 − 1)𝑂𝑐𝑏𝑢𝑑ℎ𝑓𝑡 � 𝜏 � ��� • Note: these are neuron-specific � (𝑐𝑏𝑢𝑑ℎ) here are obtained from the final converged network – 𝜈 � (𝑐𝑏𝑢𝑑ℎ) and 𝜏 � – The 𝐶/(𝐶 − 1) term gives us an unbiased estimator for the variance 90

  60. Batch normalization + + + + + • Batch normalization may only be applied to some layers – Or even only selected neurons in the layer • Improves both convergence rate and neural network performance – Anecdotal evidence that BN eliminates the need for dropout – To get maximum benefit from BN, learning rates must be increased and learning rate decay can be faster • Since the data generally remain in the high-gradient regions of the activations – Also needs better randomization of training data order 91

  61. Batch Normalization: Typical result • Performance on Imagenet, from Ioffe and Szegedy, JMLR 2015 92

  62. Story so far • Gradient descent can be sped up by incremental updates • Convergence can be improved using smoothed updates • The choice of divergence affects both the learned network and results • Covariate shift between training and test may cause problems and may be handled by batch normalization 93

  63. The problem of data underspecification • The figures shown to illustrate the learning problem so far were fake news .. 94

  64. Learning the network • We attempt to learn an entire function from just a few snapshots of it 95

  65. General approach to training Blue lines: error when Black lines: error when function is below desired function is above desired output output � � � � • Define a divergence between the actual network output for any parameter value and the desired output – Typically L2 divergence or KL divergence 96

  66. Overfitting • Problem: Network may just learn the values at the inputs – Learn the red curve instead of the dotted blue one • Given only the red vertical bars as inputs 97

  67. Data under-specification • Consider a binary 100-dimensional input There are 2 100 =10 30 possible inputs • Complete specification of the function will require specification of 10 30 output • values • A training set with only 10 15 training instances will be off by a factor of 10 15 98

  68. Data under-specification in learning Find the function! • Consider a binary 100-dimensional input There are 2 100 =10 30 possible inputs • Complete specification of the function will require specification of 10 30 output • values • A training set with only 10 15 training instances will be off by a factor of 10 15 99

  69. Need “smoothing” constraints • Need additional constraints that will “fill in” the missing regions acceptably – Generalization 100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend