natural language processing with deep learning neural
play

Natural Language Processing with Deep Learning Neural Networks a - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Neural Networks a Walkthrough Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Introduction Non-linearities Forward pass &


  1. Natural Language Processing with Deep Learning Neural Networks – a Walkthrough Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception

  2. Agenda • Introduction • Non-linearities • Forward pass & backpropagation • Softmax & loss function • Optimization & regularization

  3. Agenda • Introduction • Non-linearities • Forward pass & backpropagation • Softmax & loss function • Optimization & regularization

  4. Notation § 𝑏 → scalar § 𝒄 → vector - 𝑗 !" element of 𝒄 is the scalar 𝑐 # § 𝑫 → matrix - 𝑗 !" vector of 𝑫 is 𝒅 # - 𝑘 !" element of the 𝑗 !" vector of 𝑫 is the scalar 𝑑 #,% § Tensor: generalization of scalar, vector, matrix to any arbitrary dimension 4

  5. Linear Algebra 5

  6. Linear Algebra – Transpose 𝒃 is in 1 × d dimensions → 𝒃 𝐔 is in § d × 1 dimensions 𝑩 is in e × d dimensions → 𝑩 𝐔 is in d × e dimensions § 1 4 & 1 2 3 = 2 5 4 5 6 3 6 6

  7. Linear Algebra – Dot product § 𝒃 2 𝒄 ' = 𝑑 dimensions: 1 × d ( d × 1 = 1 - 2 = 1 2 3 0 5 1 𝒅 § 𝒃 2 𝑪 = dimensions: 1 × d ( d × e = 1 × e - 2 3 = 1 2 3 0 1 5 2 1 −1 § 𝑩 2 𝑪 = 𝑫 dimensions: l × m ( m × n = - l × n 1 2 3 5 2 2 3 1 0 1 3 2 = 0 1 5 −5 0 0 5 1 −1 8 13 4 1 0 § Linear transformation: dot product of a vector to a matrix 7

  8. Probability § Conditional probability 𝑞(𝑧|𝑦) § Probability distribution - For a discrete random variable 𝒜 with 𝐿 states • 0 ≤ 𝑞 𝑨 # ≤ 1 2 • ∑ #01 𝑞 𝑨 # = 1 - E.g. with 𝐿 = 4 states: 0.2 0.3 0.45 0.05 8

  9. Probability § Expected value 𝔽 -~/ 𝑔 = 1 𝑌 , 𝑔(𝑦) -∈/ - Note: This is an imprecise definition. Though, it suffices for our use in this lecture 9

  10. Artificial Neural Networks § Neural Networks are non-linear functions and universal approximators § They composed of several simple (non-)linear operations § Neural networks can readily be defined as probabilistic models which estimate 𝑞(𝑧|𝒚; 𝑿) - Given input vector 𝒚 and the set of parameters 𝑿 , estimate the probability of the output class y 10

  11. A Feedforward network output probability input vector distribution 𝒚 𝑞 𝑧 𝒚; 𝑿 𝑿 (𝟑) 𝑿 (𝟐) size 4x2 size 3x4 parameter matrices 11

  12. Learning with Neural Networks § Design the network’s architecture § Consider proper regularization methods § Initialize parameters § Loop until some exit criteria are met - Sample a minibatch from training data 𝒠 - Loop over data points in the minibatch • Forward pass : given input 𝒚 predict output 𝑞 𝑧 𝒚; 𝑿 - Calculate loss function - Calculate the gradient of each parameter regarding the loss function using the backpropagation algorithm - Update parameters using their gradients 12

  13. Agenda • Introduction • Non-linearities • Forward pass & backpropagation • Softmax & loss function • Optimization & regularization

  14. Neural Computation source 14

  15. An Artificial Neuron source 15

  16. Linear 𝑔 𝑦 = 𝑦 16

  17. Sigmoid 1 𝑔 𝑦 = 𝜏 𝑦 = 1 + 𝑓 !" § squashes input between 0 and 1 § Output becomes like a probability value 17

  18. Hyperbolic Tangent (Tanh) 𝑔 𝑦 = tanh 𝑦 = 𝑓 #" − 1 𝑓 #" + 1 § squashes input between -1 and 1 𝜏 Tanh 18

  19. Rectified Linear Unit (ReLU) 𝑔 𝑦 = max(0, 𝑦) § Good for deep architectures, as it prevents vanishing gradient 19

  20. Examples 𝑿 = 0.5 −0.5 2 0 0 𝒚 = 1 3 0 0 0 4 −1 Linear transformation 𝒚𝑿 : § 0.5 −0.5 2 0 −1 𝒚𝑿 = 1 −1 = 𝟏. 𝟔 3 −𝟏. 𝟔 𝟑 𝟐𝟑 −𝟓 0 0 0 4 Non-linear transformation ReLU(𝒚𝑿) : § ReLU 0.5 = 𝟏. 𝟔 −0.5 2 12 −3 𝟏. 𝟏 𝟑 𝟐𝟑 𝟏. 𝟏 Non-linear transformation 𝜏(𝒚𝑿) : § 𝜏 0.5 = 𝟏. 𝟕𝟑 −0.5 2 12 −3 𝟏. 𝟒𝟖 𝟏. 𝟗𝟗 𝟏. 𝟘𝟘 𝟏. 𝟏𝟐𝟗 Non-linear transformation tanh(𝒚𝑿) : § tanh 0.5 = 𝟏. 𝟓𝟕 −0.5 2 12 −3 −𝟏. 𝟓𝟕 𝟏. 𝟘𝟕 𝟏. 𝟘𝟘 −𝟏. 𝟘𝟘 20

  21. Agenda • Introduction • Non-linearities • Forward pass & backpropagation • Softmax & loss function • Optimization & regularization

  22. Forward pass § Consider this calculation: 𝑨(𝑦; 𝒙) = 2 ∗ 𝑥 33 + 𝑦 ∗ 𝑥 1 + 𝑥 4 where 𝑦 is input and 𝒙 is the set of parameters with the initialization 𝑥 " = 1 𝑥 # = 3 𝑥 $ = 2 § Let’s break it into intermediary variables: 𝑏 = 𝑦 ∗ 𝑥 1 𝑐 = 𝑏 + 𝑥 4 𝑑 = 𝑥 33 𝑨 = 𝑐 + 2 ∗ 𝑑 22

  23. z = 𝑐 + 2 ∗ 𝑑 Computational Graph 𝑑 = 𝑥 ## 𝑐 = 𝑏 + 𝑥 ! 𝑏 = 2 ∗ 𝑦 ∗ 𝑥 " 𝑥 ! 𝑦 𝑥 " 𝑥 # 𝑥 ! = 1 𝑥 " = 3 𝑥 # = 2 23

  24. z = 𝑐 + 2 ∗ 𝑑 Computational Graph 𝜖 local derivatives 𝜖 = 1 𝜖 = 2 𝑑 = 𝑥 ## 𝑐 = 𝑏 + 𝑥 ! 𝜖 = 1 𝑏 = 2 ∗ 𝑦 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑥 # 𝜖 = 1 𝜖 = 2 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑦 𝑥 ! 𝑦 𝑥 " 𝑥 # 𝑥 ! = 1 𝑥 " = 3 𝑥 # = 2 24

  25. z = 𝑐 + 2 ∗ 𝑑 Forward pass 𝑨 = 15 𝜖 local derivatives 𝜖 = 1 𝜖 = 2 𝑑 = 𝑥 ## 𝑐 = 𝑏 + 𝑥 ! 𝑑 = 4 𝑐 = 7 𝜖 = 1 𝑏 = 2 ∗ 𝑦 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑥 # 𝜖 = 1 𝑏 = 6 𝜖 = 2 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑦 𝑥 ! 𝑦 𝑥 " 𝑥 # 𝑦 = 1 𝑥 ! = 1 𝑥 " = 3 𝑥 # = 2 25

  26. z = 𝑐 + 2 ∗ 𝑑 Backward pass 𝑨 = 15 𝜖 local derivatives 𝜖 = 1 𝜖 = 2 𝜖 = 1 𝜖 = 2 𝑑 = 𝑥 ## 𝑐 = 𝑏 + 𝑥 ! 𝑑 = 4 𝑐 = 7 𝜖 = 1 𝜖 = 1 𝑏 = 2 ∗ 𝑦 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑥 # 𝜖 = 1 𝑏 = 6 𝜖 = 4 𝜖 = 1 𝜖 = 2 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑦 𝜖 = 6 𝜖 = 2 𝑥 ! 𝑦 𝑥 " 𝑥 # 𝑦 = 1 𝑥 ! = 1 𝑥 " = 3 𝑥 # = 2 26

  27. Gradient & Chain rule § We need the gradient of 𝑨 regarding 𝒙 for optimization 𝜖𝑨 𝜖𝑨 𝜖𝑨 ∇ 𝒙 𝑨 = 𝜖𝑥 4 𝜖𝑥 1 𝜖𝑥 3 § We calculate it using chain rule and local derivates: IJ IJ IL IK ! = IL IK ! IJ IJ IL IM IK " = IL IM IK " IJ IJ IN IK # = IN IK # 27

  28. Backpropagation IJ IJ IL IK ! = IK ! = 1 ∗ 1 = 1 IL IJ IJ IL IM IK " = IK " = 1 ∗ 1 ∗ 2 = 2 IL IM IJ IJ IN IK # = IK # = 2 ∗ 4 = 8 IN 28

  29. Agenda • Introduction • Non-linearities • Forward pass & backpropagation • Softmax & loss function • Optimization & regularization

  30. Softmax § Given the output vector 𝒜 of a neural networks model with 𝐿 output classes § softmax turns the vector to a probability distribution 𝑓 J $ softmax(𝒜) O = 𝑓 J % S ∑ PQR normalization term 30

  31. Softmax – numeric example § 𝐿 = 4 classes 1 2 𝑓 % 𝒜 = 5 𝑦 6 0.004 log(𝑦) 0.013 softmax(𝒜) = 0.264 0.717 31

  32. Softmax characteristics § The exponential function in softmax makes the highest value becomes separated from the others § Softmax identifies the “ max ” but in a “ soft ” way! § Softmax makes competition between the predicted output values, so that in the extreme case, “ winner takes all” - Winner-takes-all: one output is 1 and the rest are 0 - This resembles the competition between nearby neurons in the cortex 32

  33. Negative Log Likelihood (NLL) Loss § NLL loss function is commonly used in neural networks to optimize classification tasks: ℒ = −𝔽 𝒚,Z~𝒠 log 𝑞 𝑧 𝒚; 𝕏 - 𝒠 the set of (training) data - 𝒚 input vector - 𝑧 correct output class § NLL is a form of cross entropy loss 33

  34. NLL + Softmax § The choice of output function (such as softmax) is highly related to the selection of loss function. These two should fit with each other! § Softmax and NLL are a good pair § To see why, let’s calculate the final NLL loss function when softmax is used at output layer (next page) 34

  35. NLL + Softmax § Loss function for one data point: ℒ(𝑔 𝒚; 𝒙 , 𝑧) § 𝒜 the output vector of network before applying softmax § 𝑧 the index of the correct class ℒ(𝑔 𝒚; 𝒙 , 𝑧) = − log 𝑞 𝑧 𝒚; 𝕏 𝑓 J & = − log S 𝑓 J % ∑ PQR S 𝑓 J % = −𝑨 Z + log ∑ PQR 35

  36. NLL + Softmax – example 2 1 2 𝒜 = 0.5 6 § If the correct class is the first one, 𝑧 = 0 : ℒ = −1 + log 𝑓 1 + 𝑓 3 + 𝑓 4.D + 𝑓 E = −1 + 6.02 = 𝟔. 𝟏𝟑 § If the correct class is the third one, 𝑧 = 2 : ℒ = −0.5 + log 𝑓 1 + 𝑓 3 + 𝑓 4.D + 𝑓 E = −0.5 + 6.02 = 𝟔. 𝟔𝟑 § If the correct class is the fourth one, 𝑧 = 3 : ℒ = −6 + log 𝑓 1 + 𝑓 3 + 𝑓 4.D + 𝑓 E = −6 + 6.02 = 𝟏. 𝟏𝟑 36

  37. NLL + Softmax – example 1 1 2 𝒜 = 5 6 § If the correct class is the first one, 𝑧 = 0 : ℒ = −1 + log 𝑓 1 + 𝑓 3 + 𝑓 D + 𝑓 E = −1 + 6.33 = 𝟔. 𝟒𝟒 § If the correct class is the third one, 𝑧 = 2 : ℒ = −5 + log 𝑓 1 + 𝑓 3 + 𝑓 D + 𝑓 E = −5 + 6.33 = 𝟐. 𝟒𝟒 § If the correct class is the fourth one, 𝑧 = 3 : ℒ = −6 + log 𝑓 1 + 𝑓 3 + 𝑓 D + 𝑓 E = −6 + 6.33 = 𝟏. 𝟒𝟒 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend