Natural Language Processing with Deep Learning Neural Networks a - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Neural Networks – a Walkthrough Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception

Agenda • Introduction • Non-linearities • Forward pass & backpropagation • Softmax & loss function • Optimization & regularization

Notation § 𝑏 → scalar § 𝒄 → vector - 𝑗 !" element of 𝒄 is the scalar 𝑐 # § 𝑫 → matrix - 𝑗 !" vector of 𝑫 is 𝒅 # - 𝑘 !" element of the 𝑗 !" vector of 𝑫 is the scalar 𝑑 #,% § Tensor: generalization of scalar, vector, matrix to any arbitrary dimension 4

Linear Algebra 5

Linear Algebra – Transpose 𝒃 is in 1 × d dimensions → 𝒃 𝐔 is in § d × 1 dimensions 𝑩 is in e × d dimensions → 𝑩 𝐔 is in d × e dimensions § 1 4 & 1 2 3 = 2 5 4 5 6 3 6 6

Linear Algebra – Dot product § 𝒃 2 𝒄 ' = 𝑑 dimensions: 1 × d ( d × 1 = 1 - 2 = 1 2 3 0 5 1 𝒅 § 𝒃 2 𝑪 = dimensions: 1 × d ( d × e = 1 × e - 2 3 = 1 2 3 0 1 5 2 1 −1 § 𝑩 2 𝑪 = 𝑫 dimensions: l × m ( m × n = - l × n 1 2 3 5 2 2 3 1 0 1 3 2 = 0 1 5 −5 0 0 5 1 −1 8 13 4 1 0 § Linear transformation: dot product of a vector to a matrix 7

Probability § Conditional probability 𝑞(𝑧|𝑦) § Probability distribution - For a discrete random variable 𝒜 with 𝐿 states • 0 ≤ 𝑞 𝑨 # ≤ 1 2 • ∑ #01 𝑞 𝑨 # = 1 - E.g. with 𝐿 = 4 states: 0.2 0.3 0.45 0.05 8

Probability § Expected value 𝔽 -~/ 𝑔 = 1 𝑌 , 𝑔(𝑦) -∈/ - Note: This is an imprecise definition. Though, it suffices for our use in this lecture 9

Artificial Neural Networks § Neural Networks are non-linear functions and universal approximators § They composed of several simple (non-)linear operations § Neural networks can readily be defined as probabilistic models which estimate 𝑞(𝑧|𝒚; 𝑿) - Given input vector 𝒚 and the set of parameters 𝑿 , estimate the probability of the output class y 10

A Feedforward network output probability input vector distribution 𝒚 𝑞 𝑧 𝒚; 𝑿 𝑿 (𝟑) 𝑿 (𝟐) size 4x2 size 3x4 parameter matrices 11

Learning with Neural Networks § Design the network’s architecture § Consider proper regularization methods § Initialize parameters § Loop until some exit criteria are met - Sample a minibatch from training data 𝒠 - Loop over data points in the minibatch • Forward pass : given input 𝒚 predict output 𝑞 𝑧 𝒚; 𝑿 - Calculate loss function - Calculate the gradient of each parameter regarding the loss function using the backpropagation algorithm - Update parameters using their gradients 12

Neural Computation source 14

An Artificial Neuron source 15

Linear 𝑔 𝑦 = 𝑦 16

Sigmoid 1 𝑔 𝑦 = 𝜏 𝑦 = 1 + 𝑓 !" § squashes input between 0 and 1 § Output becomes like a probability value 17

Hyperbolic Tangent (Tanh) 𝑔 𝑦 = tanh 𝑦 = 𝑓 #" − 1 𝑓 #" + 1 § squashes input between -1 and 1 𝜏 Tanh 18

Rectified Linear Unit (ReLU) 𝑔 𝑦 = max(0, 𝑦) § Good for deep architectures, as it prevents vanishing gradient 19

Examples 𝑿 = 0.5 −0.5 2 0 0 𝒚 = 1 3 0 0 0 4 −1 Linear transformation 𝒚𝑿 : § 0.5 −0.5 2 0 −1 𝒚𝑿 = 1 −1 = 𝟏. 𝟔 3 −𝟏. 𝟔 𝟑 𝟐𝟑 −𝟓 0 0 0 4 Non-linear transformation ReLU(𝒚𝑿) : § ReLU 0.5 = 𝟏. 𝟔 −0.5 2 12 −3 𝟏. 𝟏 𝟑 𝟐𝟑 𝟏. 𝟏 Non-linear transformation 𝜏(𝒚𝑿) : § 𝜏 0.5 = 𝟏. 𝟕𝟑 −0.5 2 12 −3 𝟏. 𝟒𝟖 𝟏. 𝟗𝟗 𝟏. 𝟘𝟘 𝟏. 𝟏𝟐𝟗 Non-linear transformation tanh(𝒚𝑿) : § tanh 0.5 = 𝟏. 𝟓𝟕 −0.5 2 12 −3 −𝟏. 𝟓𝟕 𝟏. 𝟘𝟕 𝟏. 𝟘𝟘 −𝟏. 𝟘𝟘 20

Forward pass § Consider this calculation: 𝑨(𝑦; 𝒙) = 2 ∗ 𝑥 33 + 𝑦 ∗ 𝑥 1 + 𝑥 4 where 𝑦 is input and 𝒙 is the set of parameters with the initialization 𝑥 " = 1 𝑥 # = 3 𝑥 $ = 2 § Let’s break it into intermediary variables: 𝑏 = 𝑦 ∗ 𝑥 1 𝑐 = 𝑏 + 𝑥 4 𝑑 = 𝑥 33 𝑨 = 𝑐 + 2 ∗ 𝑑 22

z = 𝑐 + 2 ∗ 𝑑 Computational Graph 𝑑 = 𝑥 ## 𝑐 = 𝑏 + 𝑥 ! 𝑏 = 2 ∗ 𝑦 ∗ 𝑥 " 𝑥 ! 𝑦 𝑥 " 𝑥 # 𝑥 ! = 1 𝑥 " = 3 𝑥 # = 2 23

z = 𝑐 + 2 ∗ 𝑑 Computational Graph 𝜖 local derivatives 𝜖 = 1 𝜖 = 2 𝑑 = 𝑥 ## 𝑐 = 𝑏 + 𝑥 ! 𝜖 = 1 𝑏 = 2 ∗ 𝑦 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑥 # 𝜖 = 1 𝜖 = 2 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑦 𝑥 ! 𝑦 𝑥 " 𝑥 # 𝑥 ! = 1 𝑥 " = 3 𝑥 # = 2 24

z = 𝑐 + 2 ∗ 𝑑 Forward pass 𝑨 = 15 𝜖 local derivatives 𝜖 = 1 𝜖 = 2 𝑑 = 𝑥 ## 𝑐 = 𝑏 + 𝑥 ! 𝑑 = 4 𝑐 = 7 𝜖 = 1 𝑏 = 2 ∗ 𝑦 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑥 # 𝜖 = 1 𝑏 = 6 𝜖 = 2 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑦 𝑥 ! 𝑦 𝑥 " 𝑥 # 𝑦 = 1 𝑥 ! = 1 𝑥 " = 3 𝑥 # = 2 25

z = 𝑐 + 2 ∗ 𝑑 Backward pass 𝑨 = 15 𝜖 local derivatives 𝜖 = 1 𝜖 = 2 𝜖 = 1 𝜖 = 2 𝑑 = 𝑥 ## 𝑐 = 𝑏 + 𝑥 ! 𝑑 = 4 𝑐 = 7 𝜖 = 1 𝜖 = 1 𝑏 = 2 ∗ 𝑦 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑥 # 𝜖 = 1 𝑏 = 6 𝜖 = 4 𝜖 = 1 𝜖 = 2 ∗ 𝑥 " 𝜖 = 2 ∗ 𝑦 𝜖 = 6 𝜖 = 2 𝑥 ! 𝑦 𝑥 " 𝑥 # 𝑦 = 1 𝑥 ! = 1 𝑥 " = 3 𝑥 # = 2 26

Gradient & Chain rule § We need the gradient of 𝑨 regarding 𝒙 for optimization 𝜖𝑨 𝜖𝑨 𝜖𝑨 ∇ 𝒙 𝑨 = 𝜖𝑥 4 𝜖𝑥 1 𝜖𝑥 3 § We calculate it using chain rule and local derivates: IJ IJ IL IK ! = IL IK ! IJ IJ IL IM IK " = IL IM IK " IJ IJ IN IK # = IN IK # 27

Backpropagation IJ IJ IL IK ! = IK ! = 1 ∗ 1 = 1 IL IJ IJ IL IM IK " = IK " = 1 ∗ 1 ∗ 2 = 2 IL IM IJ IJ IN IK # = IK # = 2 ∗ 4 = 8 IN 28

Softmax § Given the output vector 𝒜 of a neural networks model with 𝐿 output classes § softmax turns the vector to a probability distribution 𝑓 J $ softmax(𝒜) O = 𝑓 J % S ∑ PQR normalization term 30

Softmax – numeric example § 𝐿 = 4 classes 1 2 𝑓 % 𝒜 = 5 𝑦 6 0.004 log(𝑦) 0.013 softmax(𝒜) = 0.264 0.717 31

Softmax characteristics § The exponential function in softmax makes the highest value becomes separated from the others § Softmax identifies the “ max ” but in a “ soft ” way! § Softmax makes competition between the predicted output values, so that in the extreme case, “ winner takes all” - Winner-takes-all: one output is 1 and the rest are 0 - This resembles the competition between nearby neurons in the cortex 32

Negative Log Likelihood (NLL) Loss § NLL loss function is commonly used in neural networks to optimize classification tasks: ℒ = −𝔽 𝒚,Z~𝒠 log 𝑞 𝑧 𝒚; 𝕏 - 𝒠 the set of (training) data - 𝒚 input vector - 𝑧 correct output class § NLL is a form of cross entropy loss 33

NLL + Softmax § The choice of output function (such as softmax) is highly related to the selection of loss function. These two should fit with each other! § Softmax and NLL are a good pair § To see why, let’s calculate the final NLL loss function when softmax is used at output layer (next page) 34

NLL + Softmax § Loss function for one data point: ℒ(𝑔 𝒚; 𝒙 , 𝑧) § 𝒜 the output vector of network before applying softmax § 𝑧 the index of the correct class ℒ(𝑔 𝒚; 𝒙 , 𝑧) = − log 𝑞 𝑧 𝒚; 𝕏 𝑓 J & = − log S 𝑓 J % ∑ PQR S 𝑓 J % = −𝑨 Z + log ∑ PQR 35

NLL + Softmax – example 2 1 2 𝒜 = 0.5 6 § If the correct class is the first one, 𝑧 = 0 : ℒ = −1 + log 𝑓 1 + 𝑓 3 + 𝑓 4.D + 𝑓 E = −1 + 6.02 = 𝟔. 𝟏𝟑 § If the correct class is the third one, 𝑧 = 2 : ℒ = −0.5 + log 𝑓 1 + 𝑓 3 + 𝑓 4.D + 𝑓 E = −0.5 + 6.02 = 𝟔. 𝟔𝟑 § If the correct class is the fourth one, 𝑧 = 3 : ℒ = −6 + log 𝑓 1 + 𝑓 3 + 𝑓 4.D + 𝑓 E = −6 + 6.02 = 𝟏. 𝟏𝟑 36

NLL + Softmax – example 1 1 2 𝒜 = 5 6 § If the correct class is the first one, 𝑧 = 0 : ℒ = −1 + log 𝑓 1 + 𝑓 3 + 𝑓 D + 𝑓 E = −1 + 6.33 = 𝟔. 𝟒𝟒 § If the correct class is the third one, 𝑧 = 2 : ℒ = −5 + log 𝑓 1 + 𝑓 3 + 𝑓 D + 𝑓 E = −5 + 6.33 = 𝟐. 𝟒𝟒 § If the correct class is the fourth one, 𝑧 = 3 : ℒ = −6 + log 𝑓 1 + 𝑓 3 + 𝑓 D + 𝑓 E = −6 + 6.33 = 𝟏. 𝟒𝟒 37

Natural Language Processing with Deep Learning Neural Networks a - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Neural Networks a Walkthrough Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Introduction Non-linearities Forward pass &

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

Deep learning for natural language processing Introduction to natural language processing

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Stochastic constrained optimization in Hilbert spaces with applications Georg Ch. Pflug/C.

Introduction to Artificial Neural Networks Ahmed Guessoum Natural Language Processing and

A Formal Model Approach for the Analysis and Validation of the Cooperative Path Planning of a UAV

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Optimization CMPUT 296: Basics of Machine Learning Textbook 4.1-4.4 Logistics Reminders:

Target Client <<interface>> Original operationA() operationB() Adapter

Moving Target Defense for the Placement of Intrusion Detection Systems in the Cloud Sailik

New Evidence for (0 , 2) Target Space Duality He Feng Department of Physics, Virginia Tech based

Natural Language Processing with Deep Learning Neural Networks a - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Neural Networks a Walkthrough Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Introduction Non-linearities Forward pass &

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

Deep learning for natural language processing Introduction to natural language processing

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Stochastic constrained optimization in Hilbert spaces with applications Georg Ch. Pflug/C.

Introduction to Artificial Neural Networks Ahmed Guessoum Natural Language Processing and

A Formal Model Approach for the Analysis and Validation of the Cooperative Path Planning of a UAV

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Optimization CMPUT 296: Basics of Machine Learning Textbook 4.1-4.4 Logistics Reminders:

Target Client &lt;&lt;interface&gt;&gt; Original operationA() operationB() Adapter

Moving Target Defense for the Placement of Intrusion Detection Systems in the Cloud Sailik

New Evidence for (0 , 2) Target Space Duality He Feng Department of Physics, Virginia Tech based

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Target Client <<interface>> Original operationA() operationB() Adapter