Natural Language Processing with Deep Learning Neural Networks a - - PowerPoint PPT Presentation

β–Ά
natural language processing with deep learning neural
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing with Deep Learning Neural Networks a - - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Neural Networks a Walkthrough Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Introduction Non-linearities Forward pass &


slide-1
SLIDE 1

Institute of Computational Perception

Natural Language Processing with Deep Learning Neural Networks – a Walkthrough

Navid Rekab-Saz

navid.rekabsaz@jku.at

slide-2
SLIDE 2

Agenda

  • Introduction
  • Non-linearities
  • Forward pass & backpropagation
  • Softmax & loss function
  • Optimization & regularization
slide-3
SLIDE 3

Agenda

  • Introduction
  • Non-linearities
  • Forward pass & backpropagation
  • Softmax & loss function
  • Optimization & regularization
slide-4
SLIDE 4

4

Notation Β§ 𝑏 β†’ scalar Β§ 𝒄 β†’ vector

  • 𝑗!" element of 𝒄 is the scalar 𝑐#

Β§ 𝑫 β†’ matrix

  • 𝑗!" vector of 𝑫 is 𝒅#
  • π‘˜!" element of the 𝑗!" vector of 𝑫 is the scalar 𝑑#,%

Β§ Tensor: generalization of scalar, vector, matrix to any arbitrary dimension

slide-5
SLIDE 5

5

Linear Algebra

slide-6
SLIDE 6

6

Linear Algebra – Transpose

Β§ 𝒃 is in 1Γ—d dimensions β†’ 𝒃𝐔 is in Β§ 𝑩 is in eΓ—d dimensions β†’ 𝑩𝐔 is in

1 2 3 4 5 6

&

=

dΓ—1 dimensions dΓ—e dimensions

1 4 2 5 3 6

slide-7
SLIDE 7

7

Linear Algebra – Dot product Β§ 𝒃 2 𝒄' =

  • dimensions: 1Γ—d ( dΓ—1 =

1 2 3 2 1 =

Β§ 𝒃 2 π‘ͺ =

  • dimensions: 1Γ—d ( dΓ—e =

1 2 3 2 3 1 1 βˆ’1 =

Β§ 𝑩 2 π‘ͺ =

  • dimensions: lΓ—m ( mΓ—n =

1 2 3 1 1 5 4 1 2 3 1 1 βˆ’1 = Β§ Linear transformation: dot product of a vector to a matrix

𝑑

1

𝒅

1Γ—e

𝑫

lΓ—n 5 5 2 5 2 3 2 5 βˆ’5 8 13

slide-8
SLIDE 8

8

Probability Β§ Conditional probability

π‘ž(𝑧|𝑦)

Β§ Probability distribution

  • For a discrete random variable π’œ with 𝐿 states
  • 0 ≀ π‘ž 𝑨# ≀ 1
  • βˆ‘#01

2

π‘ž 𝑨# = 1

  • E.g. with 𝐿 = 4 states: 0.2

0.3 0.45 0.05

slide-9
SLIDE 9

9

Probability Β§ Expected value

𝔽-~/ 𝑔 = 1 π‘Œ ,

  • ∈/

𝑔(𝑦)

  • Note: This is an imprecise definition. Though, it suffices for our

use in this lecture

slide-10
SLIDE 10

10

Artificial Neural Networks Β§ Neural Networks are non-linear functions and universal approximators Β§ They composed of several simple (non-)linear

  • perations

Β§ Neural networks can readily be defined as probabilistic models which estimate π‘ž(𝑧|π’š; 𝑿)

  • Given input vector π’š and the set of parameters 𝑿, estimate the

probability of the output class y

slide-11
SLIDE 11

11

A Feedforward network

𝑿(𝟐)

size 3x4

𝑿(πŸ‘)

size 4x2 input vector

π’š

parameter matrices

  • utput probability

distribution

π‘ž 𝑧 π’š; 𝑿

slide-12
SLIDE 12

12

Learning with Neural Networks

Β§ Design the network’s architecture Β§ Consider proper regularization methods Β§ Initialize parameters Β§ Loop until some exit criteria are met

  • Sample a minibatch from training data 𝒠
  • Loop over data points in the minibatch
  • Forward pass: given input π’š predict output π‘ž 𝑧 π’š; 𝑿
  • Calculate loss function
  • Calculate the gradient of each parameter regarding the loss

function using the backpropagation algorithm

  • Update parameters using their gradients
slide-13
SLIDE 13

Agenda

  • Introduction
  • Non-linearities
  • Forward pass & backpropagation
  • Softmax & loss function
  • Optimization & regularization
slide-14
SLIDE 14

14

Neural Computation

source

slide-15
SLIDE 15

15

An Artificial Neuron

source

slide-16
SLIDE 16

16

Linear

𝑔 𝑦 = 𝑦

slide-17
SLIDE 17

17

Sigmoid

𝑔 𝑦 = 𝜏 𝑦 = 1 1 + 𝑓!"

Β§ squashes input between 0 and 1 Β§ Output becomes like a probability value

slide-18
SLIDE 18

18

Hyperbolic Tangent (Tanh)

𝑔 𝑦 = tanh 𝑦 = 𝑓#" βˆ’ 1 𝑓#" + 1

§ squashes input between -1 and 1 𝜏

Tanh

slide-19
SLIDE 19

19

Rectified Linear Unit (ReLU)

𝑔 𝑦 = max(0, 𝑦)

Β§ Good for deep architectures, as it prevents vanishing gradient

slide-20
SLIDE 20

20

Examples

π’š = 1 3 𝑿 = 0.5 βˆ’0.5 2 4 βˆ’1 Β§ Linear transformation π’šπ‘Ώ: π’šπ‘Ώ = 1 3 0.5 βˆ’0.5 2 βˆ’1 4 βˆ’1 = 𝟏. πŸ” βˆ’πŸ. πŸ” πŸ‘ πŸπŸ‘ βˆ’πŸ“ Β§ Non-linear transformation ReLU(π’šπ‘Ώ): ReLU 0.5 βˆ’0.5 2 12 βˆ’3 = 𝟏. πŸ” 𝟏. 𝟏 πŸ‘ πŸπŸ‘ 𝟏. 𝟏 Β§ Non-linear transformation 𝜏(π’šπ‘Ώ): 𝜏 0.5 βˆ’0.5 2 12 βˆ’3 = 𝟏. πŸ•πŸ‘ 𝟏. πŸ’πŸ– 𝟏. πŸ—πŸ— 𝟏. 𝟘𝟘 𝟏. πŸπŸπŸ— Β§ Non-linear transformation tanh(π’šπ‘Ώ): tanh 0.5 βˆ’0.5 2 12 βˆ’3 = 𝟏. πŸ“πŸ• βˆ’πŸ. πŸ“πŸ• 𝟏. πŸ˜πŸ• 𝟏. 𝟘𝟘 βˆ’πŸ. 𝟘𝟘

slide-21
SLIDE 21

Agenda

  • Introduction
  • Non-linearities
  • Forward pass & backpropagation
  • Softmax & loss function
  • Optimization & regularization
slide-22
SLIDE 22

22

Forward pass Β§ Consider this calculation: 𝑨(𝑦; 𝒙) = 2 βˆ— π‘₯33 + 𝑦 βˆ— π‘₯1 + π‘₯4

where 𝑦 is input and 𝒙 is the set of parameters with the initialization π‘₯" = 1 π‘₯# = 3 π‘₯$ = 2

Β§ Let’s break it into intermediary variables: 𝑏 = 𝑦 βˆ— π‘₯1 𝑐 = 𝑏 + π‘₯4 𝑑 = π‘₯33 𝑨 = 𝑐 + 2 βˆ— 𝑑

slide-23
SLIDE 23

23

π‘₯! π‘₯! = 1 𝑦 π‘₯" π‘₯" = 3 π‘₯# π‘₯# = 2 𝑏 = 2 βˆ— 𝑦 βˆ— π‘₯" 𝑑 = π‘₯## 𝑐 = 𝑏 + π‘₯! z = 𝑐 + 2 βˆ— 𝑑

Computational Graph

slide-24
SLIDE 24

24

π‘₯! π‘₯! = 1 𝑦 π‘₯" π‘₯" = 3 π‘₯# π‘₯# = 2 𝑏 = 2 βˆ— 𝑦 βˆ— π‘₯" 𝑑 = π‘₯## 𝑐 = 𝑏 + π‘₯! z = 𝑐 + 2 βˆ— 𝑑

Computational Graph

πœ– = 1 πœ– = 1 πœ– = 2 βˆ— π‘₯" πœ– = 2 πœ– = 1 πœ– = 2 βˆ— 𝑦 πœ– = 2 βˆ— π‘₯# πœ– local derivatives

slide-25
SLIDE 25

25

π‘₯! π‘₯! = 1 𝑦 π‘₯" π‘₯" = 3 π‘₯# π‘₯# = 2 𝑏 = 2 βˆ— 𝑦 βˆ— π‘₯" 𝑑 = π‘₯## 𝑐 = 𝑏 + π‘₯! z = 𝑐 + 2 βˆ— 𝑑

Forward pass

πœ– = 1 πœ– = 1 πœ– = 2 πœ– = 1 πœ– = 2 βˆ— π‘₯# πœ– local derivatives 𝑏 = 6 𝑐 = 7 𝑨 = 15 𝑑 = 4 𝑦 = 1 πœ– = 2 βˆ— π‘₯" πœ– = 2 βˆ— 𝑦

slide-26
SLIDE 26

26

π‘₯! π‘₯! = 1 𝑦 π‘₯" π‘₯" = 3 π‘₯# π‘₯# = 2 𝑏 = 2 βˆ— 𝑦 βˆ— π‘₯" 𝑑 = π‘₯## 𝑐 = 𝑏 + π‘₯! z = 𝑐 + 2 βˆ— 𝑑

Backward pass

πœ– = 1 πœ– = 1 πœ– = 2 πœ– = 1 πœ– = 2 βˆ— π‘₯# πœ– local derivatives 𝑏 = 6 𝑐 = 7 𝑨 = 15 𝑑 = 4 𝑦 = 1 πœ– = 1 πœ– = 1 πœ– = 1 πœ– = 4 πœ– = 2 πœ– = 2 βˆ— π‘₯" πœ– = 2 βˆ— 𝑦 πœ– = 6 πœ– = 2

slide-27
SLIDE 27

27

Gradient & Chain rule Β§ We need the gradient of 𝑨 regarding 𝒙 for optimization βˆ‡π’™π‘¨ = πœ–π‘¨ πœ–π‘₯4 πœ–π‘¨ πœ–π‘₯1 πœ–π‘¨ πœ–π‘₯3 Β§ We calculate it using chain rule and local derivates:

IJ IK! = IJ IL IL IK! IJ IK" = IJ IL IL IM IM IK" IJ IK# = IJ IN IN IK#

slide-28
SLIDE 28

28

Backpropagation

IJ IK! = IJ IL IL IK! = 1 βˆ— 1 = 1 IJ IK" = IJ IL IL IM IM IK" = 1 βˆ— 1 βˆ— 2 = 2 IJ IK# = IJ IN IN IK# = 2 βˆ— 4 = 8

slide-29
SLIDE 29

Agenda

  • Introduction
  • Non-linearities
  • Forward pass & backpropagation
  • Softmax & loss function
  • Optimization & regularization
slide-30
SLIDE 30

30

Softmax Β§ Given the output vector π’œ of a neural networks model with 𝐿 output classes Β§ softmax turns the vector to a probability distribution

softmax(π’œ)O = 𝑓J$ βˆ‘PQR

S

𝑓J%

normalization term

slide-31
SLIDE 31

31

Softmax – numeric example Β§ 𝐿 = 4 classes π’œ = 1 2 5 6 softmax(π’œ) = 0.004 0.013 0.264 0.717

𝑓% 𝑦 log(𝑦)

slide-32
SLIDE 32

32

Softmax characteristics Β§ The exponential function in softmax makes the highest value becomes separated from the others Β§ Softmax identifies the β€œmax” but in a β€œsoft” way! Β§ Softmax makes competition between the predicted

  • utput values, so that in the extreme case, β€œwinner

takes all”

  • Winner-takes-all: one output is 1 and the rest are 0
  • This resembles the competition between nearby neurons in the

cortex

slide-33
SLIDE 33

33

Negative Log Likelihood (NLL) Loss Β§ NLL loss function is commonly used in neural networks to optimize classification tasks:

β„’ = βˆ’π”½π’š,Z~𝒠 log π‘ž 𝑧 π’š; 𝕏

  • 𝒠 the set of (training) data
  • π’š input vector
  • 𝑧 correct output class

Β§ NLL is a form of cross entropy loss

slide-34
SLIDE 34

34

NLL + Softmax Β§ The choice of output function (such as softmax) is highly related to the selection of loss function. These two should fit with each other! Β§ Softmax and NLL are a good pair Β§ To see why, let’s calculate the final NLL loss function when softmax is used at output layer (next page)

slide-35
SLIDE 35

35

NLL + Softmax Β§ Loss function for one data point: β„’(𝑔 π’š; 𝒙 , 𝑧) Β§ π’œ the output vector of network before applying softmax Β§ 𝑧 the index of the correct class

β„’(𝑔 π’š; 𝒙 , 𝑧) = βˆ’ log π‘ž 𝑧 π’š; 𝕏 = βˆ’ log 𝑓J& βˆ‘PQR

S

𝑓J% = βˆ’π‘¨Z + log βˆ‘PQR

S

𝑓J%

slide-36
SLIDE 36

36

NLL + Softmax – example 2 π’œ = 1 2 0.5 6 Β§ If the correct class is the first one, 𝑧 = 0: β„’ = βˆ’1 + log 𝑓1 + 𝑓3 + 𝑓4.D + 𝑓E = βˆ’1 + 6.02 = πŸ”. πŸπŸ‘ Β§ If the correct class is the third one, 𝑧 = 2: β„’ = βˆ’0.5 + log 𝑓1 + 𝑓3 + 𝑓4.D + 𝑓E = βˆ’0.5 + 6.02 = πŸ”. πŸ”πŸ‘ Β§ If the correct class is the fourth one, 𝑧 = 3: β„’ = βˆ’6 + log 𝑓1 + 𝑓3 + 𝑓4.D + 𝑓E = βˆ’6 + 6.02 = 𝟏. πŸπŸ‘

slide-37
SLIDE 37

37

NLL + Softmax – example 1 π’œ = 1 2 5 6 Β§ If the correct class is the first one, 𝑧 = 0: β„’ = βˆ’1 + log 𝑓1 + 𝑓3 + 𝑓D + 𝑓E = βˆ’1 + 6.33 = πŸ”. πŸ’πŸ’ Β§ If the correct class is the third one, 𝑧 = 2: β„’ = βˆ’5 + log 𝑓1 + 𝑓3 + 𝑓D + 𝑓E = βˆ’5 + 6.33 = 𝟐. πŸ’πŸ’ Β§ If the correct class is the fourth one, 𝑧 = 3: β„’ = βˆ’6 + log 𝑓1 + 𝑓3 + 𝑓D + 𝑓E = βˆ’6 + 6.33 = 𝟏. πŸ’πŸ’

slide-38
SLIDE 38

Agenda

  • Introduction
  • Non-linearities
  • Forward pass & backpropagation
  • Softmax & loss function
  • Optimization & regularization
slide-39
SLIDE 39

39

Stochastic Gradient Descent (SGD)

Β§ For every π‘₯ ∈ 𝕏 and for 𝑛 training data points π‘₯ β„’(π‘₯)

  • ptimum

βˆ‡!β„’(π‘₯) βˆ’βˆ‡!β„’(π‘₯)

𝑒 𝑒 + 1

slide-40
SLIDE 40

40

Stochastic Gradient Descent algorithm

Β§ A set of parameters 𝒙 Β§ A learning rate πœƒ Β§ Loop until some exit criteria are met

  • Sample a minibatch of 𝑛 data points from 𝒠
  • Compute gradient (vectors) of parameters:

𝒉 ← 1 𝑛 βˆ‡π’™ f

#

β„’(𝑔 π’š(#); 𝒙 , 𝑧(𝒋))

  • Update the parameters by taking a step in the opposite direction
  • f the corresponding gradients:

𝒙 ← 𝒙 βˆ’ πœƒπ’‰

  • Reduce learning rate (annealing) if some criteria are met or

based on a schedule

slide-41
SLIDE 41

41

Sampling size Β§ If only one data point is used in every step; 𝑛 = 1

  • Fast
  • learns online
  • Training can become unstable with a lot of fluctuations

Β§ If all data points are used in every step; 𝑛 = 𝑂

  • Also called Batch Gradient Descent
  • Training can take very long time

Β§ If 𝑛 is between these

  • Also called Mini-Batch Gradient Descent
  • Typical setting for training deep learning models
slide-42
SLIDE 42

42

Other gradient-based optimizations Β§ Limitations of the mentioned SGD algorithms

  • Choosing learning rate is hard
  • Choosing annealing method/rate is hard
  • Same learning rate is applied to all parameters
  • Can get trapped in non-optimal local minima and saddle points

Β§ Some other commonly used algorithms:

  • Nestrov accelerated gradient
  • Adagrad
  • Adam
slide-43
SLIDE 43

43

Regularization techniques for neural networks and deep learning Β§ Parameter norm penalties (discussed in previous lecture) Β§ Early stopping Β§ Dropout Β§ Batch normalization Β§ Transfer learning Β§ Multitask learning Β§ Unsupervised / Semi-supervised pre-training Β§ Noise robustness Β§ Dataset augmentation Β§ Ensemble Β§ Adversarial training

slide-44
SLIDE 44

44

Early Stopping

Β§ Run the model for several steps (epochs), and in each step evaluate the model on the validation set Β§ Store the model if the evaluation results improve Β§ At the end, take the stored model (with best validation results) as the final model

slide-45
SLIDE 45

45

Dropout

Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural networks from overfitting”, JMLR 2014

Β§ Key idea: prune neural network by removing some hidden units stochastically Β§ At training time for each data point:

  • Each hidden unit’s output is multiplied to zero based
  • n a dropout probability (like 0.6)
slide-46
SLIDE 46

46

Dropout Β§ At test time:

  • All hidden units are used
  • The output of each hidden is multiplied to the dropout

probability

slide-47
SLIDE 47

47

Dropout – characteristics Β§ Computationally inexpensive but a powerful method Β§ Dropout can be viewed as a geometric average of an exponential number of networks β†’ Ensemble Β§ Dropout prevents hidden units from forming co- dependencies amongst each other Β§ Every hidden unit learns to perform well regardless of

  • ther units