Natural Language Processing Anoop Sarkar - PowerPoint PPT Presentation

SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 17, 2019 0

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 1: Feedforward neural networks 1

Log-linear models versus Neural networks Feedforward neural networks Stochastic Gradient Descent Motivating example: XOR Computation Graphs 2

Log linear model ◮ Let there be m features, f k ( x , y ) for k = 1 , . . . , m ◮ Define a parameter vector v ∈ R m ◮ A log-linear model for classification into labels y ∈ Y : exp ( v · f ( x , y ))) Pr( y | x ; v ) = � y ′ ∈Y exp ( v · f ( x , y ′ ))) Advantages The feature representation f ( x , y ) can represent any aspect of the input that is useful for classification. Disadvantages The feature representation f ( x , y ) has to be designed by hand which is time-consuming and error-prone. 3

Log linear model Figure from [1] Disadvantages: number of combined features can explode 4

Neural Networks Advantages ◮ Neural networks replace hand-engineered features with representation learning ◮ Empirical results across many different domains show that learned representations give significant improvements in accuracy ◮ Neural networks allow end to end training for complex NLP tasks and do not have the limitations of multiple chained pipeline models Disadvantages For many tasks linear models are much faster to train compared to neural network models 5

Alternative Form of Log linear model Log-linear model: exp ( v · f ( x , y ))) Pr( y | x ; v ) = � y ′ ∈Y exp ( v · f ( x , y ′ ))) Alternative form using functions: exp ( v ( y ) · f ( x ) + γ y ) Pr( y | x ; v ) = � � � v ( y ′ ) · f ( x ) + γ y ′ ) y ′ ∈Y exp ◮ Feature vector f ( x ) maps input x to R d ◮ Parameters v ( y ) ∈ R d and γ y ∈ R for each y ∈ Y ◮ We assume v ( y ) · f ( x ) is a dot product. Using matrix multiplication it would be v ( y ) · f ( x ) T ◮ Let v = { ( v ( y ) , γ y ) : y ∈ Y} 6

Representation Learning: Feedforward Neural Network Replace hand-engineered features f with learned features φ : exp ( v ( y ) · φ ( x ; θ ) + γ y ) Pr( y | x ; θ, v ) = � � � v ( y ′ ) · φ ( x ; θ ) + γ y ′ ) y ′ ∈Y exp ◮ Replace f ( x ) with φ ( x ; θ ) ∈ R d where θ are new parameters ◮ Parameters θ are learned from training data ◮ Using θ the model φ maps input x to R d : a learned representation from x ◮ x ∈ R d is a pre-trained vector of size d ◮ We will use feedforward neural networks to define φ ( x ; θ ) ◮ φ ( x ; θ ) will be a non-linear mapping to R d ◮ φ replaces f which was a linear model 8

A Single Neuron aka Perceptron A single neuron maps input x ∈ R d to output h : h = g ( w · x + b ) ◮ Weight vector w ∈ R d , a bias b ∈ R are the parameters of the model learned from training data ◮ Transfer function (also called activation function ) g : R → R ◮ It is important that g is a non-linear transfer function ◮ Linear g ( z ) = α · z + β for constants α, β (linear perceptron) 9

Activation Functions and their Gradients from [2], Fig. 4.3 10

The sigmoid Transfer Function: σ sigmoid transfer function: 1 g ( z ) = 1 − exp ( z ) Derivative of sigmoid: dg ( z ) = g ( z )(1 − g ( z )) dz 11

The tanh Transfer Function tanh transfer function: g ( z ) = exp (2 z ) − 1 exp (2 z ) + 1 Derivative of tanh: dg ( z ) = 1 − g ( z ) 2 dz 12

Alternatives to tanh hardtanh:  1 if z > 1  g ( z ) = − 1 if z < − 1  z otherwise � 1 dg ( z ) if − 1 ≤ z ≤ 1 = 0 dz otherwise softsign: z g ( z ) = 1 + | z | � 1 if z ≥ 0 dg ( z ) (1+ z ) 2 = − 1 if z < 0 dz (1+ z ) 2 13

The ReLU Transfer Function Rectified Linear Unit (ReLU): g ( z ) = { z if z ≥ 0 or 0 if z < 0 } or equivalently g ( z ) = max { 0 , z } Derivative of ReLU: dg ( z ) = { 1 if z > 0 or 0 if z < 0 } dz non-differentiable or undefined if z = 0 (in practice: choose a value for z = 0) 14

Desperately Seeking Transfer Functions from [3] Enumeration of non-linear functions 15

Desperately Seeking Transfer Functions from [3] Enumeration of non-linear functions 16

The Swish Transfer Function [3] Enumeration of activation functions: Swish was the end result of comparing all the auto-generated activation functions for accuracy on standard datasets. Swish uses the sigmoid σ : g ( z ) = z · σ ( β z ) ◮ If β = 0 then g ( z ) = z 2 (a linear function; so avoid this) ◮ If β → ∞ then g ( z ) = ReLu Derivative of Swish: dg ( z ) = β g ( z ) + σ ( β z )(1 − β g ( z )) dz 17

The Swish Transfer Function [3] Swish transfer function with First derivative of the Swish different values of β transfer function 18

Derivatives w.r.t. parameters Derivatives w.r.t. w : Given h = g ( w · x + b ) derivatives w.r.t. w 1 , . . . , w j , . . . w d : dh dw j Derivatives w.r.t. b : derivatives w.r.t. b : dh db 19

Chain Rule of Differentiation Introduce an intermediate variable z ∈ R z = w · x + b h = g ( z ) Then by the chain rule to differentiate w.r.t. w : dh = dh dz = dg ( z ) × x j dw j dz dw j dz And similarly for b : dh db = dh dz db = dg ( z ) × 1 dz dz 20

Single Layer Feedforward model A single layer feedforward model consists of: ◮ An integer d specifying the input dimension. Each input to the network is x ∈ R d ◮ An integer m specifying the number of hidden units ◮ A parameter matrix W ∈ R m × d . The vector W k ∈ R d for 1 ≤ k ≤ m is the k th row of W ◮ A vector b ∈ R d of bias parameters ◮ A transfer function g : R → R g ( z ) = ReLU ( z ) or g ( z ) = tanh ( z ) 21

Single Layer Feedforward model (continued) For k = 1 , . . . , m : ◮ The input to the k th neuron is: z k = W k · x + b k ◮ The output from the k th neuron is: h k = g ( z k ) ◮ Define vector φ ( x ; θ ) ∈ R m as: φ ( x ; θ ) = h k ◮ θ = ( W , b ) where W ∈ R m × d and b ∈ R d ◮ Size of θ is m × ( d + 1) parameters Some intuition The neural network employs m hidden units, each with their own parameters W k and b k , and these neurons are used to construct a hidden representation h ∈ R m 22

Matrix Form We can replace the operation: z k = W k · x + b for k = 1 , . . . , m with z = Wx + b where the dimensions are as follows (vector of size m equals a matrix of size m × 1): z = W x + b �� m × 1 m × d d × 1 m × 1 � �� m × 1 23

Single Layer Feedforward model (matrix form) A single layer feedforward model consists of: ◮ An integer d specifying the input dimension. Each input to the network is x ∈ R d ◮ An integer m specifying the number of hidden units ◮ A parameter matrix W ∈ R m × d ◮ A vector b ∈ R d of bias parameters ◮ A transfer function g : R m → R m g ( z ) = [ . . . , ReLU ( z i ) , . . . ] or g ( z ) = [ . . . , tanh ( z i ) , . . . ] or g ( z ) = [ . . . , σ ( z i ) , . . . ] or for i = 1 , . . . , m 24

Single Layer Feedforward model (matrix form, continued) ◮ Vector of inputs to the hidden layer z ∈ R m : z = Wx + b ◮ Vector of outputs from hidden layer h ∈ R m : h = g ( z ) ◮ Define φ ( x ; θ ) = h where θ = ( W , b ) exp ( r y ) ◮ Define softmax y = y ′ exp ( r y ′ ) for r y = v ( y ) · h + γ y � ◮ Let V = [ . . . , v y , . . . ] for y ∈ Y . v y ∈ R m so V ∈ R |Y|× m . ◮ Let Γ = [ . . . , γ y , . . . ] for y ∈ Y . Γ ∈ R |Y| . Putting it all together: r = softmax ( V · φ ( x ; θ ) + Γ ) �� vector of size |Y| for each y ∈ Y an R value � �� A vector of size R Y that sums to 1 25

Feedforward neural network 26

n-gram Feedforward neural network from [5] 27

Simple stochastic gradient descent Inputs: ◮ Training examples ( x i , y i ) for i = 1 , . . . , n ◮ A feedforward representation φ ( x ; θ ) ◮ Integer T specifying the number of updates ◮ A sequence of learning rates: η 1 , . . . , η T where η t ∈ [0 , 1] ◮ One should experiment with learning rates: 0.001, 0.01, 0.1, 1 ◮ Bottou (2012) suggests a learning rate η t = η 1 1+ η 1 × λ × t where λ is a hyperparameter that can be tuned experimentally Initialization: Set v = ( v ( y ) , γ y ) for all y , and θ to random values 29

Gradient descent Algorithm: ◮ For t = 1 , . . . , T ◮ Select an integer i uniformly at random from { 1 , . . . , n } ◮ Define L ( θ, v ) = − log P ( y i | x i ; θ, v ) ◮ For each parameter θ j and v k ( y ) and γ y (for each label y ): θ j − η t × dL ( θ, v ) = θ j d θ j v k ( y ) − η t × dL ( θ, v ) v k ( y ) = dv k ( y ) γ ( y ) − η t × dL ( θ, v ) γ ( y ) = d γ ( y ) ◮ Output : parameters θ , v = ( v ( y ) , γ y ) for all y 30

Natural Language Processing Anoop Sarkar - PowerPoint PPT Presentation

SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 17, 2019 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 1:

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

Heavy neutral leptons at ANUBIS work with M. Hirsch arXiv: 2001.04750 Zeren Simon Wang Seventh

Guided Learning of Nonconvex Models through Successive Functional Gradient Optimization Rie

Trust in the Cloud i2Coalition & the M3AAWG Hosting SIG Introductions by Matt Stith,

Session #9: Trapdoors and Applications Chris Peikert Georgia Institute of Technology Winter

Between Two Shapes, Using the Hausdorff Distance Marc van Kreveld, Till Miltzow, Tim Ophelders

CSE543 - Introduction to Computer and Network Security Module: Applied Cryptography Professor

Discrete Holomorphicity in the Chiral Potts Model Robert Weston Heriot-Watt University, Edinburgh

Literary Text Mining and Stylometry DH Crash Course Andreas van Cranenburgh Huygens ING

Natural Language Processing Anoop Sarkar - PowerPoint PPT Presentation

SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 17, 2019 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 1:

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

Heavy neutral leptons at ANUBIS work with M. Hirsch arXiv: 2001.04750 Zeren Simon Wang Seventh

Guided Learning of Nonconvex Models through Successive Functional Gradient Optimization Rie

Trust in the Cloud i2Coalition &amp; the M3AAWG Hosting SIG Introductions by Matt Stith,

Session #9: Trapdoors and Applications Chris Peikert Georgia Institute of Technology Winter

Between Two Shapes, Using the Hausdorff Distance Marc van Kreveld, Till Miltzow, Tim Ophelders

CSE543 - Introduction to Computer and Network Security Module: Applied Cryptography Professor

Discrete Holomorphicity in the Chiral Potts Model Robert Weston Heriot-Watt University, Edinburgh

Literary Text Mining and Stylometry DH Crash Course Andreas van Cranenburgh Huygens ING

Trust in the Cloud i2Coalition & the M3AAWG Hosting SIG Introductions by Matt Stith,