Feedforward Neural Networks Michael Collins, Columbia University

Recap: Log-linear Models A log-linear model takes the following form: exp ( v · f ( x, y )) p ( y | x ; v ) = � y ′ ∈Y exp ( v · f ( x, y ′ )) ◮ f ( x, y ) is the representation of ( x, y ) ◮ Advantage: f ( x, y ) is highly flexible in terms of the features that can be included ◮ Disadvantage: can be hard to design features by hand ◮ Neural networks allow the representation itself to be learned . Recent empirical results across a broad set of domains have shown that learned representations in neural networks can give very significant improvements in accuracy over hand-engineered features.

Example 1: The Language Modeling Problem ◮ w i is the i ’th word in a document ◮ Estimate a distribution p ( w i | w 1 , w 2 , . . . w i − 1 ) given previous “history” w 1 , . . . , w i − 1 . ◮ E.g., w 1 , . . . , w i − 1 = Third, the notion “grammatical in English” cannot be identified in any way with the notion “high order of statistical approximation to English”. It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical

Example 2: Part-of-Speech Tagging Hispaniola/NNP quickly/RB became/VB an/DT important/JJ base/?? from which Spain expanded its empire into the rest of the Western Hemisphere . • There are many possible tags in the position ?? { NN, NNS, Vt, Vi, IN, DT, . . . } • The task: model the distribution p ( t i | t 1 , . . . , t i − 1 , w 1 . . . w n ) where t i is the i ’th tag in the sequence, w i is the i ’th word

Overview ◮ Basic definitions ◮ Stochastic gradient descent ◮ Defining the input to a neural network ◮ A single neuron ◮ A single-layer feedforward network ◮ Motivation: the XOR problem

An Alternative Form for Log-Linear Models Old form: exp ( v · f ( x, y )) p ( y | x ; v ) = (1) � y ′ ∈Y exp ( v · f ( x, y ′ )) New form: exp ( v ( y ) · f ( x ) + γ y ) p ( y | x ; v ) = (2) � y ′ ∈Y exp ( v ( y ′ ) · f ( x ) + γ y ′ ) ◮ Feature vector f ( x ) maps input x to f ( x ) ∈ R D . ◮ Parameters: v ( y ) ∈ R D , γ y ∈ R for each y ∈ Y . ◮ The score v · f ( x, y ) in Eq. 1 has essentially been replaced by v ( y ) · f ( x ) + γ y in Eq. 2. ◮ We will use v to refer to the set of all parameter vectors and bias values: that is, v = { ( v ( y ) , γ y ) : y ∈ Y}

Introducing Learned Representations exp ( v ( y ) · φ ( x ; θ ) + γ y ) p ( y | x ; θ, v ) = (3) � y ′ ∈Y exp ( v ( y ′ ) · φ ( x ; θ ) + γ y ′ ) ◮ Replaced f ( x ) by φ ( x ; θ ) where θ are some additional parameters of the model ◮ The parameter values θ will be estimated from training examples: the representation of x is then “learned” ◮ In this lecture we’ll show how feedforward neural networks can be used to define φ ( x ; θ ) .

Definition (Multi-Class Feedforward Models) A multi-class feedforward model consists of: ◮ A set X of possible inputs. A finite set Y of possible labels. A positive integer D specifying the number of features in the feedforward representation. ◮ A parameter vector θ defining the feedforward parameters of the network. We use Ω to refer to the set of possible values for θ . ◮ A function φ : X × Ω → R D that maps any ( x, θ ) pair to a “feedforward representation” φ ( x ; θ ) . ◮ For each label y ∈ Y , a parameter vector v ( y ) ∈ R D , and a bias value γ y ∈ R . exp ( v ( y ) · φ ( x ; θ ) + γ y ) For any x ∈ X , y ∈ Y , p ( y | x ; θ, v ) = � v ( y ′ ) · φ ( x ; θ ) + γ y ′ � � y ′ ∈Y exp

Two Questions ◮ How can we define the feedforward representation φ ( x ; θ ) ? ◮ Given training examples ( x i , y i ) for i = 1 . . . n , how can we train the parameters θ and v ?

A Simple Version of Stochastic Gradient Descent Inputs: Training examples ( x i , y i ) for i = 1 . . . n . A feedforward representation φ ( x ; θ ) . An integer T specifying the number of updates. A sequence of learning rate values η 1 . . . η T where each η t > 0 . Initialization: Set v and θ to random parameter values.

A Simple Version of Stochastic Gradient Descent (Continued) Algorithm: ◮ For t = 1 . . . T ◮ Select an integer i uniformly at random from { 1 . . . n } ◮ Define L ( θ, v ) = − log p ( y i | x i ; θ, v ) ◮ For each parameter θ j , θ j = θ j − η t × dL ( θ,v ) dθ j ◮ For each label y , for each parameter v k ( y ) , v k ( y ) = v k ( y ) − η t × dL ( θ,v ) dv k ( y ) ◮ For each label y , γ y = γ y − η t × dL ( θ,v ) dγ y Output: parameters θ and v

Defining the Input to a Feedforward Network ◮ Given an input x , we need to define a function f ( x ) ∈ R d that specifies the input to the network ◮ In general it is assumed that the representation f ( x ) is “simple”, not requiring careful hand-engineering. ◮ The neural network will take f ( x ) as input, and will produce a representation φ ( x ; θ ) that depends on the input x and the parameters θ .

Linear Models We could build a log-linear model using f ( x ) as the representation: exp { v ( y ) · f ( x ) + γ y } p ( y | x ; v ) = (4) � y ′ exp { v ( y ′ ) · f ( x ) + γ y ′ } This is a “linear” model, because the score v ( y ) · f ( x ) is linear in the input features f ( x ) . The general assumption is that a model of this form will perform poorly or at least non-optimally. Neural networks enable “non-linear” models that often perform at much higher levels of accuracy.

An Example: Digit Classification ◮ Task is to map an image x to a label y ◮ Each image contains a hand-written digit in the set { 0 , 1 , 2 , . . . 9 } ◮ The representation f ( x ) simply represents pixel values in the image. ◮ For example if the image is 16 × 16 grey-scale pixels, where each pixel takes some value indicating how bright it is, we would have d = 256 , with f ( x ) just being the list of values for the 256 different pixels in the image. ◮ Linear models under this representation perform poorly, neural networks give much better performance

Simplifying Notation ◮ From now on assume that x = f ( x ) : that is, the input x is already defined as a vector ◮ This will simplify notation ◮ But remember that when using a neural network you will have to define a representation of the inputs

A Single Neuron ◮ A neuron is defined by a weight vector w ∈ R d , a bias b ∈ R , and a transfer function g : R → R . ◮ The neuron maps an input vector x ∈ R d to an output h as follows: h = g ( w · x + b ) ◮ The vector w ∈ R d and scalar b ∈ R are parameters of the model, which are learned from training examples.

Transfer Functions ◮ It is important that the transfer function g ( z ) is non-linear ◮ A linear transfer function would be g ( z ) = α × z + β for some constants α and β

The Rectified Linear Unit (ReLU) Transfer Function The ReLU transfer function is defined as g ( z ) = { z if z ≥ 0 , or 0 if z < 0 } Or equivalently, g ( z ) = max { 0 , z } It follows that the derivative is dg ( z ) = { 1 if z > 0 , or 0 if z < 0 , or undefined if z = 0 } dz

The tanh Transfer Function The tanh transfer function is defined as g ( z ) = e 2 z − 1 e 2 z + 1 It can be shown that the derivative is dg ( z ) = (1 − g ( z )) 2 dz

Calculating Derivatives Given h = g ( w · x + b ) it will be useful to calculate derivatives dh dw j for the parameters w 1 , w 2 , . . . w d , and also dh db for the bias parameter b

Calculating Derivatives (Continued) We can use the chain rule of differentiation . First introduce an intermediate variable z ∈ R : z = w · x + b, h = g ( z ) Then by the chain rule we have dh = dh dz × dz = dg ( z ) × x j dw j dw j dz dz = dg ( z ) Here we have used dh dz dz , dw j = x j .

Calculating Derivatives (Continued) We can use the chain rule of differentiation . First introduce an intermediate variable z ∈ R : z = w · x + b, h = g ( z ) Then by the chain rule we have db = dg ( z ) dh db = dh dz × dz × 1 dz dz = dg ( z ) Here we have used dh dz , and dz db = 1 .

Definition (Single-Layer Feedforward Representation) A single-layer feedforward representation consists of the following: ◮ An integer d specifying the input dimension. Each input to the network is a vector x ∈ R d . ◮ An integer m specifying the number of hidden units. ◮ A parameter matrix W ∈ R m × d . We use the vector W k ∈ R d for each k ∈ { 1 , 2 , . . . m } to refer to the k ’th row of W . ◮ A vector b ∈ R m of bias parameters. ◮ A transfer function g : R → R . Common choices are g ( x ) = ReLU ( x ) or g ( x ) = tanh ( x ) .

Feedforward Neural Networks Michael Collins, Columbia University - PowerPoint PPT Presentation

Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A log-linear model takes the following form: exp ( v f ( x, y )) p ( y | x ; v ) = y Y exp ( v f ( x, y )) f ( x, y ) is the

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks

CS7015 (Deep Learning) : Lecture 3 Sigmoid Neurons, Gradient Descent, Feedforward Neural Networks,

An Introduction to Neural Networks - Feedforward NN Backpropagation Agathe Merceron Beuth

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

MultiLayer Neural Networks Xiaogang Wang xgwang@ee.cuhk.edu.hk January 15, 2019 cuhk Xiaogang

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Mixed phenomenological and neural approach to induction motor speed estimation B. Beliczynski,

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Introductory Chemical Engineering Thermodynamics By J.R. Elliott and C.T. Lira The fundamental

Literary Text Mining and Stylometry DH Crash Course Andreas van Cranenburgh Huygens ING

Discrete Holomorphicity in the Chiral Potts Model Robert Weston Heriot-Watt University, Edinburgh

CSE543 - Introduction to Computer and Network Security Module: Applied Cryptography Professor

TRACER - Preprocessing Marco Bchler, Emily Franzini, Greta Franzini, Maria Moritz eTRAP

W3C Workshop on Access Control Application Scenarios November 17 th 2009 Luxembourg Outlines

qDSA: Small and Secure Digital Signatures with Curve-based Diffie-Hellman Key Pairs Joost Renes 1

MATH 12002 - CALCULUS I 2.7: Related Rates Part 2: Examples Professor Donald L. White