Neural Networks: Computation + Gradient Descent LING572 Advanced - PowerPoint PPT Presentation

Neural Networks:   Computation + Gradient Descent LING572 Advanced Statistical Methods in NLP February 27 2020 1

Today’s Outline ● Computation: the forward pass ● Functional form / matrix notation ● Parameters and Hyperparameters ● Gradient Descent ● Intro ● Stochastic Gradient Descent + Mini-batches 2

̂ ̂ ̂ Notation ● I will generally use plain variables (e.g. x , y , W ) for vectors and matrices as well as scalars, relying on context ● y y : a “guess” at ● e.g.: a model’s output ● f ( x ) x f , when is a vector/matrix means that is applied element-wise ● θ : all parameters ● y = f ( x ; θ ) = f θ ( x ) θ y x : is a (parameterized) function of with parameters 3

Feed-forward networks   aka Multi-layer perceptrons (MLP) 4

XOR Network a and = σ ( w and or ⋅ a or + w and nand ⋅ a nand + b and ) = σ ( [ w and ) nand ] [ a nand ] + b and a or w and or 5

XOR Network a and = σ ( w and or ⋅ a or + w and nand ⋅ a nand + b and ) = σ ( [ w and ) nand ] [ a nand ] + b and a or w and or a or = σ ( w or ⋅ a p + w or ⋅ a q + b or ) p q a nand = σ ( w nand ⋅ a p + w nand ⋅ a q + b nand ) p q 6

XOR Network a and = σ ( w and or ⋅ a or + w and nand ⋅ a nand + b and ) = σ ( [ w and ) nand ] [ a nand ] + b and a or w and or w or w or a q ] + [ b or a p [ a nand ] = σ a or [ b nand ] p q w nand w nand p q 7

XOR Network a and = σ ( w and or ⋅ a or + w and nand ⋅ a nand + b and ) = σ ( [ w and ) nand ] [ a nand ] + b and a or w and or w or w or a q ] + [ b or a p a and = σ [ w and [ b nand ] + b and p q w and nand ] σ or w nand w nand p q 8

̂ ̂ Generalizing w or w or a q ] + [ b or a p a and = σ [ w and [ b nand ] + b and p q w and nand ] σ or w nand w nand p q y = f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) y = f n ( W n f n − 1 ( ⋯ f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) ⋯ ) + b n ) 9

Some terminology ● Our XOR network is a feed-forward neural network with one hidden layer ● Aka a multi-layer perceptron (MLP) ● Input nodes: 2; output nodes: 1 ● Activation function: sigmoid 10

General MLP source w 1 i Weight to neuron in layer 1   ij j from neuron in layer 0 W 1 11

̂ General MLP y = f n ( W n f n − 1 ( ⋯ f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) ⋯ ) + b n ) w 1 w 1 w 1 b 1 ⋯ x 0 00 01 0 n 0 0 x 1 b 1 w 1 w 1 w 1 ⋯ b 1 = W 1 = 1 10 11 1 n 0 x = ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ x n 0 b 1 w 1 w 1 w 1 ⋯ n 1 n 1 0 n 1 1 n 1 n 0 Shape: ( n 0 ,1) Shape: ( n 1 ,1) ( n 1 , n 0 ) Shape: n 0 : number of neurons in layer 0 (input)   12 n 1 : number of neurons in layer 1

Parameters of an MLP ● Weights and biases ● For each layer : l n l ( n l − 1 + 1) ● n l n l − 1 n l weights; biases ● With n hidden layers (considering the output as a hidden layer): n ∑ n i ( n i − 1 + 1) i =1 13

Hyper-parameters of an MLP ● Input size, output size ● Usually fixed by your problem / dataset ● Input: image size, vocab size; number of “raw” features in general ● Output: 1 for binary classification or simple regression, number of labels for classification, … ● Number of hidden layers ● For each hidden layer: ● Size ● Activation function ● Others: initialization, regularization (and associated values), learning rate / training, … 14

The Deep in Deep Learning ● The Universal Approximation Theorem says that one hidden layer suffices for arbitrarily-closely approximating a given function ● Empirical drawbacks: Super-exponentially many neurons; hard to discover ● “Deep and narrow” >> “Shallow and wide” ● In principle allows hierarchical features to be learned ● More well-behaved w/r/t optimization source 15

̂ Activation Functions ● Note: non-linear activation functions are essential ● MLP: linear transformation, followed by a point-wise non-linearity, repeated several times over ● Without the non-linearity, would just have several linear transformations ● Composition of linear transformations is also linear! y = f n ( W n f n − 1 ( ⋯ f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) ⋯ ) + b n ) 16

Activation Functions: Hidden Layer sigmoid tanh tanh ( x ) = e x − e − x e x 1 e x + e − x = 2 σ (2 x ) − 1 ● Use ReLU by default σ ( x ) = 1 + e − x = e x + 1 ● Generalizations: ● Leaky ● ELU Problem: derivative “saturates” (nearly 0) ● Softplus everywhere except near origin ● … 17

Activation Functions: Output Layer ● Depends on the task! ● Regression (continuous output(s)): none! ● Just use final linear transformation ● Binary classification: sigmoid ● Also for multi-label classification e x i softmax ( x ) i = ● Multi-class classification: softmax ∑ j e x j ● Terminology: the inputs to a softmax are called logits ● [there are sometimes other uses of the term, so beware] 18

Learning: (Stochastic) Gradient Descent 19

Gradient Descent: Basic Idea ● Treat NN training as an optimization problem 1 ℒ ( ̂ | Y | ∑ ℓ ( ̂ ℓ ( ̂ y , y ) Y , Y ) = y ( x i ), y i ) : loss function (“objective function”); ● ● How “close” is the model’s output to the true output i ● Local loss, averaged over training instances ● More later: depends on the particular task, among other things ● View the loss as a function of the model’s parameters ● The gradient of the loss w/r/t parameters tells which direction in parameter space to “walk” to make the loss smaller (i.e. to improve model outputs) ● Guaranteed to work in linear case; can get stuck in local minima for NNs 20

Gradient Descent: Basic Idea source 21

Derivatives ● The derivative of a function of one real variable measures how much the output changes with respect to a change in the input variable f ( x ) = x 2 + 35 x + 12 df dx = 2 x + 35 f ( x ) = e x df dx = e x 22

Partial Derivatives ● A partial derivative of a function of several variables measures its derivative with respect one of those variables, with the others held constant. f ( x ) = 10 x 3 y 2 + 5 xy 3 + 4 x + y ∂ f ∂ x = 30 x 2 y 2 + 5 y 3 + 4 ∂ f ∂ y = 20 x 3 y + 15 xy 2 + 1 23

            Gradient ● The gradient of a function f ( x 1 , x 2 , . . . x n ) is a vector function, returning all of the partial derivatives   ∇ f = ⟨ ∂ x n ⟩ ∂ f , ∂ f , …, ∂ f ∂ x 1 ∂ x 2 f ( x ) = 4 x 2 + y 2 ∇ f = ⟨ 8 x ,2 y ⟩ ● The gradient is perpendicular to the level curve at a point ● The gradient points in the direction of greatest rate of increase of f 24

Gradient and Level Curves (0, 5) f ( x ) = 4 x 2 + y 2 (1,1) ∇ f = ⟨ 8 x ,2 y ⟩ ( 1.25,0) Level curves: f ( x ) = c Q: what are the actual gradients   at those points? 25

Gradient Descent and Level Curves source 26

Gradient Descent Algorithm ● Initialize θ 0 ● Repeat until convergence: θ n +1 = θ n − α ∇ℒ ( ̂ Y ( θ n ), Y ) Learning rate ● High learning rate: big steps, may bounce and “overshoot” the target ● Low learning rate: small steps, smoother minimization of loss, but can be slow 27

̂ Gradient Descent: Minimal Example ● Task: predict a target/true value y = 2 ● “Model”: y ( θ ) = θ ● A single parameter: the actual guess ● Loss: Euclidean distance y − y ) 2 = ( θ − y ) 2 ℒ ( ̂ y ( θ ), y ) = ( ̂ 28

Gradient Descent: Minimal Example 29

Stochastic Gradient Descent ● The above is called “batch” gradient descent ● Updates once per pass through the dataset ● Expensive, and slow; does not scale well ● Stochastic gradient descent: ● Break the data into “mini-batches”: small chunks of the data ● Compute gradients and update parameters for each batch ● Mini-batch of size 1 = single example ● A noisy estimate of the true gradient, but works well in practice; more parameter updates ● Epoch: one pass through the whole training data 30

Stochastic Gradient Descent initialize parameters / build model for each epoch: data = shuffle(data) batches = make_batches(data) for each batch in batches: outputs = model(batch) loss = loss_fn(outputs, true_outputs) compute gradients // e.g. loss.backward() update parameters 31

Computing with Mini-batches ● Bad idea: for each batch in batches: for each datum in batch: outputs = model(datum) loss = loss_fn(outputs, true_outputs) compute gradients // e.g. loss.backward() update parameters 32

̂ Computing with a Single Input y = f n ( W n f n − 1 ( ⋯ f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) ⋯ ) + b n ) w 1 w 1 w 1 b 1 ⋯ x 0 00 01 0 n 0 0 x 1 b 1 w 1 w 1 w 1 ⋯ b 1 = W 1 = 1 10 11 1 n 0 x = ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ x n 0 b 1 w 1 w 1 w 1 ⋯ n 1 n 1 0 n 1 1 n 1 n 0 Shape: ( n 0 ,1) Shape: ( n 1 ,1) ( n 1 , n 0 ) Shape: n 0 : number of neurons in layer 0 (input)   33 n 1 : number of neurons in layer 1

Neural Networks: Computation + Gradient Descent LING572 Advanced - PowerPoint PPT Presentation

Neural Networks: Computation + Gradient Descent LING572 Advanced Statistical Methods in NLP February 27 2020 1 Todays Outline Computation: the forward pass Functional form / matrix notation Parameters and Hyperparameters

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Artificial Neural Networks (Part 2) Gradient Descent Learning and Backpropagation Christian Jacob

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao

Workshop 10.4: Generalized linear models Murray Logan February 15, 2017 Table of contents 1

When Ensembling Smaller Models is More Efficient than Single Large Models Dan Kondratyuk, Mingxing

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

A Bayesian approach to estimate the number and position of knots for linear regression splines

Deep Learning: From Theory to Algorithm Outline: 1. Overview of

Volatility is Rough, Part 2: Pricing Jim Gatheral (joint work with Christian Bayer, Peter Friz,

Machine learning and black-box expensive optimization S ebastien Verel Laboratoire

Estimation of Transformations Shao-Yi Chien Department of Electrical Engineering

Neural Networks: Computation + Gradient Descent LING572 Advanced - PowerPoint PPT Presentation

Neural Networks: Computation + Gradient Descent LING572 Advanced Statistical Methods in NLP February 27 2020 1 Todays Outline Computation: the forward pass Functional form / matrix notation Parameters and Hyperparameters

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Artificial Neural Networks (Part 2) Gradient Descent Learning and Backpropagation Christian Jacob

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao

Workshop 10.4: Generalized linear models Murray Logan February 15, 2017 Table of contents 1

When Ensembling Smaller Models is More Efficient than Single Large Models Dan Kondratyuk, Mingxing

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

A Bayesian approach to estimate the number and position of knots for linear regression splines

Deep Learning: From Theory to Algorithm Outline: 1. Overview of

Volatility is Rough, Part 2: Pricing Jim Gatheral (joint work with Christian Bayer, Peter Friz,

Machine learning and black-box expensive optimization S ebastien Verel Laboratoire

Estimation of Transformations Shao-Yi Chien Department of Electrical Engineering

Gradient Descent Michail Michailidis & Patrick Maiden Outline