neural networks computation gradient descent
play

Neural Networks: Computation + Gradient Descent LING572 Advanced - PowerPoint PPT Presentation

Neural Networks: Computation + Gradient Descent LING572 Advanced Statistical Methods in NLP February 27 2020 1 Todays Outline Computation: the forward pass Functional form / matrix notation Parameters and Hyperparameters


  1. Neural Networks: 
 Computation + Gradient Descent LING572 Advanced Statistical Methods in NLP February 27 2020 1

  2. Today’s Outline ● Computation: the forward pass ● Functional form / matrix notation ● Parameters and Hyperparameters ● Gradient Descent ● Intro ● Stochastic Gradient Descent + Mini-batches 2

  3. ̂ ̂ ̂ Notation ● I will generally use plain variables (e.g. x , y , W ) for vectors and matrices as well as scalars, relying on context ● y y : a “guess” at ● e.g.: a model’s output ● f ( x ) x f , when is a vector/matrix means that is applied element-wise ● θ : all parameters ● y = f ( x ; θ ) = f θ ( x ) θ y x : is a (parameterized) function of with parameters 3

  4. Feed-forward networks 
 aka Multi-layer perceptrons (MLP) 4

  5. XOR Network a and = σ ( w and or ⋅ a or + w and nand ⋅ a nand + b and ) = σ ( [ w and ) nand ] [ a nand ] + b and a or w and or 5

  6. XOR Network a and = σ ( w and or ⋅ a or + w and nand ⋅ a nand + b and ) = σ ( [ w and ) nand ] [ a nand ] + b and a or w and or a or = σ ( w or ⋅ a p + w or ⋅ a q + b or ) p q a nand = σ ( w nand ⋅ a p + w nand ⋅ a q + b nand ) p q 6

  7. XOR Network a and = σ ( w and or ⋅ a or + w and nand ⋅ a nand + b and ) = σ ( [ w and ) nand ] [ a nand ] + b and a or w and or w or w or a q ] + [ b or a p [ a nand ] = σ a or [ b nand ] p q w nand w nand p q 7

  8. XOR Network a and = σ ( w and or ⋅ a or + w and nand ⋅ a nand + b and ) = σ ( [ w and ) nand ] [ a nand ] + b and a or w and or w or w or a q ] + [ b or a p a and = σ [ w and [ b nand ] + b and p q w and nand ] σ or w nand w nand p q 8

  9. ̂ ̂ Generalizing w or w or a q ] + [ b or a p a and = σ [ w and [ b nand ] + b and p q w and nand ] σ or w nand w nand p q y = f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) y = f n ( W n f n − 1 ( ⋯ f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) ⋯ ) + b n ) 9

  10. Some terminology ● Our XOR network is a feed-forward neural network with one hidden layer ● Aka a multi-layer perceptron (MLP) ● Input nodes: 2; output nodes: 1 ● Activation function: sigmoid 10

  11. General MLP source w 1 i Weight to neuron in layer 1 
 ij j from neuron in layer 0 W 1 11

  12. ̂ General MLP y = f n ( W n f n − 1 ( ⋯ f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) ⋯ ) + b n ) w 1 w 1 w 1 b 1 ⋯ x 0 00 01 0 n 0 0 x 1 b 1 w 1 w 1 w 1 ⋯ b 1 = W 1 = 1 10 11 1 n 0 x = ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ x n 0 b 1 w 1 w 1 w 1 ⋯ n 1 n 1 0 n 1 1 n 1 n 0 Shape: ( n 0 ,1) Shape: ( n 1 ,1) ( n 1 , n 0 ) Shape: n 0 : number of neurons in layer 0 (input) 
 12 n 1 : number of neurons in layer 1

  13. Parameters of an MLP ● Weights and biases ● For each layer : l n l ( n l − 1 + 1) ● n l n l − 1 n l weights; biases ● With n hidden layers (considering the output as a hidden layer): n ∑ n i ( n i − 1 + 1) i =1 13

  14. Hyper-parameters of an MLP ● Input size, output size ● Usually fixed by your problem / dataset ● Input: image size, vocab size; number of “raw” features in general ● Output: 1 for binary classification or simple regression, number of labels for classification, … ● Number of hidden layers ● For each hidden layer: ● Size ● Activation function ● Others: initialization, regularization (and associated values), learning rate / training, … 14

  15. The Deep in Deep Learning ● The Universal Approximation Theorem says that one hidden layer suffices for arbitrarily-closely approximating a given function ● Empirical drawbacks: Super-exponentially many neurons; hard to discover ● “Deep and narrow” >> “Shallow and wide” ● In principle allows hierarchical features to be learned ● More well-behaved w/r/t optimization source 15

  16. ̂ Activation Functions ● Note: non-linear activation functions are essential ● MLP: linear transformation, followed by a point-wise non-linearity, repeated several times over ● Without the non-linearity, would just have several linear transformations ● Composition of linear transformations is also linear! y = f n ( W n f n − 1 ( ⋯ f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) ⋯ ) + b n ) 16

  17. Activation Functions: Hidden Layer sigmoid tanh tanh ( x ) = e x − e − x e x 1 e x + e − x = 2 σ (2 x ) − 1 ● Use ReLU by default σ ( x ) = 1 + e − x = e x + 1 ● Generalizations: ● Leaky ● ELU Problem: derivative “saturates” (nearly 0) ● Softplus everywhere except near origin ● … 17

  18. Activation Functions: Output Layer ● Depends on the task! ● Regression (continuous output(s)): none! ● Just use final linear transformation ● Binary classification: sigmoid ● Also for multi-label classification e x i softmax ( x ) i = ● Multi-class classification: softmax ∑ j e x j ● Terminology: the inputs to a softmax are called logits ● [there are sometimes other uses of the term, so beware] 18

  19. Learning: (Stochastic) Gradient Descent 19

  20. Gradient Descent: Basic Idea ● Treat NN training as an optimization problem 1 ℒ ( ̂ | Y | ∑ ℓ ( ̂ ℓ ( ̂ y , y ) Y , Y ) = y ( x i ), y i ) : loss function (“objective function”); ● ● How “close” is the model’s output to the true output i ● Local loss, averaged over training instances ● More later: depends on the particular task, among other things ● View the loss as a function of the model’s parameters ● The gradient of the loss w/r/t parameters tells which direction in parameter space to “walk” to make the loss smaller (i.e. to improve model outputs) ● Guaranteed to work in linear case; can get stuck in local minima for NNs 20

  21. Gradient Descent: Basic Idea source 21

  22. Derivatives ● The derivative of a function of one real variable measures how much the output changes with respect to a change in the input variable f ( x ) = x 2 + 35 x + 12 df dx = 2 x + 35 f ( x ) = e x df dx = e x 22

  23. Partial Derivatives ● A partial derivative of a function of several variables measures its derivative with respect one of those variables, with the others held constant. f ( x ) = 10 x 3 y 2 + 5 xy 3 + 4 x + y ∂ f ∂ x = 30 x 2 y 2 + 5 y 3 + 4 ∂ f ∂ y = 20 x 3 y + 15 xy 2 + 1 23

  24. 
 
 
 
 
 
 Gradient ● The gradient of a function f ( x 1 , x 2 , . . . x n ) is a vector function, returning all of the partial derivatives 
 ∇ f = ⟨ ∂ x n ⟩ ∂ f , ∂ f , …, ∂ f ∂ x 1 ∂ x 2 f ( x ) = 4 x 2 + y 2 ∇ f = ⟨ 8 x ,2 y ⟩ ● The gradient is perpendicular to the level curve at a point ● The gradient points in the direction of greatest rate of increase of f 24

  25. Gradient and Level Curves (0, 5) f ( x ) = 4 x 2 + y 2 (1,1) ∇ f = ⟨ 8 x ,2 y ⟩ ( 1.25,0) Level curves: f ( x ) = c Q: what are the actual gradients 
 at those points? 25

  26. Gradient Descent and Level Curves source 26

  27. Gradient Descent Algorithm ● Initialize θ 0 ● Repeat until convergence: θ n +1 = θ n − α ∇ℒ ( ̂ Y ( θ n ), Y ) Learning rate ● High learning rate: big steps, may bounce and “overshoot” the target ● Low learning rate: small steps, smoother minimization of loss, but can be slow 27

  28. ̂ Gradient Descent: Minimal Example ● Task: predict a target/true value y = 2 ● “Model”: y ( θ ) = θ ● A single parameter: the actual guess ● Loss: Euclidean distance y − y ) 2 = ( θ − y ) 2 ℒ ( ̂ y ( θ ), y ) = ( ̂ 28

  29. Gradient Descent: Minimal Example 29

  30. Stochastic Gradient Descent ● The above is called “batch” gradient descent ● Updates once per pass through the dataset ● Expensive, and slow; does not scale well ● Stochastic gradient descent: ● Break the data into “mini-batches”: small chunks of the data ● Compute gradients and update parameters for each batch ● Mini-batch of size 1 = single example ● A noisy estimate of the true gradient, but works well in practice; more parameter updates ● Epoch: one pass through the whole training data 30

  31. Stochastic Gradient Descent initialize parameters / build model for each epoch: data = shuffle(data) batches = make_batches(data) for each batch in batches: outputs = model(batch) loss = loss_fn(outputs, true_outputs) compute gradients // e.g. loss.backward() update parameters 31

  32. Computing with Mini-batches ● Bad idea: for each batch in batches: for each datum in batch: outputs = model(datum) loss = loss_fn(outputs, true_outputs) compute gradients // e.g. loss.backward() update parameters 32

  33. ̂ Computing with a Single Input y = f n ( W n f n − 1 ( ⋯ f 2 ( W 2 f 1 ( W 1 x + b 1 ) + b 2 ) ⋯ ) + b n ) w 1 w 1 w 1 b 1 ⋯ x 0 00 01 0 n 0 0 x 1 b 1 w 1 w 1 w 1 ⋯ b 1 = W 1 = 1 10 11 1 n 0 x = ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ x n 0 b 1 w 1 w 1 w 1 ⋯ n 1 n 1 0 n 1 1 n 1 n 0 Shape: ( n 0 ,1) Shape: ( n 1 ,1) ( n 1 , n 0 ) Shape: n 0 : number of neurons in layer 0 (input) 
 33 n 1 : number of neurons in layer 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend