training neural nets
play

Training Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D - PowerPoint PPT Presentation

Training Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Training Neural Nets 1 / 40 Outline 1 The Softmax Simplex 2 Loss and Risk 3 Back-Propagation 4 Stochastic Gradient Descent 5 Regularization 6 Network Depth


  1. Training Neural Nets COMPSCI 371D — Machine Learning COMPSCI 371D — Machine Learning Training Neural Nets 1 / 40

  2. Outline 1 The Softmax Simplex 2 Loss and Risk 3 Back-Propagation 4 Stochastic Gradient Descent 5 Regularization 6 Network Depth and Batch Normalization 7 Experiments with SGD COMPSCI 371D — Machine Learning Training Neural Nets 2 / 40

  3. The Softmax Simplex The Softmax Simplex y = h ( x ) : R d → Y • Neural-net classifier: ˆ • The last layer of a neural net used for classification is a soft-max layer exp( z ) p = σ ( z ) = 1 T exp( z ) • The net is p = f ( x , w ) : X → P • The classifier is ˆ y = h ( x ) = arg max p = arg max f ( x , w ) • P is the set of all nonnegative real-valued vectors p ∈ R e whose entries add up to 1 (with e = | Y | ): e = { p ∈ R e : p ≥ 0 and def � P p i = 1 } . i = 1 COMPSCI 371D — Machine Learning Training Neural Nets 3 / 40

  4. The Softmax Simplex = { p ∈ R e : p ≥ 0 and � e def P i = 1 p i = 1 } p 3 1 p 2 1 1/3 p 1 1/3 2 1/2 1/3 p 1 1 p 1/2 1 1 • Decision regions are polyhedral: P c = { p c ≥ p j for j � = c } for c = 1 , . . . , e • A network transforms images into points in P COMPSCI 371D — Machine Learning Training Neural Nets 4 / 40

  5. Loss and Risk Loss and Risk (D´ ej` a Vu) • Ideal loss would be 0-1 loss on network output ˆ y • 0-1 loss is constant where it is differentiable! • Not useful for computing a gradient • Use cross-entropy loss on the softmax output p as a proxy loss ℓ ( y , p ) = − log p y • Non-differentiability of ReLU or max-pooling is minor (pointwise), and typically ignored • Risk, as usual: � N L T ( w ) = 1 n = 1 ℓ n ( w ) where ℓ n ( w ) = ℓ ( y n , f ( x n , w )) N • We need ∇ L T ( w ) and therefore ∇ ℓ n ( w ) COMPSCI 371D — Machine Learning Training Neural Nets 5 / 40

  6. Back-Propagation Back-Propagation (0) x n x x (1) x (2) x = p (3) l n = l (1) f (2) f (3) f y w (1) w (2) w (3) n • We need ∇ L T ( w ) and therefore ∇ ℓ n ( w ) = ∂ℓ n ∂ w • Computations from x to ℓ n form a chain • Apply the chain rule • Every derivative of ℓ n w.r.t. layers before k goes through x ( k ) ∂ x ( k ) ∂ℓ n ∂ℓ n ∂ w ( k ) = ∂ x ( k ) ∂ w ( k ) ∂ x ( k ) ∂ℓ n ∂ℓ n ∂ x ( k − 1 ) = (recursion!) ∂ x ( k ) ∂ x ( k − 1 ) • Start: ∂ x ( K ) = ∂ℓ ∂ℓ n ∂ p COMPSCI 371D — Machine Learning Training Neural Nets 6 / 40

  7. Back-Propagation Local Jacobians (0) x n x x (1) x (2) x = p (3) l n = (1) f (2) f (3) l f y w (1) w (2) w (3) n ∂ x ( k ) ∂ x ( k ) • Local computations at layer k : and ∂ w ( k ) ∂ x ( k − 1 ) • Partial derivatives of f ( k ) with respect to layer weights and input to the layer • Local Jacobian matrices, can compute by knowing what the layer does • The start of the process can be computed from knowing the ∂ x ( K ) = ∂ℓ ∂ℓ n loss function, ∂ p • Another local Jacobian • The rest is going recursively from output to input, one layer ∂ w ( k ) into a vector ∂ℓ n ∂ℓ n at a time, accumulating ∂ w COMPSCI 371D — Machine Learning Training Neural Nets 7 / 40

  8. Back-Propagation The Forward Pass x n x (0) (1) (2) (3) = x x x = p l n l f (1) f (2) f (3) y w (1) w (2) w (3) n ∂ x ( k ) ∂ x ( k ) • All local Jacobians, and ∂ x ( k − 1 ) , are computed ∂ w ( k ) numerically for the current values of weights w ( k ) and layer inputs x ( k − 1 ) • Therefore, we need to know x ( k − 1 ) for training sample n and for all k • This is achieved by a forward pass through the network: Run the network on input x n and store x ( 0 ) = x n , x ( 1 ) , . . . COMPSCI 371D — Machine Learning Training Neural Nets 8 / 40

  9. Back-Propagation Back-Propagation Spelled Out for K = 3 ∂ x ( k ) ∂ℓ n ∂ℓ n x n x = (0) x (1) x (2) x = p (3) l n = l f (1) f (2) f (3) ∂ w ( k ) ∂ x ( k ) ∂ w ( k ) ∂ x ( k ) ∂ℓ n ∂ℓ n = y ∂ x ( k − 1 ) ∂ x ( k ) ∂ x ( k − 1 ) w (1) w (2) w (3) n (after forward pass) ∂ℓ n   ∂ x ( 3 ) = ∂ℓ ∂ℓ n ∂ w ( 1 ) ∂ p ∂ x ( 3 )   ∂ℓ n ∂ℓ n ∂ w ( 3 ) =   ∂ℓ n ∂ℓ n ∂ x ( 3 ) ∂ w ( 3 ) ∂ w =   ∂ w ( 2 ) ∂ x ( 3 ) ∂ℓ n ∂ℓ n ∂ x ( 2 ) =   ∂ x ( 3 ) ∂ x ( 2 )   ∂ x ( 2 ) ∂ℓ n ∂ℓ n ∂ℓ n ∂ w ( 2 ) = ∂ x ( 2 ) ∂ w ( 2 ) ∂ w ( 3 ) ∂ x ( 2 ) ∂ℓ n ∂ℓ n ∂ x ( 1 ) = ∂ x ( 2 ) ∂ x ( 1 ) (Jacobians in blue are local, ∂ x ( 1 ) ∂ℓ n ∂ℓ n ∂ w ( 1 ) = ∂ x ( 1 ) ∂ w ( 1 ) those in red are what we � � ∂ x ( 1 ) ∂ℓ n ∂ℓ n ∂ x ( 0 ) = want eventually) ∂ x ( 1 ) ∂ x ( 0 ) COMPSCI 371D — Machine Learning Training Neural Nets 9 / 40

  10. Back-Propagation Computing Local Jacobians ∂ x ( k ) ∂ x ( k ) ∂ w ( k ) and ∂ x ( k − 1 ) • Easier to make a “layer” as simple as possible • z = V x + b is one layer (Fully Connected (FC), affine part) • z = ρ ( x ) (ReLU) is another layer • Softmax, max-pooling, convolutional,... COMPSCI 371D — Machine Learning Training Neural Nets 10 / 40

  11. Back-Propagation Local Jacobians for a FC Layer z = V x + b ∂ z • ∂ x = V (easy!) ∂ w : What is ∂ z ∂ z ∂ z i • ∂ V ? Three subscripts: ∂ v jk . A 3D tensor? • For a general package, tensors are the way to go • Conceptually, it may be easier to vectorize everything: � v 11 � b 1 � � v 12 v 13 V = , b = → v 21 v 22 v 23 b 2 w = [ v 11 , v 12 , v 13 , v 21 , v 22 , v 23 , b 1 , b 2 ] T ∂ z • ∂ w is a 2 × 8 matrix • With e outputs and d inputs, an e × e ( d + 1 ) matrix COMPSCI 371D — Machine Learning Training Neural Nets 11 / 40

  12. Back-Propagation Jacobian w for a FC Layer � z 1 � w 1 � w 7 �   x 1 � � w 2 w 3  + = x 2  z 2 w 4 w 5 w 6 w 8 x 3 • Don’t be afraid to spell things out: z 1 = w 1 x 1 + w 2 x 2 + w 3 x 3 + w 7 z 2 = w 4 x 1 + w 5 x 2 + w 6 x 3 + w 8 � � ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z ∂ w 1 ∂ w 2 ∂ w 3 ∂ w 4 ∂ w 5 ∂ w 6 ∂ w 7 ∂ w 8 ∂ w = ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ w 1 ∂ w 2 ∂ w 3 ∂ w 4 ∂ w 5 ∂ w 6 ∂ w 7 ∂ w 8 � x 1 � x 2 x 3 0 0 0 1 0 ∂ z ∂ w = 0 0 0 x 1 x 2 x 3 0 1 • Obvious pattern: Repeat x T , staggered, e times • Then append the e × e identity at the end COMPSCI 371D — Machine Learning Training Neural Nets 12 / 40

  13. Stochastic Gradient Descent Training • Local gradients are used in back-propagation • So we now have ∇ L T ( w ) • ˆ w = arg min L T ( w ) • L T ( w ) is (very) non-convex, so we look for local minima • w ∈ R m with m very large: No Hessians • Gradient descent • Even so, every step calls back-propagation N = | T | times • Back-propagation computes m derivatives ∇ ℓ n ( w ) • Computational complexity is Ω( mN ) per step • Even gradient descent is way too expensive! COMPSCI 371D — Machine Learning Training Neural Nets 13 / 40

  14. Stochastic Gradient Descent No Line Search • Line search is out of the question • Fix some step multiplier α , called the learning rate w t + 1 = w t − α ∇ L T ( w t ) • How to pick α ? Cross-validation is too expensive • Tradeoffs: • α too small: Slow progress • α too big: Jump over minima • Frequent practice: • Start with α relatively large, and monitor L T ( w ) • When L T ( w ) levels off, decrease α • Alternative: Fixed decay schedule for α • Better (recent) option: Change α adaptively (Adam, 2015) COMPSCI 371D — Machine Learning Training Neural Nets 14 / 40

  15. Stochastic Gradient Descent Manual Adjustment of α • Start with α relatively large, and monitor L T ( w t ) • When L T ( w t ) levels off, decrease α • Typical plots of L T ( w t ) versus iteration index t : risk COMPSCI 371D — Machine Learning Training Neural Nets 15 / 40

  16. Stochastic Gradient Descent Batch Gradient Descent � N • ∇ L T ( w ) = 1 n = 1 ∇ ℓ n ( w ) . N • Taking a macro-step − α ∇ L T ( w t ) is the same as taking the N micro-steps − α N ∇ ℓ 1 ( w t ) , . . . , − α N ∇ ℓ N ( w t ) • First compute all the N steps at w t , then take all the steps • Thus, standard gradient descent is a batch method: Compute the gradient at w t using the entire batch of data, then move • Even with no line search, computing N micro-steps is still expensive COMPSCI 371D — Machine Learning Training Neural Nets 16 / 40

  17. Stochastic Gradient Descent Stochastic Descent • Taking a macro-step − α ∇ L T ( w t ) is the same as taking the N micro-steps − α N ∇ ℓ 1 ( w t ) , . . . , − α N ∇ ℓ N ( w t ) • First compute all the N steps at w t , then take all the steps • Can we use this effort more effectively? • Key observation: −∇ ℓ n ( w ) is a poor estimate of −∇ L T ( w ) , but an estimate all the same : Micro-steps are correct on average! • After each micro-step, we are on average in a better place • How about computing a new micro-gradient after every micro-step ? • Now each micro-step gradient is evaluated at a point that is on average better (lower risk) than in the batch method COMPSCI 371D — Machine Learning Training Neural Nets 17 / 40

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend