Training Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D - PowerPoint PPT Presentation

Training Neural Nets COMPSCI 371D — Machine Learning COMPSCI 371D — Machine Learning Training Neural Nets 1 / 40

Outline 1 The Softmax Simplex 2 Loss and Risk 3 Back-Propagation 4 Stochastic Gradient Descent 5 Regularization 6 Network Depth and Batch Normalization 7 Experiments with SGD COMPSCI 371D — Machine Learning Training Neural Nets 2 / 40

The Softmax Simplex The Softmax Simplex y = h ( x ) : R d → Y • Neural-net classifier: ˆ • The last layer of a neural net used for classification is a soft-max layer exp( z ) p = σ ( z ) = 1 T exp( z ) • The net is p = f ( x , w ) : X → P • The classifier is ˆ y = h ( x ) = arg max p = arg max f ( x , w ) • P is the set of all nonnegative real-valued vectors p ∈ R e whose entries add up to 1 (with e = | Y | ): e = { p ∈ R e : p ≥ 0 and def � P p i = 1 } . i = 1 COMPSCI 371D — Machine Learning Training Neural Nets 3 / 40

The Softmax Simplex = { p ∈ R e : p ≥ 0 and � e def P i = 1 p i = 1 } p 3 1 p 2 1 1/3 p 1 1/3 2 1/2 1/3 p 1 1 p 1/2 1 1 • Decision regions are polyhedral: P c = { p c ≥ p j for j � = c } for c = 1 , . . . , e • A network transforms images into points in P COMPSCI 371D — Machine Learning Training Neural Nets 4 / 40

Loss and Risk Loss and Risk (D´ ej` a Vu) • Ideal loss would be 0-1 loss on network output ˆ y • 0-1 loss is constant where it is differentiable! • Not useful for computing a gradient • Use cross-entropy loss on the softmax output p as a proxy loss ℓ ( y , p ) = − log p y • Non-differentiability of ReLU or max-pooling is minor (pointwise), and typically ignored • Risk, as usual: � N L T ( w ) = 1 n = 1 ℓ n ( w ) where ℓ n ( w ) = ℓ ( y n , f ( x n , w )) N • We need ∇ L T ( w ) and therefore ∇ ℓ n ( w ) COMPSCI 371D — Machine Learning Training Neural Nets 5 / 40

Back-Propagation Back-Propagation (0) x n x x (1) x (2) x = p (3) l n = l (1) f (2) f (3) f y w (1) w (2) w (3) n • We need ∇ L T ( w ) and therefore ∇ ℓ n ( w ) = ∂ℓ n ∂ w • Computations from x to ℓ n form a chain • Apply the chain rule • Every derivative of ℓ n w.r.t. layers before k goes through x ( k ) ∂ x ( k ) ∂ℓ n ∂ℓ n ∂ w ( k ) = ∂ x ( k ) ∂ w ( k ) ∂ x ( k ) ∂ℓ n ∂ℓ n ∂ x ( k − 1 ) = (recursion!) ∂ x ( k ) ∂ x ( k − 1 ) • Start: ∂ x ( K ) = ∂ℓ ∂ℓ n ∂ p COMPSCI 371D — Machine Learning Training Neural Nets 6 / 40

Back-Propagation Local Jacobians (0) x n x x (1) x (2) x = p (3) l n = (1) f (2) f (3) l f y w (1) w (2) w (3) n ∂ x ( k ) ∂ x ( k ) • Local computations at layer k : and ∂ w ( k ) ∂ x ( k − 1 ) • Partial derivatives of f ( k ) with respect to layer weights and input to the layer • Local Jacobian matrices, can compute by knowing what the layer does • The start of the process can be computed from knowing the ∂ x ( K ) = ∂ℓ ∂ℓ n loss function, ∂ p • Another local Jacobian • The rest is going recursively from output to input, one layer ∂ w ( k ) into a vector ∂ℓ n ∂ℓ n at a time, accumulating ∂ w COMPSCI 371D — Machine Learning Training Neural Nets 7 / 40

Back-Propagation The Forward Pass x n x (0) (1) (2) (3) = x x x = p l n l f (1) f (2) f (3) y w (1) w (2) w (3) n ∂ x ( k ) ∂ x ( k ) • All local Jacobians, and ∂ x ( k − 1 ) , are computed ∂ w ( k ) numerically for the current values of weights w ( k ) and layer inputs x ( k − 1 ) • Therefore, we need to know x ( k − 1 ) for training sample n and for all k • This is achieved by a forward pass through the network: Run the network on input x n and store x ( 0 ) = x n , x ( 1 ) , . . . COMPSCI 371D — Machine Learning Training Neural Nets 8 / 40

Back-Propagation Back-Propagation Spelled Out for K = 3 ∂ x ( k ) ∂ℓ n ∂ℓ n x n x = (0) x (1) x (2) x = p (3) l n = l f (1) f (2) f (3) ∂ w ( k ) ∂ x ( k ) ∂ w ( k ) ∂ x ( k ) ∂ℓ n ∂ℓ n = y ∂ x ( k − 1 ) ∂ x ( k ) ∂ x ( k − 1 ) w (1) w (2) w (3) n (after forward pass) ∂ℓ n   ∂ x ( 3 ) = ∂ℓ ∂ℓ n ∂ w ( 1 ) ∂ p ∂ x ( 3 )   ∂ℓ n ∂ℓ n ∂ w ( 3 ) =   ∂ℓ n ∂ℓ n ∂ x ( 3 ) ∂ w ( 3 ) ∂ w =   ∂ w ( 2 ) ∂ x ( 3 ) ∂ℓ n ∂ℓ n ∂ x ( 2 ) =   ∂ x ( 3 ) ∂ x ( 2 )   ∂ x ( 2 ) ∂ℓ n ∂ℓ n ∂ℓ n ∂ w ( 2 ) = ∂ x ( 2 ) ∂ w ( 2 ) ∂ w ( 3 ) ∂ x ( 2 ) ∂ℓ n ∂ℓ n ∂ x ( 1 ) = ∂ x ( 2 ) ∂ x ( 1 ) (Jacobians in blue are local, ∂ x ( 1 ) ∂ℓ n ∂ℓ n ∂ w ( 1 ) = ∂ x ( 1 ) ∂ w ( 1 ) those in red are what we � � ∂ x ( 1 ) ∂ℓ n ∂ℓ n ∂ x ( 0 ) = want eventually) ∂ x ( 1 ) ∂ x ( 0 ) COMPSCI 371D — Machine Learning Training Neural Nets 9 / 40

Back-Propagation Computing Local Jacobians ∂ x ( k ) ∂ x ( k ) ∂ w ( k ) and ∂ x ( k − 1 ) • Easier to make a “layer” as simple as possible • z = V x + b is one layer (Fully Connected (FC), affine part) • z = ρ ( x ) (ReLU) is another layer • Softmax, max-pooling, convolutional,... COMPSCI 371D — Machine Learning Training Neural Nets 10 / 40

Back-Propagation Local Jacobians for a FC Layer z = V x + b ∂ z • ∂ x = V (easy!) ∂ w : What is ∂ z ∂ z ∂ z i • ∂ V ? Three subscripts: ∂ v jk . A 3D tensor? • For a general package, tensors are the way to go • Conceptually, it may be easier to vectorize everything: � v 11 � b 1 � � v 12 v 13 V = , b = → v 21 v 22 v 23 b 2 w = [ v 11 , v 12 , v 13 , v 21 , v 22 , v 23 , b 1 , b 2 ] T ∂ z • ∂ w is a 2 × 8 matrix • With e outputs and d inputs, an e × e ( d + 1 ) matrix COMPSCI 371D — Machine Learning Training Neural Nets 11 / 40

Back-Propagation Jacobian w for a FC Layer � z 1 � w 1 � w 7 �   x 1 � � w 2 w 3  + = x 2  z 2 w 4 w 5 w 6 w 8 x 3 • Don’t be afraid to spell things out: z 1 = w 1 x 1 + w 2 x 2 + w 3 x 3 + w 7 z 2 = w 4 x 1 + w 5 x 2 + w 6 x 3 + w 8 � � ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z ∂ w 1 ∂ w 2 ∂ w 3 ∂ w 4 ∂ w 5 ∂ w 6 ∂ w 7 ∂ w 8 ∂ w = ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ w 1 ∂ w 2 ∂ w 3 ∂ w 4 ∂ w 5 ∂ w 6 ∂ w 7 ∂ w 8 � x 1 � x 2 x 3 0 0 0 1 0 ∂ z ∂ w = 0 0 0 x 1 x 2 x 3 0 1 • Obvious pattern: Repeat x T , staggered, e times • Then append the e × e identity at the end COMPSCI 371D — Machine Learning Training Neural Nets 12 / 40

Stochastic Gradient Descent Training • Local gradients are used in back-propagation • So we now have ∇ L T ( w ) • ˆ w = arg min L T ( w ) • L T ( w ) is (very) non-convex, so we look for local minima • w ∈ R m with m very large: No Hessians • Gradient descent • Even so, every step calls back-propagation N = | T | times • Back-propagation computes m derivatives ∇ ℓ n ( w ) • Computational complexity is Ω( mN ) per step • Even gradient descent is way too expensive! COMPSCI 371D — Machine Learning Training Neural Nets 13 / 40

Stochastic Gradient Descent No Line Search • Line search is out of the question • Fix some step multiplier α , called the learning rate w t + 1 = w t − α ∇ L T ( w t ) • How to pick α ? Cross-validation is too expensive • Tradeoffs: • α too small: Slow progress • α too big: Jump over minima • Frequent practice: • Start with α relatively large, and monitor L T ( w ) • When L T ( w ) levels off, decrease α • Alternative: Fixed decay schedule for α • Better (recent) option: Change α adaptively (Adam, 2015) COMPSCI 371D — Machine Learning Training Neural Nets 14 / 40

Stochastic Gradient Descent Manual Adjustment of α • Start with α relatively large, and monitor L T ( w t ) • When L T ( w t ) levels off, decrease α • Typical plots of L T ( w t ) versus iteration index t : risk COMPSCI 371D — Machine Learning Training Neural Nets 15 / 40

Stochastic Gradient Descent Batch Gradient Descent � N • ∇ L T ( w ) = 1 n = 1 ∇ ℓ n ( w ) . N • Taking a macro-step − α ∇ L T ( w t ) is the same as taking the N micro-steps − α N ∇ ℓ 1 ( w t ) , . . . , − α N ∇ ℓ N ( w t ) • First compute all the N steps at w t , then take all the steps • Thus, standard gradient descent is a batch method: Compute the gradient at w t using the entire batch of data, then move • Even with no line search, computing N micro-steps is still expensive COMPSCI 371D — Machine Learning Training Neural Nets 16 / 40

Stochastic Gradient Descent Stochastic Descent • Taking a macro-step − α ∇ L T ( w t ) is the same as taking the N micro-steps − α N ∇ ℓ 1 ( w t ) , . . . , − α N ∇ ℓ N ( w t ) • First compute all the N steps at w t , then take all the steps • Can we use this effort more effectively? • Key observation: −∇ ℓ n ( w ) is a poor estimate of −∇ L T ( w ) , but an estimate all the same : Micro-steps are correct on average! • After each micro-step, we are on average in a better place • How about computing a new micro-gradient after every micro-step ? • Now each micro-step gradient is evaluated at a point that is on average better (lower risk) than in the batch method COMPSCI 371D — Machine Learning Training Neural Nets 17 / 40

Training Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D - PowerPoint PPT Presentation

Training Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Training Neural Nets 1 / 40 Outline 1 The Softmax Simplex 2 Loss and Risk 3 Back-Propagation 4 Stochastic Gradient Descent 5 Regularization 6 Network Depth

Conflict nets: Efficient locally canonical MALL proof nets Dominic J. D. Hughes and Willem

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Petri Nets Petri Nets Inputs and Outputs Petri Nets vs FSM Lionel Morel Modeling Templates

Mix-Nets Lecture 19 Some tools for electronic-voting (and other things) Mix-Nets Mix-Nets

Petri Nets and Model Checking Natasa Gkolfi University of Oslo March 31, 2017 Petri Nets and

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Convolutional Neural Nets CS447 Natural Language Processing (J. Hockenmaier)

Today CS 188: Artificial Intelligence Neural Nets (wrap-up) and Decision Trees Neural Nets --

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Training

From DB-nets to Coloured Petri Nets with Priorities Marco Montali and Andrey Rivkin KRDB Research

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Deep Convolutional Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D Machine

MorphNet Elad Eban Faster Neural Nets with Hardware-Aware Architecture Learning Where Do

Incident Mobilization Incident Mobilization (R- -T T- -S) Nets S) Nets (R Mobilization

Iridex Group Plastic Protective sleeves Rabitz mesh Soil stabilization grid W Fencing nets

Comparing Repositories Visually with RepoGrams http://repograms.net Daniel Rozenberg, Ivan

Planning for Change in a Formal Verification of the Raft Consensus Protocol Doug James

Wilhelm Wilhelmsen Holding analysis (Part 2) July 10, 2016 In this final part I will be

Stranger in These Parts A Hired Gun in the JS Corral Andy Wingo JSConf 2012 Howdy! Compiler

eigenvalues, markov matrices, and the power method Slides by Olson. Some taken loosely from Jeff

Dependence analysis underlies many tasks Motivation 2 Testing / Evolution / Performance

The Simplex Method Marco Chiarandini Department of Mathematics & Computer Science University

Discontinuous Galerkin method for hyperbolic equations with singularities Chi-Wang Shu Division

Training Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D - PowerPoint PPT Presentation

Training Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Training Neural Nets 1 / 40 Outline 1 The Softmax Simplex 2 Loss and Risk 3 Back-Propagation 4 Stochastic Gradient Descent 5 Regularization 6 Network Depth

Conflict nets: Efficient locally canonical MALL proof nets Dominic J. D. Hughes and Willem

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Petri Nets Petri Nets Inputs and Outputs Petri Nets vs FSM Lionel Morel Modeling Templates

Mix-Nets Lecture 19 Some tools for electronic-voting (and other things) Mix-Nets Mix-Nets

Petri Nets and Model Checking Natasa Gkolfi University of Oslo March 31, 2017 Petri Nets and

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Convolutional Neural Nets CS447 Natural Language Processing (J. Hockenmaier)

Today CS 188: Artificial Intelligence Neural Nets (wrap-up) and Decision Trees Neural Nets --

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Training

From DB-nets to Coloured Petri Nets with Priorities Marco Montali and Andrey Rivkin KRDB Research

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Deep Convolutional Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D Machine

MorphNet Elad Eban Faster Neural Nets with Hardware-Aware Architecture Learning Where Do

Incident Mobilization Incident Mobilization (R- -T T- -S) Nets S) Nets (R Mobilization

Iridex Group Plastic Protective sleeves Rabitz mesh Soil stabilization grid W Fencing nets

Comparing Repositories Visually with RepoGrams http://repograms.net Daniel Rozenberg, Ivan

Planning for Change in a Formal Verification of the Raft Consensus Protocol Doug James

Wilhelm Wilhelmsen Holding analysis (Part 2) July 10, 2016 In this final part I will be

Stranger in These Parts A Hired Gun in the JS Corral Andy Wingo JSConf 2012 Howdy! Compiler

eigenvalues, markov matrices, and the power method Slides by Olson. Some taken loosely from Jeff

Dependence analysis underlies many tasks Motivation 2 Testing / Evolution / Performance

The Simplex Method Marco Chiarandini Department of Mathematics &amp; Computer Science University

Discontinuous Galerkin method for hyperbolic equations with singularities Chi-Wang Shu Division

The Simplex Method Marco Chiarandini Department of Mathematics & Computer Science University