Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 - PowerPoint PPT Presentation

Training Neural Nets COMPSCI 527 — Computer Vision COMPSCI 527 — Computer Vision Training Neural Nets 1 / 29

Outline 1 The Softmax Simplex 2 Loss and Risk 3 Back-Propagation 4 Stochastic Gradient Descent 5 Regularization COMPSCI 527 — Computer Vision Training Neural Nets 2 / 29

The Softmax Simplex The Softmax Simplex y = h ( x ) : R d → Y • Neural-net classifier: ˆ • The last layer of a neural net used for classification is a soft-max layer exp( z ) p = σ ( z ) = 1 T exp( z ) • The net is p = f ( x , w ) : X × R m → P • The classifier is ˆ y = h ( x ) = arg max p = arg max f ( x , w ) • P is the set of all nonnegative real-valued vectors p ∈ R K whose entries add up to 1 (with K = | Y | ): K = { p ∈ R K : p ≥ 0 and def � P p i = 1 } . i = 1 COMPSCI 527 — Computer Vision Training Neural Nets 3 / 29

The Softmax Simplex = { p ∈ R K : p ≥ 0 and � K def P i = 1 p i = 1 } p 3 1 p 2 1 1/3 p 1 1/3 2 1/2 1/3 p 1 1 p 1/2 1 1 • Decision regions are polyhedral and convex: P c = { p c ≥ p j for j � = c } for c = 1 , . . . , K • A network transforms images into points in P COMPSCI 527 — Computer Vision Training Neural Nets 4 / 29

Loss and Risk Training is Empirical Risk Minimization • Define a loss ℓ ( y , ˆ y ) : How much do we pay when the true label is y and the network says ˆ y ? • Network: p = f ( x , w ) , then ˆ y = arg max p • Risk is average loss over training set T = { ( x 1 , y 1 ) , . . . ( x N , y N ) } : � N L T ( w ) = 1 n = 1 ℓ n ( w ) where ℓ n ( w ) = ℓ ( y n , f ( x n , w )) N • Determine network weights w by minimizing L T ( w ) • Use some variant of steepest descent • We need ∇ L T ( w ) and therefore ∇ ℓ n ( w ) COMPSCI 527 — Computer Vision Training Neural Nets 5 / 29

Loss and Risk The Cross-Entropy Loss • Ideal loss would be 0-1 loss ℓ 0-1 ( y , ˆ y ) on classifier output ˆ y • 0-1 loss is constant where it is differentiable! • Not useful for computing a gradient for risk minimization • Use cross-entropy loss on the softmax output p as a proxy loss ℓ ( y , p ) = − log p y • Differentiable! • Unbounded loss for total 0 0 1 misclassification COMPSCI 527 — Computer Vision Training Neural Nets 6 / 29

Loss and Risk Example: K = 5 Classes • Last layer before soft-max has activations z ∈ R K 1 T exp( z ) ∈ R 5 exp( z ) • Soft-max has output p = σ ( z ) = • Ideally, if the correct class is y = 4, we would like output p to equal q = [ 0 , 0 , 0 , 1 , 0 ] , the one-hot encoding of y • That is, q y = q 4 = 1 and all other q j are zero • ℓ ( y , p ) = − log p y = − log p 4 • p y → 1 and ℓ ( y , p ) → 0 when z y / z y ′ → ∞ for all y ′ � = y • That is, when p approaches the correct simplex corner • p y → 0 and ℓ ( y , p ) → ∞ when z y / z y ′ → −∞ for some y ′ � = y • That is, when p is far from the correct simplex corner COMPSCI 527 — Computer Vision Training Neural Nets 7 / 29

Loss and Risk Example, Continued 15 0 -10 10 1 T exp( z ) = log( 1 T exp( z )) − z y exp( z y ) • ℓ ( y , p ) = − log p y = − log • p y → 1 and ℓ ( y , p ) → 0 when z y / z y ′ → ∞ for y ′ � = y • p y → 0 and ℓ ( y , p ) → ∞ when z y / z y ′ → −∞ for y ′ � = y • Actual plot depends on all values in z • This is a “soft hinge loss” in z (not in p ) COMPSCI 527 — Computer Vision Training Neural Nets 8 / 29

Back-Propagation Back-Propagation (0) x n x x (1) x (2) x = p (3) l n = l (1) f (2) f (3) f y w (1) w (2) w (3) n • We need ∇ L T ( w ) and therefore ∇ ℓ n ( w ) = ∂ℓ n ∂ w • Computations from x n to ℓ n form a chain • Apply the chain rule • Every derivative of ℓ n w.r.t. layers before k goes through x ( k ) ∂ x ( k ) ∂ℓ n ∂ℓ n ∂ w ( k ) = ∂ x ( k ) ∂ w ( k ) ∂ x ( k ) ∂ℓ n ∂ℓ n ∂ x ( k − 1 ) = (recursion!) ∂ x ( k ) ∂ x ( k − 1 ) • Start: ∂ x ( K ) = ∂ℓ ∂ℓ n ∂ p COMPSCI 527 — Computer Vision Training Neural Nets 9 / 29

Back-Propagation Local Jacobians (0) x n x x (1) x (2) x = p (3) l n = (1) f (2) f (3) l f y w (1) w (2) w (3) n ∂ x ( k ) ∂ x ( k ) • Local computations at layer k : and ∂ w ( k ) ∂ x ( k − 1 ) • Partial derivatives of f ( k ) with respect to layer weights and input to the layer • Local Jacobian matrices, can compute by knowing what the layer does • The start of the process can be computed from knowing the ∂ x ( K ) = ∂ℓ ∂ℓ n loss function, ∂ p • Another local Jacobian • The rest is going recursively from output to input, one layer ∂ w ( k ) into a vector ∂ℓ n ∂ℓ n at a time, accumulating ∂ w COMPSCI 527 — Computer Vision Training Neural Nets 10 / 29

Back-Propagation Back-Propagation Spelled Out for K = 3 x n x (0) = x (1) x (2) x = p (3) l n l f (1) f (2) f (3) w (1) w (2) w (3) y n ∂ℓ n ∂ x ( 3 ) = ∂ℓ ∂ℓ n   ∂ w ( 1 ) ∂ p ∂ x ( 3 ) ∂ℓ n ∂ℓ n ∂ w ( 3 ) =   ∂ x ( 3 ) ∂ w ( 3 ) ∂ℓ n  ∂ℓ n  ∂ w = ∂ x ( 3 )   ∂ w ( 2 ) ∂ℓ n ∂ℓ n ∂ x ( 2 ) =   ∂ x ( 3 ) ∂ x ( 2 )   ∂ x ( 2 ) ∂ℓ n ∂ℓ n ∂ w ( 2 ) = ∂ℓ n ∂ x ( 2 ) ∂ w ( 2 ) ∂ w ( 3 ) ∂ x ( 2 ) ∂ℓ n ∂ℓ n ∂ x ( 1 ) = ∂ x ( 2 ) ∂ x ( 1 ) (Jacobians in blue are local) ∂ x ( 1 ) ∂ℓ n ∂ℓ n ∂ w ( 1 ) = ∂ x ( 1 ) ∂ w ( 1 ) � � ∂ x ( 1 ) ∂ℓ n ∂ℓ n ∂ x ( 0 ) = ∂ x ( 1 ) ∂ x ( 0 ) COMPSCI 527 — Computer Vision Training Neural Nets 11 / 29

Back-Propagation Computing Local Jacobians ∂ x ( k ) ∂ x ( k ) ∂ w ( k ) and ∂ x ( k − 1 ) • Easier to make a “layer” as simple as possible • z = V x + b is one layer (Fully Connected (FC), affine part) • z = ρ ( x ) (ReLU) is another layer • Softmax, max-pooling, convolutional,... COMPSCI 527 — Computer Vision Training Neural Nets 12 / 29

Back-Propagation Local Jacobians for a FC Layer z = V x + b ∂ z • ∂ x = V (easy!) ∂ w : What is ∂ z ∂ z ∂ z i • ∂ V ? Three subscripts: ∂ v jk . A 3D tensor? • For a general package, tensors are the way to go • Conceptually, it may be easier to vectorize everything: � v 11 � b 1 � � v 12 v 13 V = , b = → v 21 v 22 v 23 b 2 w = [ v 11 , v 12 , v 13 , v 21 , v 22 , v 23 , b 1 , b 2 ] T ∂ z • ∂ w is a 2 × 8 matrix • With e outputs and d inputs, an e × e ( d + 1 ) matrix COMPSCI 527 — Computer Vision Training Neural Nets 13 / 29

Back-Propagation Jacobian w for a FC Layer � z 1 � w 1 � w 7 �   x 1 � � w 2 w 3  + = x 2  z 2 w 4 w 5 w 6 w 8 x 3 • Don’t be afraid to spell things out: z 1 = w 1 x 1 + w 2 x 2 + w 3 x 3 + w 7 z 2 = w 4 x 1 + w 5 x 2 + w 6 x 3 + w 8 � � ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z 1 ∂ z ∂ w 1 ∂ w 2 ∂ w 3 ∂ w 4 ∂ w 5 ∂ w 6 ∂ w 7 ∂ w 8 ∂ w = ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ z 2 ∂ w 1 ∂ w 2 ∂ w 3 ∂ w 4 ∂ w 5 ∂ w 6 ∂ w 7 ∂ w 8 � x 1 � x 2 x 3 0 0 0 1 0 ∂ z ∂ w = 0 0 0 x 1 x 2 x 3 0 1 • Obvious pattern: Repeat x T , staggered, e times • Then append the e × e identity at the end COMPSCI 527 — Computer Vision Training Neural Nets 14 / 29

Stochastic Gradient Descent Training • Local gradients are used in back-propagation • So we now have ∇ L T ( w ) • ˆ w = arg min L T ( w ) • L T ( w ) is (very) non-convex, so we look for local minima • w ∈ R m with m very large: No Hessians • Gradient descent • Even so, every step calls back-propagation N = | T | times • Back-propagation computes m derivatives ∇ ℓ n ( w ) • Computational complexity is Ω( mN ) per step • Even gradient descent is way too expensive! COMPSCI 527 — Computer Vision Training Neural Nets 15 / 29

Stochastic Gradient Descent No Line Search • Line search is out of the question • Fix some step multiplier α , called the learning rate w t + 1 = w t − α ∇ L T ( w t ) • How to pick α ? Validation is too expensive • Tradeoffs: • α too small: Slow progress • α too big: Jump over minima • Frequent practice: • Start with α relatively large, and monitor L T ( w ) • When L T ( w ) levels off, decrease α • Alternative: Fixed decay schedule for α • Better (recent) option: Change α adaptively (Adam, 2015) COMPSCI 527 — Computer Vision Training Neural Nets 16 / 29

Stochastic Gradient Descent Manual Adjustment of α • Start with α relatively large, and monitor L T ( w t ) • When L T ( w t ) levels off, decrease α • Typical plots of L T ( w t ) versus iteration index t : risk COMPSCI 527 — Computer Vision Training Neural Nets 17 / 29

Stochastic Gradient Descent Batch Gradient Descent � N • ∇ L T ( w ) = 1 n = 1 ∇ ℓ n ( w ) . N • Taking a macro-step − α ∇ L T ( w t ) is the same as taking the N micro-steps − α N ∇ ℓ 1 ( w t ) , . . . , − α N ∇ ℓ N ( w t ) • First compute all the N steps at w t , then take all the steps • Thus, standard gradient descent is a batch method: Compute the gradient at w t using the entire batch of data, then move • Even with no line search, computing N micro-steps is still expensive COMPSCI 527 — Computer Vision Training Neural Nets 18 / 29

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 - PowerPoint PPT Presentation

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Training Neural Nets 1 / 29 Outline 1 The Softmax Simplex 2 Loss and Risk 3 Back-Propagation 4 Stochastic Gradient Descent 5 Regularization COMPSCI 527

Conflict nets: Efficient locally canonical MALL proof nets Dominic J. D. Hughes and Willem

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Petri Nets Petri Nets Inputs and Outputs Petri Nets vs FSM Lionel Morel Modeling Templates

Mix-Nets Lecture 19 Some tools for electronic-voting (and other things) Mix-Nets Mix-Nets

Petri Nets and Model Checking Natasa Gkolfi University of Oslo March 31, 2017 Petri Nets and

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Convolutional Neural Nets CS447 Natural Language Processing (J. Hockenmaier)

Today CS 188: Artificial Intelligence Neural Nets (wrap-up) and Decision Trees Neural Nets --

From DB-nets to Coloured Petri Nets with Priorities Marco Montali and Andrey Rivkin KRDB Research

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint

Training Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Deep Convolutional Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D Machine

MorphNet Elad Eban Faster Neural Nets with Hardware-Aware Architecture Learning Where Do

Incident Mobilization Incident Mobilization (R- -T T- -S) Nets S) Nets (R Mobilization

Iridex Group Plastic Protective sleeves Rabitz mesh Soil stabilization grid W Fencing nets

Risk-parameter estimation in volatility models Christian Francq Jean-Michel Zakoan CREST and

Some Special Cases Recall: we can classify models: constant mean = linear

KERNEL C.I. USING LINAROS AUTOMATED VALIDATION ARCHITECTURE Wednesday, September 11, 13 TYLER

CMSC412 Discussion 9/5/2012 Overview Introduction Running project remotely Xming

Fundamental Belief: Universal Approximation Theorems Ju Sun Computer Science & Engineering

Review: Supervised Learning CS 6355: Structured Prediction 1 Previous lecture A broad

2017/11/02 Army Environmental Program Division Cultural Resource Legislative/Policy Mandate

GCE N(A) & N(T) EXAMINATIONS 19 DEC 2019 THURS, 2 PM TODAYS PROGRAMME Items By Welcome

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 - PowerPoint PPT Presentation

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Training Neural Nets 1 / 29 Outline 1 The Softmax Simplex 2 Loss and Risk 3 Back-Propagation 4 Stochastic Gradient Descent 5 Regularization COMPSCI 527

Conflict nets: Efficient locally canonical MALL proof nets Dominic J. D. Hughes and Willem

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Petri Nets Petri Nets Inputs and Outputs Petri Nets vs FSM Lionel Morel Modeling Templates

Mix-Nets Lecture 19 Some tools for electronic-voting (and other things) Mix-Nets Mix-Nets

Petri Nets and Model Checking Natasa Gkolfi University of Oslo March 31, 2017 Petri Nets and

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Convolutional Neural Nets CS447 Natural Language Processing (J. Hockenmaier)

Today CS 188: Artificial Intelligence Neural Nets (wrap-up) and Decision Trees Neural Nets --

From DB-nets to Coloured Petri Nets with Priorities Marco Montali and Andrey Rivkin KRDB Research

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint

Training Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Deep Convolutional Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D Machine

MorphNet Elad Eban Faster Neural Nets with Hardware-Aware Architecture Learning Where Do

Incident Mobilization Incident Mobilization (R- -T T- -S) Nets S) Nets (R Mobilization

Iridex Group Plastic Protective sleeves Rabitz mesh Soil stabilization grid W Fencing nets

Risk-parameter estimation in volatility models Christian Francq Jean-Michel Zakoan CREST and

Some Special Cases Recall: we can classify models: constant mean = linear

KERNEL C.I. USING LINAROS AUTOMATED VALIDATION ARCHITECTURE Wednesday, September 11, 13 TYLER

CMSC412 Discussion 9/5/2012 Overview Introduction Running project remotely Xming

Fundamental Belief: Universal Approximation Theorems Ju Sun Computer Science &amp; Engineering

Review: Supervised Learning CS 6355: Structured Prediction 1 Previous lecture A broad

2017/11/02 Army Environmental Program Division Cultural Resource Legislative/Policy Mandate

GCE N(A) &amp; N(T) EXAMINATIONS 19 DEC 2019 THURS, 2 PM TODAYS PROGRAMME Items By Welcome

Fundamental Belief: Universal Approximation Theorems Ju Sun Computer Science & Engineering

GCE N(A) & N(T) EXAMINATIONS 19 DEC 2019 THURS, 2 PM TODAYS PROGRAMME Items By Welcome