lecture 10 neural networks part 2
play

Lecture 10: Neural Networks (Part 2) Feb 25th, 2020 Lecturer: - PDF document

CSCI 5525 Machine Learning Fall 2019 Lecture 10: Neural Networks (Part 2) Feb 25th, 2020 Lecturer: Steven Wu Scribe: Steven Wu 1 Backpropagation Now we consider ERM problem of minimizing the following empirical risk function over : n R (


  1. CSCI 5525 Machine Learning Fall 2019 Lecture 10: Neural Networks (Part 2) Feb 25th, 2020 Lecturer: Steven Wu Scribe: Steven Wu 1 Backpropagation Now we consider ERM problem of minimizing the following empirical risk function over θ : n R ( θ ) = 1 ˆ � ℓ ( y i , F ( x i , θ )) n i =1 where the ℓ denote the loss function that can be cross-entropy loss or square loss. We will use gradient descent method to optimize this function, even though the loss function is non-convex. First, the graident w.r.t. each W j is defined as n n 1 ℓ ( y i , F ( x i , θ )) = 1 ∇ W j ˆ � � R ( θ ) = ∇ W j ∇ W j ℓ ( y i , F ( x i , θ )) n n i =1 i =1 We can derive the same equality for the gradient w.r.t. each b j . It suffices to look at the gradient for each example. We can rewrite the loss for each example as ℓ ( y i , F ( x i , θ )) = ℓ ( y i , σ L ( W L ( . . . W 2 σ 1 ( W 1 x i + b 1 ) + b 2 . . . ) + b L )) = ˜ σ L ( W L ( . . . W 2 σ 1 ( W 1 x i + b 1 ) + b 2 . . . ) + b L ) ≡ ˜ F ( x i , θ ) where ˜ σ L absorbs y i and ℓ , that is ˜ σ L ( a ) = ℓ ( y i , a ) for any a . Note that σ ′ L can just be viewed as another activation function, so this loss function can just be viewed as a different neural network mapping. Therefore, it suffices to look at the gradient ∇ W j F ( x, θ ) for any neural network F –the gradient computation will be the same. Backpropagation is a linear time algorithm with runtime O ( V + E ) , where V is the number of nodes and E is the number of edges in the network. It is essentially a message passing protocol. Univariate case. Let’s work out the case where everything is in R . The goal is to compute the derivative of the following function F ( θ ) = σ L ( W L ( . . . W 2 σ 1 ( W 1 x + b 1 ) + b 2 . . . ) + b L ) For any 1 ≤ j ≤ L , let J j = σ ′ F j ( θ ) = σ j ( W j ( . . . W 2 σ 1 ( W 1 x + b 1 ) + b 2 . . . ) + b j ) , j ( W j F j − 1 ( θ ) + b j ) 1

  2. All of these quantities can be computed with a forward pass . Next, we can apply chain rule and compute derivative with a backward pass : ∂F L = J L F L − 1 ( θ ) ∂W L ∂F L = J L ∂b L . . . ∂F L = J L W L J L − 1 W L − 1 . . . F j − 1 ( θ ) ∂W j ∂F L = J L W L J L − 1 W L − 1 . . . J j ∂b j Multivariate case. That looks nice and simple. Now as we move to multi-dimensional case, we will need the following multivariate chain rule: ∇ W f ( Wa ) = J ⊺ a ⊺ where J ∈ R l × R k is the Jacobian matrix of f : R k → R l at Wa . (Recall that for any function f ( r 1 , . . . , r k ) = ( y 1 , . . . y l ) , the entry J ij = ∂y i /∂r j .) Applying chain rule again: ∂F L = J ⊺ L F L − 1 ( θ ) ⊺ ∂W L ∂F L = J ⊺ L ∂b L . . . ∂F L = ( J L W L J L − 1 W L − 1 . . . J j ) ⊺ F j − 1 ( θ ) ⊺ ∂W j ∂F L = ( J L W L J L − 1 W L − 1 . . . J j ) ⊺ ∂b j where J j is the Jacobian of σ j at W j F j − 1 ( θ ) + b j . If σ j is applying the coordinatewise activation function, then the Jacobian matrix is diagonal. 2 Stochastic Gradient Descent Recall that the empirical gradient is defined as n 1 ∇ θ ˆ � R ( θ ) = ∇ θ ℓ ( y i , F ( x i , θ )) n i =1 2

  3. For large n , this can be very expensive to compute. A common practice is to evaluate the gradient i ) } b on a mini-batch { ( x ′ i , y ′ i =1 selected uniformly at random. In expectation, the update is moving to the right direction: � � 1 � = ∇ θ ˆ i , F ( x i , θ t )) R ( θ t ) ∇ θ ℓ ( y ′ E b i The batch size is another hyperparameter to tune. 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend