learning from data lecture 21 neural networks
play

Learning From Data Lecture 21 Neural Networks: Backpropagation - PowerPoint PPT Presentation

Learning From Data Lecture 21 Neural Networks: Backpropagation Forward propagation: algorithmic computation h ( x ) e ( x ) Backpropagation: algorithmic computation of weights M. Magdon-Ismail CSCI 4100/6100 recap: The Neural Network


  1. Learning From Data Lecture 21 Neural Networks: Backpropagation Forward propagation: algorithmic computation h ( x ) ∂ e ( x ) Backpropagation: algorithmic computation of ∂ weights M. Magdon-Ismail CSCI 4100/6100

  2. recap: The Neural Network Biology − − − − − − − − − − − → Engineering − − − → 1 1 1 x 1 h ( x ) θ θ θ x 2 θ θ θ ( s ) . . . s θ x d input layer ℓ = 0 hidden layers 0 < ℓ < L output layer ℓ = L M Neural Networks: Backpropagation : 2 /14 � A c L Creator: Malik Magdon-Ismail Zooming into a hidden node − →

  3. Zooming into a Hidden Node 1 1 1 x 1 h ( x ) θ θ θ x 2 θ θ θ ( s ) . . . s θ x d input layer ℓ = 0 hidden layers 0 < ℓ < L output layer ℓ = L layer ( ℓ + 1) θ W ( ℓ +1) s ( ℓ ) x ( ℓ ) W ( ℓ ) + θ layer ( ℓ − 1) layer ℓ layers ℓ = 0 , 1 , 2 , . . . , L layer ℓ parameters layer ℓ has “dimension” d ( ℓ ) = ⇒ d ( ℓ ) + 1 nodes   w ( ℓ ) w ( ℓ ) w ( ℓ ) d ( ℓ ) dimensional input vector · · · s ( ℓ ) signals in 1 2 d ( ℓ ) W ( ℓ ) = .   d ( ℓ ) + 1 dimensional output vector . x ( ℓ ) outputs  .    ( d ( ℓ − 1) + 1) × d ( ℓ ) dimensional matrix W ( ℓ ) weights in W ( ℓ +1) ( d ( ℓ ) + 1) × d ( ℓ +1) dimensional matrix weights out W = { W (1) , W (2) , . . . , W ( L ) } ← specifies the network M Neural Networks: Backpropagation : 3 /14 � A c L Creator: Malik Magdon-Ismail Linear Signal − →

  4. The Linear Signal Input s ( ℓ ) is a linear combination (using weights) of the 1 1 1 outputs of the previous layer x ( ℓ − 1) . x 1 h ( x ) θ θ θ x 2 θ θ θ ( s ) . s ( ℓ ) = (W ( ℓ ) ) t x ( ℓ − 1) . . s θ x d input layer ℓ = 0 hidden layers 0 < ℓ < L output layer ℓ = L s ( ℓ ) ( w ( ℓ )     1 ) t 1 s ( ℓ ) ( w ( ℓ )    2 ) t  2     . .     . . . .     s ( ℓ ) = ( w ( ℓ ) x ( ℓ − 1) j ) t x ( ℓ − 1) =     j s ( ℓ ) ( w ( ℓ )     j ) t  j    . .     . . . .         (recall the linear signal s = w t x ) s ( ℓ ) ( w ( ℓ ) d ( ℓ ) ) t d ( ℓ ) θ s ( ℓ ) → x ( ℓ ) − − − − − − M Neural Networks: Backpropagation : 4 /14 � A c L Creator: Malik Magdon-Ismail Forward propagation − →

  5. Forward Propagation: Computing h ( x ) → x (2) · · · → x ( L ) = h ( x ) . W(1) W(2) W( L ) θ θ θ x = x (0) → s (1) → x (1) → s (2) → s ( L ) − − − − − − Forward propagation to compute h ( x ) : 1: x (0) ← x [Initialization] 2: for ℓ = 1 to L do [Forward Propagation] s ( ℓ ) ← (W ( ℓ ) ) t x ( ℓ − 1) 3: � � 1 x ( ℓ ) ← 4: θ ( s ( ℓ ) ) 5: end for 6: h ( x ) = x ( L ) [Output] M Neural Networks: Backpropagation : 5 /14 � A c L Creator: Malik Magdon-Ismail Minimizing E in − →

  6. Minimizing E in N E in ( h ) = E in (W) = 1 � ( h ( x n ) − y n ) 2 W = { W (1) , W (2) , . . . , W ( L ) } N n =1 sign tanh linear tanh E in sign w Using θ = tanh makes E in differentiable so we can use gradient descent − → local minimum. M Neural Networks: Backpropagation : 6 /14 � A c L Creator: Malik Magdon-Ismail Gradient Descent − →

  7. Gradient Descent W( t + 1) = W( t ) − η ∇ E in (W( t )) M Neural Networks: Backpropagation : 7 /14 � A c L Creator: Malik Magdon-Ismail Gradient of E in − →

  8. Gradient of E in e n ւ N E in ( w ) = 1 � e ( h ( x n ) , y n ) N n =1 N ∂E in ( w ) = 1 ∂ e n � ∂ W ( ℓ ) ∂ W ( ℓ ) N n =1 We need ∂ e ( x ) ∂ W ( ℓ ) M Neural Networks: Backpropagation : 8 /14 � A c L Creator: Malik Magdon-Ismail Numerical approach − →

  9. Numerical Approach ≈ e ( x | W ( ℓ ) ij + ∆) − e ( x | W ( ℓ ) ij − ∆) ∂ e ( x ) ∂ W ( ℓ ) 2∆ ij approximate inefficient M Neural Networks: Backpropagation : 9 /14 � A c L Creator: Malik Magdon-Ismail Algorithmic approach − →

  10. Algorithmic Approach e ( x ) is a function of s ( ℓ ) and s ( ℓ ) = (W ( ℓ ) ) t x ( ℓ − 1) � ∂ e ∂ W ( ℓ ) = ∂ s ( ℓ ) � t ∂ e ∂ W ( ℓ ) · (chain rule) ∂ s ( ℓ ) = x ( ℓ − 1) ( δ ( ℓ ) ) t sensitivity δ ( ℓ ) = ∂ e ∂ s ( ℓ ) δ ( ℓ ) and the chain rule − M Neural Networks: Backpropagation : 10 /14 � A c L Creator: Malik Magdon-Ismail →

  11. Computing δ ( ℓ ) Using the Chain Rule δ (1) ← − δ (2) · · · ← − δ ( L − 1) ← − δ ( L ) Multiple applications of the chain rule: → ∆ s ( ℓ +1) · · · − → ∆ x ( ℓ ) W( ℓ +1) θ ∆ s ( ℓ ) − − → ∆ e ( x ) 1 1 + don’t use 0 th component (bias) ↓ δ ( ℓ ) = θ ′ ( s ( ℓ ) ) ⊗ [W ( ℓ +1) δ ( ℓ +1) ] d ( ℓ ) δ ( ℓ ) δ ( ℓ +1) . . W ( ℓ +1) 1 . . ↑ . . componentwise multiplication + × θ ′ ( s ( ℓ ) ) layer ℓ layer ( ℓ + 1) M Neural Networks: Backpropagation : 11 /14 � A c L Creator: Malik Magdon-Ismail The Backpropagation algorithm − →

  12. The Backpropagation Algorithm δ (1) ← − δ (2) · · · ← − δ ( L − 1) ← − δ ( L ) Backpropagation to compute sensitivities δ ( ℓ ) : (Assume s ( ℓ ) and x ( ℓ ) have been computed for all ℓ ) 1: δ ( L ) ← 2( x ( L ) − y ) · θ ′ ( s ( L ) ) [Initialization] 2: for ℓ = L − 1 to 1 do [Back-Propagation] Compute (for tanh hidden node): 3: 1 − x ( ℓ ) ⊗ x ( ℓ ) � d ( ℓ ) � θ ′ ( s ( ℓ ) ) = 1 � W ( ℓ +1) δ ( ℓ +1) � d ( ℓ ) δ ( ℓ ) ← θ ′ ( s ( ℓ ) ) ⊗ ← componentwise multiplication 4: 1 5: end for M Neural Networks: Backpropagation : 12 /14 � A c L Creator: Malik Magdon-Ismail Gradient Descent on E in − →

  13. Algorithm for Gradient Descent on E in Algorithm to Compute E in ( w ) and g = ∇ E in ( w ) : Input: weights w = { W (1) , . . . , W ( L ) } ; data D . Output: error E in ( w ) and gradient g = { G (1) , . . . , G ( L ) } . 1: Initialize: E in = 0; for ℓ = 1 , . . . , L , G ( ℓ ) = 0 · W ( ℓ ) . 2: for Each data point x n ( n = 1 , . . . , N ) do Compute x ( ℓ ) for ℓ = 0 , . . . , L . [forward propagation] 3: Compute δ ( ℓ ) for ℓ = 1 , . . . , L . [backpropagation] 4: N ( x ( L ) E in ← E in + 1 − y n ) 2 . 5: 1 for ℓ = 1 , . . . , L do 6: G ( ℓ ) ( x n ) = [ x ( ℓ − 1) ( δ ( ℓ ) ) t ] 7: G ( ℓ ) ← G ( ℓ ) + 1 N G ( ℓ ) ( x n ). 8: end for 9: 10: end for Can do batch version or sequential version (SGD). M Neural Networks: Backpropagation : 13 /14 � A c L Creator: Malik Magdon-Ismail Digits Data − →

  14. Digits Data 0 Gradient Descent -1 Symmetry log 10 (error) -2 -3 SGD -4 0 2 4 6 log 10 (iteration) Average Intensity M Neural Networks: Backpropagation : 14 /14 � A c L Creator: Malik Magdon-Ismail

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend