advanced machine learning exercise 3
play

Advanced Machine Learning - Exercise 3 Deep learning essentials - PowerPoint PPT Presentation

Advanced Machine Learning - Exercise 3 Deep learning essentials Introduction Whats the plan? Exercise overview Deep learning in a nutshell Backprop in (painful) detail 2 of 13 Introduction Exercise overview Goal: implement a simple DL


  1. Advanced Machine Learning - Exercise 3 Deep learning essentials

  2. Introduction What’s the plan? Exercise overview Deep learning in a nutshell Backprop in (painful) detail 2 of 13

  3. Introduction Exercise overview Goal: implement a simple DL framework Tasks: – Compute derivatives (Jacobians) – Write code You’ll need some help. . . 3 of 13

  4. Introduction Deep learning in a nutshell Given: – Training data X = { x i } i =1 .. N with x i ∈ I , usually as X ∈ R N × N I – Training labels T = { t i } i =1 .. N with t i ∈ O . 4 of 13

  5. Introduction Deep learning in a nutshell Given: – Training data X = { x i } i =1 .. N with x i ∈ I , usually as X ∈ R N × N I – Training labels T = { t i } i =1 .. N with t i ∈ O . Choose: – Parametrized, (sub-)differentiable function F ( X , θ ) : I × P �→ O , with: typically, input-space I = R N I (generic data), I = R 3 × H × W (images), . . . typically, output-space O = R N O (regression), O = [0 , 1] N O (probabilistic classification), . . . typically, parameter-space P = R N P . 4 of 13

  6. Introduction Deep learning in a nutshell Given: – Training data X = { x i } i =1 .. N with x i ∈ I , usually as X ∈ R N × N I – Training labels T = { t i } i =1 .. N with t i ∈ O . Choose: – Parametrized, (sub-)differentiable function F ( X , θ ) : I × P �→ O , with: typically, input-space I = R N I (generic data), I = R 3 × H × W (images), . . . typically, output-space O = R N O (regression), O = [0 , 1] N O (probabilistic classification), . . . typically, parameter-space P = R N P . – (Sub-)differentiable criterion/loss L ( T , F ( X , θ )) : O × O �→ R 4 of 13

  7. Introduction Deep learning in a nutshell Given: – Training data X = { x i } i =1 .. N with x i ∈ I , usually as X ∈ R N × N I – Training labels T = { t i } i =1 .. N with t i ∈ O . Choose: – Parametrized, (sub-)differentiable function F ( X , θ ) : I × P �→ O , with: typically, input-space I = R N I (generic data), I = R 3 × H × W (images), . . . typically, output-space O = R N O (regression), O = [0 , 1] N O (probabilistic classification), . . . typically, parameter-space P = R N P . – (Sub-)differentiable criterion/loss L ( T , F ( X , θ )) : O × O �→ R Find: θ ∗ = argmin L ( T , F ( X , θ )) θ ∈ P 4 of 13

  8. Introduction Deep learning in a nutshell Given: – Training data X = { x i } i =1 .. N with x i ∈ I , usually as X ∈ R N × N I – Training labels T = { t i } i =1 .. N with t i ∈ O . Choose: – Parametrized, (sub-)differentiable function F ( X , θ ) : I × P �→ O , with: typically, input-space I = R N I (generic data), I = R 3 × H × W (images), . . . typically, output-space O = R N O (regression), O = [0 , 1] N O (probabilistic classification), . . . typically, parameter-space P = R N P . – (Sub-)differentiable criterion/loss L ( T , F ( X , θ )) : O × O �→ R Find: θ ∗ = argmin L ( T , F ( X , θ )) θ ∈ P Assumption: N L ( T , F ( X , θ )) = 1 ∑ ℓ ( t i , F ( x i , θ )) N i =1 4 of 13

  9. Backprop N N 1 ℓ ( t i , F ( x i , θ )) = 1 ∑ ∑ D θ ℓ ( t i , F ( x i , θ )) D θ N N i =1 i =1 N = 1 ∑ D F ℓ ( t i , F ( x i , θ )) ◦ D θ F ( x i , θ ) N i =1 5 of 13

  10. Backprop N N 1 ℓ ( t i , F ( x i , θ )) = 1 ∑ ∑ D θ ℓ ( t i , F ( x i , θ )) D θ N N i =1 i =1 N = 1 ∑ D F ℓ ( t i , F ( x i , θ )) ◦ D θ F ( x i , θ ) N i =1 Assumption: F is hierarchical: F ( x i , θ ) = f 1 ( f 2 ( f 3 ( . . . x i . . . , θ 3 ) , θ 2 ) , θ 1 ) 5 of 13

  11. Backprop N N 1 ℓ ( t i , F ( x i , θ )) = 1 ∑ ∑ D θ ℓ ( t i , F ( x i , θ )) D θ N N i =1 i =1 N = 1 ∑ D F ℓ ( t i , F ( x i , θ )) ◦ D θ F ( x i , θ ) N i =1 Assumption: F is hierarchical: F ( x i , θ ) = f 1 ( f 2 ( f 3 ( . . . x i . . . , θ 3 ) , θ 2 ) , θ 1 ) D θ 1 F ( x i , θ ) = D θ 1 f 1 ( f 2 , θ 1 ) D θ 2 F ( x i , θ ) = D f 2 f 1 ( f 2 , θ 1 ) ◦ D θ 2 f 2 ( f 3 , θ 2 ) D θ 3 F ( x i , θ ) = D f 2 f 1 ( f 2 , θ 1 ) ◦ D f 3 f 2 ( f 3 , θ 2 ) ◦ D θ 3 f 3 ( . . . , θ 3 ) Where f 2 = f 2 ( f 3 ( . . . x i . . . , θ 3 ) , θ 2 ) etc. 5 of 13

  12. Backprop Jacobians The Loss: ( ∂ F 1 ℓ . . . ∂ F NF ℓ ) ∈ R 1 × N F D F ℓ ( t i , F ( x i , θ )) = 6 of 13

  13. Backprop Jacobians The Loss: ( ∂ F 1 ℓ . . . ∂ F NF ℓ ) ∈ R 1 × N F D F ℓ ( t i , F ( x i , θ )) = The functions (modules):  ( ( z 1 . . . z N z ) , θ )  f 1 . . f ( z , θ ) = .   ( ) f N f ( z 1 . . . z N z ) , θ   ∂ z 1 f 1 . . . ∂ z Nz f 1 . . . .  ∈ R N f × N z D z f ( z , θ ) = . .  ∂ z 1 f N f . . . ∂ z Nz f N f 6 of 13

  14. Backprop Modules Looking at module f 2 : output input [ ] [ ] [ ] D θ 3 F ( x i , θ ) = D f 2 f 1 ( f 2 , θ 1 ) f 2 ( f 3 , θ 2 ) D θ 3 f 3 ( . . . , θ 3 ) D f 3 � �� � � �� � grad output Jacobian wrt. input � �� � grad input 7 of 13

  15. Backprop Modules Looking at module f 2 : output input [ ] [ ] [ ] D θ 3 F ( x i , θ ) = D f 2 f 1 ( f 2 , θ 1 ) f 2 ( f 3 , θ 2 ) D θ 3 f 3 ( . . . , θ 3 ) D f 3 � �� � � �� � grad output Jacobian wrt. input � �� � grad input Three (core) functions per module: fprop : compute the output f i ( z , θ i ) given the input z and current parametrization θ i . grad input : compute grad output · D z f i ( z , θ i ). grad param : compute ∇ θ i = grad output · D θ i f i ( z , θ i ). 7 of 13

  16. Backprop Modules Looking at module f 2 : output input [ ] [ ] [ ] D θ 3 F ( x i , θ ) = D f 2 f 1 ( f 2 , θ 1 ) f 2 ( f 3 , θ 2 ) D θ 3 f 3 ( . . . , θ 3 ) D f 3 � �� � � �� � grad output Jacobian wrt. input � �� � grad input Three (core) functions per module: fprop : compute the output f i ( z , θ i ) given the input z and current parametrization θ i . grad input : compute grad output · D z f i ( z , θ i ). grad param : compute ∇ θ i = grad output · D θ i f i ( z , θ i ). Typically: fprop caches its input and/or output for later reuse. grad input and grad param are combined into single bprop function to share computation. 7 of 13

  17. Backprop (Mini-)Batching ( ∂ F 1 ℓ . . . ∂ F NF ℓ ) ∑ N 1 ∈ R 1 × N F i =1 D F ℓ ( t i , F ( x i , θ )) ◦ D θ F ( . . . ) where D F ℓ = Remember: N Reformulate as matrix-vector operations allows computation in a single pass: ) . . . ( ( )   t 1 , F ( x 1 , θ ) t 1 , F ( x 1 , θ ) ∂ F 1 ℓ ∂ F NF ℓ ) · ( 1 . .  ∈ R N × N F . . 1 . . N . . .  ) . . . ∂ F NF ℓ N ( ( ) t N , F ( x N , θ ) t N , F ( x N , θ ) ∂ F 1 ℓ 8 of 13

  18. Backprop Usage/training net = [ f1 , f2 , . . . ] , l = c r i t e r i o n f o r Xb , Tb in batched X, T: z = Xb f o r module in net : z = module . fprop ( z ) costs = l . fprop ( z , Tb) ∂ z = l . bprop ( [ 1 N B . . . 1 N B ] ) f o r module in rev erse d ( net ) : ∂ z = module . bprop ( ∂ z ) f o r module in net : θ , ∂θ = module . params () , module . grads () θ = θ − λ · ∂θ 9 of 13

  19. Backprop Example: Linear aka. Fully-connected module f ( z , W , b ) = z · W + b T Where z ∈ R 1 × N z , W ∈ R N z × N f , and b ∈ R 1 × N f . The gradients are: – R N z , N f ∋ grad W = z T · grad output – R 1 × N f ∋ grad b = grad output T – R 1 × N z ∋ grad input = grad output · W T 10 of 13

  20. Backprop Gradient checking Crucial debugging method! Compare Jacobian computed by finite differences using fprop function to Jacobian computed by bprop function. Advice: Use (small) random input x , and h i = √ eps max( x i , 1). Finite-difference: first column of Jacobian as: ( ) x − = x 1 − h 1 x 2 . . . x N x x + = ( x 1 + h 1 x 2 . . . x N x ) J • , 1 = fprop ( x + ) − fprop ( x − ) 2 h 1 Backprop: first row of Jacobian as: fprop ( x ) J 1 , • = bprop ( 1 0 . . . 0 ) 11 of 13

  21. Backprop Rule-of-thumb results on MNIST Linear(28*28, 10), SoftMax should give ± 750 errors. Linear(28*28, 200), Tanh, Linear(200, 10), SoftMax should give ± 250 errors. Typical learning-rates λ ∈ [0 . 1 , 0 . 01]. Typical batch-sizes N B ∈ [100 , 1000]. √ Initialize weights as R M × N ∋ W ∼ N (0 , σ = 2 M + N ) and b = 0. Don’t forget data pre-processing, here at least divide values by 255. (Max pixel value.) 12 of 13

  22. Merry Christmas and a happy New Year! Also, good luck for the exercise =)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend